Los Alamos National Laboratory HPC Monitoring Team Computing Systems Professional 1/2/3/4 in Los Alamos, New Mexico
What You Will Do
This position will be filled at either the CSP-1, CSP-2, CSP-3 or CSP-4 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.
The High-Performance Computing Division (HPC) provides production high performance computing systems services to the Laboratory. The High Performance Computing Systems group has responsibility for the broad range of HPC platforms and infrastructure deployed within Laboratory HPC Data Centers.
The High Performance Computing Environments group (HPC-ENV) invites applicants for a position of Computer SystemsProfessional 1, 2, 3 or 4 to join the Monitoring, Security and Data Analytics eam and strengthen our HPC monitoring and analysis efforts. We seek candidates who want to make significant contributions to our long-term efforts of larger scale cluster monitoring, continuous security monitoring and job based power monitoring. Team member duties include: System administration of RHEL servers; Setting up appropriate monitoring and alerts for new HPC clusters and infrastructure including networks and file systems; Diagnosing, solving and implementing solutions for various system operational problems; Communicating and collaborating with other teams, groups and sites. The selected candidate will participate in a regularly scheduled rotation of on-call support of productions systems. In addition, some non-standard working hours may occasionally be required.
HPC-ENV has the main responsibility of managing how users interact with the HPC systems at LANL. Some of the teams in this group include (1) Consulting and User Services, responsible for direct interaction and problem resolution with the users; (2) Parallel Runtimes and Environments, responsible for installing and maintaining the software and user environments on the HPC clusters; (3) Application Readiness, working to optimize user code for new HPC platforms and technologies; (4) Monitoring, Security and Data Analytics, responsible for collecting, analyzing and displaying HPC system information to administrators and users. Projects typically involve collaborations inside and outside of the Laboratory, in line with the Laboratories’ history of leadership in HPC.
The Monitoring, Security and Data Analytics team within HPC-ENV is responsible for monitoring everything within the HPC Datacenters, including Facilities, Clusters, File Systems, Networking and Support Servers. Monitoring data, sensor information and system logs and are collected using syslog, polling scripts, IPMI and several other mechanisms. Monitoring data is transported throughout our extensive monitoring infrastructure using syslog and AMQP. Splunk serves as or main analysis, display and alerting tool for administrators. Grafana backed by Elasticsearch and OpenTSDB are running on our dedicated Data Analytics Cluster for our larger analysis and machine learning projects.
Computing Systems Professional 1 (CSP-1) ($60,400 - $95,900)
The successful candidate will perform the full spectrum of UNIX/Linux computing environment administration, including but not limited to:
A ssist in the setup, administration and maintenance of dozens RHEL servers using a configuration management system
Assist with maintaining several monitoring software systems including Splunk, RabbitMQ and Grafana
Monitor system logs and alerting systems to identify issues
Implement monitoring dashboards and alerts for new HPC Clusters, File Systems or Networks
Computing Systems Professional 2 (CSP-2) ($72,500 - $118,200)
In addition to the duties outlined above, the CSP-2 will be required to:
Work under the supervision and guidance of senior HPC administrators to provide technical assistance in problem solving and day-to-day operation and monitoring of various HPC systems
Participate in periodic on-call responsibilities as assigned
Deploy advanced analytics tools for use in our production environment
Computing Systems Professional 3 (CSP-3) ($87,800 - $144,800)
In addition to the duties outlined above, the CSP-3 will be required to:
Work as a technical leader to implement solutions to current problems and future deficiencies in our HPC environment in conjunction with junior and senior administrators and technical members of other HPC teams
Proactively examine our HPC environment and propose projects to make it better
Propose and implement solutions when presented with problems in our HPC environment
Computing Systems Professional 4 (CSP-4) ($96,600 - $161,300)
In addition to the duties outlined above, the CSP-4 will be required to:
Communicate the strategies and successes of HPC Division to national peers and participate in national strategic partnerships
Implement active network security monitoring using Bro and Netflow analysis
Deploy advanced analytics tools or machine learning techniques on monitoring data for use in our production environment
What You Need
Minimum Job Requirements:
Strong interpersonal and communication skills
Understanding of how to monitor logs from multiple systems and correlate events
Demonstrated scripting (e.g., in Bash, Perl, Python, or similar scripting languages) or programming experience
Experience working in a production computing environment, preferably with HPC systems or at large scale
Additional Job Requirements for CSP-2: In addition to the Job Requirements outlined above, qualification at the CSP-2 level requires:
Knowledge of syslog configuration
Working knowledge of networking concepts and practices
Ability to write and present reports to peers and management
Additional Job Requirements for CSP-3: In addition to the Job Requirements outlined above, qualification at the CSP-3 level requires:
Knowledge of administration of production Linux computer systems, utilities, and tools, including experience building, configuring, and administering production Linux computer systems
Knowledge of production system management topics, including networking, programming, file systems, operating systems, and configuration management, with depth in one or more areas
Ability to mentor and lead individual junior team members and students
Knowledge of or experience with hardware and software security practices
Experience implementing computer and network security features
Additional Job Requirements for CSP-4: In addition to the Job Requirements outlined above, qualification at the CSP-4 level requires:
Broad knowledge of production system management topics, including networking, programming, file systems, operating systems, and configuration management, with depth in one or more areas
Experience leading and mentoring teams, students, or junior team members.
Experience initiating, designing, and leading projects
Experience interacting with vendors and colleagues within the industry, including presenting technical results and practices to peers locally and at conferences
Knowledge of statistics, data analytics, or similar fields
Knowledge of the NIST 800-53 standards
Experience implementing computer and network security features
Knowledge of HPC facilities systems including monitoring and alerting
Experience working in a production HPC environment
Experience diagnosing system software problems
Knowledge of one or more monitoring tools (Splunk, Ganglia, LDMS, etc.)
Experience configuring syslog
Experience with data collection and transport (syslog, IPMI, AMQP)
Knowledge of data storage and databases
Experience hardening server for security
Knowledge of web-based user interfaces
Experience with networking and file systems in an HPC environment
Experience with parallel filesystems (Lustre, GPFS, etc.)
Experience working with ticket tracking systems
Experience with multiple Linux distributions
Experience modifying Unix/Linux operating systems
Experience managing computers in a DOE or DOD classified environment
Active DOE Q Clearance
CSP-1:Position typically requires a minimum of two years related experience, or an equivalent combination of education and experience.
CSP-2: Position typically requires a bachelor’s degree and a minimum of four years related experience, or an equivalent combination of education and experience.
CSP-3: Position typically requires a bachelor’s degree and a minimum of eight years’ related experience, or an equivalent combination of education and experience. At this level, applicable advanced vendor and/or professional certification is desirable.
CSP-4: Position typically requires a bachelor’s degree and a minimum of twelve years related experience, or an equivalent combination of education and experience. At this level, advanced vendor and/or professional certifications are highly desirable and postgraduate course work may be expected.
Clearance: Q (Position will be cleared to this level). Applicants selected will be subject to a Federal background investigation and must meet eligibility requirements* for access to classified matter.
*Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.
New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing.
Regular Position:Term status Laboratory employees applying for regular-status positions are converted to regular status.
Equal Opportunity:Los Alamos National Laboratory is an equal opportunity employer and supports a diverse and inclusive workforce. All employment practices are based on qualification and merit, without regards to race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation or preference, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to firstname.lastname@example.org or call 1-505-665-4444 option 1.
Where You Will Work
Located in northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. LANL enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.
Location: Los Alamos, NM, US
Contact Name: Doyle, Christine Louise
Organization Name: HPC-ENV/HPC-Environments
Job Title: HPC Monitoring Team Computing Systems Professional 1/2/3/4
Appointment Type: Regular
Req ID: IRC61033