Los Alamos National Laboratory HPC Monitoring Team (Scientist 2/3) in Los Alamos, New Mexico
What You Will Do
This position will be filled at either a Scientist 2 or 3 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.
The High-Performance Computing Division (HPC) provides production high performance computing systems services to the Laboratory. The High Performance Computing Systems group has responsibility for the broad range of HPC platforms and infrastructure deployed within Laboratory HPC Data Centers.
The High Performance Computing Environments group (HPC-ENV) invites applicants for a position of Scientist 2 or 3 to join the Monitoring, Security and Data Analytics team and strengthen our HPC monitoring and analysis efforts. We seek candidates who want to make significant contributions to our long-term efforts of larger scale cluster monitoring, continuous security monitoring and job based power monitoring. Team member duties include: System administration of RHEL servers; Setting up appropriate monitoring and alerts for new HPC clusters and infrastructure including networks and file systems; Diagnosing, solving and implementing solutions for various system operational problems; Communicating and collaborating with other teams, groups and sites. The selected candidate will participate in a regularly scheduled rotation of on-call support of productions systems. In addition, some non-standard working hours may occasionally be required The High Performance Computing Environments group (HPC-ENV) has the main responsibility of managing how users interaction with the HPC systems at LANL. Some of the teams in this group include (1) Consulting and User Services, responsible for direct interaction and problem resolution with the users; (2) Parallel Runtimes and Environments, responsible for installing and maintaining the software and user environments on the HPC clusters; (3) Application Readiness, working to optimize user code for new HPC platforms and technologies; (4) Monitoring, Security and Data Analytics, responsible for collecting, analyzing and displaying HPC system information to administrators and users. Projects typically involve collaborations inside and outside of the Laboratory, in line with the Laboratories’ history of leadership in HPC.
The Monitoring, Security and Data Analytics team within HPC-ENV is responsible for monitoring everything within the HPC Datacenters, including Facilities, Clusters, File Systems, Networking and Support Servers. Monitoring data, sensor information and system logs and are collected using syslog, polling scripts, IPMI and several other mechanisms. Monitoring data is transported throughout our extensive monitoring infrastructure using syslog and AMQP. Splunk serves as or main analysis, display and alerting tool for administrators. Grafana backed by Elasticsearch and OpenTSDB are running on our dedicated Data Analytics Cluster for our larger analysis and machine learning projects.
Scientist 2 ($87,800 - $144,800)
The successful candidate will perform the full spectrum of UNIX/Linux computing environment administration, including but not limited to:
Assist in the setup, administration and maintenance of dozens RHEL servers using a configuration management system
Administer several monitoring software systems including Splunk, RabbitMQ, LDMS and Grafana
Identify and fix system server and network security issues
Actively look for problems in the Datacenters by monitoring logs and alerting systems
Implement monitoring dashboards and alerts for new HPC Clusters, File Systems or Networks
Work independently as well as under the supervision and guidance of senior HPC administrators to provide technical assistance in problem solving and day-to-day operation and monitoring of various HPC systems
Steadily increase responsibilities and knowledge of our environment and HPC systems
Participate in periodic on-call responsibilities as assigned
Participate in process improvement and deep multi‐system problem isolation and resolution in coordination with administrators of other HPC subsystems
Propose and implement solutions when presented with problems in our HPC environment
Experience using and maintaining databases
Experience managing web documentation sites, allowing subject-matter experts to easily add new documentation while creating an easy to navigate unified experience for the end user
Scientist 3 ($96,600- $161,300)
In addition to the duties outlined above, the Scientist 3 will be required to:
Work as a technical leader to implement solutions to current problems and future deficiencies in our HPC environment in conjunction with junior and senior administrators and technical members of other HPC teams
Proactively examine our HPC environment and propose projects to make it better
Communicate the strategies and successes of HPC Division to national peers and participate in national strategic partnerships
Implement active network security monitoring using Bro and Netflow analysis
Deploy advanced analytics tools or machine learning techniques on monitoring data for use in our production environment
Knowledge of several database systems and experience architecting database solutions
Experience with content management frameworks like Drupal
What You Need
Minimum Job Requirements:
Strong interpersonal and communication skills
Broad knowledge of administration of production Linux computer systems, utilities, and tools, including experience building, configuring, and administering production Linux computer systems
Knowledge of syslog configuration
Knowledge of different database systems
Understanding of how to monitor logs from multiple systems and correlate events
Demonstrated scripting (e.g., in Bash, Perl, Python, or similar scripting languages) and programming experience
Ability to mentor and lead individual junior team members and students
Working knowledge of networking concepts and practices
Experience working in a production computing environment, preferably with HPC systems or at large scale
Knowledge of or experience with hardware and software security practices
Ability to write papers and present results to peers locally or at conferences
Additional Job Requirements for Scientist 3:
In addition to the Job Requirements outlined above, qualification at the Scientist 3 level requires:
Broad knowledge of production system management topics, including networking, programming, file systems, operating systems, and configuration management, with depth in one or more areas
Experience leading and mentoring teams, students, or junior team members
Experience initiating, designing, and leading projects
Experience interacting with vendors and colleagues within the industry, including presenting technical results and practices to peers locally and at conferences
Experience deploying database solutions
Knowledge of statistics, data analytics, or similar fields
Knowledge of the NIST 800-53 standards
Experience implementing computer and network security features
Knowledge of HPC facilities systems including monitoring and alerting
Experience working in a production HPC environment
Experience diagnosing system software problems
Knowledge of one or more monitoring tools (Splunk, Ganglia, LDMS, etc.)
Experience configuring syslog
Experience with data collection and transport (syslog, IPMI, AMQP)
Knowledge of data storage and databases
Experience hardening server for security
Knowledge of data driven web-based user interfaces, Web Servers (Apache, Tomcat, etc.), and Content Management Systems
Knowledge of resource management and job scheduling software (SLURM, Moab, etc.)
Experience with networking and file systems in an HPC environment
Experience with parallel filesystems (Lustre, GPFS, etc.)
Experience with archive solutions (HPSS, TSM, etc.)
Experience with data movement tools
Experience working with ticket tracking systems
Experience with multiple Linux distributions
Experience modifying Unix/Linux operating systems
Experience managing computers in a DOE or DOD classified environment
Active DOE Q Clearance
Education: Typical educational requirement is a bachelor’s, master’s, or doctorate degree in science from an accredited college or university and a minimum of five years of experience in the HPC field, or an equivalent combination of education and experience.
Clearance: Q (Position will be cleared to this level). Applicants selected will be subject to a Federal background investigation and must meet eligibility requirements* for access to classified matter.
*Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.
New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing.
Regular position:Term status Laboratory employees applying for regular-status positions are converted to regular status.
Los Alamos National Laboratory is an equal opportunity employer and supports a diverse and inclusive workforce. All employment practices are based on qualification and merit, without regards to race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation or preference, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to firstname.lastname@example.org or call 1-505-665-4444 option 1.
Where You Will Work
Located in northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. LANL enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.
Location: Los Alamos, NM, US
Contact Name: Doyle, Christine Louise
Organization Name: HPC-ENV / HPC Environments
Job Title: HPC Monitoring Team (Scientist 2/3)
Appointment Type: Regular
Req ID: IRC61028