Los Alamos National Laboratory HPC Cluster Administrator (Scientist 2/3/4) in Los Alamos, New Mexico

What You Will Do

The High Performance Computing (HPC) Platforms Team within the HPC Systems Group (HPC-SYS) provides vanguard system and runtime support for some of the largest and fastest supercomputers in the world, including multi-petaop systems (e.g., the recently deployed 19K+ node, 40 Peta operations per second Trinity Supercomputer). Troubleshooters and problem-solvers at heart, the HPC Platforms Team seeks highly motivated, productive, inquisitive, and multi-talented candidates who are equally comfortable working independently as well as part of a team.

The role of HPC Cluster Administrator requires strong communication skills, as well as comprehensive troubleshooting and analytical skills. Team member duties include: system deployment, configuration, and full system administration of LANL’s world-class compute clusters; evaluating and testing new technology and solutions; diagnosing, solving, and implementing solutions for various system operational problems; system administration of HPC network infrastructure in support of compute clusters; diagnosing, solving, and implementing solutions for various system operational problems; system software management and maintenance, including security posture maintenance; tuning operating systems to increase performance and reliability of services; developing tools to support automation, optimization and monitoring efforts; interacting with vendors; and communicating and collaborating with other groups, teams, projects and sites. There are frequent opportunities for collaborative work with scientists and staff within the group (for instance with scientists designing and operating our high-speed networking infrastructure) or with scientists from other groups, including close collaborative research opportunities with LANL’s Ultrascale Systems Research Center (USRC), to help drive cutting edge advances.

The selected candidate will participate in a regularly scheduled rotation of on-call support of production systems, including some systems under 7x24 hour support. In addition, some non-standard working hours may occasionally be required. This position is full-time and is located at Los Alamos National Laboratory in Los Alamos, New Mexico.

This position will be filled at either the Scientist 2, Scientist 3, or Scientist 4 level, depending upon the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.

Scientist 2 ($87,800 - $144,800)

  • Participate in periodic on-call responsibilities.

  • Work as a cluster administrator, both independently and collaboratively with other members of the team or group, after receiving initial direction and requirements from technical project leads to provide technical assistance in problem solving, configuration management, and day-to-day operation of various supercomputing systems.

  • Apply and interpret, on a broad basis, existing scientific principles, techniques, methods, and tools to troubleshoot, diagnose root cause of system failures, and isolate components/failure scenarios while working with internal & external stakeholders.

  • Contribute to the design, testing, analysis, verification, and validation of both existing systems and systems in development, including modifications and additions to systems, code, and methods.

  • Work with team to bring up new hardware and test functionality.

  • Participate in process improvement, including deep multi-system problem isolation and resolution often in collaboration with administrators of other HPC subsystems.

  • Work with team members to develop new methods, techniques, or approaches to address critical technical problems and to document, design, and implement new ideas and approaches for newer system administration and configuration tools and strategies and improve those for existing ones.

  • Develop technical products such as presentations, technical papers, and reports. Develop and publish updates on resolutions and communicate findings internally. Present best practices, experience reports, and/or research results to managers and to peers locally or at conferences.

  • Mentor students, junior staff, and peers in technical and professional growth activities.

  • Maintain state-of-the-art technical expertise and knowledge within HPC data storage and develop new skills in related disciplines.

Scientist 3 ($96,600 - $161,300)

In addition to the duties outlined above, a successful Scientist 3 candidate will be required to:

  • Work as a technical leader/subject matter expert to propose and implement solutions to current problems and future deficiencies in our HPC environment in conjunction with junior and senior administrators and technical staff within and across teams.

  • Proactively examine our HPC data storage infrastructure through creation of experiments and tooling to validate solutions and to detect and diagnose hardware and system health issues.

  • Analyze published research papers in the area of system configuration and administration and summarize and share implications and connections to ongoing work with team members.

  • Develop innovative advanced concepts, theories, methods, techniques, and approaches to address specialized system administration problems. Develop new technical capabilities and create opportunities to extend existing solutions through the expansion of existing efforts.

  • Influence organizational, project, and program strategies and directions related to cluster administration, operation, and management. Make decisions and/or recommendations that influence the achievement of key programmatic objectives.

  • Interact and/or collaborate with people from other teams, groups, divisions, directorates, and programs to develop, implement, and/or communicate technical solutions.

  • Enhance technical and professional expertise of other staff and students through active mentoring and training activities.

  • Develop ideas for new technical proposals and business development opportunities. Contribute to the state-of-the-art in cluster administration and tool development, and develop new skills consistent with state-of-the-art.

  • Support system software investigations and development activities, as well as system performance and stability optimization and testing efforts within the open and secure HPC network infrastructures, serving as a Principal Investigator, as needed, in targeted production R&D; investigation areas.

  • Present best practices and research results to national peers at conferences, workshops, and meetings, as well as participate in national strategic partnerships.

Scientist 4 ($116,900 - $197,000)

In addition to the duties outlined above, a successful Scientist 4 candidate will be required to:

  • Lead proposals for both internal and external funding for self and others via responses to competitive requests for proposals.

  • Contribute to peer review of the work of others across organizations and disciplines nationally, including participation ion HPC-related conference and workshop committees.

  • Participate in national review boards for DOE in subject area of expertise.

  • Acquire internal/external funding for self and others via responses to competitive requests for proposals and developed collaborations.

  • Work closely with high level project leads and program managers to insure their projects are successful.

  • Assist in defining specifications for new clusters and systems and in the writing of Request for Proposal (RFP) documentation.

What You Need

Minimum Job Requirements:

  • Strong interpersonal and written and oral communication skills.

  • Demonstrated ability to work within a team environment.

  • Significant knowledge and demonstrated experience in formulating and testing hypotheses, investigating alternative solutions, and recommending solutions to technical problems.

  • Strong command line Linux operating system skills.

  • Demonstrated experience with and broad knowledge of administration of production Linux computer systems, utilities, and tools, including experience building, configuring, and administering production Linux computer systems.

  • Demonstrated scripting (e.g., in Bash, Perl, Python, or similar scripting languages) and programming experience.

  • Demonstrated experience with Cfengine, Chef, Puppet, Ansible, Salt, or similar configuration and automation tools and practices.

  • Working knowledge of networking concepts and practices.

  • Knowledge of or experience with hardware and software security practices.

  • Working knowledge of or demonstrated experience with best practices for network security and system hardware and software hardening.

  • Ability to mentor and lead individual junior team members and students.

  • For consideration, applicants should submit a cover letter addressing how their knowledge, skills and abilities meet the minimum requirements along with a resume.

Additional Job Requirements for Scientist 3:

In addition to the Job Requirements outlined above, qualification at the Scientist 3 level requires:

  • Demonstrated record of accomplishment and expertise in high performance and large-scale system administration.

  • A record of technical leadership in hardware or software activities within an HPC environment.

  • Broad demonstrated knowledge of production HPC system management topics, including networking, programming, file systems, operating systems, and configuration management, with depth in one or more areas.

  • Demonstrated experience diagnosing system software problems.

  • Practical experience at the advanced level in programming (e.g., in Bash scripts, shell scripts, Perl, Python).

  • Demonstrated knowledge or experience with Cfengine, Chef, Puppet, Ansible, Salt, or similar configuration and automation tools and practices.

  • Knowledge of High Performance Computing system design.

  • Knowledge of implementation and complexity of common programming data structures and algorithms.

  • Ability to lead and mentor teams, students, or junior team members.

  • Demonstrated ability to initiate, design, and lead projects.

  • Technical accomplishment within a team environment under time constraints.

  • Demonstrated ability to evaluate competing HPC subsystem technologies.

  • Ability to analyze published research papers in the area of HPC system administration and configuration, summarize research results, and share implications and connections to ongoing work with team members.

  • Demonstrated ability to develop ideas for new technical proposals, participate in peer review, and contribute to the state-of-the-art in the area of data storage.

  • Experience interacting with vendors and colleagues within the industry, including presenting technical papers and/or technical work to peers locally and at conferences.

Additional Job Requirements for Scientist 4:

In addition to the Job Requirements outlined above, qualification at the Scientist 4 level requires:

  • Demonstrated senior technical leadership that brings various organizations, teams/individuals together with a common goal to create an efficient, cost effective, performance-based solution to a particular problem/need.

  • Demonstrated advanced expertise in diagnosing complex system software problems.

  • Demonstrated knowledge of and experience with production HPC system management topics, including networking, programming, file systems, operating systems, and configuration management.

  • Advanced knowledge of or experience with elements specific to system integration of large and complex high performance computing systems.

  • Exhibited knowledge and experience in working with vendors on specifications and requirements for large-scale scientific system procurements addressing system architecture, reliability, performance, tuning, debugging, configuration, maintenance and support.

  • Demonstrated industry leadership and expertise in the area of HPC Cluster administration and configuration management.

  • Demonstrated ability to initiate large-scale projects to solve technology challenges

Desired Skills:

  • Experience diagnosing system software problems.

  • Knowledge of resource management and job scheduling software (slurm, PBS, Moab, etc.)

  • Practical experience with OpenHPC.

  • Experience with networking and fileystems in an HPC environment.

  • Experience with parallel filesystems (Lustre, GPFS, Gluster etc.)

  • Experience with archive solutions (HPSS, TSM, etc.)

  • Experience with object storage solutions.

  • Experience with data movement tools.

  • Experience supporting a scientific user base.

  • Experience working with ticket tracking systems.

  • Experience with multiple Linux distributions.

  • Experience with revision control systems such as RCS, Subversion, or Git.

  • Experience with system observability tools such as perf, strace, tcpdump, and vmstat.

  • Experience modifying Unix/Linux operating systems (e.g., enabling/disabling kernel modules).

  • Practical experience with Splunk or other monitoring tools.

  • Familiarity with database administration.

  • Experience managing computers in a DOE or DOD classified environment.

  • Demonstrated ability to develop new methods, techniques, or approaches to address critical technical problems and/or develop new technical capabilities.

  • Knowledge of file systems such as ZFS, EXT, XFS.

  • Working knowledge of file system structures and algorithms.

  • Experience with Object storage and RESTful storage interfaces.

  • Experience with multiple network technologies (e.g., Ethernet, IB, OPA).

  • Knowledge of or demonstrated experience with parallel and distributed storage systems.

  • Contribution to open source or non-work-related projects.

  • Knowledge and experience with HPC system definition, characterization, specification, acquisition, deployment, and production readiness.

  • Active DOE Q Clearance.

Education:Typical educational requirement is a bachelor’s, master's, or doctorate degree in computer science or engineering from an accredited college or university and a minimum of five years of experience in an HPC-related field, or an equivalent combination of relevant education and/or experience.

Notes to Applicants:For consideration, applicants should submit a cover letter addressing how their knowledge, skills and abilities meet the minimum requirements along with a resume.

Additional Details:

Clearance: Q(Position will be cleared to this level). Applicants selected will be subject to a Federal background investigation and must meet eligibility requirements* for access to classified matter.

*Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.

New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing.

Regular position:Term status Laboratory employees applying for regular-status positions are converted to regular status.

Equal Opportunity:Los Alamos National Laboratory is an equal opportunity employer and supports a diverse and inclusive workforce. All employment practices are based on qualification and merit, without regards to race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation or preference, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to applyhelp@lanl.gov or call 1-505-665-4444 option 1.

Where You Will Work

Located in northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. LANL enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.

The High Performance Computing (HPC) Division provides production high performance computing systems services to the Laboratory. HPC Division serves all Laboratory programs requiring a world-class high performance computing capability to enable solutions to complex problems of strategic national interest. Our work starts with the early phases of acquisition, development, and production readiness of HPC platforms, and continues through the maintenance and operation of these systems and the facilities in which they are housed. HPC Division also manages the network, parallel file systems, storage, and visualization infrastructure associated with the HPC platforms. The Division directly supports the Laboratory’s HPC user base and aids, at multiple levels, in the effective use of HPC resources to generate science. Additionally, we engage in research activities that we deem important to our mission.

Work/Life Balance:

Our diverse workforce enjoys a collegial work environment focused on creative problem solving, where everyone’s opinions and ideas are valued. We are committed to work-life balance, as well as both personal and professional growth. We consider our creative and dedicated scientific professionals to be our greatest assets, and we take pride in cultivating their talents, supporting their efforts, and enabling their successes. We provide mentoring to help new staff build a solid technical and professional foundation, and to smoothly integrate into the culture of LANL.

Compensation and Benefits include:

  • Multiple options for work schedules

  • Exercise facility free for staff use

  • Choice of comprehensive medical plans

  • Paid sick time and disability insurance

  • 401k (100% match up to 6% + kicker)

  • Fully vested in 401k on day one

  • Relocation Assistance (if needed)

Los Alamos, New Mexico enjoys excellent weather, clean air, and outstanding public schools. This is a safe, low-crime, family-oriented community with frequent concerts and events as well as quick travel to many top ski resorts, scenic hiking & biking trails, and mountain climbing. The short drive to work includes stunning views of rugged canyons and mesas as well as the Sangre de Cristo mountains. Many employees choose to live in the nearby state capital, Santa Fe, which is known for world-class restaurants, art galleries, and opera.

Location: Los Alamos, NM, US

Contact Name: Doyle, Christine Louise

Organization Name: HPC/SYS/High Performance Computing Systems

Email: cdoyle@lanl.gov

Job Title: HPC Cluster Administrator (Scientist 2/3/4)

Appointment Type: Regular

Req ID: IRC64457