Los Alamos National Laboratory HPC Network Administrator (Scientist 2/3) in Los Alamos, New Mexico
What You Will Do_
The High Performance Computing (HPC) Networking Team designs, builds and maintains some of the largest, fastest, and secure networks for both data movement and system capability in the world, including systems supporting up to 100 Gigabit per second of throughput and continuing to grow. We provide network technology spanning the full range of tiers from campus networks to highest speed cluster interconnects for some of the largest and fastest supercomputers in the world. The HPC Networking Team is responsible for all aspects of networking within the HPC environment across three separate networks. This includes Ethernet, InfiniBand, and Omni-Path networks used within more than a dozen high performance computing resources. This also includes significantly complex Ethernet networks that control and manage access between the clusters and external resources. The successful candidate will support design, deployment, and maintenance efforts for networks within the HPC environment.
Builders and problem-solvers at heart, the HPC Networking Team seeks highly motivated, productive, inquisitive, and multi-talented candidates who are equally comfortable working independently as well as part of a team. The successful applicant will acquire the skills needed to take a leading role in working on all aspects of LANL’s HPC networking infrastructure. A broad range of expertise and background is desired in the team, and there are multiple projects that a successful candidate can pursue within the production and research disciplines of the team. There are frequent opportunities for collaborative work with scientists and staff within the group (for instance with scientists designing and operating our high-speed data storage infrastructure) or with scientists from other groups, including close collaborative research opportunities with LANL’s Ultrascale Systems Research Center (USRC), to help drive cutting edge advances.
This role requires strong communication skills, as well as comprehensive troubleshooting and analytical skills. Team member duties include: designing, building, and maintaining world-class data movement and storage networks, which includes HPC network core, file system connectivity, and computing interconnects; evaluating and testing new technology and solutions; system administration of HPC network infrastructure in support of compute clusters; diagnosing, solving, and implementing solutions for various system operational problems; Troubleshooting issues, management/maintenance of all routing and switch infrastructure within HPC compute clusters; developing tools to support automation, optimization and monitoring efforts; interacting with vendors; and communicating and collaborating with other groups, teams, projects and sites.
The selected candidate will participate in a regularly scheduled rotation of on-call support of production systems, including some systems under 7x24 hour support. In addition, some non-standard working hours may occasionally be required. This position is full-time and is located at Los Alamos National Laboratory in Los Alamos, New Mexico.
This position will be filled at either the Scientist 2/Scientist 3 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.
Scientist 2 ($87,800-$144,800)
Participate in periodic on-call responsibilities.
Work as a network administrator, both independently and collaboratively with other members of the team or group, after receiving initial direction and requirements from technical project leads.
Apply and interpret, on a broad basis, existing scientific principles, techniques, methods, and tools to troubleshoot, diagnose root cause of network failures, and isolate the components / failure scenarios while working with internal & external stakeholders.
Work with team members on physical deployment or relocation of HPC network and/or interconnect cabling and hardware, including bringing up new hardware and testing functionality.
Contribute to the design, testing, analysis, verification, and validation of both existing systems and systems in development, including modifications and additions to existing networks, methods and procedures.
Participate in process improvement, including deep multi‐system problem isolation and resolution often in collaboration with administrators of other HPC subsystems.
Work with team members and vendors to document, design, and implement new ideas and approaches for newer network topologies and improve those for existing ones.
Develop technical products such as presentations, technical papers, and reports. Develop and publish updates on resolutions and communicate findings internally. Present best practices, experience reports, and/or research results to managers and to peers locally or at conferences.
Mentor students, junior staff, and peers in technical and professional growth activities.
Maintain state-of-the-art technical expertise and knowledge within HPC high speed networking and develop new skills in related disciplines.
Scientist 3 ($96,600-$161,300)
In addition to the duties outlined above, a successful Scientist 3 candidate will be required to:
Work as a technical leader/subject matter expert to propose and implement solutions to current problems and future deficiencies in our HPC networking environment in conjunction with junior and senior administrators and technical staff within and across teams.
Proactively create experiments and tooling to validate solutions and to detect and diagnose network and hardware health issues.
Analyze published research papers in the area of networking and high-speed interconnects, summarize, and share implications and connections to ongoing work with team members.
Influence organizational, project, and program strategies and directions related to cluster administration, operation, and management. Make decisions and/or recommendations that influence the achievement of key programmatic objectives.
Interact and/or collaborate with people from other teams, groups, divisions, directorates, and programs to develop, implement, and/or communicate technical solutions.
Enhance technical and professional expertise of other staff and students through active mentoring and training activities.
Develop ideas for new technical proposals and business development opportunities. Contribute to the state-of-the-art in high speed network administration and tool development, and develop new skills consistent with state-of-the-art.
Support system software investigations and development activities, as well as system performance and stability optimization and testing efforts within the open and secure HPC network infrastructures, serving as a Principal Investigator, as needed, in targeted production R&D; investigation areas.
Present best practices and research results to national peers at conferences, workshops, and meetings, as well as participate in national strategic partnerships.
What You Need
Minimum Job Requirements:
Strong interpersonal and communication skills.
Demonstrated ability to work within a team environment.
Demonstrated experience with Ethernet layer 2 and layer 3 networking, including VLAN configuration, administration, and management.
Demonstrated experience configuring and managing Ethernet switch and routing hardware.
Demonstrated experience with InfiniBand and/or Omni-Path networking, including administration and management.
Demonstrated knowledge of building, configuring, and administering production Linux computer systems and network devices.
Experience scripting in Bash, Perl, Python, or similar languages.
Strong command line Linux operating system skills.
Ability to mentor and lead individual junior team members and students.
Knowledge of networking hardware components.
Broad knowledge of networking concepts and practices, including best practices for network security and system hardware and software hardening.
For consideration, applicants should submit a cover letter addressing how their knowledge, skills and abilities meet the minimum requirements along with a resume.
Additional Job Requirements for Scientist 3:
In addition to the Job Requirements outlined above, qualification at the Scientist 3 level requires:
Demonstrated record of accomplishment and expertise in network administration, including demonstrated experience building, configuring and managing high-speed networks and interconnects
Demonstrated experience diagnosing networking issues in a production computing environment.
A record of technical leadership in hardware or software activities within an HPC environment.
Broad demonstrated knowledge of other production HPC system management topics, including programming, file systems, operating systems, and configuration management, with depth in one or more areas.
Knowledge of High Performance Computing system design.
Demonstrated programming and advanced scripting.
Ability to lead and mentor teams, students, or junior team members.
Demonstrated ability to initiate, design, and lead projects.
Technical accomplishment within a team environment under time constraints.
Demonstrated ability to evaluate competing HPC networking technologies.
Ability to analyze published research papers in the area of high-speed networks and interconnects, summarize research results, and share implications and connections to ongoing work with team members.
Demonstrated ability to develop ideas for new technical proposals, participate in peer review, and contribute to the state-of-the-art in the area of data storage.
Ability to present technical papers and/or technical work to peers locally and nationally at conferences and meetings.
Practical experience with Juniper or other firewall systems.
Practical experience with OSPF and other routing protocols.
Significant experience with multiple VLANs, tagged and untagged, as well as LACP and other port channel protocols.
Practical experience with Splunk or other monitoring tools.
Experience working in a production computing environment, preferably with HPC networks or at large scale.
Experience diagnosing system software problems.
Experience supporting a scientific user base.
Experience with revision control systems such as RCS, Subversion, or Git.
Experience with low-level system administration tools such as iperf, ifconfig, ethtool, etc.
Experience managing computers in a DOE or DOD classified environment.
Familiarity with Cfengine, Chef, Puppet, Ansible, Salt, or similar configuration and automation tools and practices.
Familiarity with parallel file systems and/or NFS.
Familiarity with database administration.
Contribution to open source or non-work-related projects.
Knowledge and experience with HPC system definition, characterization, specification, acquisition, deployment, and production readiness.
Active DOE Q Clearance.
Note to Applicants: For consideration, applicants should submit a cover letter addressing how their knowledge, skills and abilities meet the minimum requirements along with a resume.
Education: Typical educational requirement is a bachelor’s, master's, or doctorate degree in computer science or engineering from an accredited college or university and a minimum of five years of experience in an HPC-related field, or an equivalent combination of relevant education and/or experience
Clearance: Q(Position will be cleared to this level). Applicants selected will be subject to a Federal background investigation and must meet eligibility requirements* for access to classified matter.
*Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.
New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing.
Regular position:Term status Laboratory employees applying for regular-status positions are converted to regular status.
Internal Applicants:Please refer to Laboratory policy P701 for applicant eligibility.
Equal Opportunity:Los Alamos National Laboratory is an equal opportunity employer and supports a diverse and inclusive workforce. All employment practices are based on qualification and merit, without regards to race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation or preference, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to email@example.com or call 1-505-665-4444 option 1.
Where You Will Work_
Located in northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. LANL enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.
The High Performance Computing (HPC) Division provides production high performance computing systems services to the Laboratory. HPC Division serves all Laboratory programs requiring a world-class high performance computing capability to enable solutions to complex problems of strategic national interest. Our work starts with the early phases of acquisition, development, and production readiness of HPC platforms, and continues through the maintenance and operation of these systems and the facilities in which they are housed. HPC Division also manages the network, parallel file systems, storage, and visualization infrastructure associated with the HPC platforms. The Division directly supports the Laboratory’s HPC user base and aids, at multiple levels, in the effective use of HPC resources to generate science. Additionally, we engage in research activities that we deem important to our mission.
Location: Los Alamos, NM, US
Contact Name: Doyle, Christine Louise
Organization Name: HPC-SYS/HPC Systems/Computing Operations & Support
Job Title: HPC Network Administrator (Scientist 2/3)
Appointment Type: Regular
Req ID: IRC65984