Los Alamos National Laboratory HPC System Integration Architect (Scientist 3/4) in Los Alamos, New Mexico
What You Will Do
The High Performance Computing (HPC) Division at Los Alamos National Laboratory provides scientific computing resources consisting of some of the largest HPC systems in the world. The Systems team within the HPC Design Group (HPC-DES) is responsible for defining the technical direction, evaluating, developing and deploying the tools and system software ultimately used in production support of LANL’s HPC resources. These HPC resources are some of the largest in the world and currently include a large (19K+ node) Cray system called Trinity as well as numerous large commodity cluster systems.
This position will be filled at the Scientist 3 or 4 level as dictated by current
Programmatic needs and the skills of the selected candidate. Job responsibilities will be assigned in accordance with the level at which the selected candidate is hired.
You will be working closely with other DES System team members as well as more production focused team members in other groups at the HPC division. Projects typically involve collaborations inside and outside of the Laboratory, in line with the Laboratories’ history of leadership in HPC. Some non-standard working hours may occasionally be required.
We seek candidates who want to make significant contributions that impact the HPC technical direction at LANL and ultimately across the DOE and the nation.
Scientist 3 ($96,600 - $161,300)
The successful candidate will be required to:
Identify current and future challenges faced by large scale HPC applications, and work toward production HPC system solutions. In particular, this individual will help design, develop, deploy, and support system software to overcome these challenges. Areas of interest include distributed systems, configuration management, data aware scheduling, resource allocation, metadata collection, parallel file-systems, workflow management, and visualization.
Set direction, goals, milestones, and deliverables for project tasks and establish associated scope, schedule and budgets. Assist in the preparation of progress reports to sponsors.
Contribute to multi-lab and cross organization proposals for funding both internally and externally to the laboratory.
Will be the Principal Investigator for a targeted area of research.
Present results of work locally and at conferences and workshops.
Provide support in the development of system and application tools to assist in the integration process.
Support system software investigations along with application performance and stability optimization and testing within the HPC integrated open and secure network infrastructure.
Assist in the design of data intensive solutions for the wider HPC environment and provide input into the design and specification of new venues that utilize custom system software.
Provide Tier 3 support to system admin staff and help desk staff on various HPC production systems, when required by user feature requests, bugs, or security vulnerabilities that cannot be resolved by production teams.
Set direction and goals for project tasks and establish associated scope, schedule and budgets.
Enhance technical and professional expertise of other staff through active mentoring and training.
Contribute to peer review of the work of others across organizations or disciplines within the laboratory.
Scientist 4 ($116,900 - $197,000)
In addition to the duties mentioned above, the Scientist 4 will be required to:
Lead proposals for both internal and external funding for self and others via responses to competitive requests for proposals.
Contribute to peer review of the work of others across organizations and disciplines nationally, including participation on HPC-related conference and workshop committees.
Participate in national review boards for DOE in subject area of expertise.
Acquire internal/external funding for self and others via responses to competitive requests for proposals and developed collaborations.
Work closely with high level project leads and program managers to insure their projects are successful.
Assist in defining specifications for new clusters and file systems and the writing of the RFPs.
What You Need
Minimum Job Requirements:
Demonstrated record of accomplishment and expertise in high performance and large-scale systems integration of diskless clusters and/or file system.
A record of technical leadership in hardware or software activities within a system integration environment.
Knowledge and experience with HPC system hardware definition, characterization, specification, acquisition, deployment, and production readiness. This includes the central processing environment and surrounding support infrastructure e.g. power, global parallel file system, network design, test and monitoring.
Experience in elements specific to system integration of large and complex high performance computing systems.
Practical experience at the advanced level in programming such as Bash scripts, shell scripts, perl, CFengine and Python code.
Good oral and written communication skills are needed.
Record of maintaining state-of-the-art technical expertise and knowledge within discipline and development of new skills in related disciplines.
Technical accomplishment within a team environment under time constraints.
Demonstrated ability to work within a team environment.
Working knowledge of networking concepts and practices.
Knowledge of or experience with hardware and software security practices.
In addition to the Job Requirements outlined above, qualification at the Scientist 4 level requires:
Practical experience and advanced knowledge of high performance system interconnects, with the ability to deploy, optimize, debug interconnects such as InfiniBand and/or Intel OmniPath.
Demonstrate senior technical leadership that brings various organizations, teams/individuals together with a common goal to create an efficient, cost effective performance based solution to a particular problem/need.
Demonstrate capability of understanding the complete picture of an end-to-end solution for large complex systems. This includes facilities, archive, storage, networks (cluster fabric, data center, and campus) and clusters.
Exhibited knowledge and experience in working with equipment vendors on specifications and requirements of large-scale scientific system procurements addressing system architecture, reliability, performance, tuning, debugging, configuration, maintenance and support.
Demonstrated industry leadership and expertise in an area of high performance computing.
Demonstrated ability to initiate large-scale projects to solve technology challenges.
Demonstrated in-depth experience with Lustre or GPFS.
Practical experience at the advanced level in programming using C, C++ and/or Fortran.
Practical experience with proprietary interconnects such as the Cray Aries or Gemini network or other proprietary networks.
Experience with deploying software defined networks. (SDN/ NFV).
Practical experience with OpenHPC.
Practical experience with power aware computing and scheduling.
Practical experience with deployments of NVRam and other flash technologies.
Experience in anticipating needs for hardware and software environments.
Extensive experience in Linux with complete understanding of configuration files, building of diskless nodes, modifying kernel parameters and making a new kernel, and "Kickstart" files to automate installations.
Ability to creating reliable/repeatable procedures for production use.
Practical experience and the advanced knowledge of ethernet switches, routing, TPC/IP, and configuration of NICs and routers.
Practical experience and advanced knowledge of system Interconnects, especially Infiniband and know how to configure on hosts and switches.
Practical experience in taking a large cluster and making it's OS and software "Production" quality. (How to “harden” a Linux system )
Practical experience with of Slurm.
Practical experience in more than one advanced HPC subject area (E.g. data-aware computing, data intensive supercomputing, parallel file systems, operating systems, message passing libraries, threading models, and resilience of these systems at scale).
Record of setting direction and goals for yourself and other staff.
Demonstrated experience leading multi-person projects to meet scope, schedule and budget.
Demonstrated experience in formulating and presenting results to technical audiences and readerships.
Experience managing computers in a DOE or DOD classified environment.
Active DOE Q Clearance.
Typical educational requirement is a Bachelor’s, Master’s, or Doctorate degree in a science or engineering field from an accredited college or university and a minimum of five years of experience in the HPC field, or an equivalent combination of education and experience.
Clearance: Q(Position will be cleared to this level). Applicants selected will be subject to a Federal background investigation and must meet eligibility requirements* for access to classified matter.
*Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.
New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing.
Regular position:Term status Laboratory employees applying for regular-status positions are converted to regular status.
Equal Opportunity:Los Alamos National Laboratory is an equal opportunity employer and supports a diverse and inclusive workforce. All employment practices are based on qualification and merit, without regards to race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation or preference, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to email@example.com or call 1-505-665-4444 option 1.
Where You Will Work
Located in northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. LANL enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.
The High‐Performance Computing Division (HPC) provides production high performance computing systems services to the Laboratory. Our work spans the early phases of acquisition, development, and production readiness of HPC platforms continuing to the maintenance and operation of these systems and the facilities in which they are housed. HPC also manages the network, parallel file system, storage, and visualization infrastructure associated with the HPC platforms. This division also supports the HPC user base directly and aids, at multiple levels, in the effective use of HPC to generate science. Additionally, we support selected research activities that we deem important to our mission.
Location: Los Alamos, NM, US
Contact Name: Doyle, Christine Louise
Organization Name: HPC-DES/ High Performance Computing Design
Job Title: HPC System Integration Architect (Scientist 3/4)
Appointment Type: Regular
Req ID: IRC62091