HPC-AI Engineer / Manager, Solution Architect

  • Boise, ID
  • Posted 13 days ago | Updated 11 hours ago

Overview

On Site
USD 107,000.00 - 198,000.00 per year
Full Time

Skills

High Performance Computing
Exceed
Systems Architecture
Recruiting
Generative Artificial Intelligence (AI)
Performance Tuning
Parallel Computing
Distributed Computing
Resource Management
GPU
Computer Hardware
Scalability
DNN
IT Infrastructure
Teamwork
Mentorship
Innovation
Training And Development
Research
Technical Support
Root Cause Analysis
Documentation
Performance Metrics
Presentations
Regulatory Compliance
Intellectual Property
Kubernetes
Linux
Distribution
Computer Networking
Command-line Interface
Scripting
Python
Management
Debugging
Data Management
Data Storage
File Systems
Electrical Engineering
Immigration
Machine Learning (ML)
CUDA
TensorFlow
PyTorch
Large Language Models (LLMs)
Benchmarking
Optimization
HPC
Artificial Intelligence
Communication
Collaboration
Reporting
Analytical Skill
Security Clearance
Training
Enterprise Architecture

Job Details

HPC/AI Engineer(Federal)

Job Summary:

The HPC AI Engineer will be responsible for managing the day-to-day operations of the High-Performance Computing (HPC) and AI infrastructure, ensuring all systems meet or exceed requirements for scalability, efficiency, and performance. The position required a blend of expertise in HPC, AI, and system architecture, along with the ability to manage complex projects from conception to implementation. The role also involves proactive engagement with key stakeholders and staying at the forefront of technology advancements.

Recruiting for this role ends on May 31, 2025

Key Responsibilities:
  • System support and management of infrastructure for HPC and AI systems, this includes assisting teams with the implementation, tuning, and optimization of tools for Generative AI models tailored for the Federal Government and related Defense Agencies.
  • Performance Optimization: Analyze and optimize system performance, ensuring the efficient execution of AI models and HPC applications. Implement techniques for parallel processing, distributed computing, and resource management. Manage and optimize GPU enabled computing resources.
  • Integration and Optimization: Develop, debug, and maintain software tools, libraries, and frameworks that support HPC and AI workloads. Work closely with vendor hardware and software providers to ensure AI models are properly optimized for optimal performance and scalability.
  • NVidia Tools and Frameworks: Manage NVIDIA's suite of tools and frameworks, such as CUDA, DNN, and TensorRT, to optimize AI and HPC workloads on NVIDIA GPUs.
  • HPC Systems Support: Implement and manage on-premise HPC and AI systems and in COLO facilities ensuring seamless integration with existing IT infrastructure. This includes the installation, configuration, and maintenance of the HPC infrastructure.
  • Collaboration and Teamwork: Work with cross-functional teams, including alliance partners, data scientists, researchers, and software developers to solve complex AI related challenges. Provide training and mentorship to junior engineers and team members, fostering a culture of continuous learning and innovation within the team.
  • Learning and Development: Stay updated with the latest advancements in HPC and AI technologies. Conduct research to explore new methodologies and integrate them into existing systems as requested.
  • Technical Support and Troubleshooting: Provide support for resolving complex technical issues related to HPC and AI infrastructure. Perform root cause analysis and implement solutions to prevent recurrence.
  • Documentation and Reporting: Create comprehensive documentation for system designs, performance metrics, and project status. Prepare detailed technical reports and presentations for stakeholders.
  • Security and Compliance - Ensure that all HPC, AI systems, and software tools and frameworks comply with federal security and regulatory requirements. Work with Deloitte Federal BISO to implement controls to protect sensitive data and intellectual property relative to NIST guidelines.

Required Skills and Qualifications:
  • 6+ years professional experience supporting and managing HPC and AI architectures with a proven track record of successful project implementations.
  • 3+ years of experience in the design, support, and management of Kubernetes
  • 3+ years of In-depth experience of at least one Linux distribution including configuration of kernels, bootloaders, networking, and CLI.
  • 5+ years python coding experience with expertise in at least one additional scripting or programming language. Python package management and dependency debugging skills a plus
  • 1+ year Data Management: Understanding of data storage solutions, file systems, and data transfer protocols.
  • Bachelor's in Artificial Intelligence. Electrical Engineering, or a closely related field.
  • Limited immigration sponsorship may be available
  • Ability to travel 0-10%, on average, based on the work you do and the clients and industries/sectors you serve

Preferred
  • Machine Learning Frameworks: Deep understanding of TensorFlow, PyTorch, and other AI/ML frameworks.
  • NVIDIA Expertise: Experience with NVIDIA's AI tools and frameworks such as CUDA, NeMo, and Triton.
  • AI Development Support: Proven ability to troubleshoot distributed AI model training frameworks like TensorFlow, Pytorch, Horovod, Ray, DeepSpeed, and others. Experience supporting Large Language models a plus.
  • System Performance: Strong knowledge of performance profiling, benchmarking, and optimization techniques.
  • Industry Experience: Background in supporting HPC/AI in the Federal Government or Defense Industry sector a plus.
  • Communication Skills: Excellent verbal and written communication skills for effective collaboration and reporting.
  • Analytical Skills: Strong analytical skills and proven ability to handle complex problems and develop innovative solutions.
  • May require a security clearance.

The wage range for this role takes into account the wide range of factors that are considered in making compensation decisions including but not limited to skill sets; experience and training; licensure and certifications; and other business and organizational needs. The disclosed range estimate has not been adjusted for the applicable geographic differential associated with the location at which the position may be filled. At Deloitte, it is not typical for an individual to be hired at or near the top of the range for their role and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current range is $107,000 to $198,000.

You may also be eligible to participate in a discretionary annual incentive program, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance.

Information for applicants with a need for accommodation: ;br>
EA_ExpHire

#LI-LH1
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Deloitte