Infrastructure Ingneer

Overview

Remote
Depends on Experience
Contract - W2

Skills

Artificial Intelligence
LXC
Linux
Machine Learning (ML)
TAC
WebEx

Job Details

A seasoned engineer with extensive experience in managing CISCO TAC cases and customer support tickets, combining technical and interpersonal skills to build, scale, and optimize AI systems.

  • GPU Infrastructure Design & Optimization: Expert in designing, developing, and optimizing GPU-based infrastructures to support high-performance computing and AI/ML workloads.
  • AI/ML Infrastructure Requirements: Proficient in collaborating with customers and stakeholders to define infrastructure needs for AI/ML workloads, ensuring scalability and efficiency.
  • Data Storage & Management: Skilled in defining infrastructure requirements for storing, moving, and manipulating large datasets, ensuring seamless data flow and storage solutions.
  • Performance Optimization: In-depth knowledge of GPU fabric throughput and expertise in guiding performance teams through industry-standard testing methodologies to achieve optimization.
  • Security Enhancements: Experienced in identifying security vulnerabilities and leading review discussions to strengthen infrastructure security.
  • Cross-functional Collaboration: Strong capability in working with engineering teams to define and deliver new infrastructure functions and technologies that support AI/ML products and meet customer needs.

Technology stack list provided in the JD but the following below is a high focus:

  • Linux Systems Administration
  • MAAS Structure & Cloud-init (initial configuration scripts)
  • Strong understanding and experience in NVIDIA
  • Git, GitHub
  • LXC (Linux Containers)
  • Environment is a production environment - resource must be careful when they are providing decommissioning or commissioning any customer nodes. Environment is now seal proof with code reviews by the DigitalOcean team before changes are implemented. Resource should be knowledgeable when making changes and must understand potential risks from these changes. Engineer should understand the importance of the environment. Customer needs a stable environment and is very particular so the resource should be aware and privy of this.
  • GPU knowledge is helpful particularly those with a media background.
  • Customer currently has 3 bare metal customers:
    • Types of requests that might come in that the engineer may be involved in - The Development team might ask to decommission 3 servers for an existing customer.
    • Or another team member might ask to add a server node for a new customer.
    • Each ticket takes 2-3 hours to resolve that realistically 2-3 tickets can be resolved on a daily basis
  • Resources report daily activities on a WebEx Slack channel visible to Annul and DigitalOcean team
  • Bare Metal and Linux are most important skillsets to have
  • They heavily rely on Linux for making changes and accessing the bare metal.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.