Overview
Remote
$60+
Accepts corp to corp applications
Contract - Independent
Able to Provide Sponsorship
Skills
AI/ML Infrastructure and Ops Engineer
Job Details
AI/ML Infrastructure and Ops Engineer (AI/ML Training Platform).
As the Infrastructure and Ops Engineer, you will work on operations related our UAIS AI Studio (enterprise AI/ML platform), and in particular in relation to AI/ML training initiative supporting thousands of learners on the platform. This individual contributor (IC) role requires experience on working on large-scale AI/ML platforms guaranteeing stability, reliability, scalability, and performance. Experience with modern Infrastructure and DevOps tools and paradigms, as well as hands-on knowledge with major cloud-based services like Azure, AWS and Google Cloud Platform is a must.
Primary Responsibilities:
Continuous support: Provide continuous SRE support to thousands of geographically distributed learners on the UAIS platform: respond to tickets, triage support, liaise with customers.
Automation & DevOps: Improve existing Infrastructure as Code (IaC) according to best DevOps practices.
Systems Monitoring: Develop and maintain monitoring frameworks for UAIS infrastructure in relation to AI/ML training program
Security & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats.
Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML training environment, while identifying opportunities to reduce costs without compromising performance.
Required Qualifications:
Bachelor s degree in computer science, information technology, or a related field.
5+ years of infrastructure experience: Proven experience working on large-scale, cloud-based, enterprise-level software platforms and deep understanding of multi-cloud architectures, specifically Azure, AWS, and Google Cloud Platform, with hands-on experience in cloud management.
3+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike.
2+ years of practical experience in containerization technologies (Kubernetes, Docker) and orchestration
2+ years of practical experience in Scripting & Automation Skills: Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration efforts.
Preferred Qualifications:
Security & Compliance Knowledge: Strong understanding of security best practices and experience ensuring compliance with relevant regulatory frameworks.
Machine Learning and LLM Operations: Exposure to modern tools and techniques in MLOps and LLMOps fields.
Exposure to AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scale.
Exposure to a Regulated Industry: Experience working within a healthcare or regulated industry, with solid understanding of the unique challenges and compliance requirements.
Ability to work independently, manage multiple projects simultaneously, and adapt to changing priorities in a fast-paced environment.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.