Overview
Skills
Job Details
Title: AI/ML Operations/Deployment Engineer
Location: Warren, NJ (Complete onsite)
Duration: 12 months contract
Must have: AI/ML, Docker Containerization and Orchestration, Kubernetes, Helm Charts, Python & JSON. AI Frameworks (PyTorch), Linux Resource Management (SLURM), Nvidia Software Stack (CUDA, TensorRT, Triton). Jupyter Notebook and Strong communicator
Job Description:
10+ years profiles preferred
1. Docker Containerization and Orchestration strong understanding of pods, services and deployments
2. Collaborate with AI/ML teams to understand their requirements and translate them into scalable Kubernetes-based infrastructure solutions.
3. Docker, Operators and Helm charts
4. Understanding of Kubernetes security best practices (e.g., RBAC, network, and pod security policies)
5. Ability to set up monitoring, logging, and alerting for Kubernetes clusters using PrometheGrafana
6. Optimize Kubernetes cluster performance, resource utilization.
7. Python, JSON
Desired Skills: Working level understanding of the following:
1. Desired outcomes of AI Platforms required to support the Data Scientist community
2. AI Frameworks like PyTorch
3. Linux Resource Management Tools SLURM
4. Nvidia Software Stack CUDA, TensorRT, Triton Inference Server
5. Jupyter Notebook