Overview
Hybrid
Up to $140,000
Full Time
Able to Provide Sponsorship
Skills
Amazon Web Services
Apache Velocity
AppDynamics
CHAOS
Cloud computing
Computer science
Continuous delivery
DevOps
Docker
Hosting
Java
Jenkins
Data
Kubernetes
Operational excellence
Python
Testing
Ruby
Scalability
Scripting
Splunk
Root cause analysis
FMEA
High availability
Database
Job Details
Job Title: Site Reliability Engineer (SRE)
Location: Mountain View, CA (Hybrid)
Job Type: Full-time
Job Description:
As a Site Reliability Engineer (SRE), you will design, implement, and maintain complex data systems that support millions of customers. You will apply Cloud Native principles and best practices to ensure high availability, security, performance, and scalability of database systems. This is a hands-on role that involves working with cutting-edge technologies and maintaining critical infrastructure.
Key Responsibilities:
- Design, build, and maintain CI/CD pipelines in Jenkins.
- Deploy services in Kubernetes clusters using Helm, Kustomize, and similar tools.
- Implement infrastructure changes in AWS with a deep understanding of AWS services.
- Participate in on-call duties for pre-production and production systems, supporting multi-million users.
- Write and review RCA (Root Cause Analysis) documentation to prevent the recurrence of incidents and share learnings.
- Contribute to system upgrades, deployment automation, monitoring enhancements, and production changes.
- Create operational playbooks, write how-to articles, and gain domain knowledge to drive team improvements.
- Participate in FMEA (Failure Mode and Effects Analysis) testing, chaos testing, and security remediation efforts.
- Share best practices for operational excellence and cost optimization.
- Automate processes to reduce manual efforts and increase efficiency.
- Continuously look for opportunities to increase developer velocity and productivity.
Qualifications:
- Bachelor s or master s degree in Computer Science or a related technical field, or equivalent experience.
- 4+ years of hands-on experience with development and operations in AWS environments.
- Expertise in performance monitoring, troubleshooting, and tuning.
- Experience with AWS services and Cloud hosting.
- Proficiency in DevOps automation using scripting languages.
- Experience with programming languages such as Java, Python, or Ruby.
- Knowledge of Docker, Kubernetes, and ArgoCD.
- Experience with monitoring and observability tools such as Splunk, Wavefront, AppDynamics, Prometheus, and Tracing.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.