Overview
Skills
Job Details
# Key Responsibilities
# Reliability and Performance Management:
- Design, implement, and maintain highly available, scalable, and resilient cloud-native architectures for mission-critical SaaS products.
- Develop and implement SLOs, SLIs, and SLAs to measure and improve service reliability.
- Continuously optimize system performance and resource utilization across multiple cloud platforms.
- Finetune/Optimize Application performance by analyzing the code, traces and database queries.
# Incident Management and Troubleshooting:
- Lead incident response efforts, effectively troubleshooting complex issues to minimize downtime and impact.
- Reduce Mean Time to Recover (MTTR) through proactive monitoring, automated alerting, and efficient problem-solving techniques.
- Conduct thorough Root Cause Analysis (RCA) for all major incidents and implement preventive measures.
# Observability and Monitoring:
- Design and implement end-to-end observability solutions across our distributed systems.
- Develop and maintain comprehensive monitoring strategies using tools like ELK Stack, Prometheus, Grafana.
- Create and optimize product status dashboards to provide real-time visibility into system health and performance.
# Automation and Infrastructure as Code (IaC)
- Implement Infrastructure as Code practices using tools like Terraform.
- Develop and maintain automated deployment pipelines and CI/CD workflows.
- Create self-healing systems and automate routine operational tasks to reduce manual intervention.
# Cloud-Agnostic Architecture:
- Design and implement cloud-agnostic solutions that can operate efficiently across multiple cloud providers.
- Develop expertise in event-driven architectures and related technologies (e.g., Apache Kafka/Eventhub, Redis, Mongo Atlas, IoTHub).
- Implement and manage containerized applications using Kubernetes across different cloud environments.
# Continuous Improvement:
- Regularly review and refine operational practices to enhance efficiency and reliability.
- Stay updated with the latest industry trends and technologies in SRE, cloud computing, and DevOps.
- Contribute to the development of internal tools and frameworks to support SRE practices.
# Requirements:
- Strong knowledge of cloud platforms - Azure and their associated services.
- Expert in Observability tools (ELK Stack, Dynatrace, Prometheus )
- Expertise in containerization technologies such as Docker and Kubernetes
- Understanding of Event-driven architecture and database technologies (Mongo Atlas, Azure SQL, PostgresDB )
- Proficient in IaaC tools such as - Terraform and GitHub Actions.
- Proficiency in one or more programming languages - Python/.Net/Java
- Strong understanding of networking concepts, load balancing, and security practices.