Software Engineer - Incident Management

  • Denver, CO
  • Posted 47 days ago | Updated 1 hour ago

Overview

On Site
USD 149,000.00 - 190,000.00 per year
Full Time

Skills

Data
Metrics
Visualization
Honesty
AIM
Writing
Facilitation
Documentation
Python
TypeScript
Kubernetes
Incident management
Collaboration
Communication
English
Leadership
Professional development
End-user training
Mentorship
Computer networking
Health care
Planning

Job Details

About Datadog:

We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams. We operate at high scale-trillions of data points per day-providing always-on alerting, metrics visualization, logs, and application tracing for tens of thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way

Software Engineer - Incident Management SRE

The Incident Management SRE team at Datadog fosters a resilient culture by using incidents as learning opportunities and catalysts for growth. We collaborate closely with teams across departments to enhance on-call experience, incident response, and post-incident analysis, reducing friction and optimizing tooling and processes. Our efforts empower Datadog to navigate unexpected failures confidently, efficiently, and with a commitment to continuous learning and systems improvement.

At Datadog, we place value in our office culture - the relationships and collaboration it builds and the creativity it brings to the table. We operate as a hybrid workplace to ensure our Datadogs can create a work-life harmony that best fits them.

What You'll Do:
  • Steer the on-call experience for the company by establishing best practices and building platforms to support on-call rotations and compensation.
  • Define how we respond to incidents and write software to streamline the process, collaborating with product teams as needed. Our aim is to fully support our incident responders in dealing with complexity.
  • Contribute to the post-mortem process for the company, collaborating with teams on writing them, and identifying opportunities to reduce friction and enhance learning value for the organization. Our team also runs a weekly postmortem reading group.
  • Support various teams in facilitating incident reviews that emphasize learning and blamelessness. Help them share their learnings across the organization to improve the resilience of our people.
  • Train our on-callers in incident and post-mortem processes, involving both introducing newcomers to on-call responsibilities and refreshing the knowledge of existing engineers.
  • Engage in cross-functional collaborations with different teams across the organization, embedding in their group for a few weeks to either learn about how work is performed or help them improve on-call practices.

Who You Are:
  • At least 3 years of experience building software that solves real user problems, designing new features with RFCs as well as reviewing others' code and documents collaboratively. We develop in Go and Python and a bit of TypeScript.
  • Familiarity with Kubernetes and distributed systems, along with an understanding of their potential failure scenarios.
  • Interest in analyzing incidents, identifying broader risk patterns, and effectively sharing findings for others to understand and learn from.
  • Experience being on-call and responding to incidents, iteratively improving incident response processes.
  • Empathy, collaboration, and communication skills in English to cultivate strong relationships across various teams in the organization
  • Willingness to teach and train other engineers on best practices. Experience driving cross-functional change and leading through influence, or a strong interest in doing so.

Datadog values people from all walks of life. We understand not everyone will meet all the above qualifications on day one. That's okay. If you're passionate about technology and want to grow your skills, we encourage you to apply.

Benefits and Growth:
  • New hire stock equity (RSUs) and employee stock purchase plan (ESPP)
  • Continuous professional development, product training, and career pathing
  • Intradepartmental mentor and buddy program for in-house networking
  • An inclusive company culture, ability to join our Community Guilds (Datadog employee resource groups)
  • Access to Inclusion Talks, our internal panel discussions
  • Free, global mental health benefits for employees and dependents age 6+
  • Competitive global benefits

Benefits and Growth listed above may vary based on the country of your employment and the nature of your employment with Datadog.

Datadog offers a competitive salary and equity package, and may include variable compensation. Actual compensation is based on factors such as the candidate's skills, qualifications, and experience. In addition, Datadog offers a wide range of best in class, comprehensive and inclusive employee benefits for this role including healthcare, dental, parental planning, and mental health benefits, a 401(k) plan and match, paid time off, fitness reimbursements, and a discounted employee stock purchase plan.

The reasonably estimated yearly salary for this role at Datadog is:
$149,000-$190,000 USD
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.