Incident Management Specialist (System Analyst)

Overview

Hybrid
Depends on Experience
Contract - W2
Contract - 12 Month(s)

Skills

Service-Oriented Architecture (SOA)
Middleware management in UNIX/Linux environments.
AWS Infrastructure
Incident Management

Job Details

Incident Management Specialist AWS Infrastructure
Location: Reston, VA (Hybrid)

Summary Experienced IT professional specializing in incident management and application triage within a 24/7/365 environment. Skilled in troubleshooting, diagnosing, and resolving production incidents in AWS infrastructure, with expertise in performance monitoring, root cause analysis, and incident resolution. Proven ability to work effectively with cross-functional teams, manage incident status and impact, and lead technical triage calls. Strong communication and relationship management skills, capable of delivering clear updates to both technical and non-technical stakeholders.
What We Are Seeking We are seeking a candidate with expert-level knowledge in AWS infrastructure and services, particularly in the context of application triage and incident management. This role focuses on transaction tracing, log analysis, and incident resolution using AWS Console and other monitoring tools. We are not looking for candidates with solely development or deployment experience in AWS, but rather those with hands-on experience in diagnosing and resolving incidents in a cloud environment.
Key Responsibilities

  • Incident Management: Lead and manage IT production incidents to resolution, ensuring minimal downtime and effective communication of incident status, impact, and resolution.
  • AWS Expertise: Hands-on experience with Amazon Web Services (AWS), including EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, Lambda, S3, CloudWatch, CloudTrail, WAF, and more.
  • Cloud Monitoring: Build and leverage tools for monitoring and troubleshooting system resources in AWS, using platforms like Dynatrace, Splunk, SolarWinds, and MoogSoft.
  • Root Cause Analysis: Perform detailed transaction-level monitoring and troubleshooting of AWS infrastructure, including web, database, storage, and network layers.
  • Incident Triage: Lead technical incident triage calls, analyze system performance, and resolve incidents swiftly using monitoring tools and diagnostics.
  • Process Improvement: Proactively identify opportunities to improve operational processes, implement recommendations, and contribute to postmortem analysis for continuous improvement.
  • Collaboration: Work closely with other technical teams to influence incident resolution and share insights during follow-up calls and root cause analysis.
  • Stakeholder Communication: Provide timely updates and detailed reports on incident status and post-resolution metrics to senior leadership.

Core Skills & Technologies

  • AWS Services: EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, Lambda, S3, CloudWatch, CloudTrail, WAF
  • Incident Management: Hands-on management of IT incidents, triage, and resolution
  • Monitoring Tools: Dynatrace, Splunk, SolarWinds, MoogSoft, Extrahop, Catchpoint
  • Root Cause Analysis: Incident troubleshooting, transaction tracing, and diagnostics
  • Cloud Infrastructure: Performance engineering, resource monitoring, and cloud operations
  • Communication: Strong written and verbal skills, including executive-level reporting and cross-functional collaboration
  • Technical Areas: AWS, Unix/Linux servers, Wintel servers, networks, databases (Oracle, MS SQL), SAN, virtualization

Qualifications :

  • Manage and resolve complex incidents within AWS infrastructure, providing timely updates to stakeholders and ensuring minimal production downtime.
  • Lead incident triage calls, analyzing application and infrastructure health using AWS and third-party monitoring tools (e.g., Dynatrace, Splunk).
  • Collaborate with cross-functional teams to diagnose root causes and implement corrective actions, ensuring a quick resolution for high-priority incidents.
  • Design and improve incident management processes, proactively recommending changes to minimize recurring issues and enhance system stability.
  • Conduct postmortem analysis for critical incidents, documenting root cause, corrective actions, and lessons learned to improve future performance.
  • AWS Cloud Operations Specialist
  • Provided hands-on support for AWS-based applications, including incident monitoring, root cause analysis, and performance troubleshooting.
  • Implemented tools and dashboards for monitoring AWS infrastructure performance, improving incident detection and response times.
  • Worked closely with development and operations teams to resolve complex production issues, ensuring timely and effective solutions.
  • Supported the transition of legacy systems to AWS, optimizing application performance and operational efficiency.
  • Bachelor's degree in information technology
  • AWS Certified Solutions Architect Associate
  • AWS Certified DevOps Engineer Professional (Optional)
  • Certified Incident Management Professional (Optional)

Preferred Skills & Experience:

  • Experience with Service-Oriented Architecture (SOA) and Middleware management in UNIX/Linux environments.
  • Prior experience in the financial industry or with high-transaction applications.
  • Familiarity with OpenTel and advanced transaction monitoring tools
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.