Site Reliability Engineer - National Remote - UnitedHealth Group

Overview

Remote

On Site

USD 70,200.00 - 137,800.00 per year

Full Time

Skills

DoD

Medicare

Medicaid

IT operations

Change management

Design review

Estimating

Planning

Testing

Information systems

Failure analysis

Design

Performance improvement

Program management

Swift

Incident management

Leadership

Performance tuning

Continuous improvement

Operations

Workflow

Capacity management

Forecasting

Resource allocation

Documentation

Presentations

Training

Management

FOCUS

Dashboard

Real-time

Business requirements

Systems analysis

Optimization

Product management

Security controls

Apache Velocity

Performance management

Release management

Configuration Management

Reliability engineering

Emerging technologies

Electronic engineering

Service level

Root cause analysis

Pega

Appian

Microsoft

Scalability

Software engineering

System administration

Software development

Disaster recovery

ITIL

Computer science

Software architecture

Cloud computing

Customization

Software performance management

Dynatrace

Splunk

AppDynamics

Performance monitoring

Salesforce.com

Problem solving

Collaboration

Internet

Telecommuting

Communication

Policies

Jersey

FAR

IMPACT

Law

PASS

RPO

Job Details

The government services support team at Optum has earned the trust of organizations that our entire country relies on; from the Department of Defense and Veteran's Administration to the teams at Health & Human Services and the Centers for Medicare and Medicaid Services. We're repaying that trust with hard work, new ideas and a commitment to finding better solutions every day. Join us and help create ways for our government services agencies to be more efficient and effective. This will be the next huge step to start Caring. Connecting. Growing together.

As a Site Reliability Engineer (SRE) you will employ software engineering to automate critical IT operations tasks, including production system management, change management, and incident response. You will be responsible for design review and control; prediction, estimation, and apportionment methodology; failure mode effects and analysis; the planning, operation and analysis of reliability testing and field failures, and the ability to develop and administer reliability information systems for failure analysis, design and performance improvement and reliability program management over the entire product life cycle. You will help ensure swift incident response and scalable emergency handling, fostering greater reliability and resilience in managing complex systems. You will support our efforts in optimizing system performance and implementing, ensuring the reliability of our technology ecosystem.

You'll enjoy the flexibility to telecommute* from anywhere within the U.S. as you take on some tough challenges.

Primary Responsibilities:

System Reliability and Incident Management: Ensure the reliability, availability, and performance of services. Respond to, troubleshoot, and resolve service outages or degradation. Lead post-incident reviews and drive root cause analysis and mitigation
Monitoring and Performance Tuning: Develop and maintain advanced monitoring and alerting systems to detect and mitigate issues proactively. Continuously measure and optimize system performance, identifying bottlenecks and points of failure
Continuous Improvement: Advocate for and implement changes to improve system reliability and scalability. Innovate new ways to manage and automate operations tasks
Collaboration and Advocacy: Work closely with development teams to incorporate best practices and influence architecture, code health, and operational processes. Promote a culture of shared responsibility for production stability and performance. Integrate SRE principles into the engineering workflow
Capacity Planning and Scalability: Forecast and plan for the infrastructure needs. Implement scalable systems and resource allocation strategies to handle growth and peaks in demand
Documentation and Knowledge Sharing: Create and maintain detailed documentation of the systems, processes, and procedures. Facilitate knowledge sharing through regular technical presentations and training sessions
Configure, implement, and manage /optimize end-to-end APM solutions, with a focus on Dynatrace, AppDynamics, Splunk, or other relevant tools
Work closely with IT teams to seamlessly integrate APM solutions into the existing infrastructure and applications
Develop and maintain customized dashboards, reports, and alerts to offer real-time insights into the health and performance of the system
Collaborate with diverse teams to understand business requirements and configure APM solutions to meet performance monitoring needs
Conduct system analysis, troubleshooting, and optimization across various applications and infrastructure components
Provide support to internal stake holders and support teams regarding tweaking configurations, troubleshooting, and tool-specific nuances
Continuous performance management, measuring performance and working with stake holders to improve the same
Build quality frameworks to provide feedback loop to stakeholders to easy and improved APM product management, patching systems and implementing security controls
Document automation procedures to improve the velocity and quality of the effort
Continuous performance management, Software release management, configuration management and transition to stakeholders
Request feedback from teams, perform tool implementation assessments, offering recommendations for improvements to enhance system reliability and responsiveness
Stay abreast of industry best practices and emerging technologies in APM, ensuring our monitoring strategies align with the latest advancements

You'll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in.

Required Qualifications:

Bachelors degree in computer science, electronics engineering or other engineering or technical discipline (6 years of additional relevant experience may be substituted for education)
4+ years of experience as a Site Reliability Engineer or in a related role
4+ years of experience monitoring software performance in terms of service-level agreements (SLAs), service-level indicators (SLIs), and service-level objectives (SLOs)
4+ years of experience with APM features such as real user monitoring, synthetic monitoring, and effective root cause analysis
4+ years of experience with one of more of the following platforms: Salesforce, Pega, Appian, Microsoft power platform
Experience working to ensure the scalability, performance, and reliability of large-scale, cloud-based applications and infrastructure
Possess knowledge of combining software engineering and systems administration, SREs leverage Coding, Automation, and Engineering principles to build resilient, self-healing systems that could scale seamlessly
Able to detect issues, automatically handle failures, prepare disaster recovery plans, keeps systems up and reliable, and mitigates broken systems and prevent them from causing future disruptions

Preferred Qualifications:

ITIL Foundation Certification is preferred
Bachelor's in computer science or equivalent technical degree
Understanding of application architecture, infrastructure, and cloud environments
Proficiency in configuring and customizing multiple APM tools like Dynatrace, Splunk, AppDynamics for optimal performance monitoring
Additional certifications (e.g. Salesforce Developer, Quality Engineer Certification CQ etc.) are highly desirable
Strong problem-solving skills, including the ability to analyze complex systems and identify performance bottlenecks
Excellent communication skills to collaborate effectively with cross-functional teams and convey technical concepts to non-technical stakeholders
Must have reliable internet service that allows for effective telecommuting.
Must be eligible to work in the United States.
Must be able to obtain and maintain a government security Public Trust 2 or 4 (level will depend on your role)
All work must be conducted in the United States.
Must be able to communicate both verbally and in written form.
Must be able to conduct work and be available in VA communication channels during (EST business hours)

*All Telecommuters will be required to adhere to UnitedHealth Group's Telecommuter Policy.

California, Colorado, Nevada, Connecticut, New York, New Jersey, Rhode Island, Hawaii, Washington, or Washington D.C Residents Only: The salary range for California, Colorado, Nevada, Connecticut, New York, New Jersey, Rhode Island, Hawaii, Washington, or Washington D.C residents is $70,200 to $137,800 per year. Pay is based on several factors including but not limited to local labor markets, education, work experience, certifications, etc. UnitedHealth Group complies with all minimum wage laws as applicable. In addition to your salary, UnitedHealth Group offers benefits such as, a comprehensive benefits package, incentive and recognition programs, equity stock purchase and 401k contribution (all benefits are subject to eligibility requirements). No matter where or when you begin a career with UnitedHealth Group, you'll find a far-reaching choice of benefits and incentives.

Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Application Deadline: This will be posted for a minimum of 2 business days or until a sufficient candidate pool has been collected. Job posting may come down early due to volume of applicants.

At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.

Diversity creates a healthier atmosphere: UnitedHealth Group is an Equal Employment Opportunity / Affirmative Action employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, national origin, protected veteran status, disability status, sexual orientation, gender identity or expression, marital status, genetic information, or any other characteristic protected by law.

UnitedHealth Group is a drug - free workplace. Candidates are required to pass a drug test before beginning employment.

#RPO #Green

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Site Reliability Engineer - National Remote

Job Details

Share