Site Reliability Engineering (SRE) with Java Exp @ Austin, TX (Onsite Job)

Overview

On Site

Depends on Experience

Accepts corp to corp applications

Contract - W2

Contract - Independent

Contract - 12 Month(s)

Skills

Site Reliability Engineering

SRE

Amazon Web Services

Java

Grafana

Docker

Kubernetes

Python

DevOps

Job Details

Hello,

Hope you are doing well,

Note: Must Need to go Onsite From Day 1 & Need Only Local Candidates.

Title: Site Reliability Engineering (SRE) with Java Exp

Location: Austin, TX

Duration: 12 Months

Rate: DOE

The current SRE/Support team does several activities - Support/DevOps/Infra as code/Developer Support for our users/SRE etc.

Job Description:

Tools & Technologies Required

Python, Java, AWS, Kube, Jenkins, Docker, Splunk
Design, implement, and maintain highly available and scalable distributed systems.
Develop automation tools and scripts using Java, Python, or other relevant technologies to improve system reliability and efficiency.
Monitor, troubleshoot, and resolve production incidents, ensuring system uptime and performance.
Optimize infrastructure by implementing best practices in observability, logging, and monitoring (Prometheus, Grafana, ELK, etc.).
Collaborate with development teams to enhance CI/CD pipelines, automate deployments, and improve software delivery processes.
Ensure security, compliance, and infrastructure best practices across cloud and on-prem environments.
Conduct root cause analysis (RCA) for incidents and drive long-term improvements.
Improve system resilience through capacity planning, performance tuning, and failure recovery strategies.

Additional responsibilities

Ensure all the application components are running smoothly in the Kubernetes and AWS environment.
Support the components (patches / upgrades / issues / configurations) on the application Platform
Manage CI/CD pipelines for the application tools / components
Automation of Tasks to improve efficiency and effort reduction
Create and publish comprehensive dashboards for Observability
Configuring & Monitoring for Health Checks
User Provisioning
Monitoring & Remediation of Alerts
Alert the application team in the event of any potential issues related to infrastructure or components.
Create and Update Runbooks for standardized Operations
Acquire knowledge about the application platform (architecture, design, usage, typical problems faced by users, and their resolution) to reduce dependency on the application team for resolving support issues
Track and report the costing of AWS and other resources weekly.
Respond to users on application communication channels (Slack and support email group) and provide appropriate solutions.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share