Job Specifications
Our client is seeking a Senior Site Reliability Engineer to help define, lead, and continuously improve operational best practices using modern Site Reliability Engineering principles, with a strong emphasis on AWS-based cloud infrastructure.
This role partners closely with engineering, production support, and technology leadership to design, implement, and operate highly reliable, secure, scalable, and cost-effective systems supporting a complex application ecosystem and software delivery lifecycle.
The Senior SRE will influence cloud architecture decisions, lead complex infrastructure initiatives, and drive long-term improvements in reliability, observability, automation, and cost efficiency. This is a senior individual contributor role with broad technical ownership and organizational influence.
Key Responsibilities
Contribute to the design, evolution, and operational health of a large-scale AWS environment, including architecture standards and best practices
Design, implement, and optimize AWS-based infrastructure using services such as EC2, ECS/EKS, Lambda, RDS, S3, IAM, VPC, and CloudWatch
Build and manage cloud infrastructure using Infrastructure as Code tools such as Terraform, CloudFormation, or equivalent
Lead new platform implementations and major reliability initiatives as a subject-matter expert in AWS and SRE practices
Monitor, analyze, and optimize cloud spend, balancing performance, reliability, and cost efficiency
Apply and mature SRE principles to improve system availability, scalability, performance, security, and observability
Design and implement automation to reduce operational toil and improve system efficiency
Provide advanced operational support for cloud-hosted and hybrid platforms
Define and improve monitoring, alerting, logging, and incident response practices
Lead complex production incidents, perform root cause analysis, and drive corrective and preventive actions
Mentor junior and mid-level engineers through technical guidance and best-practice leadership, without direct people management
Collaborate with engineering, QA, security, and business teams to embed reliability throughout the software delivery lifecycle
Ensure systems and data handling meet applicable legal, regulatory, and security requirements
Improve production engineering processes including change and configuration management, observability, incident response, disaster recovery, capacity planning, performance tuning, and deployment automation
Participate in a sustainable on-call rotation and help reduce alert fatigue over time
Act as a change agent for long-term technical strategy, identifying risks, dependencies, and improvement opportunities
Required Qualifications
Bachelor's degree in Computer Science, Information Systems, Engineering, or equivalent practical experience
Seven or more years of experience delivering technical solutions in production environments
Three or more years of hands-on Site Reliability Engineering experience
Extensive experience designing, operating, and scaling production AWS environments
Strong expertise with Infrastructure as Code and modern cloud deployment patterns
Proven ability to diagnose and resolve complex issues in distributed systems
Experience leading incidents and driving post-incident improvements
Ability to work independently, prioritize effectively, and manage multiple initiatives
Strong written and verbal communication skills with both technical and non-technical stakeholders
Preferred Qualifications
AWS certifications such as Solutions Architect, DevOps Engineer, or SysOps Administrator
Experience working in regulated industries such as healthcare, financial services, or similar environments
Familiarity with modern application stacks and supporting tools including CI/CD pipelines, version control systems, observability platforms, containerization, orchestration technologies, and identity and access management solutions
Experience working in Agile or Scrum-based delivery environments
What Success Looks Like
Production systems are stable, observable, and resilient
Incidents are handled effectively and result in measurable reliability improvements
Infrastructure scales predictably while remaining cost-conscious
Engineering teams are supported by clear standards, automation, and reliability tooling
Reliability is embedded into the delivery process rather than treated as an afterthought