- Company Name
- Oscilar
- Job Title
- Sr./Staff - Infrastructure/Site Reliability Engineer (SRE)
- Job Description
-
Job Title: Sr./Staff - Infrastructure/Site Reliability Engineer (SRE)
Role Summary:
Lead the reliability strategy for a multi‑region, cloud‑native AI risk decisioning platform. Own the design, operation, and evolution of AWS infrastructure (Pulumi), CI/CD pipelines, observability, and scaling solutions that support billions of events and large‑scale data pipelines.
Expectations:
- Deliver and maintain enterprise‑grade availability, latency, and performance at scale.
- Demonstrate extreme ownership of platform reliability, balancing velocity with stability.
- Mentor peers and establish SRE best practices across the organization.
Key Responsibilities:
- Architect and operate resilient AWS infrastructure using Pulumi, Terraform, and other IaC tools.
- Lead initiatives to improve uptime, reduce latency, and enhance performance across global deployments.
- Design, implement, and maintain CI/CD pipelines focused on speed, safety, and repeatability.
- Define, instrument, and maintain metrics, alerts, and runbooks that support observability.
- Conduct chaos engineering experiments and failure simulations to harden system resilience.
- Mentor junior engineers, provide technical guidance, and drive cultural adoption of SRE principles.
- Partner with product, dev, and data teams to optimize system architecture for fraud and risk detection workloads.
Required Skills:
- Minimum 5+ years as a senior SRE or high‑scale infrastructure engineer.
- Deep expertise in AWS and IaC (Pulumi, Terraform).
- Strong programming in Go (Python also accepted).
- Proficiency with distributed systems (Kafka, ClickHouse), microservices architecture, and container orchestration (Kubernetes).
- Experience with observability tools (Prometheus, Grafana, ELK, etc.) and runbook creation.
- Proven ability to run chaos experiments, debug production issues, and drive reliability improvements.
- Strong ownership mindset, communication, and leadership qualities.
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent professional experience).
- Certifications such as AWS Solutions Architect, Certified Kubernetes Administrator, or Infrastructure‑as‑Code certifications are a plus.