Skills

Python Go Incident Response Docker Kubernetes Networking Research Training Linux Machine Learning Bootstrap Redis Spark Terraform

Job Specifications

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

About the company

We are an early-stage AI company building infrastructure for long-horizon reinforcement learning: agents that operate for extended periods and execute tools within high-fidelity environments. The team has deep experience in large-scale AI systems and open-source ML, and the company is well funded by experienced operators and technical leaders in the field.

We build environment infrastructure to train and evaluate agents on frontier tasks such as automated research and scientific discovery. Our customers include leading AI research organisations and fast-growing, AI-native startups.

Technical stack

Managed Kubernetes (cloud-based)
Custom autoscaling systems (Python / Go)
Redis
Distributed compute frameworks (e.g. Ray)
Observability stack (OpenTelemetry-style)
Infrastructure-as-code (Terraform, Helm)
50+ containerised evaluation environments

What you’ll do

Own the Kubernetes runtime for agent environments

Own scheduling, lifecycle management, stability, and operations for long-running, failure-prone workloads
Operate and evolve a production Kubernetes platform supporting multi-hour or multi-day agent runs

Improve environment infrastructure for long-horizon training and evaluation

Maintain a large suite of containerised evaluation environments (ML benchmarks, code execution, scientific tasks) with fast cold-start times
Optimise GPU utilisation and scheduling for distributed workloads
Design storage patterns for large datasets, model checkpoints, and ephemeral session state
Improve environment bootstrap times and resource efficiency through image layering and caching strategies

Make observability excellent

Implement metrics, logs, and traces that enable fast root-cause analysis
Build dashboards and alerting tied to SLOs (e.g. rollout success rate, environment health, tool latency, queue time)
Create debugging playbooks for common failure modes such as OOMs, memory leaks, performance regressions, and network or storage issues

Reliability engineering

Design retry and backoff strategies for long-running agent sessions that may fail mid-execution
Implement session recovery mechanisms such as checkpointing and idempotent operations
Build graceful degradation paths for node failures, OOMs, and GPU errors without losing progress
Create runbooks for common failure modes (e.g. sidecar health timeouts, stream lag, pod eviction cascades)
Develop chaos-testing strategies for multi-hour runs (network partitions, node drains, API rate limits)
Define and track SLOs for session creation latency, environment availability, and tool execution success rates

Security and sandboxing for tool-using agents

Harden container isolation for untrusted code execution (e.g. sandboxed runtimes or microVM-based approaches)
Implement network policies to restrict outbound access from evaluation environments
Design secrets management for API keys used by agent tools, including rotation and least-privilege access
Build audit logging for tool invocations and filesystem access
Implement rate limiting and circuit breakers for external API calls made by agents

Must-have experience

Deep, hands-on production experience operating Kubernetes, including:
Resource requests and limits, affinity/taints, priorities, autoscaling, and preemption
Debugging networking, DNS, storage performance, and node health issues
Strong distributed-systems fundamentals: idempotency, retries, failure domains, and incident response
Practical observability experience with metrics, structured logging, and tracing
Ability to build internal tools in Python and/or Go
Infrastructure-as-code and automation experience (Helm, scripting, GitOps-style workflows; Terraform a plus)
Experience using Redis for high-throughput, session-oriented workloads

Nice-to-have experience

Experience with machine learning systems or language models
Expertise in a specific infrastructure domain
ML or reinforcement learning training infrastructure (checkpointing, distributed training, GPU scheduling)
Building custom Kubernetes controllers, operators, or autoscalers
Sandboxing technologies for untrusted code execution
Distributed compute frameworks (e.g. Ray, Dask, Spark)
Deep expertise in container runtimes, Linux performance tuning, or networking

Compensation

Competitive salary and meaningful equity
Early-team impact with direct ownership and high leverage

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

About the Company

Here at Enigma, we specialize in Generative AI recruitment, specifically focused on Machine Learning and Software Engineering disciplines. With a combined experience of 20+ years, we understand the intricacies of finding the perfect role as well as the right talent for your team. But what sets Enigma apart? Our consultative approach. We don't just match candidates with job openings; we guide candidates, founders, and hiring managers through the recruitment process. Our value-added services go beyond traditional recruitment ... Know more