- Company Name
- Mistplay
- Job Title
- Senior ML Platform Engineer
- Job Description
-
Job Title: Senior ML Platform Engineer
Role Summary: Design, build, and operate scalable end‑to‑end machine learning platforms that deliver high‑performance inference and model lifecycle governance at production scale. Work closely with data, security, and SRE teams to ensure reliable, cost‑efficient, and secure ML services.
Expectations: Deliver robust, low‑latency ML inference pipelines; maintain infrastructure as code, model governance, and observability; lead platform tooling decisions and migrations; collaborate across multiple functional teams to align ML solutions with business objectives.
Key Responsibilities:
- Develop standardized training‑to‑deployment pipelines using Airflow, managing artefacts, environment provisioning, packaging, and SageMaker endpoint deployments.
- Operate real‑time and batch inference on SageMaker, managing multi‑model endpoints, serverless inference where applicable, blue/green and canary deployments, auto‑scaling, and cost controls (spot strategies, instance sizing).
- Build ultra‑low‑latency model services with Redis/Valkey (feature caching, online feature access, stateful request handling, response caching, rate limiting).
- Provision and manage ML/data infrastructure with Terraform (SageMaker endpoints, ECR/ECS/EKS, VPC, ElastiCache/Valkey, observability, secrets, IAM).
- Create platform abstractions (Airflow DAGs, CLI/SDK, cookie‑cutter repos, CI/CD pipelines) to consistently move models from notebooks to production.
- Implement model lifecycle governance (model/feature registries, approval workflows, promotion policies, lineage, audit trails integrated with Airflow and Terraform state).
- Deliver full‑stack observability (data/feature freshness checks, drift and quality controls, SLOs for model performance and latency, infrastructure health dashboards, traceability, alerts, incident response, retrospectives).
- Collaborate with security, SRE, and data engineering teams on private networking, policy‑as‑code, PII handling, least‑privilege IAM, and cost‑effective architectures across all environments.
- Evaluate, integrate, and simplify platform tools (e.g., MLflow, feature stores, service gateways); lead migrations with clear change management and minimal downtime.
Required Skills:
- 5+ years building and operating production‑grade ML/data platforms focused on serviceability, reliability, and developer experience.
- Strong software engineering background: Python (primary), Go or Java (secondary).
- Proficiency with AWS services: SageMaker, ECR, ECS, EKS, IAM, VPC, ELB, RDS/ElastiCache, CloudWatch, XRay.
- Experience with infrastructure‑as‑code (Terraform), workflow orchestration (Airflow), CI/CD pipelines, and version control (Git).
- Knowledge of low‑latency serving technologies (Redis/Valkey, gRPC, HTTP/2).
- Familiarity with ML lifecycle management (MLflow, DVC, model registries) and observability best practices.
- Strong understanding of security principles, data privacy, and compliance requirements.
- Excellent communication, cross‑functional collaboration, and problem‑solving skills.
Required Education & Certifications:
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, Data Science, or related technical field.
- Relevant certifications preferred: AWS Certified Machine Learning – Specialty, AWS Certified DevOps Engineer – Professional, Terraform Associate (HashiCorp).