August 27
🔄 Hybrid – London
• Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads • Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters • Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.) • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime • Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs • Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform • Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments • Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure • Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.) • Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements • Document processes and procedures to ensure consistency and knowledge sharing across the team • Contribute to open-source projects, research publications, blog articles and conferences
• Master’s degree in Computer Science, Engineering or a related field • 5+ years of experience in a DevOps/SRE role • Strong experience with cloud computing and highly available distributed systems • Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...) • Experience working against reliability KPIs (observability, alerting, SLAs) • Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...) • Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...) • Familiarity with infrastructure-as-code tools like Terraform or CloudFormation • Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices • Strong understanding of networking, security, and system administration concepts • Excellent problem-solving and communication skills • Self-motivated and able to work well in a fast-paced startup environment • experience in an AI/ML environment • experience of high-performance computing (HPC) systems and workload managers (Slurm) • worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)
• Competitive salary and bonus structure • Comprehensive benefits package (daily lunch vouchers, gympass subscription, mobility pass contribution, full health insurance for you and your family, generous parental leave policy...)
Apply Now