Job offer

Site Reliability Engineer

The job posting is for a Site Reliability Engineer (SRE) at Man Group, a global asset management firm, who will be responsible for the reliability, availability, and performance of the technology. The SRE will work on developing solutions to accelerate incident diagnosis and resolution, improve observability, and drive automation.

The role

Join our high-performing Site Reliability Engineering (SRE) team and play a key role in ensuring the reliability, availability, and performance of the technology that powers Man AHL’s funds and our innovative investment platform. This is an opportunity to work on innovative projects. As an SRE, you will take responsibility for the reliability of services and related solutions that make a real impact. Your initial focus will include leveraging AI to accelerate incident diagnosis and resolution, improving observability, capacity planning, and automation. Once you’ve settled into the workflow, you’ll work across the entire infrastructure—spanning all layers—and drive continuous improvements.

Role Responsibilities

- Ensure the reliability and performance of critical systems across the global infrastructure through proactive monitoring and rapid incident response. - Design and implement observability solutions using tools such as Prometheus, Datadog, EFK, Loki, and Kube to provide comprehensive visibility. - Develop and maintain SLAs, SLOs, and error budgets to drive reliability improvements and inform engineering decisions. - Automate operational tasks and build self-service capabilities to eliminate routine work and improve efficiency. - Develop and maintain processes and metrics. - Participate in incident response efforts, blame-free post-mortems, and implement preventive measures to prevent recurrence. - Collaborate with development teams to improve system design, deployment practices, and operational excellence. - Configure and maintain large computing resources in distributed systems. - Contribute to capacity planning and performance forecasting to ensure systems meet business requirements. - Manage multiple ELK clusters hosting hundreds of terabytes of log data, telemetry, and APM data.

Key competencies

Required: - Strong understanding of SRE principles, including SLAs, SLOs, fault budgets, and reliability testing. - At least 3 years of experience with distributed systems. Strong knowledge of Kubernetes, Docker, and Linux. - Knowledge of automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, PowerShell). - Strong troubleshooting and debugging skills in distributed systems, with the ability to diagnose complex production issues under pressure. - Experience with observability, on-call rotations, and post-incident reviews. - Familiarity with Kubernetes and container orchestration. A plus: - Experience with CI/CD pipelines and source code workflows (Git, Jenkins, TeamCity, GitLab). - Administration of Linux and Windows systems and experience with cloud technologies (AWS/Azure). - Understanding of networking concepts, load balancing, and distributed architectures. - Knowledge of AIOps/MLOps (Google Cloud, Amazon Cloud, HDP ecosystem). - Familiarity with FinOps principles, desire to understand the actual costs of our decisions. - Excellent verbal and written communication and collaboration skills.

Benefits

- Modern office space on the Old Broadwick Campus with easy access to public transportation and amenities

Job details

© 2025 House of Skills by skillaware. All rights reserved.
Our website uses cookies to make navigation easier for you and to analyze the use of the site. You can find more information in our privacy policy.