Job offer

Site Reliability Engineer

The job posting is for a Site Reliability Engineer at Man Group, a global asset management firm, who will be responsible for ensuring the reliability, stability, and performance of the technology platform. The successful candidate will join a high-performing team and work on developing solutions to monitor and improve system performance.

The role

Join our high-performing Site Reliability Engineering (SRE) team and play a key role in ensuring the reliability, stability, and performance of the technology that powers Man AHL’s multi-strategy platform. This is an opportunity to work on groundbreaking projects and shape the future of our platform.

Responsibilities

  • Ensure the reliability and performance of critical systems across the global infrastructure through proactive monitoring and rapid incident response.
  • Design and implement monitoring solutions using tools such as Prometheus, Datadog, ELK, and Loki to provide meaningful and timely insights.
  • Work with engineering teams to improve system design, deployment practices, and operational excellence.
  • Configure and install new sites; manage the asset lifecycle, large-scale GPU/CPU deployments, and high-performance distributed systems.
  • Contribute to capacity planning and performance benchmarking to ensure that systems meet business requirements.
  • Manage multiple ELK clusters containing hundreds of terabytes of log data, telemetry, and APM data.

Key competencies

  • A strong understanding of SRE principles, including SLIs, SLOs, error budgets, and reliability testing practices.
  • Extensive experience with monitoring tools such as Prometheus, Datadog, ELK, Loki, etc., ideally across multiple clouds.
  • Knowledge of automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, PowerShell).
  • Strong troubleshooting and problem-solving skills in distributed systems, with the ability to diagnose complex production issues under pressure.
  • Experience with containers, on-call rotations, and post-incident reviews.
  • Familiarity with Kubernetes and container orchestration.

Advantages

  • Experience with AWS/GCP products and familiarity with cloud technologies (AWS/Azure).
  • Understanding of network concepts, load balancing, and distributed architectures.
  • Awareness of FAIR/ITAM principles to ensure that we understand the true costs of our decisions.
  • Familiarity with IT service management communication and collaboration tools.

Services

  • Modern office on the Aldgate Campus with easy access to public transportation and amenities.
  • Hybrid working model.
  • Competitive compensation package.
  • 25 days of vacation pay.
  • Premium health insurance.
  • Pension agreement with a 6% employer contribution.
  • Referral bonus.
  • An extra day of vacation for long-serving employees and new hires.
  • Opportunities for professional development, including internal tech talk series and engagement with the alumni network.

Job details

© 2025 House of Skills by skillaware. All rights reserved.
Our website uses cookies to make navigation easier for you and to analyze the use of the site. You can find more information in our privacy policy.