Job offer
Site Reliability Engineer
The job posting is for a Site Reliability Engineer at Man Group, a global asset management firm, who will be responsible for ensuring the reliability, stability, and performance of the technology platform. The successful candidate will join a high-performing team and work on developing solutions to monitor and improve system performance.
The role
Join our high-performing Site Reliability Engineering (SRE) team and play a key role in ensuring the reliability, stability, and performance of the technology that powers Man AHL’s multi-strategy platform. This is an opportunity to work on groundbreaking projects and shape the future of our platform.Responsibilities
- Ensure the reliability and performance of critical systems across the global infrastructure through proactive monitoring and rapid incident response.
- Design and implement monitoring solutions using tools such as Prometheus, Datadog, ELK, and Loki to provide meaningful and timely insights.
- Work with engineering teams to improve system design, deployment practices, and operational excellence.
- Configure and install new sites; manage the asset lifecycle, large-scale GPU/CPU deployments, and high-performance distributed systems.
- Contribute to capacity planning and performance benchmarking to ensure that systems meet business requirements.
- Manage multiple ELK clusters containing hundreds of terabytes of log data, telemetry, and APM data.
Key competencies
- A strong understanding of SRE principles, including SLIs, SLOs, error budgets, and reliability testing practices.
- Extensive experience with monitoring tools such as Prometheus, Datadog, ELK, Loki, etc., ideally across multiple clouds.
- Knowledge of automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, PowerShell).
- Strong troubleshooting and problem-solving skills in distributed systems, with the ability to diagnose complex production issues under pressure.
- Experience with containers, on-call rotations, and post-incident reviews.
- Familiarity with Kubernetes and container orchestration.
Advantages
- Experience with AWS/GCP products and familiarity with cloud technologies (AWS/Azure).
- Understanding of network concepts, load balancing, and distributed architectures.
- Awareness of FAIR/ITAM principles to ensure that we understand the true costs of our decisions.
- Familiarity with IT service management communication and collaboration tools.
Services
- Modern office on the Aldgate Campus with easy access to public transportation and amenities.
- Hybrid working model.
- Competitive compensation package.
- 25 days of vacation pay.
- Premium health insurance.
- Pension agreement with a 6% employer contribution.
- Referral bonus.
- An extra day of vacation for long-serving employees and new hires.
- Opportunities for professional development, including internal tech talk series and engagement with the alumni network.
Job details