Job offer

Site Reliability Engineer

The role of Site Reliability Engineer at Man Group involves ensuring the reliability, resilience, and performance of the technology that powers the company’s Edge platform. The SRE will be part of a high-performing team and will work alongside the technology development teams to solve complex problems and drive major projects forward.

The role

Join our high-performing Site Reliability Engineering (SRE) team and play a key role in ensuring the reliability, resilience, and performance of the technology that powers Man Group’s Edge platform. This is an opportunity to work on groundbreaking challenges alongside the technology development teams to drive large-scale projects forward. You’ll receive mentorship from experienced leaders and develop a deep understanding of both technology and the business.

Role Responsibilities

As an SRE, you will take responsibility for service reliability and develop solutions that make a real impact. Your initial focus will include leveraging AI to accelerate incident diagnosis and resolution, improving observability, capacity planning, and automation. You will then work across the entire infrastructure stack, covering all layers and driving continuous improvements. - Ensure that critical systems remain reliable and performant across the global infrastructure through proactive monitoring and rapid incident response - Develop and implement observability solutions using tools such as Prometheus, OpenTSDB, EFK, and Loki to provide meaningful and actionable metrics - Collaborate with engineers to deliver high-quality solutions - Automate operational tasks and build self-service capabilities to eliminate routine work and improve efficiency - Develop and maintain SLIs, SLOs, and error budgets, and conduct root cause analyses to drive reliability improvements and inform engineering priorities - Participate in on-call rotations, take part in post-mortems, and implement preventive measures to avoid incidents - Collaborate with development teams to improve system design, deployment practices, and operational excellence - Configure and maintain builds, manage asset storage, large-scale CPU/GPU deployments, and high-performance distributed systems - Contribute to capacity planning and performance forecasting to ensure systems meet business requirements - Manage multiple ELK clusters hosting hundreds of terabytes of log data, telemetry, and APM data

Key competencies

Required

- Strong understanding of SRE principles, including SLIs, SLOs, fault budgets, and reliability testing practices - Extensive experience and understanding of Kubernetes (deployment strategies, Kubernetes pods, containers, etc.), Linux, EFK, Loki, Prometheus, and other observability tools - Proficiency in automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, Perl, etc.) - Strong troubleshooting and debugging skills in distributed systems, with the ability to diagnose complex production issues under pressure - Experience with visualization, monitoring, on-call rotations, and post-incident reviews - Familiarity with Kubernetes and container orchestration

Advantageous

- Experience with CI/CD pipelines and source control workflows (Git, Jenkins, TeamCity/GitLab) - Administration of Linux and Windows systems and experience with cloud technologies (AWS/Azure) - Understanding of networking concepts, load balancing, and distributed architectures - Knowledge of AIOps/M

Job details

© 2025 House of Skills by skillaware. All rights reserved.
Our website uses cookies to make navigation easier for you and to analyze the use of the site. You can find more information in our privacy policy.