Job offer
Site Reliability Engineer
The role of Site Reliability Engineer at Man Group involves ensuring the reliability, availability, and performance of the technology that supports the company’s hedge funds and other projects. The SRE will focus on developing solutions to accelerate incident diagnosis and resolution, observability, capacity planning, and automation.
The role
Join our high-performing Site Reliability Engineering (SRE) team and play a key role in ensuring the reliability, availability, and performance of the technology that powers Man AHL’s hedge funds, AHL, and other Edge Impact projects. This is an opportunity to work with cutting-edge technology and gain a deeper understanding of both technology and business.Role Responsibilities
* Ensure that critical systems remain reliable and perform well across the global infrastructure through proactive monitoring and rapid incident response. * Develop and implement observability solutions using tools such as Prometheus, Datadog, ELK, and Loki to provide meaningful metrics. * Create and maintain SLAs, SLOs, SLIs, and error budgets to drive reliability improvements and inform engineering priorities. * Automate operational tasks and build self-service capabilities to eliminate routine work and improve efficiency. * Participate in on-call rotations, manage on-call processes, conduct post-mortem analyses, implement preventive measures to avoid outages, and participate in incident response efforts. * Collaborate with development teams to improve system design, deployment practices, and operational excellence. * Configure and scale cloud costs, manage bare-metal storage, large CPU/GPU deployments, and high-performance distributed systems. * Contribute to capacity planning and performance budgeting to ensure systems meet business requirements. * Manage multiple ELK clusters containing hundreds of terabytes of log data, telemetry, and APM data.Key competencies
Required
* Strong understanding of SRE principles, including SLIs, SLOs, fault budgets, and reliability testing. * 5+ years of experience and a proven track record of successfully leading multiple IT projects. * Knowledge of automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, PowerShell). * Strong troubleshooting and problem-solving skills in distributed systems, with the ability to diagnose complex production issues under pressure. * Experience with visualization, monitoring, on-call rotations, and post-incident reviews. * Familiarity with Kubernetes and container orchestration.Advantageous
* Experience with CI/CD pipelines and source code workflows (Git, Jenkins, TeamCity/Azure). * Administration of Linux and Windows systems and experience with cloud technologies (AWS/Azure). * Understanding of networking concepts, load balancing, and distributed architectures. * Knowledge of A/UX and/or infrastructure performance tuning, HPE servers. * Familiarity with FinOps principles, a desire to understand the true costs of our decisions. * Demonstrated skills in written and verbal communication and collaboration.Benefits
* Modern office space on the Old Broadwick Campus with easy access to public transportation and amenities * Hybrid work model * 28-day vacation package * 21 days of paid time off * Premium accident and life insurance * Employee support program * Mental health first responders * Referral bonus * Additional sick days for long-term service andJob details