Job offer
Site Reliability Engineer
Man Group is seeking a Site Reliability Engineer to ensure the reliability, availability, and performance of the company’s technology platform and to work on innovative projects. The successful candidate will be part of a high-performing team and will have the opportunity to develop and grow at various levels within the company.
The role
Join our high-performing Site Reliability Engineering (SRE) team and play a key role in ensuring the reliability, availability, and performance of the technology that powers Man Group’s hedge funds, lending, custody, and banking operations. This is an opportunity to work on groundbreaking projects and help shape the future of our platform.role responsibility
- Ensure the reliability and performance of critical systems across the global infrastructure through proactive monitoring and rapid incident response - Design and implement observability solutions using tools such as Prometheus, Datadog, ELK, and Loki to provide insights and enable data-driven decisions - Automate operational tasks and build self-service capabilities to eliminate routine work and improve efficiency - Develop and maintain SLIs, SLOs, and error budgets to drive reliability improvements and inform engineering priorities - Participate in incident response efforts, conduct blame-free post-mortems, and implement preventive measures to reduce errors - Collaborate with development teams to improve system design, deployment practices, and operational excellence - Configure and roll out major infrastructure upgrades; manage compute/server utilization and high-performance distributed systems - Contribute to capacity planning and performance budgeting to ensure systems meet business requirements - Manage multiple ELK clusters hosting hundreds of terabytes of log data, telemetry, and APM dataKey competencies
Required
- Strong understanding of SRE principles, including SLIs, SLOs, fault budgets, and reliability testing - Experience with observability and monitoring tools such as Prometheus, Grafana, ELK, Loki, or similar - Proficiency in automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, PowerShell) - Strong troubleshooting and problem-solving skills in distributed systems, with the ability to diagnose complex issues under pressure - Experience with infrastructure, containers, on-call rotations, and post-incident reviews - Familiarity with Kubernetes and container orchestrationAdvantageous
- Experience with CI/CD pipelines and source code workflows (Git, Jenkins, TeamCity, GitLab) - Administration of Linux and Windows systems and experience with cloud technologies (AWS/Azure) - Understanding of network concepts, load balancing, and distributed architectures - Knowledge of AIOps/MLOps (Splunk, Elastic, Grafana, NDP-Peers) - Familiarity with internal communication and collaboration tools - Previous experience with Man GroupBenefits
- Modern office space on the OPD campus with easy access to public transportation and amenities - Hybrid work model - 28-day vacation package - 21 days of paid vacation - Premium pension contribution - Competitive benefits package - Additional compensation for long-term service and volunteer work - Additional benefits - Opportunities for professional development, including internal tech talks - Sponsorship and engagement with employee resourcesJob details