Job offer

Site Reliability Engineer

The job posting describes a position as a Site Reliability Engineer at Man Group, in which the candidate is responsible for the reliability, stability, and performance of technology platforms. The role offers the opportunity to work on innovative projects and help shape the future of the platform, with a focus on machine learning tools and technologies.

The role

Join our high-performing Site Reliability Engineering (SRE) team and play a key role in ensuring the reliability, stability, and performance of our technology platforms using machine learning (ML) tools such as Prometheus, Grafana, New Relic, and more.

role responsibility

As an SRE, you will be responsible for service reliability and will deliver solutions that make a real impact. Your initial focus will include:
  • Using AI to speed up incident diagnosis and resolution
  • Improving observability, capacity planning, and automation
Her daily work revolves around the infrastructure stack, operations, and continuous improvement.

Responsibilities

- Ensure the reliability and performance of critical systems across the global infrastructure through proactive monitoring and rapid incident response. - Design and implement observability solutions using tools such as Prometheus, Datadog, ELK, and Loki to provide monitoring and alerting capabilities. - Collaborate with multiple teams to improve system design, deployment practices, and operational excellence. - Troubleshoot issues with confidence, manage on-call rotations, large-scale CPU/GPU deployments, and high-performance distributed systems. - Contribute to capacity planning and performance optimization to ensure systems meet business requirements. - Manage multiple ELK clusters hosting hundreds of terabytes of log data, telemetry, and APM data.

Key competencies

Required:
  • Strong understanding of SRE principles, including SLIs, SLOs, fault budgets, and reliability testing practices
  • Extensive experience and in-depth understanding of tools such as Prometheus, Grafana, the ELK Stack, or similar
  • Proficiency in automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, Perl/C)
  • Strong understanding of troubleshooting and debugging across distributed systems, with the ability to diagnose complex production issues under pressure
  • Experience with containerization, on-call rotations, and post-incident reviews
  • Familiarity with Kubernetes and container orchestration solutions
  • A proactive mindset and the ability to take ownership of reliability initiatives

Advantages

- Modern office space on the OPD campus with easy access to public transportation and amenities - A hybrid work model - Flexible compensation package - 25 days of paid vacation - Premium retirement plan - Company-sponsored program - Mental health support for long-term service and volunteer work - Additional sick leave - Multifunctional card - Opportunities for professional development, including internal tech talks - A culture of personal responsibility and engagement with the business community

Job details

© 2025 House of Skills by skillaware. All rights reserved.
Our website uses cookies to make navigation easier for you and to analyze the use of the site. You can find more information in our privacy policy.