Job offer

Site Reliability Engineer

As a Site Reliability Engineer at Man Group, you will be responsible for the reliability, stability, and performance of the technology that supports the company's multi-asset platform. You will work on developing and implementing solutions for monitoring and optimizing systems to ensure high availability and performance.

Job description: Site Reliability Engineer

Tasks

  • Ensuring the reliability and performance of critical systems across the global infrastructure through proactive monitoring and rapid incident response.
  • Design and implementation of observability solutions using tools such as Prometheus, Datadog, and ELK to provide insights and enable data-driven decisions.
  • Develop and maintain SLAs, SLOs, SLIs, and error budgets to guide reliability improvements and inform engineering priorities with data.
  • Automating operational tasks and building self-service capabilities to eliminate waste and improve efficiency.
  • Participation in incident response efforts, blameless post-mortems, and implementation of preventive measures to reduce outages.
  • Collaborate with development teams to improve system design, deployment practices, and operational excellence.
  • Configuration of CI/CD tools, management of auto-scaling, large GPU/CPU deployments, and high-performance distributed systems.
  • Contribute to capacity planning and performance budgeting to ensure that systems meet business requirements.
  • Management of multiple ELK clusters hosting hundreds of terabytes of log, telemetry, and APM data.

Requirements

  • Strong understanding of SRE principles, including SLAs, SLOs, error budgets, and reliability testing practices.
  • Familiarity with automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, or similar).
  • Strong troubleshooting and debugging skills across distributed systems, with the ability to diagnose complex production issues under pressure.
  • Experience with infrastructure management, e.g., on-call rotations, post-incident reviews.
  • Familiarity with Kubernetes and container orchestration.
  • A preventive mindset and the ability to take responsibility for reliability initiatives.

Advantages

  • Experience with AIOps/CICD pipelines and tools such as Jenkins, TeamCity.
  • Administration of Linux and Windows systems and exposure to cloud technologies (AWS/Azure).
  • Understanding of network concepts, load balancing, and distributed architectures.
  • Knowledge of ALM (Application Lifecycle Management), tooling for DevOps teams, DevOps teams.
  • Familiarity with ITIL v4 principles; desire to understand the actual benefits of our decisions.
  • Supported in India, motivated to succeed in remote communication and collaboration roles.

Benefits

  • Modern office space located on the MOEIOff campus with easy access to transportation and amenities.
  • Hybrid working model.
  • Competitive compensation package.
  • 2.5 days of vacation pay.
  • Premium health insurance.
  • Corporate augmented reality program.
  • Referral bonus.
  • Mobilization for long-term service and volunteer work.
  • Multifunction card.
  • Opportunities for professional development, including internal tech talks.
  • Confidential support and engagement with Man Group's Employee Resource Groups.

Job details

© 2025 House of Skills by skillaware. All rights reserved.
Our website uses cookies to make navigation easier for you and to analyze the use of the site. You can find more information in our privacy policy.