Job offer
Site Reliability Engineer
As a Site Reliability Engineer at Man Group, you will be responsible for the reliability, stability, and performance of the technology that supports the company's multi-asset platform. You will work on developing and implementing solutions for monitoring and optimizing systems to ensure high availability and performance.
Job description: Site Reliability Engineer
Tasks
- Ensuring the reliability and performance of critical systems across the global infrastructure through proactive monitoring and rapid incident response.
- Design and implementation of observability solutions using tools such as Prometheus, Datadog, and ELK to provide insights and enable data-driven decisions.
- Develop and maintain SLAs, SLOs, SLIs, and error budgets to guide reliability improvements and inform engineering priorities with data.
- Automating operational tasks and building self-service capabilities to eliminate waste and improve efficiency.
- Participation in incident response efforts, blameless post-mortems, and implementation of preventive measures to reduce outages.
- Collaborate with development teams to improve system design, deployment practices, and operational excellence.
- Configuration of CI/CD tools, management of auto-scaling, large GPU/CPU deployments, and high-performance distributed systems.
- Contribute to capacity planning and performance budgeting to ensure that systems meet business requirements.
- Management of multiple ELK clusters hosting hundreds of terabytes of log, telemetry, and APM data.
Requirements
- Strong understanding of SRE principles, including SLAs, SLOs, error budgets, and reliability testing practices.
- Familiarity with automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, or similar).
- Strong troubleshooting and debugging skills across distributed systems, with the ability to diagnose complex production issues under pressure.
- Experience with infrastructure management, e.g., on-call rotations, post-incident reviews.
- Familiarity with Kubernetes and container orchestration.
- A preventive mindset and the ability to take responsibility for reliability initiatives.
Advantages
- Experience with AIOps/CICD pipelines and tools such as Jenkins, TeamCity.
- Administration of Linux and Windows systems and exposure to cloud technologies (AWS/Azure).
- Understanding of network concepts, load balancing, and distributed architectures.
- Knowledge of ALM (Application Lifecycle Management), tooling for DevOps teams, DevOps teams.
- Familiarity with ITIL v4 principles; desire to understand the actual benefits of our decisions.
- Supported in India, motivated to succeed in remote communication and collaboration roles.
Benefits
- Modern office space located on the MOEIOff campus with easy access to transportation and amenities.
- Hybrid working model.
- Competitive compensation package.
- 2.5 days of vacation pay.
- Premium health insurance.
- Corporate augmented reality program.
- Referral bonus.
- Mobilization for long-term service and volunteer work.
- Multifunction card.
- Opportunities for professional development, including internal tech talks.
- Confidential support and engagement with Man Group's Employee Resource Groups.
Job details