Job offer

Site Reliability Engineer

The job posting describes a position as a Site Reliability Engineer at Man Group, in which the candidate is responsible for the reliability, stability, and performance of technology platforms. The role offers the opportunity to work on innovative projects and help shape the future of the platform, with a focus on machine learning tools and technologies.

Zurich

Man Investments AG

100%

The role

Join our high-performing Site Reliability Engineering (SRE) team and play a key role in ensuring the reliability, stability, and performance of our technology platforms using machine learning (ML) tools such as Prometheus, Grafana, New Relic, and more.

role responsibility

As an SRE, you will be responsible for service reliability and will deliver solutions that make a real impact. Your initial focus will include:

Using AI to speed up incident diagnosis and resolution
Improving observability, capacity planning, and automation

Her daily work revolves around the infrastructure stack, operations, and continuous improvement.

Responsibilities

- Ensure the reliability and performance of critical systems across the global infrastructure through proactive monitoring and rapid incident response. - Design and implement observability solutions using tools such as Prometheus, Datadog, ELK, and Loki to provide monitoring and alerting capabilities. - Collaborate with multiple teams to improve system design, deployment practices, and operational excellence. - Troubleshoot issues with confidence, manage on-call rotations, large-scale CPU/GPU deployments, and high-performance distributed systems. - Contribute to capacity planning and performance optimization to ensure systems meet business requirements. - Manage multiple ELK clusters hosting hundreds of terabytes of log data, telemetry, and APM data.

Key competencies

Required:

Strong understanding of SRE principles, including SLIs, SLOs, fault budgets, and reliability testing practices
Extensive experience and in-depth understanding of tools such as Prometheus, Grafana, the ELK Stack, or similar
Proficiency in automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, Perl/C)
Strong understanding of troubleshooting and debugging across distributed systems, with the ability to diagnose complex production issues under pressure
Experience with containerization, on-call rotations, and post-incident reviews
Familiarity with Kubernetes and container orchestration solutions
A proactive mindset and the ability to take ownership of reliability initiatives

Advantages

- Modern office space on the OPD campus with easy access to public transportation and amenities - A hybrid work model - Flexible compensation package - 25 days of paid vacation - Premium retirement plan - Company-sponsored program - Mental health support for long-term service and volunteer work - Additional sick leave - Multifunctional card - Opportunities for professional development, including internal tech talks - A culture of personal responsibility and engagement with the business community

Original job description

Job details

Found on:

May 26, 2026

Employer:

Man Investments AG

job percentage:

100%

Place of work:

Zurich

Place found on:

https://job-boards.eu.greenhouse.io/mangroup/jobs/4714467101

About the employer:

Man Investments AG is a global active investment manager.

The company offers alternative and traditional investment solutions.

The Swiss headquarters are located in Pfäffikon (SZ).

Man Group is listed on the London Stock Exchange and is part of the FTSE 250 Index.

Training opportunities powered by skillaware.