Job offer
Site Reliability Engineer
The role of Site Reliability Engineer at Man Group offers the opportunity to ensure the reliability, stability, and performance of technology platforms and to contribute to innovative projects. The SRE will be responsible for developing solutions that improve operations and help the company achieve its goals.
The role
Join our high-performing Site Reliability Engineering (SRE) team and play a key role in ensuring the reliability, stability, and performance of our technology platforms. This is an opportunity to work on innovative projects and help shape the future of our platform.role responsibility
* Ensure the reliability and performance of critical systems across the global infrastructure by performing proactive monitoring and responding quickly to incidents. * Design and implement observability solutions to gain insights into system performance and identify opportunities for improvement. * Automate operational tasks and build self-service capabilities to eliminate routine work and improve efficiency. * Develop and maintain SLIs, SLOs, and error budgets to drive reliability improvements and inform engineering decisions. * Participate in incident response efforts, conduct blame-free post-mortems, and implement preventive measures to improve reliability. * Collaborate with development teams to improve system design, development practices, and operational excellence. * Configure and roll out major infrastructure upgrades and high-performance distributed systems. * Contribute to capacity planning and performance optimization to ensure systems meet business requirements. * Manage multiple ELK clusters hosting hundreds of terabytes of log data, telemetry, and APM data.Key competencies
Required
* Strong understanding of SRE principles, including SLIs, SLOs, fault budgets, and reliability testing. * Experience and in-depth understanding of tools such as Prometheus, Datadog, ELK, Loki, and Grafana. * Proficiency in automation tools (Ansible, Terraform) and scripting/programming languages (Python, Go, Perl/C++). * Strong troubleshooting and problem-solving skills in distributed systems, with the ability to diagnose complex production issues under pressure. * Experience with visualization, reporting, on-call rotations, and post-incident reviews. * Familiarity with Kubernetes and container orchestration.Advantageous
* Experience with C/IDC images and storage solutions (e.g., Zenko, Teams, OpenQA). * Administration of Linux and Windows systems and experience with cloud technologies (AWS/Azure). * Understanding of networking concepts, load balancing, and distributed architectures. * Knowledge of A/UX (Unix SVR4, Linux/Unix), containers (e.g., Docker), and HOP (Spark). * Familiarity with FinOps principles to understand and communicate the actual costs of our decisions and to collaborate effectively.Advantages
* Modern office facilities on the OPDX campus with easy access to public transportation and amenities. * Hybrid work model * 25-day vacation package * Premium health insurance * Company benefits program * Additional days off for long-term service and volunteer work * Mental health allowance * Annual leave * Opportunities for professional development, including internal tech talks * Flexible work environment and engagement with the Man Group Employee Community.Job details