Job offer

Site Reliability Engineer

As a Site Reliability Engineer, you are responsible for the reliability, stability, and performance of the infrastructure and play an important role in shaping the future of our platform. You will work on innovative projects and have the opportunity to learn from experienced leaders.

Job description: Site Reliability Engineer

Tasks

  • Ensuring the reliability and performance of critical systems on global infrastructure through proactive monitoring and rapid incident response.
  • Design and implementation of observability solutions using tools such as Prometheus, Datadog, ELK, and Loki for meaningful and rapid incident response.
  • Development and maintenance of SLIs/SLOs to drive reliability improvements and inform engineering priorities.
  • Automating operational tasks and building self-service capabilities to eliminate toil and improve efficiency.
  • Participation in post-mortem analyses, blameless post-mortems, and implementation of preventive measures to avoid recurring problems.
  • Collaborate with development teams to improve system design, deployment practices, and operational excellence.
  • Configuration and rollout of large-scale infrastructures and high-performance distributed systems.
  • Contribute to capacity planning and performance budgeting to ensure that systems meet business requirements.
  • Management of multiple ELK clusters hosting hundreds of terabytes of log, telemetry, and APM data.

Requirements

  • Strong understanding of SRE principles, including SLIs, SLOs, error budgets, and reliability testing practices.
  • Strong background in software development and operations, with knowledge of Python, Java, or similar programming languages (Java/Scala, Terraform, and scripting/programming languages (Python, PHP, Perl/Csh)).
  • Strong troubleshooting and debugging skills in distributed systems, with the ability to diagnose complex production issues under pressure.
  • Expertise in incident management, on-call rotations, and post-incident reviews.
  • Familiarity with Kubernetes and container orchestration.
  • Proactive mindset and ability to take responsibility for reliability initiatives.
  • Experience with SRE/DevOps tools and practices (e.g., PagerDuty, OpsGenie, ELK, Log, or similar).
  • Administration of Linux and Windows systems and experience with cloud technologies (AWS/Azure).
  • Understanding of network concepts, load balancing, and distributed architectures.
  • Knowledge of ALM/CMMI principles; desire to understand the actual costs of decisions.
  • Proven track record in communication and collaboration skills.

We offer

  • Modern office space on the OFCOM campus with easy access to transportation and amenities.
  • Hybrid working model.
  • Competitive salary and benefits package.
  • 25 days of vacation pay.
  • Premium health insurance.
  • Company-specific pension program.
  • Mental bonus.
  • Additional days off for long service and volunteer work.
  • Employee card.
  • Opportunities for professional development, including internal tech talks.
  • Trust, affinity, and engagement with the Man Group community.

Job details

© 2025 House of Skills by skillaware. All rights reserved.
Our website uses cookies to make navigation easier for you and to analyze the use of the site. You can find more information in our privacy policy.