Lead Site Reliability Engineer

Apply now »

Date: Feb 23, 2021

Location: Hungerford, England, GB, RG17 0YL

Company: CDK

Accelerate Your Career

Drive global technology


We’re a global market leader in providing software and digital marketing solutions to the automotive industry. We’re innovating the way that automotive dealerships drive their customers’ car-buying experience from the moment they run a search online all the way through to bringing their car back in for a service. Join us and be a part of the evolution.


We’re large enough to make a difference but small enough for your voice to be heard. This means that we are an organisation where every person matters. You can make an impact on the success of our business and that of our customers regardless of what career you decide to pursue.


What will you be doing?


As a Lead Site reliability engineer you will be part of the Platform team responsible for building and running Site Reliability Engineering functions. You will provide operational support to Platform based products, ensuring that necessary tools and processes are put in place to allow the system to scale with an increase in operational load.




  • Mature the SRE function within the business. Working closely with multiple stakeholders, propose necessary processes, toolset and skills which are needed to provide 24x7 proactive and reactive incident management
  • Lead and manage a team of SRE engineers
  • Help to define the necessary availability and SLA for various Platform based products and ensure necessary process, toolset and people are in place to realise it
  • Create, maintain and enhance the production readiness standards (performance, availability, security, compliance etc) for all services, applications and APIs and ensure standards are adhered to before services can go live into the production
  • Create, maintain and enhance monitoring, alerting and debugging tools and capabilities
  • Perform initial troubleshooting for all the services - including necessary roll back and restore to maintain the high platform availability, create necessary dashboards
  • Create, maintain and enhance the necessary run books to ensure on-call engineers have access to right information to respond and resolve issues
  • Good architectural understanding of deployed services to provide early feedback to the engineering teams
  • Collaborate with Engineering teams to implement performance improvements identified through tracking service latency figures, CPU utilization figures, etc.
  • Ensure effective communication is maintained with necessary stakeholders
  • Establish the concepts of SLIs and SLOs and agree the measurements with engineering teams
  • Provide necessary operational support to multiple platforms (both on-prem and on AWS). Participate in periodic 24x7 on-call duties
  • Deploy, support and monitor new and existing services, platforms, and application stacks
  • Capacity and performance management of environments
  • Automate and deploy cloud platform solutions using AWS


Required skills


  • Experience in supporting microservice based solution, microservice implementation technologies (both serverless and containers)
  • Experience with monitoring, alerting and incident management tools in AWS and microservice world
  • Experience in setting up processes and tools to provide 24 x 7 operational support
  • Understand customer issues, troubleshooting
  • Experience with AWS cloud infrastructure (EC2, Cloudformation, Lambda, DynamoDB etc)
  • Some CI/CD experience with Jenkins/Bamboo
  • Understanding of web security and DevSecOps principles
  • Strong communication and collaboration skills
  • Ability to work across global teams and working with different cultures across different time zones


Technology Stack


  • AWS (Lambda, SQS, SNS, DynamoDB, S3, ECR, EC2, API gateway, RDS, Route 53)
  • Terraform, CloudFormation, Serverless framework, Bamboo, Ansible
  • Grafana, ELK stack, Cloudwatch, Prometheus, Backstage
  • .NET Core, ASP.NET Core, C#, Node.js, React
  • Bamboo specs, Jenkins, Docker, Shell script


Why a career with Keyloop?


We demand diversity. Our people may be spread across countries, continents and cultures, but we’re united by a passion and enthusiasm to drive our business forward. This means no matter where you work you’ll feel like part of our global team. Diverse backgrounds, ideas and experiences are the only way to deliver world-class service to our customers. Our differences are our strengths


Your benefits. To help us attract and retain the best, we pay people according to performance, not length of service. We will also help you grow your career, not only through focused investment in learning and development but also by enabling you to explore the exciting opportunities our global market has to offer.


The perfect opportunity awaits. Start your career with Keyloop.

Job Segment: Developer, Cloud, Performance Management, Technology, Human Resources