Apply now »

Lead Site Reliability Engineer

Hungerford, England, GB, RG17 0YL


Accelerate Your Career

Drive global technology


With more than $2 billion in revenues, CDK Global is a leading global provider of integrated information technology and digital marketing solutions to the automotive retail and adjacent industries. Focused on enabling end-to-end automotive commerce, CDK provides solutions to dealers in more than 100 countries around the world, serving approximately 28,000 retail locations and most automotive manufacturers.   CDK Global solutions automate and integrate critical processes from pre-sale targeted advertising to the sale, financing, insurance, parts supply, repair and maintenance of vehicles, with an increasing focus on utilizing data analytics and predictive intelligence.   


We’re large enough to make a difference but small enough for your voice to be heard. This means that we are an organization where every person matters. You can make an impact on the success of our business and that of our customers regardless of what career you decide to pursue.


From data scientists to sales and client service experts, we’re hiring to support your growth and ours - Green light your career.  

Lead Site Reliability Engineer


As a Lead Site reliability engineer you will be part of Platform team responsible for building and running TechOps functions within CDKI. You will provide operational support to Platform based products, ensuring that necessary tools, and processes are put in place to allow the system to scale with increase in operational load.




  • Mature the TechOps function within CDKI. Working closely with multiple stakeholders, propose necessary processes, toolset and skills which are needed to provide 24x7 proactive and reactive incident management
  • Lead and manage a team of TechOps engineers
  • Help to define the necessary availability and SLA for various Platform based products and ensure necessary process, toolset and people are in place to realise it
  • Create, maintain and enhance the production readiness standards (performance, availability, security, compliance etc) for all services, applications and APIs and ensure standards are adhered to before services can go live into the production
  • Create, maintain and enhance monitoring, alerting and debugging tools and capabilities
  • Perform initial troubleshooting for all the services - including necessary roll back and restore to maintain the high platform availability, create necessary dashboards
  • Create, maintain and enhance the necessary run books to ensure on-call engineers have access to right information to respond and resolve issues
  • Good architectural understanding of deployed services to provide early feedback to the engineering teams
  • Collaborate with Engineering teams to implement performance improvements identified through tracking service latency figures, CPU utilization figures, etc.
  • Ensure effective communication is maintained with necessary stakeholders
  • Establish the concepts of SLIs and SLOs and agree the measurements with engineering teams
  • Provide necessary operational support to multiple platforms (both on-prem and on AWS). Participate in periodic 24x7 on-call duties
  • Deploy, support and monitor new and existing services, platforms, and application stacks
  • Capacity and performance management of environments
  • Automate and deploy cloud platform solutions using AWS


 Required skills


  • Experience in supporting microservice based solution, microservice implementation technologies (both serverless and containers)
  • Experience with monitoring, alerting and incident management tools in AWS and microservice world
  • Experience in setting up processes and tools to provide 24 x 7 operational support
  • Understand customer issues, troubleshooting
  • Experience with AWS cloud infrastructure (EC2, Cloudformation, Lambda, DynamoDB etc)
  • Some CI/CD experience with Jenkins/Bamboo
  • Understanding of web security and DevSecOps principles
  • Strong communication and collaboration skills
  • Ability to work across global teams and working with different cultures across different time zones


Technology Stack


  • AWS (Lambda, SQS, SNS, DynamoDB, S3, ECR, EC2, API gateway, RDS, Route 53)
  • Terraform, CloudFormation, Serverless framework, Bamboo, Ansible
  • Grafana, ELK stack, Cloudwatch
  • .NET Core, ASP.NET Core, C#, Node.js, React
  • Bamboo specs, Jenkins, Docker, Shell script


At CDK, we pride ourselves on having a diverse workforce. We value and celebrate the uniqueness of individuals and the different perspectives they provide. We offer equal opportunity employment regardless of race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability status, age, marital status, or protected veteran status.  

Apply now »