How Just Eat runs devops at scale
- 04 July, 2018 08:41
UK-based food delivery company Just Eat runs a hugely complex devops culture across 35 software development teams in five geographies, working together to maintain 450 microservices.
At peak times Just Eat is processing 2,700 orders per minute, normally on a Saturday evening, which requires them to spin up in excess of 1,500 AWS instances. The engineering team ships up to 500 releases a week and generates 1.5TBs of logs per day.
Building a central SRE team
Site reliability engineering is of paramount importance for Just Eat, as it is to rivals like UberEats and Deliveroo, as outages cause an oversized reaction from hungry customers.
Take Just Eat's reported outages on New Year's Day, its second busiest day of the year after Valentine's Day, where a tweet from the company reporting technical issues across its website and app received more than 200 responses, some verging on the dramatic side. "You're ruining my life!" one user tweeted.
Speaking at the London leg of the AppDynamics World Tour last week, Richard Haigh, director of technology at Just Eat, said: "Site reliability engineering all starts with this first premise: we believe truly that dev teams own their product, full stop.
"So the devs create the code, they ship it, they look after it in the evenings, they take the pages home and we call them at 3 o'clock in the morning if their code fails and they help fix it. They own that code cradle to grave."
However, those dev teams aren't alone, they are supported by dedicated SRE teams, who run on what Haigh calls 'five pillars'. These are:
- Relentlessly protect site availability
- Enable change to be delivered fast but baking quality in using automated delivery pipelines
- Optimise the use of infrastructure and resources - using scalable cloud solutions and ensuring the right spend there
- Innovate to stay ahead, especially when it comes to tooling - wether it is open source, commercial or built in house - to have the best possible solution
- Fostering a blameless culture and enabling autonomous teams
This starts with a central SRE team of 50-60 people who run a 24/7 service operations centre. "Their job is to boss the first ten minutes of any problem we've got," Haigh said.
This team also looks after hosting across various cloud platforms, delivery automation (CI/CD pipelines), all monitoring, logging and alerting sitting under a team they call 'observability' internally, and service management run by a set of country managers.
In terms of aligning these teams, Just Eat runs a whole range of activities to keep lines of communication open.
This starts with a daily standup to review any issues they had seen in the past 24 hours and prioritise next steps. Teams also run weekly risk meetings ahead of the busy weekend period. Lastly there is a monthly technology all-hands to discuss lessons learned and new tooling.
When it comes to tooling, Just Eat wants to fall in the sweet spot between giving development teams complete autonomy and a central team dictating which tools to use.
"We try to support that diversity and decentralisation but at the same time control some of the things that are larger in scale and more important to the business," Haigh said.
"So we have central support for a range of tooling like the monitoring stack, but we give dev teams the opportunity to interact with that in an open source way, so they can take a version, fork it, manipulate it and give it back to us and we say that's great and incorporate it back into the code base for everyone."
However dev teams adopting new tools "go into that knowing the risks and taking that on, but there is also survival of the fittest," Haigh said. "So if you have chosen that tool and it turns out to be absolutely awesome - we've seen this with something like Slack recently - the next team takes it and we get to the point where it's best in breed and we take that onboard to try and find some of those economies of scale, enterprise support and make it easier for others to adopt it."
Bennie Johnston, head of site reliability engineering at Just Eat also explained from the stage how the company has had to scale devops over the past four years due to some pretty serious growth at the publicly listed company, from just one release a day to nearer 500 per week today.
For example, four years ago the Just Eat dev teams were falling back on written run books to help resolve issues.
"Run books weren't the way we were going to scale devops," Johnston said. "Just Eat kept being successful so we got more and more dev teams. What we realised was our platform-as-a-service team needed scaling and that is where the deployment of monitoring came in."
So the team went out to market to look for the right tools to help automate some of these processes. "That's where AppDynamics came in, as well as our internal SRE team building tools to give the operations teams buttons and levers to affect the flow of traffic and protect that core user experience," he said.
Speaking about AppDynamics in particular, Johnston outlined how it allowed Just Eat to centralise monitoring.
"We had monitoring-as-a-service from very early on, it was one of the things that allowed us to do devops, if you are operating something you need to be able to see it," he explained.
However monitoring was developed by individual teams for their own services, so it was fractured.
For example: "For the Order API we had some great metrics, but from an operations point of view we had very little visibility into what was actually going on. We didn't even really have real time order rate, this was something that came out of our database three days later.
"AppDynamics gave us the ability to look across all of our teams and services to correlate cause across and gave us that visibility into our systems," he said.
So, what next for JustEat? "We want incident resolution AI," Haigh said.
At the moment managers will arrive to work in the morning to a raft of HipChat chat logs to pick through any incidents that occurred overnight.
"My dream is to come in one morning and there has been an incident but we won't have had to wake any engineers up," Haigh added. "In fact the bots and systems we are looking at will have identified where something has gone wrong and find what they need to do to recover those systems.
"If we get really good those dependent systems will talk to each other and will understand the implications of making that change. Those systems will work together to resolve that and then we will run sanity checking and it is logged in those same HipChat channel so that when we, the humans, come in after a good night's sleep we can see exactly where the system was hurting and how to alleviate that moving forwards."