How Spotify migrated everything from on-premise to Google Cloud Platform
- 31 July, 2018 13:18
Spotify announced that it was going all in on Google Cloud Platform (GCP) back in 2016, committing a reported US$450 million over three years. In Spotify, Google got itself an anchor customer, not just because of its brand and scale, but also its reputation as a data driven, engineering-centric company.
Spotify has since shut down both of its US data centres and will be free of on-premise infrastructure by the end of the year following a complex migration.
During the Google Cloud Next conference in San Francisco we heard from members of the Spotify and Google Cloud teams who were involved in the migration about how they went about it and some key lessons learned.
Director of engineering at Spotify Ramon van Alteren started by explaining why Spotify decided to go all in on cloud infrastructure in the first place.
"If you think about the amount of effort it takes to maintain compute, storage and network capacity for a global company that serves more than 170 million users, that is a sizeable amount of work," he said.
"If I'm really honest, what we really want to do at Spotify is be the best music service in the world, none of that work on data centres actually contributes directly to that," he added.
As well as freeing up developers from worrying about provisioning and maintaining infrastructure, van Alteren said the company also wanted to start taking advantage of some of the innovation coming out of Google Cloud, specifically the BigQuery cloud data warehouse, Pub/Sub for messaging and DataFlow for batch and streaming processing.
Van Alteren also said that the driving force behind the move to cloud came, somewhat surprisingly, from the engineers tasked with maintaining those data centres. "A big part of that was asking what their job looks like once we move to cloud," he said. "The net result was that group of engineers, some of the most deeply respected at Spotify, ended up being advocates for our cloud strategy."
The actual migration plan was formulated back in 2015 and was broadly split up into two parts: services and data.
The services migration focused on moving nearly 1,200 microservices from on-premise data centres to the Google Cloud Platform.
The three main goals during migration, according to van Alteren, were to minimise disruption to product development, finish as quickly as possible to avoid the cost and complexity of running in a hybrid environment, and cleanup, ensuring Spotify didn't have any lingering services running in its data centres.
One of the first things Google and Spotify did was to build a small migration team of Spotify engineers and Googlers, and built a live visualisation of the entire migration state so that engineers could self-serve to see the progress of the project.
That visualisation looks like a set of red (data centre) and green (Google Cloud) bubbles, with each bubble representing a system, and the size of the bubble representing the number of machines involved.
"This had a number of interesting side effects, one that saved me a lot of time as a programme manager to save doing status updates," van Alteren said. "Next it created a real sense of accomplishment for teams that were migrating and they could see the impact they were making."
The services migration started with mapping dependencies, as the architecture at Spotify means every microservice relies on somewhere between 10-15 others to service a customer request. This means that a 'big bang' migration, where everything stops, was not an option as customers expect constant uptime from the service.
Instead, Spotify engineering teams were tasked with moving their services to the cloud in a focused two-week sprint, where they effectively paused any product development. This also allowed those teams to start assessing their architecture and to decommission anything unnecessary.
One thing Google Cloud brought to the table specifically for Spotify during the migration is its Virtual Private Cloud (VPC) option.
"This allows you to build similar to an internal network which connects multiple projects and they can cross talk," van Alteren said.
"This gives teams good control of their own destiny, they get to do what they need to for their service and if they shoot themselves in the foot they shoot themselves in the foot and not the whole company."
The second blocker was latency caused by the Virtual Private Network (VPN). As it started the migration, Spotify found that shifting 1,200 or so microservices took up a lot of VPN bandwidth.
"To be honest the VPN service at that time was more or less scaled for an office of 25 developers," he said. "When we showed up with four data centres that didn't work so well, so we collaborated with Google and got this to work solidly pretty quickly. So we built multiple gigabytes of network pipes between our data centres and Google Cloud to get that dependency problem to disappear."
Once these blockers had been removed Spotify was able to start moving user traffic over to the cloud. "Another key realisation in the service migration was that we could decouple service migration from moving our user traffic," van Alteren said.
"So we deliberately separated these roadmaps to focus on getting these applications ready to run on Google Cloud and a separate roadmap that allowed us to gradually connect more and more users with GCP, which allowed us to control the reliability, user experience and our migration speed, and how much traffic was flowing over these VPN links as well."
Once that migration was in full flow the central migration team started to secretly induce failures in those cloud systems, recording how the teams reacted and recovered on the new architecture.
Peter Mark Verwoerd, a solutions architect at Google said: "While that was fun to break things and see teams scramble, it helped ensure the monitoring systems were properly extended to the new cloud deployment, if a team didn't notice, that would be a big red flag. Finally we had this playbook they could start going to for failure modes in the cloud they may not have had in the past."
By May 2017 each migration sprint had been completed and traffic was being routed towards Google Cloud. Then, in December 2017 Spotify hit 100 percent of users and had already closed its first of four on-premise data centres. Since then the second data centre has been closed and the last two, both in Europe, will be closed down by the end of this year.
That roadmap is "a pretty strong signal for people with long-tail applications still running in data centres that they need to get a move on," van Alteren said.
Next up to talk through the data migration was Josh Baer, senior product manager for machine learning infrastructure at Spotify, who described the experience of moving one of Europe's biggest on-premise Hadoop clusters to the cloud.
Due to a highly complex dependency graph it was a challenge to move 20,000 daily data jobs to GCP without causing downstream failures, according to Baer.
Spotify started by assessing the possibility of a 'big bang' migration. "Shut down the Hadoop cluster, copy all of the data over to GCP and start things back up again," Baer said.
Unfortunately, even with a 160 gigabit per second network link it would have taken two months to copy the data from the Hadoop cluster to Google infrastructure. "We wouldn't be much of a business if we were down for two months," he added.
The strategy they landed on was to copy a lot of data.
"As you moved your job over to GCP you would copy your dependencies over, then you could port your job," he explained. "Then if you had downstream consumers you may have to copy the output of your job back to our on-premise cluster so they weren't broken. As the bulk of our data migration lasted six to 12 months we were running a lot of these jobs to fill gaps on our dependency tree."
Naturally a migration like this ate up network bandwidth, so Baer and his team learned quickly to over-provision and to avoid using VPN whenever possible.
Each migration sprint started with two options for the team involved: they could lift and shift - something they called 'forklifting' - where appropriate or time-poor; but ideally they would rewrite.
"This was useful for teams that didn't feel comfortable just porting over their jobs using the forklifting path because they may have inherited these data jobs and hadn't really looked at them, and if they were going to dig into them they might as well rewrite them," Baer said.
"The biggest thing with rewriting was it required a much larger time investment from teams and as engineers what usually happened is as they started writing it they want to rearchitect it too.
"Towards the end and middle of our migration we had to tell people to stop the rewrite path, just migrate their stuff and if they really wanted to rewrite it then it was already on GCP, so we could still hit our migration targets."
Spotify is now running its data stack entirely on BigQuery, running 10 million queries and scheduled jobs per month, all-in-all processing 500 petabytes of data.
Max Charas, a strategic cloud engineer at Google warned: "This migration strategy is very tailored to Spotify both technically and organisationally, so if you wanted to do something like this it might look very different."
That being said, there were some key lessons learned from the migration.
The first was to prepare. Charas said: "We prepared for probably two years before the migration and each migration took around a year. We tried to build a minimal use case to show the benefits of moving to GCP but that couldn't be a small thing to show the true value."
Second was to focus. Van Alteren said: "It is truly amazing what you can do with team of engineers focused on a single thing, we had sprints of a week moving 50-70 services. It will also help your business stakeholders, who will be happier with a short period of time with no product development instead of a long time. If you try to do other things at the same time you will slow your migration down to a crawl."
Third was to build a dedicated migration team to "act as guardrails to help them know what they need to know, pass on past experience and learnings and just be resources they need," Charas said.
Last was to "get out of hybrid as fast as you can - all these copy jobs are expensive and complex," Baer said.
The results for Spotify have been more freedom for developers and greater scale, without sacrificing its quality of service.
"Quality of service is something we measured diligently and there has been no degradation there," van Alteren said. "The benefits we derived includes our event delivery pipeline, that carries our royalty payments for rights holders but is also a core part of our product development. When we moved to cloud that pipeline was carrying at peak 800,000 events per second, look now and we carry 3,000,000 a second, having that much more information available for product development is insane."
And cost savings? "This is a key thing to keep an eye on as we move from a centralised buying position to a distributed buying position where everyone is capable of spending money for your company," van Alteren admitted. "So it depends. Currently we have grown in size so it is really hard to compare and I can't give you figures."