The popular platform-as-a-service Cloud Foundry still requires plenty of work, expertise and organisational culture change if users are to really reap the benefits it promises.
Here we confront some common challenges encountered when working with Cloud Foundry and how to overcome them.
Speaking at the Cloud Foundry Summit in The Hague, Netherlands last week, Colin Simmons, a software engineer at cloud consultancy EngineerBetter, went through two common misconceptions Cloud Foundry customers often bring to him, and how he has worked with them to change processes and overcome some operational challenges.
"This job lets me work with lots of different customers in lots of different industries and I found that no matter where I go, there's often similar misunderstandings or similar questions being asked to me," he told the audience during a breakout session.
Misconception one: "Cloud Foundry is difficult to operate"
The first misconception he sought to debunk was: "Cloud Foundry is difficult to operate."
"There's a little bit of truth to this, Cloud Foundry is very complicated, and as a complicated distributed system, you need complicated tooling to deploy it," he admitted.
Cloud Foundry is commonly operated using an open source tool chain called BOSH, which has a "legendarily steep learning curve," according to Simmons. "Nowadays you can replace BOSH with Kubernetes, but I would argue Kubernetes is similar, at least from the operator perspective." That being said, there are steps that can be taken to help.
"The first major one is, if you are still deploying Cloud Foundry manually, you're objectively doing it wrong and please stop," he said.
The second is to not cut corners on training: "There's a lot of knowledge that you need to build up to operate Cloud Foundry effectively."
Establish a platform team
It is also important to establish a dedicated platform team, whether hiring externally or bringing together internal resources. However, it is vital that this team doesn’t carry over all the responsibilities from their old job on top of this new role if it’s to have any chance of success.
Organisations should try to add a dedicated product manager to this team. "[This person] is part of the teams, they go to all of the meetings and the planning and the retros, but their role is to constantly interface with stakeholders and customers and figure out what the priorities are and order the backlog in a way that works and make sure it all continues flowing smoothly," Simmons said.
Treat the platform as a software product
Another solution is to treat the platform as a software product. "If you push things without testing them into production, and the platform breaks, it doesn't matter how much you tested your apps, they're broken too," Simmons said. "Where you have dedicated teams, for deploying applications, managing applications, you should have a dedicated team for managing the platform."
Before joining EngineerBetter, Simmons worked for Marks and Spencer in the mobile team. "We deployed to Cloud Foundry, we wrote the app ourselves, we deployed the app ourselves, we did on-call support ourselves, we did everything," he said.
"This was really empowering for us, because it shows you from scratch to final completion, you can pick up an issue and then deploy the issue and see it fixed in production.
"Also, if you're the person that gets paged at three in the morning, on a Saturday, because some sketchy thing pushed on Friday night, you're not going to allow sketchy things to be pushed on Friday night.
“We had, I think, five minutes of production impacting issues in a year and a half, which we were quite proud of. From a business point of view, they didn't need a separate team, they just paid the developers and the website worked."
Once this team is established it’s important that they actually collaborate with one another.
Simmons gave an example of one client he worked with, a European government agency, that had a platform team of six people distributed across the country and only met for half an hour a week.
"The consequence was there was one person that had pretty good knowledge of BOSH, pretty good knowledge of Cloud Foundry and pretty good knowledge of how their stuff worked and no one else in the team really knew anything," he said.
What would happen if that one person wasn't around? "Well, then all the learning that that person has done is lost and the platform team has to struggle to relearn everything while there's a production incident, and that's no fun for anyone," he warned.
Simmons also talks about the process of 'pairing' or 'mobbing', where you either pair programming to developers around one computer and one story; or mob three to five developers around one computer and one story.
Simmons added: "Before you discount this, saying 'this style of working will never work in my industry, and will never work in my company'. I've seen this work at some of the biggest banks I know of, I have seen it work at security companies, pharmacy companies, and automotive, you name it, you name the industry, I can probably name someone that succeeded with pairing. So give it a shot and see if it works for you."
He advised: "Send someone from your team to pair with them on their backlog, and then invite someone from their team to come to your team and pair on your backlog. This helps foster understanding and empathy between everyone in the business.
"Collaborating and breaking down silos are really the key solutions to almost every problem in this talk."
Lastly, but perhaps most importantly, it is vital to establish automation into your processes.
"Usually, problems manifest as the platform team has no time, or they're falling behind on updates, or they're falling behind on feature requests to the platform, and are just generally unhappy and there are fires everywhere,” Simmons said. “Usually, the problem is that there's a lack of automation, or there's a lack of understanding and existing automation or poor implementation."
The Cloud Foundry Foundation pushes more than 130 releases per month on average. He asked: "Do you think your platform team can keep up with that release cadence if you are doing everything manually?"
"What you really need to do is change the focus of your team from operating the platform to building tooling to manage the platform. Most of the teams I work with the product we actually end up creating is a Concourse pipeline like this that deploys Cloud Foundry for us."
Misconception 2: "Our platform is unreliable"
The second misunderstanding Simmons looked to debunk was: 'Our platform is unreliable'.
"Cloud Foundry is very reliable,” he stressed. “I've seen Cloud Foundries continue to serve apps as if nothing was going on for hours while the underlying IaaS is kind of melting into the ground.
"A lot of times when a customer says to me, 'our platform is unreliable', what they actually mean is: 'we're seeing lots of downtime during upgrades and we're seeing lots of downtime when we're not changing anything'. Or, 'the same issues keep reoccurring and we can't seem to stop it from happening'."
The solution? Testing platform upgrades in a sandbox environment to avoid disruption to key in-production systems would be a start.
Next is documenting past incidents. "If you have a problem and resolve it, and you don't record it, you just pretend it never happened, and then the same thing happens next month, but someone else is on the call, all the learnings are lost and you have to redo it,” Simmons observed. “Every problem is a new problem, regardless of how many times it's happened."
Another piece of advice is to avoid blame culture when documenting issues.
"Make sure when you're writing down what the cause is to focus on process failures instead of personal failures," Simmons said. "For example, if you have an incident that someone accidentally deletes the production database, the root cause analysis is not 'this person deleted the production database', the analysis is, 'why is it possible for someone to accidentally delete the production database'."
Simmons also said that many clients tell him that they encounter problems because they run 'snowflake environments'.
"This is related to automation,” he said. “If you're manually creating your environments, and or maybe you're automating, and the automation isn't super strict, and they kind of drift apart, then you're in a situation where if you test an upgrade, or you test an app on pre-production, there is no guarantee at all that that's going to work when you go to production because they're different."
The answer to this issue is smaller releases. "I've seen this a lot in big enterprises and usually [large releases] are driven by fear of failure, or overly strict governance on how changes can happen," he said. Instead, by doing smaller releases more often it is easier to rollback when issues occur with Cloud Foundry.
"You can blue/green deploy, you can do creative route mapping, if something breaks, you can switch it back really quickly," he said.
On the other hand, "with the platform, rollbacks aren't really a thing".
"What do you do if you can't roll back? Well you have to fix forward, you have to figure out what broke, fix it and move forward with it," he said.
"So how easy is this if you've crushed together a month of work and released it all at once, and you have to dig through 10,000 lines of YAML to figure out what broke? Well it's pretty hard, especially with all the alarm models going off. But if you do one feature at a time and release them one at a time, it becomes pretty easy."
Issues at the infrastructure level
The final cause Simmons detailed comes further down the stack.
"I recently worked with a company that was doing everything right," he said. "They had a dedicated team, they had a product manager, actually two product managers working in tandem, they had an ordered backlog that was continuously groomed and updated.
"They were writing tests for everything they could and they were doing automation from the start. But they kept having downtime, they kept having issues."
The issue resided with the underlying IaaS, where the infrastructure team was still operating in an "old-school waterfall manual process way".
"All of their environments were different and all of these problems that I just talked about for platforms was happening with the IaaS,” Simmons added. “Ultimately, the IaaS should also be treated as a platform product or a software product and have its own dedicated team and everything I've already said."
Simmons concluded with a quote from the Cloud Foundry Foundation CTO Chip Childers: "Ultimately, you can't buy devops".
"Approaching a devops approach or a CloudOps, or a kind platform-as-a-product approach, is all about changing culture,” he added. “What you really need to do, rather than going out and finding the coolest new technology to bring in-house, is to spend time moulding and improving your existing processes to really leverage the technology."