Did self-hosting Kubernetes break Monzo?

A day ago, a friend sent me a link to an article, to what looks like a confession from Monzo at this year’s virtual KubeCon: We’ve learned a lot about self-hosting Kubernetes but we wouldn’t do it again.

Is this a last-minute confession that self-hosting Kubernetes broke Monzo? Did Monzo bet the house by being one of the early pioneers in Kubernetes?

This is a thought-provoking statement. But I don’t think anyone or even myself believes that self-hosting Kubernetes and Microservices doomed Monzo in any way. Quite the opposite, but it came at a pioneer cost.

Their statement strikes a chord with anyone who has set up a self-hosted Kubernetes cluster.

Back in 2016, there were no Managed Kubernetes Clusters available. So Monzo didn’t have a choice other than self-hosting.

But why is self-hosting Kubernetes so hard in the first place?

Glad you asked. I will highlight a few of the challenges.

Installing Kubernetes is Hard

There are so many components in a Kubernetes cluster. The control plane, Etcd, kube-controller-manager, kube-api-server,kube-scheduler, kubelet, kube-proxy etc. You should also not just set up a single control plane, because if it goes down, it will take the cluster down. So you need to have redundancy as well.

The problem is that there are many moving parts, so the likelihood of something going wrong is higher.

One of the biggest benefits of using GKE or EKS is not having to worry about installing Kubernetes at all and the control planes are managed for you.

Networking for Kubernetes is hard

Running hundreds of APIs simultaneously requires very good networking and monitoring in place.

With self-hosting, Kubernetes doesn’t come with a networking provider out of the box. No surprise that networking is one of the most complicated aspects of Kubernetes. If you look at the network model specification for Kubernetes you will see that there are some very specific network features required. For example, NO NAT should be required to communicate between any POD, etc

So if you self-host Kubernetes you need to either use an available networking plugin such as Calico, flannel, etc. Or you develop one from scratch. No guarantees though that any of these plugins will work well in your network. And I am discounting any potential bugs.

Dynamic provisioning of storage

Since Monzo self-hosted, there was no default dynamic storage provisioner setup. They had to set it up themselves. There are plenty of storage integrations available now(not so many back in 2016). But beware, there are some open-source projects out there which provide integrations that don’t work very well. Storage also needs to be fault-tolerant and needs to have high availability, otherwise, you end up with a single point of failure for hundreds of nodes.

Security is paramount in a highly distributed cluster

As a bank, Monzo deals with the personal details of at least 2 million customers. Imagine the complexity of handling security when you have information flowing across 1600 microservices. Monzo had it almost perfect, except for a data breach in 2018 where they found that some PIN number information was leaked.

Security is very challenging because pods are able to connect to each other from any node in the cluster. Even though there is a requirement in Kubernetes to encrypt all communication with TLS, you can switch it off or bypass it.

But even if Monzo was running on GKE, security would still be an issue as it has to be built by design at all the layers of the application.

Monitoring is extremely important

Having 1600 microservices running requires some serious monitoring in place, to monitor every single microservice across hundreds of pods and machines. This data needs to be collected somehow. Monzo made a smart choice here, they went with Prometheus.But it is definitely not a straightforward implementation, as you can read in their blog post about prometheus.

Being a Kubernetes pioneer required Monzo to INVENT the wheel

You hear often in software development of the basic principle of not reinventing the wheel. This basically means that if there is anything already developed you should reuse it. The problem in 2016 is that Monzo was one of the Kubernetes pioneers. There were very few tools available for managing Kubernetes clusters, so Monzo had to invent the wheel. So they had to invest heavily in creating tooling for Kubernetes. Hats off to them, many of those tools were open-source and are available on github.

Therefore no surprise that Monzo would rather use GKE or EKS instead of hosting their own Kubernetes cluster. But perhaps they have created so many proprietary customisations for Kubernetes that it would be very expensive to migrate now.

Just wanted to add some food for thought to the latest news from Monzo. And looking forward to feedback from the Kubernetes community!

Resources:

https://skillsmatter.com/skillscasts/9146-building-a-bank-with-kubernetes

https://www.cbronline.com/news/monzo-down

https://www.telegraph.co.uk/personal-banking/current-accounts/monzo-leaves-customers-without-wages-payday-outage/

Monzo outage: Is it possible to fail in a good way?

https://www.thesun.co.uk/money/9608477/monzo-down-chaos-customers-payments-cash/

Posted

August 23, 2020

Tips

Armindo Cachada

Tags: