Adventures in Kubernetes (Part 1)

Written by Avi Kessner | November 11, 2020

Breaking the monolith

Back in 2016, we set out on our journey to microservices; in 2017, we started our journey in Agile; and in 2018, we began our journey with Kubernetes. During that time, we learned a lot about the difference between hype and reality, and wanted to share our experience.

Kubernetes is considered an industry standard in some circles, but for most companies, it is still just a buzzword surrounded by hype and magical transformations. We want to share our imperfect Kubernetes journey, which often looked more like a raft than a fleet of oil tankers.

If you watch talks about Kubernetes and look at most guides or stories, you tend to see a few common patterns.

A specific company culture of autonomous teams, with each team having full ownership of specific microservices that they work on.
Certain assumptions about how teams work together, or not together, to deploy services into specific namespaces dependent on the app they are working on.
A certain level of devops maturity, automation scripts, and infrastructure that make all the benefits of Kubernetes look easy.

As we grew on our own journey, we often thought that we must be doing something very wrong, because none of the assumptions seemed to apply to us.

When we decided to break up our monolith, we were guided by a key point from Netflix’s experience. This point was that moving to microservices was a journey of about seven years. The length of time and difficulty in breaking up a monolith can not be understated. Having this mindset of slowly breaking up the monolith, and with an eye for a multi-year process, saved us from many horror stories you read online about switching to microservices. Implied in that statement is that moving to microservices should be a journey when your organizational scale requires it. If you are a startup, with less than 10 developers, don’t do microservices, it’s not worth it. However, if you are a large or growing organization, and your changes to the codebase are happening slower than you would like, or you’re spending too much money on scaling vertically, then we hope our adventure in breaking the monolith can be useful to you as well.

Originally, we planned to write a long series of articles about our journey, but perhaps it might be better to tell our story like a TDD practitioner tells a joke, by starting with the punchline and working on the setup later.

There are many great articles online about tips and tricks in using Kubernetes, and some use case stories, but we hope to give you the tools to help make informed decisions by learning from our failures.

Why you can’t trust this article, and must verify everything.
Why taking an honest look at your current and future scale can help you learn from others.
Why it’s OK to start with Kubernetes even if you can’t gain all the benefits.
Why working with Kubernetes with an imperfect culture is still helpful.
Why adopting Kubernetes can help you improve your development and organizational culture.
Why slow and steady wins the Kubernetes race.

Each of these points will become a blog post of it’s own in the future, but until then, let’s start with the first point.

Why you can’t trust this article, and must verify everything

When we started on our Kubernetes journey our main goal was to better scale our application. We had hit the limits and were seeing diminishing returns by simply deploying our monolith on more VMs. We had a good run with the three microservices we deployed, but quickly saw that orchestrating and managing those microservices ourselves would be a pain.

At the time, we were deciding between Docker Swarm, Mesos and Kubernetes. When we noticed that Docker Enterprise was using Kubernetes instead of Docker Swarm, our decision was made. We read a few books on Kubernetes, scoured the web for blogs and tutorials, and were convinced that Kubernetes could solve our problems. Most promising for us was the idea of “autoscaling”.

This was exactly the sort of magic we needed to improve performance and reduce costs!

Our simplistic understanding of this however was all hype! In theory, Kubernetes can autoscale and reduce your costs, but in reality, autoscaling will not be something you want to do often, and is not something you want to rely on for managing your costs.

Our first cause of confusion were the important yet subtle differences between horizontalPodAutoscaler (HPA), verticalPodAutoscaler (VPA), and “cluster autoscaling”. Let’s compare the reality to the hype for each of these.

Fact: HorizontalPodAutoscaler will spin up new pods for you and will reduce the number of pods if they are not in use based on CPU or memory consumption.
- Hype: The autoscaler will dynamically adjust to your current load reducing costs and saving on resource usage.
- Reality check: CPU and memory consumption were not the limiting factors in our services. Network requests, DB connections, and latency were our biggest issues. Attempts to modify the autoscaler to scale based on our homegrown connection counters were prone to error and hard to test. The most important autoscaling feature for us isn’t built into the platform. Attempts to use plugins to scale on requests and connections have proven to be unreliable.
- Reality check: Scaling your pods without scaling your nodes has no effect on cost. We pay for the underlying VM and no cloud provider allows you to only pay for the % of CPU used within that VM. For bare-metal installations, there might be a savings in electricity usage, but that’s a minority of use cases and didn’t apply to us at all.

Fact: VerticalPodAutoscaler will look at the load you have on your pods and adjust or recommend the ideal requests and limits for CPU and memory, letting your pods consume more resources without being throttled or OOMKilled.
- Hype: We no longer have to ask each development team to measure the amount of CPU or memory their service will use under load. We can automatically adjust the resources as needed.
- Reality check: You can’t have both HPA and VPA on at the same time. They conflict with each other’s goals. In addition, the recommendations need to be validated with performance testing. Although it does save a lot of time to have the recommendations.

Fact: Cluster autoscaling will spin up or remove nodes based on your pod requests helping you manage costs, and support spikes in traffic.
- Hype: With cluster autoscaling, you’ll never have downtime, and you’ll never pay for more nodes than you need!
- Reality check: When we started, these features were not mature and had a habit of reducing the cluster size to 0 during off times, causing daily downtime of 10-15 minutes!
- Reality check: Even with more robust solutions, it can still take a long time for nodes to scale up and down, meaning that autoscaling is not appropriate for dealing with a surge in traffic. Autoscaling is most useful to us for dealing with predictable changes in load, and does reduce some costs, but we still need to overprovision for high availability and maintaining up time.

This is a simple example, where hype and reality don’t perfectly align. The hype isn’t false, and it wasn’t even misleading. But it didn’t fit our use case, and was over-emphasized. Even well-meaning developers showcasing the ideal benefits of the technologies they use can’t be fully trusted from reading alone. You must validate the claims in your own unique situation. Not because they might be lying or wrong, but because nobody knows your own situation, and no two companies have the exact same requirements, goals, or limitations.

The next step is to look deep, and verify your own situation, and your realistic requirements.

We will cover that in part two of the series.

Avi Kessner, Software Architect

Avi started as a Flash Animator in 1999, moving into programming as Action Script evolved. He’s worked as a Software Architect at OnceHub since September 2019, where he focuses on Cloud Native Automated workflows and improving team practices. In his free time, he enjoys playing games, Dungeons & Dragons, Brazilian jujitsu, and historical European sword fighting.

View full post