Kubernetes Horror Stories

SMAF-NewsBot

English - October 28, 2020 07:38 - 2.44 MB
Technology News Tech News streaming industry news technology media smaf smadvancedforum Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Is Limelight Networks' Post-Earnings Dip a Buying Opportunity?

Next Episode: Adam Stanger and Katy Oberdiek - November 2020 | smadvancedforum

Thundra sponsored this post.

Kubernetes is a feature-rich, complex container management system that runs across all environments — multiple public clouds, on-premises, and hybrid. It’s no surprise, therefore, that Kubernetes is often the lead character in application or infrastructure horror stories.

In this article, we introduce five such scary-but-true stories, which are described in full detail in our white paper. Warning: Even these abridged versions are not for the faint of heart — but they may save you from experiencing your own real-life and costly horror story.

Doomsday Preppers: Running Out of Resources

Serkan is co-founder and CTO of Thundra. He has 10+ years of expertise in software development, is an AWS Certified PRO and has a patent on distributed environments. He mainly works on serverless architectures, distributed systems and monitoring tools.

Moonlight, a service that matches software developers with companies looking to hire, uses Kubernetes via Google Kubernetes Engine (GKE) to host its web-based application.

The story starts on a Friday, with connectivity problems with the Redis database that the Moonlight API uses for every authenticated request to validate sessions. It’s a critical component in their workflow. At the very same time, Google Cloud reported network service disruptions with packet losses, so the Moonlight team assumed that Google Cloud was causing the connectivity errors. But then things got worse. The following Tuesday, the dreaded doomsday scenario occurred: Moonlight’s website crashed completely.

With the help of the Google Cloud support team, they tracked application and resource usage, and the root cause was identified. GKE was scheduling high-CPU-consuming pods to the same node, which consumed 100% of the node’s CPU. The node would thus go into a kernel panic and become unresponsive. At first, only the Redis pods were failing, but as the pattern continued, all pods serving the traffic went offline.

The Moonlight team had designed a three-replica deployment of the web application in the cluster. They assumed that there would be one pod per node, and two nodes could fail before the system would go down. Unfortunately, they didn’t know that the Kubernetes scheduler can assign the pods to the same node unless inter-pod anti-affinity rules are implemented to define which pods or other CPU-intensive applications should never be together. The problem was solved by using the rules to repel CPU-intensive and critical applications from each other, resulting in a more reliable system.

Falling Bridges: Unresponsive Webhooks

Jetstack helps companies create and operate cloud native landscapes with Kubernetes, including provisioning multi-tenant applications. To define custom requirements, they started using the Open Policy Agent for enforcing custom policies in the admission controller.

One fine day, the team was using Terraform to upgrade the development clusters running in GKE, after having successfully performed the same procedure for the pre-production environment. However, 20 minutes into the upgrade, Terraform timed out. As if that wasn’t bad enough, the API server started to time out the incoming requests. With nodes unable to deliver their statuses to the control plane, GKE assumed they were broken and started replacing them. The pattern continued until the entire cluster collapsed.

Jetstack and the Google Cloud support team discovered that GKE halted the upgrade when it was unable to complete the upgrade of the second master. The root cause: a mismatched namespace configuration that resulted in an unresponsive OPA webhook. The lesson learned was that webhooks, which are a single point of failure in a cluster, must be monitored closely and configured with care.

Finding Nemo: The Missing CNI Configuration