Whiteboard Confessional: The Bootstrapping Problem

AWS Morning Brief

English - July 31, 2020 10:00 - 11 minutes - 16 MB - ★★★★★ - 76 ratings
Tech News News Business News cloud aws amazon devops last week in aws Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: AWS re:Lease The Kraken

Next Episode: Drastic Load Balancing Code Changes

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links

CHAOSSEARCH @QuinnyPig

Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.

Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.

Hello, and welcome to this edition of the AWS Morning Brief: Whiteboard Confessional, where we confess our various architectural sins that we and others have committed. Today, we're going to talk about, once upon a time, me taking a job at a web hosting provider. It was the thing to do at the time because AWS hadn't eaten the entire world yet, therefore, everything that we talk about today was still a little far in the future. So, it was a more reasonable approach, especially for those with, you know, budgets that didn't stretch to infinity, or willingness to be an early adopter of someone else's hosting nonsense to go ahead and build out something in a data center.

Now, they were obviously themselves not hosting on top of a cloud provider because the economics made less than no sense back then. So, instead, they had multiple data centers built out that provided for customers various hosting needs. Each one of these was relatively self-contained unless customers wound up building something themselves for failover. So, it wasn't really highly available so much as it was a bunch of different single points of failure, and an outage of one would impact some subset of their customers, but not all of them. And that was a fairly reasonable approach provided that you communicate that scenario to your customers because that's an awful surprise to have later in time.

Now, I was brought in as someone who had had some experience in the industry, unlike many of my colleagues who had come from the hosting provider’s support floor and promoted into systems engineering roles. So, I was there to be the voice of industry best practices, which is a terrifying concept when you realize that I was nowhere near as empathetic or aware back then as I am now, but you get what you pay for. And my role was to apply all of those different best practices that I had observed, and osmosed, and had bluffed, into what this company was doing, and see how it fit in a way that was responsible, engaging, and possibly entertaining. So, relatively early on in my tenure, I was taking a tour of one of our local data centers and asked what I thought could be improved. Now, as a sidebar, I want to point out that you can always start looking at things and pointing out how terrible they are, but let's not kid ourselves; we very much don't want to do that because there are constraints that shape everything that we do and we aren't always aware of them. So, making people feel bad for their choices is never a great approach if you want to stick around very long. So, instead, I started from the very beginning, and played, “Hi. I'm going to ask the dumb questions, and see where the answers lead me to.”

So, I started off with, “Great, scenario time. The power has just gone out. So, everything's dark, now how do we restart the entire environment?” And the response was, “Oh, that would never happen.” And to be clear, that's the equivalent of standing on top of a mountain during a thunderstorm, cursing God while waving a metal rake into the sky. After you say something like that there is no disaster that is likelier. But all right, let's defuse that. “Humor me. Where's the runbook?” And the answer is, “Oh, it lives in Confluence,” which is Atlassian’s wiki offering. For those who aren't aware, Wikis in general, and Confluence in particular, is where documentation and processes go to die. “These are living documents,” is a lie that everyone says because that's not how it actually works.

“Cool. Okay, so let's pretend that a single server instead of your whole data center explodes and melts. When everything's been powered off, you turn it back on. That one doesn't survive the inrush current, and that one server explodes. That server happens to be the Confluence server. Now what? How do we bootstrap the entire environment?” The answer was, “Okay, we started printing out that runbook and keeping it inside each data center,” which was a way better option. Now, the trick was to make sure that you revisited this every so often, when something changed, and make sure that you weren't looking at how things were circa five years ago, but that's a separate problem. And this is fundamentally a microcosm of what I've started to think of as the bootstrapping problem. I'll talk to you a little bit more about what those look like in the context of my data center atrocities. But first:

This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of p...

Twitter Mentions

@quinnypig