Whiteboard Confessional: Click Here to Break Production

AWS Morning Brief

English - May 08, 2020 10:00 - 9 minutes - 13.8 MB - ★★★★★ - 76 ratings
Tech News News Business News cloud aws amazon devops last week in aws Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: AWS Non-Profit Organisations

Next Episode: The AWS Machine That Goes PING

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links

CHAOSSEARCH @QuinnyPig

Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.

On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.

Today on the AWS Morning Brief: Whiteboard Confessional, I'm telling a different story than I normally do. Specifically, this is the tale of an outage from several weeks ago. The person who shared this story with me has requested to remain anonymous and further wishes me to not mention their company at all. This is, incidentally, a common occurrence. Folks don't generally want to jeopardize their relationship with AWS by disclosing a service issue they see, whereas I don't have that particular self-preservation instinct. Then again, I'm not a big AWS customer myself. I'm not contractually bound to AWS in any meaningful way, and I'm not an AWS partner, nor am I an AWS Hero. So, all that AWS really has over me in terms of leverage is the empty threat of taking away my birthday. So, let's dive into this anonymous story. It's a good one.

A company was minding its own business, and then had a severity one incident. For those who aren't familiar with that particular designation, you can think of that as being the company's primary service fell over in an embarrassingly public way. Customers noticed, and everyone runs around screaming a whole lot. Now, if we skip past the delightful hair-on-fire diagnosis work, the behavior that was eventually tracked down was that an SNS topic had a critical listener get unsubscribed. That SNS topic invoked said listener, which in turn drove a critical webhook call via API gateway. This is a bad thing, obviously.

Fundamentally, customers stopped receiving webhooks that they were expecting, and this caused a nuclear meltdown given the nature of what the company does, which I can't disclose and isn't particularly relevant anyway. But, for those who are not up to date on the latest AWS terminology, service names, and parlance, what this means at a high level is that a thing happens inside of AWS, and whenever that thing happens, it's supposed to fire off an event that notifies this company's paying customers. This broke because something somewhere unsubscribed the firing off dingus from the notification system. Now that we're aware of what caused the issue at a very high level, time to dig into how it happened and what to do about it. But first:

In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.

The logs for who unsubscribed it are, of course, empty, which is a problem for this company’s blameless-in-theory-but-blame-you-all-the-way-out-of-the-company-if-it-turns-out-that-it-was-you-that-clicked-this-thing-and-didn't-tell-anyone, philosophy. CloudTrail doesn't log this event because why would it? CloudTrail’s primary purpose is to rack up bills and take the long way around before showing events in your account, not to assist with actual problem diagnosis, by all accounts. Now, fortunately, this customer did have AWS Enterprise Support. It exists for precisely this kind of problem. It granted them access to the SNS team which had considerably more insight into what the heck had happened, at which point the answer became depressingly clear, as well as clearly depressing.

It turns out that the unsubscribe URL at the bottom of every SNS notification wasn't authenticated. Therefore, anyone who had access to the link could have invoked it, and that's what happened when a support person did something very reasonable: Copy and paste a log message containing that unsubscribe link into a team Slack channel. It wasn't their fault [00:06:04 unintelligible] because they didn't click it. The entity triggering this was—and I swear I'm not making this up—Slackbot.

Have you ever noticed that when you paste a URL into Slack, it auto expands the link to show you a preview? It tries to do that on every URL, and you can't disable URL expansion at the -Slack workspace level. You can blacklist URLs but only if the link expansion succeeds. In this case, it doesn't have a preview, so it doesn't succeed, so there's nothing for it to blacklist. Slack’s helpful feature can't be disabled on a team-wide level, so when that unsubscribe URL shows up in a log snippet that got pasted, it silently unsubscribed the consumer from SNS and broke the entire system.

Now, there are an awful lot of things that could have been different here. Isn't this the sort of thing that might be better off with SQS, you might reasonably ask? Well, four years ago, when this system was built, SQS itself could not, and did not support invoking Lambda functions, so SNS was the only real option. T...

Twitter Mentions

@quinnypig