Links

Follow Last Week In AWS on TwitterAWS Outage Message"Kinesis Outage" by Ryan Frantz

Transcript
Corey: This episode is sponsored in part by our friends at Linode. You might be familiar with Linode; they’ve been around for almost 20 years. They offer Cloud in a way that makes sense rather than a way that is actively ridiculous by trying to throw everything at a wall and see what sticks. Their pricing winds up being a lot more transparent—not to mention lower—their performance kicks the crap out of most other things in this space, and—my personal favorite—whenever you call them for support, you’ll get a human who’s empowered to fix whatever it is that’s giving you trouble. Visit linode.com/screaminginthecloud to learn more, and get $100 in credit to kick the tires. That’s linode.com/screaminginthecloud.


Pete: Hello, everyone. Welcome to the AWS Morning Brief. It's Pete Cheslock again—


Jesse: And Jesse DeRose.


Pete: We are back to talk about ‘The Kinesis Outage.’


Jesse: [singing] bom bom bum.


Pete: So, at this point, as you're listening to this, it's been a couple of weeks since the Kinesis outage has happened, and I'm sure there are many, many armchair sysadmins out there speculating at all the reasons why Amazon should not have had this outage. And guess what? You have two more system administrators here to armchair quarterback this as well.


Jesse: We are happy to discuss what happened, why it happened. I will try to put on my best announcer voice, but I think I normally fall more into the golf announcer voice than the football announcer voice, so I'm not really sure if that's going to play as well into our story here.


Pete: It's going, it's going, it's gone.


Jesse: It’s—and it's just down. It's down—


Pete: It's just—


Jesse: —and it's gone.


Pete: No, but seriously, we're not critiquing it. That is not the purpose of this talk today. We're not critiquing the outage because you should never critique other people's outages; never throw shade at another person's outage. That's not only crazy to do because you have no context into their world. It's just, it's not nice either, so just try to be nice out there.


Jesse: Yeah, nobody wants to get critiqued when their company has an outage and when they're under pressure to fix something. So, we're not here to do that. We don't want to point any fingers. We're not blaming anyone. We just want to talk about what happened because honestly, it's a fascinating, complex conversation.


Pete: It is so fascinating and honestly, loved the detail, a far cry from the early years of Amazon outages that were just, “We had a small percentage of instances have some issues.” This was very detailed. This gave out a lot of information. And the other thing too is that, when it comes to critiquing outages, you have to imagine that there are unlikely to be more than a handful of people even inside Amazon Web Services that fully understand the scope of the size and the interactions of all these different services. There may not even be a single person who truly understands how these dozens of services interact with each other. 


I mean, it takes teams and teams of people working together to build these things and to have these understandings. So, that being said, let's dive in. So, the Wednesday before Thanksgiving, Kinesis decided to take off early. You know, long weekend coming up, right? But really, what happened was is that there was an addition of capacity to Kinesis, and it caused it to hit an operating system limit causing an outage. 


But interestingly enough—and what we'll talk about today—are the interesting and downstream effects that occurred via CloudWatch, Cognito, even the status page, and the Personal Health Dashboard. I mean, that's a really interesting contributing factor or a correlating outage. I don't know the words here, but it's interesting to hear that both CloudWatch goes down and the Personal Health Dashboard goes down.


Jesse: That's when somebody from the product side says, “Oh, that's a feature, definitely not a bug.”


Pete: But the outage to CloudWatch then even affected some of the downstream services to CloudWatch—such as Lambda—which also included auto-scaling events. It even included EventBridge, which was impacted, and that even caused some ECS and EKS delays with provisioning new clusters and scaling of existing clusters.


Jesse: So, right out of the bat, I just want to say huge kudos to AWS for dogfooding all of their services within AWS itself: not just providing the services to its customers, but actually using Kinesis internally for other things like CloudWatch and Cognito. They called that out in the write-up and said, “Kinesis is leveraged for CloudWatch, and Cognito, and for other things, for various different use cases.” That's fantastic. That's definitely what you want from your service provider.


Pete: Yeah, I mean, it's a little amazing to hear, and also a little terrifying, that all of these services are built based on all of these other services. So, again, the complexity of the dependencies is pretty dramatic. But at the end of the day, it's still software underneath it; it's still humans. And I don't want to say that I am happy that Amazon had this outage at all, but watching a company of this stature, of this operational expertise, have an outage, it's kind of like watching the Masters when Tiger Woods duffs one into the water or something like that. It's just—it's a good reminder that—listen, we're all human, we're all working under largely the same constraints, and this stuff happens to everyone; no one is immune.


Jesse: And I think it's also a really great opportunity—after the write-up is released—to see how the Masters go about doing what they do. Because everybody at some point is going to have to troubleshoot some kind of technology problem, and we get to see firsthand from this, how they go about troubleshooting these technology problems.


Pete: Exactly. So, of course, one of the first things that I saw everywhere is everyone is, on mass, moving off of Amazon, right? They had an outage, so we're just going to turn off all our servers and just move over to GCP, or Azure, right? 


Jesse: Because GCP is a hundred percent uptime. Azure is a hundred percent uptime. They're never going to have any kind of outages like this. Google would never do something to maybe turn off a service, or sunset something.


Pete: Yeah, exactly. So, with the whole talk about hybrid-cloud and multi-cloud strategies, you got to know that there's a whole slew of people out there, probably some executive at some business, who says, “Well, we need to engineer for this type of durability, this type of thing to happen again,” but could you even imagine the complexity...

Twitter Mentions