About the Guest

Corey is a Cloud Economist at the Quinn Advisory Group. He has a history as an engineering director, public speaker, and cloud architect. Corey specializes in helping companies address horrifying AWS bills, hosts the Screaming in the Cloud and curates LastWeekinAWS.com, a weekly newsletter summarizing the latest in AWS news, blogs, and tips, sprinkled with snark.


Corey’s newsletter: Last Week in AWSCorey’s professional site: quinnadvisory.comCorey’s podcast, Screaming In the CloudCorey’s Twitter

Links Referenced: 

The Story of a Serverless Startup by Sam KroonenburgCorey’s talk, Myth of Cloud AgnosticismDan McKinley (@mcfunley)’s Choose Boring Technology


Transcript

Mike: Running infrastructure at scale is hard, it's messy, and it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're gonna talk about the rough edges. We're gonna talk about what it's really like running infrastructure at scale.

Mike: Welcome to the Real World DevOps podcast. I'm your host, Mike Jillian, editor and analyst for Monitoring Weekly, and author of O'Reilly’s Practical Monitoring.

Mike: This episode is sponsored by the lovely folks at Influx Data. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their Time Series Database InfluxDB, but you may not be as familiar with their other tools. Telegraf for Metrics Collection from systems, Chronograf for visualization, and Kapacitor for Real-Time Streaming. All of this is available as open-source. They also have a hosted commercial version too. You can check all of this out at influxdata.com.

Mike: Hi folks, welcome to the Real World DevOps podcast. I'm here with Corey Quinn the editor of Last Week in AWS. Welcome to the show, Corey.

Corey: Thanks, Mike. It's always a pleasure to hear myself talking.

Mike: I'm sure it is. So for those who don't know, Corey is one of my closest friends. So, this might get a little off the wall and banter-y. But hopefully everyone will enjoy it. So for those who don't have the pleasure of having met Corey yet, Corey, what is it that you do?

Corey: A lot of things is probably the best and most honest response to that. But what I'm best know for is either shit-posting on Twitter and/or writing Last Week in AWS, which is a newsletter that gathers information from the AWS ecosystem every week, discards the crap that no one cares about, takes what's left, and then makes fun of it.

Mike: It is pretty fucking funny. So by day, you are also an AWS consultant?

Corey: Yes, but in a more directed sense. Specifically, I start and I stop professionally at fixing the horrifying AWS bill. It's one of those areas whereas a consultant, I find that I'm more effective when I'm direct. And so I'll aim at a very specific, very expensive problem.


Mike: Absolutely. So you and I were talking a while back and this is something I have repeatedly come into around isn't Amazon just a single point of failure in my infrastructure? Shouldn't I really be focused on trying to mitigate the failure of AWS going down? What happens if you us-east-1 explodes again? Like, my website's offline, and now I have huge problems. So shouldn't I at that point maybe start thinking about multi-region or maybe multi-provider or any number of other really dumb ideas?

Corey: The answer to all of that is generally it depends, which is accurate and completely useless. The fact of the matter is is depending on what your business is and what your constraints look like, you're the best person to wind up saying that this either an unacceptable risk or ultimately this is something that you absolutely need to address and focus on. The example here is if your application has people's' lives depending on it, then yeah, you need to be able to withstand everything up to and possibly including a nuclear event. Whereas if you're running my side project of “Twitter For Pets,” if your site is down for four hours because AWS region as an issue, maybe it's okay. Maybe the internet is better for it.

Mike: Yeah, that's a good point. It always seemed to me to be an incredible amount of over-engineering, of people trying to get their applications and infrastructure to be completely ... completely capable of surviving basically everything. It's like, "Hey, wait a second. You run a very small social media application that no one's going to care about."

Corey: We saw a fair bit of this a couple years back with the S3 Apocalypse wound up hitting. This is not me trying to bag on Amazon in any meaningful way. They run an incredible amount of complexity at a stupendous scale that boggles the mind. Things break. It's what they do. That's the nature of how working with computers plays out. And there was a knee jerk reaction that we saw from a lot of infrastructure types after this happened where they immediately want to turn on replication for every S3 Bucket, so it's now multi-region in one or more locations. I understand the reflex reaction to that. There's no one better than Ops people at fighting the last war. But there also needs to be a rationed, measured response to this.

Corey: One area that seems to get lost along the way somewhere is you're going to be doubling or tripling in some cases your infrastructure costs from a raw perspective. Plus, the additional complexity, plus people's time to set all of that up, plus data transfer to get things from one place to another. Is this a reasonable response for your business constraints or is it a knee jerk reaction to something that is very unlikely to ever occur again? If we look back, we don't see a litany of individual websites that were called out for things breaking. It was called out as Amazon's failure and things like Instagram were mentioned or American Airlines or Amazon's own status page as examples of things that were impacted. But that was the day the internet broke for a little bit. So, most of us just kind of shrugged and went outside. And eight hours later, things are working again and life went on.

Corey: If you're an ad tech company and you need to be able to sustain that type of outage, and maybe it makes sense to do that. People are not going to come back and look at an ad again. But conversely, I was buying, I think, a pair of socks I want to say five years ago on Amazon. It threw a 500 error and suddenly I'm staring at a dog, which is fascinating. That's amazing. To hear the way some infrastructure people talk about outages, that would ...

Twitter Mentions