Episode #9: Chaos Engineering in Serverless with Gunnar Grosch

Serverless Chats

English - August 12, 2019 08:00 - 46 minutes - 42.4 MB - ★★★★★ - 29 ratings
Technology Education serverless faas baas cloud aws lambda Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Episode #8: Observability in Modern Applications with Ran Ribenzaft

Next Episode: Episode #10: Testable Serverless Applications with Slobodan Stojanović

About Gunnar Grosch

Gunnar is Cloud Evangelist at Opsio based in Sweden. He has almost 20 years of experience in the IT industry, having worked as a front and backend developer, operations engineer within cloud infrastructure, technical trainer as well as several different management roles.

Outside of his professional work he is also deeply involved in the community by organizing AWS User Groups and Serverless Meetups in Sweden. Gunnar is also organizer of ServerlessDays Stockholm and AWS Community Day Nordics.

Gunnar's favorite subjects are serverless and chaos engineering. He regularly and passionately speaks at events on these and other topics.

Twitter: @GunnarGroschWebpage: https://grosch.seLinkedIn: linkedin.com/in/gunnargroschYouTube: youtube.com/channel/UCTltOOcP1UZTpEQKQjjO-5Q

Links from the Chat

Gremlin: gremlin.comChaos Toolkit: chaostoolkit.orgAdrian Hornsby Lambda Layer: github.com/adhorn/aws-lambda-chaos-injectionThundra: thundra.ioYan Cui's Article: hackernoon.com/how-can-we-apply-the-principles-of-chaos-engineering-to-aws-lambda-80f87e3237e2

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly and you're listening to Serverless Chats. This week, I'm chatting with Gunnar Grosch. Hi, Gunnar. Thanks for joining me.

Gunnar: Hi, Jeremy. Thank you very much for having me.

Jeremy: So you are a Cloud Evangelist and co founder at Opsio. So why don't you tell the listeners a bit about yourself and maybe what Opsio does?

Gunnar: Yeah, sure. Well, I have quite a long history within IT. I've been working almost 20 years in IT, ranging everything from development through operations, management and so on them. Um, about a year and a half ago, we started a new company called Opsio. And we are a cloud consulting firm. Helping customers to use cloud services in any way possible and also operations.

Jeremy: Great. All right, so I saw you speak at ServerlessDays Milan, and you did this awesome presentation on Chaos Engineering and serverless. So that's what I want to talk to you about today. So maybe we can start with just kind of a quick overview of what exactly chaos engineering is.

Gunnar: Yes, definitely. So chaos engineering is quite a new field within IT. Well, the background is that we know that sooner or later, almost all complex systems will fail. So it's not a question about if it's rather a question about when. So we need to build more resilient systems and to do that, we need to have experience in failure. So chaos engineering is about creating experiments that are designed to reveal the weakness in a system. So what we actually do is that we inject failure intentionally in a controlled fashion to be able to gain confidence so that we get confidence in that our systems can deal with these failures. So chaos engineering is not about breaking things. I think that's really important. We do break things, but that's not the goal. The goal is to build a more resilient system.

Jeremy: Right. Yeah, and then so that's one of the things that I think maybe there’s this misunderstanding of too is that you're doing very controlled experiments, as you said, and this is something where it's not just about maybe fixing problems, either in the system, it's also sort of planning on for when something goes down. It's not just finding bugs or weakness, it's also like, what happens if DynamoDB for some reason goes down or some backend database like, how do you plan for resiliency around that, right?

Gunnar: Yeah, that's correct, because resiliency isn't only about having systems that don't fail at all. We know that failure happens, so we need to have a way of maintaining an acceptable level of operations or service. So when things fail that the service is good enough for the end users or the customers. So we do the experiments to be able to find out how both the system behaves and also how the organization, their operations teams, for example, how they behave when failures occur.

Jeremy: Well and about the monitoring systems too, right? I mean, we put monitoring systems in place, and then something breaks and we don't get an alarm. Right? So, this is one of the ways that you could test for that as well.

Gunnar: Yeah, that's a quite common use case for chaos engineering, actually, to be able to do experiments to test your monitoring, make sure that you get the alerts that you need. No one wants to be the guy that has to wake up early in the morning when something breaks. But you have to rely on that function to be there so that PagerDuty or whatever you use actually calls the guy.

Jeremy: And you said this is a relatively new field. You know, when you say new, it's like a couple of years old. So how did this get started?

Gunnar: Well, it actually started back in 2010 at Netflix, so I guess it's around nine years now. And they started a tool or created a tool that was called Chaos Monkey. And the tool was created in response from them moving from traditional physical infrastructure into AWS. So they needed a way to make sure that their large distributed system could adapt to failure so that they can have a failure. So they use Chaos Monkey to more or less turn off or shut down EC2 instances and see how the system behaved. So, that was in 2010, then I guess the first chaos engineer was hired at Netflix in 2014. So about five years ago, and in 2017, the team at Netflix published a book that's on Chaos Engineering. I think it's out on O'Reily Media, which is like the book on Chaos Engineering today that's used by most people who use chaos engineering.

Jeremy: So we can get into some more of the details about running the experiments, so I want to get into all of that, but I'm kind of curious, because this is something where, especially with teams now, and maybe as we get into Serverless too, you've got developers that are closer to the stack, there may be less OPs people or more of this mix. So is this like a technical field? Is it the devs that do it, the OPs people, like who's sort of responsible for doing this chaos engineering stuff?

Gunnar: Yeah, that's a good question. I would say that it's a question that's being debated in the field as well. Where does chaos engineering belong? And I think it's different in different organizations. Some teams have specific or some organizations have specific teams that are only working with chaos engineering like Netflix, like people at Amazon as well. But other organizations, they use chaos engineering and it’s spread out in the organization. So it might be operations that works with chaos engineering. But it also might be a DevOps team or just developers as well. But to do the experiments, you need to involve more or less the entire organization. So you use people from from many teams.

Jeremy: Right. Cause you're gonna run, in some cases, you run this earlier on where you’...

Twitter Mentions

@gunnargrosch