There are few jobs where you’re encouraged to break things, but Kolton Andrus has made a career out of “failure as a service.” Kolton is the CEO and Co-founder of Gremlin, a platform that seeks to help companies avoid outages and build more resilient systems.

On today’s episode of CTO Connection, Kolton explains the concept of chaos engineering and why it’s so important, particularly to large scale operations. Take a listen and get ideas on how you can prevent outages.

[01:26] - Kolton's backstory[03:21] - Managing a team at Amazon[05:10] - Moving to a technical role at Netflix[07:51] - Founding Gremlin[10:36] - Formative experiences with chaos engineering[15:39] - Graceful degradation[19:33] - Next-level testing[22:26] - Challenges with microservices[26:05] - Efficiency & resilience[31:18] - Preventing outages

Special thanks to our global partner – Amazon Web Services(AWS). AWS offers a broad set of global cloud-based products to equip technology leaders to build better and more powerful solutions, reach out to [email protected] if you’re interested to learn more about their offerings.

CTO Connection is where you can learn from the experiences of successful engineering leaders at fast-growth tech startups. Whether you want to learn more about hiring, motivating or managing an engineering team, if you're technical and manage engineers, the CTO Connection podcast is a great resource for learning from your peers!

If you'd like to receive new episodes as they're published, please subscribe to CTO Connection in Apple Podcasts, Google Podcasts, Spotify or wherever you get your podcasts. If you enjoyed this episode, please consider leaving a review in Apple Podcasts. It really helps others find the show.

Podcast episode production by Dante32.