Episode #12: Reducing MTTR in Serverless Environments with Emrah Şamdan

Serverless Chats

English - September 02, 2019 08:00 - 40 minutes - 36.9 MB - ★★★★★ - 29 ratings
Technology Education serverless faas baas cloud aws lambda Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Episode #11: Serverless Security in the Real World with Hillel Solow

Next Episode: Episode #13: Managing a Serverless Engineering Team with Efi Merdler-Kravitz

About Emrah Şamdan

Emrah Şamdan is the VP of Product at Thundra, a tool to provide serverless observability for AWS Lambda environments. With the development team, Emrah is obsessed with helping the serverless community with their debugging and monitoring effort both in production and during development. He is responsible for making trouble for the Thundra engineering team while finding solutions to ease the life of serverless teams.

Twitter: @emrahsamdanThundra: Thundra.ioBlog: blog.thundra.io. Demo: demo.thundra.io

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly and you're listening to serverless chats this week. I'm chatting with Emrah Sandam. Hi, Emrah. Thanks for joining me.

Emrah: Hey, Jeremy. Thanks a lot for having me today.

Jeremy: So you're the VP of product at Thundra. So why don't you tell the listeners a little bit about yourself, your background and what Thundra is up to.

Emrah: Yeah, sure. So I'm, you know me as a product manager for Thundra. I started as a product manager at Thundra while it was a start project — it was an insider project in OpsGenie. We were some engineers, me and some designers that began an internal product for OpsGenie engineers. Then it turned out to be a product and company, and now we are serving serverless developers for observability. So in 2017 Serkan, our CEO and the CTO actually acting, and he was developing some modules of OpsGenie with AWS Lambda. And he had some problems with the observability and he couldn't find any solution that fits the purposes. And he said, hey, I can write general libraries because they were writing in Java at that time which can give me some ideas about like how my Lambda functions are performing. And he developed this as like an extracurricular activity for OpsGenie, and he made this available, and it was sending data to Elastic at that time. They were seeing some Thundra produce data. And they thought that, even before I joined OpsGenie, they thought that why don't we make it as a separate product? And why don't we make it as a separate company and I joined and they hired me as a product manager for that. In October last year in 2018, we decided to spin off it as a separate company because, you know, OpsGenie was sold to Atlassian and Thundra will continue as a separate company. And we are helping serverless developers with observability by aggregating traces, metrics and logs.

Jeremy: Very cool. All right, so I wanted to talk to you today about reducing MTTR in serverless environments, because I think when we think about meantime to repair, normally we have a lot of control. Like if we're running our applications on-prem, then we likely have access to the physical servers and the hardware components, and even if we're running our applications on something like EC2, we still have access to the operating systems, the VM instance sizes, the attached storage, and the same is typically true with containers as well, right? So we have a lot of ways in which we can affect the time it takes to repair some of these hardware or even scale issues. But If you are in a serverless environment, then it’s quite a bit different, especially if you're using a lot of managed services from the cloud provider, you really don’t have access to the underlying operating systems or hardware anymore. And I know some people have changed the “R” in MTTR to mean “recovery” or “resolution” since it’s really less about actually repairing hardware. But maybe we can start there, maybe you can give us your thoughts on what's different with how we respond to incidents in serverless versus how we would respond to incidents with more traditional applications.

Emrah: Definitely. So in traditional applications, as you say, there are some resources that we can easily gather the information when we see some problems, some incidents in our system. But in serverless, on the other hand, it is like you have different piles of logs, which it comes out of box from CloudWatch, from the resource that Cloud vendor propose. But these are actually separate, and these are not actually giving the full picture of what happened in the distributed serverless environment. And what you need here is that the problems are different. In a normal environment, the problem, most of the time, was actually about scalability and you were responding to that by giving more resources, by just increasing the power of your system. But with serverless, the problem is about like some problem occurs in any kind of a system in a distributed network and you need some more than log files. You need like all three pillars of observability, which is called traces. In our case, it is distributed traces, which shows the interaction between Lambda functions and the managed APIs and the managed resources and third-party APIs, and the local traces, which shows what happens in the Lambda function, and the metrics and the logs.

Jeremy: Yeah, right. And I think you’re right that the distributed nature of serverless is something that might be relatively new to people as well, so just figuring out where the problem is, or what component is causing the issue, is a challenge in and of itself. So the point about metrics is interesting too, because as you just said, the scalability is handled by the cloud provider for you with most of these services. So we’re likely not as worried about low level metrics like CPU usage anymore. So what are the signals of failure, like, how do we know that something is broken? What are the things that tell us something might be wrong in our application that we might want to address?

Emrah: Yeah, sure. So, like, if the scalability is not the problem, so what might be the problem? So what might be the good metrics that we should look at? In this case, there are some metrics which are actually very predictable by everyone listening here. But they are actually saving our system’s availability a lot. Say first, and the most important is actually latency, because of our aim to not receive timeout alerts, right? So the latency metric, the duration of invocations metric, is something that we should keep our eye on. So you need to keep an eye on how long do our functions take? So you need to see that if the duration is approaching to timeout, if there is, we should check what might be the reason we should check that? Is there something? Is there a problem with the third-party APIs? Is the problem with the resources that we are using? And this gives you like, when you see a latency, you should be approaching it very, very carefully because you don't want to be in the storm of timeout errors. So the second metric, in my opinion, is that memory usage. So you know, we are provisioning a memory to Lambda function. And this is the only thing that we control in serverless. Most of the time, developers are giving the memory more than it actually requires just in order to increase, speed up the IO and throughput and let’s say the CPU. But, in this case, we are having problems with the cost. You know, because whenever your function gets triggered, it runs for a time, and our billing is decided by GBs per second. So in this case, if you allocate more than necessary memory, we may be losing some money for Lambda. This might be very negligible if you don't use Lambda excessively. But if you're using Lambda in production and you're using Lambda and serverless mostly, you'll have some problems with the costs. In order to do that, you should tune your memory accordingly, an...

Twitter Mentions

@emrahsamdan