Reliability is a Team Sport with Thilina Ratnayake of Lightstep

On-Call Me Maybe

English - May 02, 2023 04:00 - 46 minutes - 43 MB - ★★★★★ - 3 ratings
Technology monitoring tracing distributed tracing sre oncall on-call software software development technology tech Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Observability, Databases, and Management…OH MY! with Marylia Gutierrez of Cockroach Labs

Next Episode: To ChatGPT and Beyond! with Adrian Cockcroft

About the guest:

Thilina (known as "T") is a Site Reliability Engineer at Lightstep. T has had a varied journey in tech spanning from support, project management, and systems engineering, which has eventually led him to focus on the "people" side of Reliability Engineering. He is extremely passionate about communications and how that plays a huge role in improving the performance of teams and increasing the reliability of their systems.

Find our guest on:

Twitter LinkedIn GitHub Instagram Blog

Find us on:

On-Call Me Maybe Podcast Twitter On-Call Me Maybe Podcast LinkedIn Page On-Call Me Maybe Podcast Mastodon On-Call Me Maybe Podcast Instagram On-Call Me Maybe TikTok On Call Me Maybe Podcast YouTube Channel Adriana’s Twitter Adriana’s Mastodon Adriana’s LinkedIn Adriana’s Instagram Adriana’s Bluesky Ana’s Twitter Ana's Mastodon Ana’s LinkedIn Ana's Instagram Ana’s Bluesky

Show Links:

Lightstep Brittish Colombia - Watersheds Chatime Boba Cisco PowerPoint Leading SRE with Empathy Chaos Engineering Principles SLOs (Service Level Objectives)SLA (Service Level Agreement)P1 Outage (Production Issue Gradations)

Transcript:

ADRIANA: Hey, y'all. Welcome to On-Call Me Maybe, the podcast about DevOps, SRE, observability principles, on-call, and everything in between. I am your host, Adriana Villela. And with me, I've got my awesome co-host...

ANA: Ana Margarita Medina.

ADRIANA: And today, we are talking to Thilina Ratnayake, who works with us at Lightstep working as an SRE. Welcome.

THILINA: Howdy, howdy ho. Hello.

ADRIANA: So nice to have you. Now, first things first, where are you calling in from?

THILINA: I'm calling from Vancouver, British Columbia, Canada.

ADRIANA: Whoo. Awesome. We've got two Canadians in the house today. You are outnumbered, Ana. [laughs]

ANA: Once again, I'm outnumbered. I think we're having too many Canadian episodes. I'm going to speak to some On-Call Me Maybe managers, which is us. So we're going to have to start talking.

ADRIANA: [laughs]

ANA: I need more representation from just the United States of America, specifically California. [laughs]

ADRIANA: Oh yeah, that's true. Well, we did just record today with someone from California, so... [laughs]

ANA: Yes. True. True, true. Here and there, I'm outnumbered. [laughs]

ADRIANA: Yeah, here and there, you're outnumbered, either because they're from Canada or they're Brazilian. [laughs]

ANA: True.

ADRIANA: So it wouldn't be proper On-Call Me Maybe tradition if we didn't ask you what you're drinking today.

THILINA: Sweet, yeah. I am drinking some pristine mountain water from one of three watersheds that apparently feed my city from the mountains of British Columbia. Yeah, this is grade A water that's been filtered through a Brita filter.

ADRIANA: [laughs]

THILINA: So you know what? I'm just drinking really nice mountain water.

[laughter]

ADRIANA: This is the best selling of water that I've ever heard. I'm like, I want some of that. [laughs]

THILINA: Cool.

ADRIANA: I've got Lake Ontario water, yes. [laughs]

ANA: I thought we were going to get a Wikipedia summary of coordinates of where you can find these watersheds. I thought that's where this was going. [laughs]

ADRIANA: Oh my God. Yes. For real.

THILINA: I had it up on my screen. And I was like, oh, one of three watersheds. Oh, there are three. I don't know which one it's from. Well, [crosstalk 02:19]

ADRIANA: Well, we're going to have to include this in the show notes now because our listeners are definitely going to be curious about this. [laughs]

THILINA: And what are you all drinking?

ANA: For me, today it's actually green juice, which is kale, pineapple, apple. And I think I put ginger and turmeric. So I'm trying to go back into some more veggies and fruits in my diet.

ADRIANA: That's awesome.

THILINA: Very nice.

ADRIANA: Yes, I've got a glass of Lake Ontario prime H20. [laughter]

THILINA: Ooh, wow.

ADRIANA: No bubble tea, sadly. But as we were talking before we were recorded, I was so happy that in Amsterdam for KubeCon, I found a Chatime, [laughs] which, for those who are not in the know, it's a bubble tea chain. It's an international bubble tea chain. And we happen to have one in Canada, well, more than one in Canada. We've got a few in Canada. So I was pretty excited. And I told the Chatime guy I'm like, "I have these in my country." [laughs] He probably looked at me like I was crazy. He's like, get away, woman. [laughs]

THILINA: So, a very important question, what was your order when you found that Chatime?

ADRIANA: Shoot. I think I got a lychee green tea.

THILINA: Ooh.

ANA: Nice.

ADRIANA: Yes, yes.

ANA: That sounds fancy and yummy. I love anything lychee.

ADRIANA: I know, right? I'm a fan of lychee. Now, I can't tell if it's lychee or leechee because people correct me either way.

ANA: I was about to say I call it leechee. I grew up calling it leechee.

ADRIANA: I call it lychee. I've got some Chinese friends who call it lychee. And then I've got other Chinese friends who call it leechee. One is always correcting the other. I'm like, which one is it? [laugher] So you never know.

ANA: Hit us up on social media if you know how to pronounce this. [laughter]

ADRIANA: There's our sidebar. I guess we should get into our regular business. [laughs] So, Thilina, how did you get into tech?

THILINA: Well, it's a very interesting story that starts off with a news clipping on the back of my grade 10 high school classroom [laughs] because, in that class, that room was used for multiple classes, one of them being business 11 or planning 10. So I was in my Planning 10 class, which is a high school class that talks about how to do useful things like do your taxes, or get a utility bill paid, or decide your future in the world. And one of the assignments was, what do you want to do after you graduate? Like, what's your plan? Come up with three plans. And that was the assignment.

And me being a classic grade 10 high schooler, I had just left it off till the block before it was due. [laughter] And it was like lunch, and then the next class would have been the one where I had to present. I'm like, oh, I got to think of something.

ADRIANA: [laughs]

THILINA: And then, looking to the side on this wall, was a newspaper clipping about Cisco, the telecommunications giant. And this clipping was someone in the business class had talked about how the future looks great for teleconferencing. In the future, we might have AI and holograms, so this is a good place to be. And I was like, well, I like computers. It's kind of cool.

ADRIANA: [laughs]

THILINA: And then, in the time between seeing the newspaper clipping and when it was due, I did a little research on how do I work at a tech company like Cisco? And it showed me a college program that I could do. It was like a diploma. So that was my plan A. And my plan B would be if I did my diploma and maybe a degree afterwards. I don't remember what plan C was. But I was like, cool, let's make a PowerPoint. Let's present it.

ADRIANA: [laughs]

THILINA: And that was it. So in a couple...a year happens, and I was like, what do I do after high school? And I was like, well, I've got this plan that I made last year. Why don't I just stick with it? And then I did. I did all of it to the letter.

ADRIANA: That's awesome.

ANA: [laughs]

THILINA: I did the diploma, and I did the degree. My first job was learning at Cisco while doing support there. So, yeah, unintentionally, a newspaper clipping set the tone of the next 10-15 years of my life. So yeah, the answer to your question is I got started with a newspaper clipping about Cisco.

ANA: [laughs]

ADRIANA: That is awesome. I love it.

ANA: I love when it's always the most random experience that one has growing up that's like, this was the moment that the light bulb went off, and I was like, oh, I can actually get paid for various years to do this. [laughs]

ADRIANA: Yes, Defining Moment. Also, that sounds like a really interesting class. I don't think I ever took any classes on paying utility bills and stuff like that. That sounds awesome. [laughs]

ANA: Filing taxes.

ADRIANA: Filing taxes.

ANA: The fact that a school teaches you. Like... [laughs]

ADRIANA: I know. No one taught me that. My first tax filing I did manually, pencil and paper. [laughs] Why?

THILINA: Nice.

ADRIANA: [laughs] But that's a really cool start. And I love how the most subtle things, I don't know, they seem almost like passing at the time, and then they come back. It's like it was meant to be. So kudos for making that happen. That's amazing.

ANA: And, I mean, I think it's great to also start asking you students and kids early on what they want to do for their career. I know for me, I figured it out in middle school, which I am extremely fortunate of that. But getting folks early in high school to be like, all right, you have electives. Start thinking about what you want to do three, four years from now. What makes you happy? What do you hate? Can you intern somewhere? Can you do a side project? Can you shadow someone?

THILINA: For sure. I think it's worth mentioning that part of that story involves me making a PowerPoint presentation. And the reason I bring this up is a lot of people talk about, like, oh yeah, I knew I always wanted to get into tech because when I was little, my parents got me this computer, and I started building programs. And that wasn't my story. For me, it was that my parents got me a Windows 95. And they didn't give me any games, but they got the PowerPoint suite. [laughter] So all day, I would just be making presentations on things.

And this is important because I got into tech, and my first job in tech was doing support, which is a lot of talking to customers. There was also presenting information. And a lot of people were like, "I just want to write code." I was like, "But I love talking about what we do and presenting it." PowerPoint, or the ability to present information, has served me so much more in life than being able to program. So all I'm trying to say is if you're curious about tech, it's not all about coding. There's a whole bunch of presenting and packaging information too, which I hope kids are taking away from it.

ADRIANA: Yeah, there's something to be said about being able to convey the information to people, like, yes, I know this, but can I tell somebody else about this? That's an art form.

THILINA: Yep.

ANA: One of the first bosses that I had was, like, we hire based on being a good person and a good communicator because we can teach anyone how to code. And it kind of stuck with me because I'm like, yeah, almost anyone can learn how to code as in if you put your head to it, and you do the work, and you're able to pick up on what's going on. But if you're not able to know how to be a kind person, how to communicate, how to work well with others, you're not going to be successful in any career you take part in, specifically technology or technology as we know it right now that's completely remote.

ADRIANA: Yeah, I totally agree. I think there's this long-standing misconception that you don't need to be able to communicate to be in tech. And, I mean, I've been in situations where I've been surrounded by brilliant developers who cannot express themselves to me, and so, well, what's the point? Your brilliance is lost because you can't share your brilliance with others.

THILINA: Yeah, that's really true. Communication is key.

ADRIANA: All right, yeah.

THILINA: That will be my tattoo if I get one.

ADRIANA: [laughs]

ANA: That would actually be a good one. I say that in personal and professional settings all the time. I'm like, communication is key; that's just it. Over-communicate whenever you can.

THILINA: Yes, I really believe in over-communication. Over-communicating is key, like, when in doubt, over-communicate. That's the biggest takeaway I've had from working remote during COVID, not for nothing. That's just a thing that sticks in my head a lot. I have a lot of thoughts on how teams work well remotely, and that's a huge tenet of it.

ADRIANA: Oh, tell us more.

THILINA: One of the best teams I ever worked on was a hybrid remote team. And what I mean by that is we had a couple of members that were in office and a couple of members that were in the state, so like a different country, different geographical area. But the team never felt separate. I never felt like talking to person B was like, oh, they're not here. And the reason for that is because we communicated a lot to the point that our chat was never silent between 8:00 and 5:00 because there's always things going on.

And it wasn't just like work-related things. It was like, I'm getting some coffee or like...it doesn't have to be context specific. But communicating frequently when you're remote and conveying tone goes into helping the other things that you do. And I feel like that goes a long way when you're working remotely, especially if you start new and someone's trying to get a vibe check of everyone. If you are not next to someone, you can't shoulder-tap them. Hearing what people talk about frequently goes a long way for building that trust.

So I highly believe in over-communicating because you'll actually find the right norm easily versus not communicating enough. The onus is on other people to be like, okay, I guess I'll talk. What I'm saying is it's better to over-communicate.

ADRIANA: Yeah, I agree with you. It's basically having a chatty Slack channel. Having a place where you're encouraging your team to shoot the shit, even if it's not necessarily work, right?

THILINA: Yes.

ANA: [laughs]

ADRIANA: You're building up that camaraderie. Prior to coming to Lightstep, I managed two teams. I had 13 people working for me. And I was terrified because I'm like, oh shit, I've never managed a team remotely. So I made a huge effort to, like, okay, even though I have two separate teams, like, we had a Slack for each team, but we had a joint team Slack.

And always posted stuff, whether it's like team announcements or funny things, because I wanted people to get used to communicating with each other. And we'd have like a bi-weekly sync-up meeting where, you know, so that it's not an us versus them sort of scenario. So I completely agree the over-communication is super key in a remote setting.

ANA: And I think you nailed it with the hybrid aspect. I mean, I think this came to fruition a lot during COVID where it was like, folks were still trying to figure out what was going to happen. It was like different countries and states were shutting down at different speeds.

But it's like, how do you make sure that everyone, no matter what work persona you have, actually carries a similar experience at your organization where it's like, we know how to collaborate really fast; we know how to move at similar speeds; we know how to get projects done? Or how do we make sure that all the tools that we have have collaborator access, that we're allowed to just kind of sync up quickly and be always on the same page?

THILINA: Right. 100%?

ANA: You mentioned that you got your start in support, but you're now working in SRE. I kind of wanted to ask you how was that transition for you? Why did you do it? How's it going?

THILINA: Okay, so I have to start the story about what happened immediately after graduating, which is that immediately after graduating, a lot of my friends were going into software engineering roles. And for me, I was in this weird place of feeling like, oh yeah, I know how to program versus, oh my God, I don't know what I'm doing. And then I got [laughter] my first job working as a system support administrator. And for one reason or not, that place was not really the most supportive for a junior. And within two weeks, I was like, this is not for me; I'm out. I don't really know what I'm doing. This isn't --

ADRIANA: Oh no. [laughs]

THILINA: Yeah. So within the first month, I had gotten my first job and then left my first job. And I joined a different organization because I was like, maybe I'm meant to be in more of like front-end web dev, maybe that's my path. And this place was also probably not meant for a junior engineer because I remember my second week, I was super excited, submitted my first-ever PR for review. And then, it came back with 78 points of feedback or to change on a Trello card. And they're like, yeah, so go ahead and do this async or just let me know if you have any questions. And I'm like, Ooh, I have so many questions. Like, should I even be here?

ADRIANA: [laughs]

THILINA: And, again, I was like, maybe this isn't for me. And so then I was feeling very hopeless and scared about what I was doing. And then my friend was like, "If you're not having fun doing dev work, come over to support. You know how to talk and type. You seem like you would have fun talking to people. And you're still in tech, so why not?" And I was like, sure. Let's do it.

So I jumped over into support and was there for like three years. And I was like, all right, now I feel confident in talking to people about tech. Let's get into some dev work. I did a whole bunch of internal internships. And I was like, yes, coding, that's the thing I want to do in microservices and networking. And then, I got into a team as a systems engineer. And it turns out that was the precursor to being an SRE because those systems we also maintained to ensure that they were running reliably. And it just so turns out that's what SREs do.

ANA: [laughs]

THILINA: So I was like, all right, SRE it is, and then just kind of magically dovetailed into there. So that was a bit more about that weird journey getting into tech and SRE specifically because it was a journey through IT, and front end, and support, and then all the way through systems to here, so yeah.

ANA: What one or two qualities do you think that SREs need to learn from support engineering?

THILINA: Oh, here we go.

ADRIANA: [laughs]

THILINA: So here's something you learn during support is that the amount of time that the user or the customer has for you is very limited. And you're a good support engineer if you can get the information to them quickly in a way that they can understand quickly, but they can also refer back to very easily. So you're working with tickets. If your ticket takes 15 paragraphs, not great. If your ticket takes five paragraphs, sure, good.

But if your ticket takes five paragraphs and then later, like two days later, they have a question, and they can look back at what you wrote, and they don't have to ask you again, that's a gold star, like, that's where you win. So as an SRE, you're doing the exact same thing but for people inside your company because people are usually looking to you to be like, how do I make sure my system doesn't OOM every three days? What should I be looking at? And as an SRE, you could be like, well, you should probably add more memory, or maybe you should do this.

But if you can give an answer that's very context-specific and packaged up in a way that they can understand how it immediately applies to them and also, in a way that when they look back in five days after they've made some changes, they can be like, oh, these principles still track, then that's how you win, in my mind, as an SRE. Tech chops is great, but packaging information in a way that helps people be more reliable with the way they do things that's where you win.

ADRIANA: Yeah, that makes sense. And I guess being in a support role, you have to develop a lot of empathy for your customer, which I would imagine is something that goes a long way than for an SRE role.

THILINA: Yep. In support, people talk to you, and they're not being successful with your software or the thing you're supporting. So there might be some angry phone calls, you know, like, hey, let's talk, and let's make it happen. But yeah, that's correct; in SRE, you're also responding in times of things not maybe going right, and the empathy is there.

But the empathy also needs to exist where it's like, someone's trying to just get something developed and deployed. And you don't want to be the one standing between them and deployment. And when I say them, I mean engineers or application developers, even though I don't like to think of us versus them. But when someone's coming to you asking about, "Why do I need to have these autoscaling policies? Can I just ship this?" Because I can just get this done today.

The answer can't be, well, you need to be more reliable because, duh. It's more like, hey, man, you want to ship this today but what if you get paged on Friday night? That would suck. Here's how we make sure that doesn't happen. It's never adversarial. It's never I'm telling you what's right because I know. It's more like, let's make sure none of us get paged because zero page on call that's the way we want to do things.

ANA: I love that. It's actually thinking of reliability as a team sport, which I think it's not advocated for enough.

THILINA: Yes. There's one more thing I was going to say about support and SRE and the parallels, which is that in support, one way to buy time is to communicate. Like, if you're working on a ticket and you don't have the answer, and they're like, well, do you have an update for me? And your update could be like, here's the update, but it might not be a solution. Same thing goes for SRE.

Like, do you know how to do this very complicated thing where my 36 sharded databases keep on OOMing at this specific time? And the answer could be like, no, but we've checked the proxies. That's not part of the problem anymore. We rule that out. We're going to look at this next. And helping people through what your thought process is, I feel like that's helped me in support and SRE, too, because an update is better than nothing.

ADRIANA: Yeah, for sure. One question I had for you is how does observability factor into your role as an SRE?

THILINA: Hugely. It factors in a lot. Without it, I'm kind of blind.

ADRIANA: Awesome. That's what I like to hear.

[laughter]

THILINA: I can answer your question with an analogy and another support analogy if it helps too. So when I used to work in support for this company way back, it was around troubleshooting DNS, DNS the phonebook for the internet.

ADRIANA: Ouch.

THILINA: Yeah. But let me tell you that it was a lot easier to troubleshoot tickets that were typed out than on a phone call. And the reason I say that is because when they were typed out, they would give you some data, which is obviously already better than being like, hey, I can't get to this site. It's not resolving. But also, we could ask them for a diagnostic, which would be a tool. It would do a bunch of nslookups. You could see what the results were, and then you can make informed decisions based on actual data.

What I'm trying to say is working on tickets without that data was extremely tough. You're kind of going off of like a couple of common root cases or maybe some things that could happen. But you're really depending on someone to package up the information that you need, versus a diagnostic would tell you exactly what you need. And you can sift through the data and filter out, and give a better answer. Same thing for observability; without that, without knowing how your systems are interacting and working or playing together, you or specifically me, like, I'm blind. I need that data. And it's hugely important, like, very critical. Yeah, observability.

[laughter]

ADRIANA: Yeah, observability, rah, rah. Love it. [laughs]

ANA: As you've gotten a chance to see a bit of the SRE space, what do you think people are doing wrong? What do you think folks are approaching in a way that's like, ooh, this actually might not be the best way to do it?

THILINA: You can't engineer your way out of every problem, is the one takeaway I've had. And what does that mean? That is such a vague answer.

ANA: [laughs]

THILINA: And my response to that is that just because you've gotten a lot of systems and processes in place to increase reliability, like, you have observability, cool. You've got an autoscaling policy, awesome. But at some point, this is going to get to the point where something is breaking, even though you've got all those systems in place. And what matters at that point is how easily can people find the information required to solve the problem?

Observability is cool. It's a means to an end. It gives you some data, but that's just part of it. That needs to be tied into how you document how you troubleshoot. Just because you have observability doesn't mean that someone being woken up at 3:00 a.m. is going to know what to look for. The human element involves actually tying it all together into the human piece of, like, how do you troubleshoot? How do you integrate? This is a very vague answer.

But I think a big mistake that I've seen is people chasing after the next shiny technology thing or process. Like, I can add all these autoscaling policies. I can make sure that we never run out of resources. Yeah, but what happens when your network goes down? There are so many what-ifs. You cannot account for what it is, but you can definitely have a process or a document written that says maybe talk to this, or talk to this person. The human piece is hugely critical. Yes, that one piece.

ANA: You're just walking into one of my passions, the topic of chaos engineering, of preparing for those what-ifs.

ADRIANA: [laughs]

ANA: You just want these questions to be thrown at you. [laughs]

THILINA: Let's do it. I'm here for it. Let's go. [laughs]

ANA: What do you see as one of those best ways for a team to come together and really talk about those what-ifs and document them as you say?

THILINA: If you've been reading into chaos engineering, this is probably very familiar to you. But the act of breaking your own system goes a long way. And it's not just like, hey, you've worked on this before, like, try to break it. Do your thing, sure. But I think the best part comes from what if you have a newer member on your team, and you just get them to break it and just run through that whole exercise?

ADRIANA: [laughs]

THILINA: Because that's going to reveal a lot of questions. Like, oh, why do we go through two load balancers from these different paths? Well, it's because we have to factor in for this use case. Cool, but what if those people do something different? Vague answer. What I'm trying to say is look at things with new eyes and try to break something; that goes a long way. For me, recently, it's we just deployed a new service. We architected and deployed a new service, and going end to end on everything was kind of cool.

There was a point where I had to whiteboard out the headers that are exchanged between different components throughout the lifecycle. And sure, tracing will give you that; that's great. But what I'm saying is knowing the context for when those different headers and pieces come into play that only happens if you're actually going through this lifecycle with trying to do these use cases. So all that to say, try to break your stuff and write notes about it. That's a huge learning experience for me.

ADRIANA: I like that advice. I actually want to go back to something that you said earlier, which is when something breaks, and you get woken up at 3:00 a.m., the answer isn't immediately in front of you. However, observability can facilitate that. So how do you use that information to help you troubleshoot? I think that's something that would be really useful to our listeners to understand.

THILINA: I'll say the starting point for anything for me is always a doc. So I've always had an in case you get woken up at 3:00 a.m., click on this document and read through it. Sure, great. And that should lead me ideally to like a single pane of glass or a place where I can query for information. Once I'm there, whatever the page is, the dashboard doesn't really matter; I would like to see what my application does and just kind of run through it. And this is where the support side comes into play.

I've always seen everything from the customer's point of view, which is like DNS. I'm a customer; I look something up, I get a response. But what does my system do for my end users or the people that use it? And I would just go from, as the saying goes, from left to right and see what observability pieces kind of show me whether those things are green or not.

And this gets a little bit harder if what you're doing is asynchronous, like far-spanning. But hopefully, observability will be able to guide me to where to look and then allow me to give me the tools to drill down a bit further. But I don't think having a dashboard that accounts for every single possibility is possible. Just let me look at the different pieces.

ANA: Totally.

ADRIANA: So do you have something that gives you a clue on where to start looking, or is it that, in your case, the doc is the starting point? Where do you start looking? [laughs]

THILINA: The service that I kind of mentioned that we've been architecting and building out, we kind of go over the most common operations. Let's say there are three use cases, and maybe the first one is like serving requests. Alongside each common operation or use case, I usually have a dashboard or a set of things to look for associated to it. So, for example, one of the things is we serve users' requests on this endpoint. One way you can make sure that this is good is you should see the amount of requests being served be more than zero. You'll see memory being close to here, but it'll be linked to a dashboard or something to look at.

I have a weird way to explain this, and this might not make a lot of sense. But if you've ever played God of War or these old video games where you have to fight a big monster, you might get to the point where each different piece has a health percentage. You strike the person's arm, and the arm is now 77%.

I like to break down the system into multiple pieces and have each piece linked to a part in the dashboard. So even though you might not know we're serving requests or the arm does this, you can be like, let me just go check these and see what's green or not. I don't know if I answered your question. All that to say, I have a lot of different docs leading to different dashboards, and that's how I'll answer.

[laughter]

ANA: That makes sense. I have always kind of enjoyed at least one reference document to put everything in. Like, this is a playground to start either your alert is going to say what service, or what cloud provider, what region is actually having issues. And then go through the doc, and it's like, where can I pick up my first clue? Because at the end of the day, you're playing a detective game. You're investigating.

THILINA: Yeah. Well, one of the SREs I was talking to kind of mentioned that he divides his dashboards in two ways, and we've kind of been doing it too. One is like a machine-based dashboard, which shows the defaults like CPU, memory, RAM, whatever, and that allows you to see anomalies. But then the other dashboards are either tailored around SLOs, which is tailored around what you do with your service, or tailored around common operations. Like, if you know that one specific endpoint has a lot of asynchronous calls that requires a lot of memory to be kept in state, there's going to be a RAM counter here. So that's two ways. That's a better answer to your question.

ADRIANA: Okay. Yeah, yeah, that's awesome.

ANA: I mean, it's perfect because when we think about the reliability of services, specifically, it's like every service is going to have those unique reliability goals. So that is a perfect moment where service-level objectives come in because you kind of are able to turn that dashboard backwards and just give it to you quickly in a percentage of, like, an SLO is really close to customer impact. So you should actually be able to gauge from there and help your debugging go a little quicker, in a way.

THILINA: Yeah, I agree.

ADRIANA: On the topic of SLOs, do you have anything that you can share with regards to setting up SLOs for the first time? For example, what was it like the first time you delved into the world of SLOs? We talk a lot about SLOs in SRE, but I do know that a lot of organizations aren't necessarily super mature when it comes to SLOs. I think a lot of organizations are starting to see the value of them but haven't necessarily gotten to that place yet. So, what's your experience around that?

THILINA: So I have a very interesting relationship with SL* in general. Most people, when they go into engineering, are like, oh, we built a thing. We should probably figure out what level we can say that it's performing at so we can give it to marketing and salespeople so they can sell the product. That's cool. For me, I kind of came at it from the other side where I was in support.

So SLAs are very important to me because when a customer is mad, upset, angry, frustrated, sure, they might have not-so-nice words, but also, once their use of whatever you're doing is impacted to the point of breaching an SLA, then there are consequences. Depending on how an organization operates, if you breach an SLA, something happens, not great things.

So for me, I learned how important SLAs were to start with, which probably meant that when I went into looking at SLOs and trying to create SLIs. I was very anxious about doing that exercise. I'm like, oh no, what does this mean for the person supporting? But if I was to give you a one-word answer, is that SLOs are collaborative, surrounded by two sparkly emojis. Yeah, it's collaboration. [laughter]

Initially, I worked with SLAs, like, understood the impact of them being broken. I was like, okay, that's not good. And then, I started working on an engineering team, and there were already SLIs and SLOs in place, which meant for me, as a service-level objective, my job is just to ensure that those didn't get breached. So it's a guiding light for the decisions I would make. Is this thing I'm about to do in the outage going to be bad or good, hurtful or less hurtful? Is this thing I'm about to introduce into the codebase going to have second or third-order impacts that could affect the SLO? Yeah, likely, pretty likely, maybe don't do it.

But coming up with SLOs that's where you learn about SLIs. What do you want to observe and why? And when I say collaboration, you should never be doing this in a vacuum. [laughter] If the SREs are like, yeah, we're going to go away, talk about some SLOs; we're going to come back, and we're going to have some SLOs, that's bad. But for me, the first thing I had to do when I got the ticket that said, "Let's figure out the SLO strategies," I figured out who I have to talk to. That's probably like a product manager, probably people from support and figuring out what do we talk to our customers about.

Having all those people in the room to talk about what you want to measure goes a long way. And then the work of building SLIs and SLOs has already been really well written about. There's a whole bunch of books, maybe from some really good writers that we all know. Building SLOs and learning about them is not hard. Figuring out what you want to learn on and measure that's the hard part, and that's where the collaboration comes in. So I guess final answer is SLOs are collaboration.

ADRIANA: Awesome. I love it.

ANA: It is true. You definitely need to have the right people in the room. There are so many times that people forget to bring a PM, and you need to remember that there's business logic like OKRs, like BLOs, business-level objectives. You can call them whatever you want. But at the end of the day, you're keeping everything reliable so you don't breach a service-level agreement and have to pay your customers money for not upholding the standards of service that you need it to. So definitely think about the business.

THILINA: Yeah. And I'll give you a good example of that. When I was a systems engineer, the thing we were working on was considered a critical tier zero service. So, for me, that meant if I got paged, I had to respond within three to five minutes and resolve it immediately because if I didn't, some small country might lose a big portion of X service for 30 minutes, which is not good.

So for me, I was always like, pager is going off. SLOs are being impacted, must impact bad; go do this now. And then afterwards, I worked in a different company as an SRE where it's like, there was an alert going off for P1-related thing for like 30-40 minutes, and people were not super stressed out. I was like, why is everyone not stressed out? This is terrible. This is crazy.

And the PM was like, yeah, but the assessed value of this is not very high. It can wait, like, that's okay. That's why our SLO is actually lower. It doesn't have to be super high. But if it was just me in a corner making SLOs, I'd be like, nope, 99.9999% very important; that's how we do it. Reliability: that's my job. Not true; you're helping the rest of the org. So having more people in the room is very helpful.

ANA: That is true. You're almost going to set yourself up for failure if you don't talk to the right stakeholders. You're going to have a 99.99% SLO and your business need actually may have been just like 92%, and then you over-engineer, which then costs your business a lot more money. [laughs]

THILINA: And then you train your users for a higher level of stability or reliability than you're actually supposed to be doing, and then people have a disconnect in expectations.

ADRIANA: Absolutely. You mentioned one thing that I want to go back to with regards to SLOs, which is I think you implied that having your SLOs also keeps you accountable in terms of when making changes to the system. But then I guess the other side of the SLOs is also that they are a moving target of sorts because you're always iterating on them. In your experience, can you talk a little bit about iterating on SLOs?

THILINA: When I think iterating SLOs, that obviously means tweaking the number to be whatever the business deems it to be. And so when I say that you're making decisions based on the SLO, it doesn't mean if I do this now, is this going to break SLO? I'm thinking more that the SLO is to help you do some prioritization of the changes you're about to make on a system.

So if you know that something is considered an SLO that's ranked higher on the list of SLOs, if you do ranking, and something isn't, that's the thing that's prioritized. Maybe you're making changes that affect it; you should have a bit more due diligence. Versus something, like, where do you want to concentrate your effort? And that helps you prioritize.

ADRIANA: Yeah, that makes sense. What about iterating on SLOs? Have you ever been in a situation where you're like, okay, we came up with this SLO, and then you have an incident, and you're like, er, that didn't quite cover it? [laughs]

THILINA: I have the opposite story, actually.

ADRIANA: Oh.

THILINA: Which is at one of my organizations, every time an SLO was breached, that usually happened due to an incident. And that meant that there would be a retro or a postmortem where everyone would get together and do the ritual and the ceremony of, like, what went wrong, and why was this bad? What was the impact? In that case, we found ourselves in a string of repeated postmortems. We were like, oh, why are we here? Oh, something went down. What was the impact? Not that bad. Then why did we get paged? Oh, because our SLO is set to this.

And then that led to the question, should it be set to this? No. Okay, well, maybe let's tweak it down. Does that sound good? And it was like, yeah, sounds good. It was like, all right, cool, let's do it. And then we didn't get paged as much. Yeah, things still happen, but that was tweaking it in a way that's making it less sensitive because we found that it wasn't as big of a priority. So kind of the opposite story of what you hear about.

ADRIANA: Yeah, that's actually such a great thing to bring up.

THILINA: That story really depends on SLOs being part of the process as opposed to just a thing you chase for. Having SLOs on its own doesn't do much for you other than help you tailor your dashboards. But having SLOs be part of your incidents, your retros, your postmortems, your reliability reviews, other people have to care about your SLO than just your sales folks and you as the SRE.

ANA: Yes, seriously. It goes back to reliability is a team sport, like, collaboration.

THILINA: 100%, yeah.

ADRIANA: I love the story of revisiting the SLO and realizing, like, oh, it was too sensitive because that is just as important because if you're getting over-paged, you're getting alert fatigue, basically. That's not good for anyone. You give your team PTSD, and then they don't start taking the alerts as seriously, or they jump at every little thing. And that's not healthy for your team and, therefore, not healthy for your system because if you have an unhealthy team, well, it'll lead to an unhealthy system.

ANA: Burnout. It's called burnout. [laughs] And your team leaves, or word gets out of this is a firefighting organization, don't go work there, which definitely happens a lot. On that topic of iterating on SLOs, I always have felt that SLOs make it a lot easier to adjust those pagers to those alerts to actually not get pager fatigue versus having them for every single service based on dashboards and going that way. It's harder to tell when it actually matters to the customer when you're looking at that metric in a dashboard versus this is the SLO within my dashboard, within my service configuration.

THILINA: Yeah. Actually, one golden tidbit that I would like to pass along for SLOs, and this is like, if there's one thing I learned from this experience, is that if you're starting to tweak an SLO, do a look-back analysis. What I mean by that is the system we're setting up is going to be taking over for something else. And we noticed that the SLO for the old system was set lower, and the one we wanted to set up was higher.

And the question we wanted to ask ourselves before we implemented this was based on this higher SLO that we want. If we had applied this looking back to this existing system, how much more would we have been paged, or what would the impact have been to the team? So doing a look-back if you're going to tweak up for sure would help answer that question or guide you more. Because if you say I'm going to increase something from 97 to 99, you're like, oh, great. That's awesome. That's great. Reliability: awesome.

But if we say we're going to increase from 97 to 99, but if we look back in the last two months, that means a team would have been paged 36 times as opposed to 20, oh, that's a little concerning. How important is reliability? Is 2% really worth 16 pagers? Like, what's the impact here? So look back, analyze against previous data. That's a huge thing I learned from my first experience setting up SLOs.

ANA: That's an amazing golden nugget that a lot of folks don't always take into account, like, consider what could have happened as history stands but also look into the future and say, well, that means that we are only allowed to be down these minutes. So we have to be a lot faster about triaging incidents and things like that, making sure that you also think about that downtime part. Your error budget, I think, is the word I'm actually trying to think of in my brain [laughs] that I couldn't get to.

But with your golden nugget, it made me think of the question of, like, as you're sharing your experiences of going through this, do you have a platform where you share with folks some of these golden nuggets in a longer format?

THILINA: Yeah, I mean, I have my blog that I've been really making an effort to update more; it's tratnayake.dev. One thing that's important to know about my blog posts is it's not meant to be...like; it's definitely more like a lab notebook that I keep with me as I learn things. So, for example, what I'm going to be writing about is me talking about a change I made in Envoy that's going to go into detail about what is Envoy if someone was reading it for the very first time. So if you're into blog posts and learning things about stuff that you might already know because we get really elementary, you should totally check it out.

ADRIANA: We will link to it in the show notes. Thank you.

ANA: That is pretty neat. And as folks eventually get to hear our season finale, you'll get to hear why sharing your learnings as you go really do matter. It just makes the industry a lot better. [laughs]

THILINA: Oh, sharing learning is huge. Like, if there's advice I could give to a junior engineer, just get better at explaining what you do and what you're thinking. That will serve you more than learning regex. That's a spicy take.

ADRIANA: Yeah, it's so true. Well, I found with personal experience, it's one thing to solve it for yourself. It's another to solve it for yourself and then write about it, and then realize that there were some gaps in your brain in understanding because now you're trying to explain it so someone else can follow the instructions. And so you got to do a little extra research, which is good because you learn some more stuff along the way.

THILINA: 100%. Writing more things down makes your team better and makes your team faster. One big thing I took away from support is if something's not clear to a customer, there should be a doc for it, so immediately write a doc: cool. And then, in the future, you can always just send back that doc as opposed to having to answer that question. But for me, I keep notes on everything I do, like process-related.

So I had to deploy a release for a thing two months ago, and I wrote about it. And I was like, well, it's just for me but whatever. And then this was supposed to be a rare operation, but this became less rare, and we did more releases, and some teammates had to do it. And I just gave them my doc. And as opposed to me having to walk them through for two hours just like I had to learn it, they were able to just take the doc, do what they needed to do, and ask me when they run into questions. And that made the team faster. So write things down. You never know when it's going to help you. I'm a huge proponent of writing things down.

ADRIANA: Yeah. And then now you're no longer a bottleneck.

THILINA: Yes.

ANA: Your future self or someone else in the community will thank you for it.

ADRIANA: Yes, yes. And I would add, add lots of details because my biggest pet peeve, and I've said this multiple times, is when people will write a technical blog post and either assume that I know what the hell they're talking about, or they're just tired of writing. So they'll leave out details, and I'm like, but I don't know what you're talking about, dude. [laughs]

THILINA: Oh, yeah, err on the side of over-communication, again, just provide more context. It doesn't hurt.

ADRIANA: Yes.

ANA: Add all the damn links. Let the user decide if they want to follow the path to them. [laughs]

ADRIANA: Yeah, exactly as opposed to...do you remember those math books in high school that will leave the proof up to the individual to prove, and it's like, nooo, I don't know how to prove the fucking proof. [laughter] Oh.

ANA: I freaking hated that part of math, by the way.

ADRIANA: Oh my God, me too. It's like, no, I will not. Don't leave it up to the individual because the individual has no idea. [laughs]

THILINA: Correct.

ANA: So you're teaching me things. Why are you not teaching me what I paid you to do? [laughs]

ADRIANA: Yeah, exactly. So yes, over-communicate and down.

ANA: As we're getting to wrap up the episode, I wanted to ask a little bit more about what your life as an SRE is. What is on-call at Lightstep like?

THILINA: On-call is really nice here. It's very, very nice. And I say that because I've been part of organizations where P1 outages would be a very common thing multiple times a week.

ADRIANA: Ouch.

THILINA: That is not the case here. But I think more than that, on-call is very nice because we ruthlessly prioritize silencing the things that go bump in the night. And what I mean by that is not just like, oh yeah, just put it in another room; we're not going to hear it anymore. It's more like, oh, why did this go bump in the night? Oh, because this happened? Cool. We're working on that immediately now to make sure that the next person that's going to be on-call does not get woken up by that.

That ruthless prioritization of ensuring that we reduce the impact to on-calls livelihood is huge. Also, the fact that I really enjoy that we will write down the things that we see that could be improved or could be going wrong, and then people will read it and actually take the time to work on it. Like, that's really cool.

ADRIANA: That's awesome.

THILINA: We've talked about burnout. I think we mentioned burnout earlier. And burnout can come from many different ways; one of them could be pages and being fatigued from alerts. Burnout can come from doing too much too often. But burnout can also come from trying to write things down and raise awareness to a problem and not being heard. That can lead to a lot of burnout. And I'm very happy that whenever we have something to bring up or we write down, we'll read together and we'll discuss, and that goes a long way.

ANA: That's awesome. And when you're on-call, what are some ways that you make sure to take care of yourself?

THILINA: When I'm on-call if I'm going to be away from the pager, I'll always be like, hey, I got to go do this thing for an hour. Is someone able to cover me or just keep an eye on things? Again, erring on the side of over-communication really helps. But also scented candles are really nice. And the reason I bring that up is incident response at 3:00 a.m. with a scented candle game changer. Game changer. [laughter] You're getting ready for that conference bridge, and you know C-suite people are about to join; you got a scented candle. You got some vanilla tapioca in the background. Everything's fine. We're good.

ADRIANA: Wow.

ANA: [laughs] Do you have a favorite brand of candle? I think I need a link here. [laughs]

THILINA: No, I just have smells that I would recommend. Definitely, holiday vanilla is one of my favorite smells. I don't know if that even transcends to different places. But yeah, vanilla is a great smell.

ANA: It's a very soothing one, so I totally get that. And on that realm of questions, what is the best way your team or manager has supported you while you've been in an incident?

THILINA: I was going to bring up that it's super great when managers bring you pizza, but that's like an old thing that people hear about outages and incidents.

ANA: All the time. [laughs]

THILINA: Yeah, all the time, yeah. [laughs] My manager has supported me; previous managers have supported me during an incident. It's kind of helping me run interference if I'm leading an incident. So like, if I'm on a call and I'm doing things, just helping out by being like, "Oh, this question was answered here," or like, "Check over here," kind of being like a secondary gopher or runner.

But more importantly, during an incident, if my manager DMs me and is like, "Hey, I see you've got this. Is there anything you need right now? Or is anything you need later? Just shoot me a text or just even give me a call, and I can help you out." Just letting me know they're there is very helpful. It goes a long way.

ANA: Just show up. [laughs]

THILINA: Yeah, just show up, yeah.

ADRIANA: That makes a huge difference because there's nothing more annoying than thinking that you've been abandoned by your manager, so...

ANA: Most definitely. So making sure to over-communicate and put people first. We are working on systems. We want to keep them reliable. By the end of the day, there are people behind those systems that are trying to keep them up and customers also.

THILINA: Yeah, and I think that over-communication piece how that applies to now is that when we were in the office, people could see you. And for me, I'm very emotional, so people can see when I'm not feeling great. But when you're behind a screen, you can't really do that. So I've made use of being like, hey, can I talk to you, or can I vent to you about this for like five minutes? Or, I'm very frustrated about this.

All my managers have been very good about that, just being like, "Yeah, let's just jump on a call. Let's talk about it," just being there. Showing up is good, but being there and just being available to take those really quick calls goes a long way, especially if you're remote. I highly recommend that.

ADRIANA: Yes.

ANA: I love that advice for managers. Just being there really does help. It's true. I've had a fair share of managers or managers where I'm like -- [laughs]

ADRIANA: Yeah, managers take note.

ANA: Well, I love that as our last golden nugget for today. Thank you so much T, for joining us in today's podcast.

Don't forget to subscribe and give us a shout-out on all social media via @oncallmemaybe. Be sure to check out our show notes on oncallmemaybe.com for additional resources and to connect with us and our guests on social media. For On-Call Me Maybe, we're your hosts, Ana Margarita Medina...

ADRIANA: And Adriana Villela signing off with...

THILINA: Peace, love, and code.

Reliability is a Team Sport with Thilina Ratnayake of Lightstep

On-Call Me Maybe

Twitter Mentions