Philo-SLO-phy with Alex Hidalgo of Nobl9

On-Call Me Maybe

English - February 21, 2023 05:00 - 32 minutes - 29.7 MB - ★★★★★ - 3 ratings
Technology monitoring tracing distributed tracing sre oncall on-call software software development technology tech Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: New Year, Same Us with Adriana Villela and Ana Margarita Medina

Next Episode: Transform Thyself with Shingi Kanhukamwe

About the guest:

Alex Hidalgo is the Principal Reliability Advocate at Nobl9 and the author of "Implementing Service Level Objectives." During his career, he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.

Find our guest on:

Twitter LinkedIn Mastodon

Find us on:

On-Call Me Maybe Podcast Twitter On-Call Me Maybe Podcast LinkedIn Page On-Call Me Maybe Podcast Mastodon On-Call Me Maybe Podcast Instagram On-Call Me Maybe TikTok On Call Me Maybe Podcast YouTube Channel Adriana’s Twitter Adriana’s Mastodon Adriana’s LinkedIn Adriana’s Instagram Ana’s Twitter Ana's Mastodon Ana’s LinkedIn Ana's Instagram

Show Links:

Nobl9 SRECon Chef HugOps Admeld Customer Reliability Engineer (CRE)Service Level Objective (SLO)Squarespace Implementing Service Level Objectives: A Practical Guide to Slis, Slos, and Error Budgets OpenSLO OpenSLO Slack Community Sloth O'Reilly Google SRE Books Rundeck Break Things on Purpose Podcast - Alex Hidalgo

Additional Links:

Google SRE Book - Service Level Objectives

Transcript:

ANA: Hey, y'all. Welcome to On-Call Me Maybe, the podcast about DevOps, SRE, observability principles, on-call, and just about everything in between. I am your host, Ana Margarita Medina, with my awesome co-host...

ADRIANA: Adriana Villela.

ANA: Today, we're talking to Alex Hidalgo, who works over at Nobl9 doing all things Principal Site Reliability Engineering. We're very excited to have you join us.

ALEX: Thanks so much for having me.

ANA: So, to kick us off for today, we love asking our guests, like, what are they drinking for today's podcast episode? What's keeping you hydrated?

ALEX: Water that I put in a refilled bottle that I got on the airplane flying home from SREcon a few weeks ago.

ANA: What about you, Adriana?

ADRIANA: I've got green tea. For once, I have something different. I always have water, so yay.

ANA: That is funny because I'm usually the one with the fun drinks. And today, I'm on water because I just forgot to grab anything different. [laughter] There's that whole, like, constantly thinking about too many things and having to hop on between meetings. It's like, what do I need for my next meeting to have near me? Like, do I need my notepad? Do I need my water? And if you're moving around your office, or your house, or commuting, it just gets a lot.

ALEX: I always forget water half the time when I do these things because I have a neighbor who's actually the founder, and composer, and conductor for the Brooklyn Symphony, which is really cool. So he has this studio in the basement that he lets me use on occasion for stuff like this. And it's great because it's got pretty good sound quality.

And my dog, Taco, he's prone to bark in the background for no reason. But that also means I have to take everything apart, like my mic, and my webcam, and my USB dock, and my laptop, and throw it all in a bag and rush downstairs because I probably just had a meeting. And then half the time I get down here, I'm like, I'm about to talk for a very long time, and I forgot to even bring water. But today, we're all set.

ANA: Yay. It's like rescheduling. And then how many restarts have we had to do of computers [laughs] and browsers?

ADRIANA: We're so happy we finally made this happen today. [laughs]

ANA: It is kind of nice because it does bring to the front what we love talking about, that reliability aspect of stuff. And everyone kind of finds their passion for reliability in a very different way. For me, I kind of stumbled upon it, and it was like, oh, this is really cool. But I know that Alex has a really interesting story and how they got into their career path. So I would love to hear from you.

ALEX: So yeah, I'll try to keep the beginning part of the story short and gloss over a few things. But I was into computers from a very early age. My dad started teaching me how to program in BASIC when I was about nine years old. And then, in middle school, my friends and I taught ourselves C and eventually C++. And we decided we were going to write our own video games. We learned OpenGL and how to make 3D trains and things like that.

And I just maintained this passion for computers throughout my high school years to the point that I actually chose not to go to college. I figured I could go get a decent tech job right out of school, and turned out to be mostly correct. I ended up doing network security work for the Department of Energy. I did well. I actually got promoted within just a few months. I mean, I started doing overnight nighttime monitoring. I was watching a computer screen, watching potential alerts come in, trying to decide whether or not these were actual attacks on the network or just false positives, and most of them were just false positives.

And I did well, but after about a year there, I realized I kind of hated it. So I decided, you know what? Maybe computers aren't for me at all. Maybe they're meant to be a hobby, and I'm not meant to work with them for a living. And I realized, you know, I was still pretty young. I could still pack up everything and go to school, and so that's what I did. And I ended up studying philosophy and history. Then after that, spent my 20s working all sorts of jobs, restaurant work, front of house, back of house. I worked in a warehouse for a little bit. I sold furniture. I made most of my money as a DJ for about a year.

ANA: Wow.

ALEX: Just all sorts of random odds and ends. Then a whirlwind of circumstances landed me in New York City in early 2009. And so the 2008 recession, that downturn was still really, really in effect. And even though I now had this shiny degree in philosophy and history, right? [laughs] Very highly marketable. [laughter] I wasn't sure what I actually wanted to do. And so I started to apply for all sorts of random jobs. I thought maybe I wanted to work in publishing, for example. And my money was running out. I didn't have a lot of money. I was able to stretch it a few months.

And suddenly I was like, you know, I really need a job. And I met these two dudes who were at a bar one night, and they were crushing tallboys of PBR on their foreheads. [laughter] And so I went over to say hi, and turns out one of them needed to hire someone new at their IT firm, just kind of like a help desk-y small to medium business support kind of place. And I was like, you know, I can still do this computer stuff because it always remained a hobby for me. And so I said, "Sure, why not?"

And I went in, and we did a very brief interview. And I started just a few days later. And that was great because the day I got my first paycheck, I was using quarters and change. That's all I had left. Like, literally, I was down to change to go buy a cheap Bodega sandwich and to ride the subway. I'd just made it. And doing that job, it was help desk support. But also, I got to do some Linux work and networking work because some of our clients had those kinds of setup. And no one else had this...it was a 10-person company, like, no one else knew that aspect of things.

And I realized I actually did like working with computers. The thing I didn't like was working for the government as a 19-year-old. That was the thing I actually hated way back when. And so I did that for a few years. And then, I ended up moving to a company called Admeld as what we call the technical operations engineer. This is kind of before we consolidated on titles like SRE and things like that.

And it was a really cool company to work for. They were pretty early adopters of Chef. They were pretty early adopters of the true DevOps approach to things. Suddenly, I was learning what DevOps meant, learning about HugOps. I was learning about blameless culture, which is something I had not really been familiar with in most of the jobs I'd worked [laughs] up until that point. It was great.

And then, not too long after, Admeld got acquired by Google. And so suddenly, my title went from technical operations engineer to site reliability engineer. And I was like, what does this even mean? So I spent the next few years still supporting the Admeld platform because we were making money, right? Like we couldn't just turn it off. I didn't even really have a chance to learn about true SRE principles for the first few years that I was at Google.

I got to learn a ton more about different kinds of tooling. And my coding skills went way up. And it was so a great formative time for me. But then it came time to turn Admeld off. I was the last SRE on the team. Everyone else had transferred already. I was the last one standing. And I got to run the Chef knife ssh command that logged into every single server because we were on physical servers, I think, like 1,500 or 1,800 of them, something like that. And I got to run that final command that shut them all off at the same time.

ADRIANA: Oh wow.

ALEX: Like, that was a great feeling. [laughter] But then, after that, I transferred on to some other Google SRE teams, spent about two years on each. I spent a few years on managed systems. I spent a few years on prodmon, the production monitoring team. Then I spent a few years on CRE, the Customer Reliability Engineering team. And all this time, I learned more and more about true SRE principles, especially things like SLO and proper incident management, and things like that.

I even ended up starting traveling all over the world, teaching other SRE at Google how to do their jobs. Like, this is where I found my passion for writing, and education, and sharing the things that I've learned, especially on the CRE team, where a big part of the job was essentially teaching Google's largest cloud customers how to SRE, everything from pure educational effort to actually sitting down with them and looking at the architecture or looking at the codebase sometimes even. I mean, like, here's how we can improve.

And that's where my true love of SLOs started. At prodmon, we actually adopted them and adopted them very well. And so, I was very familiar with what an SLI was and what an SLO was, and I got it. Like, it made some amount of sense. But on theory, we decided that in order to properly engage with all these different companies, many of whom were not even in the quote, unquote, "tech sector" at all, major retailers, and things like that, that we needed a common vernacular.

We needed to be able to communicate with each other correctly and know what the other one was saying. And we decided that that language would be SLOs. So a huge part of any initial engagement with one of these CRE customers was, first, let's teach you what SLOs are, and then let's help you set them up. And then once we have that going, once we have service-level objectives that talk about the reliability in a holistic and meaningful way, then we can help you become more reliable because that was the whole point.

So yeah, that's when I fell in love with SLOs. And I moved on a few years later and spent some time at Squarespace, where I was still focused on SLOs quite a bit, was able to get them to adopt them, and eventually ended up accidentally writing a book about them. And that's kind of why I'm at Nobl9 today because we're a company focused entirely on helping you do SLOs better. We're a company focused entirely on providing you with the appropriate tooling so that you can do them in the best possible way because not everyone has that tooling.

And a lot of monitoring vendors they let you do SLOs in some very basic ways but not with all the bells and whistles. So yeah, I somehow ended up spending the last about six years of my career focused almost entirely on service-level objectives. That's how I ended up, well, here today.

ANA: It's such a fascinating journey just because there was that where you were already into technology. And then the whole detour to study philosophy but getting a chance to then take all these skills that you take from liberal arts, education, and humanities and just talking to people and communicating that you get to put it to work and education with your customers and writing books.

ALEX: Yeah, I mean, I actually think...I like to joke... So I majored in philosophy; I minored in history. And I was actually just like two or three classes away from a minor in creative writing as well, so English. At the end of the day, it didn't seem worth doing. I like to joke sometimes. I took the three most useless degrees: philosophy, history, and English [laughter] and combined them all into one degree. But that's not actually true. I found that the skills that I learned, not just studying philosophy, history, and English, the ability to synthesize ideas, express them clearly to others, learning how to write well, learning how to communicate well, that served me very well in my career.

I also think my time in the restaurant industry, for example, is another really good example of learning service. You are a server. You're providing a service for someone. It's not a chance that that's why we call our computer servers servers, and that's why we call our services “services” because they're servers that provide a service for someone.

And you really learn how to think about these things holistically when that's your entire job when your job is just to provide a service for these clients coming in. And you want to serve them reliably, even if that's not always the term you're using. And even if you don't explicitly have service-level objectives, you kind of do.

If you're working in the kitchen, you want X percent of your dishes to go out without them coming back, and that's a service-level objective. [laughter] And the customer being happy with their dish is a service-level indicator. We just use different languages in different industries. But it turns out we all kind of know these things. They're all kind of inherent to the human condition, I think.

ADRIANA: That is so cool. It's funny how, even though oftentimes...like, a lot of our guests have taken the so-called crooked path into their current tech careers. I'm a huge believer that everything that you've done in your life up until this point has led you to be able to do the job that you do right now and that if you hadn't had your past experience, would you even be as good as what you are doing now? So it's so cool that having the restaurant experience and even having the philosophy degree gives you different perspectives, different appreciation of things in a way that maybe if you'd done like a CS degree, you might not have had.

ALEX: Yeah, I very much agree with that. I think about it quite a bit. And especially in the SRE space, in the DevOps space, in the TechOps, production, engineering, platform engineering, whatever we want to call it, [laughter] everything's really kind of rebranded sysadmin. And feel free to @ me about that on Twitter. [laughter] But in that space, yeah, I think it's very often the case that people don't have traditional compsci degrees.

I remember one time I was on a team where I think we may have had one new grad who had studied computer science. But everyone else on the team...we had an ex-chemistry professor. We had a psychology major. We had, you know, essentially the entire team. And it was a great team of very talented SRE, and almost none of them had any kind of formal academic computer background.

ANA: Similarly, I was in a team of four, and three of us were college dropouts. And it was just kind of like pretty rad to be like, hey, like, yeah, non-traditional path for the win.

ADRIANA: Yeah, clearly. Hey, my degree is in industrial engineering.

ALEX: Love it.

ADRIANA: [laughs]

ANA: You kind of beat me to one of the questions that I wanted to have for the podcast of defining SLOs and SLIs. I think the analogy with servers really does help folks kind of put it in perspective. But if you were to give a quick summary and how they relate to service-level agreements as well, like, what would that be for folks that are really not familiar with it?

ALEX: Sure. I'll answer that in two parts. How do you want to set good SLOs? Or actually, I like to use the term meaningful SLOs because they're not worth anything if they're not meaningful if they're not telling you something. They need to be a signal that you can use. At their heart, they're just a codification of the concept of don't let great be the enemy of the good.

Don't strive for 100% because you'll never reach it. It's impossible. And you'll spend way too much trying to get there. Your humans are going to be way burnout trying to get there, and the amount of money that you have to spend trying to get there is, like, it gets difficult. So pick a reasonable target that your users can actually absorb. And here, a user can be anything from a paying customer, to another service, depending on your service, to a team down the hall. It's anything that depends on your service, whatever your service looks like, and whatever the shape of that is.

A good SLI is one that tells you what your users are actually experiencing. It doesn't matter if your monitoring tells you everything's fine, your time series. Kubernetes is reporting that all your pods are running. Well, I don't care about that if the users of your service are not having a good time. So you need to figure out how can I measure what's actually happening from my users' point of view? And then pick a reasonable target for how often can that fail before they're going to actually be upset.

The second part is about how do SLOs relate to SLAs. There are two different parts to that, I think. One is that they're very different because, in my opinion, SLOs are most useful to be used as decision-making tools. They're better data to make better decisions with, to have better conversations to say, "Okay, this is what we believe our service has been performing like; is that okay? Do we need to go address that? How many people do we need to put on this? Do we have to drop everything? Or do we just assign one person for one sprint to see if they can go clean some stuff up?"

SLAs are written into the contracts, pretty necessarily by their definition. They are an agreement. An SLO is just an objective. You can change it basically whenever you want, as far as I'm concerned. Like, as long as your users aren't upset with your new SLI or your new SLO, then cool, do it, change it. SLAs you generally can't change because, again, they're generally written into contracts. And when you violate them, it's not a decision-making tool. It generally means that you owe someone money, or you owe them something, credits, you owe them, you know.

So while they're similar in the sense that you are trying to measure the performance of your service, and they both use target percentages...and these target percentages often have many nines in them, like the most common number 99.9% availability, and you're like, whatever. They're very different tools in my mind. But if you do have SLAs, if you are working in a position where you're responsible for services that inform an SLA that may cause your company to owe someone something if you violate it, you can still use SLOs very effectively because you just set them with a more strict tolerance than your SLA.

So let's say your SLA is set at, again, we'll say, 99.9% availability of some API of some SaaS product. Then you can set an SLO, let's say, like 99.95 or even 99.98. And this gives you a signal. If you are violating your SLO, you better take action because you're going to be violating your SLA soon.

ADRIANA: Right. Right.

ANA: It's like ringing that alarm like, oh no, the storm is coming, like, paying customer number four is about to hit up your CEO in their personal email and come to their door like, oh no. [laughter]

ADRIANA: Yeah, I really like that concept as basically like the SLO informs, or it's like your canary in the coal mine for whether or not you're going to be violating your SLA.

ALEX: Yeah. Yeah, yeah, yeah, a bit.

ADRIANA: That's very cool. One thing that I was wondering about because you mentioned you work at Nobl9 like; my understanding is Nobl9 has been pushing something called OpenSLO. Can you talk a little bit more about that? Because I'm actually really curious about OpenSLO.

ALEX: So, although we started it, it really is an open-source community project many contributors from tons of different companies. I encourage everyone to go check it out. And it's basically that's a specification language to ensure that people can define their SLOs, and their SLIs, and their error budget windows, everything that comes along with service-level objectives that we have a common way of doing that across the entire industry.

And so it's essentially mostly a YAML spec, but it's constantly being iterated on and improved to allow for more and more functionality and different alert methods like defining what fast burn means to you. Like, how quickly are you burning through your budget? And the goal is to be vendor agnostic and, in the worst-case scenario, allow you to more easily adopt either multiple tools or move from one vendor to the next.

It's still kind of like a work in progress, but it works for some systems already. For example, Nobl9 does not yet ingest OpenSLO YAML directly, but we have a tool that will convert your OpenSLO YAML to Nobl9 YAML that you can then apply to Nobl9. Or there's a project called Sloth, which is an SLO framework for Prometheus. So you can use your Prometheus data to calculate error budgets, and Sloth speaks OpenSLO natively.

I know that there are a few other projects like that. I don't know which ones I can talk about, and I don't want to mess up which ones are currently actually working on it and which ones are just expressed interest. But I would say just go check out OpenSLO. There's a Slack you can join if you're interested; all the maintainers, contributors, and even a ton of users hanging out there. It's a very friendly bunch. Just go check it out and join the Slack.

ADRIANA: Awesome, awesome. Yeah, we'll make sure to include that in our show notes as well.

ANA: Yeah, initiatives like that where we're getting a chance to bring together vendor-neutral standards is always kind of nice where it's like, let's just be able to speak the same language and educate one another and try to have meaningful conversations around topics that really do matter.

ALEX: It was definitely at least partially inspired by OpenTelemetry. The huge success we've seen there, for example, we're like, we should be leading the way but working with the community, and figuring out how we can do this for service-level objectives.

ANA: Much needed. Like, I know when I started learning about SLOs and SLIs, I was kind of like a little over my head. And then there weren't vendors applying it. It was just kind of like a concept within SRE principles; this is what you strive for. And, like, these are all the numbers and the dashboards, and management cares about this. [laughs] Don't let anything break.

ALEX: There are still so many people who come to us who sometimes even literally point at the SLO chapter in the first book and two of the SLO chapters in the second Google book. Yeah, we're talking about the O'Reilly Google SRE books here. And they're like, "I want to do that." [laughter] Like, I helped write one of those chapters. It's mostly philosophical. You don't have to do it just like that. You don't have to do it [laughter] exactly like that.

But that's the problem is most people learning about it get these chapters that are, I think, very convincing. I think they're great chapters, but they make it difficult to actually go and do, and that's part of what we're trying to solve there for sure, both with OpenSLO, so people have a better idea of how to just define these things in the first place as well as that's Nobl9's whole goal. We want you to be able to easily do SLOs without having to build a whole bunch of tooling yourself.

ADRIANA: Yeah, yeah. I think that's such a great idea. And it makes so much sense also to codify your SLOs because, I mean, one of the things about SRE is codifying your infrastructure. And SLOs are part of that, part of your reliability story. So why not make that something that you codify, right?

ALEX: Yeah. And it's YAML again.

ANA: [laughs]

ALEX: But that's using everything else anyway [laughs] and so...

ADRIANA: I don't hate YAML. [laughs]

ALEX: I actually don't, either.

ADRIANA: Yay. [laughs]

ANA: It's not the worst thing ever.

ADRIANA: There are some people who hate YAML so much, but I don't mind it.

ANA: It does get annoying when things are breaking. And it's literally just like one or two spaces or stuff like that. [laughter] But --

ALEX: Oh, I have a great YAML story. When I was first learning YAML, period, right? It was because I was trying to set up Rundeck. I was trying to set up a Rundeck instance, and Rundeck configs are all in YAML. And so I was trying to set up a few different jobs, like, a few different workflows. And so, I started with copying the example straight out of the Rundeck docs. And, okay, I put that into etcd/rundeck.

I start editing it, and no matter what I do, only the last workflow is being applied, over and over and over again. And I'm like, what am I doing wrong here? I leaned over to my coworker. He was always there to fix my dumb mistakes.

ANA: [laughs]

ALEX: And I'd spent a day on this already. And he was like, "Well, you have them all in the same file. You don't have any separators." I'm like, "What do you mean?" He's like, "You see in the example, you see how there's three dashes in between the different workflows?" [laughter] And I'm like, "I thought that was decorative." [laughter]

ADRIANA: Oh my God, that's awesome.

ANA: I mean, I feel like that's the mindset when you've been a sysadmin that you kind of like put ASCII code on stuff.

ADRIANA: [laughs]

ANA: You're just like, duh, lines. You're making it pretty.

ALEX: [laughs]

ANA: This is just fricking, like --

ADRIANA: I love that. [laughs]

ANA: Two weeks ago, with a Kubernetes release team, I had one of those PRs that I was like, I couldn't get right the indentation by like two. And it was just like a line break is all I needed.

ADRIANA: Oh no.

ANA: And I was just going back and forth for like 36 hours on a PR. And I was like, I swear I have a job, and I know what I do. [laughter] And I'm confident. [laughs]

ADRIANA: Don't you hate those moments where you're like, oh my God, what happened to my brain? [laughs]

ALEX: I had something that...I told the whole story on the Breaking Things Podcast, the Gremlin one, so if you want to hear the whole story, you'll find it. But I once spent a week or maybe even two due to an incorrect end-of-line character, so go listen to that story. [laughs]

ADRIANA: Ooooh. Oh, man.

ANA: I've had those. [laughs]

ADRIANA: I've had those too. Those are the worst.

ANA: Oh, we're pushing the industry forward by, like, oh, we're sharing our stories, and failure is going to happen.

ADRIANA: That's right.

ANA: And like part of it is that human side where, as humans, we also make mistakes like that. And it's like; it's okay to have [laughs] spent 48 hours on just one end of file character that really destroys your entire code.

ADRIANA: But you'll never forget. I'm sure now you're like, yeah, I am always going to be extra aware of end-of-line characters.

ALEX: But, yeah, failure happens. And it turns out humans are okay with that as long as things don't fail too often.

ADRIANA: [laughs]

ALEX: That's SLOs in a nutshell.

ANA: The best, like Alex Hidalgo, approved summary of it. [laughter]

ADRIANA: That's this podcast episode in a nutshell. [laughter]

ADRIANA: I wanted to touch back upon the experience that you had on the CRE team at Google and some of the work that you do now with Nobl9 customers. What are some of the examples that you've seen of bad SLOs, bad SLIs, maybe back then? And what are the ones now? How are they comparing?

ALEX: I'm honestly seeing a lot of the same mistakes: both instances, most, not all. But most of the people that we were either working with on CRE or most of the customers at Nobl9 are new to the process. They are learning how to configure and how to use SLOs for the first time. And the biggest gaffes I generally see is conflating availability with reliability; they are very different things. And two, trying to programmatically generate too many SLOs for too many different endpoints and too many different services. Too much data, and you end up with a multiple comparison problem.

The best SLOs are, at some level, artisanally formed. At some level, they need to be meaningful. They need to be created for a purpose because they have to give you a good signal. And I've seen people in the industry...I've also seen it just with friends at other companies. People I know from Twitter telling me stories of "Yeah, we tried to do SLOs, and we ended up dropping it as a company after a year because we didn't see the value."

And most of the time, those companies that don't see the value is because they're just slapping 99.99% availability on all of our APIs. What does that mean? What does that tell you? One, that target is probably too high; very few services of any can actually hit four nines. But beyond that, just slapping the same number on everything without understanding how does this service actually operate? How does it need to operate? This is exactly the kind of thing that leads people to not seeing the value and then eventually just dropping the whole thing, which is a real shame because when you do them right, SLOs are absolutely game-changing, absolutely life-changing.

ADRIANA: So if you were working with someone who's getting started with SLOs, what would you give them as far as advice or easing into it so that they can start seeing the value of them right away?

ALEX: I think the best advice I can give is just understand that they're not a thing you do. They are a different way of thinking about the reliability of your service and making decisions about the reliability of your service. They're not like an OKR that you check off and be like, cool, we have SLOs. Now we're done.

I generally refer to them as SLO-based approaches, SLO-based approaches to reliability because that's what it really is. It is a different system. It's a different way of thinking. It's a different way of gathering data. It's a different way of thinking about data. It's a thing you do for all time. You don't just set them up and walk away. You iterate on them constantly because something about the world is going to change. Your dependencies are going to change. Your codebase itself is going to change. The requirements of your users are going to change. We could go on and on.

So that means your SLOs are necessarily going to have to change with them. They can't just remain the same for all time. So yeah, that's the best advice I would give. Understand this is not a one-time project. It's not something that if your CTO comes to you and says, "We're doing SLOs now," that you just do them and then report back okay, we're doing SLOs.

ADRIANA: [laughs]

ALEX: It's a different approach. It's a different way of thinking.

ADRIANA: Yeah, that makes sense.

ANA: I had two questions that I feel would be great to ask you about. How often should folks be reviewing service-level objectives?

ALEX: The general advice I give is a lot at first and then less often later.

ANA: [laughs]

ALEX: When you first set them, you may have just picked the absolute wrong values, right? [laughter] You may not have understood what your users actually need from you. You may not have understood the shape of your data. You may have thought you did, but maybe your metrics and your telemetry looks totally different than what you expected once you start measuring it in this way.

So, at first, look at them quite often. Later, I don't know, once a quarter, like, something that helps you think about them more. However, I will say operational SLOs; you probably want to look at their status pretty frequently. Make that part of your weekly team sync. Make that part of your monthly opex meeting. Make that part of...whatever kind of cadence you have. You do want to look at their statuses quite often. Even if they're not alerting you, you still want to make sure that they're not acting anomalously.

You want to make sure you're not just running at 100 because that means either you're measuring wrong or maybe you should be setting a different target. So, yeah, you review them in that sense. I think it really needs to be on a cadence, whether that's monthly, weekly, quarterly. But you do also need to be looking at your SLO definitions, which I think is what you were kind of getting at. And that part is often at first, far less frequently later.

I think in the SRE book, we've got something like, you know, once a month at first, once a quarter after that, and eventually, you can get to once a year. You have no signals telling you that these definitions are wrong. But I take a slightly more nuanced view which is, you know, everyone's favorite phrase: it depends. It depends on what your numbers look like. It depends on what your tooling looks like. It depends on what user journeys these SLOs are even matching.

Are these purely operational SLOs for a single SRE team with five people? Well, then you can probably review them quite often because you probably have a weekly sync, or an operational sync, or a handoff meeting, or wherever you can be appropriately slot in. But if it's like a cross-org, multiteam, multicomponent SLO that measures like, can a user log in, add an item to the shopping cart, and checkout with it? That you just can't review as often, and that's fine. So I don't think there should be any hard and fast rules besides that often at first, and then if you feel comfortable, less often.

ADRIANA: I guess the moral of the story is, like, don't fall in love with your first SLO, right?

ANA: [laughs]

ADRIANA: Because it's bound to change. And I guess it's also okay to make mistakes. As you said earlier, you might come up with a completely wrong SLO. And so, having that review process for your SLO becomes really important. Because I guess the other thing, too, is you might have had a set of SLOs that were working really well, and then you have a major release to your application. And then [laughs] your SLOs aren't cutting it anymore. So I guess that is also a signal to review your SLOs once again because they're not doing their job, right?

ANA: Yeah. I love that you said just don't fall in love with your SLOs firsthand. And I think even just in general, like, overall, they're constantly going to be changing.

ADRIANA: Thank you so much, Alex, for joining us in today's podcast.

Don't forget to subscribe and give us a shout-out on social media. Be sure to check out the show notes on oncallmemaybe.com for additional resources and to connect with us and with our guests on social media. For On-Call Me Maybe, we're your hosts Adriana Villela and...

ANA: Ana Margarita Medina. Signing off with...

ALEX: Peace, love, and code.

Philo-SLO-phy with Alex Hidalgo of Nobl9

On-Call Me Maybe

Twitter Mentions