How to Rock at SRE with Liz Fong-Jones of Honeycomb

On-Call Me Maybe

English - September 13, 2022 04:00 - 39 minutes - 36.2 MB - ★★★★★ - 3 ratings
Technology monitoring tracing distributed tracing sre oncall on-call software software development technology tech Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Tech Origin Stories

Next Episode: An OpenTelemetry Journey with Gabriel Fonseca and David Alfonzo of Wavelo

About our guest:

Liz Fong-Jones is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 17+ years of experience. She is the Principal Developer Advocate at Honeycomb for the SRE and Observability communities and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

She lives in Vancouver, BC, with her wife Elly, partners, and a Samoyed/Golden Retriever mix, and in Sydney, NSW. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Find our guest on:

Liz’s Twitter Liz’s LinkedIn Liz’s website

Find us on:

On Call Me Maybe Podcast Twitter On Call Me Maybe Podcast LinkedIn Page Adriana’s Twitter Adriana’s LinkedIn Adriana’s Instagram Ana’s Twitter Ana’s LinkedIn Ana's Instagram

Show Links:

Honeycomb.io Observability Engineering (free e-book download - limited time only)Observability Engineering (dead tree edition)Cloud-Native Observability with OpenTelemetry Observability Engineering book signing party - San Francisco Continuous Profiling OPEX (operating expense)Jeli.io Nora Jones (Jeli)Unknown Unknowns DR (Disaster Recovery)Multi-Cloud Google SRE Book (aka “SRE Bible”)Emily Freeman Justin Garrison Twitter Space - DevOps vs SRE with Emily and Justin SLOs (Service Level Objectives)Implementing Service Level Objectives - Alex Hidalgo Google Site Reliability Workbook Google Seeking SRE Book Amy Tobey’s SRE model SREcon

Additional Links:

Trans Lifeline Adriana on O11ycast Ana Margarita on O11ycast

Transcript:

ADRIANA: Hey, y'all, welcome to On-Call Me Maybe, the podcast about DevOps, SRE, observability principles, on-call, and everything in between. Today we are talking to Liz Fong-Jones, who is the Principal Developer Advocate at Honeycomb, and she is also on the OTEL Governance Committee. Welcome.

LIZ: Thank you for having me on.

ADRIANA: So first things first, we like to ask all of our guests, what are you drinking?

LIZ: I am drinking a homemade mocha, just a French-pressed coffee and some hot chocolate powder dunked in it. I used to live closer to a coffee shop where I could get real espresso, but now I live a little bit more into the boonies. So now I have to add powdered hot chocolate to my coffee to make a mocha.

ADRIANA: [laughs] Awesome. Hey, whatever works. How about you, Ana, what do you have?

ANA: I'm just doing classic cold brew with oat milk. And I added vanilla flavor to it to give myself a little different taste this Monday. But that's about it. Pretty simple. What about you, Adriana?

ADRIANA: I have a bubble tea today. I'm a huge bubble tea junkie. [laughs]

LIZ: Oh, right, because it's noon over there. Unfortunately, bubble tea places...I would love to have bubble tea in the morning. If someone opened bubble tea that was open at 8:00 a.m., they would have a monopoly on the market.

ADRIANA: Yes.

LIZ: Because I don't want to wait until noon to have my bubble tea. But all the bubble tea shops open at noon.

ADRIANA: It's so true, and it's so sad. Sometimes I actually will pre-buy my bubble tea as long as it doesn't have tapioca because I think after about an hour, the tapioca goes really nasty. But I'll get it...if you get it with coconut jelly or basil seeds, it'll survive the night, so then I can have it handy for the next day. [laughs]

LIZ: Business ideas brought to you by On-Call Me Maybe.

ADRIANA: That's right. That's right. So, Liz, you just came out with a book. So, why don't you tell us a little bit about that?

LIZ: Yeah. So my colleagues and I, Charity Majors, George Miranda, and myself, have just published Observability Engineering. It came out in May, and print copies have been available since June. And the book is about the why of observability. We do go into some of the how but I think we're kind of orienting people around how is observability different from monitoring. When should you use observability? How do you introduce it into your organization? What workflows does it enable?

There is one chapter in it about OpenTelemetry. But we actually had your colleague, Alex Boten, at Lightstep also publish an entire book-length volume on Observability with OpenTelemetry. So the two books kind of nicely pair up with each other because Alex's book is more about the how and our book is more about the why. And actually, it turns out that people buy them together. Who knew? If you go and look at the Amazon listings, people just buy them as a bundle set, and I think that's really cool.

ADRIANA: That's awesome.

ANA: And if folks want to take a read for free, as I think I saw on Twitter, there's a way.

LIZ: That's exactly correct. Limited time only, you can go to a page on info.honeycomb.io (We'll put it in the show notes.) to get a free copy of the book in PDF format. And there is no paywall or no registration wall or anything. We don't ask for an email address. But if you happen to want a dead tree copy of it after reading the PDF version, certainly, our editors at O'Reilly would appreciate you tossing them some money.

ADRIANA: And I remember talking to Charity recently, and she said that if you order a print copy and let her know about it, she'll send you some stickers to bling up your book cover.

LIZ: Yeah, that's exactly correct. So I'm handling distribution in Canada and Australia, and she's handling U.S. distribution. But yeah, you can put a sparkle unicorn tail on the maned wolf that's on the cover. I asked for booties, like, Ruby slipper booties for the wolf [laughter], in case you want to stay a little bit historically accurate to what a maned Wolf is and hypothesize that you could put booties on it.

ADRIANA: I love it.

ANA: That is amazing.

ADRIANA: It's almost like a personalized book signing. Send stickers, and that shows that you've got, like, it's the signature from the authors almost, which I think is such a cool idea.

LIZ: Yeah, we also do book signings, though. There is a book release party in San Francisco in the middle of September if you're listening to this before the middle of September. So if you stop by the book signing party, we'd be happy to give you stickers in person as well as to actually autograph your book with a hand-written personalized autograph.

ANA: Oh, I actually might need to get myself to that party, is what it sounds like. [laughs]

ADRIANA: Yeah, it's close to home for you. [laughs]

ANA: Yeah, I'm still in the Bay Area, so definitely might be doing that.

ADRIANA: That's awesome. It's interesting, with regard to the book on Observability Engineering; I think it's still so important to keep up the conversation on what is observability. Because I don't think enough people understand it fully. So I think it's good to keep hammering the point home.

LIZ: Yeah, exactly. One of the things that we frequently wind up saying and wind up I think everyone on this podcast is in agreement with, right? Like, it's about the outcomes that you achieve. It's not about the various signal types. It's about how do we actually debug unknown behavior in our systems? And it doesn't really matter what combination signals you use to achieve that.

For instance, one thing that I like to emphasize is that I spent almost over a decade at Google. And in my over a decade at Google, we didn't actually, at least for the first seven, eight years of that time at Google, we didn't really use traces very much. But we had very sophisticated metrics slicing systems that would enable you to join multiple different metrics together, would enable you to do some of the higher cardinality things with metrics that you might not be able to do in a more primitive system. So I would argue that we had observability at Google without collecting all of the various signal types.

ANA: That's really cool work. And I also wanted to mention, like, congrats on the book. Now that that's finished, what are you most excited about to be working on right now?

LIZ: So it's definitely been really nice to get back to practitioner stuff and prototyping rather than writing. I tend to oscillate between prototyping and writing about it. But the book was kind of this three-year-long project, not necessarily all three years spent writing continuously. But towards the end, it was a little bit of a slog. I wound up not getting to do very much hands-on engineering.

So what I'm up to right now is really, really looking at what can continuous profiling do for us. How can it benefit us? And I'm not going to call it a fourth pillar because I don't believe in pillars of observability. [laughter] But I do think that there are some interesting applications of continuous profiling that are not the ones that the people who came up with had originally envisioned.

Specifically, people talk about continuous profiling in the context of, you know, oh, we're here to save you 5% or 10% or whatever off of your data center bill. To me, that's missing the point. Yes, OPEX is a concern, but I care about optimizing user visible response times. That's why we do tracing.

At a certain point, I know that tracing breaks down. You're not going to put a trace span around every single function call. But in our profiling, where you can actually capture every function call that takes longer than ten milliseconds, 100 milliseconds, you can start to fill in some of the gaps that you wouldn't be able to seal with tracing.

ANA: That's actually really cool. I have zero background in the profiling stuff, but definitely would make a lot of sense when we're thinking about the scale that we have our applications, and the users that we have tuning in, and the stuff that's still missing.

LIZ: It's so powerful, and yet it's something that, at the moment, is for power users only. And I think that there are ways that every software developer can benefit from it; they just don't know it yet. So that's what I'm spending my time exploring.

One of the blog posts I'm working on that by the time you hear this might actually be published is a blog post on how we sped up ingestion into Honeycomb, sped up the ingestion processes in Honeycomb by 50%, which then gave us 10% more capacity to run queries in Honeycomb, which does not directly translate to 10% less latency, but it does translate to in general Honeycomb users queries running faster. So that's something that I found with profiling that I would not have been able to find any other way.

ANA: It's actually really cool to hear about some of that work because I know that part of the IC work that you get to do is also help improve the SRE practices of Honeycomb. I know that you were recently on stage at AWS talking about some of the improvements you were doing with some of the migrations that you were doing; I think it was Graviton.

LIZ: Yeah, exactly. So part of what's fun about being a principal-level engineer for me is getting to go on wild goose chases. And sometimes, you come back with a domesticated tamed goose. I think that one of those wild goose chases was started two years ago, two and a half years ago, when I got us to start trying out the AWS Graviton Arm architecture-based processors. And that led to essentially a 50% reduction in Honeycomb's OPEX or at least in our compute OPEX. Obviously, paying people salaries, that's something that doesn't get affected by your processor architecture.

ANA: [chuckles]

LIZ: So getting to explore some of those things without taking cycles away from the product engineering teams and then when I find something worthwhile bringing it back into the org and figuring out how do we leverage this as best as possible? And funnily enough, to wrap this back around to OTel, OTel was actually one of those wild goose explorations. It was a, you know, hey, the Honeycomb Beelines are working more or less well enough, but there's this OTel thing that's new that's combining OpenCensus and OpenTracing. Should we maybe develop an exporter?

ANA: Super congrats on being able to chase some wild gooses because we know that in general, the job in DevRel and the work that we do is a lot of context switching, so to be able to get enough hours in a cycle to be like, oh no, I'm actually going to be dedicated to this topic and actually do some deep dives.

LIZ: Yep, do the deep dives and then write about them. That's kind of the other piece about it is if you're not communicating about what you're doing, is it ever...or it's a little bit challenging to show the impact that you're having.

ADRIANA: It's so true. That's one of the things I actually like about DevRel is being able to prototype because that's like one of my favorite things is take a problem that looks interesting, try to solve it, and then write about it. And, I don't know, I find it so, so satisfying.

ANA: I agree. I definitely wish I did more of it currently. So I'll hopefully find more time to carve into that.

LIZ: Right. And given that we work with the SRE and DevOps communities, one of the most impactful ways that I found is to sit in on on-call rotations. You don't necessarily have to even be on call, on call, but at least shadow people and see what's going on and see what patterns you can generalize out, what incident response practices that you think the world should know about.

ANA: You actually got me right to one of the questions that I wanted to ask; what is on-call at Honeycomb like?

LIZ: Yeah, it's something that is constantly evolving because we have grown from having...when I started at Honeycomb, we had ten engineers, and now we have about 40 engineers. And therefore, we have had to split on-call rotations. So now there are actually three on-call rotations. There is the platform engineering on call, and there's the product engineering on call. And then there are the integrations on duty. So those are kind of our division responsibilities.

So anything to do with client SDKs, anything that runs on the client's premises that is integrations on call. Anything to do with the product UI is product on call, and anything to do with data ingest and querying goes to the platform on call. So that's loosely how it's divided for now. But we're already thinking about what the next flood of that might look like.

So it's definitely interesting to be hiring people and then figuring out how to onboard them onto the existing rotations and also figuring out how much scope is the right amount of scope for one person to keep in their head. Certainly, it used to be the case that we would just have one on-call rotation. And that person had to be responsible for any integration question, and any JavaScript question, and any platform question.

And you might try to pair up to make sure that there's usually a front-end-y and back-end-y person on call as primary or secondary at any time, but now that's actually formal. But yeah, the general philosophy is if you ship code at Honeycomb, you are responsible at least part of the time for the consequences of that. And that means that you both watch it as it goes out, even if you're not the person officially on call. And you also take a turn being the person on call for your particular area of responsibility and helping other people shepherd their changes out.

ADRIANA: And how has that gone? Do folks at Honeycomb respond well to that? Because I know that that's something that you and Charity are always talking about on Twitter. And I think it's such a great idea. I mean, I think it makes people more responsible for their code rather than, oh, I'm done with it. It's someone else's problem now. What has the reaction been?

LIZ: I think people really appreciate the freedom to be able to ship changes on Friday morning and Friday afternoon. And the main rule that we have around here is don't push and run. As you said, don't just walk out the door when your code lands in the main branch. Your job is to make sure that it stays there, which means watching it for at least an hour or two. So if you don't have an hour or two left in your day, maybe don't land that in main.

So it's certainly a lot easier for us to keep things clean as an org and make sure that everyone is on board with it as they join the organization, make sure that everyone is following these best practices. It's a little bit harder to extrapolate that to another org to say, "Hey, by the way, suddenly you're on call."

I don't think that goes over well with people if you suddenly change their job responsibilities from having no on-calls to suddenly being flooded with pages at 2:00 a.m. I think that some acceptance of production ownership is a good idea, but you have to do it gradually if you're starting from something that is not a well-honed on-call rotation with low noise.

ANA: I mean, it's definitely part of the culture of the company. And I think y'all do voluntarily on call too. It's not necessarily mandatory or?

LIZ: It is mandatory for everyone to participate in production in some way. So as I said earlier, integrations on call is not expected to be reachable out of hours. So if you're a member of the integrations engineering team and you're on duty, you can be interrupted during your working hours, but you typically at least won't get unless there is some kind of sub-zero security thing that requires doing an emergency release. You're not going to get interrupted out of hours.

But definitely, anyone who joins the platform engineering team is well aware that on-call is a mandatory part of the platform responsibility. But again, the contract goes both ways. You are expected to be on call, but we are not going to page you day and night when you're on call. We are going to maybe one out of every two or three of your Chefs will have a weekend or a weeknight page, right? It's not too onerous. You can still go and hang out with your family. You can go cook dinner. It's not like you're going to be working the entire time that you're on call.

ADRIANA: I would imagine then on-call PTSD is not really a thing at Honeycomb from the sounds of it.

LIZ: We do have periods where things get bad, but if that happens, we adjust. That's the important thing is having that feedback loop. I'm not going to say that we don't have bad on-call weeks. We have had some periods of time where we have had weeks that are more heavy, and we've had to say, "Okay, we're going to pause non-essentials infrastructure changes. Stability of the system is more important than optimizing something away."

ADRIANA: That is awesome. I think that's what so many people would love to see in their own organizations. So it's really cool that you guys do that.

ANA: Do you have any tips for folks that are looking at changing the way that their actual teams are being split? I know you mentioned Honeycomb has those three sections of it, and you're looking at revamping those. So for folks that are listening in that are coming in as, like we know that on call is not really working from our org, what are the questions they should be looking at in order for them to restructure it for something that makes sense for them and it's also healthier?

LIZ: I think once you have multiple on-call teams, you at least have developed a process for figuring out what goes to who, and changing the responsibility is easier. But I think that when you are in a situation where there is one on-call rotation for everything, and there are maybe two or three most senior people who know how to debug things, that's a situation that's harder to get out of.

And I think the number one thing to do there is to introduce observability. If you make it, so it doesn't require flashes of insight and keeping the whole system in your head at once, and so the tools can assist you in piecing together, and testing hypotheses, and bisecting the problem until you know where the problem is localized, whether it be localized to a part of the system or to a particular subset of users or both. That kind of methodical debugging approach makes it possible for anyone to be able to debug issues and not just the person who's most experienced or has written the dashboards before.

It does mean that sometimes you will miss the obvious 30-second solution to something, but it means that everything you'll be able to debug in less than half an hour rather than, you know, either it takes 30 seconds or it takes 30 hours. That's not good.

ADRIANA: Yeah. And that's so important, too, because I came from an organization where one of the challenges was that there were always the same handful of people who were the ones that got called whenever there was an incident, and it was very taxing on them. They were super fried. That's why we're really pushing to introduce proper observability practices there because it's not a sustainable lifestyle. And to be able to empower all the developers and all the folks that are on call to be able to troubleshoot a solution is magical, right?

LIZ: Right, exactly. It's not just what formally is the state of affairs in your on-call rotation; it's what informally happens. Like, is your most senior engineer always on call whether they like it or not, even if their name isn't in the pager's file?

ADRIANA: Exactly, exactly.

ANA: I think it's one of those things that I've been seeing a lot more folks bring in external people to do a little bit of consulting in terms of how does this organization do on call? Or, actually, can we talk about incident 561 and actually get a little bit more details?

And that's kind of when you uncover that, when you're able to realize, oh, everyone is actually always reaching Jamie. And Jamie feels like she's really burnt out because of that. The manager might not be aware of it when it's stuff that happens here and there. But it's something that's taken a toll on the entire organization, and no one is aware.

LIZ: Yeah, it's this thing where the hidden surface area of what are the assumptions that people are making that they haven't necessarily overtly discussed? What are the backchannel communications? This is actually why I'm a huge fan of Jeli because they're really focused on this kind of incident understanding workflow of trying to figure out how do we unpack the assumptions that people made? Where are the insights coming from? What are people finding useful, and what should we generalize?

ANA: I 100% agree with you there. I'm excited to see all the work that Jeli and Nora is getting to do in that space of we've already spent the money; can we really be learning from it and making sure that we're actually incorporating it?

LIZ: Exactly, unplanned learning opportunity. Just because you didn't schedule, it doesn't mean that there's nothing you can learn from it. [laughs]

ANA: I think they're being...I believe Nora is calling them learning moments, which actually makes perfect sense in the sense of you can have them as planned activities, and you can have them as it already happened. But the way that you have to approach the opportunity has to be very similar. You have to come in with a curious mindset. And you come in with that empathy, and it's asking questions, not necessarily putting that blame.

LIZ: Right. And I loved the thing that was said earlier about bringing someone in who was not involved in the incident during the retrospective rather than the person who happened to be incident commander.

ANA: Yeah. I'm actually really curious to see what else we can be doing in that space, like how tools can actually be helping that process, and Jeli is getting to do it, I'm assuming. I haven't gotten a chance to see the product itself, but I've seen some of the work around it.

But like for companies that are not able to bring someone else in, how is it that a tool can actually meet that middle ground that asks a question in a very different way that actually breaks the process that you have ingrained into your muscle memory that is the same way that you'd go about your incident response every single time? And you have action items that take you nowhere, and you have similar incidents two months later.

LIZ: Yeah, it's so frustrating when people are like, we have to have the root cause, or we have to write down at least two action items that we promise, and pinky swear that we're going to get to at least...It's like, no, not everything needs to result in action items. Sometimes the best thing to do is to watch and observe for future patterns or to consider doing things a little bit differently the next time and seeing how it goes and treating it as an experiment.

ADRIANA: That is such a great point to make because I think a lot of times, especially when it comes to upper management, they want everything wrapped up in a nice, little bow. And it's a lot messier in real life. And it's not a series of checkboxes that you can tick off and say, "Okay, I'm done."

ANA: You can't just point the finger at SQL or a member and be like, "It was them." It's like, oh no, actually, our entire system, we need to talk about the fragility of it and that the backups didn't get a chance to get engaged, and our documentation was completely out of date. So no one even had a chance to take a possible action, or maybe we didn't have the right alert for it.

LIZ: There are situations, though, where I think that as much as I like to talk about unknown unknowns, if you're not addressing your known unknowns, you have no hope of leaving enough room for your unknown unknowns. So if you have known points of system fragility, like if you have a single MySQL database that is backing everything, maybe you should make sure that that MySQL database has a very fast recovery time objective like having a hot standby or something.

Because we all know the MySQL database is going to break at some point in the next year. We don't know when, we don't know how, but it's going to break. So do you have a hot standby, or is it going to take you three hours to recover?

ADRIANA: Yeah, yeah, absolutely. It's funny, we were talking with Jason Harley about DR processes a few weeks ago, and it just underscores you need to make sure that your backup systems are ready to go if shit hits the fan. But then there's this fine line also between how much effort do you put into ensuring that your backup systems are up and running? It's such a balance because sometimes it's so much effort that it ends up being that case of people ticking off boxes again and just blindly following DR.

LIZ: Ooh, I talked to a client the other day who said something about how their security and risk compliance department wants them to be multi-cloud within the next three years, like running active-active across multiple clouds. And I was like, oh sweetie, [sighs] I feel so sorry for you.

[laughter]

ANA: The answer is you need to have started the job six years ago because it's really not that easy. And have you looked at the cause? Have you already signed off on how expensive that's going to be in general?

LIZ: Right, exactly. What is the risk that you're mitigating here? [laughs]

ADRIANA: Exactly, exactly. It's not to say that DR isn't important. It's just like, you need to work within the realistic. [laughs]

ANA: One of the other questions in the same space of SRE that I wanted to get your take on is what are people doing wrong in the SRE space? I know we all get a chance to talk to the community and customers. But sometimes you're like, why are you implementing SRE this way? Or why are you taking SRE to this next level that it was never meant to be?

LIZ: Number one pet peeve, companies that have both a DevOps department and an SRE department.

ADRIANA: [laughs]

LIZ: What are we even doing? Why are you doing this?

ADRIANA: Yes.

LIZ: So this is actually something that Emily and @rothgar and I did a Twitter Space on recently. But yeah, essentially, your SRE team is not just an SLO machine. Your DevOps team is not necessarily a CI machine. Instead, these are interdisciplinary things where we should be force multiplying teams, not just getting pigeonholed into building this specific tool team that you think that we're responsible for, right?

ADRIANA: Yeah.

LIZ: We might start from different axioms of why we do the things, but we should be one team doing those activities of, yes, helping with CD. Yes, helping people build SLOs. But more importantly, driving towards continuous feedback loops and reliability as our operative functions.

ANA: And it starts with not being a silo between SRE and DevOps. And then it's like, how can you bring them in together to collaborate to become one team? That was something that we encountered a lot with the folks that we talked to when they were looking to implement chaos engineering. Like, do we do this in the observability? Do we create a new team called chaos? This is SRE and DevOps. And it's like, what are you even doing in this enterprise?

LIZ: That's close-shipping your org chart. In reality, this is why I'm so excited about the emergence of platform engineering as a concept, this idea that we are here to provide services to product teams that are trying to build, and we're here to give them the best defaults that we can.

ADRIANA: Yeah, yeah, that's awesome. It's funny because talking about having a separate SRE org and a separate DevOps org, at one of my former employers, we had an observability team. And our goal was to create best practices in observability. But we kept getting calls from folks saying, "We're having issues with blah, blah, blah, app. Can your team please build us a dashboard?"

And I'm like, that's not our job. [laughs] We're here to help you build your dashboards if that is what you want to do, but we are not building dashboards for you. And it was a constant pushback. It was so frustrating because it was like, [laughs] y'all don't get it. [laughs]

LIZ: This is kind of the parallel that we have to test engineering where test engineering is here to make better frameworks for being able to test your stuff, not test engineering is here to manually test your apps for you or test engineering is here to write your unit tests for you.

You're a full grown-up software engineer; write your own damn tests. So write your own damn comments. Write your own damn observability annotations. This is what will help you understand your code later. Someone else writing it for you achieves very little of the value.

ADRIANA: Yeah, exactly. You're not teaching anyone to fish at that point. You're just doing the fishing for them.

ANA: And at that point, you also just want to be the engineer that gets to say, "Yeah, I get to do this work. But I also do it in a way that is reliable, not just I throw things over the wall, and someone else has to deal with it. Because we're back at the pre-DevOps movement, which that's that. [laughs]

LIZ: Right. And I think going back to your earlier question about what are people doing wrong with SRE, SRE is not just a synonym for we've renamed our ops department, and we're still throwing things over the wall. This is one thing that I actually think the Google book authors got...maybe didn't necessarily get wrong but didn't emphasize enough that the handoff of the pager to an SRE team that is not a requirement of how to do SRE. That is just how Google made it palatable to software engineers back in 2004 was the, you know, you have to do what we say been an exchange; we'll take the pager from you.

But today, that's not necessarily a best practice. It's better to keep the pagers with the teams that are actually writing the software.

ADRIANA: Yeah, that makes a lot of sense. On that note, because you worked at Google, what's your thought when you hear all these organizations talking about how they want to model their SRE practices around the Google practices? Because I mean, everyone and their uncle seems to want to do that, oh, Google did this; therefore, we must do it this way.

LIZ: This is why I was really, really pleased to have worked on the Site Reliability Workbook and on Seeking SRE, these kinds of three books in the trilogy of yes, the original SRE book was aspirationally this is how we think things might be done at Google, not necessarily what every Google team was doing. But that's another story.

And then you had The SRE Workbook, which was written by the Customer Reliability Engineering team that I was part of, which was basically trying to get the stories of how did customers of Google Cloud translate SRE practices to their orgs and specifically saying, "Okay, this is how you make these things work at not Google. Here are some variations on the pattern."

And then the third book, Seeking SRE, was about what future extensions did people at the time foresee, many of which have come true. The fact that it's now not just that one book makes it easier for people to absorb how to do things, not at Google.

ANA: I think we're also seeing a lot more content, too, that says that where it's like, I do SRE, but we don't do it the Google way. We did read the Google SRE Bible, but we acknowledge it, and the way we implement it is very different. And I do appreciate that because I remember when that book came out, I was an SRE at Uber, and a lot of folks there were from Google. And it was just like, oh, look, now we have documentation for the things that we've been preaching, like, let's go and run town. And it's like, but we've been doing embedded SRE, and that's been working.

LIZ: Right? Exactly. And I think there is a parallel universe in which Facebook had written a parallel book about production engineering and about embedded production engineering practices that could have been put on a level footing with Google's centralized SRE function so people can compare and contrast. But I think that, unfortunately, Facebook didn't publish that book. So we didn't get that wisdom except for folks who had been talking across the aisle.

ANA: Most definitely. So it's always good to be sharing those learnings out in the wild for anyone listening that's like, wait, no, I have different thoughts on this.

LIZ: Exactly.

ADRIANA: Yeah, I think at the end of the day, the Google SRE book is great in terms of giving you ideas, but I think it's not a one-size-fits-all. You got to do what works for your organization while still upholding SRE principles.

LIZ: The question I always ask is, what is the problem we're trying to solve? Okay, let's look at some options for trying to solve it rather than saying all of the Google Books should be brought into this org. Like, why? Why? What problem are you solving here?

ANA: Can we narrow it down to a chapter? [laughs]

LIZ: Right. So the one chapter that I think is still very much golden to this day is cause-based alerting versus symptom-based alerting. Like, you don't want to alert on a million potential causes like CPU too high, disk too high. You would rather do your alerting based off of symptoms of user pain like SLOs. That, I think, has held up very well, whereas some of the chapters on team structure I would take with a huge grain of salt.

ANA: I do wish we were seeing more folks alert on SLOs. I think that's still very much visionary. The more I talk to folks, it's like, they're very far away from their journey, like, been running in this space for a bit that you assumed that SLOs were a little bit more adopted. So I'm definitely really excited to see more folks work on that in the next 12 months.

LIZ: And I think this is an area where we as developers of observability tools can really help because an SLO has to be a living, breathing thing, not just a thing that you put up on a dashboard and you look at it 90 days later, and oops, we blew our SLO. That's not how you make SLOs relevant and actionable. The way that you make SLOs actionable is by tying them to your observability data.

If you're generating your observability data-driven SLOs, that means that SLO degradation can be immediately debugged using your observability tool. And I think that's how you close that feedback loop. So we've seen a lot of people say, oh, we have an SLO feature. And you unpack, and it's like, we're just summarizing aggregate time series metrics. And it's like, yeah, no, people are not going to build or debug this. So this is why I'm really excited that there are such smart people working at our respective employers on these kinds of things.

ANA: [laughs] Definitely. And it goes to that other side of using the error budget itself to actually innovate versus just taking this budget and not ever using it and not really trying to push the edge on the work that they're doing, which that is maybe in 24 months we'll see that a lot more. I think there's always that wishful thinking. And a lot of folks are still too scared to get there sometimes.

LIZ: People are rightfully too scared if they don't have observability. If you get an alert saying, "Hey, I'm burning through my error budget. Users are seeing errors." You don't have the ability to figure out where those errors are coming from. And you're just going back to CPU too high. I can understand the gap. I can understand why you don't want to just have this noise from your perspective that you can't actually figure out. But the answer is not to give up on SLOs; the answer is to invest in observability.

ANA: Definitely.

ADRIANA: Yeah, yeah, absolutely. Absolutely. And I think taking time to craft proper SLOs because I think that that's something that a lot of organizations struggle with. They don't know where to start. They don't know how to tie it all together.

And this is where I would recommend Alex Hidalgo's book on SLOs because I read the first few chapters, and it's got so many good nuggets. I felt like I wanted to highlight [laughs] the whole book because it really clarifies the terminology and what you should be looking out for, and it talks about error budgets. And we don't necessarily need to chase the five nines depending on the context because otherwise, you're not giving yourself enough error budget, in which case you're in a bind, right?

LIZ: Yeah, definitely setting realistic SLOs is super important, and differentiating between what's my aspirational SLO versus what am I actually achieving today? What am I measuring to and alerting on? As you're saying, you don't want to be told you blew your error budget by a factor of five. It's like, yes, I know that. [laughter] I'm going to fix it when I have 24 months to re-architect

ANA: [laughs] Liz, do you have any recommendations or tips for folks that are getting started with SLOs? Or is it more of like dive into Alex's book and see if you can take some of those learnings home?

LIZ: I think the main thing is to try to understand what your users are trying to do with your application. And then once you have that information, then you know, start writing out the measurement methods for those user journeys. So I don't think it necessarily requires reading a book, although I think that the book is great for filling in the details.

I think the place where people go wrong is more when they set SLOs on the wrong thing. Like, if you're setting SLOs on this individual endpoint is error-free, it's like, yes, but have you looked at the whole system? Does it really matter if the fraud system is working if people can't even log into the website?

ANA: [chuckles] Is that even part of the critical path?

ADRIANA: And I think that's where observability helps because it helps you to refine the SLOs. I think they feed into each other constantly. And the other thing is that SLO is like, it's not a fixed thing. You're always iterating on them.

ANA: I mean, it should all be like a continuous journey where you're constantly coming in and doing work on your observability, taking those improvements, making your incident system better, and then doing other practices. For me coming in from the space of chaos engineering, it ends up being this continuous cycle.

And I was giving a talk last week where I referenced Amy Tobey's model of what the SRE should be from her perspective where it's like incident management makes observability better, which makes SLOs better, which makes chaos, and they all kind of feed in and out of each other. And I'm like, this is how I see it. More people need to be looking at it this way because you can't be doing one work without doing the other or having those structures set up in order to be doing it right. Without having insights, you don't know what you're doing.

LIZ: Yeah, it's totally this virtuous cycle. But I do think that there are some prerequisites to have to come before others. I think that if you, I've said this before, I think when you were working at Gremlin, but if you can't even see what's going on in your system, if you have unaddressed chaos that's happening inside of your system, then why add more chaos if you can't even measure and tame the chaos you have today?

ANA: I definitely fall under that mindset too. So it's like, understand before you inject chaos. It's not going to help if you have a system on fire [laughs], and all of a sudden you're like, wait, what happens if I do a little small fire here, a little bit of planned?

ADRIANA: It's a controlled burn, right? [chuckles] I know we're just coming up on time. Before we wrap up, I had a burning question for you, Liz. How did you get into developer advocacy?

LIZ: Ah, this is the example that I use to illustrate sponsorship versus mentorship. So a director of SRE at Google, she's now one of the VPs of SRE at Google, but her name is Sabrina Farmer. And she wasn't in my direct reporting chain, but she was one of the co-chairs of SREcon from its inception. And she wanted to focus more on her Google management responsibilities, given that she was in the process of getting promoted to VP. And she was looking for a replacement for her on the SREcon program committee.

And she basically said, "Hey, Liz, I think you'd be great for this. I know you've never even thought about conferences or public speaking or anything, but..." She was like, "I love your internal talks. I love that you've been on like ten different teams at Google. Hey, are you interested in taking this on?" And I said, "Yes." This wasn't her coaching me along these specific things. This was just her seeing this opportunity that would fit well with my career and then advocating for me to get that position.

So I co-chaired SREcon for the first time, I think in 2017, 2018, something like that, 2017. And I just fell in love with the idea of leveling up every engineering team in the world and not just leveling up my particular team that I was managing at Google at the time. So that's how I got into DevOps because it was just realizing that it was part of this broader mission to be able to help people develop better software.

ADRIANA: That is super awesome.

ANA: Since I met you in the developer advocacy of SRE space in that time, it was that where it's like, well, a lot of the things we do in SRE is literally this, like, sitting with engineers and educating them on how they make their systems better than when you think about how do we scale our SRE practice to be impactful outside of the organization? Like, it just kind of fits right in.

LIZ: It absolutely does. I'm really excited to see so many great DevRels come from SRE backgrounds, and there's so much work that's interesting to do in our space.

ADRIANA: That's awesome. So, as we wrap up, do you have any parting words of advice for our listeners, be it advice on SRE, observability, or in the DevRel space for anyone who wants to get in? Yeah, any words of wisdom?

LIZ: The main words of wisdom that I have are that toil is your enemy.

ADRIANA: [laughs]

LIZ: So anything that you can do to free up more of your time to do higher impact work will give you increased leverage, will give you the ability to get more done.

ADRIANA: Cool. I love it. These are great parting words. Awesome. Well, thank you so much, Liz, for joining us in today's podcast. We loved talking to you about all the things today, from observability to SRE, on call.

Y'all, don't forget to subscribe and give us a shout-out on Twitter via @oncallmemaybe, and be sure to check out the show notes on oncallmemaybe.com for additional resources and to connect with us and with our guests on social media.

For On-Call Me Maybe, we are your hosts, Adriana Villela and...

ANA: Ana Margarita Medina.

ADRIANA: And signing off with...

LIZ: Peace, love, and code.

How to Rock at SRE with Liz Fong-Jones of Honeycomb

On-Call Me Maybe

Twitter Mentions