Finding Humanity in Incidents with Nora Jones of Jeli.io

On-Call Me Maybe

English - October 10, 2022 04:00 - 42 minutes - 39.1 MB - ★★★★★ - 3 ratings
Technology monitoring tracing distributed tracing sre oncall on-call software software development technology tech Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: OpenTelemetry & Nomad with Luiz Aoqui of HashiCorp

Next Episode: Improving Quality with Observability with Parveen Khan of Thoughtworks

About our guest:

Nora Jones is the founder and CEO of Jeli.io. She is a dedicated and driven technology leader and software engineer with a passion for the intersection between how people and software work in practice in distributed systems. She created and founded the learningfromincidents.io movement to develop open-source cross-organization learnings and analysis from reliability incidents across various organizations and the business impacts of doing so.

Find our guest on:

Nora’s Twitter Nora’s LinkedIn

Find us on:

On Call Me Maybe Podcast Twitter On Call Me Maybe Podcast LinkedIn Page Adriana’s Twitter Adriana’s LinkedIn Adriana’s Instagram Ana’s Twitter Ana’s LinkedIn Ana's Instagram

Show Links:

Jeli The Howie Guide Learning from Incidents Twitter thread summary of our chat with Nora

Transcript:

ANA: Hey, y'all. Welcome to On-Call Me Maybe, the podcasts about DevOps, SRE, observability principles, on-call, and just about everything in between. Today we're talking to Nora Jones, CEO and Founder of Jeli.io. And we're very excited to have you here. Thanks for joining us.

NORA: Thanks for having me.

ANA: We have the first question for today's show; what is going to be your drink of your choice today?

NORA: I have a smoothie that I'm working on. So it's from the shop across the street. It is like a little kind of milkshake smoothie. It has some oat milk, some strawberries, some bananas in it, and it looked delicious. So they sold me, and I now bought a $12 smoothie that I probably could have made for $3. [laughs]

ADRIANA: Sounds very delicious.

ANA: What are you having, Adriana?

ADRIANA: I am having some good, old water.

ANA: Good hydration. Similarly, I'm doing sparkling water with a little raspberry cranberry flavor, just trying to beat the end of summer heat.

ADRIANA: Yeah, it's pretty hot where I'm at. We're like high 80s, [laughs] so it was a very hot day.

ANA: So, for our listeners today who are not familiar with Jeli, can you tell us a little bit about the company you have created?

NORA: Jeli is an incident analysis company. First and foremost, we cover really the whole incident management suite. But we go deep on an area that I think has frequently been not very much talked about in tech but is incredibly important, and that's on understanding what happened after we had an incident.

There is so much learning to unpack with how we coordinated, how we spoke to each other, who we brought in, why we needed to bring them in, what systems were involved that doesn't really get talked about. In my experience, the tech industry really focuses on what we know best, which is the technology and the software behind what led to an incident.

And there's a whole other story that didn't get unpacked, which is how the organization works together, the relationships, the psychology aspects, which I think can sometimes feel really awkward to talk about, but it's also really important to talk about. And so that is really what we focus on is making that easy for folks to unpack in a way that can help them improve in the future.

ADRIANA: That's so cool. So you actually tackle the sociotechnical aspect of it, too, then?

NORA: Yeah, absolutely. That's our main focus, and it's a lot of fun. But we're also all engineers. We've been a part of the big technical aspects of it all. And we've seen the social aspects not really be spoken about very much. I think it's a big miss. And it's honestly a business advantage to be able to talk about the social aspects as well. So we're really trying to give every company that advantage.

ANA: Especially when it comes...like, a tool is not even necessarily just a new culture that you're trying to bring into organizations. You're kind of bringing in the new culture, but it's being facilitated with something that's going to help them.

NORA: Totally. It's definitely meant to help them. It's meant to take typically sterile and maybe sometimes boring or a very emotional thing that happened, which is an incident, and make it fun, make it a learning opportunity, make it not feel like a bad thing that happened but something that is expected as part of moving fast in a very technologically advanced world, and just really trying to extract value out of it and also celebrate your employees and your organization along the way.

ANA: As you've been working on this field, do you feel like the definition that we have for incident management, incident response needs a refresh when we look at it since we have been focusing very much on technology broke; it didn't matter how it broke?

NORA: Yeah, 100%. I think it needs a huge refresh. I think there are a lot of things that need to change; I think around how we view incidents, I think around how we speak to people that participate in them afterwards. I think it's a lot of emotional burden to carry fixing an entire incident when you are also an expert in the situation.

But it's like, it also needs to be approached with care when you talk with them afterwards. And I don't feel like that is always spoken about as much, but it's really, really beneficial to do that. So I think there are a lot of things. I mean, at Jeli, too, there are even subtle language changes we make in the tool. Instead of using the word incidents, we use the word opportunities.

So like in CRM tools when you sign in, it shows you your opportunities rather than these are the people you need to sell to and meet up with. It's like, that almost feels kind of not fun and kind of a little transactional whereas opportunities it's like you're building a relationship. You're building an opportunity to grow.

It's similar with incidents; it's an opportunity. It is an opportunity to grow. It's not a bad thing that happened that you need a checklist to recover from. It is like a way to evolve. It sounds like such a minuscule language shift. But doing those little language shifts can actually really help people think about them differently, which will help the org get more out of them at the same time.

ADRIANA: I love that because it turns it from a glass half empty scenario to a glass half full scenario, right?

NORA: Bingo. Yeah, yeah, exactly. It's way more like this happened. It wasn't fun. But we can collaborate on it, and we can just treat it as an opportunity to learn and grow. And it's something I've thought about for a really long time, and something I used to get hired into organizations to help was to change the culture and the way we spoke about incidents, even in one on one conversations, like passing by, getting coffee together.

But I really had to walk the talk in my own organization too. I'm making a tool that I'm trying to sell to people [laughs] to help them think like this. I'm going to have to implement it internally too. And I will earnestly say incidents are kind of fun here. They're never expected, but they are... we really enjoy working together and collaborating with them. And they're just considered a regular part of work and not a very big deal.

There are certain incidents, of course, that you cannot avoid in almost every company, in every industry that is successful, that are going to be painful and are going to be a big deal. And I think the way you talk about them afterwards can still help folks feel psychologically safe, help your retention, help challenge your engineers and all the people in your organization to build their expertise in a way that helps your organization long term too.

ADRIANA: That is so cool because I think it taps into what Ana was saying in a previous blog post about basically leading with empathy. And this really taps into leading with empathy. And I love, too, that it sounds like it takes the PTSD out of dealing with incidents because of the positivity around it, which I think is so cool. Because I think so many of us have been so burned and so scared to even own up to mistakes because of repercussions, right? Oh my God, I caused this. Now I'm super screwed, like, fingers pointing everywhere, blah.

NORA: It's almost worse when you're feeling that way, and then someone else gets assigned the post-incident review. And you were just waiting to figure out what they're going to say about your enrolment and involvement in the situation, even though you did what made sense at the time. You care about your job very much; I'm sure they do as well. And you just have this anxiety, and usually, it's about how they're going to fill out a pre-templatized Google Doc.

And oftentimes, it ends up with a lot of their opinions of the situation. And so you're almost like subtly DMing them, trying to schedule meetings with them to try to make sure you're involved in the conversation because you very much should be. And the thing is we make all that easier, like, we make that easier for the you in that situation. We make that easier for the writer in that situation. We make that easier for the person that created that templatized process.

And I think a lot of the problem is how we've been set up as an industry, too, which is, yeah, use a Google Doc and Slack, two tools that were not made at all for incidents, [laughs] and come up with the answer to what happened in this incident, which doesn't make sense. It doesn't set anyone up for success.

So what we really do is we have a couple of different personas that we really tailor towards. We are there for the person that is creating the post-incident review and is tasked with this, but we also assume that they have a lot of other stuff going on. And they may not get your full account in the way that you deserve and the way that the person deserves.

And so we really help them see the key players in the situation and the key standout moments so that they can collect all the perspectives and get all the different views of expertise because ultimately, incident review shouldn't be done in a vacuum by one person. They should be a collected, amalgamated experience that is put together.

And the thing is, everyone's individual experience is incorrect, and it's also very correct. And so the role of the incident reviewer is to collect all those and highlight the differences between those experiences because the real answers are in the deltas, like how you viewed what happened versus how I viewed what happened. And again, neither of us are wrong, and neither of us are right.

And it's very hard to do that as an incident reviewer, but that is what we try to make really easy. So that, what you said, it feels psychologically safe afterwards, that there's not a lot of emotional burden afterwards. And that you really feel like it's kind of a team effort that you're working through together.

ANA: The post-incident reviewer is actually able to see you in a sense like in a human aspect.

NORA: Right, exactly.

ADRIANA: How do your customers respond the first time that they use that approach to handle these opportunities, as you put them? What goes through their minds?

NORA: I would say our first initial users, like all of our initial inbound customers, were folks that were very bought into this way of thinking, but they might have been one of two people in their org that was bought into this way of thinking. And I used to also be that person in this org. So I really tried to make a tool that would help them socialize it better and show the ROI of it better.

And so I think the initial reactions, especially from those people, are like, oh my gosh, now I can screenshot this thing where you're visualizing this for me, and you're putting this together in a way that multiple different parties can understand. I feel like the initial reaction has been really great.

And I think a lot of folks in the industry, when they hear the word learning or learning from incidents, sometimes the initial reaction is like, but what does that mean? Like, what does learning mean? What is the ROI of that? And we really try to show the ROI of that in a way that can make it easy for those skeptics to go, oh, interesting.

And so our goal is to help the person that already really is trying to move some of these processes and thoughts forward into their organization and really help them socialize it to their colleagues in a way they understand how it impacts their work positively too. So it's like there's an initial amazing reaction towards the people we know and interact with, and there's a slow trickle through the rest of the organization that is really quite cool to see.

ANA: For anyone that's trying to convince their manager to start having this culture change of seeing them as learning opportunities, but their manager is kind of pushing back because, like, the same failure doesn't happen twice; we can't learn from it. What is your take on that?

NORA: My take has honestly evolved a bit over time. I was very much in the like; I will stand on this hill and die on this hill about some of these things. And now it's like, that's not the way to make change in your organization. I think everyone has...are y'all familiar with the terms blunt end versus sharp end in terms of expertise?

ANA: I think our listeners might not be. We'll definitely get the explanation.

NORA: Okay, yeah. So just to distill it just in a quick sense...and there are much better infographics online about it. Richard Cook has a really great one where he talks about Above the Line, Below the Line, and I can send a link to it afterwards.

But the sharp end is like when you are in a role, and you are seeing all the depths of your role. And then the blunt end is more like what others might see about your role in the organization. I think the thing that's evolved for me is everyone has their own sharp end beyond engineers. And I know that sounds very obvious, but I think directors and managers, even the curmudgeon-y ones in those situations they, also have their own sharp end.

And I think just getting curious about everyone’s sharp end in an incident is what entices change. Some initial mentalities in this space have been like, let's completely just only listen to the engineers in this situation. We really have to understand their expertise, and that is totally true. And I think sometimes managers and other folks’ sharp ends we're left out of that conversation, which does not make change possible. And so it's like getting curious about those sharp ends as well so that they can all integrate together.

ADRIANA: And I think that plays nicely into what you were saying then about, like, everyone's perspective matters, right?

NORA: Yeah, it matters. Yeah, it's like, I don't want to say wrong or right, but everyone's perspective is incomplete, and it's incomplete in their own way. And so they all need to form together to be complete. And no one's is more right or more important. It's just kind of collecting this together, which requires a really skilled facilitator, which that's what a lot of companies don't invest in. They're like, "So, and so you were the incident commander here, so you get to do the post-mortem. And you have eight hours to do it for our eight-day incident. Yeah, let us know all the stuff we need to do afterwards."

And then they'll be working to complete it really quickly. They won't get everyone's perspective. They’ll hoard a quick post-mortem. Then they're like, "Yeah, so and so that also did this; you also have to do all these other things this week that have nothing to do with this incident. But we really want all the action items from this incident too." So they quickly rushed to do those.

And then those action items hang out in a Jira log for months. And then someone that asked them to do the post-mortem later comes back and is like, "Hey, why didn't any of these action items get completed" And then they talk about them in some all-hands or some meeting, and then the cycle just continues over and over again.

And so I think the thought that Jeli is taking...at first was to encourage folks to spend more time on their incident reviews, especially for the big ones. But I switched my approach a little bit, and I'm like, let's meet them where they are today and assume they're going to take eight hours. How can we extract the most value out of those eight hours?

And so we're really trying to show them things that would be really hard to see with the human eye and that you probably wouldn't explore at all, like, who participated a lot in these incidents? What did the emoji reactions look like? Who was reacting to them? Things like that that will help direct you towards who to talk to and who to include.

And then eventually, I would love if people spent more time on them, but where we're starting out today is assuming that they might not, and they might not have a lot of time to spend on them, even though they want to do a good job.

ADRIANA: So, how do you choose a facilitator for these types of things? What would you say is a good practice for choosing a facilitator?

NORA: We actually have a whole guide that we created; it was called the Howie guide. And we have a whole section on how to choose a facilitator. It was led by Dr. Laura Maguire at Jeli and Vanessa Huerta Granda, who's one of my colleagues as well. We really talked about the ways that you should ideally pick a facilitator, which is you should pick someone that was not involved in the incident but understands the incident technically enough to be an unbiased third party.

You also should pick someone that is well-respected enough in the organization, like someone that people look up to, that are comfortable divulging things to. So that might not be someone that has been at the org for three months because they haven't built up enough time with people. It also might not be someone that is brand new in their career because I actually think that's a pretty overwhelming situation to throw someone into. I think it's unfair for both parties participating. So I'll link that afterwards, too. But the picking is very, very important for sure. So I'm really glad that you asked that question.

ADRIANA: So it sounds like the facilitator is almost like a mediator/coach/therapist all rolled into one, right? [laughs]

NORA: Yes, it really is. And that was the thing...so I spent about six months at Slack. I was training people on how to do this. And I was putting people through several week-long training programs. It was mostly engineers that we had picked so that they could understand some of the technical nuances because, ultimately, you're going to be talking to engineers. So you want engineers to be able to think that they can explain technical concepts to you.

But I told them, I was like, you have to completely take off your I work at this organization that when you're talking to these people because that is not your job at this time. Your job is not senior staff SRE; your job is incident analyst. And you have to be curious. And you have to have asked questions that you know the answer to, so say on a shared graph or a link to a dashboard in our incident channels.

And I, as the investigator, maybe know exactly why she shared that, and I feel pretty good that I know why she shared that. But that doesn't matter in the incident review. You want to ask her, "Hey, Ana, why did you share this? How did you find it? When did you first hear about this graph? Did you make this graph? Did someone else make this graph?" And ask all the honestly silly questions and put them in your report because that's how other people learn. And then it makes it safe for other people to ask or share that information in the future, too, rather than just assuming that that's something you should know.

ANA: Is currently the tool looking to provide some of that extra context from the tools that we're using but in the fashions that we don't necessarily see when we do Google Docs and Slack?

NORA: Definitely, it's also helping you as the investigator ask these questions a little bit better. Like, hey, here are all the times people shared graphs. You should ask them these questions about it. Or here's where they've actually shared this graph in other incidents. Are they related to each other? And we bubble up those commonalities as well in a way that I think would be hard for people to do if they're just copying and pasting a Slack channel into a Google Doc and trying to peruse it that way.

ADRIANA: So it gathers information. It uses some historical data to help drive future incidents within your organization. Do you also pull in data from other organizations saying, like, other organizations have tried blah, so you should try blah?

NORA: Are you reading my business plan? [laughter] I would love for the industry to get to that point. And I'm sure you all have experienced the same thing. I feel like I have been at a few organizations in the same role at this point, where I have experienced almost the same incident at very different organizations. And I'm like, wow, the industry would really benefit from sharing this stuff with each other.

And it is less about the technology being used and more about how people are treating the technology and acting towards it and how they're being staffed around it. And I'm like, all this stuff should be shared with each other. So I would love for folks to eventually be able to opt-in to sharing some of this obfuscated data. But at least within organizations right now, you can see a history of how folks have responded to this in your organization over time and how it's kind of trended.

We had someone use our tool to make a case to build a three-person team in Dublin around a particular piece of technology, and they used our tool to show how often it was involved in incidents, how often they needed people not on call to respond to this particular piece of technology, and how much time and interrupt-driven work it was taking them.

And originally, the thought process was, oh, we just need to rip out this technology, but that wasn't going to help them. And that wasn't going to save them a lot of money, honestly. And so this was the better approach. And they were able to have that conversation much more quickly than I think they would have otherwise.

ANA: That's pretty cool that you get a different use case of this actually ends up being like an aggregated amount of data of information that you don't necessarily see about the services that you're operating, but at the same time, information of opportunities that are literally just sitting down and you already invested money, but you're not using.

NORA: Yeah, exactly. There was one organization I was at that was about to have an exit. And I was asked what is going to happen if we lose all these people that have been here for four years and are fully vested? Which is actually a very scary question for any leader to ask. And I would venture that most leaders actually know who their most crucial incident responders are, and I think anyone in an organization does. But I don't think you quite know how valuable they are and how much...

We have single points of failures in our systems. But we also have single points of failures with our expertise, and that is a lot harder to see. We make a lot of tools around understanding single points of failures in a technical sense, but we don't make a lot of tools about understanding them in a people sense. And I don't think it has to be a bad thing to bubble that up. What I ended up making when I was an employee was this heat map of all the people that were fully vested in all the times we needed them and incidents that they weren't on call for.

So you could look at someone's name and see, oh, wow, they have immense expertise in this particular system, and it doesn't look like many other people in the org did. So maybe before they head out, we should get a lot of training sessions from them on this particular technology.

ANA: Or just in general, maybe the manager being like, oh, there's a single point of failure in my system, because like --

NORA: Because of how I organized it, yeah, exactly. And so it's meant for that as well without necessarily shaming anyone, you know what I mean? It's meant to be more proactive about that so that people aren't burning out and leaving because they're the only one that understands that thing, and no one else knows just how much they're relying on them for that. It's like surfacing that earlier so that you can be a little bit more proactive about it.

ADRIANA: That is so cool because we were talking to Liz Fong-Jones about that today. Oftentimes, you end up having your single point of failure because it's always the same person or five people who get called for incidents. And they might not be on call, but they still get calls because they're the experts. And it's so cool that you turn something that's non-empirical into empirical data saying, hey, [laughs] look at this. This is the person with the expertise; maybe you should do something about it.

NORA: Totally. And it's just meant to show your hotspots. It's not meant to be the entire story. But it's meant to help you start and contextualize a story that I think a lot of people feel they have a good handle on in their heads, but they don't always.

ANA: It's like, oh, the manager or director doesn't even know that Maggie is getting called every Friday at 11:00 p.m. because the data--

NORA: Yes, 11:00 p.m. for the third time. And then you're pinging her at 8:00 a.m. the next morning, and she's not saying anything because that's a lot of pressure to put on her to bring that up, like, you should just know. And so it's like we're enrolling everyone in the conversation. Because I feel like people have assumptions and expectations of their colleagues that don't always get said. We're providing some of those data points to them before they become problems.

ANA: Well, in a way, the tool is actually trying to equal the playing field for folks. And then these incidents, which, as we know, can be detrimental to mental to your mental health because of burnout or just bad actors in the room. But at the same time, you really might not feel safe speaking your mind because people of the people on the call. Like, are you going to lose your job? Is your promotion going to get canned?

NORA: Yeah, exactly. And we're trying to take that social responsibility off of the person. You shouldn't have to use your social capital in an org to make your point about something important because we all have a bunch of points to make. And then we sit, and we're like, okay, which ones [laughs] are most important to make? And which ones should I probably just sit on for a little while, right? And we're trying to make it easier through the tool to show those problems so that you don't have to raise them.

ANA: That makes perfect sense. One of the questions that I was thinking about earlier, like, I would love to get your take on it: how do you feel about companies and organizations using external facilitators for doing their post-incident? I personally like it because sometimes it is like someone with technical expertise that's coming in and asking better questions of, like, why is it that you did this thing? And in moments where you don't necessarily have a full robust tool like Jeli that helps you. Do you have any thoughts around that?

NORA: I do. And I think the reason it's so appealing towards folks in the organization is like, they have no reason not to trust this person. They're not enrolled in the day-to-day. They are just curious and asking questions and quite nice. And that is why I actually really encourage organizations to develop their own group internally.

So what we did at Slack was we developed a group of about 13 people across the entire organization. And we would rotate through them like an on-call rotation. If they weren't involved in the incident, they were involved in this. But a big part of this was you had to develop trust with the people that you were talking to. A lot of it was, hey, nothing you say to me one on one is going to get put in this report or shared with anyone without your consent. And a lot of it is actually just going to get amalgamated into a bigger story. So I'll share it with you before I share it with the rest of the organization.

And so I feel like folks end up liking that and building trust with that person because they still get their viewpoint represented, but it's not necessarily attributed to them. They get their thing put in there. And I think that's why folks do consultants in organizations is because sometimes the consultants say and get a thing implemented that they've been trying to do for years. And I think that folks can do it internally a little bit too. It just requires that person taking off their day job cap and putting on this facilitator cap.

But I think what's really amazing about this is...like one thing I saw at a previous organization I was at was a senior engineer that had been trying to get promoted to staff for like four years ago enrolled in this program and spent three months ramping up on doing incident reviews, putting on this curious facilitator hat, had a lot of context on the organization.

And he comes up to me; he was like, "I finally get it now. I realized that this is not actually about the org having good facilitators." He was like, "I know so much about our systems now just from talking to people one on one," that he started becoming someone that people would go to in incidents.

And he got promoted to staff the next round, not because he signed up to be a post-mortem facilitator but because he all of a sudden knew so much stuff about the systems, and not only that, he knew also other people who knew about them. So he knew who to pull in in the incidents quickly. He knew which graph to check. And the more people you get to do that in your organization, the more your expertise is going to level up too, and you're going to have less of those islands of knowledge or single points of failures.

ANA: That makes perfect sense. As we talk about this, one of the questions that I had also thought about for today was, in general, apart from looking at incidents as incidents and having the traditional way that we go about them, what is another misconception in general that we are still having in the space?

NORA: There are a lot of misconceptions. Like, there is a lot of shiny object syndrome around very prescriptive tools, especially after you've had a really bad situation and some vendor is coming to you being like, "You really just need to do this, this, this, this, this and this," and it feels really emotionally nice to have a new process to follow. And then future things you can blame on people not following the process, rather than any of the stuff you're doing it. It's almost an emotional Band-Aid rather than you actually looking inward in your organization and addressing what's happening.

Like you brought up before, there are a lot of components. We all hopefully do self-reflection to a certain extent, but it is very hard to do organizational reflection and then create a process that works for you rather than just offloading and buying one. And we really try to make our tooling flexible. We don't say, "Hey, we know your org entirely." It's like, you know your org the best, and you know which processes are going to succeed and which ones aren't. And we also help you identify your hotspots along the way so that you can make that thing that works best for you.

ANA: Definitely. Specifically, when it comes to learning from incidents, do you think that, as an industry, we're still going about it the wrong way? At least from my perspective, I feel like I've seen people more open to we can learn from it versus this happened, and we have to deal with it. But at the same time, it feels like we're just not being as intentional with it.

NORA: I'm seeing what you're seeing, Ana, and I know you've been in this field for a while too and have been evangelizing a lot of this thinking for a while too. I feel like I've seen a change a lot in the last five years. It's not where I would like it to be. But I think growth is slow. I am seeing upward trends.

I feel like sometimes these kinds of things are a little bit like a video game. Like on the first level, you get told to go talk to this merchant and buy this particular thing. And then, in the next level, you're not getting so much instructions, and then you slowly build expertise. But you're not shaming them in the first level for not understanding the process yet.

I think it's just a journey that we have to go on as an industry. And we're all trying our best, and we're all understanding it in our different ways and at our own paces. And I think it's trending upwards. Like, I'm seeing folks talk about it a lot differently.

ANA: It is a nice change. It is definitely refreshing when it's like people don't run away when you say those words. They're like, "Oh." [chuckles]

NORA: Yeah, exactly, exactly. Or you'll get people just bringing stuff up in conversation without you having to be the person. And you're like, "Oh, wow. That's really nice."

ADRIANA: It's refreshing, especially in an industry where I think people have just reached that point of burnout of like, nah, this can't go on any longer. I'm just tired. [laughs]

NORA: Yeah, totally.

ANA: It's raising that white flag of like, nope, [laughs] I will not do this again. I've done this before.

NORA: Yeah, yeah. And it's a lot of emotional burden to put on the people. Like I was saying before, there's sometimes one to two folks in an org that understand this thing. But I feel like the more I'm doing this, the more I'm seeing that one to two-person number growing to actually a couple of teams number, which is a huge change. And I think it's only going to be growing from there.

ANA: Definitely. And I think, too, we're starting to also see that it's not necessarily we're just going to learn. People are trying to be a little bit more actionable about it, which I have thoughts around that because you made the comment earlier [laughs]; we end up with 5 to 10 Jira tickets, and nothing gets done until like six months later.

NORA: It's actually the same thing as what I was talking about with, like buying a tool to fix something broken. A Jira ticket is kind of also the same thing in that situation. It emotionally feels like you're doing something to create your ticket. You feel like there's some output. I mean, it's sometimes similar to meetings with no follow-up or action items. It's nice to have those, and sometimes you don't always need them, and there are intangible benefits.

And so I think it's just folks getting used to sometimes it being a little bit less direct but trusting the process that it is going to help you more than a Jira ticket that derails your devs for a few months and is actually not that helpful.

ANA: Do you have any words of advice as how to come up with action items or how to review your action items for prior incidents?

NORA: Yeah, definitely. My spiciest take on action items is that you should have your post-incident review and not talk about action items at all, and then have an action items meeting later. Because that way, if people are coming in not with the idea I need to fix this right now, they'll actually openly have a conversation and reflect and unpack what happened. And then give them a day to soak on that conversation. And then have another quick conversation where folks come with their action items.

But not every org feels like they have the time or quite understands the benefits to do that yet, and that's okay, too. I think the most important thing is to not just have one person talking in the meeting and have several folks talking in the meeting and collaborating together on action items. I think when you do what I first talked about, you'll have amazing action items just fall out of your incidents.

We really practice what we preach here, and we do our whole thing with action items and incidents, and we use our tool almost every day. And we don't really prescribe needing to do action items in the meeting. And as a leader, I don't worry about people getting them done. They just kind of do them. They work on them together, and they come up with what needs to get done. And they figure out a thing that needs to happen, and then we move forward.

It doesn't mean we don't have technical debt. It's just kind of trusting the process a little bit more and empowering people to know what's best for the systems after hearing the entire incident review. And it's not just us; we see our customers do that, too. We've gotten feedback from our customers that have used us for over a year that their action items are actually getting done because they're not Band-Aids; they're actually good. And they enroll people that are doing them in the creating of them.

ADRIANA: I guess that's what happens when you give people a safe space to actually work through incidents properly. At the end of the day, people feel empowered, and they feel like, okay, this is worth my time. This is worth working on.

NORA: Yeah, absolutely.

ANA: I think it also comes into play; it's like, you have all that information that you're not regularly seeing because you're capturing that delta like you mentioned, which is more intentional; it's not just busywork at that point.

NORA: Yep. Yeah. You're not just doing it to appease someone. You're doing it because you're trusted, and you feel empowered. And a lot of really cool things fall out of that, like better action items than any leader could come up with when you're focusing on that.

ANA: It's about treating adults like adults, right? [laughs]

NORA: Exactly, exactly. You don't need to babysit someone to get their stuff done. They know what they need to do. It’s about trusting them.

ADRIANA: Totally.

ANA: Y’all are just hurting my brain with these really big concepts.

[laughter]

ADRIANA: My mind is blown from today. I'm super excited that we got a chance to talk about this because I just love your take on all of incident management. It's so refreshing. And I feel like this is the way that our industry needs to keep moving. And so I love that we get to have this conversation and share this conversation with our listeners too.

NORA: Yeah, absolutely. And it's like I almost wish I could share my screen during this conversation.

ADRIANA: [laughs]

NORA: So it doesn't so abstract. But I think by the time this episode comes out, we'll have a lot more pictures and videos and stuff of our product on our website. And we have a few launches coming out in between now and then, so that should be a lot of fun.

ANA: That's so exciting. As we're starting to wrap up our conversation, apart from these learning opportunities at incidents, is there anywhere else that you think that we, as engineers or we as humans, should be learning from?

NORA: Yeah, certainly other industries. Incident analysis is a field of study that has been around since Three Mile Island happened, and the tech industry has so much to learn from the folks that study incident analysis. I feel like sometimes we try to recreate the wheel and think that a thing has never been done before. But there is so much stuff to learn from dating back almost a century now.

And so I think we can learn from aviation. We can learn from maritime. We can learn from healthcare in a lot of ways. And the tech industry has benefited in a way from this that others aren't in that we aren't fully regulated. A lot of these industries are required to follow a runbook, or they go to jail, even if it means saving a life to not follow a runbook. And we don't have to do stuff like that.

And so I think some of them...I was in a Human Factors & System Safety master's program, and some of them were baffled at some of the stuff we were doing. They were like, "You don't have to legally do that, so [chuckles] why would you do that thing?" And so I think there's a lot of stuff we can learn from other industries. And I'll share some of the papers and observations from other industries as a link in this podcast.

ANA: I feel like every time I get to talk to you or folks that have gone through more learning of Richard Cook and just human factors, I'm always learning. I was at DevOpsDays Dallas, and I just learned about Three Mile Island. And I'm like, I've been working in this field for a bit, [laughs] and no one shared this with me. I was like, cool.

NORA: It's amazing. I have some great papers to send you too.

ADRIANA: The show notes are going to be lit.

NORA: Totally.

ADRIANA: [laughs]

ANA: I mean that that is the beauty. We, as a tech industry, have the pipeline set up for sharing a lot of knowledge, which a lot of industries don't. And it's easier for us to have less of red tape. We do have red tape, but it's not as much as healthcare and firefighting and aviation.

NORA: It makes a big difference. We can get things done very fast. We have that benefit, too, and not just because of the lack of red tape; it's the fact that we have technology. And we are creating it and making it, and that is a huge benefit that we should really lean on. Yeah, that hasn't been thought of as much in the past with a lot of these incident analysis fields. And so combining those worlds could be really beneficial for businesses, for employees, for consumers using our technology.

ANA: Most definitely. I think that's the perfect nugget of gold to end of, like, go out there and be learning from this world that we really, as engineers, get to create stuff. You can prototype something within a week with your friends or, if you really want to, by yourself on a weekend. [laughs]

NORA: Yeah, exactly. Other industries don't always have that benefit. They have to spend a year trying to find someone that they're allowed to hire, and then they'll see if they can make it, and then they have to spend another year getting it approved. And we can just, like you said, whip it up in a week. And so we should do that.

ANA: I feel like funding for our projects comes a little easier or is a little less expensive than some of the other projects in other sectors too.

ADRIANA: It's true.

NORA: Definitely.

ANA: Well, thank you so much, Nora Jones, for joining us in today's podcast. We loved talking to you about just incidents, learning opportunities, changing the culture of operating systems.

Don't forget to subscribe and give us a shout-out on Twitter via @oncallmemaybe. If you liked our episode or if you have any thoughts on today's episode, be sure to also check out our show notes on oncallmemaybe.com for additional resources and connect with us and our guests on social media.

For On-Call Me Maybe, we're your hosts, Ana Margarita Medina.

ADRIANA: And Adriana Villela.

ANA: Signing off with peace...

NORA: Love and code.

Finding Humanity in Incidents with Nora Jones of Jeli.io

On-Call Me Maybe

Twitter Mentions