The Evolution of Operating Software with Jason Harley of Honeycomb

On-Call Me Maybe

English - September 27, 2022 04:00 - 37 minutes - 34.7 MB - ★★★★★ - 3 ratings
Technology monitoring tracing distributed tracing sre oncall on-call software software development technology tech Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: An OpenTelemetry Journey with Gabriel Fonseca and David Alfonzo of Wavelo

Next Episode: OpenTelemetry & Nomad with Luiz Aoqui of HashiCorp

About our guest:

Based out of Toronto, Canada, Jason has been working in a variety of roles for the last 20 years in fields ranging from marketing technology to high-frequency finance. He loves helping teams make their systems and platforms more humane to deploy, operate, and reason about. He's currently working at Honeycomb as a Software Engineer building APIs and integrations but started at Honeycomb as a Customer Architect helping customers navigate the sociotechnical aspects of adopting observability practices.

Jason is also a HashiCorp Ambassador.

Find our guest on:

Jason’s LinkedIn Jason’s Twitter Jason’s GitHub

Find us on:

On Call Me Maybe Podcast Twitter On Call Me Maybe Podcast LinkedIn Page Adriana’s Twitter Adriana’s LinkedIn Adriana’s Instagram Ana’s Twitter Ana’s LinkedIn Ana's Instagram

Show Links:

Honeycomb.io RedHat Linux 5.2 Alias Wavefront kubectl 10x Engineer HashiCorp Community

Transcript:

ANA: Hey y'all. Welcome to On-Call Me Maybe, the podcast about DevOps, SRE, observability principles, on-call, and just about everything in between. Today we're talking to Jason Harley. We're so happy to have you here joining us. Welcome. What is your beverage of choice today, this morning?

JASON: It is 10:17 in the morning here in Toronto. So I am drinking boring peppermint tea, two coffees behind me, and some sort of caffeine management regimen. I don't know.

ANA: [laughs] There's nothing boring about peppermint tea. It just sounds very soothing and warm.

ADRIANA: It's a much more exciting drink than my water. So...

ANA: I will say that it's almost a blend of both of y'all. I'm on a Yerba Mate mint tea. So I got a little bit of mint and a lot of caffeine.

ADRIANA: Would you believe I'm not a coffee drinker?

ANA: No, I cannot believe that.

ADRIANA: I do a great disservice to my culture [laughs] because I don't drink coffee, and I don't watch soccer. [laughter] I think I make a pretty bad Brazilian.

ANA: Jason, you have many years of experience in the tech industry. And you've been very successful across various companies. So now that you're coming on your first podcast, what an honor. We would love to talk to you about some of the things that you have done. We have listeners wondering how did you get into tech?

JASON: That's a great question. I installed Red Hat Linux 5.2 on my family computer when I was in grade 9 and erased all of my parents' stuff in the process.

ADRIANA: Oh no. [laughs]

JASON: I had no idea what I was doing whatsoever. But I was convinced that this was something possibly associated with some weird hacker identity that one sort of had if you were growing up in rural Canada and predisposed to computers. But I was always fascinated by technology. I was convinced I should do computer science. I only applied to one school and one program. And tech has been really...I wouldn't say I'm a computer scientist by any stretch of the imagination, but tech has been quite kind.

I mean, our industry is rife with toxic bullshit. But at the same time, I think there's so much opportunity to grow and learn, to be curious, and connect with people. And all of these skills are immensely transferable. I've done a lot of different roles, and they're all sort of grounded in the same technological experience or principles. But yeah, I moved to Toronto after university and started working at a 3D graphics company, which was called Alias Wavefront or just Alias. They made the 3D graphics software that made Toy Story and Lord of the Rings and that kind of stuff possible.

I worked in the business department on infrastructure like email, and routing, and stuff like that for this global company. It was a lot of fun. We got acquired by a very large company, Autodesk. And I promptly left and went to a 30-person startup, which began the next crazy five years of my professional life. I was on call, and all of it...we were transacting money over the internet, and I don't mean like Venmo or Interac; I mean spot foreign exchange.

So we made a market, which is to say that we set...didn't ask prices for currency pairs, and we exchanged them. When I left, we were doing north of 10 billion a day in transactional volume. And we did that with a pretty small team of operations folks. My team ran the infrastructure, and we had a team that was specialized in running the company's software. So basically, if you plugged it in, we were in charge of it if it was written in-house, this operations group. But that was my baptism by fire in tech. And I've been in love with startups ever since.

ADRIANA: That's so cool.

ANA: I think it's awesome to call it baptism in tech. Just like, yep, that was my parents' computer. That's my baptism.

JASON: Thankfully, my parents weren't super tech savvy. So they didn't have a lot of stuff on it or anything. But it was definitely the family computer, not my computer. And yeah. [laughs]

ADRIANA: That's definitely a good introduction to Linux. I think the first time --

JASON: Yeah, I learned that mounts are not pointers is really what happened there. Mounts are not pointers. [laughter] They're the real...block device means something. Who knew?

ADRIANA: So, having been in tech for a while, what's been the biggest change that you've seen through your career? I mean, I can speak for myself, like, the tech that I started in when I graduated university is definitely not the same tech [chuckles] that we're in right now. So, yeah, what's your perspective?

JASON: I think the biggest change I've seen is automation and by way of programmatic infrastructure. I choose that very deliberately over something like cloud because a lot of enterprises are still able to automate in a big way with things like VMware, for example, or things with Microsoft Hyper-V. But being able to automate and codify configuration and infrastructure, I think, has dramatically changed the tech industry in the last ten years. And I graduated from university in 2004.

If I was to tell my new graduate self something to do differently, I think in like 2005; I would have been like, you should really start to learn CFEngine (And there were early remnants of Puppet around that time, I believe.) because I came to that a bit later than I would have liked. I think it was 2012/2013 when I really started picking up Chef at the time, but I think that has been a game changer in terms of converging. We can bring software development practices to infrastructure. I'm working as a software engineer now, like, an actually one. I'm working on APIs and such.

But I think from a reliability and a learning and a testing standpoint, that's really changed the game. Developers can come in and participate. We can say what we want about DevOps and the Wall of Confusion. And it's now a proper noun, apparently. I saw a job posting the other day saying that 90% of DevOps like this technology, and that's a wild one to even take apart its actual intention. But that kind of automation I've spent most of the last 8-10 years, last eight years, using AWS in a pretty big way. And the stuff that you can pull together with that kind of tech is truly [chuckles] game-changing compared to the start of my career, to your point.

ADRIANA: Yeah, that stuff is trippy. I mean, considering that you can practically at the snap of a finger you've got a VM; you've got a Kubernetes cluster. It's like, what? Kubernetes wasn't even a thing when we were starting out. Docker wasn't a thing. I graduated school in 2001. Java was still nascent. It was the cool kid on the block at the time. So yeah, it's mind-blowing. And now we've got new languages like Go, and then old languages experiencing a resurgence like Python is popular again. I think it's so cool to see that kind of thing.

JASON: It's wild, yeah. I graduated university knowing Perl and Java, and I haven't used either of those in a number of years. I've spent most of my days in Golang these days, Golang search engine versus real life, I guess. But yeah, it's been a wild shift. The other part of that that I think is fascinating, and this is to double click on your point about with one call, you can have a cluster of seemingly infinite resources, or you can launch 100 virtual machines. The complexity that has come from that capability is also pretty wild. And this is something we talk a lot about at work.

But thinking about operating software is very, very different now when you don't have the database and the app server. And I worked at places where we had the database or at least the finance database, and the customer database or whatever, and then some big, gnarly monolith, whether it was Java and WebSphere or some PHP thing. Those things still exist. They're still totally valid paradigms but managing that complexity has been the wildest part.

And that's something that I think our two companies, Lightstep and Honeycomb, do a really good job of talking to people about. I was on call in a very stressful way for five years. I've been on call multiple times since, but that was definitely one of the most foolish things that I've done. It was a lot of fun. I learned a lot. But being able to reason about systems in their complexity, mental models don't work anymore because of these things.

I think I would also go back and tell my younger self, and it's something I try and explain to other people, is that mental models are fundamentally broken because the business and the industry demands such high rates of change now, and we achieve those through mind-boggling complexity, but we just talk about them as really abstract concepts. Like, yeah, you just run kubectl or kube C-T-L (Let's not start that debate.), and you push in your new manifest, and then the stuff happens; that's great. But when it breaks down, you need to understand what's going on.

In a past role at Honeycomb, I was actually in customer success. I've worked with a lot of our bigger customers through some of both the technical and social challenges of operating complex software. And the hardest part there is to get people to understand that it's okay to not know everything. And in addition to that, you can't know everything. And you just need to accept that and ask questions and be curious. And that is what makes "10x" quote, unquote, engineers; your ability to roll up your sleeves and collaborate and ask questions. Those are such exciting things.

ANA: It's very true; those mental models, there's no way to hold them in your head. They're constantly changing in a way that you think you have your architecture diagram mental model built out, but then you have two others go push out a whole new deployment. And that deployment had so many other dependencies, and now it's all changed, and you don't even know about it.

JASON: And that's the dangerous bit, right? Operating with a false model is inarguably worse than operating with no model at all. And then there's a lot of ego tied in with that; I mean, we could all do with less. Everybody's got some ego to manage, I truly believe.

ANA: It reminds me of like the on-call hero where they just really want to save the day. And you just really want to put your name out there.

JASON: I am a recovering on-call hero for sure. But in my past self, we didn't have tools to work through that in the same sort of way. And the only way without observability and much more active documentation and actually talking about empathy across company boundaries, the reason that is such a thing is that we created it. It was a gap that needed to be filled. And without better solutions, I think that's just what happened. And that's not to say it was good; it was terrible. It held companies back.

But having lived through that role, it took too much time, too much effort, too much stress. And I think all of those things ended up being functions of our inability to actually collaborate and share those models in a meaningful way. I think we're doing a lot better with that. We still got a long way to go as an industry. But that, to me, I think, is the next big shift.

We can make 1,000 computers with a single API call. I think we're getting to the point where we can now start to ask questions and operate more complex infrastructure. We say at Honeycomb that software is a team sport, whether you're SaaS or you're shipping binaries. The medium to be a team is what's really exciting to me these days.

ADRIANA: Going back to the on-call hero thing, as you said, it was almost a necessity because that person had the domain knowledge. But with the tooling and the practices that we have now these days, it becomes less and less necessary. So it not only relieves the pressure of the on-call hero, but it also gives them room to grow in different ways than they would have before. Because they're just stuck in this world of domain knowledge, and that's it, that's your life.

And it also didn't give room for the more junior people to get up to speed because it's like, get out of the way. Let the big people work on the problem. And now it's like observability has made this an equal opportunity playing field again, back to the whole thing of the team sport.

JASON: Exactly. And it was just a vicious, self-reinforcing cycle. It has so many negative drawbacks. But, I mean, a lot has been said about hero culture. I cannot do it justice. But that very short-sighted view with the big dopamine hits and feeling like you saved the day, that's great, but it was not great even in the medium term. And scaling engineering companies and scaling delivery as software teams that is 100% an anti-pattern these days.

ADRIANA: Going back to your on-call days, so I guess two questions: how was it before when you were in the early parts of your career where you were doing on-call a lot and things were less evolved practice-wise, tool-wise? And have you been on call more recently now under our newer, more evolved way of...or at least as we all try to practice a more evolved way of doing on call?

JASON: Yeah, I've been on call more recently. I have just switched roles in Honeycomb into the engineering organization. And it's a new team. We don't yet have on-call responsibilities, so that's going to change. Engineers at Honeycomb are on call for their area of ownership. And I can honestly say for the first time in a very long time, I'm very comfortable [laughs] accepting that because it's such a different culture, steeped in both good engineering principles and good observability principles.

So I'm actually a little bit excited, dare I say, about on-call because it truly is a learning opportunity for how this stuff comes together and how our customers actually use our product. It is not “I'm going to be woken up at 3:00 a.m. at least three times in a seven-day period” to deal with something that is probably trivial.

Or, based on past on-call experience, it was either trivial, or the company was literally melting. And I honestly have this memory, being the fickle thing that it is, my memory is that it was like this disk is going to be full, and honestly, a shell script should have been run that compressed a bunch of files or removed a bunch of files.

And the other part of it is that all of the production database arrays have just gone offline [laughs] because they're 1,100 miles from you. And you now have to wake up who knows how many people. I mean, that is still a part of a lot of people's jobs. And I don't mean to trivialize that but not knowing even what my week was going to be like. And we talk about unknown unknowns in observability. In infrastructure, there are new levels. Sometimes I think, you know, I worked at a place where a UPS truck literally pulled down the single power line going into the facility.

ADRIANA: Oh my God. [laughs]

JASON: I'm not making this up. I don't know how it happened.

ANA: Talk about a single point of failure.

JASON: Yeah, it was a data center in Florida. And we got contracted into it through a responsibility, and a delivery truck literally clipped the power line, somehow. We never got a full story. And all that gear was just dark. We found out there weren't generators. We weren't given a chance to do appropriate diligence in this place. It was just contractually we had to put some stuff. And yeah, so that stuff happens. That's a bad week at work. Bad weeks at work always happened. The modern analogy for a lot of people might be like a cloud region going down. Like, that's a bad day at work.

ANA: That little story is just like everything in technology. We thought we had a backup. We thought we did this. But if you're not continuously getting into the cadence of running a practice around these items like preparing folks for these moments of failure and making sure that you're doing these catastrophic scenarios that you think will never happen, at some point, in the three months, six months, however, it makes sense for your organization to be like, oh, what do we do if all these little single points of failure happen that we have within our organization, our infrastructure teams like testing? Because they're just going to break. Failure is inevitable. And like we've talked about, our systems are so complex. It's just mathematically going to be there.

JASON: Absolutely, yeah. And as someone who proudly was a systems administrator for a very long time in their career, abstracting away a lot of infrastructure can make these problems worse. I love being able to launch 1,000 machines with a single API call. My number is going up the more we talk. [laughter] I love being able to launch a million machines with a single API call.

Datacenter and the physical aspect has really turned into much more of a discipline that we're now outsourcing to these large companies. I am positive that the Amazon data center folks do data centers better than I ever could or ever had the budget to do, and I'm comfortable with that.

But to your point, planning for disaster as a business is now a new kind of complicated. And for a lot of small companies...and I have worked in a lot of small companies. I've worked with a lot of startups. And I am usually the person in the room who's saying that I don't think we can actually pay for that because to do that kind of resiliency well costs a lot of money. And it's really easy for the "business," quote, unquote. I think that software teams and infrastructure teams should always align with business goals.

But it's a really easy thing to say we need to be able to respond to this kind of failure. I agree that would be a wonderful thing to be able to do. But a rough cost breakdown on being region-resilient on Amazon will break those businesses' budgets.

ADRIANA: Yeah, totally.

JASON: And that's the part that I think people don't talk about enough. I actually think us-east-1 is fine. People say a lot of bad things about us-east-1 on AWS. It is more reliable than most data centers that I used when I own physical infrastructure. The same is probably true for other folks listening. But doing a full redundancy, East Coast, West Coast, North America or like America, Europe for failover or even for geo stuff, the complexity there to not actually just make a bigger mess and an interdependent cascading failure for a lot of places, and this is anecdotal, but it's probably like 3x your infrastructure cost. And it's probably just not worth it. So you take the downtime, which is an interesting business decision to make, right?

ADRIANA: Yeah, that's a really good point to make.

ANA: The whole conversation around the cost of these nines that we're chasing ends up being the disconnection between the IC practitioner and the business that sometimes in a lot of these companies and startups that I work with where it's like, guess what? I really want my six nines' availability and reliability. We are going to be chasing for these, and we can't have more than a few minutes a year of downtime.

And it's like, you need to have built an infrastructure with so much redundancy, and this takes a large team. You need to have built up the knowledge of this engineering team to even build that infrastructure. The projects are so large and complex. And then, these engineers need to be on call with some form of practice. Like, these are the steps to take and the actions to bring back the systems to a failover data center.

JASON: Exactly. That's just it, like, all of those things cost real money, both capital and operational. And at the same time, and this is perhaps the most surprising part when actually talking through this with product and business stakeholders in my experience, the amount of time dedicated from feature work, to use that term, to actually do this stuff well is like a real engineering effort to your point.

And then back to the point, you made even a little while ago, you have to test this stuff to make sure it works because it would be even worse if your costs went up. Let's say you did like a pretty cheap version with cold standby region stuff; if you're not testing that every six months, especially if you're shipping frequently, it's probably not going to work when you need it anyway. And then it's a huge waste of people's time and the money that it took to leave it there being useless, frankly.

ANA: You mean that the region failover playbook if it hasn't been run or updated in six months...I actually didn't touch it, like, that --

JASON: It depends. Do you only ship once every six months?

ANA: Well, we do 120 Kubernetes deployments a day. Is that okay? Is that reasonable?

JASON: It's like patching the control plane, though, isn't it?

ANA: [laughs]

ADRIANA: One thing that I wanted to bring up because you were talking about testing your failover...and I've worked at a large organization where we had to do every six months test our failover. And one thing that came out of that was, A, nobody enjoyed that. And it was almost to the point of like --

JASON: And you probably did it on the weekend, too, right?

ADRIANA: Yeah, it was on the weekend. And it was like the lucky chosen few who had enough domain knowledge to help test the failover. So that was kind of a nightmare. And every time we did it, something went wrong. So then, it made us dread these failover tests. And I think these types of things are necessary, but it was almost like people were so fatigued and traumatized from these failover tests that I think a lot of people lost sight of why we were doing them. At that point, you're just ticking off a little checkbox to satisfy somebody. But, at the end of the day, were you actually doing a proper test? Because the goal was get this thing done as quickly as possible.

I remember at the organization where we did that; we had a whole infrastructure team that was responsible for doing a bunch of these at the same time. So these guys were spread thin doing four or five DR tests at the same time. And then having to wait for them to get back to whatever the next step was, making sure that the DR playbook was up to date, which oftentimes it wasn't, especially when we had major architectural changes, [laughs] it definitely wasn't.

JASON: I was going to say the interesting output from a lot of what you're describing is that people are also scared to make changes in the system, right?

ADRIANA: Yeah, yeah.

JASON: Because there's this negative feedback loop because it means we have to update the DR plan, the DR policies, like all of this stuff. And I think that we're at this interesting point right now as an industry where there are big business costs to doing this kind of DR work. I mean, setting aside the fact that a single region in Amazon is actually easier to set up multi-data center resiliency than it ever was in my past life.

I had data centers on opposite sides of metropolitan areas, but an availability zone in Amazon is that kind of already. So I can just set my Auto Scaling group spread capacity across the AZs, and I'm in three data centers, wicked. Better than I ever was in the past. But we now need to balance shipping and feature velocity against this kind of resiliency. And I think we've maybe even overcorrected a little bit.

But the side note there is that not every business can afford or even really needs to think about it until they're at a certain point that they need to do those things. And in my opinion, it's bad for business to be down for a day, sure. But the cost of that day, like, it's a cost-benefit analysis with an overlay of regulation, I think is really what that stuff ends up boiling down to for modern companies.

It sounds like we have a fairly similar experience there in terms of being forced to do those plans. I worked at a place where the federal government was our biggest customer, and they mandated a lot of this stuff from service providers. And I definitely came in on the weekend and did the DR failover. Matter of fact, I think we did it at 3:00 a.m. on a Friday. It's when everybody's like thinking their absolute best. And you're trying to remember how to catch up a database replica in another data center and find out that that network route's actually been down, and wake up the network guy because they assured you that they didn't need to be on the call.

ADRIANA: Of course, yeah. It's like you've lived the same life. [laughs]

JASON: Yeah, I don't miss any of that stuff.

ADRIANA: Yeah, totally.

JASON: The good part, again, I said at the beginning, we have a lot of room for improvement as an industry. And we're at least starting to be a little self-reflective on that these days. But starting to actually look at these pain points and do something about them is encouraging. It's always an overcorrect-recorrect sort of cycle. I think that's just how humans work.

ANA: Is there any kind of work that Honeycomb currently does to require you to be on call?

JASON: There is a whole procedure that I have not read all the details of yet. But on-call at Honeycomb is voluntary. And I'm paraphrasing; I don't remember all the details. But you start by shadowing someone who has been doing it for a while in your area of expertise. And when you start to feel comfortable owning the primary responsibility, you do. We are growing such that as a company that, the shape of on-call is also changing, though. For a very long time, this has been written about by people like Charity and other folks who've been at Honeycomb for a long time.

Honeycomb really had like two engineering teams. We have more now because we're restructuring to scale our business, which is a great problem to have. And as a result of that, the shape of on-call needs to change as well because there isn't a platform team and a front-end team [laughs] anymore, which we had for a really long time. And I think it is a testament actually to the sort of thing that observability can unlock.

ADRIANA: I think it's really cool that you...I want to go back to what you said about basically revisiting, redesigning on call, like, changing things around because your organization is changing. And I think that's such an important thing to do. Because I think a lot of organizations they continue to change, but then they don't change their processes to mimic that.

JASON: Oh yeah. And it's just like some ops engineer on call. Yep, just like keeping a bad mental model of your system is not any way to try and troubleshoot it or reason about a failed state, I think your on-call practice needs to actually reflect the state of your system as well. Asking teams to own areas they know nothing about...and a lot of organizations are aiming to ship multiple times a day, 120 Kubernetes deployments a day. [laughs]

So if you had ten teams and each team on average was doing 12 of those, there's no way that being on one team, you can...you can't show up in any sort of reasonable way for the other 90% of that change set. It's not possible; I don't think, no matter how good of an engineer you are or even how good you are at diving into your telemetry data. You might be able to find something of interest, but the amount of contacts that you need to bring along with you to tease that apart that's a lot that you're asking of someone.

ANA: I think it's amazing to see the industry start getting there. I've been following the work of Honeycomb for so long. And I knew that on call, there was not a tradition as an on-call in a startup, which is something I always hold dear to my heart. I come from one of the stories of getting thrown a pager on your second week and getting told, "Good luck. Here's a runbook."

And on your fourth day, when you're actually getting paged at 4:00 in the morning, and you open up that runbook, it's 280 days old. And you're like; they told me good luck to execute this. This is good. Everything is fine. Everything is fine. I still have a job. It's going to be okay. See you at 8:00 in the morning.

JASON: Right. And you're just set up to fail and set up to resent the experience. It should be a learning experience. I've been at Honeycomb for 16 months now, I think. And I also read the material for years and watched the way that they talked about things. And it's really humbling to be able to participate in it. We're a group of people, so that includes the subtitle that we are flawed and we'll make mistakes. But there really is a feedback loop of trying to be better and trying to keep improving these things.

We made our first SRE hire early in 2021. So Honeycomb existed for a number of years without someone whose primary responsibility was reliability and the broad practice that is SRE. We hired based on the Google definition of the role, so consulting and then working through reliability and finding both the tooling and the process and the pain gaps. And it's been really neat. We hired someone called Fred Hebert there, who's another Canadian. He is very good at teasing things apart and finding patterns and raising them, and spurring conversation.

We've been running internal sessions to talk about the experience of on-call, and he actually wrote a blog post recently about an exit survey from your on-call shift where he would just sort of, in a very informal fashion, ask...I think it was five questions. It's on the Honeycomb blog what he ended up doing but to just try and even get a quantitative measure of someone's opinion exiting that. And it sounds like we've had some shared experience there. [laughs] And it would have been great to have been asked that. [laughs]

ADRIANA: It's such a simple concept. How did you feel after you did on call, and it's like...yeah, I read the blog post. And I thought, wow, that's such a cool take on it.

JASON: And it's like, it is not a technical solution. That is like sociological research. [laughs] People matter, and people make the system.

ANA: People are part of the system. Like, it's so complex, and there are so many pieces to it, and they're all moving. You alluded to this earlier, putting empathy as a company boundary, putting empathy and engineering and sales within our customers, like, engineers are running these large systems running on open source. We're using 10,000 deployments on Kubernetes, and these clicks, like, so many resources. Someone is pressing these buttons. There are contributing factors to everything that happens on a complex system.

JASON: Yeah. And, I mean, it's really easy to forget that when you're using an app on your phone. I'm guilty of it. I definitely rage at some apps occasionally on my phone [laughter] when they're not doing what I think they should. But we're making this for us. Capitalistic overlay aside, we need to keep that in mind. And the only way we can be better is to come back to that and ask those questions.

You can and probably should change your on-call based on how it is going. That isn't to say that you should be immediately reactive. But if you notice a negative pattern, it's worth having a conversation about that. And in all seriousness, that is supposed to drive product work.

Reliability is a function and a feature that your customers expect. We just talked for 10 minutes about DR. Like, if we're going to worry so much about honestly very rare catastrophic failures, why don't we talk about the day-to-day failures and their impact on both our customers and our employees just to say you're teammates? And how can we make those better? That should absolutely be work that gets funneled in, and that's not taking away from feature work when done well. Delivery to customers and delighting customers is a constant balancing act. And that's what makes SaaS, in particular, really fun, I think.

ANA: I have this talk that I've yet to write that is around the topic of the same work of inclusion, diversity, equity, and reliability where it's always been put as last, and it's never really a priority. But these are things that really matter, and it will make a great workplace. And when things go wrong, it costs a lot, like, cost of downtime, talk about that often. But then you also have the cost of losing talent and not uplifting that talent, and promoting them and mentoring them, and not having anti-discrimination, anti-harassment policies, bias training. Those things are like buckets of costs.

JASON: They absolutely do.

ANA: Millions of dollars are lost, whether it's a lawsuit on discrimination or a cost of downtime. And a lot of it happens because leadership doesn't care about it. So we're going to have more conversations about it. So, hopefully, a blog post and talk will be coming shortly. [laughs]

JASON: I look forward to reading it. Yeah, you're absolutely right. And companies are created to create value for customers, users, and shareholders. But they are made up of people and those connections and those relationships ultimately. And anybody who talks fondly or poorly about a place that they work are never talking about the shareholders or customers. They are always, always, always talking about their teammates. And whether it was a particularly gnarly bug that you dug through with someone or by yourself, that can be a great thing, but it can also be an isolating experience like; the kind of supportive social environment matters huge for on-call and just for day-to-day.

ADRIANA: Yeah, absolutely. It boils down to remembering that there are humans that are working behind the scenes to make this stuff happen and to cherish that work being done because there are people who are putting everything into their work to get something running because they care enough to keep it running. So why not show them the respect that they deserve for the effort and caring that they put in right?

JASON: Exactly.

ADRIANA: Cool. Well, I think we're coming up on time. This has been a really awesome conversation. I love where we ended up.

JASON: It was a rambling journey. It was a lot of fun. Thank you for having me.

ADRIANA: I think it's such a worthwhile topic, and I think it definitely needs more exposure. So thank you for exposing it further. The idea of the human behind the computer, we all need some TLC. We need to evolve our SRE practices. I think these are all really, really great things that organizations need to keep in mind. And I think knowing that there are companies out there that are doing that gives me lots of hope because it tells our listeners these things exist; it's not a figment of your imagination.

JASON: Absolutely not. It's also not by accident. It is a choice and one that needs maintenance, just like your software.

ADRIANA: Yeah, exactly, exactly.

ANA: All the gold nuggets at the end of the podcast.

ADRIANA: I know, right? Like, deep thoughts. [laughs]

JASON: Thank you again for having me. This was a lot of fun. I really enjoyed the conversation.

ADRIANA: Cool. Thanks for coming on. Jason, is there anything else that you would like to tell our listeners, cool things that you're a part of, things that...?

ANA: Where can we find you on the interwebs?

JASON: I am terrible at Twitter because it won't let me correct my typos.

ADRIANA: Oh yeah.

JASON: So you can follow me if you want.

ADRIANA: [laughs]

JASON: I'm @redmind, R-E-D-M-I-N-D. You can find me on LinkedIn, where I also post sporadically, but they'll at least let me edit my typos. I am jharley on GitHub. Reach out if you want to have a chat about any of this or tell me that you disagree about any of this. But yeah, in terms of other things I'm working on, I am trying to be more active in the HashiCorp community.

ADRIANA: Woohoo.

JASON: The pandemic made that feel a little bit strange. I think a lot of us are over-Zoomed but finding places to contribute there in the intersection of a lot of what we talked about here.

ADRIANA: That's awesome. For our listeners, don't forget to subscribe and give us a shout-out on Twitter via @oncallmemaybe. And don't forget to check out the show notes on oncallmemaybe.com for more resources and to connect with us and our guests in case you weren't fast enough with the pen to take down Jason's Twitter and LinkedIn handles. Signing off, we are your hosts, Adriana Villela and...

ANA: Ana Margarita Medina.

ADRIANA: Signing off with peace, love, and...

Both: Code.

The Evolution of Operating Software with Jason Harley of Honeycomb

On-Call Me Maybe

Twitter Mentions