An OpenTelemetry Journey with Gabriel Fonseca and David Alfonzo of Wavelo

On-Call Me Maybe

English - September 20, 2022 04:00 - 39 minutes - 35.7 MB - ★★★★★ - 3 ratings
Technology monitoring tracing distributed tracing sre oncall on-call software software development technology tech Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: How to Rock at SRE with Liz Fong-Jones of Honeycomb

Next Episode: The Evolution of Operating Software with Jason Harley of Honeycomb

About our guests:

Gabriel is an Observability Engineer at Wavelo, based in Belo Horizonte, Brazil. Prior to his time at Wavelo, he spent several years as a DevOps Engineer working for large Brazilian e-commerce providers.

David Alfonzo is the manager of the Platform Solutions team at Wavelo. He is laser-focused on how SRE and software development can make the internet better. During his work journey, he has worked in a variety of roles, from support, web development, sys admin, infrastructure, security, DevOps, and management. He is based in Toronto and enjoys long walks with his wife Durley and the hot summer weather while it lasts!

Find our guests on:

Gabriel’s LinkedIn David’s LinkedIn

Find us on:

On Call Me Maybe Podcast Twitter On Call Me Maybe Podcast LinkedIn Page Adriana’s Twitter Adriana’s LinkedIn Adriana’s Instagram Alex’s Twitter Alex’s LinkedIn

Show Links:

OpenTelemetry.io OpenTelemetry Collector Cloud-Native Observability with OpenTelemetry Tucows.com Wavelo.com HashiCorp HashiCorp Nomad Metrics Prometheus OpenTelemetry Prometheus Receiver OpenTelmetry StatsD Receiver OpenTelemetry Jaeger Receiver OpenTelemetry Zipkin Receiver SLO (Service Level Objective)OpenTelemetry Protocol (OTLP)CNCF OpenMetrics KubeCon

Additional Links:

O11ycast Podcast Episode: Ep. #54, Cloud Native Observability with Alex Boten of Lightstep Alex Boten on Medium David Alfonzo on Medium

Transcript:

ADRIANA: Hey, everyone. Welcome to On-Call Me Maybe. I am your host, Adriana Villela. And today, I have a special guest host with me, Alex Boten. Alex, why don't you talk about yourself a little bit here?

ALEX: Hi, everyone. I'm Alex. I'm an OpenTelemetry collector, contributor, and maintainer, and I'm also a contributor to some other projects within OpenTelemetry. I'm a senior staff software engineer at Lightstep. And I'm also the author of Cloud-Native Observability with OpenTelemetry.

ADRIANA: Awesome. And today, Alex and I are going to talk to two former colleagues of mine from Tucows. We have Gabriel Fonseca, and we have David Alfonso. So, guys, why don't you introduce yourself? So let's start with Gabriel.

GABRIEL: Hey. Hello, everyone. I'm based in Brazil, and I have been working with Tucows in the observability team since last November. And we have been trying to move to OpenTelemetry inside Tucows and get everything it has to offer.

DAVID: Nice. I'm David Alfonso, and I’m actually based in Canada and working closely with the observability team as well. I'm the manager for platform solutions within Wavelo Tucows. I'm pretty much as well, just trying to get in the platform, specifically trying to get OpenTelemetry working and some adventures that we have with that.

ADRIANA: Cool, awesome. So the reason why we have both David and Gabriel here is that full disclosure, they both used to work for me. I used to be their manager. So David now has my old position as manager of the platform solutions team. And I also managed the observability practices team at Tucows Wavelo, and Gabriel was one of my hires. And as part of our mission to bring observability into the organization, we wanted to go the OTel route.

David and Gabriel are going to talk about some of their adventures in basically bringing OpenTelemetry to the organization and specifically around the OTel collector. So why don't we start with the OTel collector? What were your experiences in running the collector? And specifically, what was some of the architecture that you guys had to deal with? Because you guys aren't a Kubernetes shop like most of the world out there; you guys are a Hashi shop. So what were some of the challenges in running the collector?

GABRIEL: Well, I think maybe the first one was how to deploy that in Nomad and figure out all the small things that have to change, but Adriana got it perfect for us in the first place. So we are still running these in pre-prod. We have tested this with several teams, the auto collector on our site. David's team is one of them that is sending a lot of metrics to us from all the Nomad nodes we have. Maybe you can talk more about that, David; how does that work for you?

DAVID: We pretty much tried it all, to be honest. [laughs] One of the things that we did is like you said, we are a HashiCorp shop. So we use Nomad, Consul, both, you name it. Anything HashiCorp, we have our fingers in it kind of deal. So part of that is how do we get metrics from the right places in the right ways? So the goal for us was going directly from the host all the way to our vendor.

But now, in order to do that, we need to do a few steps. We started with...on the host specifically, we collect the host metrics. And we name the host metric right in the beginning. We figured out the hard way, the closer you are to the data, the easier it is to work with collector. You can name things, for example, directly on the collector, and you can set all this stuff directly that way and then send it to another collector if you may.

In our case, we have, you can call it a gateway collector or a central collector for Nomad specifically. So we will send from the host to this middle layer collector, to Gabriel's team's collector, and then from there, ship it to the vendor. And we find that that works really well and just for a few things. The first one was it was very simple to set up. The con part of that is you have to set up in multiple places. And we use a configuration management which makes life way easier to set everything up.

The other piece why we decided going directly to the host instead of, for example, scrapping directly from a Prometheus telemetry and directly from the host; we decided to go directly to the host because, one, we can name things directly on the host. Like if I'm going to say Nomad client one, then I know that it is coming directly from the collector Nomad client one. And so I don't have to do shenanigans or complex setups and stuff like that later on in the other collectors. So that's some of the things that we did in order to get all the way down to our vendor.

ADRIANA: One of the things that I remember from back in the day is that when you started ingesting metrics from the Hashi stack, you guys were using a really early version of the Prometheus receiver. So what were some of the challenges around that? Because my dream when I was there was basically to get rid of Prometheus. So I'm like, I really wanted to make that work. [laughs] But what were some of the challenges, and now that I think that receiver is a little bit more mature, how is today's receiver different from last year's receiver?

DAVID: Full disclosure, we were using Prometheus before, like Adriana mentioned. For us, it was like, how do we use the collector to use the exact thing that they were using before, and we wanted to grab that so just to escape Prometheus out of the picture, and then just put the collector right in between kind of deal. And now the problem with that is if you're deploying the collector the way that we were doing it, we were using TerraForm, like we said, by HashiCorp, and then deploying the collector using Prometheus, then you need to use multiple variables on top of each other, and it becomes a mess.

It was very challenging to get it working, especially in the early stages of the Prometheus telemetry collector receiver kind of deal. And because in OpenTelemetry stuff as well, we were using...we were going directly for the Prometheus receiver itself. So that's some of the challenges we run into. I don't know, Gabriel, did you run into anything specifically in your site?

GABRIEL: I remember at some point, we had some issues with the collector breaking because of some misconfiguration from the receiver. But I'm not exactly sure what that was. But Prometheus receiver has replaced almost all Prometheus, I think, in the company. I'm not aware of any...that's probably Prometheus running somewhere because there's a lot of stuff running there. But I'm not aware of anyone that is using that part. So we got rid of most of that, at least.

And also, on the metric side, another thing that we use a lot is StatsD metrics format. And for that, we also set up the StatsD receiver. And luckily, the OTel guys added the label support to it from log StatsD. This was really great because it enabled us to remove a lot of garbage metric that we had. We still have some, but we're working on it. So that named the metrics and replaced the names by labels and so on, so restructuring that instead of having huge names. And it has been working very well so far. Besides the Prometheus receiver, we are also using the StatsD broadly.

DAVID: Alex, I'm wondering, how did you guys, like, did you run into something similar to this? And how do you guys fix it?

ALEX: Unfortunately, or maybe, fortunately, I don't know, like every other large-scale deployment internally at Lightstep, we also run multiple metrics solutions. And I think we're still in the process of migrating some of them. But we're using the StatsD collector. We're using the StatsD receiver, the Prometheus receiver. And we're; also, I think we're just in the end of removing another component that we called...it's basically a metric proxy for StatsD.

It's taking some time. I think the collector has matured significantly in the past year. And so, there was a lot of hesitation at the very beginning to migrate on to the collector because it was still very much in development. But now that it has support for so many different formats, it's become a lot easier to manage.

ADRIANA: So, Alex, is there a specific component of the collector that you work on?

ALEX: No, just a little bit everywhere. I'm a maintainer on the core collector, which is the area that I'd like to focus as much as possible because there are only a handful of contributors there. But the contrib repository is the busiest OpenTelemetry repo by a magnitude of I think it's 5X over the next one. So there are a lot of people that are really interested in getting their PRs, and so I’m trying to help as much as possible there.

ADRIANA: Cool. That's awesome. Maybe for folks listening in who aren't familiar with the two repos, can you explain the difference between the core collector repo and the contrib repo?

ALEX: Yeah, the core collector repository is where the functionality that's maintained by OpenTelemetry lives. So anything that's open source so, like the Jaeger receivers, the Jaeger components, or Zipkin components, or Prometheus components, will eventually move to the OpenTelemetry core repository. And the contrib repository is a little bit more open as to what it accepts, so it accepts contribution from vendors, contributions from individual users. Basically, if you're looking for something that's outside the core supported formats in OpenTelemetry, you'll find it in the contrib repo.

ADRIANA: Awesome. And I'll put in a plug in the middle here saying to anyone who wants to contribute to OpenTelemetry; the OTel folks are always looking for contributions, so don't be shy. You can get started out anywhere, even with the docs. That's where I got started. And the collector is written in Go, right?

ALEX: It is, yep.

ADRIANA: Yeah. So if you're a Go pro, then consider contributing to the collector and collector contrib. Cool. I guess another thing that I wanted to touch on with regards to the collector is I know when I left the observability team at Tucows, you guys were starting to get the collector ready to run as a gateway, as a centralized gateway for basically ingesting OTel data from various applications.

So maybe if, Gabriel, you could describe some of the things that you needed to do to make, I guess, the OTel collector gateway more productionalized, some of the considerations that you had to make to make that happen, and where you guys are at with that now.

GABRIEL: Sure. One of the things we had to do was moving some stuff more close to the teams, like the Prometheus receiver, for example, because most of the collector is stateless. But some components are not like the Prometheus receiver. And what it does is that if you have more than one copy of the collector running with the same configuration, you have your host scrape it more than once, and then you have to duplicate that later.

So one thing we decided was not to have those Prometheus rules on the gateway site, so things just pass through there. We add some tags. We can do filtering and things like that. But we do not have any stateful module there. Other than that, I think everything else was almost straightforward. We had to take care of some memory configurations. There are some balance configurations for the collector, memory limiters, batch processors, things like that to make sure it doesn't break.

Also, we are working on trying to scale it using the metrics that the collector generates. So we generate some metrics about the...so the number of metrics coming in, going out, and so on by pipelines and whatnot. So we are working on scaling that based on these metrics. Yeah, we have also set some SLOs based on those metrics. We have not moved everything yet to the gateway, to our collector gateway, so most people are still using other ways of sending things to the backend.

But yeah, we are on the move also working on things not related to the collector, so enabling the developers to be able to instrument properly their code and sending things there. Just a handful of people are using OTel natively. So we are working on providing some shared libraries for people to do that, and so on.

So it has been a lot of political work, not just tech work on that [laughs] in convincing people to adopt that and move to our gateway and so on. But I think slowly; we are able to do that. So more people are interested in that and making good questions about observability. And we can see that the observability is really improving in there. So I think we are happy with this.

ADRIANA: That's awesome. And what kinds of security considerations do you have to make for being able to run the collector in gateway mode? Do you need to install any certificates? Do you have any SSL-type of considerations when running the collector when you get it ready basically for primetime?

GABRIEL: Well, we are still running this all in pre-rod. So we did not set up any certs yet. But the plan is to have this in place when we go to prod. One thing that we wanted to get to work in, but we were not able yet, is how to enable teams to send their own API keys and get these passed through the collector and sent to the backend.

So, for example, if we want to send to a different storage in the backend, some folks allow you to send a header and the request, and they will do that for you. But now, we are not able to get this from the teams. We can only set these in the gateway itself. So this is something that we wanted to get ready for prod. Maybe Alex has some insight. [laughs]

ALEX: I was going to ask you; I think there's an open issue around this, if I'm not mistaken. I seem to remember seeing being able to forward on a header as an issue somewhere.

GABRIEL: Yeah, I think so. Also, making questions for you, Alex, I've heard that the guys in OTel are working on a query language for creating metrics from spans and things like that, right?

ALEX: Right, the telemetry query language that's being currently developed in the transform processor.

GABRIEL: Yeah, that's pretty cool also.

ALEX: Yeah, ideally, at some point, the transform processor will allow the collector to sunset some of the other processors that have been created. I think one of the main things that I came into when I first looked at the collector was there are so many different processors. It's hard to know which ones you're supposed to use. There's attributes processor; there's a filter processor, there's transform now, there's a transform metrics processor. It just seems like there are so many different things that you could use. So I think the transform processor is hopefully going to make that a little bit easier for users to figure out.

ADRIANA: So it'd be like kind of a one-stop shop processor.

ALEX: For most things, yeah.

ADRIANA: Well, because technically, you could specify more than one processor, right? When you're defining your pipeline in the collector.

ALEX: Yeah, that's right. You can specify end processors, and they process the data in serial. So if you've specified the batch processor followed by a filter processor, which you might not want to do, but the data will go through the batch processor first, and then it will go through the filter processor.

ADRIANA: Well, so it's like a daisy chain kind of deal.

ALEX: Right.

ADRIANA: If you're so inclined and feeling super adventurous, I guess anyone could write their own processor if they want to do any custom stuff out there, right?

ALEX: Yeah. And there are definitely a lot of people doing that today.

ADRIANA: One thing that I was always curious about that I always thought was a huge selling point for the collector was basically leveraging processors to mask data. I've worked in so many organizations where data masking was such a top priority, of course. You don't want personal information getting leaked out somewhere by mistake. Nowadays, I think the idea is that there's a processor that does masking that's basically a one-stop shop for all data masking. So you don't have to create your own kind of deal. Is that...

ALEX: Yeah, there's the production processor, which I believe only currently applies to traces, if I'm not mistaken. There was an issue open to add support for metrics as well. I don't think anybody's working on that right now.

DAVID: I do have a question, Alex. Can we go back a little bit to security? What do you guys recommend in terms of collector to collector security?

ALEX: I think in an ideal world, you'd want to have SSL-enabled, some kind of authentication enabled between your components. It used to be that you could say, well, everything is within my network, so I don't really need to worry about SSL. But now that everybody's operating in the cloud, people tend to be a little bit more; hopefully, a healthy level of paranoia is applied here. I would definitely choose to encrypt the traffic from all of my different endpoints unless they weren't leaving the hosts. But yeah, that's probably the only time that I wouldn't recommend using some kind of encryption.

So, for example, you mentioned that you have a collector running on each one of your hosts. For that traffic, that's just sending application data to your localhost or if you're scraping data or whatever. I think that's probably fine to do without encryption, but anything that leaves the host after that, I would recommend turning it on.

ADRIANA: Just piggybacking on the comment of having a collector running on each of the hosts, right now, do those actually funnel into a centralized collector? I think you mentioned not everyone's using the centralized collector, but it sounds like there was some usage of it.

DAVID: Yeah. So for us, it is specifically for our thing. So we're running on locally all the metrics that we need to scrap that are in the CPU memory. And if you have Prometheus metrics there as well, just scrap these ones right at the get-go, right? It feels like the most organic way to do it.

And then after that, we send the OTLP to another collector, which I found that that was the best way to go about it. Then there, we can use labeling, like, I don't know, Cluster A. And then from that global collector you made, then we send it to Gabriel's collector. And then, from there, it goes into the backend. So it's like gateway collectors of gateway collectors.

ADRIANA: So you've got like three layers of collectors.

GABRIEL: I have an agent, so a local gateway and a global gateway, right?

ADRIANA: Cool.

DAVID: Exactly. I think the only challenging part of having collectors everywhere, which is the good thing about it, is if something goes wrong with one hose, then you technically only lose a hose, one collector. The bad part about it is like management, right? You know, keeping those configurations in the right places. We help with configuration management with that, but it's still a little bit challenging to be honest.

ALEX: Do you have any plans or a strategy for how you're...or a need even to scale the local gateway for the collector?

DAVID: Yeah, so we actually overcalibrated. So what we did is we used traffic and like a load balancer in front of it, in front of the normal client, if you may, and then the collector is running in a Docker container. And then it runs on each one of the clients. And we have five different local collectors, and all of those files goes into the major Gabriel’s collector; whatever happens, there is just black box to me. It's magic. And from there, it goes into the backend.

GABRIEL: We have had also some experience with the other team. They also run something very similar to David's. So they also have an agent in the local gateway. And they wanted to drop some spans. And we figured out that is kind of a workaround to do that in the collector. You have to create a routing and send this to /dev/null or something like this. Found it a bit strange, but it did work.

ADRIANA: Interesting.

GABRIEL: I think it does not have yet a way of just dropping these spans. You have to route into nowhere to get it working.

ADRIANA: Oh, I see. Right. Right. Right.

DAVID: Touching on the span subject, so, Alex, what's the recommendation when you use the OpenTelemetry libraries or do all the magic with your spans? You perfectly designed this thing, which I think it's almost impossible, like really hard to do. But then, from that point on, what's the best way to send your spans? Like, did you send it directly to a collector locally? Do you send it to a global collector gateway? What would you recommend there?

ALEX: I think just like the rest of telemetry, I would just send it to whatever the closest collector to your application is. I will say this with a caveat that I'm probably not the all-encompassing authority on all things collector. My tendency is always to go towards sending the data as closely as possible to your application.

DAVID: Should we keep in mind something like, for example, authentication when you live in a code, specifically? Because it feels like I'd rather make changes on my collector than change it on the code. But I don't know what's the best practice there.

ALEX: Sorry, you mean the authentication when you're in your instrumentation?

DAVID: For example, when you're doing your instrumentation, you are ready to send the information out. Do you authenticate there? Do you use SSL right there? What's the best method kind of deal?

ALEX: Back to the earlier point around not being able to afford your authentication information, that's going to stop a lot of people if that's a need for different teams, for example, if you know multiple teams are sending the data to the same collector. I agree with you; I think it makes a lot more sense to remove all the authentication information from the application themselves and store it inside the collector. I mean, it's just generally easier to manage. And if you need to update keys or whatever, it's usually a lot easier to do than having to redeploy all of your applications, for instance.

I would definitely lean towards that option instead of doing it at the application level, which then comes back to my earlier point around, well, for the application, where do you send your data? I would just send it to the closest collector and then let my local host collector decide where it should send it from that point on.

ADRIANA: Rather than having basically everyone sent to a single centralized collector, every application should have its own collector, and then that collector would send to the centralized one.

ALEX: Yeah, I mean, if you're already running the collector on every node, I would just use that instance of the collector rather than trying to send it to a centralized location. Of course, when you think of tracing, then it goes back to ensuring that your sampling configuration is consistent across your entire fleet of collectors. Otherwise, you might end up with some very strange results.

ADRIANA: Yeah, makes sense. Makes sense.

GABRIEL: Maybe a good approach to that would be to have the sampling always closer to the application and getting everything in the centralized gateway. I don't know if that makes sense.

ALEX: Right.

DAVID: So talking about sampling, sampling or not to sample, what do you guys think?

ADRIANA: Controversial topic. [laughs]

DAVID: Yes. Yes, I hear.

ALEX: I think you get to a certain point; it doesn't make sense not to sample, using a double negative in that sentence to be intentionally confusing here. I think it does make sense to sample at some point. I can't remember the numbers, but there are a certain number of transactions that, basically, once you hit that certain threshold, there's very little probability that you won't be able to sample the data and get all of the same kinds of events that occur. So I do think it makes sense to sample.

GABRIEL: Yeah, well, I would also say it depends on the scale, how much we're sending. And also, I've never used sampling in the collector, but I'm aware that there are different sorts of sampling we can do there, right?

ALEX: That's right. Yeah, you can use the tail sampling, which has a downside of having to buffer the data for any given traces. You can give it limits for the size or the amount of time that you'll wait for the buffering to happen. But obviously, that's going to add some memory pressure, depending on how much data you're sending to it. And then you can do things with a probabilistic sampler as well.

As expert users of the collector, you mentioned you use the Prometheus receiver, the statsD receiver. You are using OTLP. Any other components that you find particularly useful? Or maybe on the other side of this question, are there things that are not in the collector today that you think I really wish this thing was here?

DAVID: One of the things that I find, like you said, Prometheus being part of the OTel collector core makes sense to me just because we use it a lot. Obviously, I'm just talking about a small point of view of the whole work, but we use it a lot. It makes sense for us to be part of the core and then also support it. And OTLP, it was my wish OTLP was used everywhere, but that's not the case. I think Prometheus is definitely more wider knowledge out there in the community.

To this day, transformers are just, like you said, is one of those great things on paper that I wish it was a great thing as well on the implementation pieces. I have tried it multiple times, and it’s like trying to make sense out of, you know, of a two-headed person kind of deal. But yes, I think those are the two major things as a personal user of the collector. I don't know if I'm an expert for sure, [chuckles] far from an expert.

DAVID: Yeah, I would also not consider myself an expert since I have been in this field for several months now. I really like how the configuration file works in the auto collector. I find it really easy to read and to configure pipelines and so on. If you look into something like Fluentd, it's really desperating to configure pipelines there and confusing and so verbose. So yeah, I'm still in honeymoon with the auto collector.

ADRIANA: [laughs]

GABRIEL: So I would not know what to say. [laughs]

ALEX: You say you're not an expert, but the project is only three years old, and you have several months. So you're --

GABRIEL: [laughs]

ADRIANA: You guys are pros.

[laughter]

GABRIEL: I've been there for one-third of the auto collector life.

ADRIANA: [laughs] It is interesting to see how much the collector has grown up in the last year or so, even just the various capabilities. Every time I update the collector version, I'm like, ooh, new thing. So that's kind of cool. It's nice to see that there's a good amount of momentum on the collector and that it's not stagnating, which I think is awesome. It's a vote of confidence to the community.

And especially, in my view, I really feel like OTel is the way to go for observability. So making sure that you have support from various observability backends and making sure the project continues to be active. I think that'll ensure that adoption across the board. It'll make it so that anyone who's not doing OpenTelemetry is just going to feel left out, I think.

ALEX: Right. They'll have observability FOMO.

ADRIANA: That's right.

DAVID: Honestly, the same thing. Every time I've run into an issue, in less than a day, most likely, it's fixed already. So for being an open-source project, I am super happy about the progress that you guys have made so far. So that's really cool to see and how the community as well is very impactful, how vendors have chipped in and actually the developers to help on this. So that's actually cool to see.

ALEX: Yeah, I was going to say the amount of vendor support from observability vendors but also from cloud platform vendors has been just tremendous in the collective repo.

ADRIANA: And one thing that I will continue to encourage and I was doing with the team when I was at Tucows...I remember there were a couple of times when the team was stuck on something. And they raised an issue for fixing whatever. I'm like, hey, if you know how to fix it, maybe just try fixing it and submit a PR. And I mean, it's the best way to do it, right? If you have the know-how.

GABRIEL: Yeah, for sure.

ADRIANA: So I do want to do that. [laughs]

DAVID: Talking about that topic on contributing to the collector, Alex, what would be the best way to get a starter on it? Like, you're a new developer. You have written something in Go like a few things like I do. How do I get started?

ALEX: To be honest, the collector contrib and even the collector repository can be a bit daunting. There's some good documentation that you can start by reading. But looking at just open issues, there are a lot of issues with help wanted or a good first issue. That's usually the first place people will look when looking at a new repo. So that's kind of where I would start or reaching out on the community Slack.

I think people are pretty receptive to newcomers and definitely want to make people feel welcome. So if you start on an issue and you're not really sure what to do with it, just ask an issue. One of the contributors will chime in and if that doesn't work, just ping us on the Slack channel.

ADRIANA: And I would say, like Alex said, everyone is super responsive. And to clarify for anyone who's interested in joining the community slack, it's the CNCF Slack. And then there's a bunch of OpenTelemetry channels. So I think there's a channel for corresponding to almost every repository out there, right? [laugh]

ALEX: Right.

ADRIANA: Just about. [laughs]

GABRIEL: And it's really a great place to get questions answered. I have solved a lot of questions I had there.

ADRIANA: One final thing that I wanted to touch upon because I came across this a few months ago, and I think it might be worth mentioning on here, is there's also a CNCF project called OpenMetrics. And if I understand correctly, the purpose of OpenMetrics is to establish standards around, I guess, Prometheus format.

So, Alex, I don't know if you can you speak a little bit about this. How does OpenMetrics fit in with OpenTelemetry? Do they play nice with each other? Are they competing with each other? For anyone who's curious about that.

ALEX: Yeah. So there were a lot of questions about this a year and a half ago when we were still in the early days of the metrics support in OpenTelemetry. And I think it was in December of 2020 there was a metric signal kickoff where a bunch of folks, even from OpenMetrics, came along and joined the OpenTelemetry group. And since then, I'd say that we've made a lot of progress to ensure that OpenTelemetry is compatible with OpenMetrics.

We don't want them to compete. We don't want to have the same problems that happened before with tracing, where there were many standards, and people don't know which one to choose. I think it's close to, if not fully compatible. Last time I checked, I think there were a couple of issues still open on the OpenTelemetry side to ensure compatibility between the two, but we're definitely working with those folks pretty closely.

ADRIANA: So what does that mean in terms of the collector? Does that mean that the collector ingests, like, is there an OpenMetrics receiver? Or does that mean the Prometheus receiver uses OpenMetrics format? Like, what does that look like?

ALEX: That's a good question. And I haven't spent any time thinking about it. So I can't really give you an answer, unfortunately.

ADRIANA: Fair enough.

ALEX: But as far as I understand, so long as you're using the Prometheus format, that should be compatible with OpenMetrics.

ADRIANA: Cool.

GABRIEL: Yeah, I have checked here quickly, and they seem to be a little stale. Their mailing list has nothing since February. GitHub has no commits since March. So maybe they just moved on to OpenTelemetry.

ADRIANA: Yeah, I haven't seen any activity on their Slack channel for a while either. But who knows? Maybe we can get someone from OpenMetrics on here to talk about this at some point.

ALEX: Future episode.

ADRIANA: Yeah, exactly. So if anyone from OpenMetrics is listening and would like to be on the show, [laughs] be sure to hit us up. We'd love to talk to you.

ALEX: So there is a guidance for interoperating with OpenMetrics document in the OpenTelemetry specification. That's still marked as experimental, so I think there's still a little bit of work to be done there.

ADRIANA: It will be interesting to see where that project goes. So I guess we'll stay tuned.

DAVID: I have a final question myself. And I think this is the elephant in the room. When are we having a general availability for open collector? So right now, it's sitting on version 0.5.5. And what can you tell us? What kind of insight do you have, Alex, to share with us?

ADRIANA: [laughs]

ALEX: I can't commit to any dates or any final decision here. I know there's a lot of focus on the collector core repository to stabilize each kind of package within the repository. I think the most recent one that we've managed to stabilize is the pdata package, which wraps the OTLP package. It's the internal format to the collector. I'm not sure how much more work there has to be done, to be honest. But I know we're all itching for that 1.0 version.

ADRIANA: Hopefully soon.

DAVID: Yeah.

GABRIEL: I think the login protocol just became stable some months ago.

ALEX: It did, yeah. It was, I think, in March or April.

GABRIEL: Cool. Yeah, that's pretty cool.

ADRIANA: And metrics has been stable for a while, right?

ALEX: I think parts of metrics is still under development. Last time I checked, it was things like exemplars wasn't stable yet. But I know stability was there, at least for most of the specification around metrics.

GABRIEL: Yeah, I think so. I think that the API and the protocol are stable. But the SDK they have not all implemented metrics yet.

ALEX: Yep. And also, there's a whole new discussion around profiling, which kicked off a few months ago. I think it was at KubeCon, the discussion.

ADRIANA: Oh, yeah, yeah. I've heard some ramblings about that. That's cool. Exciting. Well, I know we're coming up to near the end of our show. Any final thoughts from anybody?

DAVID: Well, I just want to say thank you so far to anybody who has contributed to the OpenTelemetry Collector and OpenTelemetry, period. And I think it's well needed and keep going. If you can go faster, go faster. Don't break anything in between.

ADRIANA: [laughs]

DAVID: But it's going well. It's going well. As a user, as a consumer, I think it's one of the projects that I'm super excited to see going back alive and then being used across a whole community.

GABRIEL: Yeah, I think the same thing for me. I'm always really happy to be using that. And I have a lot of fun learning about that kind of stuff. All those pains were pains I had in my previous work. So I'm really happy to see the path its taking and to see that the industry is really gathering around OpenTelemetry. And that's great because this is really hard to see; usually, companies just go to different sites. Yeah, it's pretty cool to see that people are contributing to it, and yeah, adopting it, and so on. So, congratulations; I hope I can contribute more in the future.

ADRIANA: Yeah. And again, I encourage anyone who is interested in OpenTelemetry to look into contributing. And, I mean, I think any contribution counts, whether it's through docs, it's through some of the language-specific examples, the collector, wherever you think that you can make a dent.

And the other call to action I'd like to give is to leadership in various organizations. I think a lot of people put a lot of time into OpenTelemetry. And it's kind of nice to see so many vendors out there allowing for using company time to contribute to OpenTelemetry.

And I think it would be awesome to see more companies out there, not just the observability vendors giving their engineers time to contribute to what I think is an extremely valuable and important open-source project. So anyone out there in leadership who's listening, encourage your engineers to contribute as part of their jobs because I think it's such a great learning experience, such a valuable contribution to the community.

ALEX: Yeah, first one on that; I think we spent a lot of time building the project. And there are a lot of super valuable contributions from vendors and from platform cloud vendors. The important part here is we're all building this for the end users. And getting more users involved, getting more user feedback, I think, is super helpful. And thank you to our guests today for telling us their adventures with the OpenTelemetry Collector because, for me as a contributor, it's great to hear from our users. So thank you.

ADRIANA: Yeah, thank you both for coming on and sharing your experiences. So signing off, I'm Adriana Villela with...

ALEX: Alex Boten.

ADRIANA: Thanks to David and Gabriel for joining us, and we will see you on the internet.

An OpenTelemetry Journey with Gabriel Fonseca and David Alfonzo of Wavelo

On-Call Me Maybe

Twitter Mentions