Cloud Native Security: OpenTelemetry & Tracing Explained

Cloud Native Security Series

Aug 1, 2021

•

Season Two

View Show Notes and Transcript

Episode Description

What We Discuss with :

00:00 Intro
04:02 Cloud Native
07:22 Observability
12:58 Logging vs Tracing
20:05 Open Telemetry
27:19 Observability becoming Mainstream?
29:16 Achieving Stability in Metrics & Logs
34:24 Observability or Open Telemetry – Where does one start?
38:09 Security risks with Observability
44:12 Starting with Open Telemetry
46:37 The Fun Section

THANKS, Ted Young!

If you enjoyed this session with Ted Young, let him know by clicking on the link below and sending her a quick shout out at Twitter:

Click here to thank Ted Young at Twitter!

Click here to let Ashish know about your number one takeaway from this episode!

And if you want us to answer your questions on one of our upcoming weekly Feedback Friday episodes, drop us a line at ashish@kaizenteq.com.

Resources from This Episode:

Tools & services, discussed during the Interview
OpenTelemetry.io
Ted Young on YouTube
AWS Distro for OpenTeleMetry
Google Cloud for OpenTelemetry
Microsoft Azure for OpenTelemetry

‍

Ashish Rajan: Hello, and welcome to another episode of cloud security podcast with virtual coffee, with Ashish . And this month is Cloud Native Security Month. And we’re talking about all things cloud native this month. And I have my first topic, which is observability for those who are joining us for the first time.

Cloud Security Podcast is a weekly episode, which we go live here every week on different topics of cloud security. And this is where we get to hang out with other people. And what’s our native security today. So we’re a special guests today and I’ve got a special music cued in for this gentlemen.

Hey, welcome

Ted Young: Ted. Hey, how’s it.

Ashish Rajan: Good man. Good. Thanks for coming in, man. I really appreciate that. Yeah. So I’ve known I guess a few or for some time, but a lot of people may not have heard it.

So for people who may not know who Chad young is, if you could just give us a brief intro about yourself.

Ted Young: Yeah, absolutely. I’ve done all kinds of things in the past. But I got really interested in observability. Well, I’ve always been interested in it from the perspective of operating big systems, but I was working on a project called cloud Foundry building the [00:01:00] container schedulers, sort of the runtime that managed all the containers that cloud Foundry, manages and.

Just hit kind of like a breaking point where it felt like the traditional tools that we were using to observe systems were just not sufficient enough, for the situation that we were in and started researching it more which is how I got into distributed tracing and some of these other tools.

And then that led to the open tracing project. And then the open telemetry. And I went to work at a company called LightStep which is founded by a Ben sigelman who wrote a lot of the white papers and some of the original tracing systems that were using over at Google. So I’ve been, kind of in the distributed tracing observability world full time for about five years now.

Ashish Rajan: Right. I’m glad because we have to get into all of this in a minute as well. So. For people who don’t even know, maybe we should start with cloud native. What does cloud native mean for you?

Ted Young: Oh, geez. Buzzword central for me, I think the [00:02:00] term cloud native comes from the idea that, we’re now running it.

Rented hardware, right? Where you’re in this world where it’s very easy to spin up servers. It’s very easy to connect them together. And at the same time, we’re running larger and larger, more complicated systems than we have in the past. And so cloud native is sort of like, how do we take our traditional tool stack and.

And modify and adjust it to make our life easier in this new world. So when I started, we were racking servers, so I I’m old enough, sadly that I, when, when we started it, it wasn’t about scaling. It was about capacity management, right. Where you had to figure out ahead of time, what kind of capacity you were going to need.

Because you had to have those machines. If you didn’t have the machines, it was going to take weeks to order new ones and get them installed. So it was about capacity management and one of the [00:03:00] big shifts was when we shifted towards , renting hardware and being able to just spin them up like that, it starts to change your thinking.

And as the number of machines you’re running scales up and up and up to the point where you’re no longer really being able to do traditional system administration, where you’re kind of like logging in, you’re setting them all up by hand and doing all of that. You need to kind of, those two things kind of went hand in hand.

There were advantages from being able to spin up lots of machines, but then there was the fact that you weren’t going to be able to sort of manage them by had, I’ve heard the phrase like You know, cattle, not pets, as a way of describing the shift that I think is accurate. So to me, cloud native is like all of the technology that is now grown up around making life easier, this new world of a rented hardware.

Ashish Rajan: So that actually is a good definition because We have been talking about cloud and what cloud has been for some time on this cloud security podcast as well. And now we’re at that stage, we are almost talking about that next evolution that’s going on in the cloud space [00:04:00] where, Hey, it’s no longer enough just to have hardware , the way I explained it as like, they used to be cards than they used to be.

I can’t imagine a world going back to a card again, I feel like cloud native is kind of similar where people have gone to the point that they know, oh, I can go to hardware in a matter of minutes, I’m not going to wait three months or four months for hardware, but now people are taking that for granted and going for that next level.

Okay. I just want all of these features. I don’t care about the hardware anymore, and I think exactly how I came across the observability piece. So how does observability fit into this? And if you don’t mind.

Ted Young: Yeah. So, I mean observability comes from the fact that we still have bugs. We still have to operate our machines, it comes down to, to my mind, like two main problems.

There are logical errors. So you’ve coded the computer programs to do something that isn’t what you expected them to be doing. And the other is a resource company. When you run this at scale, there’s something about the interplay between, the request for resources, the [00:05:00] availability of resources, the concurrent access to all of these resources that creates its own trouble.

And observability is the ability to actually, see what’s happening in real time and be able to tease out. When an invariance has been violated. In other words, something that you expect to happen is not happening. It’s like a logical error and the other is, is resource contention. When you have insufficient or mismatched resources and things are starting to go wrong , due to resource utilization.

Ashish Rajan: Okay. And to your point, you would need this because if you’re still having bugs and because this was talking about, Hey, I mean, I need to somehow figure out a way to manage this monitor this. So is the same as logging.

Ted Young: Yeah. Well it’s yes and no. So we, we still do the same thing that we always did, which is when you want to know what what’s wrong with your system, or you want to know what your system’s doing.

There are events, so. What are the sequence of events that were occurring when a program was executing [00:06:00] and observing those events, we tend to call log, and then we want to know what’s happening in aggregate. Right? We want to be able to step back and say in aggregate what’s happening. And this is where we’re looking at resource contention and problems of that nature.

So we tend to call that metrics traditionally. So you have logs and metrics.

Ashish Rajan: Yep. Oh, actually. So that’s an interesting point because every time I’ve spoken about logging in general, it’s always been about, Hey, what data source am I getting my information from you cycle that information out, and then you do develop some kind of metrics from it.

So are we saying that that’s obsolete now?

Ted Young: Well the problem runs into that you run into is really a problem indexing. So when you’re saying, like, go look at the logs , Well, what logs? So let’s say you have a transaction and you’ve got an exception over here and you want to find out what’s causing that exception.

When the first things you’re going to want to do is say, well, get me the rest of the logs that were in that transaction. And I think [00:07:00] anyone who has operated a system that was doing more than processing one request at a time knows it’s actually really painful to collect up those lines. Because on any given machine, there are a bunch of concurrent requests.

So there’s all these logs happening at once. And you want to know which of these logs were just part of this particular request. And then you’ve got the fact that these transactions are distributed. So this request was going from one machine to another machine for another machine, from another machine.

So you’ve got 50 machines and this requests touched six them. So that problem is now repeated across 50 machines. You want to find the logs that are just in this one transaction. And if you don’t have any kind of indexing, like if there’s no way to say like, well, I found this log, so I just want to look up all the other logs that are in this transaction that becomes this sort of manual, just grepping around or whatever logging system you’re using.

You’re you end up doing a lot of searching and filtering too. To just get that collection of logs that [00:08:00] represents a particular transaction that actually takes a lot of time. It’s time people have gotten so used to spending that you may have kind of forgotten how much time you’re spending doing that, but if you, the next time you’re anyone listening, the next time you’re investigating an issue.

Just notice like, like stopwatch something. Notice how much time you’re spending, just collecting the information. So that you can look at it as opposed to making a hypothesis and actually operating on that information. So that’s, what’s starting to fall apart , with logs.

Ashish Rajan: Interesting. So I imagine all the security operation, people listening to this who ask everyone, Hey, throw all your logs into the scene collector, which is like basically a log aggregator and going, cause that’s, that’s our log aggregator and I all right.

Your cloud service provider. tell you the same as well. Hey, push all the logs into this CloudWatch AWS or wherever, but there’s no, it just logs and metrics at that point, but there’s no like to your point. If I’m just hypothesizing a scenario where it doesn’t issue with an [00:09:00] application, you go into say in the AWS scenario, you go to CloudWatch and you have a bunch of logs.

And then going, okay. Timeline, then you go into, what is this just a log of the application or is it other things in there as well? I mean, you can keep dissecting that for hours.

Ted Young: Yeah. And so really truly the only difference between logging and tracing is let’s even assume that you’re using a logging system that has indexing so proper database.

So you can index these logs so you can look them up, but the index that you’re going to. Is what’s called a trace ID or a transaction ID you’re going to want and a unique identifier. So when that transaction starts on the client and identifiers generated, and then every log in that transaction gets that same identifier.

Even if it hops from another machine that identifier follows it. So you just have this identifier test, every log. And if you’ve got that ID and you’ve got a database you’re storing the logs in that does index. Then when you find one log is going to have that ID on it, and then you just look up by ID and bam, you’ve got all the other [00:10:00] logs .

And so as soon as you add that transaction ID you’re now doing tracing that’s as fundamental that’s all tracing it. There’s a bunch of other stuff people add to tracing , once you’re doing that. But at its fundamental, it’s just about getting key identifier attached to everyone.

The trick is it’s a lot of work to do that. Traditional logging tool facts, the kind of context in order to do that is just like, make me a log right here. And there’s no that log when you make it, it’s not contextualized. And so actually adding that context. So there’s all of your code is now executing in a context that contains.

This transaction ID, is a fair amount of work. And that’s why tracing systems require more work to build. They require more work to set up than a traditional logging system. But that’s the main difference is

Ashish Rajan: just having that idea. I may have jumped a few steps here then if you were to bring it back to the beginning, so I’ve got people who may be going well.

I’ve been log aggregating for such a long time. I’ve got logging, I’ve got metrics. That seems to work, but what [00:11:00] we’re saying at the moment, that takes an enormous amount of time. And a lot of times don’t even have context and we need to go down the path of indexing with context. So that’d be at least at the time to get into the problem.

So you can work on the period away should be shorter. Would that be right?

Ted Young: That that’s correct. Yeah. So if you’re going to be tracing, you have to be generating these events and that’s, that’s no different than logging. So you do need to instrument your code. Maybe another area where open telemetry, the particular project I’m working on has a specialty, which is most software systems today are actually written out of third party software.

So you’re taking a lot of usually open source libraries today and you’re bringing them together. You’re you’re taking a lot of third-party libraries and you’re gluing them together. And then applying your application logic on top of that, you’re not writing your own HTTP client, your own database client.

You’re not writing your own web framework. You’re, you’re taking these off the shelf components and you’re reusing them and [00:12:00] fashioning them into an application. So you have these third party libraries that are doing a lot of the heavy lifting for you. And one thing for anyone who’s attempted to write open source software, I’ve written a fair bit of it.

Myself. You hit this wall where you want to instrument your libraries so that you can provide the users of your library with information about what it’s doing, but you can’t because you have this issue with composition where it, I can pick a logging library, a trace in library and metrics library. But if I.

One that’s different from the one the application owner wants to use or is different from the one, all the other libraries pick, then it’s not going to compose into a coherent system. And so you end up, just kind of like spewing logs to standard out or giving someone like a hook and being like you wire all this up.

And so that’s always been a bummer. So the. Integrating everything with tracing and providing better indexing and just fundamentally better observability open telemetry is also split out and architected [00:13:00] in such a way that, third-party libraries can instrument themselves with open telemetry without picking on dependencies or overhead or limiting the application owner, in ways.

There’s certain choices. The application owner needs to make an open telemetry kind of divided. How it’s set up so that you, the instrumentation is not making choices about say where you’re sending the data or what format the data is coming out in. , so we’re hoping to solve this problem of native instrumentation as well as the problem of distributed tracing and getting everything integrated.

Ashish Rajan: Awesome. and native instrumentation would just be. Logs for wider by the provider.

Ted Young: Exactly. So if I’m writing a web framework or a database client, I should be able to provide the instrumentation myself. And, I am the person as the author of that software, the person who knows what’s important.

I know what my users need to know about the system’s doing. And I also know the remediation. Right. So if I’m writing my own instrumentation, that means I can [00:14:00] also start doing some good dev ops practices like shipping playbooks, saying a Couchbase is in sample a company that’s doing this. So they natively instrumented with open tracing and now open telemetry.

And then they ship a playbook that says we’re producing these logs and metrics around things like, you know, how backed up the queue is. And so if you see this warning or you’re seeing these thresholds getting better, that means you need to tune these parameters. So here’s our playbook for what the information is that’s coming out of the system and also what you should do about it when you see it, like here’s how you should tune things or here’s what it means when you’re seeing this information, we’re providing

Ashish Rajan: it.

And, so to your point maybe we should explain what is because you and I have somewhere in the standing, or at least you have more understand than I do. It we should probably define what Open Telemetry is because we were using the word, but a lot of people are like, what the hell is it? So,

Ted Young: so open telemetry is an open source project.

It’s under the alias of the CNCF and it provides instrumentation for every major [00:15:00] form of observability tracing metrics. Things like EVP PF are going to get added to it rum. So it’s just going to be the catchall for all the different ways. You might want to instrument a system, but it doesn’t in a way that’s novel by using distributed tracing as the underpinning to actually take what were traditionally considered several different pillars or several different tools.

And ensures that they’re all cross index so that you can get a single cross index stream of data coming out of your system. And this solves some of the problems we were talking about earlier about how it’s normally really slow to move between these tools, because you don’t have the indexing, so you can’t write databases or automated analysis tools because the data is not actually, there.

With open telemetry, all of this stuff is getting cross index. So, in the future, once this project is complete and widely adopted, I think you’re actually going to see a shift in the kind of observability products and tools people are offering [00:16:00] going away from you have your metrics dashboards over here, and you have your logging system over here and your tracing system over there to something that’s more like one coherent system.

That’s synthesizing all of these different data sources to give you like a more complete picture that you can move around between a lot

Ashish Rajan: Actually you reminder me cause Amazon recently made, I think it was Grafana . I think they’re made available through open telemetry or something like that.

So basically I think my mind sending from that news was they started supporting open telemetry standards, I guess, language in some of their, well, at least one of their products. Yes.

Ted Young: So there’s, there’s a couple of different pieces if we’re going to break open telemetry down. So there’s the instrumentation part which we call the APIs and the APIs are totally decoupled from any implementation.

So when you’re instrumenting your code or your service using open telemetry, APIs, there’s no imp implementation getting called in. This actually relates to security and other, things around [00:17:00] dependencies. I should say if there’s no dependency, That automatically comes in just because you’ve instrumented, which is a key element when it comes to instrumenting open source libraries, but you’ve got your API, then you have, what’s called the SDK, which is an implementation we provide that you would install in your application.

And the SDK is where you’re configuring what you’re doing with that data where you’re sending it, what format it’s going to be. Yeah. We support a wide variety of existing formats for this stuff, but then there’s the open telemetry protocol called OTLP . That format is special because that’s the format that actually takes all these different data sources and combines it into a single stream.

So you’re just fire hosing, OTL P at some end point that can then take all of these different data types and do something coherent with. And then along the way, there’s a service we provide called the collector and the collector is like a data processing service. And that’s so that you can move a lot of the data [00:18:00] processing and configuration work out of your application services and you instead move it over to the separate service you’re running called a collector.

And so that’s where you would be doing things like scrubbing your data PII, converting between data forms. Being able to do things like say generate metrics out of your tracing data, right? So generating metrics on the fly, things of that nature. And you can also use this to then tee off into multiple, data syncs.

So if you want to send your data to. Let’s say CloudWatch, but, and also Datadog, or let’s say someone’s written a specialized analysis tool. And so you want to send some of the data off of that. You can actually send data off to multiple places using the collector. The one thing open telemetry doesn’t provide is any kind of database or backend or analysis tool.

And that’s because the project is focused on standardized. The telemetry portion of this system. So we help people generate the data and then transmit it. But, and that’s where we’re trying to get all the [00:19:00] agreement happening and standardization, but we don’t want to get into the analysis of the database game because that’s kind of where all the competition and all of , the kind of Greenfield revolution is going on.

So. W the, the edge of the project and there at the collectors sending data off to some third party service.

Ashish Rajan: Okay. So open telemetry is that, I guess, for lack of a better word instrument, that’s standardizing across multiple

Ted Young: sources. Yeah. So the term, that’s why we chose the term telemetry open telemetry.

So if you look up telemetry, it’s the generation of, and transmission. Metrics and data about some remotely operated system. So it’s not the analysis part, it’s just the generation and the transmission of the data. And so we’re seeing infrastructure providers like Amazon, Google, Microsoft, all start to integrate open telemetry into the services they’re running on customer’s behalf so that when customers are.

Are doing tracing of their applications. And then they’re talking to say, you know, an Amazon service [00:20:00] like S3 or, you know, application gateway or something like that. That trace is continuing on into that service that they’re running. And then they’re going to provide those users with OTL P data of their requests when it hits those servers.

So that’s another place besides open source libraries. You have these managed services that in the past, it was difficult to get any information out of, but now with open telemetry, you’re going to get all of this great OTL P data starting to come.

Ashish Rajan: That would be a game changer for a lot of people that fell, especially the folks who were off being in the beginning.

Is cloud secure. I can’t see anything in the cloud. How do I trust it? But now through open telemetry and it’s really interesting. I’m curious, are there enough people I guess behind this, is that why it’s such popularity? Because if the managed service providers have started, like the Amazon Google, Microsoft of the world has started providing this, is it mainstream?

Ted Young: Yeah. So it’s definitely backed by all of the major players in the industry. So part of the idea here was that this is only gonna work if we [00:21:00] all come together and agree on a standard. Especially when you’re talking about distributed tracing, right, where you’re trying to pass these transaction IDs around and have some kind of coherent view of the entire, transaction.

You have to have agreement on how you’re going to do that. And so the open telemetry. Came out of, they were initially kind of two competing projects that were in a similar space. There was open tracing, which was happening in the CNCF. And then there was something called the open census project that was happening at Google.

And it was quickly clear like this, this wasn’t going to work if like there are other domains where yeah. You can have a bunch of different databases and it’s fine. But this particular domain, because it was about interoperation we needed actually all come together. So open telemetry. Actually represents everyone in the industry coming together to, to work on the same project.

So you’ve got me at LightStep and, Google, Microsoft, Amazon Splunk, as some of the founding members, new Relic, Dynatrace, honeycomb, a lot of observability [00:22:00] vendors showing up in the early days. So , it’s got a lot of legs at this point and it’s seen a. Quite a bit of adoption already.

Even though the, I should mention the project is not complete. It, the tracing portion is stable, but metrics and logs are aren’t stable yet. We’re hoping for metrics to be stable end of year. And,

Ashish Rajan: what do you mean by not stable? It could be one of the things it’s going to explore.

Ted Young: Yeah. Well, I mean we care quite a bit about that. Stability. It’s really common in the world of open source for people to declare something 1.0, and then ship a 2.0 like the next year and be like, ah, it’s an improvement. It broke everything, but it’s an improvement. And we’re very, very sensitive to that because instrumentation goes everywhere.

Right? We’re looking at these instrumentation APIs with the expectation that there’s going to be millions and millions of lines of code written. These instrumentation APIs. So stability for us means we never break it ever. Right. Microsoft wants to put this [00:23:00] into office and windows.

So we’re talking about software that has a shelf life measured in decades. So that’s the kind of long-term we’re thinking about is that like, when we say open telemetry is stable, We’re saying this API is never going to break. It’s going to be supported like basically for the lifespan of this project, however long that is.

And we already did this with OpenTracing. So for example, we had the open tracing APIs, and then when we went to open telemetry, we made improved API. So sort of like a V2 of open trays. But all the open tracing API is still work. Everything inter-operates with open tracing. So we didn’t break anybody’s code by switching to open telemetry.

So that’s what we mean by staple. So tracing it open telemetry is stable and it’s going to remain stable forever. And so you can trust that if you’re instrumenting with that, we might add improvements in the future that might make things easier for people or add additional features. The code you write is never [00:24:00] going to stop working.

And we are not there yet for metrics. We’re there for logs in the sense that if the logging you’re doing is on traces. So. Just logging against your traces. We call those span events. That’s there. That’s fine. I recommend people use that as their main logging API today, but you have all of these kinds of like edge cases around logging that aren’t really covered by that.

And not there yet with metrics. The problem is like the metric space is like very, very broad. And we want to make sure that the metrics KPIs we’re building works we want it to be something that would work for premium. We wanted to be something that works for stats D for all the things people are currently doing.

And we also want to make sure that it’s doing things that systems don’t do, which is integrate with tracing so that your metrics dashboards are actually connected up to your tracing and logging system, which is another thing that’s actually really awkward and annoying today. You, you look at your metrics dashboards, and you can see there’s a spike, like, oh, there’s a big spike in [00:25:00] errors, but what are these errors?

What was causing these areas? You can’t just like click on that dashboard and then go look at example logs and traces of the things that we’re generating that dashboard because it’s two totally separate systems right now. Yeah. Open telemetry. Those things are getting combined so that the data you’re receiving, whether your system today supports or not the data is cross index.

So when you see those metrics, those metrics are coming in with trace exemplars attached to them. So that future systems are going to be able to just say like, yeah, so this slowness you’re seeing here, here’s an example of transactions that represent these errors or this slowness or these HTTP. Whatever it is, you’re looking at in the dashboard.

So getting all of that, working and making sure that it’s,, final working. Well, we’re hoping to hit that for, metrics by end of year. So that’s our current goal.

Ashish Rajan: Well I’m sure a lot of people would be looking forward to this as well, but I’m sure it already will also excited.

Now they’ve basically heard about open telemetry from [00:26:00] you observability. So, if they want to go down to the spot of there, obviously a lot of people would already be doing a lot of logging or metrics already, but it’s in their own way. And as you said, they have, everyone has a dashboard. Every operation person, devops person they’ll have a dashboard for, Hey, this is how my application is behaving and they definitely cannot click on that link.

It’ll just take them to another rabbit hole, I guess, click on a link. If we were to start today on this, as anyone listening to this, , what’s the easiest way to, because I imagine there’s a big transition from shifting from just doing logging and metrics mean doing open telemetry because the industry, we have a provider available as well, or is there too much, like, would you pick one over the other.

Ted Young: So there’s open telemetry works with all the major providers. If someone wants to start today you just want to go ahead and get the open telemetry SDK installed and start using that for tracing. The hardest part to set up , is getting your traces, sorry, propagated across all of your services, because metrics and logs, they’re [00:27:00] kind of like these single.

Contactless things that also makes it like easy to set up when you set up tracing, there’s a little bit of superstructure that has to get set up properly so that the traces actually propagate, that means installing the open telemetry SDK, installing instrumentation for, The major libraries you’re using, especially the network libraries, like your web server and your HTTP clients that are talking to each other that needs to have instrumentation installed so that the trace has propagated.

So that’s like still a little bit of work sometimes to get started with, we’re trying to make that a lot smoother. As things become more and more natively instrumented that’ll naturally get a lot smoother. But that’s where people want to start. Once you have tracing up end to end, so you can see traces, then you can start doing things like, say converting your existing logs into span events.

Or you can take your trace data and use it to start decorating your logs. So once you have [00:28:00] tracing setup, you’ll now have a trace ID and a span ID. And. Most existing logging tools have something called like a log Appender. So just creating a little adapter that takes that trace information and staples it onto a log.

Every time you make it, even without changing tools, just using your existing logging tool, having those trace IDs on there are going to really improve your day because now, like almost every blogging tool is some way that lets you search and filter. And so you could start searching and filtering using precise.

Once you get open telemetry installed. So that’s like an easy way to get started. And likewise the, the metrics and trace data, that you get out of open telemetry can be sent off to any backend. So whatever backend you’re currently using can start receiving data from open telementry instrumentation.

So it’s possible to progressively migrate off of. Your existing tools and just keep the portion of your existing tools that are still useful to you once you’ve installed, open telemetry.

Ashish Rajan: Yeah. So to, I [00:29:00] guess a lot of security people that think that this may already be in a observability platform, they may not have adopted it , but they see it around in their work life, I guess, but clearly there’s some security as to observability as well that people should consider. And in saying this understand it’s a bit new. So not everyone out in the intranet is using observability, but there’s a plan for it. Eventually go down that part.

So from keeping that in mind, what are some of the obvious security risks that say someone is a security architect of redeploying observability now. What am I looking for here? What are some of the low hanging fruit, for lack of a better word, that they should be asking questions

Ted Young: about?

Yeah. So observability has the same security risks that all open source libraries have, which is you’re taking this third party dependency that you did. Right. And then you’re installing. And every single service that you’re running. And so , that is a juicy target, right? So this is what’s called a supply chain attack, and this is an area where I think [00:30:00] open source really has some growing up to do, right?

Like open source has this sort of like free for all background where we’re just creating stuff and sharing it widely. And it’s all very. Chaotic and organic, and there’s like lots of great stuff happening, but one thing that’s not good, there is security. So, in a world where it was , just, you know, cowboy is doing this on the side, that wasn’t a real target, but now that these libraries are getting deployed into big enterprises and federal government, the military, and all other places, you’re starting to see.

Them get targeted. Solar winds is like the huge example here. The solar winds hack was, you know, basically hacking an observability tool, right? This is a network monitoring tool that was installed everywhere and hackers were able to get a tainted dependency, baked into that tool. So when that tool got installed, their malware got installed one.

So that’s a [00:31:00] problem that we take very seriously in open telemetry. That’s why we look at dependency chains, really seriously. And we’re trying to take a hard look at, how on the one hand we can say , here is a subset of things we provide that we have some security guarantees about. And then beyond that, here’s sort of like the general ecosystem of plugins and instrumentation and things.

People have written, for open telemetry that you can install, but we can’t make any real guarantees about, where they’re coming from or, or whether you should install them. And so I think that’s a real problem that open telemetry has to face that Honestly though, no different than the problem your web framework is facing or any other widely

Yes. So then there’s also the fact that, you know, the data that’s coming out of this system could potentially be very useful to people, right? Like you’re talking about all the transactions that are happening in your system. We’re talking about getting an observable view of what your system is doing.

That sounds. That [00:32:00] could be useful to hackers. And eavesdroppers, it’s not, sometimes people look here the term telemetry, and I’ve seen this happen on the internet. I don’t quite get where it’s coming from. I think it’s coming from people not liking the fact that, you know, when you install your video game or like a Microsoft

like, they want to get telemetry about their software out and that can be seen as like, oh, we’re, we’re spying on you. And it’s true. You could use this stuff to, to spy on people, but I don’t think telemetry systems are like a really great choice there. You can go by Spyware tools , that do spyware and run those better than packing up and observability system.

And just by where, but yeah, you should think about PII. You should think about what information is potentially getting leaked through this tool that you don’t want leak. And where’s that information going? There’s maybe one extra thing with trace . Which is [00:33:00] in band. You’re sending some amount of information in band and at some point that information may egress , a trust boundary.

So you may be talking to say some third-party database over there, and you’re tracing in band tracing information. What’s called trace context and baggage is just going to truck along with it. And so. Potentially could be a leak. So if you’re a security person and you’re looking at evaluating someone who’s installed a tracing system, you want to look at what systems are downstream from yours that are potentially out of your trust boundary.

You also might want to look at things. Like how people can control or manipulate this observability system , by sending it directives from the outside. But that’s really no different than, than you might say. Any other system you want to have your reporting locked down?

Ashish Rajan: Yeah, I think to your point, the standard logging kind of security risks would still apply so, or identity, whatever you’re talking about.

Ted Young: It’s the same, same stuff you’re doing the logging today. that hasn’t changed

Ashish Rajan: Donald’s. [00:34:00] Yeah. So if someone wants to learn more about this where can , they get more information. I do want to get into the fun section, just the last section so quickly. What’s the if someone was to learn more about open telemetry, tracing observability, what’s a good place to start and where can they find the resources?

Ted Young: Yeah. So open telemetry IO is the website for the process. We have a GitHub organization open telemetry, open dash telemetry on GitHub. Those are great places to start repo in that get hub org called community, the community repo. The read me there has as all of the, how to get in touch with us information there.

So we have a lot of meetings. Yeah. We’re a big fan of, of having face-to-face meetings to discuss things. So the project itself is. On zoom and get hub issues. So you can find the, the Google calendar there for when all , our meetings are open. You can come to them, they’re all recorded and put up on YouTube.

So it’s totally transparent, but anyone can come to these meetings and we’re also, we hang out on slack. So we’re part of the CNCF [00:35:00] slack instance. And if you go to that community repo, you can find the links to like how to join the slack instance. And you can say hi there. So those are the best places. To get a hold of us.

And if you just go to the website or look at the repos, there’s lots of getting started material. I work at LightStep, LightStep is an awesome company. We provide a lot of training materials. So if you go to the LightStep, website, like open or open Sloan and shoot out lightstep.com, we have training material there.

I produce a lot of YouTube videos. So. And they mostly get produced on light steps, YouTube channel. So if you go to LightStep YouTube channel, there’s an open telemetry playlist there, and that’s where you can find my content. And I have a lot of videos that are kind of like overview of, you know, here’s the design and architecture of open telemetry.

Here’s how to get started with it. Here’s like what the point of it is. Like, here’s what all these terms mean. If you like. This video format. I recommend that I think I’ve got the most comprehensive set [00:36:00] of video materials on the car.

Ashish Rajan: I’ll definitely link that up in the show notes as well.

So I know we’ve been talking about open telemetry observers here for some time, and I did want to have a few minutes for our fun section, right in the end, which are non-technical questions just to get know, get to know Ted a bit better and your cat . Yeah, what’s the cat’s name? This is penny. Hi penny Ben.

She definitely wants attention. Hey Penny . She definitely cannot give attention right now to me. But if the distracted by you, man, I’ll say I’ll just speak questions right into the fun, fun part. And, or where was the first one being? What do you spend most time on when you’re not working on cloud native or OpenTable?

Ted Young: But I’m not working on open telemetry and I’m not working on LightStep then these days it’s a a lot of drawing, a lot of art. I used to be an animator before switching over to internet stuff and right. So lately you’ve been getting back into that a bit. So some drawings, some painting, I don’t, I post a little bit on Instagram.

But yeah, I think getting into writing some script soon maybe filming a thing or two, so yeah, it was kind of, yeah.

Ashish Rajan: Yeah. So we have a whole creative side to yourself as [00:37:00] well.

Ted Young: Yeah. And all this stuff is, is creative. Honestly. I think there’s a lot of creativity that goes into computer programming.

Ashish Rajan: Yeah. Yeah. Cool. I was going to say definitely we should A blue car for that movie script that comes out thing question, where does something that you’re proud of, but not on your social media?

Ted Young: Proud of. That’s not on my social media. So I’ve got, I live on like a half acre here and about half of that’s cultivated.

So we have like a little tiny farm in the back. And so that’s kind of where I spend the rest of my time. And I’m real proud of, of this, this setup we’ve got going on over here. So getting outside and, and grow and stuff, that’s. Yes. That’s where I’m at these days. So

Ashish Rajan: , what are you going this season ?

Like what’s the season

Ted Young: for, yeah, this season, we’ve got a lot of I’m all about compressor, love compresses. We’re growing a lot of tomatoes and basil. We also got three sisters going, which is when you take corn and beans and squash and you kind of grow them as a coherent unit. So a traditional American way of, of growing grown food and It’s so I’ve been doing a lot of that.

Plus we got, you know, some apples and plums and [00:38:00] pears and things like that. A lot of berries, a lot of things got scorched. We had this insane heat though, or got up to 115 degrees here and it literally burned a lot of the fruit, like someone took a magnifying glass to it and that really thrashed of, a lot of horticulture up here in Oregon, unfortunately.

So, so, the climate change stuff is a thing that I put a lot of effort into to like awareness there. That was how I actually got started on the internet, was around doing work with the environmental movement to kind of help leverage the early days of the internet to kind of raise awareness. People know about it now, but if you actually roll back to like 2001, 2003, most people were like pretty clueless about this.

They just literally hadn’t heard about it. So that was actually all I got into the internet was kind of advocacy around. Where does your

Ashish Rajan: firstly we’ll just start those kinds of patients on the

Ted Young: internet. Yeah. I mean, I want to say like the first, but yeah, very early on working with a guy named bill McKibben, who’s been around for a long time.

Started on college campuses, a thing called step it [00:39:00] up, which then turned into, an organization called three fifty.org. And then that’s now turned into what’s called the sunrise movement. So there’s sort of been like three generations of this. Oh. But yeah, yeah, a lot, a lot, a lot of the early days of that stuff.

The Keystone XL pipeline campaign and things like that. So people have different opinions about this stuff, but I, I. Unfortunately, we were now we’ve been moving real slow on, on actually doing something about it, even though we know about it. And certainly at Oregon, we’re really starting to feel like some of the crazier effects as, as like the climate shifting kind of hits the sort of hockey stick curve.

You’re going to see a lot over the next decade. A lot of just like really bizarre. Yeah.

Ashish Rajan: I mean, yeah, I know we won’t get into that, but being with you, we definitely need to do a lot more work. I think Australia is very similar as well. You almost feel like we do a lot more.

Ted Young: No, I was just going to get weird the next decade.

You’re just going to see a lot of weird weather, basically. It’s going to be very, yeah, yeah.

Ashish Rajan: I think they were saying it was the hottest day in Sweden, which is like a cold country or something. It was like things weird things [00:40:00] that happening, but yeah, I’ve got one more question for you and I’ll that’ll be the end.

What’s your favorite cuisine or restaurant that you can

Ted Young: share? Oh, So many I love ramen. I’m a big fanatic. I’m from Hawaii originally, so we had a lot of that grown up and so yeah, I’d throw that out there. That’s where my face.

Ashish Rajan: Dude, thanks so much for this. I really appreciate that.

And I’m so glad we got to know the other side of that as well. Where can people find you, but really you’d only hang out and have more questions about observability.

Ted Young: The best place to hit me up is on Twitter. So I’m tedsuo on Twitter. I post links it’s pretty low volume Twitter account. I mostly am posting links to conversations and material related to open telemetry.

So that’s the place to follow me if you’re interested in this. My DMS are open, so you can always hit me up there otherwise on the open telemetry slack instance, just hop onto that slack and say hi there.

Ashish Rajan: All right. I’ll definitely encourage people to check Ted out and I’ll definitely encourage people to check out observability and tracing as well.

So thanks so much for having coming on the show, man. I really appreciate this. And looking forward to having more conversations with you about

Ted Young: yeah, absolutely. [00:41:00] Man. It was a lot of fun.

Ashish Rajan: Thank you. All right, everyone. I’ll see you next week.

Ted Young: Bye.

‍

Cloud Native Security Series

Episode Description

What We Discuss with :

THANKS, Ted Young!

Resources from This Episode:

Claim your free spot in our upcoming Cloud & Kubernetes Security Training!

"Escape-Proof" Cloud: How Block built an Automated Approach to Egress Control

Prioritizing Cloud Security: How to Decide What to Protect First

Migrating from “Tick Box" Compliance to Automating GRC in a Multi-Cloud World

Using AI Agents to Solve Cloud Vulnerability Overload

Adapting to New Threats, Copilot Risks & The Future of Data

"Escape-Proof" Cloud: How Block built an Automated Approach to Egress Control

Prioritizing Cloud Security: How to Decide What to Protect First

Migrating from “Tick Box" Compliance to Automating GRC in a Multi-Cloud World

Using AI Agents to Solve Cloud Vulnerability Overload

Adapting to New Threats, Copilot Risks & The Future of Data

Bridging Cloud & Edge Security: Multi-Cloud Context & Remediation

Securing AI: Threat Modeling & Detection | Live Panel with Anthropic & Canva

CYBERSECURITY for AI: The New Threat Landscape & How Do We Secure It?

Data Resiliency: Why Your Backups Aren't Enough

Building Smarter AI Infrastructure: Private AI, Shadow Risks & Gym-Grade Innovation

Cloud Security Evolved: From CNAPP to AI Threats ft. Elad Koren (Palo Alto Networks)

MORE Fake Code Than Real? AI Supply Chain Security Explained

Modern SOC Strategies for Cloud & Kubernetes (Ft Sergej Epp. Ex-Deutsche Bank)

Scaling Container Security Without Slowing Developers (ft. Cailyn Edwards, SIG Security)

How Attackers Stay Hidden Inside Your Azure Cloud

How Confluent Migrated Kubernetes Networking Across AWS, Azure & GCP

Detection Engineering with Google Cloud

The New Future of Cloud Security

CNAPPs & CSPMs don’t tell the full cloud security story

Securing AI Applications in the Cloud