Building Observability Platform for Scale

View Show Notes and Transcript

Episode Description

What We Discuss with Colby Funnel:

  • 00:00 Introduction
  • 05:24 Cloud Native and Observability
  • 09:43 Security vs Observability
  • 12:26 Monitoring vs Observability
  • 15:47 Where to start when building an Observability Platform
  • 22:00 Open Telemetry
  • 24:53 Importance of Logs?
  • 26:57 Tracing and Metrics
  • 32:32 Measuring the maturity of Observability Platform
  • 36:03 Shared Responsibility in Observability
  • 38:26 Risk of Observability Platform
  • 40:19 Scalability of an Observability Platform
  • 43:09 The Fun Section
  • And much more…

THANKS, Colby Funnel!

If you enjoyed this session with Colby Funnel, let him know by clicking on the link below and sending him a quick shout out at Linkedin:

Click here to thank Colby Funnel at Linkedin!

Click here to let Ashish know about your number one takeaway from this episode!

And if you want us to answer your questions on one of our upcoming weekly Feedback Friday episodes, drop us a line at ashish@kaizenteq.com.

Resources from This Episode:

  • Tools & services, discussed during the Interview

Ashish Rajan: Welcome Colby. I’m so glad you came in Colby.

Colby Funnel: Thanks for having me.

Ashish Rajan: I I’m super excited. I have a local Australian after a long time, so I appreciate you hanging out with me early in the morning I want to start off by saying cheers.

For people who may not know Colby and I were both in lockdown cities at the moment. So we’re making the most of our time at home, but for people who may not know you Colby can you tell us a bit about that?

Colby Funnel: Sure. I am, have been in the tech industry for many years. I started the the journey as a sort of help desk, sys, admin path. So not through the developer path, so to speak, always had a bit of affinity for monitoring. I remember one of my first gigs was setting up as an MP [00:01:00] monitoring for a telco on all their Cisco routers.

And from there, it’s just sort of leveled up and left. Started at Atlassian 10 years ago building out some of their first iterations of cloud when it was all running hardware and not running on Amazon, and really enjoyed the challenges of flying around to different data centers in America and spending time overseas, setting up servers and racks and telemetry of all these things.

But more recently in the last sort of five years, we’ve been working with the observability org in Atlassian and focusing on getting the richest data set and best insights for the Atlassian cloud.

Ashish Rajan: Yeah. Wow. And building racks, I appreciate people from that era as well.

Cause I, I started off. In identity access management. But before that I did a bit of . A lot of IP addresses should remember, man. It becomes second nature. After a while.

Colby Funnel: Thankfully. People smarter than me design this thing. So it was using containers before DACA was a thing. It was not really well there’s patents out there about how they built these sort of read only deterministic systems.

And it was an amazing experience. Figuring, seeing how tech can be done at scale. [00:02:00] Super smart.

Ashish Rajan: I’m glad I have you here as, cause you’re going to get her to talk about ability at scale as well. So talking about observability . And kind of cloud native people kind of use those two terms sometimes together.

So how do you define cloud native and observability?

Colby Funnel: Gosh, it’s a bit early for such philosophical questions. Isn’t it? Cloud-native to me that the story of the last year in splitting, right. And Netflix did a very similar thing. And that’s what , we sort of copied Atlassian . A number of years ago, decided to fork the code base one code base, which is the behind the firewall, , licensed stuff that you run on prem, which is how it last, it started and the other for cloud.

And that really meant that we could move both of those strings of work much faster. Individually. So we didn’t have to have right. A feature that worked on running on people’s servers, as well as on our cloud, or we didn’t have to write plugins that work for both. And we didn’t have to, , so cloud native to me is, , you are focused a hundred percent on running this in the cloud for someone and it now lives in breeds.

It starts in the cloud and finishes in the cloud. It’s it’s not something that is just running on someone else’s computer in terms of observability . [00:03:00] Look observability to me is being able to understand and answer any question you have of your systems, applications, and users, and specifically it’s about not knowing the questions upfront, right?

So you’re not instrument answers, you’re instrumenting apps so that you can answer any question later on.

Ashish Rajan: Honestly, you’re almost preempting what would be useful for a future question?

Colby Funnel: Yeah. I remember yelling at users years ago saying, no, no, no. You have to go through your list of what questions you might need to answer and imagine what could go wrong and, , instrument and answer all these questions up front and set up alerts and all that sort of stuff.

But the world and especially cloud native, it’s too complex for that. You can’t answer all the questions upfront. You might have, , especially microservices, you might have a hundred services interacting for one user request. And there’s no way that you can think of all the things that can go wrong up front.

So it’s really about instrumenting. Dual apps and collecting the data. So you can, after the fact say, hang on. So how did this user interact with this and how fast was it?

Ashish Rajan: And actually that’s an interesting point because [00:04:00] instrumentation was kind of covered in an episode couple of weeks ago with Ted young, we’ve got into instrumentation and open telemetry , but it’s more in a general context for people to kind of get an understanding of it.

I’m curious to know from an organization perspective, what do these things mean for an organization usage perspective? Like instrumentation we’re talking like logging monitoring, like what are we talking?

Colby Funnel: That’s really interesting. The entire point of oTel is to answer this question.

Isn’t a consistent answer in all of observability for this. I can tell you, The places I’ve worked and the people I’ve spoken to all struggle with this. And so tell, we at the moment allowed developers to generate data. However, they. Right. So they can send logs. It goes into Splunk. They can send metrics, they can send traces they can even send analytics events to systems outside of us, or they can, they can sort of do whatever they like and the struggle that that creates in, if developers choose how to send the data and how to instrument the data, then they’re also on the hook, how to use that data.

So it’s a kind of catch 22 in that respect. So that the idea of instrumentation[00:05:00] for us is that if a data system or observability system roof and security system is as good as the data you send it, we want to help people generate the right data from the beginning, which is therefore all about instrumentation and hopefully oTel then, , helps to solve this.

Ashish Rajan: And I’m glad you mentioned security and observability as well. So what does that mean? I always imagined a incident response scenario for an observer. You can actually, oh, I guess a well laid out observability platform can help. Are there other scenarios as well? Or like how can security be enabled by observability?

Colby Funnel: Maybe an unpopular opinion, but I don’t think they’re very different. I think that security and observability at least could just be different use cases of the same data. I think, especially at scale where things like cost. Of data the performance of you, huge amounts of data. If we’re collecting the data once and using it for different things, then I think we’re in a much better place.

If you look at the typical example of, , an HTTP access log, , we’ve got all these 12 factor apps out there and microservices. We might send [00:06:00] an access log, which has a huge amount of information that might go to Splunk and security run their queries, and it might be, , combined Apache access level something.

And then on the observability, well, we might send, , a trace, which is all the same information for the HTP access. We might then send HDP metrics and all this sort of stuff, which is just, , duplicating or triplicating. Is that a word? The data. So I think that security and observability could just be different use cases for the same data.

Ashish Rajan: It’s true. I’m glad you mentioned this. Because last week we had a conversation about data security lake and the conversation wasn’t that Hey, security team should create their own data lake. It’s more like if you already have an existing data lake in your organization, Just use that information to kind of build security metrics around it.

And I think you’re kind of going through the same thing where if you’re already doing observability in an organization, why not just tap into that instead of creating your own observability platform? I guess that’s kind of where I’m going with it.

Colby Funnel: Yeah. Look, I understand. Cause we have very similar conversations internally.

I understand that there are [00:07:00] nuances, right? Security will want dimensions on data that most developers don’t, security don’t like things like aggregation and sampling, right? Where we’re developers might want to see trends, security, want to see absolute details. But I think, , working with an observability platform, you can solve both of those use cases without having to build a secondary shift.

Ashish Rajan: And I think interesting, you mentioned that a developer is interested in trends, but isn’t it I guess from a security perspective of just basically looking at what am I investigating, what am I just ignoring? Or what is good? What is green? What is red, I guess have, which is a very different way of looking at logging and monitoring.

So, talking about monitoring, because I think a lot of people still get confused between whole observability and monitoring, like the way you mentioned. It’s really interesting that, how different. They’re

Colby Funnel: not. Yeah. Observability is of marketecture word. I love that, that, , big companies are using to drive big dollar sales is my opinion.

Look. Without being too cynical. It’s, it’s an evolution of mine, right? Monitoring in the old days, like I said before, is you instrument your monoliths for questions and answers that you might want [00:08:00] before. The fact, whereas observability solves the same problems in a much higher scale, much more complex world with many more moving parts.

But at the end of the day, it’s still sending data. It’s still measured the same way. Around, , incident response it’s just new techniques run and you see that through the tech industry. It’s, it’s not a new thing. It’s just sort of an evolution of best practices and charity,

Ashish Rajan: right? Yeah. And that’s definitely good.

That makes me go. Okay. So if observability and monitoring is similar and I guess we know this is to your point, is the next evolution, are we ready to kind of like, so why make a different platform for it? Is that because the existing solution. Look how it was at the beginning of people moving into cloud.

There were certain companies build in the cloud and then there were like people in on-premise trying to make the solution fit into cloud. Is this one of those sort of situations where observability platform, which for built in observability with that mindset in big before, they’re probably two cloud native.

Is that kind of a thing there as well? Well,

Colby Funnel: you could use observability tools from 10 years. And quite effectively [00:09:00] have the same level of stuff today. I mean, it’s, it’s not that different vendors will have you believe otherwise and looking if you’re aiming for, to, , min-max and get the absolute most then maybe, but under the hood, , Splunk is a logging platform that receives logs and that hasn’t changed a great deal in, , 10 years, metric systems.

, the vendors, the data dogs and signal effects is they’re just a slightly tweaked and much more heavy scale time series, database graphs. It’s not that different from what it was 10 years ago.

Ashish Rajan: Interesting. But it has a much more, I guess, a different kind of lipstick on it.

Colby Funnel: Yeah, look, tracing is kind of the newest thing.

And even that, not that new, but the jury is still out. At least for me, that it’s the beyond end, all of it, of all things observability. I think the problem with tracing is you need absolute coverage of all things before it becomes super impactful. And that’s a hard place to get to when you’re talking about thousands of microservices.

Ashish Rajan: Ooh, that’s an interesting point to kind of segway into the next question that I had then from a, I guess, a company looking at observability [00:10:00] platform, and obviously they might be starting small. They might already be doing some kind of logging. May have a SIEM solution. Like, I don’t know where anyone at any one of the vendors out there, switching to observable.

What are some of the initial challenges on sounds like tracing probably is already coming on as a champion for that’s the hardest part, but I’m curious, like where does one start in terms of building for scale and observability platform?

Colby Funnel: So firstly, understanding why he wanted, right. I will wave a flag and say that everybody must have observability. But at the end of the day, the amount of money that you spend and these things get really expensive. The amount of money you spend has to give you something. Right. So what value do you want and, how much do you value customer trust?

Everyone’s like, , of course we value customer trust, but there’s a huge difference between five nines and three nines in terms of you’re. Right. So do you want to spend the a hundred million or the 10 million? And what does that get for you? So I really make that clear. Right. Cause that’ll dictate the, sort of, exactly how much money you’re going to spend on this thing.

How many people you throw at it? In terms of the tech. It’s interesting. The industry is gone , [00:11:00] 10 years ago the world was opensource, right now, or two years ago, I’d say it was nothing but vendors. And we’re starting to see the trend go back a bit more open source now. Which I think is a really good thing.

There’s also a whole heap of younger, much more agile vendors these days that offer more inclusive sort of all in one observability without all the enterprise attached to it. Right. I I’d suggest it is very, very common. Everyone I speak to at large companies has, , probably 12 different observability systems.

Ashish Rajan: Essentially, even more than one phone.

Colby Funnel: Yeah. Yeah. And that’s the problem they might have. I mean, Atlassian has it. We have a metrics platform. We have a logging platform, you have a tracing platform analytics platform. And we think that that’s great because it’s best practice, right? The best tool for the best job.

But then you look at the developer, who’s looking at a problem and then. That many tabs open, I’m trying to understand this and that. And and we kind of forget that it’s the developer that has to use these things and that their job is much harder. And that then impacts things like incident resolution times and time to detect.

Ashish Rajan: I’m just thinking to [00:12:00] what you said. Even from a security perspective as well. We always end up in a situation where we have multiple tools for, you might be trying to find out, Hey one of my public IPS on my AWS accounts is an example.

But it’s not just one script and you have all the answers you’re going to have to like all these different things that you have to go into. So I, I believe observability kind of falling in the same bucket as well, where it’s almost like there’s so many best of breed tools that they can go for.

But at the end of the day, you end up with like five or six things that you have to somehow combine again. You might as well make a product about it as well. So to your point, yeah. If funds figured out that, Hey, it’s important for me to have the five nines for availability, because especially now with remote working, where now everyone’s like availability is important, integrity is important.

As the folks from cyber security, as a CIA people would say confidentiality, integrity and availability as they talk about, is observability playing a part in this as well. So when you’re building a platform, Is that the metric that you need to build on because you mentioned the cost aspect of it as well for five nines versus three nines.

[00:13:00] So is, I guess that availability, reliability, a huge component for why people go into Observability as well.

Colby Funnel: Yes. Availability reliability. Some of the numbers I’d throw performance in there as well. Right. You might say that a nappy is unavailable if it’s not performing. Right. , if a user then closes the tab.

The typical metrics that you’d measure observability by would be time to resolve an incident. So the use case of incident detection, time to resolve time to detect, it’s not an industry thing, but I’d use time to diagnose how long it takes. You actually figure out what the problem is.

Meantime between failures and sorry, how long between incidents for different services and applications, these sorts of things. Reliability is a good measure. But it it’s generally not enough.

Ashish Rajan: Right. And so once you kind of decided, I guess what’s important for us with reliability and availability, what’s the next step for building that platform then?

So you figured out, okay, I’m happy to spend X amount of money and I’ve got this problem of a plenty of best of breed tools. But I guess I want to start somewhere small. So I don’t get [00:14:00] overwhelmed. Where do recommend people start from building a platform that hopefully can scale into multiple observative platforms, but at least like one version.

Like a

Colby Funnel: smallish company. I would just go with one of the vendors that offers, the sort of all in one things where you, you might drop in a client library or some agents and put that on your infrastructure and you get some opinion. Views of that data, right? If you don’t know what you’re looking for at all, then somebody’s opinion of what you should be looking at is a pretty good starting point.

The problem in the industry and hopefully hotel solves this is vendor lock in is such a real problem. So if you do start with this vendor over here, it becomes really difficult to then move to this thing over there. But if you’re starting out fresh and no varied at all, then I think a, an all in one solution from one of the many vendors out there is probably the best place to start.

Ashish Rajan: I’ll send an interesting thing you mentioned. Cause I always assume isn’t open telemetry , like an open standard, so I should be able to move across anywhere. Right? Huh?

Colby Funnel: Eventually that’s one of the goals

Ashish Rajan: it’s not happening right now. As in with open, telemetry as it stands [00:15:00] right now.

Colby Funnel: One of the fun things about open Telemetry is that it’s a real double edged sword. You’ve got all of these vendors that have come together to say we are going to create the new stuff. And that’s amazing. I never thought I’d see that, that all these vendors are like, yes, let’s work together and let’s do this.

On the other side, you’ve got vendors saying, well, hang on. No, my standards better than yours. So just do my standard and that should be the standard. Right. And that’s why we’re seeing. A bit of delay in actually getting to this, this final standard, , in what dark shapes should look like and what instrumentation should look like.

Interestingly, , the first versions of open telemetry didn’t have logs. It was all just metrics and tracing. But, , it’s you just some prolific logging companies out there being involved now there’s, , a logging standard that’s that’s coming along. So it’s, it’s very vendor led at the moment and I’m hoping we can change that sometime soon to just really go back to the roots.

Data standards,

Ashish Rajan: all. So the, the three pillars of observability that we will talk about I guess to your point, if I’m a small company, smallish company and trying to build an observability platform, which my [00:16:00] developers can use, and hopefully my security teams can use as well. I’m happy to just invest in like a all-in-one solution for observability.

And use that for my logging tracing and I guess metrics.

Colby Funnel: Yeah. I’d be really careful with the pillars of observability. Again, really fantastic marketing photo. But I would never say that somebody has a higher value observability solution because they have metrics logs. I’d say they have an expensive one.

You can, I mean, if you really wanted to, you could do all of this through logs . You can log a metric, you can put everything in Splunk or something. If you want to spend the time and energy doing that, it really is figuring out what you want from this, and getting that and metrics logs and traces were a good way to sort of explain that.

How to get started. And if you have this, you’re on a good path, but it’s not a destination at all. You shouldn’t just say I’ve got metrics, logs and traces. I now have observability. You need to be able to say, I know I answered the question of what did this user do at this point that I’m interested in, or what’s the performance of this tenant with this app and blah, blah, [00:17:00] blah, and all the data types, metrics, logs, and traces come together to help you answer that.

Ashish Rajan: That’s true. That’s interesting. Cause I I’ve clearly I’ve been drinking the marketing. Kool-Aid quite a bit. So I’m like for me, like those are the pillars. So wait, so which one is it important for one to kind of focus on? So keeping the marketing pieces side, is, is it to your point logging then?

Because I’m already doing a lot of logging already at our organization. Yeah.

Colby Funnel: Look, I’d actually say that you don’t need logs. Logs are unstructured. Right. They’re just a blob of something that says maybe this thing happened and here’s some metadata about this thing. That’s the most expensive slowest and least often useful bit of data in the observability suite.

Right? So if you look at metrics there a, a trend of things they’re really fast to query. And you can store for a long time because the data footprints tiny, and these are the things that you might send alerts and things off, and they’ll show you pretty graphs of. What’s going on and that gives you a general sense of, of the world.

So, you absolutely need a general sense of what’s going on. [00:18:00] Tracing gives you the ability to connect things, right? Tracing is the context propagation tool where you might say this request, then touch this, this, this, this, this, this, this, and it used these sorts of resources. And I think that metrics and tracing together give you enough of a picture without logs..

Topical, but, I do believe that you can do most of the use cases for observability. And I actually think with security too, if, if the security industry, looked at HTTP spans and pricing, I think you could get most of your daughter from spans as well.

Ashish Rajan: Yeah, cause I think I’m already, as you mentioned that if metrics and tracing up all the two important components, I can only think of a scenario and you can validate this if this is wrong or not.

If you’re looking at HDB logs and a lot of people can pick up a denial of service data and fairly quickly, where if your metric for any like a good day is 10,000 requests per second, something, but suddenly you’re getting 2030. That and you see that trend and you go, oh, there’s something wrong there. But your, your machine obviously being in cloud can handle it, but you’re going, there’s something that happening [00:19:00] here.

So is that, would that be a good, I guess an example between that tracing and metrics?

Colby Funnel: Yeah. Yeah. So we, we always, we ask or hope that users start with metrics for all things. Cause they’re the most lightweight most near real time data. So you might have a dashboard that shows you your HDP requests, your application usage, right? Number of users logging in the number of hits. We sorts of things.

And , they might be seasonal, but , as long as everything’s following a consistent known unexpected pattern, everything’s great. Typically though are, , what’s the next step when something goes up and you’re like, oh crap, what do I do now? Most developers, I mentioned security too, would just jump straight into Splunk and they’d so show me everything that happened between there.

Right. And then spend maybe an hour trolling through a thousand log lines, looking for the needle in the haystack. Tracing gives you more context, right? So you might say that, , metrics or , metrics might say, yeah, Something’s going on, I’m tracing my, tell you where it’s going wrong.

Right. It might say, in this actually the requests that’s hit these 30 different services. This one over [00:20:00] here, this span is taking a bit longer than normal or using more resources or something like that. So it helps give context and sort of zero in on what the issue is. And from there, you might sort of jump into two logs to find.

Specifically what happened, but it’s, it’s a much smaller surface area for you to actually have to search through.

Ashish Rajan: Oh, yeah, because I remember in an incident response scenario, the first thing people are trying to find out is who do I call first? Because I clearly, as a security person have no idea what app, what does this application do?

I mean, I have a high level idea, but kind of like, it’s not specific enough.

Colby Funnel: Oh, yeah. And made so much worse with inconsistent instrumentation. So you may not even be able to understand different teams and different services telemetry because well, they’re sending HDP with a typo or something and I can’t find their data and so on and so forth.

Ashish Rajan: Yeah. And I think they like the doors. People would know how that they turned the instrumentation. So security obviously doesn’t have an ID. They kind of have to go again, find the person responsible instead of trying to find like a trace and [00:21:00] funny, as, as we were talking about this, I’m kind of realizing we already have a lot of metrics, especially for people who have been working in the cloud space for some time, Mike, I’m just gonna use an AWS as an example, then CloudWatch, EC2 Instance they all have the, even use the word metrics.

So like, if you can define your own metrics in CloudWatch or what? I mean like, oh, actually we’ve been doing metrics for a long time. It’s just that there was no like a banner toward saying, Hey, this is observability .

Colby Funnel: Yeah, and this is where I get a bit Filipic icky with the whole observability marketecture thing, because it’s not new and it’s not different.

It’s a bit of an evolution, but everyone’s been sending telemetry and metrics and logs , for years now. We need to use them slightly differently because , microservices and architecture and cloud and all these things, but at the end of the day, it’s, it’s the same. Yeah.

Ashish Rajan: And so just, well, I’m gonna use the word if you want.

I did that tracing back to my question about building up the multi platform. So I bought a observatory schools I guess are best of breed, all, all contained solution. So I’m starting to build metrics. I need to find I’m already doing log. But it’s the intent is [00:22:00] to kind of start building metrics on it, which you can probably make some traces from.

Would that be the next obvious step after you’ve decided on the platform considering you already sucking it on the logs, but it’s more about what my business metrics that I.

Colby Funnel: Well, yes. And this is where a hotel hopefully will solve this with the standard instrumentation. Every vendor today will tell you a different way of generating data, or they’ll say, use this client library in your application, and it will automatically send a bunch of stuff or, put this agent on each host.

And it will generate all of this data for you. So there’s no one way today for me to say, don’t do this. If I had to generalize, , you would be doing something in your application to send some data, open symmetry, and hopefully soon-ish will answer what that something is. And it’ll be a client library and it will generate a consistent set of data.

And hopefully if you’re using common frameworks and common tech, it will do it nearly for free. You’ll just sort of include this library and it will generate a spend. So what’s going on, it’ll generate metrics. And if certain vendors have their way it’ll generate [00:23:00] logs, and you’ll be able to then understand what’s coming out of your apps and use that

Ashish Rajan: oh, and so we’re not there yet at this point in time, then

Colby Funnel: It’s close. There’s, there’s definitely clone libraries that are out there. The, the issue that we have rolling this out to production at scale at the moment is simply that it’s all still in experimental, right? So we are waiting for the community to sort of say, right, this is the 1.0, this is the, globally or generally available.

And bits and pieces are there. Right. So, , I don’t want to slam her till I think they’re doing an amazing job. But I’d like it sooner.

Ashish Rajan: Right. And so as actually makes me think then if this is kind of like a experimental stage, but what do you consider? What’s the metrics of good observability platform then?

I guess on a slightly, not every time I say metrics of tracing, I’m thinking of like all my questions are going to go into tracing metrics or observability . How do you measure a, not so mature observability platform.

Colby Funnel: So in terms of maturity I’ve always thought about it as following very similar to SRE or ops in that you’ve got different levels of maturity, Typically, and I think most [00:24:00] companies fall into this bucket of reactive run.

That’s the first level something’s gone wrong. I’m going to react and figure this out. Observability obviously helps with that. But , at that point something’s already going. Yeah. So the next level of maturity might be proactive in that. I know something’s about to go wrong or when something does go wrong, I’m already prepared for it.

I might have some automation. I might have fail over. I might have something right. And good companies are still struggling to do that. And I think the holy grail is just prevention. Right. So, , things before they even go wrong. And that’s a really interesting one because how do you measure that?

How do you measure when you stop something going wrong? It’s, it’s a really interesting one. So we, we are at the moment struggling to sort of go, well, hang on. If we really succeed, we won’t be able to tell anyone we succeed because nothing,

but yeah. There’s ways around that. It’s really interesting. So I’d definitely say reactive, proactive, preventative. Yeah. I stole that off the internet somewhere. So that’s not me being wise.

Ashish Rajan: If the, okay. So if that’s the case in that scenario, are there use cases for [00:25:00] security in there as well? I guess coming from an incident response perspective, clearly sounds like, , we would love to be preventative, but most of the times we are reactive in some response, but then cloud security for a lot of people, at least I would love for it to be preventative, but we don’t have a preventative measure that as well.

I mean, for folks like Amazon Azure or Google, or all these people who have collecting so much metrics, they clearly fee as doing stupid things. They clearly have a metrics of our super things at their end, like at three opened on the internet or whatever, but they obviously would not open it up up to us.

So it’s up to us to kind of figure out what would that mean from an organization or is it a split between, Hey, this is what almost like a shared responsibility that as well, because I mean, I guess now AWS also has observability. Azure has observability I guess. Okay. Yeah are opening up to observability that they haven’t really gotten observability.

So are there components there as well? Where it’s more of a, Hey, this is some thing that developers would look after. This is something that a security team [00:26:00] or a SRE team would look, are they like a shared responsibility, component to observability as well?

So basically I’m trying to get to the point, where is this still like a collaboration kind of a thing where it’s one person or one team responsible for generating the metrics and identifying the metrics and another team kind of observing.

Colby Funnel: No, it depends on your org structure. I believe the old op stays gone.

And we’re still a there with security and on the legacy that disappear where you have like an ops team over there and a dev team over here, and the dev team does a thing. And then the ops team gets woken up because the dev do the thing. Unless you remove that, here’s a, and everyone I speak to is in the same boat now where, on this is what dev ops is, as much as I hate the term, Developers are on the hook for their own stuff.

Right. And that means instrumenting observability. It means viewing dashboards and alerts. And in many cases it also means generating the right data for the security team to use. But in that sense, you still a developer is instrumenting something and throwing it over the fence with the security team who.

, all the common [00:27:00] problems of why I don’t know what they’ve done, and I don’t know how to look at their data and they’re doing this thing for me, but they don’t understand this. And, I don’t actually know the answer. I haven’t thought about that for, for security, but in the DevOps world, it’s about pushing those responsibilities to the developer.

I think that. Maybe the security world was a bit more complex than the ops world in that you can’t really just throw the security, responsibility in full. But I think like you say, a really collaborative approach to having developers understand why security and how security and, potentially make the, I talk about developer experience a lot, like making the developer experience easier, simple and accurate.

Yeah. I think that. Yeah, collaboration and shared ownership , of responsibilities and outcomes. Definitely important.

Ashish Rajan: it’s actually interesting what you mentioned, because I think I’ve, I feel like at every product level, there’s two sides, I guess one is security of the product, and then there is like, how can security use the product?

So talking about security of the product, are there obvious things that people listening to? This are probably some security background[00:28:00] they’re going, what are some of the risks that an observability platform. I guess opens up, opens them up to, I guess there is one obvious one that kind of floats around quite a bit is the whole PII being available to much more broader people.

Are there other things that like, is there a supply chain component to this as well and other things?

Colby Funnel: Yeah. I mean the observability system is kind of. The blueprint and the access log, and it basically says everything about your stuff. So if someone gets access to that , we have a bad actor internally or something it can cause problems.

Definitely, I don’t think I’ve ever heard of that happening. But I can absolutely see the risk there. We do, there are controls around, , data access and who can see what, there’s movements in the industry about, , automating away PIP D BJC, all this sort of stuff. Yeah, but there is a risk there. I’ve not heard of it playing

Ashish Rajan: out. Yep. I think to your point, because it also, because I guess it’s still, I mean, for a lot of people, observability is still not even there. Like, I think a lot of people are still doing the traditional log aggregation, same or [00:29:00] performance management.

A lot of people are kind of opening up to the idea of observability . So kind of maybe as we kind of mature into it. And hopefully once we have a standard for open tell, but where, what does that standard for, I guess, instrumentation, maybe that might be a good way to kind of have the start as well. I, I think.

I love the conversation we’ve had so far, because we’ve touched on, if I’m a small company, I want to start doing observability kind of solution, get some metrics and use that to develop pricing. I’d love that thing from, from a scaling perspective is, are there any fundamental things that they should be looking at when scaling an observability platform?

I guess like you can do it for one team, but I’m thinking about like a , mid market, like going to a large, how does that work at that time?

Colby Funnel: Yeah. First of all, I understand observability is expensive and I don’t think that’s observability. That’s just data, lots and lots of data, lots and lots of money.

So you’ve got to figure out how much you willing to spend on that to scale to the point that you need. And it’s not just cost it’s the more data you have the slower it is to respond. And the more people you might need to run this thing. Other [00:30:00] aspects of scale. There’s a really interesting one that we are, I’ve been thinking about all this week, actually.

Asset management and artifact management. Right? So in some of our tools, if you say, show me a list of all dashboards in the system, that list one load, cause there might be 50,000 dashboards cause there’s 5,000 developers that create this dashboard and someone’s automated something all of a sudden there’s 50,000 dashboards.

And so these. Ways that, , you didn’t think that they were breaking points, some of these systems you’re or vendors you go to and say, show me a list of all my services. Well, that’s not going to load for a company like Atlassian or, , other medium to large things. But that’s just one little nuance in, in reality. The most important thing for me is the standards. Right in it last year. And you might have lots of developers sending HDP metrics, but calling them different things. And therefore I, as the observability team or someone in ops or on anywhere, can’t use that data because the developer knows what it is.

Whereas if you were starting from scratch today, you might say, Hey, every developer has to send [00:31:00] their data. They used to be taught a little bit like this, and then all of a sudden we can, the platform can help them. Security team can understand that data because all the data is consistent. Yep.

Ashish Rajan: Was at that source.

, the successful scaling would be standard, I guess, having a standard.

Colby Funnel: Yeah. I mean, the metrics would be, , how far she can get data in and out of it and how much it costs them as a title to maintain. But, but the true value of using the data in the system. If the data is consistent, then it’s a lower footprint.

It’s faster. And the platform can do more on behalf of users. And we can start talking about these other buzz words, like, , machine learning and AI and , doing things on behalf of users with this data, but it starts with understanding the data.

Ashish Rajan: I’ve really enjoyed this so far has been, I’m going to continue enjoying, exploring observability.

Yes. But I think I was just want to quickly switch gears as well. So this is kind of like the last section of my podcast as well. I know we’ve been talking about technical stuff. What’s a non-technical question just for folks to get to know you a bit more as well. Three fun questions, not too many.

First one being, what do you spend most time on many not working on observable.

Colby Funnel: I’m a pretty boring person. [00:32:00] So, reading books. Taking my dogs for a walk. Playing video games, watching TV. I’m not if I wasn’t in lockdown, the answer would be, , road trips and, getting out and about, but I can’t remember what life was like before lockdown

Ashish Rajan: anymore. Yeah. I think a lot of other stuff, I mean, I guess I feel for you guys after seven weeks of lockdown, so you’re going, so now I appreciate the the insight for pre lock down days when they used to be.

This is the second question that I have is what is something that you’re proud of, but is not on your social media?

Colby Funnel: Oh,

Ashish Rajan: people talk about family and things that they’ve done. But I’m curious from your side, what is something that you’ve proud of? Part is not on the social media.

Colby Funnel: I’m trying to have a huge amount of social media. And my wife definitely on, and I’m proud of my wife, my dogs, I’m proud of my journey through different careers, in different things that got me to where I am. I’m proud of mom. I’m confusing, proud, and thankful now, but , I’m proud of my sisters.

Ashish Rajan: Yeah. That’s pretty awesome. And I think I’m, I’m glad you kind of mentioned the part about the different careers as well.

And I mean, obviously family’s important. I think we’ve had a couple of people before mentioned the same thing as well, because it’s, [00:33:00] it’s, it’s sort of easy transitioning from sys admin as well to you. Into now year of doing observability, you’re going, that’s like some scientific stuff, man. It’s like, it’s like instrumentation and tracing

Colby Funnel: I actually, I actually meant even further.

You can’t tell because of my lockdown haircut, but I was a hairdresser for years before I even really tech. And I think that that journey of going through the day. Bits of life and who you are. That’s, I’m pretty proud of that. Oh,

Ashish Rajan: that could be a good story from a hairdresser that you observability

Colby Funnel: pretty common path, I

Ashish Rajan: think.

Yeah, totally. I’ve got one last question for you, Ben. What’s your favorite cuisine or restaurant that you can share? Obviously pretty locked down, but I’ve been keen to know your favorite cuisine or restaurant.

Colby Funnel: Definitely sushi, Japanese.

Ashish Rajan: So she’s Japanese fin say lovely. Awesome. So that’s pretty much what I wanted to cover for the podcast.

So I do appreciate you hanging out, man. So for people who probably have questions around the observabilit and probably building that from the scale, where can they reach you on?

Like what, where do you hang out on social media?

Colby Funnel: No, we’re really, but I’m on LinkedIn. I’m [00:34:00] sure. Feel free to reach out on LinkedIn. I think that you’ve already pinged me on a thing through that. Happy to talk for days about this stuff. Don’t always know what I’m talking about, but happy to give opinions anyway.

No,

Ashish Rajan: I think it’s been valuable for me, so I’m pretty sure other people would find it valuable as well. So thank you for that. And I’m really looking forward to talking more about observability . As soon as the open telemetry be comes up with a standard until I have to bring you again. Cause to get, to get an honest opinion on this, we can totally do an honest review on what the standards are saying and what actual implementation looks like.

Totally can do that as well. So I appreciate you hanging out with me Colby. So thanks so much for coming in and I’ll look forward to bringing you back. Good.

Colby Funnel: Yeah. Thanks. Thanks for having me.

Ashish Rajan: Thanks everyone. All right, everyone else. I will see you next weekend. And a yes, Stay safe. Peace.