SIEM vs. Data Lake: Why We Ditched Traditional Logging?

View Show Notes and Transcript

In this episode, Cliff Crosland, CEO & co-founder of Scanner.dev, shares his candid journey of trying (and initially failing) to build an in-house security data lake to replace an expensive traditional SIEM.Cliff explains the economic breaking point where scaling a SIEM became "more expensive than the entire budget for the engineering team". He details the technical challenges of moving terabytes of logs to S3 and the painful realization that querying them with Amazon Athena was slow and costly for security use cases .This episode is a deep dive into the evolution of logging architecture, from SQL-based legacy tools to the modern "messy" data lake that embraces full-text search on unstructured data. We discuss the "data engineering lift" required to build your own, the promise (and limitations) of Amazon Security Lake, and how AI agents are starting to automate detection engineering and schema management.

Questions asked:
00:00 Introduction
02:25 Who is Cliff Crosford?
03:00 Why Teams Are Switching from SIEMs to Data Lakes
06:00 The "Black Hole" of S3 Logs: Cliff's First Failed Data Lake
07:30 The Engineering Lift: Do You Need a Data Engineer to Build a Lake?
11:00 Why Amazon Athena Failed for Security Investigations
14:20 The Danger of Dropping Logs to Save Costs
17:00 Misconceptions About Building Your Own Data Lake
19:00 The Evolution of Logging: From SQL to Full-Text Search
21:30 Is Amazon Security Lake the Answer? (OCSF & Custom Logs)
24:40 The Nightmare of Log Normalization & Custom Schemas
28:00 Why Future Tools Must Embrace "Messy" Logs
29:55 How AI Agents Are Automating Detection Engineering
35:45 Using AI to Monitor Schema Changes at Scale
39:45 Build vs. Buy: Does Your Security Team Need Data Engineers?
43:15 Fun Questions: Physics Simulations & Pumpkin Pie

Ashish Rajan: [00:00:00] Why are people switching from traditional SIEMs to data lakes now

Cliff Crosford: to increase our license it would've been more expensive than the entire budget for the engineering team, for security teams, it can be pretty terrifying to be like, well, I'm only keeping like 20% of my log data. It became a bit of a black hole.

Where like you couldn't really search through very much data. Tools in the future need to like embrace the fact that logs are going to be messy. We as humans can kind of see these schema changes, be like, eh, I get it. I get what this new field means.

Ashish Rajan: If you have been wondering what is it like to build an in-house data security lake, well this is the episode for you.

I gotta speak to Cliff Crosland from Scanner.dev who tried doing this. Failed. Learned a few lessons and then now he's sharing it over here. What he found through the journey of using SIEMs, SQL injections, normalizations, the number of log sources you have to care about. Now, I don't wanna deter you from looking at Data Lake.

I definitely see today with AI being so prolific in a lot of organizations, either the engineering team or the [00:01:00] security team, they are all looking at building data security links. Or data lakes in general. So if you want to be able to tap into that and see what that could look like for your organization, especially perhaps if you're sick of your expensive SIEM, or not being able to store enough logs because of volume-based pricing.

Whatever your excuse maybe. I think you'll enjoy this episode. If you know someone who is working on planning for what could be the future of security operation without a SIEM and building a data security lake, we also covered some of the challenges too. So do share this episode with them so they can get a full understanding of what does it take to build this data security team in terms of teams and the challenges they would face as they walk along that path.

Finally, if you have been listening or watching an episode of Cloud Security Podcast for a second or third time, thank you so much for supporting us. I really appreciate it. If you can take a second to hit that subscribe or follow button, it really helps us grow and helps more people find us so that we can inform them about making right calls about these technologies with cloud security and have a overall good cloud security posture and.

Program. Thank you again for your love and support and I will [00:02:00] see you and talk to you in the next episode. Peace. Hello and welcome to another episode of Cloud Security Podcast. I've got Cliff with me. Hey man, thanks for coming to the show.

Thanks so much. It's a pleasure to be here. Man, I'm excited for this.

First of all, maybe we, can we start with a bit of your background as well, what have you been up to, uh, where you are today?

Cliff Crosford: Yeah, I'm the co-founder of like a, a very fun database focused startup. Um, that, that does a lot of fun stuff with data lakes called Scanner. My, my background is I'm obsessed with low level system performance, and I have a love hate relationship with the RUST programming language.

So uh, but yeah, my, my co-founder and I, we've worked together for a long time. I've done a couple of different startups. The last one we were acquired into CISCO, um, just always facing interesting security and observability challenges at massive scale. Uh, so yeah, I don't get to code my team here at the company doesn't let me code as much as I want to anymore.

But, uh, that, that is like my bread and butter is, uh, and my passion is going crazy with systems level programming. But yeah, that's a little bit about me.

Ashish Rajan: I think when we were [00:03:00] talking initially about this, you rem I remember you telling me about your experience of building a data lake like yourself.

And I guess the question that I have for you is, it almost seems like a theme today that a lot of people. I'm kind of guilty of this as well. When I was ciso, I was thinking about every time I thought of a soc, I would think about, Hey, I need to have a a SIEM. Now I'm gonna use the word traditional SIEM, considering now we have this AI world.

So let's just say why are people switching from traditional SIEMs to data lakes now?

Cliff Crosford: It's really interesting. I think basically the world of logging and, and data volumes has changed dramatically. We have this problem as well where our traditional siem, we were using Splunk at the time, at the prior startup, at just the, the data volume that's very easy to get to terabytes of logs per day.

Traditional SIEMs were wonderful in the era when you had maybe individual gigabytes or maybe tens of gigabytes of, of logs per day. Uh, but now that, that we're in like a very containerized world, everyone's using many different. [00:04:00] SaaS tools and services log volumes just get massive and then it just becomes impossible to keep all of the logs that you want to in your siem.

And so a lot of people will start to move data at scale over to data Lakes because they, they're just wonderful for managing massive log volumes and scaling basically forever. We really think that's the future of where logs should go. But yeah, it, it's just economically feasible once, uh, once you reach certain log volumes, uh, to scale and, and capture all the data that you want to in a data lake, it's not economically feasible in a SIEM.

You have to like decide which log sources to cut out which, which logs to pare down, et cetera. So yeah, that's like a huge reason why we move to a data lake and why other people are as well.

Ashish Rajan: so economical in a sense, cost and visibility and all that as well.

Cliff Crosford: Yes. It basically with a lot of, uh, the, the challenges of, of SIEMs of, with the sort of limited log volume that, that they can, uh, ingest and retain.

They're [00:05:00] great. They're complex, they're awesome that they, they have a lot of really robust features. But once you are ingesting something like multiple terabytes of logs per day, it gets to be, become like an extremely painful. Painful economically to retain everything. And so you start to drop a lot of log log data.

And so Data Lake gives you a lot more visibility because you can kind of capture all of the logs that you could ever want to and store it in the data lake and keep it around forever. It's, it's just like so much cheaper. You can, you can really capture all of the log sources that you want to so you can really get visibility into everything.

Rather than making hard choices about the, a finite set of log sources that you want to capture in the SIEM. Uh, so yeah, visibility is another big part of like visibility to all of your log sources is another big reason why people move to Data Lakes.

Ashish Rajan: So your first prototype version that you had tried building, and we were talking about this last time as well.

What, what was that journey like, the whole theme to, uh, all the big surprises you found along the way as well? [00:06:00]

Cliff Crosford: Yes. The prior startup where my co-founder and I worked it was a lot of fun. Our log volume exploded rapidly. So we were in charge of both observability and security at this startup.

Um, and the our traditional SIEM, like basically immediately hit its, its volume license thresholds. Like as we, as the log volumes grew. It would've been, and, and to increase our license, it would've been more expensive than, than the entire budget for the engineering team. So what we ended up doing is we ended up redirecting the vast, like 90% plus of our log data over to S3 buckets.

And we're like, cool, like we have Athena, let's go query that. And that worked for small data sets and like looking at maybe a day of data. But it was, it was really sad that basically like it became a bit of a black hole where like you couldn't really search through very much data once the, the data set became large querying [00:07:00] that, that data lake at S3 became more and more painful over time.

And also as we added more and more log sources, trying to get them to fit into a se SQL table schemas and so on, uh, just became e extremely. Like extremely laborious and painful. So it was, yeah, it was awesome from a cost perspective, but it was really painful from a usability perspective for sure.

Ashish Rajan: I mean, I guess the surprise being the engineering, uh, overload that you're putting yourself through this, because I imagine it's not for everyone, right? I'm a like, for example, it is there usually a breaking point when people realize, 'cause a lot of people. In my mind kind of always have gone down the path of either they find a log aggregator or a siem, and that becomes either collection point for all the logs that you care about and you build threat detection on top of it.

That's been the standard for a while. Do you almost need to be an engineering team to build a data lake or. Like I, I guess what's the breaking point when people suddenly decide, okay, I, I think I can't do [00:08:00] this. To your point, may, it may have been that, hey, S3 brackets are good for one day of log, but maybe not for one year of log or 90 days of log that we require for regulatory reason.

Cliff Crosford: Yeah, that, that's a great question. I think it, it really is the case that with the way that data lakes are today, like most data lake tools. Are very hard to use effectively. Like pushing data into Splunk or Elastic or like a traditional SIEM is quite a, a nice experience, uh, relatively speaking compared to data lakes.

Um, you don't really have to do a lot of massaging of log schemas and transformation of data. Um, the, like the traditional SIEMs are quite good at just making sense of your data, making it all searchable and so on. But yes, if you're like my team wants wants more visibility into more log sources, maybe my retention window is not long enough to really do effective investigations.

Maybe you wanna do threat hunting and you're like, well, really the only feasible, uh, way for me to get all of my logs covered would be to use a [00:09:00] data lake. Let's start working on that. It is like a lot of data engineering to get that to work properly. I, I think basically you have to really understand every log source.

Do like custom manual work, uh, per log source to fetch it from where you, from your tools, transform it into a schema that fits into like, uh, some, there are a couple of different, like more popular schemas for data lake, uh, SQL engines. Uh, OCSF is one. But it can be basically every log source is, is quite a bit of work, and you will be on this endless journey of maintaining a data lake forever.

It, it's fine for teams. We think this is changing though, like, uh, there are a lot of really cool new data lake technologies that are, are coming. Coming out and a lot of innovation there. Um, uh, we're still in the early days, but yeah, I, I think it for some teams, if you have a lot of engineering resources and maybe you have other like data engineering teams at the company, they might love this project of like building out a data lake and creating [00:10:00] all these tables and so on, and sharing the data maybe with other teams who might like really appreciate visibility in, into different.

Log sources that are typically used by security teams. But yeah, it is a heavy lift, uh, to get a data lake to work well. And so, but we, we think that there, yeah, there are some cool examples of teams who have done this well and, uh, and new like technologies that are coming out. Uh, we really care about making data lakes easier.

Yeah it is like we, what we think is the future of logs and making data lakes easier to use, faster and more powerful. Um, all of that is happening now. So hopefully it won't be as painful, uh, like, uh, as it is, as it has been recently. But, uh, for now it is with, with like the most common data lake tools, it is annoying.

It's like a big engineering lift to get your data lake running. I guess

Ashish Rajan: you, you gave the example of Athena as well, that you had used for storage, which I guess to what you said, maybe it's great for storing, but maybe unusable. I think to your point, you had a few examples that you were, you're talking about the speed and everything.

So I guess in my [00:11:00] mind you started with a theme. You start, you started pushing logs to S3 bucket, and then you realize, oh great, for one day, but not for 90 days or whatever. Now you're like, okay, I'm storing in Athena, so maybe I can do a query better. What's the challenge there as you get to that stage three?

Cliff Crosford: Yeah, that's a great question. So Athena is basically Apache Presto, Apache Trino under the hood probably. Those are some of the, like the most common data lake tools, um, out there that are all SQL based and it's good at, at querying S3 buckets. And so what we did in our original data lake is we just started dumping huge amounts of log data into S3, storing them forever.

And we try to use Athena to go and run queries on them. One of the, the, the challenges is that this is, it's pretty common, and this happens a lot with data lakes, is when you execute a query, unless you are querying like something that is perfectly suited to the way that you've organized your data and partitioned it [00:12:00] into different folders and like maybe.

Perfectly indexed and so on in your data lake. The queries are going to be extremely slow. Uh, so, uh, we, we typically do, you know, like, let's go search the last 30 days for some activity, and then the query would take three hours to run and might cost a few hundred dollars. And so that, that's like Athena, even though in theory we could, uh, run investigations on this data lake.

Uh, go and scan through our S3 buckets where all of our log data was. And in practice it was almost unusable unless we were doing something like using Athena to query just a small amount of data, like the past 24 hours or something like that. But, uh, yeah, if we wanted to, to search larger data sets it just does kind of, a bit naive like S3 scanning.

If your data isn't like, perfectly column nerd, you're trying to do a tech search, like. It's kind of impossible to use. And so, it it's good for, I think Athena is probably good for business transactional data that's like very column [00:13:00] nerd, very spreadsheet like, but for a lot of security logs they, they can be a lot messier.

They can be like deeply nested, JSON or, or lots of text like PowerShell, command line text and so on. That's where sql engines that are the typical data lake engines really break down. And so it, it was like basically like, well, maybe, we'll, we'll use Athena every now and then, but it's almost uh, unsuitable for day-to-day use.

We'll, we'll like touch it a few times a year. But, uh, it, it like very cheap to store the log. So if we ever needed it, um, to put them in S3 and use Athena to query them, but uh, yeah, like unusable to, to go and get visibility in search.

Ashish Rajan: And to put this into context at a security team let's just say a security operation team, you probably are looking at a huge amount of data for every incident you're reviewing and you don't have the time to just, I'm just trying to give this much of query and hopefully I don't have to wait an hour 'cause that's a long time to wait for an incident to be reviewed.

When you've al already identified with an [00:14:00] incident, are there other things that happen as well? Like do people actually keep all the logs? Because it sounds like if it's too expensive, people shed logs as well.

Cliff Crosford: Yes, I think, I think basically this is the journey that people go on is, okay, cool, like I've got all my logs heading to my SIEM.

This is great. Like, oh my God, this is getting to be really expensive. And either my SIEM is crashing a lot or I I'm basically hitting volume limits and unless I, I have a few million more dollars to, to spend on my SIEM, I'm going to have to start shedding logs. Uh, so. A, a classic example, I think, um, cripple popularity speaks to this where cripple as a data pipeline system sits in front of SIEMs.

That's often a, a way that people use it. Uh, they'll use cripple to delete fields from logs, filter them down, sample them down, like try to just like keep the log volume down to, to avoid spending too much on, on logs and on ingestion volume. Uh, costs in their traditional SIEM, and so then people start deleting logs and really [00:15:00] like sampling and deleting logs is kind of okay for SRE teams or observability use cases because.

For those, you're kind of using these logs to get a sense for the health of the system. And so getting like sampled data is all right, but for security teams, it can be it can be pretty terrifying to be like, well, I only, I'm only keeping like 20% of my log data. But the threat actor activity and maybe the IOCs that I care about, like malicious IP addresses, et cetera they're invisible to me.

I, I want to keep everything, I want full fidelity. I want to be able to find any, act like even one event can be really important in security. So, yeah, like shedding those logs can be like a very, I dunno, for us, it, it was this way too, but it can be like a very, annoying and painful experience where like, okay, well what risk am I willing to to accept here?

Like which kinds of logs, which kinds of threats am I willing to be, uh, allow to become [00:16:00] invisible to me? And making those choices between logs. Uh, can be very scary and painful. And we think like the future looks like keeping them all but making it like, uh, much more accessible to go and search them.

Uh, and, and I think it's, it is fun. There are lots of, of cool cool technologies coming out to make data lakes better that way. But yeah. It is an annoying problem that that security teams have to face there.

Ashish Rajan: So what's the misconception then, because it sounds like to your point, it could be easy for people to go down the path of, oh, seems too complex, quite expensive.

Are there any misconceptions that security leaders have with, uh, with their approach to just looking at Data Lake as a cheap storage that to your point, you can just query, uh, as much as you.

Cliff Crosford: Yeah, I, I think, uh, that's a great question. I think, um, I, I think, look, one of the things that we tend to hear from teams is that a lot of people that, that we talk to are excited about building out a data lake, but they're nervous about how much engineering effort it's going to be.

There are some really cool like open source tools that help, uh, make this [00:17:00] easier. So like, I think a fun one that we often see is like, HashiCorp has this. Unofficial project called Grove to help you like collect logs in. There's also like. The substation from from Brex, uh, that, that's a really great pipeline system.

Um, but, but anyways one of, one of the misconceptions I think is that, uh, there's going to be an extreme amount of like engineering lift, uh, to get my logs to be gathered into the data lake and then transformed into a, into like a schema that, that works for me, and then make it. Actually searchable.

It's true that, uh, un until recently it has been very hard to build a data lake and that it is a heavy lift. But yeah, I think one of the misconceptions that's starting to ch like it's really starting to change. It's becoming easier and easier. To like just gather messy data into a data lake and have really cool tools that will make sense of it and make it like fast to search or like detect on or normalized.

It is there, there are many options out there, but [00:18:00] like, yeah, I, I think, uh, I think over time we'll see, like in the coming few years. I feel like everything is going to be moving to data lakes over time. Like, uh, whether it's security logs or everything is, will be moving to, to object storage and S3 buckets.

And then we'll start to see even more cool tools, uh, to make data lakes easier to onboard and, and build. But yeah, I would say, um, I, I would say like for security leaders, like. B it with like kind of popular data lake tools. It will be a big, uh, engineering lift, but, but look for other cool tools that are, that are more new out there like that they may make getting started on the data lake journey much easier.

Ashish Rajan: I mean, it sounds like there's a, there's been a few generations of this data lake architecture as well, and I guess you've tried a few versions yourself as well. I'm cur curious, how do you see. What are the three or four or however many generations that the data lake architecture has gone through, like today you're saying it's a lot more easier, but what's the transition been so that people get a sense of which one of those stages am I at in my data lake [00:19:00] journey as I'm building this?

Cliff Crosford: Yes. I, I think that's a great question. I think, um, just looking at history, it's really interesting to see the evolution, um, and like how things have evolved over time. At like the original seems like ARC site. They were based on sql, they were based on Oracle. And it was a good first step, but one of the problems is like, again, log data, security data can be a messy fit for like a hard fit for perfect uh, structured SQL tables.

And then you saw this generation of, of SIEM like Splunk or Elastic, and they were much better at taking messier log data. Normalizing it, being a bit more flexible searching semi-structured or like well-structured or totally unstructured text. Um, they were good at all of those things.

So we see this transition for old SIEMs from like the SQL era to full text searchability, and we're seeing the same thing I think, uh, with Data Lakes today. Where the original technologies that are [00:20:00] built for querying data and managing data in Data Lakes has all been SQL based at first. So you see, like, you see tools like Snowflake or Athena and Presto, et cetera.

They're all very SQL oriented. It's really cool to see more technologies. Uh, today with Data Lakes focused on full tech search. That's something that. We love at Scanner. We where what we're, we love the unstructured, messy data, but also, um, there's a really cool example from, uh, from Apple Security team where they moved from Splunk to a data lake in S3 and they are using Databricks, uh, Delta Lake.

But then they also built their own custom full text search using like Lucine and Apache Spark together. That's a huge engineering lift. Like I don't think every organization can do that. But, uh, but they sort of showed like, cool, the data lake the SQL based data lake is a good first step, but then getting full tech search on the messier, semi-structured unstructured data in my data lake is the next evolution.

And [00:21:00] so yeah, we're, we're excited to see a bunch of cool technologies there, um, that the people are doing. And, and I think in the future it'll be more turnkey to do that. You won't have to have like apple's engineering. Resources to pull that off. But yeah, that, that is, um, I think the next generation of Data Lakes is tools that are like just very good at, at messy, unstructured, semi-structured data.

Ashish Rajan: Right? I mean, I guess turnkey reminds me that it says Amazon security at Data Lake as well. We've been talking about Amazon a w Athena S3 bucket. So where does the whole Amazon Security Data Lake fit into this kind of world?

Cliff Crosford: Yes, I, I think it's a really good first step. I, I think that the, uh, what Amazon Security Lake is good at is they have a bunch of different log sources that they support outta the box that's really cool.

And what they do is they will translate them into OCSF, uh, which is this really cool, like pretty strict schema. Uh, but it is a really cool schema that sort of provides a schema. On top of all of your security data. So all of your [00:22:00] logs from many different sources can all fit the same schema. They translate it into parquet files, which are really fast to to, to query because they're really nice and columner.

But, um, the, there are two problems that we hear people talk about with Amazon Security Lake, and it'll be cool to see if they, um, if, if this gets easier over time. But one is that, um, for custom log sources or log sources that aren't in their list of supported log sources, again, there's this data engineering lift that you have to do to like get your messy, weirdly structured logs.

Every log source is like totally different and has their schema's always changing. It's annoying. But if your, if your log source isn't on their list, you're gonna have to do the work to get it to fit into this very strict schema, and that can be a massive amount of work. If you have custom logs that you wanna monitor that, that are, that you're generating internally or just you, you have many different kinds of logs that aren't on their list, uh, [00:23:00] you're gonna have to do that data engineering work.

So maybe over time it gets to be easier to get those logs to fit into that schema. That's, that would be really cool over time. But the, that, that's one problem is just that data engineering lift. If your log source isn't supported. The, the second problem is the full text search. Again, like, um, if you, if you have things like command line text from EDR logs like PowerShell commands, you're trying to like dive in and do substring search and really understand messier log data that is unfortunately still quite slow.

In the data lake. It's not really designed for that. The Amazon Security Lake is really designed for like great column nerd, like SQL friendly data. But uh, if you can't get your data into that format, or if your data just by nature isn't very SQL friendly it might be annoying. Uh, but it'd be cool to, like, if Amazon does build out e easier onboarding and easier full tech search into, uh, security, like that would be, that'd be great.

But it's [00:24:00] definitely like. That is the direction things should go in is like more data and object storage, easier to retain more scale, et cetera.

Ashish Rajan: I guess you've touched on something really interesting there. Most organizations use custom applications, right? Not that many are just using SAS applications or standard applications.

Everyone's using custom logs, custom applications, they're building in-house, whatever the reason may be. And I definitely find that. Custom logs, probably the more common patterns you will find when it comes to logging and security operations instead of the standard logs. And yes, you can, to your point, you can have an OCF format that you may use that is standard by Amazon, but if 90% of your logs are supposedly custom, then you're.

Back to the engineering part again. I'm curious on the whole, uh, schema and normalization part. 'cause I guess because there are so many different kinds of sources in an organization, custom sources, I mean, I guess cloud logs kind of are covered because your OCF kind of conversion may happen more easily, but.

To be honest, [00:25:00] we're not just looking at cloud logs, we're looking at application logs, enterprise application logs, which are very custom. What's the challenge with, how easy or difficult is the whole schema normalization thing in this text-based search world that we are all moving towards with Data Lake?

Cliff Crosford: Yeah, that's a great question. I, I think it is. Um. The if you are focused on sql, you really have no choice but to do a lot of work to transform every log source, including your custom application logs. But in the new world where that you you have engines and query engines that can handle full text, search on miss your log data.

That normalization isn't as important. I think there are some really fun things happening there. So you if your, if your engine is if your, like query system is really good at understanding like deeply nested or, or, or like, uh, unstructured or semi-structured data you may not even need to do, um, normalization.

One of the things that that we see, and this is like a fun new [00:26:00] era for all of us, is if you're using agents of using LLMs to do things like run an investigation on your data lake, it's actually quite good at doing fuzzier kinds of correlation. Uh, whereas, uh, traditionally with a sim un, like you can't do correlation unless you, you have the schema perfectly mapped.

Like, uh, if, if you're like, cool, like let's do a correlation and see what this user has done across many services, then every services logs need to be normalized to have exactly the same column for user. Um, but in, in the future, um, as. If your system is good at doing searching on messier data, you don't need the column names to be exactly the same.

And also, um, LMS are quite good at saying. Cool. I actually did follow, I ran a few different queries. I followed this user all the way through, across these patterns. I kind of get the idea that this field is the same as this field and this log source and also like, maybe their username is just [00:27:00] like slightly varied, but from a source to source like it can be.

Quite cool to see that yeah, in, in the future. I, I think really what we'll see is like, more, more strength in tools, understanding messier data and not forcing everyone to like totally clean their data first to make it perfectly normalized. Not only what will, uh, tools make that, uh, make it easier to search that messy data, but also like agents will just be able to understand the messy data.

Much more easily and do those correlations for us. So, I still think it is helpful to normalize as much as you can. But I kind of believe in best effort normalization, like adding a handful of normalized fields, not forcing teams to get their logs to be perfectly mapped into a schema.

So, yeah, I think there's like a happy medium we can all get to. But it'll become less and less important over time, I think. Yeah. Interesting.

Ashish Rajan: And I guess you, you, you've said something really interesting there as well, 'cause forcing everyone to go down a particular path is [00:28:00] great when you're not expecting the application to change at all.

And I think one app, I mean, and obviously most organizations have 300 plus applications, it takes like 10 of them to change their schema. Suddenly now you're like, oh. I haven't received logs from this particular application for months now. I have no idea what's happening over there. And your, by the way, your SIEM is not even telling you that I haven't received any logs.

'cause there's no alert for the fact that you're not receiving any logs. I, I mean, that would get quite tricky quite quickly.

Cliff Crosford: Yes. And when we talk to users, like oftentimes what they're doing every week is they're like, okay, cool. When we had 10 log sources, we were starting the data lake journey, this is fine.

But then when they, they got to like 40, 50 log sources basically every week. At least one of them was misbehaving because the schema changed a little bit. Like new fields showed up that were important or got renamed and then suddenly like. It stopped getting inserted. It had errors. Um, so every week the team is just constantly fighting with, like, getting the data to fit into this very rigid structure [00:29:00] in like the common SQL based data lake tooling.

Uh, so. That the future looks like. Because it, it's kind of funny 'cause we as humans can kind of see these schema changes, be like, eh, I get it. I get what this new field means. I understand. Yeah. So what, what we, what I really think is that tools in the future need to like, embrace the fact that logs are going to be messy.

And that both humans and and AI is gonna be able to kind of get it, get what the schema means now and not force everyone to. Fight a stupid fire all the time, like getting their logs to be perfect. But it is the case. Like yeah, once you, I, I think that's one of the scary things about ingesting dozens and dozens of log sources is like the ongoing maintenance and the pain that you experience.

Every week there's another stupid fire to put out. But yeah, the, we'll, we'll see like how it all shakes out in the future, but like. Messiness will be embraced more

Ashish Rajan: is my prediction. I love that. Uh, but do you reckon, you mentioned AI a couple of times as well. I'm curious. There's a lot of AI SOC that's popular on the [00:30:00] internet at the moment, and a lot of them focus on detection.

If you were to build a detection pipeline with. The data lakes. What, what impact do you think AI would have on it today that people can use? Like, I mean, especially the companies that are very engineering forward, they've listened to you go, oh yeah, I can definitely, I have the engineering capability. I can go down and build a data lake.

What should an, what should a detection pipeline look like today from all the way from ingestion to enrichment to detection? Uh, I guess what would, I mean, obviously I'm trying to answer the question here, but what principles would you follow for that?

Cliff Crosford: Yes. That's a great question. So I would say that.

The, some of the pla the places where AI is most helpful is in its understanding and knowledge of the world and, and all of the log sources that exist out there. Um, and so what that means is that you can use, I would say like from i, from the beginning when it comes to ingestion, you can use AI to tell, to teach you about the APIs that exist and the log sources that exist, and the connectors that exist.

The documentation that's [00:31:00] out there and, and get it to right code for you. Because it tends to understand all of these log sources better than we have. Like, you, you won't, instead of doing all the research for every log source to, to how to collect it. How to pull it in. You can, uh, really make a lot of progress quickly by using a AI like in, in cloud code or cursor, et cetera.

If you want to build custom connectors there are also lots of tools that have lots of connectors built in. Use those as well. But, but it basically, you start first by. Letting AI do the the work of use it, use its breadth of knowledge to understand what all of the, the log sources are that you might want to collect and then how to connect to them.

Gather them all in, and then start to use it to do things like and transform the data, give you ideas for what fields you might want to enrich what threat intelligence feeds, um, you might wanna use to enrich that data. That enrichment can happen in like you could do it yourself by analyzing the data nests three and maybe like transforming it and, and saving it as a new file.

[00:32:00] That, that can be helpful. Um, but yeah, like use, um, definitely use AI in the beginning to like, connect everything. Know what options are available to enrich, but then when it comes to search and detections the what I think is like, AI can, can really help you figure out how to normalize your data if you really, really, really need to fit it into a SQL schema.

And most data lake tools require that of you. Uh, so that can be helpful. It will get you like 80% of the way there. It can be kind of, there's still like a, a bit more, it'll hallucinate a little bit. So there's a, it can speed up your journey of normalizing your data too. The a schema like your table schema in sql.

But then, uh, once, once it's in a state that you like having using AI to give you detection, suggestions. Is really cool. I think it's quite good at understanding, uh, schemas, whether they're messy schemas or like very clean SQL schemas. And because it kind of knows everything that's out there it, it, it can, it can be a really great [00:33:00] brainstorming partner.

Um, my, my opinion for now is that it's not yet ready. To, uh, be fully tRUSTed with important like investigations and response. It, it is good at getting started, but I think humans are still very, very much needed in like actually analyzing the alert and making sure that the AI's investigation into it, the queries that ran, um, the detections it wrote.

Are actually valid and are, and makes sense and are useful. Uh, so I definitely think uh, the, the principle I would say is like is, is really try to leverage AI to give you a quick understanding of your data what data you should pull in, brainstorm ideas for detections. Do a first cut at investigations when a, when an alert goes off.

But not necessarily just to hand over the keys and say like. Cool. You're running my sock now, like you're replacing my team. Like, I really don't think that's the future. I think the future looks more like detection engine. Everyone, uh, will be leveled up and [00:34:00] will all become really powerful detection engineers more.

So like, uh, but I still think humans are essential. I don't think they'll get replaced, but I think, uh, we'll all be doing cool, like collaborative detection, engineering work with, uh. With the ai.

Ashish Rajan: And I guess to your point, I, I, I kind of agree with this also, the fact that today when you go find a new.

Detection rule that has to be created. You normally put an Jira ticket or whatever and wait for, say, Ashish to come back from his holiday or whatever. He is gone to come back and finally do this. But to, to what you just called out. I don't may, I may not need to be a Sea Splunk expert to know the SQL query that I need to have for this detection.

I can just basically ask the LLM and go, Hey, I'm trying to do this. What do you think is a, is a semi dirty version that I can put today. So a, it helps Ashish to validate, but at the same time, I already have my prototype going, so I'm not waiting for remembering the context and typing everything out and all of that.

I, I, I agree with this, but [00:35:00] have you seen it work at scale as well? 'cause we just spoke about the scale problem, where if you have 10 log sources, great. But the moment you go 40, 50, a hundred every time a schema changes, now you're back to like, trying to, uh, massage this in whatever way be like, do you reckon AI would be better in that scale or have you haven't seen an examples of it outta curiosity?

Cliff Crosford: Yes, it's a great, I think, I think, um, it is really fun to see that we've seen it at, at scale where people have dozens of log sources and they're starting to get hundreds of alerts or thousands of alerts per day. Um, sometimes tens of thousands. But, um. AI can be really good at saying like, here's a trend of what we're seeing, which is like these new fields showing up in your schema.

So you can even just say connect to MCP servers like and you know, a bunch of data lake tools have MCP servers. I love it. It's fun, it's exciting. I, I. But uh, you can basically say like, what is, what has changed recently in this source versus another? And it will give you like a reasonably strong analysis of that and really [00:36:00] speed you along in understanding what's changing in your schema so you can.

Make decisions about uh, like what new detections you should write, or, or new, or like, what should change about existing detections? If if the schema has changed, maybe detections aren't valid anymore, they don't catch it because they're referencing an old field. It's quite good at, at noticing that.

As long as you're there to, to ask. We've seen people do cool automations where, like on a schedule. They will actually have an agent go and query like many MCP servers they have internally, um, including their like sim or their data lake and say what are trends in our alerts?

Um, that it might say this alert count has dropped off a lot. So like maybe a schema has changed. And actually I've noticed it because I just went and ran a query to check on it. Um, and I can see that the schema's a little bit different and I would recommend updating the detection in this particular way.

But then also if your alert volume is so massive that uh, it's not possible for humans to really address all the alerts it, it can give you really good ideas on, on how you can reduce the [00:37:00] severity level of different alerts to say like, yeah, these ones probably don't need to wake someone up.

Here's a general trend. Given these 1000 alerts that went off yesterday, um, that you might want to be cognizant of, here are the entities, like the hosts and the IP addresses that seem to show up a lot in those alerts. And and then, for the really critical alerts, it will basically do its best to like do a deep dive investigation.

But that's when a human really needs to come in and say like, all right, for this critical alert, let me make sure that its analysis is correct. And, uh, that, that it is found like, let me, let me give it feedback. Let's like dive deeply into different parts of the data to really do a full investigation on this really important incident this really important alert.

So yeah, we've definitely seen. At scale, like people, yeah, I think, I think like, AI has been very helpful. Like ano, like one last, uh, thing that that's fun is to see people close the loop on detection, engineering where, uh. An alert goes off in their data lake or their sim goes off to Jira or linear, another [00:38:00] issue tracker, and then an agent goes and takes a first cut at the investigation.

It then will make a recommendation for what the, um, the, the detection rule should probably be like, maybe this is too noisy and we should tune it a little bit, like add an exception to the detection or something. And then, then it will go open up a poll request in GitHub, and then the team can just review it, accept it, and then like.

Detection rule is now tuned better. So yeah, it's all really important though. I still think for humans to review that though, and not just to say like, yeah, of

Ashish Rajan: course, go nuts,

Cliff Crosford: agents, like, change everything, you know, like, but it can be such an accelerator for people. So yeah, it is like, I think it.

I think it does work well at scale to get like, uh, AI to get involved. Yeah,

Ashish Rajan: I can already like, to your point, the same example that I said earlier where I identify detection, but instead of me manually typing and going through the procedure, I let the AI agent do that job and make it as a pull request that's reviewed by someone, but I've given the right context, it's created the [00:39:00] right detection as per the context, whether it's cloud container capabilities, whatever, doesn't really matter.

And kind of send over a pool request. That'll be an amazing future to get to. For CSOs who are probably still on that. Bill versus by dilemma that most of us go through. Uh, when you're trying to go through this question where I have an expensive scene that I'm paying money for, but I have this chief storage option, what are the team skill sets people need to have to even consider this?

Like especially this time of the year when a lot of people are considering what their 2026. Um, and uh, beyond would look like from a program perspective, especially if you work in security operation, AI attacks are on top of mind for a lot of people as well, and they're like, Hey, I mean, maybe SIEM has a better option.

Data Lake is a better option. Still tossing the idea, for future proof. Do you still believe Data Lake is the right decision for people who have the dilemma, whether it's economical or not? And if it is. What kind of skill should, what kind of skill sets should they have in their team [00:40:00] to even make that possible?

Cliff Crosford: Uh, that's a great question. I think one of the things that. We see is that I don't think it's quite time to totally replace your SIEM with a data lake. Some teams do it and it's great. Like I, I do, I do. We do see that quite a bit. But I think like one pattern that, that tends to work well is to say, cool.

Like all my logs used to fit in my SIEM a few years ago. Now it's like 10% of them. Let me keep those 10% going to my SIEM. Maybe it's like one terabyte a day and then nine terabytes a day of logs are still being generated and I'm just like, have no visibility instead of dropping them entirely. Let's make the first step, which is just store them in S3 for compliance purposes.

Um, and then the question is like, okay, now should I build my data lake? Now that my data is like my nine terabytes a day is flowing to S3. Or should I buy something? If you have like a, uh, a really strong data engineering team, like whether in the security side or, [00:41:00] uh, your, the rest of your organization.

Is already doing a lot of really cool data engineering with data lakes for other purposes, like for business analytics or even for observability. Then yeah, you can share that work with them. That can be a fun project. It is an, uh, a forever project though. Like every log source, you know, you'll always be updating schemas and so on.

And unfortunately I don't think the open source tooling exists yet, um, to give people, uh, the ability to. Unless you're like, unless you have a lot of engineering resources, maybe like an apple getting full tech search on your data lake is still super early and there, there aren't, um, unless you're willing to really do an insane amount of engineering and build your own like inverted index of your own custom loose scene fork like Apple did that, that, that's probably not gonna happen.

So if you want that kind of full text capability, that, that kind of like messy data searchability. You, you probably will have to buy something there. But over time I think it's just gonna get easier and [00:42:00] easier. But I, I would say that you should probably buy, if your team is really like SOC focused and not necessarily data engineering focused there are more and more cool data lake tools that can take over the job of building out the data lake.

I think that's kind of the first thing that's happening actually, is um, lots of tools exist out there to help you gather the data into your data lake. And now I think it's time. Like it's something that we care about and that other, other teams care about. Now it's time for tools, uh, to exist to make that data once it's in your data.

Like really easy to search, really easy to run detections on and and so on. But yeah, I think, uh, I think if you just love data engineering projects like. You just be, be prepared for like a long tail of like, you know, like, data engineering, like schema tweaks and maintenance, uh, forever.

Um, but uh, if you don't have that data engineering talent on and resources, like you'll probably need to buy something.

Ashish Rajan: Yeah. Like, don't overkill it, I guess, forever. [00:43:00] Yes. Fair. And I, I guess maybe, uh, so those are the technical questions. I have got three fun questions for you as well.

Uh, first one being, what do you spend most time on when you're not working on data lake and cloud and technology and all of that?

Cliff Crosford: Uh, it is like my family and I have like two young kids. That's definitely it. Like, but the thing I love to do though is skiing. It's, it's like one of my favorite things in the world.

So. Like Lake Tahoe, Oh yeah. Uh, Utah. Like that, that's my, that's my favorite.

Ashish Rajan: Yeah. Awesome. I'm a snowboarder myself, but I'll take you. We can still be friends. That's okay. The second question that I have for you is, what is something that you're proud of, but which is not only a social media?

Cliff Crosford: Yes. Uh, that's, that's a good question. I think like. One thing that I'm proud of is that was really fun to work on. This is kind of silly, is I, I created a JavaScript plugin that was like simulating the, what? A black hole looks like a visualizer for [00:44:00] black holes. Oh, this is like a goofy thing, but I, I love physics.

I think physics is super fun. I think this is like after interstellar. But, uh, you can like move your mouse cursor around, uh, on a background and see like gravitational lensing and warping of space in an image. So that's a silly thing, but it is something that was fun to work on and and fun to learn the math about that I was proud of.

Yeah. And it got on like front page of Hacker News for a second. Yeah,

Ashish Rajan: well, I mean, it would've been amazing. Uh, I'm sure Hack can use would've picked it up as well. Final question. What's your favorite cuisine or restaurant that you can share with us?

Cliff Crosford: Yes. Um, I would say that definitely my, my favorite thing in the world to eat is pumpkin pie.

I don't know if this is like, it is not a cuisine or a restaurant. Yeah. So Thanksgiving has definitely been on my mind. Uh, I would say it's kind of funny though, like it's a family recipe that we have that's like way more sugar than the typical pumpkin pie. So I hate all other pumpkin pies, but like.

Yeah, if, uh, I, I love mis pumpkin [00:45:00] pie. Uh, especially if you dump in, uh, a very generous heap of sugar

Ashish Rajan: too. I mean, it is that time of the year as well, so it's like Autumn's winter. So we are into the Halloween Thanksgiving season as well, so rightly so. I mean now thank you for sharing that as well.

Uh, and thank you for spending time with us. Where can people connect with you and find out more about Canada or Dev and other things you're working on from a data lake perspective? Even if they have questions about Data Lake, they're building.

Cliff Crosford: Yeah, for sure. We're pretty active on LinkedIn. Hit us up, uh, at Scanner Dev.

I, I love DMing or, or, or like, chatting with people and comments. People have like amazing conversations on, on, uh, on LinkedIn about, about data lakes these days. And sim it's like a really cool time. I think sim is like everyone's. I think that there's a big shift afoot. Um, so if you have like cool ideas for what that looks like, I would always love, love to chat there or, or reach me on X Twitter Clifton crosland is my handle.

But yeah, always fun to see like the very [00:46:00] amazing, very cool technologies that people are, are bringing to this problem. So happy to chat.

Ashish Rajan: I would, uh, I would put the links in the short as well. But thank you so much for, uh, for spending time with us and sharing all that as well. And, uh, thank you for everyone tuning in.

I'll see you next time. Thank you for listening or watching this episode of Cloud Security Podcast. This was brought to you by Tech riot.io. If you are enjoying episodes on cloud security, you can find more episodes like these on cloud. Podcast, tv, our website or on social media platforms like YouTube, LinkedIn, and Apple, Spotify, in case you are interested in learning about AI security as well, to check out assistant podcast called AI Security Podcast, which is available on YouTube, LinkedIn, Spotify, apple as well, where we talk to other CISOs and practitioners about what's the latest in the world of AI security.

Finally, if you're after a newsletter, it just gives you top news and insight from all the experts we talk to at Cloud Security Podcast. You can check that out on cloud security newsletter.com. I'll see you in the next episode,

please.

No items found.
More Videos