Security Chaos Engineering Experiments for Beginners

Cloud Security Engineering Series

Jan 31, 2021

•

Season Two

View Show Notes and Transcript

Episode Description

What We Discuss with David Lavezzo:

What is security chaos experiments and how does it compare to chaos engineering?
The Golden Age of Offence
Tools for Chaos Experiments
How to get started in Security Chaos Experiments?
Is Security Chaos Experiments only for large companies?
Security Chaos Experiments in Production
How to get organisation buy in for Security Chaos Experiments?
What is gap hunting?
Security Chaos Engineering at Scale and what metrics can be used to measure its maturity?
The upcoming trends of Security Chaos Engineering
And much more…

THANKS, David Lavezzo!

If you enjoyed this session with David Lavezzo, let him know by clicking on the link below and sending her a quick shout out at Linkedin:

Click here to thank David Lavezzo at Linkedin!

Click here to let Ashish know about your number one takeaway from this episode!

And if you want us to answer your questions on one of our upcoming weekly Feedback Friday episodes, drop us a line at ashish@kaizenteq.com.

Resources from This Episode:

Tools & services, discussed during the Interview
AWS Fault Injection
Verica
Gremlin

‍

Ashish Rajan: Hello, and welcome to another episode of cloud security podcast with virtual coffee, with Ashish. And today we are talking about security heroes experiment for beginners, today’s guest and today’s topic is really interesting for me. We’ve covered some of it in the last year and I just feel really pumped every time I talk about something really edgy in cybersecurity. And it’s really interesting for me to call it edgy as well and probably find out why, because it seems like there’s still so many layers to security, chaos, experiment, and engineering that I keep discovering.

So when I bring my guest on. Hey, David, How are you?

David Lavezzo: I’m doing

excellent. How are you doing good.

Ashish Rajan: Thanks for coming in, man.

David Lavezzo: Thank you for having me.

Ashish Rajan: I’m going to start with a little bit about yourself. How’d you get into this field where you are?

David Lavezzo: I first came into Capital One as part of the testing group. We call it the cyber test kitchen and the cyber test kitchen. We were responsible for evaluating new software, new Harbor stuff, coming in to make sure that the things that the vendors promised to actually [00:01:00] happen.

So let’s say there’s like antivirus. A lot of people will, they’ll just check off the box up. All right. We did this, we did this, we did this. So our job was to see how much they were lying and if they were lying, if it was an acceptable amount of lying. So we would go through tests like Phyliss Mo destructive malware, not destructive Malware, things like that.

And we’d go through that and they just do like, IDs is. Encryption, just all sorts of stuff. And then eventually it led to like, well, we’re testing all this new stuff, but how do we know that the stuff that we already have, isn’t already doing the right job. That is, how do we know it’s not good enough already.

So sort of leveraging from there to focusing on what we already own and all what we already use. In late 2017 S flipped it around too. Let’s just start looking at what we have here already. And so that’s sort of kicked that off. I think I was calling it security validation and then one day I said, you know, what was, was with a coworker.

And we said, let’s just call it security, chaos engineering, because yeah, it’s branding. So [00:02:00] went with the branding

so that’s sort of where we took off from there, starting with the branding and sort of leveraging all of the stuff I was doing in the test kitchen, into the enterprise.

Ashish Rajan: It’s really interesting. I love the word test kitchen as well, I guess the next obvious question then, because it’s Cloud Security Podcast aswell. What does cloud security mean for you?

And I guess then we can go on and go into the whole. That to your point about what the test kitchen as well,

David Lavezzo: to me, it’s Cloud Security. It’s a little bit of new challenges. And then revisiting what’s old is new again, because we can see what we’re allowed to see. And even then some cases, we still have to rely on what it is that the vendor is letting us is letting us deliver.

So something like guard duty, like, I don’t know if this is still the case, but it was probably like last year. So they would ignore all AWS traffic. So if you’re trying to monitor that, they say, Hey, you know what? This is all good. We trust ourselves. So it kind of made it difficult to start focusing on any alerts that would be tested in between the two when it was just ignored.

So. I was looking through the docs a little bit earlier, right before this, and it looks like [00:03:00] they’re still relying on known bad IP addresses.

Ashish Rajan: I do want to ask you another question now, because kind of, it’s really fascinating. The fact that we find security, I guess the cloud security pieces.

And so when you talk about security, chaos, experiments, what is this field? And is this something from your test kitchen?

David Lavezzo: No. Well, let’s see. It is a way to help get things fixed, because another way it’s like, all right, it’s chaos engineering. We’re going to break everything and we’re going to break everything.

But if you just spend all your time breaking things, you’re probably not going to get employed for that long, unless you’re an offensive security person. And so rather than just breaking and breaking and making everyone feel bad, it’s like, well, the focus needs to be to fix like, okay, cool. We’re going to run stuff.

We’re going to. Identified the failure of security controls through experimentation, not trying to break it. So it’s going to help us build confidence that systems can deal with malicious entities, the way that they were designed. [00:04:00] Yeah, there were never any way to break it. Security, chaos engineering is specifically around getting fixed, getting baselines in the environments, helping get issues fixed, like the principles between the chaos engineering and security, chaos engineering are pretty similar.

Ashish Rajan: So to your point about is chaos engineering very similar? And I had Aaron and Jerome and a few other people come in in the past as well.

We’ve kind of touched on these topics in a very different way. And I’m curious to hear from your side, is it different, like you hear about these terms application fault injection application resiliency as like some people even take this into SRE field? Like, are they all the same or are they like, is that chaos engineering?

And then there is chaos experiments, like. Please demystify this for me

So I think it’s all going to be going into the same place. Like, as it stands, you can’t really set up a system that says, all right, we’re done here. And just walk away. Like we can’t do old schools up years of uptime. Like take your picture of my [00:05:00] system’s been up for 15 years.

Like that’s not really going to work these days. Especially like right now, we’re in the golden age of offensive security. Like there’s never been a better time to attack right now because there’s tooling everywhere. There’s YouTube tutorials. There’s free training everywhere and everyone’s sharing.

Yep. Yep. That’s your point. If you say it’s golden age. Is it a golden age also because now most people are working from home as well. And our threat vectors are kind of like, well David and Ashish are working from home. But do I really know if it’s David connecting? Is that one of those golden opportunities as well?

David Lavezzo: So the golden opportunity it’s just from, to me, there’s a lot of time. There’s a lot of knowledge sharing, like. If I’ve got a question on pretty much anything I can go into like bloodhound, Slack, and talk to people from Palance here, Microsoft SAANS instructors. So it’s really, the world is out there.

We’ve got free tools everywhere. The barrier of entry to attack is really, really low. So it’s really easy these [00:06:00] days. So when we get back into your question about like fault injection and SRE and how it relates. They have to relate. So like, as you’re releasing code, as you were doing your releases and staging them, you need to test before, after make sure that the stuff that you’re doing is actually continuing to do what it was that you thought it was doing.

Because otherwise with the golden age of offense, you’re going to get hit sooner than later.

Ashish Rajan: Oh, right. So to your point then, talking about tooling for people who may, by listening to this, what kind of tooling are we thinking from a chaos experiment perspective? What kind of tools you recommend

David Lavezzo: so with injecting fonts, The way I’m defining it the way I use it, it’s a little bit different than what an SRE would do. So instead of just like our MLS spike, the CPU, I’m going to disconnect the data store. It’s going to be, we’re trying to inject faults in. depending what your target is. So I want to run something that will, let’s see simulate ransomware.

It’s like, okay, cool. So we’re going to queue up some Python, certain encrypting stuff on the disc. Do we see it? Do we detect it? Do we have something that’ll block it? So to me it’s really, really [00:07:00] close it, but slightly different

Ashish Rajan: So it’s just a different way of using the same.

Oh, well not maybe not the same tool, but the, at the end goal, maybe still to check resiliency and to what he was saying earlier, it’s still building on what you think I guess the system is behaving the way it should. For the task. And I think that this is a good segway into the other pieces where.

When we look at this from the outside and like, okay, great. Chaos experiment sounds like a great idea, but I’m on AWS or I’m on Azure or I’m on, I dunno, server let’s insert it is this like a holistic principle that can be applied on any technology or is it more like, Oh no, no, no.

These are like specific tools that, and these are specific kind of environments that this can be applied on.

David Lavezzo: I think it would be, it should be platform independent. I was just using examples from specific things, but really when you’re going into different clouds, you should be expecting the same things.

If you’ve got your security groups or firewalls, isolating traffic from one to another, just try it. Just what happens [00:08:00] if you tamper with that, you get it. Like you get an account with the right permission. Like, okay, well, we’re supposed to have an alert that says if this role changes and this changes, we get an alert.

All right. Try it. Like, just see what happens.

Ashish Rajan: Do you find that it’s like a lot of people are a bit more hesitant to try that and they go, you know, , it’s like asking people to do testing and production people just go like, I don’t know, man. I don’t know. I do love my job. I mean, you find that it’s not a mental barrier.

David Lavezzo: No, I think that’s a good thing to do is to be cautious. So what we do is we have our own zones where we have the same rules and if we break something in production, it doesn’t matter because I’m breaking my own thing. the alerts should work regardless of where it is.

I’m not going to go into something that actually generates money and say, hi, I guess what’s the one malware on here because that would not make me very popular.

Ashish Rajan: I definitely would not encourage that as well. And anyone listening should not either. I’m gonna change gears a bit as well, because I think we’ve kind of bounced on a couple of topics, so I’m just gonna bring it back to the [00:09:00] basics.

So it’s especially for the beginners over here. So we kind of have defined what security. Chaos engineering kind of spoken about the experiments. We’ve spoken about some of the terminologies people could be looking at, and we’ve kind of mentioned the fact that, okay, AWS serverless doesn’t really matter because what you’re trying to do is you’re trying to make sure there’s enough resiliency in the environments that you have in the way they behave the function.

To your point, if there is a malware if you’re able to respond to it appropriately I feel like. As someone who’s listening to this, like sounds like a great idea that, Oh yeah. I mean, I guess I should be able to control what I’m doing in my application. If I got DDoS, I should be able to respond to it.

Insert scenario here, you should be able to respond to it, but is there like a good starting point for this. Where does someone start, especially who may have a history? Like, I think I’m not going to use examples where it’s a startup, because I think it’s very small, but like, imagine the enterprise or where there’s a lot of history of code and applications and products as well.

[00:10:00] Is there like a good starting point for those kinds of things?

David Lavezzo: So the way I would start is , I spent a lot of time doing attack trees just because it’s good to help people visualize what it is that we’re looking to do. So show the relationships between. How you get from one spot to another and to easily understand the concept. So the way I would start is you find a problem that relates to something that the business cares about.

So if you have a SOC, take some critical alerts, test them. Yeah. They haven’t gone off in a while. If they haven’t, if the alert hasn’t fired in a while, it’s even better. Because usually when you brought the alert, you have to wait for someone to attack you. Like this way, don’t wait to measure it. So if it hasn’t fired in a month, then you don’t really know.

Either you’re not being attacked or your alerts broken. You don’t really know. So you would want to test that. And once you test that and you can say, okay, I’ve measured this, I know this works. And then you automate it. And then when you get that, you can start getting metrics and start generating metrics.

So you take the problem. You’re putting assigning metrics and making sure it works. Like I [00:11:00] fire this off every single day for a month, we’ve got 30 out of 30. It’s like this works. It continues to work. And then you can start doing deviations. Like if I change it slightly, does it still work? Does it introduce false negative?

So it’s more of doing it, something like that. So it’s proof positive and proof negative. Like you could do the same thing with a WAF. So let’s say you had your developers going through a WAF, right? Running some web shells. Does the WAF stop your web shells? If it does? Cool. Well, this is good. So at least we can say, Hey, this works.

I know it works. Here’s the data that proves it.

Ashish Rajan: Interesting. And do you call it attack tree? Did he call it or attack? Attack?

David Lavezzo: I call it yes. Attack trees. So it’d be just think of them as graphs. So you start here, then you just draw. They’re like, all right. I would go to here. Draw a little box, go to hear different things that can go off and pivot from one spot

to another.

Ashish Rajan: Oh, right, right. You almost attack tree. It’s like an attack vector as well. Basically you identify an asset, which is, I guess, important, but not too important so that you don’t piss off some people, I [00:12:00] guess and start showing value from there by mapping out the attack tree four.

What are the possible scenarios that you look out for already? And raise or tests those alerts and then start deviation. Is that how you would approach it? And

David Lavezzo: then what you could do is in between every layer, just think, okay, I know we have this technology here, we’ve got these tools, this is what should see this.

This is what you’d see that does it. So then you can sort of have, you can build the expectation of, okay. We expect that we’ve got this kind of visibility here. Do we

Ashish Rajan: Cause I almost wonder, is this like a full-time job then? coming from because I’m thinking like a lot of companies, I mean, I guess a few of the folks over here listening in would have very small security team as well.

Do you reckon? It can be automated. So it’s like all happens. Like you defined it once and just keeps happening. Right. It doesn’t have to be like, cause I just don’t want people to think that this is only possible. If it’s a big company with a massive security team. Would you like, would you agree

David Lavezzo: Yes and no. I think. If you’ve got the right engineering, it should start, it should live and be driven by engineering because they’re the ones who are doing the work. They’re the ones who know their [00:13:00] systems the best, and they will know their apps, they’ll know their gaps. So security would exist to just, they would double check.

It’s just a second pair of eyes, but when you let’s say you’re doing the developments, it’s like, okay, I have developed this app, there are these flaws. We’ve remediated these flaws. We’ll just check those flaws. Like just do it before you actually go. So it’ll reduce rework in the future. Interesting.

I would see it being nested into all the different teams to do it because like one overarching team that makes it a full time job. But when it gets segmented,

Ashish Rajan: It seems like we’re going into DevSecOps territory as well, but I’m going to stay away from it.

But I’ve got a question here from Vineet. Would you scope physical hardware failure as part of security, chaos engineering?

David Lavezzo: I would. And not. So the reason why I wouldn’t is because I don’t keep it in my scope to simulate an entire outage, but I would take into consideration if that layer was missing. Do we have enough visibility to be able to compensate for that?

Oh, interesting. let’s say you have an [00:14:00] IDs and it’s like, all right, I’ve got an IDs, I’ve got an IPS, I’ve got this, I’ve got firewalls.

If the idea is just drops, like, do we still have visibility before that layer, after that layer? So we can identify that. So I wouldn’t necessarily simulate like physical hardware failure, but taking that into account of if that was dropped, do we still have layers of defense that we expect

Ashish Rajan: or interests?

So I guess almost like there’s like an overarching step before you start where you’re doing asset identification, what’s important. And then going, okay. What are my different failure scenarios and attack vectors, and then kind of mapping out like almost like making like a playbook inside the company.

For what makes the right sense for you?

David Lavezzo: When you just talk about it, it’s like, I, this is really easy, but then when you start getting it, it’s like, Whoa. Like, I didn’t even think about this. This is this is a lot of work, like, okay. Let’s, let’s roll.

Ashish Rajan: Yeah, it’s funny. What’s a realistic timeline for even a small experiment? Like what’s the smallest experiment you’ve seen?

David Lavezzo: The smallest

one was the one I started with was taking top 10 [00:15:00] critical alerts that I haven’t seen go off and try them.

And I did that on the end point, because the end point I didn’t have to involve anybody else. It was all contained on my machine. Like, okay, let’s run this. Let’s see what goes on. It’s like, okay, cool. I’ve got results. Then present that to the teams. Like, are these the results you expected? If yes.

Cool. If, no. All alright cool. Let’s fix it.

Ashish Rajan: That’s an interesting, Oh, it’s a great way to collaborate as well because. At the end of the day, to your point, the scaling mechanism for chaos experiments seems to be in other teams, not in your team as well. So you do want to start by, Hey, let me show you value first and then I’ll let you kind of continue the button with the Baton or whatever.

David Lavezzo: When you’re doing it. It’s important to not make the people feel bad. Like if it’s something on someone else’s team, if they feel bad, you’re going to get resistance. So when you generate it, you’re generating metrics to help them show their work.

It’s like, okay, all the stuff that you’ve been doing, it works. Here’s the evidence that works cool. we’re delivering the value that we think we should be. So it’s helping [00:16:00] generate metrics to lift up other teams too.

Ashish Rajan: Right? Obviously there could be a lot of use cases based on the organization or this some common ones.

I think that you already mentioned one fact, which is like an alert that should work. So if you’re not part of a SOC team, or if you’re not looking at, I guess, alerts, you can always reach out to security and find out what should not be working and be part of security team or work with them.

If they’re not the ones initiating it.

David Lavezzo: Yeah. And that’s just one of the different use cases. Like I, for whatever reason, I decided to jump straight into the deep end. It’s like, all right, Hey, we can do this, this, this, this, this, and then like immediately regretted after I said, yes, I can do it.

But it is new for helping build the relationship , between the teams.

Ashish Rajan: Yeah. So I guess one big takeaway is definitely like it’s a, it’s a great way for collaboration and not being hated by a security or the engineering team. Do do it in the right way with collaboration, so others get to learn and it’s a great one, but would you run this in production though?

I think that’s the question that I have. I only

David Lavezzo: do it production.

Ashish Rajan: Yeah. For me. It’s like one of those misconceptions about the chaos engineering experiments, [00:17:00] the whole conversation on them suddenly revolvesn around I’m not going to be running test in production, but whats your response?

You normally like dev test and like, hope it works in production or is it is the end goal to eventually land in production.

David Lavezzo: The end goal is going to be to run it in production because even, I would assume even very large companies don’t have identical protections in dev as they do in production. And we’re not testing dev.

We want to know what is going on in real life, in the real environment. So we’ll do the staging where we develop what it is that we’re going to do in dev, make sure it runs and then, alright, cool. Take that. Put it in production. We’ll see what this looks like. Because when, when you’re getting attacked, that’s happening in production.

So if someone’s going to ask, Hey, what’s our coverage against apt 45? Like, well, I can run it in dev and say, this doesn’t really show me anything, because I don’t know what our tooling actually does. So that’s why when we’re running into production, it’s all in systems that the team manages. So it’s, we’re not hitting other people’s things that we don’t own.

It’s our own stuff.

[00:18:00] Ashish Rajan: Yeah. And I think to your point, it’s kind of like checking if your lock your door or not when you’re walking out. Right. A lot of people do that. Some people do it with OCD and maybe one of those, but at least you just checking.

Oh yeah. It’s, it’s almost like giving yourself a sense of assurance that, Oh, it’s done it’s going to work even if I’m sleeping and not be woken up at 3:00 AM in the night, I guess. For people who may be inspired by this and looking at this going, okay, this makes sense. I can release definitely start with at least a couple of experiments.

If you start talking from a acceptance from, a leadership perspective and wider group perspective what’s the value conversation you’re having in terms of, is there a way to kind of like, Get their buy-in as well. Like, and just for people who may not have heard of the chaos engineering

it may make people nervous when you talk about doing chaos experiments, because if you just take it literally in English terms. So how do you go about getting some buy-in out? Like what’s the value add that you’re trying to promote.

David Lavezzo: So the value line I made was just return on investment because [00:19:00] people like to care that we spend a lot of money, a lot of time on the tooling, on the people, on everything that we have.

And there’s traditional metrics of like, we’ve seen this many of this events, this many like, okay, but they’re not really related. We don’t know if there’s more than that. So what we do is we identify the return on investment of our tooling. Are the tools paying off? Are they optimized? Are they seeing everything that we expect them to see?

Do we have duplicative tooling? Like three of the same thing? Do we have defense in depth? Can we prove it? How many layers can fail before we’re in trouble? It’s being able to measure that, to be able to establish that as the baseline. So we can say, Hey, we’re doing a good job. Here’s the evidence. Here’s what we’ve seen this week.

Here’s what this week looks like. And then measure it from year to year, quarter to quarter, whichever one you want. But it’s really to show that the money and the time that’s been invested in the teams are paying off. And that’s the best way that I found to get buy in is to help them prove to their senior leaders that they’re doing, [00:20:00] what they were tasked to do.

Ashish Rajan: That’s really interesting because that builds on the whole resiliency thing as well. Right. And it goes, a lot of people are questioning. Are we resilient enough in this COVID world are in terms of when people are working from home, how do we know that it’s Ashish logging in or even like, Oh yeah, we have an IDs which checks for blah, blah, blah.

Or we have a VPN that checks for us. So, but is anyone testing this? But then I is anyone testing this at the scale of multiplied by number of employees you have as well. So yeah, I can definitely see. So the value is more, the ROI is kind of like almost being created by building resiliency and giving that comfort and trust to, I guess the biggest stakeholders is that where you’re coming from as well.

I think that’s what we’re trying to build over here.

David Lavezzo: Like that was the way that I was able to get the, go ahead because showing that, and then from there we do a few other use cases. We do like security operations with alerts, with helping with training. Because when we’re doing things, we’re doing actual, like real life attacks.

So when we attack something, the logs show up and [00:21:00] analysts can go investigate that and see what it would look like if it was not us, because our goal is to not stay hidden. We wants to be seen. And we want to know what we. Log what we don’t log and the real do that with gap hunting. So like detection, gaps, prevention, gaps.

If we change something slightly, do we have something that’s brittle? And we now miss it?

Ashish Rajan: Oh, what’s gap hunting.

David Lavezzo: So it’s just. Knowing what we know. So there’s the thing of, we don’t know what we don’t know. It’s like, well, if you’re running these tests, you can see what you don’t know.

It’s like I did this, we didn’t see it. Okay, cool. So let’s say you have a hunt team handed off to the hunt teams. Like, Hey, we know this is missing. Can you try and find something here? Or we know we don’t alert on this one. Here’s something that might be interesting to look at. So it’s finding the gaps and then going to get them fixed.

Cause I use a. Process similar to traditional chaos engineering. So you start with the steady stage. So first thing you’re running things, you’re finding out what you actually see, what you don’t see, consider that to be your baseline. And then that’s when you start experimenting.

I just think, [00:22:00] you know what, I wonder what will happen if I do X. And then from there you’re identifying the gaps, any deviations from what the expected outcome is, and then help drive the improvements. Because like I said before, we’re not trying to break it. We’re trying to fix it. So once you find the gaps, start the process of fixing and then repeated in production.

We do it every single day. We generate a lot of alerts.

Ashish Rajan: Interesting. So I love the fact that there’s a cool term for it, as well as gap hunting. I’m going to add that in my dictionary as well.

And it’s a good segue into a question that came in from Paul chaos. Engineering is almost an anti-pattern to the kind of some organizations running this CI/CD processes. Right? So a test I wrote often checks for the existence offered dependency, and that it’s got a new timestamp, chaos engineering takes the opposite of that and then runs with it.

How did that affect my application or environment? Would you agree? .

David Lavezzo: I would agree with that.

Ashish Rajan: I guess we kind of spoke about. Some metrics and we touched upon it. We did talk about gap [00:23:00] hunting as well. In terms of, if you’re trying to run this as a program, you have a couple of people who you want to kind of run this with in your team. And how would this look like in scale and what metrics do you reckon they should be measuring themselves against in terms of maturity.

Yeah.

David Lavezzo: So that is, that is a nice chunk. So that’s why like the emphasis on self testing and being able to test yourself, like, I can’t test everything myself, but with teams who are able to embrace it, they can test it themselves and improve themselves. So like, I picture like a world where.

Your teams can report on what they’ve already tested. And it’s like, do we have this? It’s like, we don’t have it. Even if we did, here are the indicators of compromise.

If something, if we’ve had it and we didn’t know, we can detect it, we can see this. So we, we don’t have to worry about this and then report that enough. So it’s just more proactive security.

Ashish Rajan: Oh, I love it. And I think to your point, I’m going to add a few more [00:24:00] examples in there. It could be as simple as looking if your external facing endpoint is running on PLS 1.0 or lower, like how do you know at any given point in time, and especially if you have a long history of code that could be another experiment to run as well.

I think there’s another one that I came across was what are your current public endpoints? And if it matches to what the company thinks and the risk profile. And I’m like, Oh, that’s an interesting one. Like, yeah, I guess it goes, that affects you this profile, but I don’t know if any more examples that you can think of that I’m, I’m putting in the sport for this one, but a couple more examples that you would have seen, which are great starting point for people who are going to look at.

David Lavezzo: It’s important to have it like an up-to-date asset inventory. So like that stuff I think for those would be, that would be something you could start out if you know that you’ve got maybe some outdated TLS, just you can run some experiments that show what happens actually that I’ll lead back to an attack tree where if this happens map out, what’s going to happen afterwards.

Ashish Rajan: And kind of [00:25:00] like to your point about the attack tree, it does make sense to kind of start defining it. What I found and where I got those example was in my couple of previous companies where you come across a scenario as a security team and say, Oh yeah, I’m going to add that to the backlog, but it’s technically, it’s just a thing which should you should be consistently looking at or proactively doing security and keeping an eye on it for these things.

Like how exposed am I? So keeping that in mind is this something that you recommend that people should like get anyone get into this or is this. Like only for, I don’t know the security, but sounds like we were trying to collaborate. We’re trying to get the engineering team and that as well. Anyone with like the dev capabilities should be able to start doing experiments in their organization, right?

Like, so it’s almost giving, there are people listening to this who may not be insecurity. We just are interested in security. So they should just be able to kind of go out and do some experiments themselves.

David Lavezzo: Yes. It’s getting really, really easy now. If you’re worried about, let’s say defensive resiliency with end points or anything like that, like not related to cloud or security [00:26:00] groups, resource exhaustion, there are a bunch of different open source tool kits that make it easy.

So even if you don’t have a development experience or you don’t have any engineering experience, and that’s something like the caldera purple, sharp atomic red team, and there’s one that was introduced to. The folks who made caldera are also doing something at prelude and they released an open source testing kit.

And so that helps automate that.

Ashish Rajan: To your point about Golden age of offensive security, this was a Golden Age of Proactive Security as well, I guess. Yeah. So

David Lavezzo: there’s, you can pay for it. You can, there’s a lot of different companies out there. It could be adversarial, simulation, breach, simulation.

There’s many vendors out there. And then there’s the free one. So if you don’t know, if you want to put the money into it, the open source ones are they’re pretty good. So something as simple as running, this script, it’s going to run a whole bunch of things. It crosses over with threat engineering or detection, engineering, where they’re doing something pretty similar, where they’re put, they’re looking specifically for their own gaps to help improve their own detection. And this is [00:27:00] it’s pretty close, but we’re looking at pharmaceutical and see, looking to establish that and just keep running it every single day.

Because when you tune, when you do anything, you’re going to change the state and you can’t check everything. When you’re checking your code. It’s like, okay, this works. Like we detect this. Cool. But maybe you broke something and you don’t know it. If you don’t run everything and tested everything.

You’re not going to know that something’s broken.

Ashish Rajan: I find that really fascinating man. Cause I always find, and this kind of goes back to the first question that I had. Right. A lot of people get nervous when we talk about this and Oh, this sounds like a lot of work. It’s going to be a while. And it is a lot of work.

I wouldn’t deny it. But I almost find it’s a change in mindset as well, cause every security was already doing a list of things that there, I guess, as SOC Threat hunting and all that, to your point I kind of feel like this is. Something that can be shared with some of the other parts of security as well.

To your point, you were talking about gap hunting, like threat hunting could be helping with that as well. You could look at your SOC team

[00:28:00] is that a lot of crossover within security team for this as well? Where if you’re thinking about, Oh, how am I going to pitch this to people with insecurity?

For someone from a traditional security space, where there are well-defined roles, how do you encourage them to do proactive security? We spoke about some of the simple steps we spoke about some of the simple experiments people can run and how it’s a great collaborative thing. I feel kind of like the whole dev sec ops concept, as well as a lot of other things. It’s it does require a change in security culture and changing security mindset

how do you foster that? any thoughts on that?

David Lavezzo: It’s a hard one because they’re doing their day jobs. Like they have their own main core responsibility. They may not have time to do it, which is why for me, I was like, all right, we’re coming in for us, another group doing this because. They might not have the staffing.

They might not have the time to do it. So we foster that by helping them with metrics, showing that their tools are working, that they’re doing their job well. Like we’re not really making [00:29:00] metrics for ourselves. We’re making metrics for everyone else because right now you’ll see, it’s like, Hey, we, we saw 3000 attacks this over this last week.

Okay. Great. Not that it’s bad. Not to anyways, I’m gonna get in trouble. Okay. So when we’re running it, like, yeah, we ran 5,000 of this ourselves. Did we see all 5,000? Yes. Okay, cool. We’re doing something we’re doing well, like let’s ramp it up.

Ashish Rajan: I love that by the way, because.

You’re thinking about putting this across in a way that this is for you. This is not for me. This is for the company, and this is for me, helping your team identify the gaps that they have rather than me poking holes. And like, you’re not doing your job well that maybe that’s, that’s how it should be worded.

David Lavezzo: some people call it like patient consulting where you just kind of fly in and then you just like crap all over everything. That’s not what we want to do. Nobody likes that. It’s like, okay, like we’ve got three people here and you just came in and just started spraying us all. Like, no, thank you.

So it’s more of, you have to build that relationship and help them out. Just show, Hey, I want to help you. I want to help [00:30:00] your team show your value, show what you’re doing.

Ashish Rajan: Yeah. And

are you seeing any trends in terms of when you started working in the security chaosexperiment space to where it is now? Because you’ve been in this for some time. Are you seeing any trends or things that are sticking out and going, is it much more easy to learn chaos engineering now, or like what’s standing out for you because we spoke about a few resources now, but I’m just curious about where is this going in, in what you see?

David Lavezzo: I think it’s getting easier, especially when you get into Amazon releasing their fault injection. So they’re making it easier for them, and then with the releasing of all the open source tool kits, it’s a lot easier now than it was because before it was probably people thinking let’s find out a way just to stage this and run it and we’ll figure out what happens next.

So it’s, everything has gotten easier on the probing side. Defense is not easier. it’s hard.

Ashish Rajan: So I haven’t heard about the Amazon. Did you say fault injection? They have like a service for

yes

David Lavezzo: it is fault injection simulator, a fully managed. [00:31:00] Chaos engineering service that makes it easier for teams to discover an applications weakness at scale.

Ashish Rajan: There you go. Wow. Is this something that came out recently? I guess, because I definitely wasn’t aware of this. Clearly I’m talking to the expert here. So you were on top of this, the moment it came out I do appreciate that. So this is really great, man, for people who are really curious about security, chaos, experiments, and how they can kind of get some more information, where can they reach out to you?

David Lavezzo: I would say probably the safest one is going to be LinkedIn because I’ll actually respond. Because other than that, I don’t really spend that much time in social media. I think it’s a result of growing up in the nineties when social media was used to just troll everybody. So I tried to avoid it because like you could ask any of the people that I work with.

So LinkedIn is going to be the best place.

Ashish Rajan: Fair enough. All right. I’ll put a link of that, on the show notes as well. So people can reach out to you, man, but this was really awesome. Thank you so much for coming in and dropping some gems here, man.

David Lavezzo: Thanks for having me. No problem.

Ashish Rajan: All right.

I’ll for everyone else, I hope you got some value out of it. We’ll look forward to hearing you guys on the next episode as well, but in the [00:32:00] meanwhile thank you, David, and I will see everyone next.

‍

Cloud Security Engineering Series

Episode Description

What We Discuss with David Lavezzo:

THANKS, David Lavezzo!

Resources from This Episode:

Claim your free spot in our upcoming Cloud & Kubernetes Security Training!

Why Runtime Agents Are Replacing Static Posture Checks

The Hidden Cost of BlackBox AI: Bridging Cloud and Code Security

Who Governs Your AI Agents? Identity, Offboarding & Open Standards

AI-Powered Forensics: How Attackers Automate Breaches

The 4 Pillars of AI SOC:From Threat Hunting to Vibe Hunting

Why Runtime Agents Are Replacing Static Posture Checks

The Hidden Cost of BlackBox AI: Bridging Cloud and Code Security

Who Governs Your AI Agents? Identity, Offboarding & Open Standards

AI-Powered Forensics: How Attackers Automate Breaches

The 4 Pillars of AI SOC:From Threat Hunting to Vibe Hunting

Native Cloud Firewalls Falling Short in a Multicloud World

How AI Agents Will Negotiate Your Vendor Contracts

How Claude Mythos Changes Vulnerability Management: From CVSS to Exploitability

Why AI Guardrails Are Dead & The Threat of Indirect Prompt Injection

AISPM Isn't Enough: How to Apply Zero Trust to AI Agents

The Invisible Prompt Injection Hack & AI’s "Fire Triangle"

Red Teaming in the Cloud: Why "Least Privilege" is a Broken Concept

The Rise of Agentic Cloud Security: Code-to-Cloud Shrinks to 3 Days

Surviving Ransomware: How to Guarantee a Clean Recovery After a Breach | ResOps

Orchestrating the Next Evolution of Detection as Code

The 2-Minute Dwell Time: Why Agentic AI is Redefining Threat Hunting

Why EDR Fails at AI Security & The Rise of Endpoint Behavior Modeling

The Zero-Day Clock: How AI Shrank Exploit Times from Months to Hours

Why Legacy DLP Failed & The Rise of the Enterprise Browser

Solving Prompt Injection & Shadow AI for AI Malware