Getting Started with Chaos Engineering

Cloud Security Engineering Series

Jul 26, 2020

•

Season One

View Show Notes and Transcript

Episode Description

What We Discuss with Aaron Rinehart:

What is Chaos Engineering?
Is Fuzzing part of Chaos Engineering?
Is Chaos Engineering for SREs?
Is there an example of application fault injection from a cloud perspective?
What concepts of Chaos Engineering are people not talking about?
Does Chaos Engineering need to happen in production?
How does Chaos Engineering affects readiness in terms of incident response?
Would Chaos Engineering be part of a Table Top Exercise with executives?
And much more…

THANKS, Aaron Rinehart!

If you enjoyed this session with Aaron Rinehart, let him know by clicking on the link below and sending him a quick shout out below:

Click here to thank Aaron Rinehart on Linkedin!

Click here to let Ashish know about your number one takeaway from this episode!

And if you want us to answer your questions on one of our upcoming weekly Feedback Friday episodes, drop us a line at ashish@kaizenteq.com.

Resources from This Episode:

Tools & services, discussed during the Interview
Cloud Security Academy

‍

Ashish Rajan: [00:00:00] Welcome Aaron! Well, I know chaos engineering is a very interesting topic. I’ve given a couple of talks about it myself as well, but I’ve always found it really hard to find the source of someone who’s come that close to implementing it. Like my majority of the polls that I ran on LinkedIn, Twitter and Facebook, majority of the polls where one of the options was yes, I have implemented.

And yes, I know about it. Most of the responses were either. No, I don’t know about it or, yes, I know of it never implemented, and it’s really interesting from my perspective on Chaos Engineering. I think you and I have been talking about for almost two, three years now, but it was surprising for me not a lot, a lot of people you bought it.

So I would love to it deep dive into it. But before I do, obviously, for people who are listening to you for the first time, who is Aaron Rinehart,

Aaron Rinehart: [00:00:55] Who is it? A-aran,

Ashish Rajan: [00:00:57] A-aran. Yo, going to go?

I want [00:01:00] to use that for the next one. Yep.

Aaron Rinehart: [00:01:01] Who am I?, so I am the CTO and cofounder of a company called Verica. We are somewhat stealthy startup, you know, a series, a startup.

we are the creators of chaos engineering. Casey Rosenthal, is my co founder. He’s the CEO, and, he created chaos engineering and ran the teams at Netflix. Yep. And basically we’re bringing a more sophisticated set of products. The big it’s easy for people to implement and get the value from chaos engineering as a practice.

If you see anything on web Verica’s website, you’ll see stuff about continuous verification as well as a good section in the coming O’Reily book or just came out. Oh, I believe Chris, we should

Ashish Rajan: [00:01:39] probably hold up for a few more seconds. That’s right. Thanks. Yeah, I got, I got a good shot off it.

Yep. That’s it.

Aaron Rinehart: [00:01:45] Actually, if people go to verica.io/book, there’s a chance to win a copy of the book. If you’re interested in giving a free one.

Ashish Rajan: [00:01:52] Oh, yeah, there you go. Perfect. You guys heard it here. You guys heard it over here first. So just go there and the best way I’ll, I’ll leave. Leave a [00:02:00] link on the show notes as well for this, for people to sign up.

curious to know your part in the cybersecurity as well. Cause you have an interesting one, which is kind of a traditional part, so keen to know what was your part in the Cybersecurity?

Aaron Rinehart: [00:02:12] Well, so, let’s see. so my background go extends before Barrack. I was the chief security architect of United health group.

and that the company loves it. The dog was part of leading the DevOps transformation, Cotter transformation. open source transformation actually are the first open source tool for United health group. The largest healthcare company in the world yeah. Was chaos Slinger, which was the first time your application of Netflix’s cast engineering to cyber security.

We’ll talk more about that. I’m sure. Throughout the rest of the show, but my background, like dramatically, he extends from Mike. I started off in systems and ever engineering. and then, I remember like I was more, as I was in, I learned most of it in the military. It was much more experienced than most people, twice my age at the time I couldn’t get the job.

I couldn’t get a job early on. even though my experience was all hands on. Yeah. And if I could get a [00:03:00] job on as a software engineer, so I didn’t know he was software. My background, in school was finance and economics. I was going to, I went in investment banking for a short period of time. but, so I learn software engineering.

Basically. I started with databases and then they worked out with the front ends and they’re sort of building my own apps. I think I’m very thankful the open source world. That’s where I learned most of them what to do. Right. but, anyway, I was a software engineer for a little over a decade.

and, I went, ended up working for NASA for a number of years. I actually worked in safety and reliability engineering.

So it ask you what the NASA thing as well, like what’s up with NASA. Yep.

you know, I actually, I, I got laid off years ago, and you know, NASA was the first place to call them like, Hey, we’ll do it to come out and work on space stuff.

I’m like space, you know? Yeah. So we’re not there to be a software engineer apps and software applications for a situated building engineering. And this opportunity came up to, to do the security role because one did didn’t want to hire an extra head count to do it. and I’m going to sure. I’ll do [00:04:00] it.

Right.

Ashish Rajan: [00:04:00] So

Aaron Rinehart: [00:04:02] then I got into it then, like it turned out to be, it turns out if you’re an engineer and you’ve been a builder most of your career and you get into the security, it’s a pretty fast accelerant because like, you can’t lie to me how things are really built. Right. But also have an ability or the.

Facility, to, receive and transmit empathy, right. With somebody, right. Is that pushed down, like, Hey, building stuff that’s ever been built before his heart. Right? Like, and I’m not trying to make your wife in your heart, I’m just trying to teach you on what we have to do to make things secure. Right.

And I kind of took my career just like this way, because I wasn’t constantly fighting people. I was, I was able to enable them to lift them up. and so that NASA to a bunch of other places in between, but. I ended up at United and then I ended up here.

Ashish Rajan: [00:04:44] So yeah. Thank you. Thanks for sharing that. By the way, before we go into the crux of the questions.

Cheers. Cheers. What is that? By the way,

Aaron Rinehart: [00:04:54] salt whiskey, you know, the American street whiskey now.

Ashish Rajan: [00:04:57] Oh, American street whiskey,

[00:05:00] Aaron Rinehart: [00:05:00] which was actually black tea with Alovera it’s

Ashish Rajan: [00:05:04] can you do that? I didn’t even realize you could do that, but there you go. There’s all this.

Aaron Rinehart: [00:05:08] Yeah. So this black tea has no sugar in it.

Right, but the Alovera that acidity from it combining with the black tea kind of mix it like a sweeter kind of taste without the sugar,

Ashish Rajan: [00:05:18] right? Yeah. Cause I was going to say I’m not a black tea kind of person, but black tea with Alovera maybe interesting. getting into the crux of this. What is chaos engineering, please just demystify this weird and at the same time mysterious concept for us.

Aaron Rinehart: [00:05:33] Sure. so I’ll give you my definition. but like yeah. So chaos engineering is idea of introducing troubling conditions into a system. The trader Turman, the con, the trade determined the conditions by which we’ll fail. But for. It actually fails because a lot of times we don’t learn, you know, about what was wrong in the system until after the fact or like that there were some hidden failure.

You want me to tell, I’ll tell a brief story. Maybe this story is kind of helps me explain to us engineering to [00:06:00] people. And sometimes people are brilliant with the batter. Yeah. Early on in our first year as a company we met with basically a hundred large companies across the world, I try and figure out what tech stacks to build to and align to.

And we went with one of the largest payment processing companies and they were talking about how they had this legacy application. They’re like, Casey was talking to him and I was just kinda eating my lunch, you know, listening to this conversation. Yeah. They’re telling us, he was the chief engineer was talking about how, you know, we have this legacy application does 90% of our revenue for the company, a busted application.

The engineers know it are competent in the role is rarely an outage. I like, and we want to move it all over Kubernetes. Cause that product kind of scaling and changing. Right. , and, so, but it got me thinking, it got me thinking it was kind of an epiphany was like, I was like, Hmm, how do systems become stable?

Right. Was that legacy system? Always stable, right? Like, was it always so well known? Was it always, did you always have the right engineers? Like, was it like, you know, a lot of [00:07:00] times our systems become stable because we ended up learning through a series of unforeseen accidents and mistakes or surprises, what we didn’t know, but often , that process itself is, .

Incurred, that learning is incurred through tremendous pain. One for the engineers, worrying about being blame named. and shamed, right? For, , causing incident an outage, you know, on top of that, you know, customers encounter pain, right? They, at first they were frustrated. They may have lost customers, you know?

but it doesn’t have to be that way. Like chaos engineering is a way to proactively inject failure into the system. Hypotheses. Right. If, I had to design, I know in my mind for a fact that the system will respond with what, right. We never do a casting. We know it’s going to fail. If you always gotta fill it, just fix it.

You’re not gonna learn anything new, you know what already doesn’t work. Right. So the idea is to question the system and ask the system questions about things, you know, to be true, you think are true. And what’s funny actually is the first time you try that, I guarantee you. [00:08:00] Yeah, you’re wrong. Right?

Because I’ve never seen, I mean, there may be somebody who gets it. Right. There’s gotta be somebody.

Ashish Rajan: [00:08:07] So chaos engineering experiment should never fail.

Aaron Rinehart: [00:08:10] No, it shouldn’t have, you should never do when you know is going to fail because you’re not going to learn anything.

Ashish Rajan: [00:08:15] Right. No. Right. So if he knows as soon as going to go down, then technically it’s, you know, it’s a failed experiment to begin with.

Aaron Rinehart: [00:08:21] Yeah, exactly. So like you just go ahead and fix that thing, you know, like engineers, like, You know, let me explain it this way also. it’s really kind of changing the mindset from the post-mortem being the, after the fact, you know, exercise where you think, you know, what happened and people spend three hours.

And sometimes they document it. Well, sometimes they don’t usually, for sev1s , then screw everything else. Cause we don’t have time for next one, the war room, to war room, to war room. Right. Instead of do all that after the fact. Proactively because people are kind of freaking out, they’re an in an incident.

And after the fact, people kind of forget what happened, right. And like, you know, and proactively your eyes are wide open. It’s all rainbows . [00:09:00] And, you you know, kittens, right? , you’re not worried about like, you know, an incident because there isn’t one, like you’re proactively injecting these conditions.

Hey, did it work? Yes or no? Right. Why didn’t rerun the experiment? Yes. Like it’s allows you to uncover , these hidden failures didn’t know that without incurring the pain. I mean, in general, that’s how I like to explain it. And it’s the same thing as applied to security.

Ashish Rajan: [00:09:20] Interesting. And there’s a question, that just came in and told me to take on as well. You probably should see on your screen as well is fuzzing part of chaos engineering.

Aaron Rinehart: [00:09:30] There’s a lot of similarity between, you know, actually I’m glad you asked that question. Cause I get that question a lot and I have, right.

I didn’t address that in the O’Reilly book. So I’m thinking, right. So there’s actually a, I, I finished the manuscript on the first security cast engineering O’Reilly book. Kelly Shortridge is my coauthor. You guys

Ashish Rajan: [00:09:47] might have read the book by the way.

Aaron Rinehart: [00:09:49] she’s usually quite known for her steaming, well, one she’s an excellent speaker and, and you know, mind in cyber security, but she always comments on RSA and sort of grills them every year.

but [00:10:00] she’s not,

Ashish Rajan: [00:10:04] you’re not painting a great picture for Kelly, by the way, just saying.

Aaron Rinehart: [00:10:06] Oh, Kelly is one of the brightest minds in cyber security.

Ashish Rajan: [00:10:10] Oh, they got you. You saved yourself then

Aaron Rinehart: [00:10:15] no. So, it’s blessing part of chaos engineering. No, it’s not really a Fuzzing. is something we do as part of the application security, testing life cycle. I mean, If there’s some similarities between fuzzing between the red teaming, between purple teaming and breach attack, simulation tools, there are similarities, you know, we see is fuzzing as a form of testing versus experimentation.

I’ll break down instrumentation. So it’s a loose definition, but a testing is the verification or validation of something we already know to be true or false. What we’re really trying to do with fuzzing is kind of ensure that what we think is right is actually right. Like, like we’re not what we’re trying to do with, chaos engineering is instrument the system as a whole post deployment.

So what happens is, so it takes for fit for an account. Let’s say you have a modern software [00:11:00] application. That’s 10 microservices, right? You have payments billing. RX big using healthcare cuts comes a easily, you know, you’ve got medical coding. What? You got 10 of them, right? Get 10 different teams probably.

Maybe they’ve just had the same boy schedule. Maybe they’re different. Maybe they’re releasing 10 times a week, 10 times a day. What have you. so they’re all releasing their features and functions in code, and you know, the application it gets released, right.

You never in for microservices. You never had this have three or four, right? Usually have what? I mean, never. I just have one, usually have three or four of each. sometimes you have three or four more because you’re running, Bluegreen deployments, you have older versions and newer versions. Are you testing out a particular feature monster user base?

So you’re kind of rolling things out, but a magnify that times the other, or the rest of the time. Right? So you have 10 different groups with groups of humans delivering a different cadence to sometimes the same. And you have all these microservices that you’re delivering. Right. Yeah. Sometimes you’ll have older versions of all those microservices because some other functions are needed for other services because they’re microservices right.

Dependent, [00:12:00] even though we like to think they are they’re interdependent upon each other. Right. And. So you had this massive ecosystem of services in humans and interactions and changes, right? It’s the same scale. It’s complexity. We’ve kind of never seen it kind of still applies or more legacy applications address that later.

But like first thing is something we more address for each individual microservice or a wave maybe the front end, like, right, right. Well, I’m not seeing fuzzy. It’s not important. There are similar. What I’m seeing is there’s overlaps between the concepts cause you’re injecting, unexpected variables into the system.

But we’re trying to exercise is not the microservice itself. We’re trying to exercise the emergent properties of what safety, safety, and security should be as a system. Like,

Ashish Rajan: [00:12:40] so same, same, same but not related.

Aaron Rinehart: [00:12:42] different, I guess this one.

Ashish Rajan: [00:12:43] that wasn’t to say. Yep.

Aaron Rinehart: [00:12:44] I’ll tell you what I, for that question, I’ll make sure you get . A copy of the Security Chaos Engineer’ book when it comes

Ashish Rajan: [00:12:49] out. Oh, Oh, there you go. Alright. yeah, I can coordinate with that cause I know Vineet too as well, by the way, someone else has Charles House.

The question about what’s the name of the book in pop the book again on the screen.

Aaron Rinehart: [00:12:59] Oh, sure. [00:13:00] Yeah. Yep.

Ashish Rajan: [00:13:02] Charles, by the way, there’s a chance to be in the book as well. If you’re interested or to the website link, verica.io/book.

Aaron Rinehart: [00:13:09] Yeah. V E R I C A.io/book.

Ashish Rajan: [00:13:15] Good luck, man. Otherwise reach out. Aaron is a good guy.

He loves giving out free books. So just ping me and I’ll ping him. I appreciate the, support as well. I love the fact that you’ve offering some, you guys heard it first. If you ask the question, which is good, you get a free book. Just come out, keep asking questions. People keep asking questions.

another concept, which is kind of related to chaos engineering is Application Fault injection application. Resiliency. Like no concept that I used to think is in a very SRE kind of concept. Like, are there, is that related to SRE or, or like an chaos engineering is part of it or like, how does that all relate?

Cause you know how Google I started for a chaos engineer. It’s more about how do I build [00:14:00] application resiliency and it’s all about fault injection. And for people who don’t know for application fault injection, cause you kind of explained it earlier when you were trying to differentiate between fuzzing and I guess what it’s like, is there an example of application fault injection from a cloud perspective?

Cause I’ve got a lot of people over here who are primarily working in the public cloud space and I know you’ve done some work in the AWS space with Netflix as well and other other places. Are there simple examples, maybe one or two examples that you can share for application fault injection.

And how can I use that as a chaos experiment to build application resiliency? I’m not going to a long question, but essentially I’m just offering an example of a Chaos experiment in a public cloud.

Aaron Rinehart: [00:14:45] Well, I think it’s, I think it’s, given the familiarity with people as they responded on the survey with chaos engineering, it’s always important to explain.

You know, maybe I’ll explain a couple of contexts around chaos engineering and most people never talk about.

Ashish Rajan: [00:14:57] Yeah. Yeah. Yeah. That’d be awesome. Yeah.

Aaron Rinehart: [00:14:59] I have the [00:15:00] luxury of kind of having the beginning of the, of the story, we’re clearer than most people do. And because knowing Casey,

Ashish Rajan: [00:15:06] that’s why I have you here, man, go to the source as part, as close to the source as possible.

Aaron Rinehart: [00:15:11] Well, I’ve gotten to know Casey, Bruce Wong, and, and a number of other people in the space that we be

Ashish Rajan: [00:15:16] next of my list

Aaron Rinehart: [00:15:18] and, Casey. So Casey really kind of big chaos engineering, thing, right beyond just Netflix. Right? it all started back with Chaos monkey. Remember Netflix in 2008, 2009 had decided, you know what, we’re going to be bold as a company, right?

We’re going to change the way we operate. And we’re going to define the future by streaming. But I think move from DVDs to streaming, right? And this is what Reed Hastings is also released the memos that hire the only senior engineers we’re going to hire the best people. No brilliant jerks, all that, like, you know, all that stuff that, memo, that changed the world.

but as a process of that, , so there one fourth of this ambitious kind of, engineering strategy , to stream these massive movies over the, over the internet, which is, we all know [00:16:00] one of the number one fallacies of distributed computing is the network has never reliable.

So that’s going to be hard to do if the network’s not reliable. So then what they started doing was so they started building off the streaming services and Amazon web services. Right. And what was happening was is that, Oh, I didn’t address. Let me get back to answer the first question. SRE. So chaos engineering was originally designed as a tool set for SRS.

Okay. It was not, it’s not a designed to be a practice or like a job. It was designed to be a tool series of tools, so, okay. So, so back to the cast monkey story, right? So we went the build out, the streaming services on Amazon web services. What was happening was, is that a member? Because Netflix has cloud transformation.

So a lot of people will tell me, they’ll say, Aaron, we can do chaos engineering. We can barely do the dev ops stuff. you know, well, Whoa, Netflix started doing chaos engineering during their transformation. I’m going to kind of explain that with the story. So as you’re building out these services, what was happening was, is AMIs were just disappearing.

[00:17:00] Yup. It was just a feature of a Amazon Web Services that point in time. Right. But they were just disappearing. So what,

Ashish Rajan: [00:17:08] is it? Was it a feature like how long ago are we talking?

Aaron Rinehart: [00:17:11] I’m just joking. It’s not, it wasn’t a feature. I was going to say

Ashish Rajan: [00:17:14] like, wow, it’s like great feature, Amazon.

Aaron Rinehart: [00:17:17] Nobody knows why.

Actually I have a friend of mine. That’s a Kiwi. He always calls him Emmys. So I’m going to stop Amy’s were just disappearing. Anyway, that would be scary.

Ashish Rajan: [00:17:27] That would be really scary. Imagine like production systems and suddenly I can’t find the AMI anymore. Like what, what is like yeah,

Aaron Rinehart: [00:17:35] well more so what they did was they said, okay, we’re going to design our system to be resilient to this kind of problem.

Right. So one fourth, they built this, they built their system to be resigned to that particular. Problem. And now they need a way to test it, but you know, sort of with Amazon say, Hey, will you provide us a way to do this? And they said, no, that’s a validation of a remit. You could, but you can do it all day yourself.

Right? So they put a tool called chaos monkey that during business hours [00:18:00] randomly take down an AMI. So really what this did was the point of it was not the cause chaos. The point of it was to put a well-defined problem in front of an engineer. The problem was for an engineer is that during the day, during business hours, when my sister it’s supposed to be running, delivering value to customer, this thing could happen to me.

Right. So it turns out if you put well-defined problems in front of engineers, They solved them. Right. And that’s really what it’s about. It’s about providing context to engineers that they previously did not have, so they can change their behavior and improve the way the system operates proactively.

Proactively is where the key things is. It’s not a reactive sort of meantime to detect meantime to repair kind of thing.

Ashish Rajan: [00:18:39] dude, I think, and to a point of, It’s such a mindset thing though, because wait, are we doing this in production? Or like, where are we going? Are we doing this? Like, I, obviously I imagine me going to my boss CEO or CIO, and he’s like, I’m going to test chaos engineering in production.

How would that conversation go? [00:19:00] Like, I can’t imagine people saying yes to that, that easily to do, does it have to be in production? Or does it require a mind to change overall?

Aaron Rinehart: [00:19:07] No. And so there’s a mindset change. I mean, the cloud was a mindset change. That ops was a mindset change, you know, SRE is a mindset change.

Chaos engineering is a mindset change, right? It’s really cast the drinks by continuously verifying that the system works the way you think it does versus like, I be like, what? So it’s, it’s, there’s a vast difference between how you think the system works. Versus how it works in reality. Right. I’m going to get to the dev product question, right?

So that what happens is when we design system, it’s like, systems engineering is a very, very, very, very, very messy exercise, real work. I’m Texas messy guys. I mean, like, we like to think that there’s this beautiful 3d diagram, right? Like, just as an architect, I used to love this. You know, I used to have a solutions architect and the data based architect come to me, different diagrams of the same system, like our context, our mental model of how we believe the system works is [00:20:00] vastly different than how it works.

In reality. Even if the system works like that diagram day zero day one quickly, we move into it. A series of unforeseen events, right? There’s an outage on opinion, CPI. If the heart code of token, right? Google hires your best engineer, a Lassie and hires your best engineer, new engineers come in. Yeah, they’re still there.

They’re not as new as the code. Like people make people make a mistake. People make changes, you know, there’s outages, there’s, there’s, you know, scanning results with the staff with a fix that it’s slowly people lose their context of what the system really is. Right. So we’re, so if we’re constantly can change to a system, we don’t really understand bad things are bound to happen.

Right. So we’re trying to do we understand that? So when it comes back to, Sort of comes back to, this whole pride thing, right? Kiosk engineering. Let me just put it quite scary. I probably just put it like exclamation marks engineering is not breaking stuff in production. It never happened. It never will be right.

It’s about, it’s about fixing things, right? And it’s about building a culture of continuous learning, not continuous [00:21:00] fixed. And so, you, don’t never, you can never do chaos engineering and production is still get tremendous value. So there is, there’s this example of where the large retail companies that, that they’re in to do the first chaos engineering.

exercise or what they called game day, which is manual chaos engineering. The Brucie seemed to do a tack talk. so you’ve still got Netflix then when you went there and they said, he seems, you know, you know, we’re not going to do a cast in our production. We’re gonna do, we’re gonna do on dev. and what we’re going to do is we’re going to bring down a cuff keynote and we assume another one’s going to come back up and all the, you know, and then, the there’ll be a rebalance and, and all the, you know, the traffic and messages will be just fine.

Right? So what they went forth, they scheduled it. Casey was there and they actually brought down one of the nodes and on the dev environment, because can anyone guess what happened?

Ashish Rajan: [00:21:48] No, I, but I, I I’ve. I wonder if it never came up

Aaron Rinehart: [00:21:53] trying to wind down. Right. So what happened is they forgot to change it.

They forgot to change the point of information. Right? Cause [00:22:00] you know, it’s real easy to system engineering work is messy. It’s that they intentionally did that. But what was great about that was even though they did it in dev, right. They treat it like a real exercise and all the people that need it, like all the people that needed to make the changes to get it back up and running with it.

And they saw it, it was back up very quickly. Right. but like even in depth, They’ve learned a lot about prod, right? Like, and you know, it’s, it’s, there’s a maturity. So chaos engineering especially comes with tools and technique. It’s always starting low environments. I mean, make sure you understand the tool, make sure you’re competent in the technique.

Make sure everybody kind of understand what’s the what’s going on. We never hide w house engineer we do in the open. We’re transparent. Everyone should learn from this. You should not. So trying to be sneaky with it, right? Like

Ashish Rajan: [00:22:49] that’s a good way to put it. I was going to, cause I think you already answered the question from a, I think it was David Raviv.

He’s a good guy with her in New York. He was talking about, there’s a story of Epic fails, [00:23:00] offered testing. I think you’ve kind of answered that question already. he’s got another question. how does chaos engineering affects readiness in terms of incident response?

Aaron Rinehart: [00:23:08] Oh, God damn. It feels like you’re reading my mind.

Who ever said that?

Ashish Rajan: [00:23:11] There’s David Raviv by the way, great guy

Aaron Rinehart: [00:23:13] guy? I can’t, I can’t read it at the bottom, like anyway. All right, right.

Ashish Rajan: [00:23:18] But did he get the question? Yeah. Cool.

Aaron Rinehart: [00:23:21] So I got it. I got it. So actually, this is one of my favorite use cases. I actually started off, with chaos engineering, for security, in terms of security, visual validation.

So as I’m gonna answer the question, so as an architect, I was always concerned about my recommendations, whether they actually got implemented, whether they’re correct. So I wouldn’t do my job. Well, I believe in what I did I believe, and I want to help people. Right. But I was never sure if my recommendations ever actually made it there.

Right. And so I need a way to, I need a way to kind of ask the computer quite skip ahead of all those people and ask an objective question to the machines. Right. And so I was really kind of about security control validation, but the second case, I found a lot of value [00:24:00] in because whatever we’re demoing, this critic control validation piece of counseling or to my boss, , he’s like Aaron, you know, he could have picked my interest.

He’s like Aaron, like, you know, I love the way this helps us. Validate our controls proactively, but like, man, it’s really a great tool for keeping the incident response team sharp. So I started thinking about, you know, how does this really kind of work in a response? So really instant response in general, the problem with this response is its response, right?

Is that you’re kind of like. You know, no matter how, security side, no matter how much money you spend, how much time you it’s been preparing, how many people you have, you still don’t know a lot of things. You don’t know when it’s going to happen, why it’s going to happen. Who’s going, who’s trying to get it, why the trying to get in and how, how they’re, how they’re doing it.

Right. And all that preparation. You’re not sure whether the controls actually fire the right situations that you need them to. So what we did was so, but chaos engineering, we’re not kind of waiting for an event to happen. We’re not like hoping that when set event happens, things actually all fall into place.

Right. I’m not doing it. Right. But, and also we’re not [00:25:00] assuming that when we actually caught the event was the beginning of it because when you detected it, it’s probably not the beginning of the event. And we always love to, I think of things in who cars, right? I mean, there’s one event that cause, cause that image, right, it’s almost, it’s almost always a multiple, we would have different things and processes and people and things in as a process of it.

But it’s always simpler and easier after the fact to point the finger at one person or thing. but I will. That’s great about incident response use case with chaos engineering in general is we get to see start the event. We’re the initiator of that section and we can do it whenever we want to do it.

We’re control of it. So proactively went this, this condition and now, because we know the conditions by which it started, we can kind of remember like, If this process, we’re not, there is no real incident, right? People are not kind of freaking out. We tell people we’re doing this so we can kind of learn.

Do we have enough people on call? Whether they’re the right people did, were the runbooks actually correct for that type of event, then the log data actually makes sense to a human, right? Like, like when this [00:26:00] has happened, you’re trying to figure out what the hell happened through all this different log data and events and, and like, you know, some of it’s like, what the hell does this mean?

Mean, I don’t know. Skip throw it away. Right? That’s all of what we can like you’re trying to figure things out what a timeline is. It’s like, but proactive. We can, we can introduce this condition and we can say, Hey, did the technology, what it was supposed to do? Did we have,

Ashish Rajan: [00:26:21] would this be like a table, top exercise or w I mean, chaos engineering would be part of the table.

Top exercise where you are say you have your CEO, CIO, everyone’s sitting down trying to do a demo run of a run book of an incident. You could actually have a, I guess, a dashboard with all the chaos engineering as you’ve already done. And it will be, I guess, that’s I just imagined my head where the entire system has a few experiments going on.

And you, if any of those fails in a tabletop scenario, You know, you have something to face, but otherwise then all the known scenarios should not really happen. And you shouldn’t really need, have to go down a root cause part for something that you already have an experiment [00:27:00] for.

Aaron Rinehart: [00:27:01] Exactly. Exactly. So, so, I’ll, I’ll just requires more wirelessly in a second, but like, yeah, so like once a kiosk, I like to, I’m a big fan of your initial chaos experiment.

You, if that you come up with. Right, to do it as a manual game to exercise, you know, it’s a great, it’s only a few hours, you know, you get the people from that are different parts of the business in the same room, you know, and we get the, all learn about how the system really works and usually pick things that like you expect, like if this happened, like we covered this, like a misconfigured firewall rule, I’m sorry.

I missed, I misconfigured a firewall. Like we would totally catch that and block that. No problem. Right. and. Everyone gets the test, but everyone gets to see you’re wrong. Right? No matter with your firewall engineer, you’re the sizzle. You’re the, you’re the network engineer that runs the switches on the old balance.

So that person or the help desk person or the like everyone gets to see, Oh crap, we were wrong together. It builds, it builds this kind of like comradery across the functions of people you’ve never met before. [00:28:00] but once you were right, the ones that came yes, experiment becomes sort of successful, that you were right about how the system actually works.

Now you can kind of start thinking, this is where more than sophistication comes in with the advanced tooling, is that it becomes more of a regression test now, right. To ensure that, that the, that, that those KPIs still actually remain true. and so, yeah, so, Cool.

Ashish Rajan: [00:28:22] And just to confirm, so David Raviv does get a copy of the book as well.

Free copy of the book is the guy who asked a question.

Aaron Rinehart: [00:28:30] If you guys send that, I can’t see it. So make sure you sound like messages to Ashish & I’ll make sure to get your contact info too. So I can,

Ashish Rajan: [00:28:36] yep. I’ve got all these people in my context. So I’ll definitely be, yeah. Cause they’ve got questions are flowing in because you’ve been so generous with your book offer.

The questions are flowing in. I’ll just move to the next question. it’s from Charles, what impact does it have on automation? Do you have examples of how chaos engineering helps improve security?

Aaron Rinehart: [00:28:53] Oh my gosh. So I treat those questions is two different things. So how does it impact automation? [00:29:00] and how does improve security?

I’m gonna address that, maybe what’s that?

Ashish Rajan: [00:29:04] Yeah. Yeah,

Aaron Rinehart: [00:29:05] that sounds good. So automation, so back in 1983, a woman, a software engineer actually, wrote a three page paper called the ironies of automation. It has never been this proven, right? Is that there are certain ironies with automation, is that a lot of people think you need less people, but you actually need more.

Once you write the automation, you have to, it has to be maintained. And in order to write more automation, need more people. Right. So on top of that, you don’t want to replace a skill function with automation, a skill series of steps with automation. Now you have an unskilled worker monitoring, something that a skilled worker used to be able to do.

And if, and if the monitoring is up being wrong, that unskilled person can no longer intervene with a skilled person. Could’ve. Right. So over time, if that thing continues to be red or yellow, they just say, Oh, that thing has always been the thing was, I was young. I don’t know. I don’t know why. Right. Like, Hmm.

And what happens is you need monitoring for monitoring and alerting for alerting. So there’s a rate paper and you’d actually should read it. It’s like three pages. There’s [00:30:00] no excuse for, I like the point of it is it’s like, is that, Automation like, I, it requires maintenance. It’s, it’s, it’s just coded and there’s complexity in it and it changes all the time.

What we’re doing with chaos engineering is. Proactively introduced the conditions by which the automation should be successful and asking him a question like, Hey, you’re still working the way you’re supposed to. Are you still working the way you’re supposed to? I, because it’s not just isolated automation and isolation on like one service or 1:00 AM I, or one thing it’s the system must emerge.

Must have emergent properties. Must have merged certain properties as a whole, how it operates. Right. I, and that’s where, that’s where we kind of fall down is, is it’s when all these different, the things in a complex system start interacting, you get nonlinear outcomes, right? Like what was one longer equals two.

Yeah. Like negative three or negative 4,000. Right? It’s because the ripple effect throughout the system. Right. So you need to constantly verify or continue. Cause there’s we verify the system works the way it’s [00:31:00] supposed to. Now the security side, all, everything I’ve said up to this point on this podcast, it directly applies to security, right?

Yeah. As being an engineer, most of my career. Right. I never believed I never got that. Right. Where I think that there’s a system and its security, the system is either secure or it’s not right. And it’s this, the security is part of the system. A am. And, you know, what is that? So, the security suffers all the same holistic problems in building things in that, we’re building at such scale size, scale, speed, and complexity today’s world.

It’s so easy. To make a mistake. Right. And I, my mistakes, I mean, accidents, mistakes, like, you know, a permissive account or leaving a port open or that didn’t need to be open or cause when you’re building things, come on, guys, building things is a process of like, of this. Hmm, that didn’t work. That didn’t work.

It kind of works. Hey, kind of work. And it’s a combination of that, but like, we’re trying to figure out stuff, [00:32:00] figure it out. We’re trying to build and like, you know, the process, you’re not sure what the lockdown, what permissions unique, because you haven’t created the objects yet. Like, so like. What, what happens is by then, we were introducing these changes into this overall larger ecosystem.

It’s so easy that because you didn’t stand how the tar, how it interacted holistically, you understand your stuff might not understand how the rest of the things work, but it’s the security trying to fit that into that environment. It’s easy to make mistakes. So what we’re trying to do, if you will, the majority of actually the breach data.

It’s like the simplest things are causing these problems in psych and I it’s. and what we’re, so what we’re trying to do is inject these low hanging fruit in the system. Cause if you look at the majority of malicious code, that’s cool. You can go on like the virus websites and. And they always have breakdowns, like the steps of code.

Everyone usually requires some kind of really stupid thing to exist. Right? Like it does, there’s some advanced stuff. Okay. Don’t get me wrong. But like the majority of it’s crap code, if you’re a software engineer, you’re like, Oh my God, this is horrible. [00:33:00] Right. Like you look at it. It’s like, it requires some kind of like permissive account or port or deprecated version of some dependency or software.

So what we’re trying to do with security, Cass engineering, inject those conditions in the system to ensure we can catch them faster than AdvoCare can exploit them. Oh,

Ashish Rajan: [00:33:18] right. And I think I kind of understand how it would kind of affect security, not just security, I guess. And then kind of brings back to the point of application resiliency as well.

So, which is current security because your availability is one of those pillars of security kind of have to maintain. As well, it’s kind of pointless to have a system which is not available 24 seven or whatever the SLA for that is. so I get Charles gets the books. I feel like an Oprah moment at the moment.

It’s just like, you got a book, you gotta book, you got a book. I’m going to book a challenge. I’ll reach out to you as well, man. I’ve got another question for me. Are there any particular tools that I use for chaos engineering and are they open source?

Aaron Rinehart: [00:33:55] Oh, all kinds

Ashish Rajan: [00:33:56] then drop it in there as well.

Aaron Rinehart: [00:33:58] Ah, the drug care.

[00:34:00] So castling here, I left United health group three, so two years ago now, right? Like we wrote casting her about four years ago. it’s so with castling, if you go to get hub.com flat, he is like chaos slash Optum. Well, anyway, typing chaos, Slinger, C H a O S L I N G R a. And that’s the repo on get hub. and, what you’ll, what you’ll find about Kessinger is that it’s a horrible tool.

You can implement it and run it and run the base experiment. But like I left United health group. I was a sponsor of that. So it’s not, it’s no longer really maintained as an open source project. It’s still out there. It still represents the framework on how you can write experiments. There are four. For different functions you need to, right.

It’s all serverless. Right? So it’s there is a generator, there is singer and there is tracker and then there’s the documentation of the experiment, right? So there’s a generator does target acquisition based upon it, Amazon Ruffin stacks. Cause you opt to as an opt in, opt out kind of function with it.

Right. And so [00:35:00] you’ve defined the security groups by which you want to actually inject the misconnect import, to, There’s a tracker attracts all the changes reports to them off the Slack three, is a Slinger, a Slinger actually executes the opening or closing of a port. support singer was just the.

First example, we actually did several experiments, and, but, United health group now uses it internally as an, as more of an internal tool and they actually linked it through their thirst CIC pipelines, I believe. first time I checked, but, but so in terms of chaos engineering tools in general, there’s some great stuff being done.

obviously Verica a couple of America. Of course, it’s a, it’s more of a commercial tool set. we may open source something at some point, but like, We’re trying to evolve casts engineering, into the way we do chaos engineering in a way everyone else does it is completely different. We’re trying to actually make it more extensible and easier for people to utilize it.

But also we don’t make up failure modes, right? The kind of failure mode you see from us are actually recreated from real through a world. [00:36:00] We not just kill VMs and pods and things like that. We actually have documented from, real world companies on how the weird things that would happen to cause failure.

And we actually re re re Programmatically structured that into a product. And it’s really interesting, the kind of stuff that we can learn. so if you’re interested in reach out, but, in terms of source projects, I kind of liked the cast toolkit. There, there were all very successful, Russ miles and Silva, or a good friends of ours, right.

It’s a great place to learn. and, how to do casts experiments in Python. So that also makes it easy. I believe there’s an agent involved. So a lot of the cast engineering tools have agents. You have to be. we don’t, but, but you have to, you have to be concerned with that. but, you know, I like to, I guess, tool kit others.

there’s one for Kubernetes, CA sorry, shoot. What am I thinking of? route by Bloomberg on camping, Middletown, powerful seal power for steel.

Ashish Rajan: [00:36:47] That’s fine. The name is right there. and it’s a good segue into my next question. Where do you see the trend going between say, And I think this would answer the question for the next one as well, which as David has [00:37:00] asked in terms of when would this go mainstream?

Like how long before this make becomes mainstream? So here’s to know from your side, what are you seeing as a, I guess, as a trend from where it was and where it is going, it’s server less than Cuban at East kind of uncontained as now, taking over the world. Where do you see that

Aaron Rinehart: [00:37:17] going? So, chaos engineering.

I see chaos engineering as a practice in general. I see. I think over the next five years, it’s going to be more of a. I think right now we’re tracking about 13 to 1400 companies, kind of experimenting and are utilizing cast engineering. It used to be like thing in Silicon Valley doing it right. Or like, you know, those types of companies, right.

it’s evolved from banks to healthcare. I mean the largest health companies, companies in the world are doing it, the largest banks, largest healthcare, you know, some of the largest retail companies. they’re all, a lot of people have written their own tools in house. I’ve actually seen some amazing in house gas engineering tools, stitch together from the open source stuff, blogs and what’s great is now there’s [00:38:00] actually, what we’re missing was a body of knowledge.

Now there’s about acknowledge all the different companies and like their own, how they do cast engineering in here, how they, you know, what the maturity life cycle looks like. So great book. but, The cat and the security, chaos engineering one also, I could have a mock up of the thing of it, but like, yeah, but that’s a, that will have all how all the companies doing security, casts engineering, which is a little bit, a little bit different.

Like I said, we use case perspective, but you’d be surprised at the companies you see there. They’re quite large and quiet. What you would think is more legacy oriented, but they really had to think differently. And a lot of times what everyone who I’ve talked to that does security, chaos engineering comes to me.

Oh, we actually tried it. I guess we found out I execute. It didn’t really work. We found this out proactively, we were able to do something about it. And we were trying to make that case to management. It’s like, you want to learn through accidentally finding out something, something was messed up or you want to learn proactively when you have an ability to do something about it.

[00:39:00] Right. Like, and that that’s, that’s really, what’s powerful about it.

Ashish Rajan: [00:39:03] Yeah. And I think I’m quickly going to ask, because a lot of people who would be, I guess, listening to the, especially the guys who have been asking questions as well, Well, what is the simplest experiment they can start with? Like, does it have to be like a really complicated one or can it help?

Is it just a matter of whether my SSH ports are open? Is that like a good test or because it sounds like it needs to be like a massive test for production to go down, but it doesn’t need to be that

Aaron Rinehart: [00:39:27] complicated. No, it doesn’t. I mean, like, you know, actually it all honesty a so, I mean, you know, Cass, McKee, I mean, how long did the chaos monkey just have that simple AMI?

Termination experiment. They got so much value. One experiment. It’s like martial arts, right? Like you don’t want it technique. And you’re just really good at it. And I just like, then you’re you’re dynamite. Right? It’s kind of the same ops engineering. You could go with one experiment. I that that condition may be true in one instance or one environment, but we need to do it for another one.

Conditions are different. Like it’s like you get good at one [00:40:00] experiment in constantly doing it. You’re still gonna learn a lot, you know, just with, just with and the port Slinger experiment of introducing a misconfigure port. I can actually go through an example. If you want me to go through example of how that works.

Ashish Rajan: [00:40:11] I think it’ll be, it’ll be, it’ll be awesome. Cause I guess. Because there’s another question that came in from DARPin, which I thought is it’s probably going to hit the Bali. So the question that came through from is after attaining, what maturity level in cloud journey or the security posture, do you suggest with the appropriate time to start doing these exercises?

I think that’s a great place to start. And then we kind of go into the example. What do you think?

Aaron Rinehart: [00:40:34] Sure. Yes, definitely do that. so almost everyone I know who does chaos engineering or has done it? I mean, like there’s a couple of exceptions, are some kind of cloud journey. Right,

Ashish Rajan: [00:40:47] right. So they’re not an advance,

Aaron Rinehart: [00:40:48] right?

Yeah. They’re not even the cupboards you think. Well, things I learned over the last, this is just me being me. Like I think that I’ve learned over the last year and a half of meeting all these different companies I’ve been to about a hundred, 120 [00:41:00] companies, is that the companies should think you’re transformed or nuts.

The cupboards you didn’t think were transformed actually are more formed with some of the companies that are CVR. So it’s quite interesting, but, everyone’s kind of like, so what would you do call transformation often happens is most of the time executives often have unrealistic timelines. Right. They always blame the fact that all the right people, right.

the people that they have don’t have the right skills. They need to change and need to evolve. Right. You know, if they bring an Amazon or some, picking on Amazon by bringing us some cloud provider and their professional service to do it. But as a company really learning, are you just kind of. You know, I hoping that they’ll pick up the, pick it up in the meantime.

Right. I feel like it’s like, it’s a transformational exercise, right? Meaning that, what chaos engineering does is as you’re kind of building things as you’re in their building, or let’s say refactoring for it to build a. what I lifted shift would be to actually a cloud native kind of application.

you know, you want to, you need feedback mechanism, it’s fundamental and they’re all engineering and science. You need imitation testing to know whether [00:42:00] something works or doesn’t engineers don’t believe in luck or hope, right. It either works or does it right. So what we need to have a way to tell us, Hey, It’s not working or it’s working.

Right. So what we’re doing in house engineering is we’re injecting what we expect to be working or not. Right. And, and we’re doing that. So it’s, you know, it could be, you know, when you’re, I mean, when you’ve got more of like, I guess I’m trying to think of what are the kind of illusion of an app where people do things it’s kind of different for like, when you’ve got something where you’re running in stage, that kind of functions, And maybe it’s a good opportunity to start running some, some, just some simple scripts scripts against it.

Right. That’s really good. You can just, you can actually do most casts engineering experiment. It’s due with a bash script, right. Kill a service. Right. Bring that to VM breakout pod. Right. Like see, see how the system response. Right. I see if all the, all the other things that are supposed to occur, occurred.

So like we cast, let me give an example, you that example, right. So first thing, it was a primary example of castling, or because I needed an example that I could explain how to do it too. And software [00:43:00] engineer and network engineer, how to engineer. No matter what you do, everybody kind of knows what a firewall is.

Right? So what, we were very new to AWS at the time there are cloud transformation and what we expected when we introduced a misconfigure port, an open or close fourth, it wasn’t supposed to be a pretty closed, is we expected a firewall to immediately detect a bucket and to be an honest ship, right? So we started doing this.

Remember I misconfigured and offer a sport. Chase could happen for all kinds of things. You know, unbelief non-malicious sort of reasons. Meaning like somebody couldn’t have a lot of software engineers don’t understand network fall, and it’s kind of a INSEAD thinking, and it could be that you just filled the ticket out wrong.

It could be the, that the caliper met it wrong. It could be lots of different things could be complete up to a mistake or accident, like right. Or unintentional change like that. While also what we expect. I said, like I said, is it that kind of thing is so. Duh for security people to think of like detective bucket, not issue.

Well, I only worked about six at the time when we started running it. and we’ll we’ll and what the [00:44:00] problem was, there was a, there’s a drift between how we were configuring things and not commercial in our commercial software. And so we’re able to proactively fix that. Cause remember there was no incident they’ll be freaking out right now.

The second thing we learned was that the, the, the cloud native configuration management we’ll call it. And is it every time, every time, but the thing we’re barely paying for. How did like the change every time. So that was the second thing I know. Right. The third thing was, as we expected the log data come from both, and it’s a correlate some kind of event to, Hey, this kind of weird thing happened to our SOC security operations center.

We didn’t have a sense of, we had our own log tool, but yeah, either way, it’s still worked that part worked like it correlated events with the sock on the socket, on the, and they’re like, which he did was count as this. Right, because we’re very new to any of us kind of still figuring things out. And what they found out was they couldn’t figure out which if it was not commercial or commercial now, as an engineer, you’re like, come on, you can just map back to IP address.

Well, yeah, well kind of right. That could take a few minutes. I can take 10, 15, 20 minutes [00:45:00] maybe. Right. You know, if this were real outage, right. That could be millions of dollars, right? On top of that, you have to come for a production system and probably have snap enabled S net will actually hide you realize the address.

So you can be fucked. Sit around for an hour, an hour to three hours trying to figure out which actual instance that was. Right. And, you know, meanwhile, the system is down, but guess what? It wasn’t done. Right. Nobody’s freaking out. We kind of were able to learn these things, you know, and how things, the way they did it to be proactively.

And this was during our call transformation, you know, confidential. I’m not sure if there’s ever really into that, you know, it’s just like, there’s never really any, the dev ops transformations are our open source or it kind of just.

Ashish Rajan: [00:45:45] Is that what triggers it though? It’s cloud transformation that triggers these conversations or is that the right time for it to get triggered?

Aaron Rinehart: [00:45:51] It’s it’s, it’s the most common, beginning point for all chaos engineering in most companies is the people trying to verify that the system works the way they think it does [00:46:00] because they’re concerned. They’re concerned about that story. I told you the beginning where. There’s this legacy system, it worked there, worked at our data center.

We moved it all over the place. We’re told all these great things and we deliver value clear to customer. And you know, we’re unsure now. Cause those engineers did that are not the ones doing this. And like, you know, but now we haven’t, we have a way. So not only do we still do great software engineering, testing and unit testing, smoke testing, we all that still pause.

Not said we stopped any of that. What we’re doing is. Skipping to the casing. I like to call it sort of like once you’ve achieved CI and CD, there’s really this need for a continuous verification mechanism. And that’s what casts engineering is. That’s how we explain it. In terms of like executives, somebody was asking you that earlier, like, You know, that’s a more of an adult conversations versus cartoons, monkeys, and other characters, you know, it’s really about continuously verifying the system.

Yeah. Is operationally ready the way we think it is. And that’s a better way to structure a conversation.

Ashish Rajan: [00:46:56] Interesting. And I, it’s a great example and I’m just [00:47:00] conscious of the time as well. but probably have 10 more minutes, but is there, cause there’s a concept around chaos engineering as a service and like this is some kind of maturity scale that’s required.

I’m hinting towards a maturity model in your book, which I’ve heard of. but is there a maturity model or it scales engineering as a service, which is kind of what breaker does I imagine?

Aaron Rinehart: [00:47:21] So America is not actually a SaaS service it’s that on-prem software. Okay. Cause like a lot of times staff services will require like an agent to be on a thing and like the Sasser’s compromised and then, you know, then you become compromised, you know, potentially.

So, we, and we find that we’re trying to help, all companies be able to sort of do this. So when I say it’s sort of on prem software, I mean, it can be ran the cloud. It’s just ran inside of your environments. Ah, that’s right. We do it that way. So, and we’re able to do certain experiments cause we write directly to software.

We don’t require some sort of agent to do it. Right, right.

Ashish Rajan: [00:47:56] So maturity level that we, so how do, how does one [00:48:00] find a maturity level? Like does, does one exist for like a metrics for chaos engineering? What’s the matrix.

Aaron Rinehart: [00:48:05] Yeah, there does. There is. I forget what chapter it is in the book. There’s a book, there’s a whole chapter in the.

The book on, I forget who wrote it. It could be Nora Jones that might’ve wrote it, but there’s chapter on the chaos engineering maturity life cycle. And a lot of it is like starting a non prod, you know, testing out, verifying the open source tool or the scripture scripture wrote. If you’re finding open source tools, don’t do it for you, you know, write you out.

I’ve been like, you know, just make sure you open sources so the world can benefit, you know? that’s what we struggle with as a community. There’s a lot of the tools that have been written inside of companies don’t get released cause they can’t release them. so there’s actually some great tools out there that just.

You know, I

Ashish Rajan: [00:48:48] agree. I think another great, great question came in from David. Probably the last question for this one. What are the elements to building a business case for chaos engineering to get support from business stakeholders?

Aaron Rinehart: [00:48:59] So it [00:49:00] kinda all comes down to, you know, on the security side. So the security, the security yeah.

Case and the availability use cases are kind of both similar in theme, but it’s like, you know, it’s that whole, are we, are we going to constantly would mature applications through unforeseen failure and poor business outcomes? Are we going to be proactive about trying to understand that and fixing that?

So, you know, one of the, You know, the same thing goes on like security. Some of the security, the average cost for like per cloud workload for security is somewhere between 25 and 40%, depending on what level of regular regulation you’re required. So you’re spending 25 to 40% of all of your budget, all for an application running in a cloud on security.

It’s Chris, would it be beneficial to know how much of that actually works? Like, so what you’re trying to do is it, and you can’t really put it down or figure on the confidence you get through gas. It’s like if there’s a breach that comes out for a certain type of attack, you know, with confidence, like you run this experiment 10 times a day.

I know for a fact that that would occur. Like we have [00:50:00] mechanisms in place. Like you can focus your attention elsewhere. Right where you like distinct, you may not be thinking about right. New experiments. It’s a way of, it’s the only way, the only way to be proactively unify these types of things before they manifest in the catastrophic problems.

And it’s hard to put a dollar figure on a breach, but man, people are getting tired about hearing the breach. It never happened. The outage had never happened with like, like it’s a, one of the other things that we do, what Netflix does with chaos engineering, we try to tie. The success of experiments to business KPIs, right?

Like you can, that you can now explain, you know, this is what we’re trying to get to as a craft in general, we’re trying to point the craft to is all the technology doesn’t is not what you’re, you’re not trying to just build technology and try to build business value. Right? So like when chaplains, when Netflix chaplains, what it does, is it monitors like, you know, not, not like the technology, it monitors like your stream starts per second.

Can people play a [00:51:00] movie? Right. Sometimes it’s a, check on cards converted sometimes it’s, I don’t know, sign ups per hour for a minute. And like, you mattered that KPI. If that ever deviates you stopped the experiments because now, you know a customer. So at that time, during that condition report that data back to the service owner, they get to investigate what happened, right?

Like, cause if that makes sense, what I’m saying now you have to explain how you improve product. Business value, like customer experience. That’s where you want to go. But like the fundamental premise, you start with the business case and ROI, by the way, there’s a chapter on that in here, and putting it together.

But you kind of start with like, The exercise of being proactive and learning about the system proactively instead of through retroactive failure, customer pain

Ashish Rajan: [00:51:43] and the, and the value that you’re putting across as application resilience. So, so the, the use case is not more, you get better security, but the use case is more like.

Your customers can have a much more stable application running for are being presented to them instead of this, I think to your point, a lot of people may go down the [00:52:00] path of going, Oh, this is the most secure because you’ve done experiments, but actually it’s the other way around where you’re building application resiliency.

Do I have a highly available system or highly available service for your customers? I think that’s how, probably better way to put it.

Aaron Rinehart: [00:52:15] Yeah, that’s that’s yeah. That’s the direction. Yeah, exactly.

Ashish Rajan: [00:52:19] Sweet. it was kind of towards the end of the, interview. And I’ve got these fun questions, which I clearly, by the way, does David get like two books?

Because he’s asked like two questions. I don’t know. You don’t have to say this. I mean, you can decide what you want to do, but he’s been like really asking a lot of questions. I’m like, does that mean he gets two books, but I’ll let you decide that, with the, I do want to, I’ll let you know beside that, when you, when we talked to him a weird.

So, I guess I’ve bought some fun questions towards the end of the interview channel. It goes through not too personal. So just three of them. I want to go through them one by one. Where do you spend most time on when you’re not working on cloud or chaos engineering or technology?

Aaron Rinehart: [00:52:58] What do I do? What I do in my [00:53:00] personal time?

Yeah. I try to go fishing with my son to be honest.

Ashish Rajan: [00:53:05] Oh, you like fishing yellow, very patient?

Aaron Rinehart: [00:53:08] no, it’s a way for me to exercise my patient muscles, patients

Ashish Rajan: [00:53:13] like

Aaron Rinehart: [00:53:14] the magnet, magnification of sucking a phishing and then sucking it out with a, an eight year old. And there, there are frustrations magnified, yours.

It’s

Ashish Rajan: [00:53:24] not doors patients getting over here. next question, where does something that you’re proud of that is not on your social media? Proud of none of

Aaron Rinehart: [00:53:33] my social media.

Ashish Rajan: [00:53:35] Yeah. Like a lot of people talk about family and the support they have or something they’ve achieved, which like I could degree or something personal, like charity work they would have done.

So is that something that you can share?

Aaron Rinehart: [00:53:47] Yeah, so, so I was in the Marine Corps years ago. for a number of years. and, I designed actually at age 21, I got the opportunity because I knew satellites, radios, phones, and computers. I knew all of that. I was one of the [00:54:00] only people in the, actually the continent of Africa and then really that within the department of defense and military, right.

It kind of had, there’s a huge tsunami in the Indian ocean and it really affected the seashells. You know, was a country. And I was able to sort of a diplomatic fish. I was able to go over there and design is absolutely committed patient network. It was fascinating, worked in it. It ended up it’d be a very successful effort and I’m very proud.

I was really proud of that work as one of my life’s best work is bill out and design their ability to respond to a, I guess I’ve been in resilience for awhile. I don’t know.

Ashish Rajan: [00:54:35] I was going to say like, there are elements of chaos engineering and application resiliency in there that, in that answer that you gave me a last question.

What’s your favorite cuisine or restaurant that you can share?

Aaron Rinehart: [00:54:46] Oh my gosh. I’m pretty cuisine

Ashish Rajan: [00:54:49] a fruity yourself. So I’m, I’m keen to know. I was really hard to come up with one answer, but.

Aaron Rinehart: [00:54:55] what would be my favorite answer? I don’t know. I don’t know if I could

Ashish Rajan: [00:54:58] get that. What’s your go to, let’s go [00:55:00] with that.

What’s what’s good. what’s your go for? I could have, after we get retiring day, like a burger

Aaron Rinehart: [00:55:06] pizza now, what would I go to? we, I eat so much. My wife is actually a Taiwanese, so we had a lot of Asian food and I don’t like to eat a lot of meats. So I, I kind of also don’t I feel a lot of sugar.

Ashish Rajan: [00:55:21] that explains the Alovera in your black tea.

Aaron Rinehart: [00:55:23] Oh yeah. So, I dunno, I just try to eat healthy, but I try to, you know, I try to have a wide variety of foods. I’ve lived kind of all over the world, so I I’ve gotta have, you know, uh I’ve you know, try to, we try to just see different things and different tastes and, you know,

Ashish Rajan: [00:55:38] Wait.

thanks so much for that, man. I think eating healthy is always great. I’ll say that. I think, especially the older you get the healthier you should eat just to be just to be normal, I guess. Yeah. And so where can people find you on social media?

Aaron Rinehart: [00:55:55] I I’m pretty responsible on LinkedIn. and I’m pretty, like when I [00:56:00] say like, here’s my contact information.

It’s like at Aaron Rinehart on Twitter, you can look a Baron Reinhart on LinkedIn. we go to Aaron at Verica. I always my email. I, I mean, I will respond to you, right? No matter what question you’re asking me, if it gets too personal, be,

Ashish Rajan: [00:56:17] be careful for what you ask for from the internet.

Aaron Rinehart: [00:56:20] Actually, I’d put that out for years and people are pretty genuine.

other than like people trying to sell me something or like, I I’m pretty good filter for that.

Ashish Rajan: [00:56:28] You know,

Aaron Rinehart: [00:56:29] if you have a genuine question you want to learn, like, you know, no matter whether you’re a college student or you’re a career professional, I try to help everyone. I believe that’s. I mean, that’s.

Right. That’s what I enjoy is helping people

Ashish Rajan: [00:56:43] community in general though. Like I’ve always felt that as well. that w I mean, kind of like your cell, there’s so many people in the security community who are just like us just sharing free information out there. We just warn people do like that. And. I don’t want to get into the whole gateway topic, but it is, there’s this [00:57:00] conversation that goes on, on Twitter about there’s a gated approach to getting the security.

But I think it’s a topic for another time, but I loved this episode, man. And I think, I think the common flooring, you know, so make sure. Gives me the sign that a lot of other people loved it as well. So thank you so much for your time. And I can’t wait to have you bring come again because clearly I have so many more questions and I feel like all those, all of those unanswered.

So I kind of can’t wait to bring you back on again.

Aaron Rinehart: [00:57:27] Well, thanks for This was great. Thank you. Thank you again.

‍

Cloud Security Engineering Series

Episode Description

What We Discuss with Aaron Rinehart:

THANKS, Aaron Rinehart!

Resources from This Episode:

Claim your free spot in our upcoming Cloud & Kubernetes Security Training!

The 4 Pillars of AI SOC:From Threat Hunting to Vibe Hunting

Native Cloud Firewalls Falling Short in a Multicloud World

How AI Agents Will Negotiate Your Vendor Contracts

How Claude Mythos Changes Vulnerability Management: From CVSS to Exploitability

Why AI Guardrails Are Dead & The Threat of Indirect Prompt Injection

The 4 Pillars of AI SOC:From Threat Hunting to Vibe Hunting

Native Cloud Firewalls Falling Short in a Multicloud World

How AI Agents Will Negotiate Your Vendor Contracts

How Claude Mythos Changes Vulnerability Management: From CVSS to Exploitability

Why AI Guardrails Are Dead & The Threat of Indirect Prompt Injection

AISPM Isn't Enough: How to Apply Zero Trust to AI Agents

The Invisible Prompt Injection Hack & AI’s "Fire Triangle"

Red Teaming in the Cloud: Why "Least Privilege" is a Broken Concept

The Rise of Agentic Cloud Security: Code-to-Cloud Shrinks to 3 Days

Surviving Ransomware: How to Guarantee a Clean Recovery After a Breach | ResOps

Orchestrating the Next Evolution of Detection as Code

The 2-Minute Dwell Time: Why Agentic AI is Redefining Threat Hunting

Why EDR Fails at AI Security & The Rise of Endpoint Behavior Modeling

The Zero-Day Clock: How AI Shrank Exploit Times from Months to Hours

Why Legacy DLP Failed & The Rise of the Enterprise Browser

Solving Prompt Injection & Shadow AI for AI Malware

Will AI Replace Application Security? Navigating the New SDLC

Browser Security Explained: Consent Phishing, "Click Fix" Attacks & The Limits of EDR

Is AI Hallucinations a Myth and the Real Threat from AI

Why AI Infrastructure is Harder to Secure Than Cloud