Attacking and Defending Managed Kubernetes Clusters

View Show Notes and Transcript

Episode Description

What We Discuss with Brad Geesaman:

Click on the timelines to listen to the answers of that question:

  • 00:00 Intro
  • 04:18 What is Cloud Security?
  • 05:57 What is Kubernetes
  • 06:00 Kubernetes and Cloud Native
  • 09:00 Approach for Kubernetes Pentesting?
  • 13:06 Low Hanging Fruits for Recon
  • 14:49 RBAC in Kubernetes
  • 20:08 Diff between Managed and UnManaged Clusters
  • 23:13 How does Attack scenario scale?
  • 25:47 Lateral Movement in Compromised Managed Cluster
  • 28:59 Deleting Hacker BreadCrumbs in Managed Cluster
  • 32:40 Other Attack Surface in Managed Cluster?
  • 36:25 Is Kubernetes right for Startup?
  • 39:10 Is Kubernetes right for people already doing container orchestration? LIVE STREAM AUDIENCE QUESTIONS ROUND
  • 44:23 Which tools for attack and defend Kubernetes?
  • 46:55 Runtime tools besides eBPF?
  • 48:09 What’s next for Kubernetes Security?
  • 50:36 Skills shortage in security for Kubernetes?
  • 54:14 How did Brad learn Kubernetes?
  • 56:07 Learning Kubernetes the Hard Way!
  • 57:28 Fun Section
  • And much more…

THANKS, Brad Geesaman!

If you enjoyed this session with Brad Gessaman, let him know by clicking on the link below and sending her a quick shout out at Twitter:

Click here to thank Brad Geesaman at Twitter!

Click here to let Ashish know about your number one takeaway from this episode!

And if you want us to answer your questions on one of our upcoming weekly Feedback Friday episodes, drop us a line at ashish@kaizenteq.com.

Resources from This Episode:

[00:00:00] Ashish Rajan: Welcome. First of all, and it’s a tradition, so cheers, man. Thanks for coming in. Yeah. Cheers. Cheers. Thank you. I’m going to start with, I know a bit about you, I’ve been kind of online stalking you for some time, but a lot of people may not. So who is Brad and where are you professionally now?

Brad Geesaman: You know, the easiest way to say it is in short, I’ve been in the space for almost 20 years now. But SOC engineer, pen, tester, security architect sales engineer, security consultant, and in the last like five years or so, you know, building ethical, hacking training scenarios later it was on VMware, but then on Kubernetes, and that was in 2016.

And ever since then, I’ve been in the cloud native container security space specifically with a lot of Kubernetes focus. So you know, this, this this past year and a half now if co-founder of dark bit and doing cloud security posture assessments for companies running and focusing on [00:01:00] Kubernetes, specifically in the GCP and AWS and those managed offerings.

So you know, just a small group doing some some deep dive expertise stuff.

Ashish Rajan: Awesome. And I think I’ll definitely recommend checking out the Darkbit website as well. You guys did a great job of making the Attacker first out of the standout there. Kudos you guys for that. But what I realized after, I guess running a whole month of Kubernetes interviews, like the last four interviews over the last four weekends have been amazing guests who both spoke about Kubernetes.

And I kind of wanted to get some of your perspective cause you you also bring in like a bit of a holistic perspective as well as you said, you’ve done. So you’ve done SOC you’ve been a pen test or you know, that kind of different things. And I’m going to ask too, I guess two different sides of the same puzzle, I guess.

What does cloud security mean for you?

Brad Geesaman: Cloud security? You know, being a pen tester and the 2008 and before era, you know, so windows 2008, it was like the latest [00:02:00]at the time. So sort of stopped there. There was sort of a specific set of responsibility and it was kind of common. Everybody’s running everything, they’re running their runbooks to build their windows servers and things.

I think cloud security to me is, is having tiers of shared responsibility for different workloads. And so as a security professional, or as a defender, as an attacker, it’s not just, you know, here’s the infrastructure, here’s the, the data center and this is how you get there’s different models for each one.

Things that cloud providers are responsible for things that you’re responsible for. And if you’re running a VM versus a container in something like Kubernetes versus something in serverless. I don’t want to say they’re worlds apart. That’s not fair. They share a lot of substrate, but from a, just a, what am I deciding?

What am I looking for? What am I you know, trying to do a response process on they’re worlds apart. They’re very different. So to me, it’s, a combination of all those things, depending on what you’re running and adapting, all your processes [00:03:00] to the various sets of shared responsibility.

Ashish Rajan: And I’m glad you brought this up the way it’s process in shared responsibility, because I almost find so similarity between that and Kubernetes security. So what about Kubernetes? And for people who don’t know Kubernetes that I think after running a month of Kubernetes in the cloud security part, cause I’ll be surprised people don’t know this, but what is Kubernetes and what is its relevance to cloud native.

Brad Geesaman: It’s a great way to run other people’s code next to your secrets and data as root. That’s a quote that Ian and I came up with at our RSA slash KubeCon talk. Cause it was like, when you think about it, it’s really depending on how you run it, it could be considered remote code as a service because if you’re not taking care of things, that’s where it is.

But from a Kubernetes perspective, it’s like why would you use Kubernetes like that sort of fits into this question too, I think is like you’re running infrastructure and you have patterns that keep repeating. And because you are running enough infrastructure, you tend to be at something scale where you’re like, [00:04:00] we’re doing this the same way or sorry, differently, but for the same types of apps across our infrastructure, we should be doing this in more common way.

That’s typically when you’re ready for something like a container orchestration platform that says, this is how we do load balancers. This is how we build images and how we ship artifacts. And this is how we deploy. This is how we test, and this is how we validate and all the things that come with it have to adapt.

And so all the security aspects and the policy and the response and things follow suit, but it’s that’s what I think of when I think of, you know, distributed containers, distributed bundles of shared processes processes and their shared dependencies , on multiple hosts.

Ashish Rajan: Is that what would become cloud native as well?

Brad Geesaman: I think that’s one aspect like cloud native is, is to me, I think people use that definition quite loosely, but it’s back to the shared responsibility. But if you sort of step back from it, it’s leveraging that level of control [00:05:00] versus shared responsibility and ownership to the maximum advantage for that given workload.

So if I had this thing that just needs to respond with the IP of who talks to it, like I can has IP, you know, I could do that in a serverless container for very, I don’t have ops. I don’t have to worry. I just push through it and let it run and I’ll have to do as much. But then there’s something like, this is my core business.

I have a very custom driver or I have a thing that is different. That is my secret sauce. I need control over it. You might take it down to, you know, a containerized infrastructure or even a VM infrastructure just to have max control over it. And that granularity.

Ashish Rajan: Oh, I love it. And by the way, I just want to quickly shout out to your support crew.

that came in, which have got Dan pop in here as well. And I’ve got Magno as well out here supporting you. You’re talking about attacking and defending and I do what feel the best way to learn security is to learn how to attack it.

Then, then you kind of, oh, maybe this is the, how should [00:06:00] I defend? Because it’s our attack. So starting to talk about attacking Kubernetes security, what’s usually your approach for pen testing. Kubernetes clusters.

Brad Geesaman: Yeah. You know, it’s funny you using , the literal word pen test is. I argue that something like a, and this is going to sound self-serving, but a security posture assessment would be akin to a vulnerability assessment that you might do ahead of a full-on pen test to validate that you fixed a lot of the low-hanging fruit and things.

I argue that a security posture assessment where you’re looking at the config and metadata is step one. And then pen testing is a validation of that process. So like from a config standpoint, I’m looking at external attack surface, how are the nodes exposed? How how’s the API server exposed or the components that run your cluster exposed, which apps are exposed via load balancers and sort of that attack surface, right?

And then stepping inside the cluster, then it gets very interesting because. You know, running as a container, what things do you have? What [00:07:00] permissions do you have? Do you have RBAC access to the, the API server from the pods? Do you have network policy that’s preventing you from moving laterally and moving around inside the cluster?

How well are you preventing or, or defending access to the metadata typically associated with cloud instances, there’s credentials in that metadata that give you access to the, to the broader ecosystem. So , there’s kind of different scenarios. So that outside in is pretty straightforward and whatever apps are exposed and you sort of move into your standard web app or application security models from that point.

Cause you’re looking at, you know, the API interfaces and authentication authorization, that kind of stuff. But when you’re inside the cluster, it’s like, I, I always envisioned the, what if, like what if a container is compromised and I have a shell on just pick a pod, that pod what’s my worldview, what things can I see?

What things can I do? Can I gain access to credentials that are sitting there. Can I hit other services inside the cluster that are, you know, different teams or different you know, namespace, that kind of thing. And just seeing where that [00:08:00] lateral movement happens. Cause it’s quite common, unfortunately.

Ashish Rajan: Oh, I love what you said. So it’s almost like giving you a viewpoint from the pod, but also to your point it’s two views. One is from the outside in, this is where you hear about that Tesla API server issue, or I guess not an issue really, but like the use case that happened over there. I think offline, we were talking about that TeamTNT one as well.

Was that similar?

Brad Geesaman: Yeah. Shout out to Magno , and folks that at trends a really nice blog post a couple of days ago where they’re talking about sort of , a more sufficient, I wouldn’t say like incredibly sophisticated, but more sophisticated than, than what I’ve seen in the past. Attacking the kubelet , the old kubelet exploit, which isn’t really an exploit.

It’s just, kubelet not having off authorization enabled. And so, you know, tools and techniques that are taking advantage of that, and then pivoting and scanning and trying to hit other Cubelets and then a Monero miner and using IRC that to communicate, etc. So that’s some interesting things that are happening, but , that’s [00:09:00] sort of attacking.

That’s why I say that first step is what are , my components as my SSH on the Nodes, Kubelet on the nodes, the API server on the control plane nodes are those exposed externally because that’s where that attack stops from the outside in that doesn’t mean it can’t happen from the inside to the inside, but just from my broad mass scanning the internet and I see a port 10250, it’s probably going to be a kubelet and you can ping it just to see if it has authorization disabled really quickly.

And , it’s high confidence and low effort from that point. .

Ashish Rajan: So we kind of went in the recon phase as well. Then you were talking about what is your view look like from the pod out and what else you can do? And because there’s so many services within pick a component, you pick a control plane, so many components over there, you pick a node, there’s so many components over there.

Like what are some of the low hanging fruits that people should be kind of looking which are common?.

Brad Geesaman: Yeah, absolutely. So that, that component [00:10:00] configuration I talked about, I’m talking about like the, the flags on the API server, the flags on the kubelet and those types of things. That’s step one, step two is probably RBAC.

I know of someone I’m not going to name them, but they have automated scans using a, you know, mass scanners that trigger on API servers that hit API V1 secrets, meaning there’s a misconfiguration of RBAC. And so if you see that, you know, it’s, RBAC, isn’t enabled correctly, or somebody purposely, you know, open that door.

And that means basically your cluster admin in that cluster. So RBAC misconfigurations is the second thing, just reduce the amount of stars on the, RBAC policies. Then it’s admission control network policy. And then I would say secrets management and in mission control being the biggest thing that prevents a developer from becoming your lead SRE in two commands, right?

You want them to stay in their namespace, or you want them just to work on the workloads. You don’t want them to be rude on all your nodes just because you allow them , to run a host path root [00:11:00] pod or something like that. So that’s where admission control comes in.

Ashish Rajan: Actually, that’s a good point because I just reminded me that in the first few conversations that we had, we went into API services, low hanging fruit. We even spoke about I guess container security , as a low-hanging fruit as well, depending on way downloading container from the, RBAC is something. Or we haven’t, we haven’t really spoken about it. Cause we spoke about network policy and pod security policy as well.

But I’m curious about the whole,RBAC piece here. When you say RBAC, is that more RBAC for the Node or is it for the cluster itself? Like what, I mean, don’t mind unpacking that a bit.

Brad Geesaman: The Kubernetes authorization mechanism , is role-based access controls. So the API server. So can I get pods? Can I list pods?

Can I create pods verb resource? Right. And so when you’re creating a cluster role bindings or role bindings, which are namespace specific, you know, you’re granting specific API actions to a user. And if you’re [00:12:00] giving create pod or view secrets, you’re giving very powerful. Verbs, you’re giving them lots of access by default.

You need to actually, for the create pods specifically, you also need to have admission control to back that up. So yes, you can create a pod, but it only looks this shape. It’s not a very privileged one that escapes to the note. It just does the web server or just as the, you know, the caching tier, whatever it’s supposed to do.

Right. But the view secrets is the one that is often hidden in stars, meaning a helm chart. I’m not picking on helm charts specifically, but often what happens is somebody installs a helm chart and doesn’t audit what is happening and they’re installing cluster role bindings to be able to make the thing work, the operator the GitOps operator.

The thing that’s deploying things on your behalf will typically be cluster admin. So it have a cluster role binding of like verb star resource star. Right. And so you’re just giving it full access to everything inside the cluster and where I see that sort of overlap. [00:13:00] Is, and this is actually a real use case from one of our clients.

They had a GitOps operator, so something that’s responsible for deploying and it has to have its privileges, no doubt. And it was in a separate namespace. And then they give view secrets because they were using sealed secrets in this like front end web namespace we’ll call it. But what happened was that there was a performance problem and they wanted to break it out.

So they installed multiple versions of these one in each namespace and it just handled it’s piece. But what happened was, is then that crossed over. So the developers had view secrets and they installed the GitOps operator in the same namespace, which installs a cluster admin bound service account and service accounts are stored as secrets inside that namespace.

So they basically gave them access to a cluster admin token. It was a complete oversight. There was like, oh my goodness, how do we realize that there was that crossover? That’s what I mean by that misconfiguration. It’s really easy to install multiple operators and then just go wait, I’m just [00:14:00] giving them their viewing secrets in just that namespace.

It’s totally fine. Yes. On the surface. But you have to look at what else is inside that namespace to make sure. It’s not a secret that contains a cluster admin bound service account token. And that was like the one step for the developers to become cluster admin. Yeah. And then you have other tiers where it’s like, now that they’re cluster admin, they’re basically root on the node.

So when you think of cluster admin, think of root on the underlying note, because you can bypass policies, you can take off pod security policy, you can disable admission control, you can clean logs as well. Once your cluster admin, you’re able to access all the credentials that are attached to all the notes.

So if you have cloud metadata that maybe makes the control plane nodes, and this isn’t going to happen , in managed clusters necessarily, hopefully it shouldn’t. If you are able to run workloads or get access to the API server or the control plane, typically those have higher IAM privileges. In your, in your cloud environment, things like creating VMs, [00:15:00] attaching disks you know, doing also like, like having full control of some aspects of, I am for, for that account.

That’s where it gets very, very interesting. So you have a two-step misconfiguration here. You have the ability for a user to see secrets, which makes them cluster admin, which makes them ownership of all the nodes, which makes them ownership of all their credentials attached to the notes. And that’s how you walk and escape out of the cluster in a lot of cases.

So that’s like an example of like, how you think, why I think of, you know, this, this pod is the starting point because it’s either compromised from a web app. Like it’s a, it’s a front facing web app and it gets compromised. Attacker has a shell, or its developer is queue control, exacting, and feeling malicious today, they woke up and they chose violence.

Or it’s a malicious dependency that ran from your car that comes in from your container and it sits there. And now it’s running all of those start from I’m an unprivileged container or pod inside the cluster. And by that, I mean, like it’s just sitting where it can, what can it do? Does it have access to [00:16:00] tokens?

Does it have access to network? You know, where is it going? That that’s why that scenario is so common. And a lot of the Kubernetes security talks, you’d like, let’s assume that the attacker has a shell in the pod. That’s because that’s the start of so many of the centers.

Ashish Rajan: Oh, right. So in a way it’s probably your Achilles heel in a lot of ways.

It is something that you need to function, but also part of, I guess the, the beginning of attacks, but maybe another way to put this , with the managed cluster and unmanaged cluster. So maybe if you can start with the difference between the two and what the low hanging food with different effects, your point, if it’s a managed cluster in a cloud environment, but say GCP or AWS.

Brad Geesaman: Yeah. So a managed Kubernetes run by say a cloud provider is typically they make , the dividing line that they don’t give you access to the underlying nodes of the control plane, the control plane being things that run the API server or the controller manager, the [00:17:00] scheduler and etcd. Because if you can get access to etcd, and this is not very common, but it’s just, that’s the database.

Right? So like think of the analogy of a web application and it talks to a, MySQL database. Well, if you could just go into the database and go, yeah, the record is X, Y, Z, the web server will trust it. And so, you know, you can make yourself admin, if you just go admin equals true and a select statement, then the next time you access the web, it was like, yeah, your admin, right.

Similar thing. etcd is the backend. So you never want to let anybody get to etcd because it stores the secrets. And it stores all the configuration that would give you the ability to become a cluster admin. So they take those keys away from you. They don’t let you have access to those for very, very good reasons.

So they give you access to the nodes because that’s typically where most workloads might need something like you need privileged access to a GPU or a container storage interface driver. That’s doing something cool with, you know, fiber channel or whatever that you, you need a little bit more control.

You, you can give them [00:18:00] route on the nodes, but not the control plane notes that just the worker nodes where the workloads run. And so if you are running your own Kubernetes, like if you’re running something like KOPS, I’m not picking on KOPS because you’re managing , that control plane, you have to be very careful not to allow workloads.

To run on the control plane nodes, you have to worry about the API server. You have to protect it from rate limiting, you know, denial of service attacks, all that things. If you expose that API server, you own that responsibility of the most important pieces of your cluster. So if you’re, if you’re saying I don’t really need that much control over the control plane, I’m happy to give that management up to AWS or GCP by all means do that.

It’s worth the a hundred, some odd bucks a month for a cluster because they’re, the SREs are handling that for you. They’re patching it. You know, they’re taking a lot of extra precautions that you don’t have to. So then your focus is in a managed cluster. You’re focused on node security and you’re making sure that workloads don’t [00:19:00] interact with each other, that shouldn’t not also roping in all the stuff that you need to do necessarily to protect that, that API server and etcd and things.

Ashish Rajan: Maybe how about we scaled it? You know, how cloud is all about let’s scale, everything, and just make it like 10 to 20,000 deployments in a few seconds. So if we talk about multiple nodes scenario, and I’m thinking in my head, if I was a managed cluster I have got, a broad definition of what time are defined for the deployment, which only has one node, but I’ve got multiple nodes.

How does Attack scenario scale like where some of the examples you’ve seen in terms off like a large Kubernetes deployment in a managed cluster context.

Brad Geesaman: Well, scale , is in vertical horizontal. It’s like how big your pod is, like how much CPU and resources and then how many there are.

One of the, so just in general, I mean, scale can go really big, really quickly. I’ve seen somebody go from a 10 replica [00:20:00] deployment to a thousand replica deployment, and I’ve seen that come that node auto scale and eventually topple under its own weight because it wasn’t configured correctly, but it can do that with, with the right help.

So if you’re setting a requests and limits correctly saying my pod uses one CPU and one gig of Ram, and I’m correct, when I say that Kubernetes will handle that for you very gracefully. What’s interesting from an attacker’s perspective is the privilege you get by being cluster admin means you can be extremely efficient at then getting access to all those nodes because the kubelet and the API server are fundamentally a in sync I’ll run this command for you.

It’s either a container or a new container or. Command on the host. So if you get cluster admin, you have access to all of that through a wonderfully documented multiple client library, API that lets you get access to everything that’s in the cluster. So kubectl is your best [00:21:00] friend, but you can write any code that you want.

If you get cluster admin, if you have the permissions, you can be extremely efficient at moving around and being a part of every single note and seeing what’s on every single node. And I think that’s, that’s sort of the fun aspect of it too, is like using the control plane. If you get those permissions using it as an administrator would because then you blend in number one and number two, because it’s the most efficient way to harvest all the secrets and all the data from all the nodes is to use its own literal orchestration against itself.

Ashish Rajan: It makes me smile because I’m just thinking, it’s interesting that the cloud provider takes away access for API and probably most of the control lane, but if you’re a cluster admin, your scaling your privilege across the north, but just have a crypto miner in all of them.

But maybe I just talk about being multiple clusters and like, in, in terms of like the whole lateral movement and everything like what are some scenarios over there and how would you, [00:22:00] I guess, abuse them for lack of a better word in a managed cluster?

Brad Geesaman: Yes. So I’m a big proponent of considering the cluster. Like the, the multitenancy debate. There’s really not a lot of accessible or easily used hard. Multitenancy where it’s like untrusted code is okay. Running inside your cluster. There has to be some level of trust and typically soft multitenancy is like this team and my organization and this other team in my organization, all who I have legal control over, if they do something nefarious, that is typically what we’re talking about by multitenancy, but they’re still blast radius.

And what you just described as, you know, this whole scaling thing, I don’t want necessarily to have one compromise, make that attacker have cluster admin and root to everything in my entire organization. Just for fault domains, just for regional outages, or most likely a availability zone outages in a cloud provider.

You might want multiple clusters [00:23:00] because they might be different shapes too. I might have, I have a GPU workload over here. I might have a jobs, a batch jobs workload over here those things should never touch. This is touching backend data lake customer don’t share that in the same one that is front end accessible, that runs my, dot com website or , my API that I let anybody in the internet use.

You want those separate? So I see using clusters as the blast radius of this, think of the similar mindset of why would I make put VMs in a different VPC? I will. These are my dev instances. Well, these are my prod instances. Well, that draw a dotted line around that VPC and say, these things go together.

That should probably be in the shared cluster. Maybe, maybe not, but definitely should not be in the same cluster over here, which is the dev things definitely should not be the same thing over here, which is my data lake. That’s how I look at it. So sharing, or I should say separating that blast radius by separate clusters is the best way to go.

And in GCP specifically, I [00:24:00] always give the advice of one GKE cluster per project, because there’s too many opportunities for,IAM crossover, you know, compute admin gives you actually access to all the nodes underlying GKE. Cause they’re just GCE instances. Like those types of things you don’t really think of like the shared logging patterns where all the logs from all the clusters go to the same logging destination.

Those typically , are things that you might not want shared across teams or different. Things you wouldn’t separate those in projects. So in AWS would be different VPCs and different CloudWatch, and then GCP it’s in different projects.

Ashish Rajan: That’s a good advice actually. And thinking about all the cloud providers, which I should be full have footprints collected as well, for everything that you’re doing.

And I’m just imagining this scenario. I took your advice. I started looking at becoming how do I, become a cluster admin. I became a cluster admin, and I’m trying to deploy clusters across the board, but what’s my get out of jail card, I guess. How do I delete my footprint for what [00:25:00] are the breadcrumbs? And I left behind. Is there things that I should be looking out for in Kubernetes clusters toward me doing things, I guess, how would I remove that? Or what should be my thinking that,

Brad Geesaman: so, like you’re an attacker, right? And you’re leaving breadcrumbs or you’re a defender and you’re looking for breadcrumbs.

Like if you’re an attacker, if you’re running and managed clusters, typically if you’re configured correctly, in other words, you’re sending the audit logs from the API server and you’re shipping logs from the nodes. Most times that’s done for you. It might not be perfect, or it might not have enough granularity, but you should be able to quickly enable that.

That is one of the key things that is different in a lot of environments that an attacker can only delete what’s on the nodes and they might not be able to delete what’s in the control plane because it’s already been shipped off. So as a. How do I clean up after myself? It’s really hoping that the organization is not looking at those historical logs after the fact, right.

That they’re just like, oh, they’re live troubleshooting. And they don’t see [00:26:00] anything. So they don’t know what’s going on. But the record has already been, you know, logged off and, and said exactly what happened. This was the manifested this time from this service account, etc. So I look at it like, there’s only so many things you can do to delete the breadcrumbs.

You can just not show up in a kubectl get pods or get resource or whatever, the item that they’re looking for. But if you’re auditing the right things you’re like an EKS, there’s 6. You have to, you have to enable them explicitly. Right? You want these, all right. You want those shipped off GKE, we’ll do it, but you actually need to add additional ones to get all of them.

It can be verbose and it can cost extra money. And that’s why they’re always like, well, these aren’t on by default because it costs a lot of money, but that’s your literal record of what’s happening inside your cluster. It’s another API with all these features. You want every request and every chance you get of what’s happening, you want that shipped off the cluster for you automatically.

Oh, those are hard to delete. Cause they’re already gone. They’re already [00:27:00] somewhere safely.

Ashish Rajan: Right? And to your point, if those six options for north selected that probably not being logged anywhere. So yeah. You’ve already minimized your breadcrumbs in to begin with.

Brad Geesaman: Yeah. Like I’m not picking on EKS specifically, but in a default situation, very little is logged from what’s going on inside the cluster.

So there’s a lot of chance to be silent and stealthy. But , if you turn those on, it’s gone, it’s going there. Unless you start getting access to go delete those logs from, from where they’re just, they’re stored like an S3 or in CloudWatch.

Those types of things like you have to go another step further to go delete those logs from the record.

Ashish Rajan: Yeah and maybe another few layers that I’ve always considered from an attacker’s perspective and keen to know your thoughts on this as well, but the SSRF and API metadata that exists in cloud service providers as well and the supply chain of Kubernetes as well.

Like a lot of deployment where I’m going to use a CI/CD pipeline to create my container for [00:28:00] a container image. And I’m also going to create a CI/CD pipeline for our Kubernetes cluster as well. You had Mark Manning earlier in the month. And he was talking about a cluster with 3000 developers were we’re logging into one cluster and I’m like, oh, I imagined that in a cloud context and I’m going, oh my God, this is, this is insane.

But. What are some of the, the moving parts in a managed cluster that you, we can also look at for either from the recon perspective or a possible exploits,

Brad Geesaman: wow. To your point about what mark Manny said, I love mark. Just in general, I’m a big fan of not allowing, and this is part of like a maturity journey.

If you can get to a point where you’re developers only you’re like five or, you know, small number 10 SRE, cluster admins, whatever infrastructure platform teams, those are the folks that are using kubectl in a break glass scenario. Like, oops, something’s really sideways. Let’s go debug this. If that’s the only time [00:29:00] you’re using Kubectl, that is an optimum state.

If the developers are literally interfacing with a code repository that says, I want to bump the version I want to check in and code. Once it gets approved, it gets put in a testing branch and the automation takes it from there and they don’t have kubectl. That takes a ton of complexity out of the problem that takes a ton of attack surface because RBAC while extremely granular is incredibly hard to do 10 poorly.

I know there’s some solutions out there, but it’s like, you either give them creative. It’s hard to do like a, I need to break the glass for this one-time type of thing, access that you’re like, you give them create pod. You give them view secrets. And that’s that that’s the,RBAC is sitting there.

It’s, there are some identity proxies that can do some fun things there, but that’s a lot of complexity just for this. But if you can get the developers not working inside the cluster, not caring about that level of abstraction you’re winning, because then you can change things [00:30:00] out from under them. In a good way, like you can upgrade your clusters, you can move things around and they don’t have to care.

And that takes away a lot of attack surface of, you know, giving them, giving them that access. What that means though, is that you have to give them the tools, the feedback loops the observability to feel just as confident about that deploy. If it goes sideways or if it’s working or as healthy as you would, if you got it from a Kubectl command, it’s, it’s like a cheat.

It’s like, oh, I can see it with a Kubectl. That’s what we, that’s why we need it. And you’re like, no, you need to see that that deployment is healthy. You don’t necessarily need to run a specific kubectl get pods. So if you can get them out of the cluster, you reduce the, all the, just the complexity of, I can’t imagine 3000 developers and RBAC one massive group.

And it’s probably like cluster admin, boom. There we’re done.

Right. So

Ashish Rajan: I’m wondering if that’s a debate as well, like. A lot of security folks maybe are talking to [00:31:00] SREs or Kube Cluster admin, whoever you want to call them, where everyone is probably talking about. And then for context kubectl is kind of like the SSH of the world for washing machine one, for people who may be listening in a lot of people do ask for kubectl because that’s hard, all the guides online talk about, that’s what everyone talks about.

And suddenly security people become the bad people. Well, because you don’t want me to have kubectl. I like well, so I see CI/CD pipeline, cause that’s why you go through there and that’s how you test it. That’s why you had the dev environment test environment. So I hope people have healthy debates about this and how they land on this, but that also is a good segue into probably talking about more on the defending side of things as well.

Then maybe we’ll start with something simple. And say less, we’re a startup. And we’re thinking about, okay, if you’re going to start with Kubernetes, but the first question is, is right for everyone.

Brad Geesaman: No, that is not, it [00:32:00] is, it is a, I love it because, well, I love it because of a lot of reasons, but once you get it, it’s hard not to see that pattern everywhere and want to apply to everywhere.

If you don’t, if you don’t see those patterns done poorly first, you might not understand why Kubernetes was written the way it was. Right. You’re declaring state and then letting it do its thing. It’s not your imperatively like you’re acting on things. The commands must be in certain order, etc.

It’s like, and then it works. It’s make this pot. I don’t, I don’t want to have to know anything more about the implementation details store, the secret store this ConfigMap like that. That is a clean way. And then once you step back from it, you’re like, well, gee, it handles load balancing. It does this, it starts making standard patterns that apply to all these problems, but you might not be big enough or have those scale patterns that require it.

You might just be like, I just need two VMs in an auto scaling group. I just want at least [00:33:00] one up. And I want it to talk to this RDS instance, perfectly reasonable three tier stack, like do that thing. If that is the simplest thing that achieves your goals. Great. But if you have 50 of those. And they start to drift a little bit or they start to have, you know, these 10 get upgraded, but these take a little bit, you start going well, gee, we’re having to run across all 50 of those things and do the same things over when it’d be great.

If we had a standard VM that does this and we’d have a standard way, we’d set up our databases and you’re like, gee, you’re really cool to have that in like a little snippet of YAML. And we could just say, go make the thing. That’s where oh, co oh Kubernetes, there it is. That’s why you like see that pattern.

And it comes to that. So maybe at a, at a startup, your focus is on getting traction and getting customers, getting users, whatever technology it is to do the thing. It’s not, let’s build it on Kubernetes and then they will come. That’s no, that’s not going to happen. So I would argue, you will know when you at that problem space, when [00:34:00] you start seeing those patterns of doing things poorly across multiple sets of infrastructure that are like, man, we do everything kind of the same, but not really.

It’d be really great if we standardize this. That’s when it might make sense for container orchestration system.

Ashish Rajan: Interesting. And I’m thinking about all the large enterprise that maybe listening into this and going well, it’s too late for us. The cancer, like the horse has already left the barn for lack of a better word.

So for people who already have Kubernetes clusters in their environment thinking about them for a second over here, maybe is this right for them? Maybe because they are to your point they’re at that scale, they’ve been doing some container orchestration for some time.

Brad Geesaman: Yeah. So S so Kubernetes has a funny way of bringing out organizational problems, whether you like it or not, it covers so many areas.

It covers networking. It covers load balancing. It covers VM patching. It covers container build pipelines. It covers security. It covers audit. All these things are [00:35:00] collapsing into a shared infrastructure. So if you have pain, it’s going to be where you’re weakest, and it’s going to bring it right to the forefront.

You’re going to have. What, what do we do with these logs? I don’t know. Where do we ship these you’re well, we never really shipped to one central place with a standard way. None of our, not all of our applications logged the same way and structured JSON. Well, it’s going to bring that right to the surface because you’re like, how do I debug this thing?

Well, this app sends it like in syslog format and oh, now we have to solve that. Now you’re bringing all this technical debt that you should have solved or should be in the process of solving before you get to the point of Kubernetes. But if you’re already on Kubernetes, it forces you to be good or at least a decent level, and a lot of areas that maybe you weren’t really ready for, and it then changes things on top of it.

So the security team is we always like to pick on like what is Kubernetes? I guess we have some of those in our infrastructure. How do we pen test it? How do we red team at how we defend it? I don’t know. Like they’re, they’re having to re learn a [00:36:00] threat model of a completely different way of organizing infrastructure.

And you know, how do you defend that? How do you attack it? How do you do incident response on it, all those problems surface, because maybe you weren’t, we’re really good at doing that on standard VM or bare metal. You’re doing it and you’re doing it in something that has moving target pods come and go.

No, it’s come and go load balancers, load bounce to nodes that, and then they don’t load bounce to other nodes. And what happened when? I don’t know, we never really kept good luck. It, it just, it’s a, it’s a trickle down effect of if you’re not doing all the right things, you’re going to feel that pain.

So people perceive a lot of complexity with that because it’s relearning and rebuilding muscles that maybe you never had, or you did differently or did poorly. But if you do them correctly, if you follow the happy path and you do well, I’m just going to send all the logs to CloudWatch. I’m gonna send all the logs to stack driver, and I’m going to install Falco, and I’m gonna do this.

I’m gonna install this admission. If you start doing the things right, you start going, [00:37:00] oh, I get it. I see it. I’m. Seeing why this is, and I’m going with that happy path. You’ll be better off for it. And I think your organization will need to mature and all those buckets, like all those categories can not, you can’t have a bunch of immature buckets and then a bunch of mature buckets.

You have to bring them up. Otherwise you’re going to feel that pain.

Ashish Rajan: Oh, I love it. And I think to your point about in all of that amazing gems to talk, then the one thing that stood out for me is that things that used to be individual roles in a threat model. Now they’re combined in many cases.

Brad Geesaman: Yes. Yes.

Like what, what is like the role of legal in your software bill of materials? Like, are they looking for GPL or LGPL? Well, we shipped so darn fast that what they looked at was six months old, right? Like that kind of thing. Like those types of folks will want to be like, well, how do we. Re rewire that process so that we know that [00:38:00] we’re not shipping software, that we’re not licensing just as an example, versus, you know, vulnerable stuff versus, you know, stuff from other packages that are old or out of date, like all those problems surface because you’re shipping so quickly and you’re leveraging automation.

You need to wrap all those supporting processes and think through them as if it was running, we’re deploying 10 times a day. How painful would it be for the legal team to see this? Well, maybe we should just in the CI/CD pipeline go, oh, we’ll ship off the, the bill of materials to them. And if they ever want to look, it’s in the storage bucket, you know what I mean?

And then they can just show them here. This is how, how you would, you would look at it and check if you wanted to check. Cause they’re, they’re not really going like, or they’re not going to have a lot of freeform queries, for example. So you sort of see that pattern, ask what they’re using the data for and go, Hey, can I just send it to you in this place and think through all those types of integration points.

Like that’s, that’s why it’s like forcing everybody to be that quick, where processes might take days or [00:39:00] months, and those don’t catch up. That’s where you see that, that, that, you know, not meshing of the gears and abrupt conversations that are like, you’re not doing it, you’re doing a completely differently.

You’re not shipping the logs. And you’re like, well, we’re running this quickly. We’re doing this thing because it adds business value. This is why we’re doing this. We can ship features. Our customers are buying. This is why we’re doing it. Y’all need to get on board. That’s a little bit awkward as a conversation to say it that way, but basically that’s what the business needs to do.

It needs to mature all the supporting processes around it

Ashish Rajan: awesome. And I love it. I feel like I can talk to you for hours about this, but I’m going to switch gears and Fisher some of the questions that are coming in. Thank you for the patience that everyone has well Vineet and Magno, especially.

I’ve got a question from,Vineet over here, which tools would you recommend for both attack and defense

Brad Geesaman: Kubectl troll? JQ curl. That’s dead serious by the way.

A fan of so like there’s a couple of things, like if I’m [00:40:00] attacking there’s there’s kube-bench for CIS stuff, there’s kube hunter for penetration, light penetration testing stuff.

There’s Peirates there’s a couple others I’m blanking on them, but there’s some, some that are more attack or red team ask. There’s one that just came out that was mapped to the MITRE framework. That was kinda neat to see. But then there’s like from the malicious activity detection, like this is one of those things that doesn’t ship with Kubernetes.

It’s not part of its concern is, is what is running in that container good or bad. You know, from the start or sometime from a runtime perspective. Exactly. I’m a big fan of the Falco project like that is you know, you can get a ton of value very quickly for a very inexpensive price. You know, just the price of being a member of a good community.

That’s what I would argue. The cost of OSS is it’s not free. It’s being a part of that community and helping and contributing back. But like that gives you a ton of visibility into [00:41:00] nefarious things that shouldn’t be happening. And your w you’ll want to add that to your, your, your defensive, if you’re running something else you know, a paid or a vendor version, that’s, you know, that’s fine too.

It’s just something that’s telling you that what’s happening in this container is suspicious. You should take a look. Is, is huge. I mean, that, that, I don’t wanna say that’s it, but like, I tend to look at Kubernetes as a part of the whole, it doesn’t exist on its own. It lives on some metal, somewhere. It lives in some cloud somewhere has extremely tight integrations, a lot of cases.

So it’s not just Kubernetes. Just remember it’s everything. It’s the VMs, that’s the kernels. It’s the components. It’s the containers. It’s, what’s in the container. All of those layers of the stack typically require different tools or different perspectives. So it’s an inventory problem first. So using standard cloud inventory, tooling, standard posture management, tooling, to understand where all the things are and then what’s inside.

Kubernetes is building on top of that.

[00:42:00] Ashish Rajan: Awesome. Great answer that as well. And actually just one runtime Magno had a question which has been interesting for me besides eBPF are world technologies you think are important for protecting the runtime.

Brad Geesaman: I mean, that’s, that’s the new hotness. I shouldn’t say new cause it’s been around for a while, but I would argue that in the last two or three years, it’s seen a huge focus.

There’s a number of companies that are focusing solely on being really, really good at this. And , eBPF eBPF is like running kernel level code, but sort of in a sandbox that’s oversimplifying a little bit, but letting you plug in observability and security things of what’s going on without having to write a kernel module is enabling that iteration to happen much more quickly and much more predictably across kernel, versions and architecture.

I see, runtime is a combination of what’s happening in the container. How you’re capturing what’s happening. If something goes sideways there in combination with the logs that are being admitted from all the various log sources from cloud APIs, from the API server, from the nodes themselves, like [00:43:00] that all goes into what is happening from a security perspective.

So you kind of need to have all of that to be able to paint the complete picture.

Ashish Rajan: I’ve got a few questions for Magno, so appreciate the best. This is about what’s next for Kubernetes security.? Where should we focus our efforts next?

Brad Geesaman: You know, I think it’s tempting to say service mesh and identity and all that good stuff from a risk perspective.

It’s still the hygiene. It’s still the basics. I would argue. As we get better with all those basics. Like I said, like exposing API servers, no, we’re not doing that. We’re restricting layer four. We’re we’re configuring and tuning our, RBAC. It’s not completely you know, wide open and stars everywhere.

And then we’re implementing network policy and admission control, and we’re doing basic things. We need to do that and do that well before we worry about sort of going up a level of maturity that said the one piece that is getting a lot of talk is software supply chain. That’s because it’s [00:44:00] amplified in a containerized environment.

It’s literally like just empty nodes, like ready to run random software, let me know, and I will run it for you. So you have to think of like, well, where’s all that coming from. That’s getting, that’s sort of like pushing the risk. To that part. That’s the hardest part, which is the OPSEC of all the humans and all the CI pipelines and all the build processes that make up all the dependencies of all the things that we build on top of and put inside of our container.

So like our focus should be like, defenders should be, do the hygiene, do the basics, do all those things really, really, really well. And then vendors and community focus on how we can make that a supply chain visible and alert people that isn’t completely alert, fatigue. You know, like you scan a container for a vulnerability, you get 50 back and you can only patch three of them.

You’re like, cool, thanks. I can patch those three, but I still have a glaring red dashboard that says, you know, 47 unpatched vulnerabilities. Like that [00:45:00] process needs a lot of empathy and a lot of love to make that effective, but it also needs a lot more detail and a lot more sophistication to be able to like, make that process.

You know, get that risk balance for the, for the cost in there.

Ashish Rajan: And the wall of red is real for, even though both you and I were wearing red t-shirts whatever. Right. It’s definitely a real in a monitoring world. I was going to say it, it kind of begs another question too, taking a leave from there for a security person to secure this properly as clearly a skills shortage as well.

Right? It’s not that a lot of us came from a non Kubernetes background and a lot of us picked up cloud. Now this new, this new thing has come out like, oh, great. I’ve got to learn container like, oh, I’ve started letting container. It’s like, oh, great. Now learn Kubernetes as well. What else is next? And I think it’s also coming from Magno’s question which I’ll get I’d love for you to get into, but what are your thoughts on the skills conversation for [00:46:00] Kubernetes?

Brad Geesaman: I’ll tell you what, like Kubernetes has built upon years, decades of abstractions, Linux, you have namespaces you have bundled sets of processes with bundled, the dependencies, you have this thing called a Docker container. Now an OCI spec container. Right? And then you have something that orchestrates them.

So if you’re learning and you start up here and you just assume, oh yeah, I’ll just work on that stuff. That’s below it later. You’re probably not going to be able to, from a security perspective, you’re not going to properly develop your threat model. You’re not going to know why you might know what’s happening, but you’re not, might not know why it’s like that or why it’s configured this way or why that behavior is, unless you sort of.

Understand Linux, primitives understand cgroups and namespaces and understand Linux networking and all of those things build upon it. So if I were to say like, where would you start? I would start at the start, which is not Kubernetes, not [00:47:00] Docker, not, it would be all the way down, just like, what is root?

What is it? User what’s isolation? What is the basic primitives of of Linux before I’d start building on top of that. And, and as a Kubernetes security focused person, the way I got to like, and love and fall in love with Kubernetes is because I built with it. I had to operate it. It wasn’t like I was just somebody else’s cluster that I’m coming along to secure.

I don’t feel like you get enough empathy to know why things are that way. We ran a capture, the flag exercise in 10 regions, ten one cluster per region, AWS in 2016 on Kubernetes,1.3 with Calico alpha. Right. We got really up close and personal. This is before RBAC. This is like service account tokens were mounted in every pod that were cluster admin.

By default, we had to work around that. We had to protect the metadata API with IP tables rules manually, like with the Damon set. Like we had to think of all those things to make a CTF environment [00:48:00] was just like WordPress containers inside of a namespace. But like, we had to think of all that multi-tendency problem space and build it to understand it.

So like I argue it’s really hard to secure something. If you haven’t even at least deployed a couple apps and try to scale it, a couple of replicas and like, well, if I change the image and like, oh, what if I can do that load balancer thing? Oh, the load balancing thing works. And then once you get it working, then start poking at it, then start breaking it because then you’ll go, okay.

Now I know the happy path. This is what they’re trying to do. And I know why they’re doing it this way, and then I can go, well, what if I did it the correct way? Or what if I swept his legs out from under itself? What would it do? How does it behave? And I think that’s where you have to be curious and be a builder a little bit to be able to say, I can attack or I can defend this.

Cause I think it’s really hard if you’re it’s like your day job is to just be a defender and you’ve never looked or tried to deploy one of the apps that’s running inside the cluster. That’s very arms [00:49:00] length. It feels like you want to, you want to understand the workflow of the developers that are running stuff inside your cluster, to be able to understand how to threat model it.

Ashish Rajan: That leads me to Magnos question. What was your approach for learning it? And what were, do you have told your past self about how to learn Kubernetes?

Brad Geesaman: It was a team decision. This is 2016. It was between Mesosphere and Kubernetes and Docker swarm.

And we kick the tyres on all of them. And the only thing that really worked the way that we want it to is Kubernetes. And we spent nine months building the CTF platform to be able to run these events. Right. And so it was an earlier time. I it’s hard to be like, yeah, it was easier. It was simpler times back then, but it really, really was.

There wasn’t a stuff much. Going on, you had pods, you had services, you had config maps and secrets and deployments, pet sets, not stateful sets. Like you had Daemon sets that were like that. Those are your primitives. So it felt like you could understand all of those things. [00:50:00] I argue, going back to the roots is where I would start.

I wouldn’t start at service mesh. I would start at pod service deployment and ConfigMap secret. I would, I would go through the CKA, the certified Kubernetes administrator you know, syllabus and really dive in as a builder. Before, and it’s a prerequisite before the CKS, by the way, this, the certified Kubernetes security specialist, you have to be a CK first for good reason, because you can’t secure what you don’t know how to administer, just get working and get running.

It’s very, very hard to do that. So that’s what I would do if I, if I was like my job, I started tomorrow and I had to be a Kubernetes security person. I would go do the CKA. And I know that’s a lot, like I would go to that track, whatever gaps I had to be able to get to that point. That’s where I’d go first.

And then I’d probably look at the syllabus for the CKS and if I feel like sitting for it, so sitting for it.

Ashish Rajan: Awesome. So I guess learning Kubernetes he’s the hard way. No pun [00:51:00] intended.

Brad Geesaman: Yeah. Shout out to Kelsey. I mean that, that repo has lived longer than I thought he would maintain it, but early on, it was incredibly useful.

Here’s why I think Kubernetes the hard way, it gets a kind of a hard, weird naming a rap to it because. It’s actually Kubernetes the manual way, but it is quite possibly one of the simpler ways to do it manually by that. I mean, set up these three nodes, download these binaries, make this Yammel here, or make this thing here and start the service.

Okay. Move to the next node. Do the same thing for the kubelet on both of those nodes. You now have a Kubernetes cluster that is quite it’s. Yes, it’s the tedious way, but it’s actually pretty straightforward. It’s like download a binary and they put a config in and run it. That is enlightening for a system administrator, a Linux CIS admin.

They can grok that. I can guarantee you. They’re like, oh, that makes perfect sense to me. I know what to do now. That’s just levels on top of that, that I have to learn. And I think that’s the building block that a [00:52:00] lot of people would, would really benefit from.

Ashish Rajan: I definitely recommend that as well. Well, that’s pretty much what we have time for. And before we kind of drop off the livestream, I wanted to let’s you share a piece about where you normally hang out for people to kind of reach out to you, ask me questions. Like, what are your socials that where you normally hang out?

Brad Geesaman: Yeah. I mean, my DMs are open but everybody who’s hit my Twitter. It’s @bradgeesaman on Twitter. I’m also a bgeesaman on GitHub, although that’s not really as social, but I’m also Brad Geesaman on LinkedIn. There’s, there’s only one other Brad decent amount I’m aware of that’s on LinkedIn.

So I’m hard to, I’m easy to find cause I use my name everywhere. So shout out with the, you know, reach out with DMs or questions. I’ll get back to you as soon as I can.

Ashish Rajan: Awesome. And thank you so much for coming on board. And I had a great time, as I was saying, I had a lot more questions and I didn’t go through all of them, but this is totally worth it.

And I hope to have you once again, at least to talk about more of the things that we can be doing in Kubernetes security, but [00:53:00] I will look forward to talking to you again soon, Brad, but thanks so much for this and for everyone else, who’s tuned in. Thank you so much for joining us as always every weekend and next weekend, we’re switching another topic and moving on to a bug bounty and Google cloud security.

You get to know a bit more as we kind of go through this in the week, so that next month is focused on that. And if you want to know more about this and a lot more on the topics that we talk about, feel free to subscribe and follow on whatever platform you’re watching this on. And thank you, Brad. And thank you everyone.

Thanks.