MORE Fake Code Than Real? AI Supply Chain Security Explained

View Show Notes and Transcript

The open-source ecosystem might now contain more intentionally fake and malicious components than commonly used legitimate ones. In this episode, Brian Fox, Co-Founder and CTO of Sonatype discusses the critical state of AI software supply chain security, how hidden AI usage is rampant, and why traditional vulnerability management isn't enough. We spoke about unique risks AI models present (data, software, execution environments) and practical strategies to protect your development lifecycle.

Questions asked:
00:00 Introduction
01:07 A bit about Brian Fox
02:03 AI Software Supply Chain Components
03:53 Attack Vectors for AI based applications
04:53 AI Software Development Lifecycle
08:11 Is software supply chain security changing with AI?
11:23 Malicious vs Vulnerable Component
17:30 Fun Questions

Brian Fox (Sonatype)

Brian Fox: [00:00:00] In AI, how would you analyze all of the data that it's been trained on? Like in many of these components, you don't even have access to the data. Yeah. In fact, there's been a debate lately of what open source AI means. The official definition does not mean the data itself is open, it's just the software part.

And so if you have the software, maybe you can inspect that. You can inspect the data so it, it becomes harder to prove that it's okay.

Ashish Rajan: So

Brian Fox: the risks are bigger in that regard. Yeah. So those are things that, that make this much more complicated to deal with than what we've been used to. And I'll tell you, organizations are still generally terrible at managing just the open source dependencies,

Ashish Rajan: AI software supply chain, what is inside those components over there that your developers may be putting in? I had a conversation with Brian Fox, who is the co-founder and CTO of Sonatype, and we spoke about this new software supply chain that we're building up as the CICD pipelines.

What are some of the components that people are using in AI and how could you be building better security with developers and a lot more in that conversation? I hope you enjoy this conversation with [00:01:00] Brian and I'll talk to you soon. Hello and welcome to another episode of Cloud Security Podcast. Today I'm with Brian Fox.

Thanks for coming into on the show. Thanks for having me. Maybe just to start, setup some context, if you can share a bit about yourself and how'd you get into the whole software supply chain space?

Brian Fox: Sure. So my name's Brian Fox. I'm the co-founder and CTO at Sonatype. We founded the company almost 18 years ago.

Yeah. My background was in software development, open source. And then as part of building the product suite at Sonatype, we didn't start initially focused on security, but because we were focused on open source dependencies and the tools around them naturally the evolution of that pulled us more in the direction of helping organizations secure them.

Oh, early days it was more focused on license compliance, OSS governance and security was a side thing, but over the years as attackers have leveraged that more, it's become much more of the main focus of the supply chain management. But it's not the only one. Picking better components for license architecture, quality reasons [00:02:00] can be as important as making sure you're not using ones that are vulnerable.

Ashish Rajan: Actually, it's a good segway into where we are today as well. Considering started in this space 17, 18 years ago, with the whole AI software supply chain thing , what are some of the components in there that leaders should know of?

Brian Fox: So it, it, to me, it looks a lot like open source management did in say 2010. 2011 where we would talk to big organizations and they would say we don't use open source.

And we, one of the things Sonatype has always done also is we run the, what's called the Maven Central repository or the central repository. This is where all the world's open source Java components are. That comes from our infrastructure.

Oh, really? Yeah. We maintain that, we run it, we provide that as a service to the community.

And it's quite large, right? And that gives us unique insights to some of the behaviors. And so we would see organizations we're downloading, 60, 90,000 components from Central a year. And we would go talk to the leaders and they say, we're not using open source. And we're like something's not right here.

And we see the same thing at organizations now, and we did some [00:03:00] studies on this last year. You ask leaders are you using AI in your development? And they say, no, we're not. And then you do an analysis and you find that the developers have included AI models and other things like that already in the software, even to a level that even I'm surprised of how fast it happened, oh wow. And so for me, the parallels are the same that. Leaders assume because they haven't blessed. Yeah. They assume that it's not happening. But if you don't have a supply chain management and program, you don't have the ability to produce SBOMs Software Bills of Material. Yeah.

Then how do you know? You assume it's not there. I'm telling you, more than likely it's there. And so for me, the parallels track the same exact way. If you don't have controls over that as a CISO, yeah, assume that it's being used and try to go get controls and I'd be happy if you prove me wrong in that case.

But I, I think more often than not, you'll find at least somewhere in your portfolio, people are using models you didn't realize.

Ashish Rajan: Interesting. So in these AI based applications then what is the attack vector in that case then?

Brian Fox: There, there can be many. It's [00:04:00] an AI model sort of, is almost like three components.

You've got the software that actually executes the AI aspect of it. There can be issues in there, whether they're accidental or intentionally malicious. There's the data that it's trained on. Same problem. It can be, flawed because humans are flawed. It can be intentionally crafted to do things you would wish it didn't do.

Yeah. And then there's the actual, sometimes there's actually the execution runtime, so it's sometimes it can be like embedding. Say a container or a virtual machine inside your application, which can have a whole bunch of other stuff. Whereas problems with your typical open source components tend to revolve around issues of the open source component code itself.

Ashish Rajan: Yep.

Brian Fox: In AI, you've got the code, you've got the data, and you've got the execution environment. So those latter two things are new paradigms

Ashish Rajan: Yep

Brian Fox: to have to deal with and really it expands the surface area of all the things that can go wrong.

Ashish Rajan: So people who are trying to, I guess maybe to what you were saying earlier about open source being part of it, but they had no [00:05:00] idea AI as being part of it.

How would you describe the open source in the AI space fitting into a current software development life cycle? What are some of the things that are unique about compared to a regular software developer life cycle.

Brian Fox: I think it's the things I just covered that the surface area of it is so much bigger.

It's harder to prove that it's safe. You could, in theory, nobody does this, but you could, in theory, scan the open source and try to review all the code and convince yourself there's nothing nasty in there. But in AI, are you gonna, how would you analyze all of the data that it's been trained on?

Like many of these components, you don't even have access to the data. Yeah. In fact, there's been a debate lately of what open source AI means, the official definition does not mean the data itself is open. It's just the software part. And so if you have the software, maybe you can inspect that.

You can't inspect the data so it, it becomes harder to prove that it's okay.

So the risks are bigger in that regard. Yeah. So those are things that, that make this much more complicated to deal with than what we've been used [00:06:00] to. And I'll tell you, organizations are still generally terrible at managing just the open source dependencies.

We had the Log4j, Log4Shell issue a couple years ago. Yeah. It's three and a half years now.

Ashish Rajan: Wow.

Brian Fox: Until, last year, on average, 30% of the downloads from the repository were of those known vulnerable versions from three years prior. Wow. We're at 13% now, which is getting close to reasonable.

But if organizations can't deal with something as straightforward as a basic vulnerability in Log4J that was so widely popularized. Everybody knew about it. Then imagine where we are with this newer problem with ai. Yeah it's not a good situation.

Ashish Rajan: So do the same things that you mentioned earlier about licensing issues and so open source components being used in the AI based application.

Do they still carry over into the AI Oh, absolutely. Based application as well?

Brian Fox: Yeah, absolutely. Exponentially, because now you have potentially, I think it's still unsettled case law about what happens when the data has been [00:07:00] trained on copyright works. What are the implications of that?

So again it's those extra dimensions of the execution and the data carry with it. All these new problems from the security like we touched on, but also from legal aspects.

Ashish Rajan: Yeah. But

Brian Fox: The software itself, the model itself might be open sourced and might be okay. Yeah. But you might, your developers might be pulling in an implementation of that is using trained data that is not open sourced. Yep. And shouldn't be used. And so again, if you don't know what's inside your software, if you don't have a process for managing that, you need to assume that this stuff might be in there and it might not be what you want it to be, which is again, exactly what we saw with open source 10, 15 years ago where developers were grabbing any old component.

Some of them might have a GPL license. Yeah. Which can get you a copyright lawsuit. There's some high profile ones where this happens. And I think I think we're gonna see history repeat all over again. Just much more complicated with AI.

Ashish Rajan: And [00:08:00] I guess to your point, the existing open source license issues and open source security vulnerability haven't gone away.

So now you've tagged on another component to it.

Brian Fox: That's right. You've added multiple dimensions even to it. Yeah.

Ashish Rajan: Yeah. 'cause now we have data model, data pipeline, app, pipeline. You kind keep hearing all these pipelines as well. And I guess the, is the way. Software supply chain security done different in this world, or I guess maybe, obviously it's come a long way since 17 years that you've been working in this space.

There's a lot more DevOps pipeline. CICD pipeline, app pipeline, data pipeline. There's a lot going on there. Is a way software supply chain done at enterprise level any different? Or is there a better way to manage it? 'cause I imagine there's a lot of people who are here at KubeCon where we are recording this.

They are uplifting the security program. They're obviously having conversation about, Hey, we are using containers, blah, blah, blah, whatever. And if they look at the software supply chain or at least uplifting an existing security program, what should they consider for these AI world if you're moving to?

Brian Fox: Yeah, we've always taken a different approach.

We [00:09:00] recognize that providing developers with the information to make better choices upfront was key to this, right? Many security programs are taxing or roadblocking development and it doesn't have to be that way. So we've designed our solutions from the beginning to be able to provide, first codify what the policy is.

Ashish Rajan: Yeah.

Brian Fox: Provide the data so that the systems can recognize immediately if a particular use is in is okay or not in the exact context in which it's being used.

Some components, some models might be okay for one application, but not for another.

Ashish Rajan: Yeah.

Brian Fox: Especially when you think about licensing internal services versus distributed code have very different implications from copyright.

As an example.

Ashish Rajan: Yeah.

Brian Fox: And for security.

Ashish Rajan: Yeah.

Brian Fox: And we took the approach to make that as automated as possible so that you could analyze all the way through the lifecycle. Provide developers real time information about the dependencies that they're about to choose or the ones they've already chosen if a vulnerability becomes known after the fact and provide [00:10:00] ways to be able to warn before you have to fail. It's a very different approach, but it's more empower, but then provide the guardrails, to make sure nobody's ignoring it. That's still a sort of unique approach among a lot of our peers in the industry, but it applies to AI as well.

And it, it applies even more so when you're talking about developers using AI tools to generate the code. Which is a totally different area than what we just talked about. Yeah. You've got the, I've embedded the model into my software to do some AI thing when I ship it.

Yeah. And then there's, I've got AI tools that are helping me write my actual code. Yeah. Those things might be making suggestions based on other things. And that code needs to be scrutinized just as much as it did if a human wrote it.

Ashish Rajan: Yeah.

Brian Fox: And because there are pollution attacks that can happen there, somebody could try to trick a, an open source AI tool or just an AI tool that knows about open source to produce bad recommendations. Let me suggest this vulnerable or malicious component more [00:11:00] often than I suggest. The safe one is a way to pollute lots of software, as an example. Yeah. And so that's a different dimension that you need to be thinking about as you, you put these guardrails in place.

Ashish Rajan: Yeah.

Brian Fox: So I think with the AI, like we talked about. It's the same and more. Yeah. So you need to be able to do the fundamentals if you want to even have a chance with it with dealing with it, with AI. Yeah. But I think largely the patterns remain the same

Ashish Rajan: Because you mentioned malicious component versus vulnerable one.

What was that about then?

Brian Fox: So around 2017, it's shocking that it's been, what, seven, eight years now. I started to see the first kind of new components appearing in the supply chain. So in repositories, like the JavaScript repository where the focus was actually on stealing credentials of open source publishers.

And we've seen a constant evolution of this ever since. And a lot of people don't recognize that this is really a distinct challenge. So what's happening is the attackers have figured out that. It's easy to trick developers into grabbing a counterfeit [00:12:00] component. They put it out there in the world with a similar name.

Yeah. They fake the downloads, they fake the stars. They make it look very popular. And so the developer just trusts it. They trust their package tool, they download the thing. And then in many of these instances, the payload executes. Now it's often a smash and grab, it's trying to steal data that's on the developer's machine.

Yeah. Which could be cloud API keys, things like this that might actually have access to production systems. Yeah. And so they smash and grab that data and ship it off to parts of the world where you wish they wouldn't have it. And by then it's too late. And so the problem is developers often don't recognize that this happened.

The data's gone. Yeah. Nobody reports it because they don't realize that it happened until the attackers come back later and use those keys to do something else. And so your traditional vulnerability management program, which is focused on inspecting the code, the developers check in that gets built, that's staged, that's released, yeah misses this whole attack because the attack is happening further left. On the developer [00:13:00] machine. And so the only way to defend against that is to be able to intercept it before it gets on a developer machine. Oh, interesting so the point here is that, 'cause I talk to a lot of people and I explain it and they go, yeah, I get it and I have a vulnerability management program.

I'm like, okay. Then you don't get it because this is a different problem. 'cause

Ashish Rajan: it's in the box of the developer.

Brian Fox: It's happening further left. Yeah. Of your traditional security program. We like to say, a lot of the supply chain is like auto manufacturing and factories do different things to be able to make better safer cars, right?

Yeah. There's all the Deming supply chain principles from Toyota and all that.

Ashish Rajan: Yeah.

Brian Fox: And so I, I like to say, you should do all those things. They're focused on making a better car, but what they're not doing is they're not protecting somebody from blowing up the factory.

Ashish Rajan: Yeah.

Brian Fox: Putting better car parts in the car are great until somebody says, that's nice, I'm still gonna make a fake one and slip it in the back door.

Yeah. And the way you defend against that is very different. And that's the point here, that you can have a sophisticated vulnerability [00:14:00] management and still be susceptible to these malicious components because they are designed to do something different. And also your traditional malware tools that are looking for viruses and things like that on the endpoint, they're looking for certain signatures of back doors and root kits and things like this that are known.

But what we see in these open source, malware components is that literally we're talking about bespoke code that never existed before because they're just putting it, they're writing it in the scripts. And so there's nothing to recognize until after the fact, so we've done some pretty interesting things to be able to detect these.

Yeah.

Ashish Rajan: Okay.

Brian Fox: Out of necessity. And so the number as of right now that we've tracked over these years is somewhere around, probably today it's probably 820,000 components that exist for the sole purpose of stealing data or doing something nefarious, that's shocking. And if you look at the graphs that we've published, it continues to be a hockey stick growth every year.

Last year it more than doubled. We're looking at more than 18,000 of these [00:15:00] components every week now. And so it's significant. And we also did a study last year. We looked at all the different ecosystems, both from the download perspective, but also from our tooling in terms of understanding what components are used in enterprise applications.

It turns out of around the 8 million or so open source components that were available as of this past summer. Yep. Only about 10% of those are commonly used. Oh, wow. Which makes sense. Yeah. There's a lot of stuff out there, you get down to the core, everybody uses fairly the same things.

10% of that turns out to be around 750,000 components. And I think it's interesting when you reflect on the fact that there are now more intentionally fake components in the ecosystem than the number of components that are typically used in applications every day. Yeah. We've already crossed the point.

This is no longer an edge case. Actually, the noise is outweighing the useful signal.

Ashish Rajan: Oh my God.

Brian Fox: And again, not enough people are focused on this problem. They don't understand that this is happening. And we're getting to the point now where we [00:16:00] roll out our capabilities within companies, and they find that these attacks have already happened and they thought they were safe, but they didn't have a tool to understand it.

They didn't have the ability to see it, so of course they wouldn't know that it had already happened. Yeah. So this is why we keep trying to raise awareness around it. It's the focus of our booths here at KubeCon. Yeah. And some of the things that we're doing.

Ashish Rajan: Wow. Oh, I think. Because as you were talking about this the one thing that came to mind, which I've done in the past, and I'm like, now that I think about it, actually, I didn't even check, like even leaving a link of or a package a link to a package on Stack Overflow and say, Hey, I, most common Java issue, this make a blog post. You could be the person who writes a question, writes the answer, yeah. And go, yes, this answer helped me. And you download this package.

Brian Fox: And we've seen that as a promotional way of them drawing attention to these malicious components. I've seen that tactic before.

Oh God, for sure. So they'll put it in the repository. For whatever reason, the downloads aren't enough. So they'll go create a thing and people click on it.

Ashish Rajan: Yeah.

Brian Fox: And again, they don't, these things are easy for [00:17:00] them to create. They replicate the code over and over again. Different package names. So it's throw away email addresses, one gets banned, they just do it again.

Yeah. So the cost of proliferating these across is very low. Yeah. And so it means they don't have to be super sophisticated and sneaky and durable in their attack. Yeah. Because if they get 10 people to download it, they're just playing the odds. Of course. Yeah.

And if, and this is why we see so many of them,

Ashish Rajan: if one of them happens to be an enterprise developer, Exactly. You have the gold mine there. That's right. That's right. Wow. I mean that, that's most of the technical questions. I have got three fun questions for you as well. Okay. First one, where do you spend most time on when you're not trying to save the world from software supply chain security,

Brian Fox: Sitting on a plane going to the next place lately.

Ashish Rajan: Yeah.

Fair. The number of trials you're coming up. Sounds like that.

Brian Fox: I've been home five days. In all of March so far.

Ashish Rajan: Oh my gosh. Yeah.

Brian Fox: Besides that, I don't know, I'm a hands-on kind of guy. I do a lot of things fixing stuff around the house, building things boy scouts, camping, you name it.

Oh. With my little time, I have a lot of different things that I'm doing

Ashish Rajan: Fair. Second question. What is something that, what [00:18:00] is something that you're proud of that is not on your social media?

Brian Fox: Wow. I don't share a lot on social media, but, having I like to joke that Sonatype like my third kid, but my kids are successfully in, in college and getting out. So I'm proud of that. I'm proud of the fact that Sonatype we've come this far, yeah of companies that were founded in 2007 that made it to the stage that we are it's 0.001% or something like that.

Wow. And that's that's pretty cool. Yeah. Yeah. Not sure if I could replicate that again, but it's been a, been an interesting journey, so Wow. Really proud of both of those things. Wow.

Ashish Rajan: Yeah. And to your point. Surviving in cybersecurity with such intense threat actors keep changing, everything is evolving and to your point, yeah, 17 years later, we are still trying to make people understand about the same problem that they, the same mistakes they keep making as well.

Brian Fox: That's right. The war is never over. Yeah, that's right.

Ashish Rajan: It's like you can have a shut down, for lack of better word.

That's right. Third and final question. What's your favorite cuisine or restaurant that you can share?

Brian Fox: Ooh. Always partial to, to like Lebanese food. Ooh. I think like the kebabs. Yeah. And and and [00:19:00] all of that kind of chicken shawarma and oh my, that kind of stuff. Probably my favorite.

There was, when we started Sonatype, we were in Mountain View and there was a nice little restaurant right there on Castro, El Camino Castro, and, I from, I'm from New Hampshire. We don't have that many exotic choices. Oh okay. And so like that was the easy place. And so we went there all the time.

And so now I have a hard time passing that up if I am looking for something to eat and I come across a restaurant like that.

Ashish Rajan: Good choice there. Good choice there. Yeah. That was all the fun questions I had, where can people find you on the internet to talk more about the software supply chain malicious versus vulnerable and all of that.

Brian Fox: Sure. Brian Fox, Sonatype, Google it. You'll find me on all the places.

Ashish Rajan: I will put that in the link in the shownotes as well. But thank you so much for coming. Yeah, thanks for having me. Thank you. Thanks everyone for the show and I'll see you next time. Thank you so much for listening and watching this episode of podcast.

If you've been enjoying content like this, you can find more episodes like these on www do cloud care podcast or tv. We are also publishing these episodes on social media as well, so you can definitely find these episodes there. Oh, by the way, just in case there was interest in learning about AI [00:20:00] cybersecurity we also have a sister podcast called AI Cybersecurity Podcast, which may be of interest as well. I'll leave the links in description for you to check them out, and also for our weekly newsletter where we do an in-depth analysis of different topics within cloud security, ranging from identity endpoint all the way up to what is the CNAPP or whatever, a new acronym that comes out tomorrow.

Thank you so much for supporting, listening and watching. I'll see you next time.

No items found.
More Videos