Thinking of building your own AI security tool? In this episode, Santiago Castiñeira, CTO of Maze, breaks down the realities of the "Build vs. Buy" debate for AI-first vulnerability management.While building a prototype script is easy, scaling it into a maintainable, audit-proof system is a massive undertaking requiring specialized skills often missing in security teams. The "RAG drug" relies too heavily on Retrieval-Augmented Generation for precise technical data like version numbers, which often fails .The conversation gets into the architecture required for a true AI-first system, moving beyond simple chatbots to complex multi-agent workflows that can reason about context and risk . We also cover the critical importance of rigorous "evals" over "vibe checks" to ensure AI reliability, the hidden costs of LLM inference at scale, and why well-crafted agents might soon be indistinguishable from super-intelligence .
Questions asked:
00:00 Introduction
02:00 Who is Santiago Castiñeira?
02:40 What is "AI-First" Vulnerability Management? (Rules vs. Reasoning)
04:55 The "Build vs. Buy" Debate: Can I Just Use ChatGPT?
07:30 The "Bus Factor" Risk of Internal Tools
08:30 Why MCP (Model Context Protocol) Struggles at Scale
10:15 The Architecture of an AI-First Security System
13:45 The Problem with "Vibe Checks": Why You Need Proper Evals
17:20 Where to Start if You Must Build Internally
19:00 The Hidden Need for Data & Software Engineers in Security Teams
21:50 Managing Prompt Drift and Consistency
27:30 The Challenge of Changing LLM Models (Claude vs. Gemini)
30:20 Rethinking Vulnerability Management Metrics in the AI Era
33:30 Surprises in AI Agent Behavior: "Let's Get Back on Topic"
35:30 The Hidden Cost of AI: Token Usage at Scale
37:15 Multi-Agent Governance: Preventing Rogue Agents
41:15 The Future: Semi-Autonomous Security Fleets
45:30 Why RAG Fails for Precise Technical Data (The "RAG Drug")
47:30 How to Evaluate AI Vendors: Is it AI-First or AI-Sprinkled?
50:20 Common Architectural Mistakes: Vibe Evals & Cost Ignorance
56:00 Unpopular Opinion: Well-Crafted Agents vs. Super Intelligence
58:15 Final Questions: Kids, Argentine Steak, and Closing
Ashish Rajan: [00:00:00] Hello and welcome to another for classic podcast. I've got Santiago with me. Hey man. Thanks for coming on the show. Glad to be here. I was gonna start with an introduction. So for people who do not know Santiago, could you share a bit about yourself, your background, uh, where you are today?
Santiago Castiñeira : Yeah, absolutely.
I'm Sango, co-founder and CTO at Mace. I'm based in Munich, Germany. Uh, I've been working in a few things over the last years. Uh, mainly, uh, basically data pipelines for cybersecurity data, specifically one 19 management. Before that I was CTO at another small startup, and I spent some time at Amazon generally being in a few startups before that.
Ashish Rajan: Awesome. And today's conversation is about the whole AI first vulnerable team management thing. I mean, obviously AI is top of mind for a lot of people, but the one conversation that, uh, I think we were talking about this before we started the recording, I come across this in the advisory board that we run, and you're coming across this as well, the whole build versus by conversation.
But before we jump into that, I want to set the tone for. [00:01:00] The AI first vulnerable team management. How do you describe AI first vulnerable team ai? Sorry, lemme just rephrase. Mm-hmm. What is the AI first cloud vulnerable team management? So specifically?
Santiago Castiñeira : Yeah. So the way we, the, the way I think about it is, uh, traditional energy management has been very heavily rule based.
So the goal for years and what, basically what synapse, uh, brought into the picture is bringing all massive amounts of data into one place. So you can do very unique rules that basically match your security, uh, security posture and your, i, let's say, uh, risk appetite and how you define risk and how do you value, risk or evaluate risk.
So I think basically AI native or LA native changes the picture completely because suddenly you go from rules to reasoning. So this is one of the ways we described, and I think reasoning. It's very, very different because you can be very contextual. So you don't need to be specific about the parameters of a rule, but basically evaluate on based on the context that we, that you have.
And I [00:02:00] think this changes the game completely. And what I would say is like AI native would be basically products that use this reasoning as the core, um, decision making engine of the products.
Ashish Rajan: Oh, right. So it's not just about me getting my, uh, the way you describe it, what's my vulnerability at the moment.
It's more of an intelligence layer on top of it.
Santiago Castiñeira : Yeah, exactly. So it's, it's basically figuring out the vulnerabilities, but then you can have a lot of understanding of the specific context where the vulnerabilities is and really do a more thorough assessment. So it could be from looking at, is this really a problem for us to, uh, what is the impact?
What are other assets around it? What is the likelihood? So is it something that is easy or hard to exploit? Is it, uh, something that an attacker will, let's say prefer against another type of vulnerabilities in your environment? So it's very, he's very smart, let's say. Reasoning and decision making when you're looking at prioritizing and deciding what's really is important.
Ashish Rajan: And actually now that brings me to the bill versus buy thing. Sounds like we are just adding AI [00:03:00] to, uh, the vulnerable team management. I'm gonna, as, as I oversimplify this whole thing, can I just use, I don't know man, Gemini three, Chad, GB, D, five point, whatever is out now how can I, can I just build this intelligence layer on top of a vulnerable team management program?
Especially if I'm an established vulnerable team management program person?
Santiago Castiñeira : Um, technically you can. There's a lot of difficulties on, on the way. Some of them are, are things that you can probably figure out if you, if you, uh, think hard about it. Uh, and there's others that are sort of like more nuanced and that only only get to understand them and understand the impact of these things along, along the way.
Um, so the, the main things that I would think about is, um, all, all about LMS and, and agents and so on. It is all about context. So for any decision or any evaluation that you want to, to do about the vulnerability or anything in your environment from cyber security point of view, you need the full context and you need the right precise context that is relevant for the question you're trying to answer with the agent, right?
So this [00:04:00] means that behind the agent itself, that is something that you probably can create a script very quickly. You need to have very complex, scalable, maintainable, uh, data ritual systems, uh, that basically help you bring the right context at the right time. Uh, and even. Then you need to optimize those.
So, um, cost is the second dimension, and that's why I'm saying do we need to optimize? So inference is fairly expensive and especially when energy management, as you know, there's normally as, as soon as you're minimum size, you have millions of bilities open, and then things start to get very expensive and volumes are really high.
So, I mean, working with companies. Tens of millions will be open for a while. Um, and if you want to investigate each one of them, even if it cost you a few cents, how often are you going to do that? Then it gets really expensive. So cost optimization comes into the picture. Um, and this is without thinking about the longer term, let's say maintenance and quality that you want to sustain on these and evaluation of these systems, right?
So that's a whole, a complete different thing of you will need to [00:05:00] have your, the data system very, uh, organized to know basically the prominence of the data. Uh, you might be audited, so you might want to show an auditor DCV at this variety back then weed it like that. And that's why we didn't address it within the critical SLA, right?
So there's a lot of requirements that come down the line as you start building, um, that are a lot harder. And I mean, we can get a lot deeper into, into the LLM and agent side if, if you wish.
Ashish Rajan: Yeah. I mean, and you said all of that without, and we haven't even covered the skillset part yet as well. I mean, wasn't, who's gonna maintain this at your end?
Santiago Castiñeira : Yeah, absolutely. So this is, this is something I, I had quite a few experiences of this, uh, and this was, this is what I loved about, uh, joining the podcast on this topic because I encounter a lot of CISOs and a lot of security teams that basically jumped into building. Uh, and there's a lot of things down the line that, that become, uh, tricky, right?
So I, I remember one, one of the biggest stock exchanges, uh, in the world, uh, one of the senior engineers in their team had built basically a few scripts in his laptop that was doing very specific, uh, [00:06:00] task. And then the challenge became when that person tried to leave the company, right? So he was the only one that understood what was happening, could even maintain that system.
And he was just running from his laptop. He would triggered once a day. Um, it was a very, very complicated situation, right? So you need the skills that generally, um, based on at least my experience, security teams do not have machine learning engineers, software engineers, data engineers that can build these type of systems.
Um, but they have this very hacker mentality in general that can get a lot of things done, which is great. But as a big enterprise, you need to think a little bit deeper about like, okay, how does that play out in a year of this system? So I'm going to be investing a lot of time, a lot of money on inference, on building new systems.
I'm going to become dependent on this system. But what this year, two or three of this system in my organization?
Ashish Rajan: Hmm. I mean, because it's this whole conversation about now every, uh, these days, obviously there is a whole C-S-P-M-C in that market, but there's also, now yeah, everyone has an MCP as well. [00:07:00] It's like, it's almost just, it's like the Oprah moment of you get money, everyone gets money.
It's like everyone gets an MCP. What, what would that make this any easier?
Santiago Castiñeira : So the way I see MCP is, is great for basically having, having agents that are driven by humans. So if you can, NCP is great for basically plugging a lot of things. It needs to be properly secured. By the way, there's a lot of discussions around that and how to secure it.
Um. But you can have basically an agent that you trigger something and then through MCP does something and comes back. But that's scaled with the number of people operating on these agents, right? Or using these agents. Right? But for a lot of the cybersecurity tasks, I think systems need to be autonomous and do work when you are sleeping, right?
And this is our case, right? So we want, we provide hundreds of thousands of agents, um, to our customers, and that means that MCP there has a hard time scaling and actually meeting the demands that you have, right? So you're going to be looking potentially concurrently into many [00:08:00] thousands or tens of thousands or more, um, basically look, loading context and MCP, that's what, that's one, one of the challenges.
You need to sort of optimize it as a data retrieval system, uh, I believe. And that's, that's basically where, where MCP starts to have a, a few issues.
Ashish Rajan: You just mentioned, uh, retrieval augmentation as well. I mean, so, so maybe let's just start building one. So the people I really, I know I, I would want to call it out that a lot of people who may have been using AI for some time, they always do what you said.
You start with a few small scripts. You've been bringing that cool, that open AI has been sharing, and I certainly feel that, oh, I, I can, I already have my, uh, vulnerabilities in cloud from my CENA or CSPM. I already have the, the data that I need. All I need to do is just put some kind of like, some kind of a wrapping of, uh, open AI something.
But maybe we should probably start diving into what, what does it look like from an architecture perspective? If someone was to build an AI for cloud vulnerable team management in their team, what would that look like as an, from an architecture perspective in their practice?
Santiago Castiñeira : That's, that's very [00:09:00] good. So there's some key components that I would, that would, I will, um, define here.
So the first one is sort of your data, where the context is and basically how do you build that context? So data ingestion, pipelines. Yeah. So let's say we start with MD management. Let's say we'll need to have, I don't know, GCP Azure and AWS critical context that we want to have. Let's say that this is, I know VPC security groups and in, uh, easy to instance definitions, right?
Um, that would be one part. You need to ingest that daily and let's say, uh, and then you will have your, your synap, your CSPN, that will give you additional context in terms of misconfigurations and vulnerabilities that were found in your infrastructure, right? So there you will have sort of like a list of.
When and misconfigurations that were found, we will have the context to try to analyze them. Um, so this needs to be, it sounds easy, but this needs to be a system that works at scale, that has good performance, that actually keeps the data very organized so you can go back and look on specific data. What is the data that you had to do the evaluation?
[00:10:00] Um, so there's, there's a lot of requirements there that are not that easy, and that's where the data engineering skills come into play, right? So that's not core to, to a security engineer, let's say, or security team. So once you have that, it's just. The data that allows you to then do your agents, right?
And then depending how you want to run this, but you will need to have some sort of environment or, or infrastructure for executing agents. So you can write a script like a small land graph agent, um, that is very simple, that looks into this, this context, and then makes the decision, right? But you'll want to run these thousands, if not hundred thousands of times per day or based on your, on your cycle, right?
But you will need to run it a lot. So you need, again, an execution platform, whatever it is. This could be APIs, this could be job or workflow based. There could be many different things, but this is again, another key component that needs to be maintained, reliable. You need to be patching it over time. So there's a lot of things there.
And then there's the whole thing of like evaluation of your, of this agent that you just build. You build this agent. Great works week one, it works phenomenally. [00:11:00] The week two, suddenly you start seeing that the, the results are not great. How do you know? Of your a hundred thousand, how many are off? How can you evaluate?
Are you going to sample some of them and just make a decision? Now you need to have basically machine learning pipelines that actually evaluate how the agents are doing, uh, that do some metrics on them, and then you need to be on top of those metrics and then every week or every day look at those metrics and see how that works.
That's more of machine learning skillset, right? So that's sort of, I would say the MVP for an enterprise. Right? So we're a company, uh, that's already, if you have a, a security team of a certain site, the amount of data instances you're going to have or, or devices let's say, is going to be certain that there's some requirements in there.
Ashish Rajan: Yeah, interesting. 'cause as you said, that, uh, a lot of examples that I've heard from people is more around, I, I don't know, I take a C Cena or CSPN provider log. I get a collection of those log I use, uh, Claude or GPT or Gemini or whatever to go, Hey, uh, find me patterns, uh, out of this, which one should be a [00:12:00] priority.
What you're saying is a lot more than that. 'cause are you building a system that is repeatable or are you building a, a one-off script that you run every time you feel like, actually I'm gonna work in cloud today and see what happens. Yeah,
Santiago Castiñeira : but exactly. So that, that is the challenge, right? So, and that's the decision.
So if you want to do something that is holistic, that really looks at all your, and help you assess the situation, like, uh, holistically, you need to look at all of them. You cannot say like, oh, uh, during my eight hours today as a security engineer, I look, I run the script on, uh, 3000 of them and this is what I got.
What about the other 900? Thousand that you didn't look at, right? So at that scale, the problems are, are really big in terms of numbers. Um, on the other side, that doesn't mean that you cannot build some, let's say, investigation tools to help yourself. So a security engineer, I think this is an area that a lot can be built, right?
So that is more of like helping you investigate things, help you bring context, evaluate context, uh, look at different type of things and, and then kind of do assessments. So [00:13:00] that can definitely be, be done, but it's a very different, uh, approach in a way. So one of them is like overnight, a lot of stuff gets done for you and you look at the results and then continue from there.
And then the other is like you still looking at the million vulnerabilities every day and trying to figure out which is the one, but it's a lot, lot harder and I don't think it's as holistic and properly attacking the problem. Let's say,
Ashish Rajan: do you have an example of an enterprise where they tried doing this and maybe using an internal agent framework and perhaps it broke down?
Santiago Castiñeira : Yes, yes. We, we have talked with multiple, um, and some of them very big enterprises, basically. Um, a lot of enterprises nowadays have kind of pushed for bringing AI and have introduced some sort of framework and trying to provide that to development teams. Um, there's one specifically that they push a framework forward, but they build it on top of one of the existing frameworks, uh, land graph and launch chain, but they limited it in some capabilities already.
Land graph and launch chain is a framework that has a lot of attraction layers, so they added more layers of traction on top. So there, the amount of issues that they [00:14:00] encounter, um, is, is tremendous. And actually they are now, this, their security teams are telling us like, yeah, this doesn't work. We cannot build the agents that we need with this.
Uh, and this is just from the, let's say, building the frameworks and the tools for being, being able to build that little script that we talk about. Um, and then those companies are now looking outside seeing, okay, this requires a lot more than we thought. Uh, how can we get help on this one? So
Ashish Rajan: actually for people who may not know who would la what L Chain and Landgraf is, yeah.
What is that and what's, what does that have to do with the, what we're building?
Santiago Castiñeira : Yes. Very good question. So long chain and graph are two basically the same. Um, the same, uh, uh, creators are two libraries that were, uh, among the first ones and very popular for building initially, basically, and then any. LLM workflow, let's say, uh, with long chain and then graph introduced basically a lot more agentic, uh, features.
You can call, you can use a tool to look at certain data. You can create an agent that sort of like goes into a loop, calling [00:15:00] tools and making decisions and can, provides an output. So our, some of the most popular, uh, frameworks, there's a lot frameworks are there. And this is another topic I, I blocked, I wrote about this, uh, agent frameworks for scaling.
It's really hard. It's, uh, now it's, it's getting better, but there's no, I think there's no one framework that really solves all the problems that you need. In the end, you need to have a combinations of multiple frameworks for you to be able to, to actually do everything you need in a cost efficient and, and effective manner.
Let's say, and this again, more of the challenge of it's a new technology. Up and coming, uh, still a lot being developed. Uh, and there's a lot of hurdles there too.
Ashish Rajan: Yeah. I mean, there's a, there's a lot to unpack there. But I would love to get into, because maybe if I just take this, take a, take a step deeper in this, into this.
I cso, I have a CA yeah. And obviously I have a huge push from my CTO, uh, and the organization to just, Hey, just 10 x the use of AI in your organization. Um, where, where am I starting? Am, am I starting with Lang chain or am I starting with deciding [00:16:00] Gemini? Where am I starting in this project process? What am I needing from a tool data team kind of perspective?
Santiago Castiñeira : Yeah. It's very, very good. So I think that basically figuring out what is a problem or a specific, uh, space where you want to look into within management. I mean, prioritization is one of the ones that comes, uh, first, but there's a lot of them that you could be looking at. Um, and so I think the idea is first figure out what's the context that let's say a security analyst would need to make a decision.
Uh, this is. Sort of how we, how we think that is, try to follow the, the human process into decision making. Um, because nowadays, I mean we, we work with a lot of, uh, very senior security engineers and they tell us like basically the situation, what data they look at, how do they make a decision, and I think.
That's probably a good place to start. Um, and generally at these security engineers that I work with, they have very established ties. They know exactly what they want at which point in time. So change of regard. If you can collect all the data together or query it from [00:17:00] different systems, but basically have a way to get that data.
Um, then obviously you have the skills from the software engineering side side of things of being able to build these agents, being able to, to deploy them in some environment. Um, I would say those are the, the absolute bare minimum, uh, to start. Um, but this will just give you sort of an understanding of the potential of the technology, then making it sustainable, like maintainable, longer term, reliable, that you can improve it, that you can, um, let, let's say trust the risk posture of your organization based on the system, uh, that is a bit more down the line and, and, and takes, it, takes a lot more of the, of the components that we talk earlier.
Interesting.
Ashish Rajan: Wait, so, 'cause you already mentioned, you mentioned a software engineer, a security engineer. Yeah. And you also have like a data engineer and you already have Yes. Like the two skillset that normally don't exist in security normally. Uh, security usually just helps security engineers. So if I'm a CISO who's trying to put it on the spot, I already have to consider, uh, either borrowing a data engineer from, or a software [00:18:00] engineer from another team Yes.
To be able to bring this over. And, 'cause a lot of people may oversimplify this because a lot of people have been using console for chat g or cloud or whatever.
Santiago Castiñeira : Yeah.
Ashish Rajan: Uh, but that's, you, you're not referring to that when you talk about tooling from an agent perspective, right? No. What, what's like, like if I were to just use an example of, I don't know man, I'm, I'm on, I'm on AWS uh, I have loss coming out from CloudTrail that I sent in, into S3 bucket, um, which is collecting my CloudTrail.
Like how am I piping that into this? With a software engineer and a data engineer, what are they doing?
Santiago Castiñeira : Yeah, yeah, yeah, yeah. So that's, that's the part, right? So, um, this is basically another source, right? So, uh, a log that comes through some sort of, uh, stream is something that you basically need to pipe into some sort of, either, either you process it on the fly and decide, I want to look into this log deeper or not.
Uh, so you can have some sort of, let's say a Kafka stream, and then you pick it up, look at it, makes sense, then queue it into another system to look [00:19:00] deeper into it. Maybe you're, um, there's certain combination of attributes that you can, uh, basically with, with the agent evaluate. Again, logs are a tough use case for L LMS because of the cost, uh, inference cost, right?
So remember that this, the number of low, uh, logs entries is just huge. Yeah. So the cost is going is something to, to be very careful about in this situation. Um, so then that will be evaluating it. Putting it into some queue for some other, uh, system that you built to be looking into it. And then this could be, again, it could be as simple as Lambda, let's say, and then you will be calling, let's say, bedrock to do the inference, right?
Um, this is kind of like, sort of the absolute minimum if you want to look at logs, right? Um, but in general you could have some sort of system like that, that eventually after the agent looks into it, it could alert you about certain conditions again. But this is just, it's just icing on the cake. This is just the very basic, right.
So how, how do you explain to your ciso, why did you, um, raise, uh, a [00:20:00] red flag and alarm on that event that the Lambda notified you about? So the first thing you will want to do go, is look at, did the LLM actually make the right decision? What data was the LM looking at? Uh, where did the da, which, uh, where did the data come from?
And a lot of other things like. That observability requirement, you need to build it into the system. It's not that it's going to happen automatically, that everything is going to be there. Right. So yeah, there's all of these things that are sort of afterthoughts. Yeah. That in the case of the LLMs, I think especially in terms of security, there is such an important, uh, topic.
You need to be very sharp there.
Ashish Rajan: Yeah. I mean, we haven't even gone to the long chain part yet. Like this is like before any of the lang chain came in.
Santiago Castiñeira : Yeah, yeah, yeah. And exactly, I mean, that, that lambda in the end could be, let's say, running, uh, a small chain, right? So that's where you could have a, a lang chain or a Lang graph, uh, small agent that is doing some sort of calling two bedrock, doing the, the, the LM and again in there.
There's a whole world that we didn't talk about, but it's like the prompts, [00:21:00] the how do you make sure that the results are consistent? So this log coming in tomorrow or in a week, I would, I want to have the same behavior, right? I verified it today. Works great. What about next week? What about the week after?
Right. This variability, and this is the whole. In a way, hallucination problem of LLMs. Yeah. Um, that requires a lot of, a lot of careful tuning of the, of the prompts. Testing, evaluation, constant evaluation, not just like a development time you evaluated and it's fine, it's just down the line. You need to sort of be evaluating the outputs to see that everything is kind of like still in line.
And then as you introduce changes, let's imagine that you say, um, you were looking at, um, VPCs and security groups and now you are, uh, load balancers and a web application firewall. Right? Suddenly the prompt could be that it's not good enough anymore and the LM is confused and starts failing left and right.
So you could, you need to detect that and be able to react and change it. Ideally not in production. Yeah. On a separate [00:22:00] environment where we are running these, these, these things before putting it into production. Right. So there's a lot of complexity. At first sounds easy, but if you want to build it, maintainable, sustainable for long term, uh, and be able to rely on, um, it's, it's very tricky.
And I, I have this. This, uh, thing about relying on, and why I keep repeating on it is because I work with some public big tech companies that have a lot of, um, audit requirements, um, especially around cybersecurity insurance. And basically they need to be able to provide to the auditors the snapshoting time of the cvs that they detected, the full context that they had in there, and what were the decisions they made to make sure that they are according to their SLAs.
And there's a lot of things that change over time. Uh, like for example, um, uh, CV severity change. So sometimes three months ago it was, uh, a medium. And today it's critical because something was discovered in this recent CVE, right? So this kind of thing, you need to be able, as a security [00:23:00] team to explain.
Why you are within SLAs and, and your cybersecurity insurance premium should not go up. Right? Yeah. That requires a, a level of organization and, and, and, and basically having your house in order to be able to pull that up quickly and, and explain to auditors, man, it's like,
Ashish Rajan: this is just to get the use of false positive.
Exactly. Yeah. We, like, I haven't even gotten to the point of, I'm gonna take an action. This is like, oh, I, I, yeah. I don't even know the timeline I would take, but the few things that I noted over there was the, uh, if you are someone who's thinking about building this, and if you go down the path of, you need to think, think about the, the continuous stream of logs that's gonna add to the cost token, where you put this, how you put this, and how you use this, that's gonna add that in.
Second then is the skillset that you're gonna look at from a, am I using, uh, cloud for inference? Am I doing it locally? And if I do locally or if I do it in cloud, uh, the cost associated with it, but also the, uh, [00:24:00] what, what are some of the logistics around it? Would you go for Lambda with a Lan, you know, land graph, uh, with where the data is streamed to once you've decided what you're sending.
Once you've done that, you have to, now you have a prompt lifecycle that you need to manage a CV lifecycle you to manage a threat intel lifecycle to manage. I mean, obviously I, I'm not trying to deter people from this. Like if you, if people wanna experiment, you know, that's what, that's why we are in technology.
You should definitely go, go give it a shot. But if they were starting to go, you know what, great Santiago, but I still, I'm still very determined. What are some of the two or three things that they must consider to have, if they wanted to go down this path? 'cause I imagine not every enterprise, uh, is, is is, I said that stage, but they might be people who are much smaller startups or whatever, or whatever.
Like the, the scale up. I'm thinking about who are very build, first kind mentality. Yeah. What would be the top three things you think they should have too? If they wanted to go down the spot?
Santiago Castiñeira : Yeah. Very good question. First of all, if [00:25:00] anyone out there, any CSO tries to build internally and wants help, I'm happy to provide a completely, like, not trying to sell you maze, but this is a complex topic and we're happy to, to help others on this.
So I think in my mind, like kinda connecting this with the, with the CTO, uh, discussion, what I would do to be honest, is basically look at the problem you want to, to address, um, and then. Go to your CTO with basically an understanding of the problem you're trying to, to attack. Saying like, look, this problem we're spending this amount of time, or is expo, is increasing our risk significantly in these uh, conditions.
Um, in order to build a system, we want to build a system that does these things, kind of mentioned a few of the key components, and then describe the staffing or the requirements that you will have in there, right? And then that then you will have a minimum setup. And then this could be very minimal, right?
So you could start very small with one or two engineers that had the right skillset or right combination of skill sets, right? But I think that's what I would start with, uh, in order to try to build this, [00:26:00] um, going from. Having the problem and trying to solve it with script, I feel down the line it hurts more than helps, um, because it's a, it's a false perception that you got a hang on it.
Um, and then over time things will get a lot harder. So that's what I would say, um, what the, uh, key three things that I would, I would, uh, look into to really address the problem and solve it in, in some way. And as I said, happy to help, like completely send my question my way. Really
Ashish Rajan: would, would I need some kind of a data lake or something as well while I'm doing this?
Yeah.
Santiago Castiñeira : Yeah, I think that's, that's a whole data, data virtual system, right? So, um, as I mentioned at the beginning, so I worked on pipelines for, for ingestion, vulnerability and asset inventory specifically across multiple, uh, cyber security tools. And the best strategy in my opinion there is, is having a data leak at a certain scale, right?
Uh, where you have all of your, um, all of your know vulnerabilities, all of your assets, all [00:27:00] of the context that you need organized per day. Uh, and then the agents will basically be looking into that one to look on, okay, we're drawing the investigation today. What's the context today? Let's run with that context and, and so on.
Ashish Rajan: Damn. So we have an execution framework of execution fabric, the data lake. Yeah. Then I think you mentioned evals earlier as well. You kind of need to have figured out how you're gonna evaluate this. Maybe you wanna change cloud component as well.
Santiago Castiñeira : Oh, yeah. Yeah. That, that's a key one actually, I, you touched there on a very important topic.
Um. In our experience changing from one model to the next version or to another model. So if you go from cloud four to 4.5, the cloud seven at four, four to 4.5, or Gemini two to 2.5, it's this, the difference is significant. So if we are evaluating the outputs, the moment you change the model, generally you're going to see that all your metrics suddenly go a little bit out of whack.
Yeah. And you need to try to understand is it good, is it bad? How can I mitigate that? Right. [00:28:00] So, and that, that's the whole part of the machine learning, let's say ops or, or, or pipelines. That is, yeah. Evaluating ev like a percentage of the execution, having some metrics that you trust. Those metrics change.
By the way, you cannot have metrics from now until end of next year. You will need to be changing over time because eventually the metrics become less useful. Um, you need to monitor those as you do changes. Make sure that the changes are in the right direction. The metrics are moving the way you want.
And that's, that's a whole. Effort. That is, this is probably the, the one of the hardest skills in there, um, in that we're describing. So the software engineer, lead engineer, I think those two are a little bit, uh, more, more easy to get. Yeah. But this one, the machine learning, I remember one big enterprise that I talked in the past and they were, they were like the build versus buy.
And they told us like, yeah, we have only one machine learning, uh, engineer in the engineering team, but I don't get time from him to help us. Yeah. And it was just that situation, it's like those skills are really hard. [00:29:00] Uh, it's not that you're going to ramp it up in two weeks or, or a month.
Ashish Rajan: Yeah. I mean, I guess 'cause there's more and more sounding like the AI agent is not just another feature, it's more like a diff separate pipeline.
Like it's by itself. Yes. Even if you were to do it yourself or a vendor doing it.
Santiago Castiñeira : Yeah. Yes. I, I do agree with that. I think it's a, it's a complete new methodology, especially on the evaluation, on the prompt changes on the building, the tools for them, it's, it's quite different than any previous, let's say, machine learning, uh, technology,
Ashish Rajan: because you probably maintain the entire ecosystem.
You have you to, I mean, we kind of touched on one pipeline of this. Yeah. We didn't even mention about all, everything else you may wanna do in security as well. This is, this is like a tiny slither of what cybersecurity is and Correct. We're going, oh wait, if you're just doing that as an ai, maybe you have dedicated team for it, but so is, does that mean now the way people measure vulnerability management should change as well based on this?
Like if you're using intelligence layer on top, what's the right you to measure [00:30:00] what a good vulnerable team management metric would be? Especially for people who are like, we are going towards 2026 and people are already considering, Hey, what's an uplift for AI that I should be doing? Uh, what's your recommendation there?
Santiago Castiñeira : That's a very good question. Um, yeah, to be honest, I think there's, there's a few things that I think are changing in terms of how we, we measure performance for teams. Um, basically in a way we have, in all terms, we have a lot more productivity. Yeah. In terms of like, potentially we'll be able to go through a lot more than we were before if we have agents.
Yeah. So I think then the question becomes looking at sort of the, the impact, let's say the reduction on risk, uh, or the improvement on the, the, the things that we have in their control. One of the key things that we hear over and over, and now for me it's years and years, is that we are not, we cannot be on top of all the modalities.
There's so many, right. So. [00:31:00] That perception of, okay, we are on top of them. Like, yes, this tool actually reviewed a bunch of them. These are false positivity. These are ones we evaluate them, these are the critical ones. These are taken care of. So that perception of the team of being on top of things, this is one of the key metrics that, that we get, uh, from security teams when we talk with them.
That is when they see what maize will bring to the picture. Right. Um, that's precisely the value that we're looking for. It's like, okay, you've got it. You got it. This means it's a lot more productivity. It's a, a lot more, um, understanding and control, let's say on, uh, on your risk, uh, profile in a way. I think that's one of the ones, it's hard to measure and it depends how your, your organization measures it specifically.
But, um, yeah, I think it's, that's one of the, the, the key ones.
Ashish Rajan: Would, would there be a change, any change to the whole, uh, meantime to resolution NPTR or criticals being closed or, I don't know. Yeah. Engineering our investor are, is there metrics around that as well? A hundred percent.
Santiago Castiñeira : [00:32:00] A hundred percent. And I think there, the interesting thing is that I think organizations need to start to get used to having kind of, uh, false positive, um, processes.
Like how, when you say you find, uh, false positive, how do you declare false positive? How you properly document why it was a false positive. Um, and yes, I mean, TDR will definitely be, be changed. The number of criticals you're going to get probably is going to change significantly too. Uh, we talk with a lot of organizations that are like, we're in the zero criticals club, uh, which is great, but what about the high that is actually critical, but is, uh, it's misclassified and it doesn't understand your context.
Right? So those are actually, uh, criticals that are hidden that are a lot more important. So I think yes, MTTR is one of them, but also the, uh, understanding the real severity and the real impact in your infrastructure. And then how are you managing those? Yeah. Uh, I think that's very important.
Ashish Rajan: Yeah. Would you say, um.
When you guys were building [00:33:00] the, you know, the agent, uh, and basically using that for some of your customers, was there something that surprised you outta curiosity? Like, I don't know, like a human-like insight or was there more mistakes? What, what was like the, uh, surprising thing out of it?
Santiago Castiñeira : Yeah, there, there's a lot of, uh, along the way there's been a lot of this.
So, um, in some of the earlier versions of the product, uh, one of the things that, the first thing that surprised me, let's say, is how human-like the discussions felt. So it was, uh, basically we had some sort of, uh, initial prototypes where, uh, there would be different roles, uh, on different agents and there will be.
Discussing with each other about specific topics. And I remember this one where we have one of the roles with the project project manager, and the project manager told the others, let's get back on top, uh, on the topic, because we were going off track and our goal is to do this specific, uh, investigation, but they were like talking about specific technical details or something else.
I I [00:34:00] love that one. It's one of my, my all time favorite. Um, I think the, there's been others where, and another key point for me was when I realized the depth that the agents are getting to, um, in a single, let's say, uh, run. Was, uh, more than a single staff security engineer would know. So they will be touching such deep topics and multiple of them in one run that security engineers will not be able to assess it.
Uh, like manually assessing, right? So we do a lot of, of manual assessment and, and labeling, and this is from very senior engineers. And then some of them will be like, yeah, those two I understand, but that one I need to learn because I don't know exactly what is the right way of doing that. Um, I think this is part of what's coming with agents and where, where intelligence, again, even at the current levels, if an agent is, is a good expert in a bunch of different things that you only know, some of them it [00:35:00] feels that they're extremely, extremely smart, but it's because they are like a certain level of depth, but across so many fields.
So those are I think some of the, the, the areas where I found, yeah, this is, this is something different into how we do ledge management. Awesome.
Ashish Rajan: Uh, sorry, I was gonna quickly a side note. Do you still have time off? 'cause we only have 10 more minutes left. And I think we've got a few more questions. Do you have time to off?
Well, I
Santiago Castiñeira : don't, yeah, yeah, yeah, yeah. Okay, sweet. I think I, I don't have, lemme just check, but, uh, I think I clear. Yeah, I have, I have half an hour more so.
Ashish Rajan: Okay. Yeah. Okay. Yeah. Perfect. Yeah, because I, I think we shouldn't need the full half an hour. I think I, I, there's some good questions here. I don't wanna miss any of them as well, just so I was like, okay.
Santiago Castiñeira : Yeah,
Ashish Rajan: yeah. So, um, okay. Because I think, so that, that's a good one because to your point, I also have been in a trap where I would start asking about a vulnerability to, I don't know, one of the models that I'm working with, and I'll just completely go on a different rabbit hole because as a curious, technical person, you wanna understand the, the depth of why and how and everything that is around that one particular problem.
But to [00:36:00] what, the example that you called out, it may be taking you off track for what you're actually trying to do versus what it should be doing. Yes.
Santiago Castiñeira : Yeah. So this, this is definitely, uh, a, a key pro, and again, this is another part of building the systems, right. All of that conversation off track, you pay for it.
Yeah. So you don't want them to do that. Yeah, yeah. Yeah. There's a lot that's
Ashish Rajan: being spent for like
Santiago Castiñeira : using internet videos.
Ashish Rajan: Look,
Santiago Castiñeira : yeah. Let me, let me go look at more logs and rie more data, because I'm just curious about, it's like, no, please stay on topic. Yeah. Um, try to get there. Um, and this is, this is the whole thing.
Yeah, yeah. Sorry, go ahead.
Ashish Rajan: No, no, I was gonna say, that's a good point as well because as much it, it, it's like every request you make to your LLM model is actually costing you money as well. So it's not the same as that. Hey, it's only a few cents, right? Like, there's like, so, because you go down a rabbit hole, keep asking questions, and suddenly you send like, I don't know, thousands of tokens and you're like, oh, by the way, the, [00:37:00] we can't use AI anymore for the rest of the month because we just, uh, when we went over our quota.
Santiago Castiñeira : Yeah. Yeah. This, this is one of the key, the key things, right? So understanding, understanding the costs and getting to them is. I, I talk with a lot of, uh, engineers that are building similar systems and very often, um, and this is like outside of me, is like interviewing and, and other sources, but helping other people.
Very often they want to put a lot of data, send a lot of data to the element review, and that has a lot of problems aside from cost. But one of the main one is like, you, you need to think how does that scale? Mm. So if I'm going to fill the context and I'm going to do five requests and then get an output, and I need to run this a hundred thousand times a day.
It's significant money. Um, so this is one of the areas that I, I think is, is definitely not intuitive of how much tokens are required.
Ashish Rajan: Yeah. And people don't even talk about it. I think it's, it's like a software you can just keep asking question unless there's a guardrail to, Hey, stop spending money now.
It's like, there's no way to stop that. [00:38:00] I think, uh, I was talking to someone, uh, actually this is a, uh, maybe a good segue into the whole multi-agent world as well. 'cause people mm-hmm. Don't even have one agent these days. They believe in that whole I idea of, uh, especially the ones who are building. And to be fair, the fair to them, that's how you're scaling it.
There's multi-agent, like if you have a fleet of security agents, I would even think governance, like we are obviously a security team, building a fleet of security agents. Uh, what would governance be like in that particular context? 'cause we spoke about what does it look like without security? What's the security component in there and how does that scale with multi-agent.
Santiago Castiñeira : Yeah, that's, and this is, this is a, a massive, massive challenge, right? So what those agents are allowed to do. Yeah. Who, like, what are the actions they can take themselves, which other agents they can communicate with. I know it's like, think about the unintended consequence that, that what kind of like the worst case scenario.
Um, and we have here like a very small [00:39:00] anecdote, but we had very early on some agents investigating a vulnerability on an image. Um, and what they figured out, um, because they didn't know the current date, um, they were only, we didn't, were not putting the date in the, into the prompt. So the agents were thinking that it was the last time they were trained.
Ashish Rajan: Oh. So.
Santiago Castiñeira : They suddenly look at this vulnerability on this image. They look at the data and they say, look, this image was just published. It's in the future. So it was just published, so it very likely will go to production. We had a bug on the credentials that they were assuming to do that investigation.
They could actually delete. So they themselves, they, this was a multi-agent, uh, early on, they went and deleted that image from the repository because they thought this is clearly a very severe minority, which was correct. Yeah, but please don't delete it. So this is the, the whole, the whole challenge around, and, and this again.
If you think about US company having multiple teams, building agents on the security, uh, [00:40:00] domain, and each one having different tools like the, there's unintended consequences of these interactions that I think that's where guardrails come to place. Your policies come into place, and we haven't talked yet about where your data goes.
Um, so you need to have your own data policies, right? Are you find your security groups going to bedrock? Are you find them going to, I know XAI or to an another provider, right? So the mo the best models change over time. There's more expensive, cheaper. So defining your policies upfront to understand what is allowed, what is not allowed in terms of where the data can go, what data can go where maybe some data can go to someplace other, more sensitive data cannot.
Um, so I think this is part of the, the whole governance that it needs to be. Thought throughout, basically ahead of time to make sure that you, that you don't have surprises on the line.
Ashish Rajan: Actually, because we spoke about the AI agent pipeline. How, how different would that be in a multi-agent world? Is it the Lambda function with the [00:41:00] lang chain talking to another Lambda function with the Lang chain?
Santiago Castiñeira : Yes. Correct. So the, I mean, not specific Lambda, I mean, you could do it with Lambda, but basically there will be multiple agents deployed in places and then they could receive a message from another agent saying, Hey, like an API call. I'm this agent. Yeah, it could be an API call, it could be a, a message in a queue.
It could be an event that is triggered. It could be many things. Right. Or it could be multiple things. Right. Imagine, imagine they say you have a specific, uh, event that it says, okay. Um. Uh, impossible travel, uh, trying to log in on, on fill authentication and suddenly succeed. So someone's trying to log in and suddenly logs in from London, from New York, from uh, Russia, and suddenly succeeds from North Korea.
Ashish Rajan: Yeah.
Santiago Castiñeira : What you want to do in there is probably trigger investigation immediately. So it could be that that is an event that goes into a bus. Yeah. And then there's multiple agents that go into like, okay, let me go look at the logs in different places, and then collect all that information or that, that intelligence somewhere else.
This is not so much vulnerability management, it's more like incident [00:42:00] response. That, that's kind of the, the general mechanism, like together those agents Yeah. Know they are aware of other agents. Yeah. And they are aware that they can ask questions or, or maybe you can have more like Oracle type agents where you have an agent that is an expert on, I don't know, VPC security.
So when another agent has a question about VPC, security could ask that agent specific question about V VP C security. And then the VPC security agent is the only one that has full context and understanding of certain things and is prompted in a specific way. Right. So that's why the agents start to interact with each other based on, on expertise and what they need.
Right.
Ashish Rajan: Yeah. Yeah. And I guess to what you said as well, now you, you're as, yeah, quote unquote, your security fleet is now a mix of, uh, I won't say living and breathing, but very interactive, API cultures being made across the board. Yes,
Santiago Castiñeira : yes. I think, I think that's where we're, to be honest, that seems like a likely future in terms of having sort of semi-autonomous systems that do a [00:43:00] minimum amount of, of decision making.
Uh, and especially report on what they find out. Yeah. I think they. The deep dive into looking at the logs, uh, understanding a lot of details. I think a lot of it can be automated, uh, and that output from the agents could be then consumed by a human, for example. So in a way, it's not that you're replacing the human, just, you're just giving them the smarter work in a way.
Uh, and then over time, basically it's going to be more and more higher level, uh, higher that they can start res uh, resolving that.
Ashish Rajan: So talking about the future as well, because I feel there's obviously there, there is the two sides to the whole AI narrative. One is the productivity side and one is the, Hey, I'm gonna make, uh, my revenue making applications very more, way more AI integrated.
And that's different, different people. Mm-hmm. A lot of people put this particular conversation in that productivity camp as if I give AI access to my secure engineers, they would be able to mm-hmm. Revolutionize whatever they do. Automate. So what part of the 'cause the more I speak to you, the [00:44:00] more I realize.
And this is kind of across the board when everything that I am finding myself finding, I'm finding myself in that same phone as well. No company out there, unless you're a cybersecurity company, you would be building a productivity, AI security thing for, uh, as a product. 'cause you're literally just building a product inside your company if you were to go down the path with doing this.
But what components are something that they can do with ai, which is very much in that realm for like, Hey, I'm not building a product. I, I'm, I'm sure there's better examples than making my email less angry. What, have you seen some examples in the security space that people can do for AI productivity outta curiosity?
Santiago Castiñeira : That's, that's a very good question. Um, so one small nugget there before. I, I think there's some companies that, that are building these products internally, and I think certain companies, it makes sense that they build it internally. Certain companies that are extremely good at software that build a lot of their software [00:45:00] internally that are like, I think those companies will continue to build a lot of these products internally.
So I think those are hard to change. It's just, I think there's a, a massive gap and there's a massive amount of companies that should really think is this is their core, uh, investment that they want to do and basically think thoroughly about the, the, the long tail. But, uh, thinking about things that companies could basically build outside of the, the main things.
Um, what I would say is, is, is some of those, um, of those evaluations, of those agents that look at, uh, outputs from different places and then either notify or do do some sort of, uh, alerting or, um, kinda some initial assessment of things, I think those agents are, are probably easier, uh, to do in a way. And, and again.
Provide more productivity, uh, um, let's say automations in a way,
Ashish Rajan: right? Because I think there's, there's this theory also the fact that like, uh, like for example, uh, a vendor would be able to fill that gap for, Hey, I with [00:46:00] the AI system, but I want you to maintain an internal knowledge source for your rag.
Yeah. You are, uh, you, you, you have a rack pipeline, you have vector databases, you have your own maybe MCP connections to talk to certain specific things. Yeah. You don't want to give them direct network access. You only wanna give them API access. Um, so in my mind, we are almost turning security companies into like secure security companies, security teams into like these guardians of knowledge sources, for lack of a better word.
And you, you still end up, I mean, even irrespective of whether you make the choice to buy or to build, I feel like that's kind of where we are leaning towards, because that's where it makes more sense for it to be, I guess, the time well spent, for lack of better word.
Santiago Castiñeira : Yeah. I, I, I, I think that makes sense. I think that I agree with you on this one.
It's, um, it's probably a, a time well spent in that direction. Um, building those type of tools is, um, it can really help unlock, uh, in certain situations. Um, the, the [00:47:00] key there is, and, and you mentioned you touched a topic there. That is, uh, that is really key to me. That is, uh, rag. So RAG works and in our experience works great for, um, documentation type of data, but.
When I know, when I want to know some specific technical requirement of part of the infrastructure rag, uh, systems are really tricky because they're probabilistic and they might leave that one critical detail outside. So, um, or they might, for example, uh, one specific example that I have in mind is version numbers.
Yeah. So we built a rag, um, uh, system with a lot of data and one of the questions that in one of the agents asked was specifically about version numbers. RAG system was terrible at giving us the precise, complete list of those, uh, version numbers, right? So you need to be careful about what questions are you going to ask your right system.
It could be great, for example, to imagine you have engineering documentation inside of your right system and you want to try to assess the importance of a [00:48:00] specific system. That might be great because you will basically. Retrieve context around specific, uh, uh, description of a system. Um, but for security and when you want to make decisions that, uh, that are critical, you need to make sure that you have all the version numbers you need to make sure that you have the full security group description, let's say.
In the context of the agent that is going to make the decision. And if you pick it from a rack system, you might not just get the whole thing. Yeah. So, yeah, this, this again, lessons along, along the way that we learn.
Ashish Rajan: Interesting. So if you were to, well, hopefully, uh, at this point in time, people are more aware of what actually takes it to be building a AI first vulnerable team management.
If, if they were to buy a solution, this particular space, how would someone be evaluating something like this? Because I guess at this point in time, everyone is calling out AI and no offense, it's also like hard to judge the signal from the noise. Yeah. What do you see as someone who's been working in this particular space quite deeply is, uh, something people can [00:49:00] do to used to evaluate, uh, solutions in this particular space?
Yeah. Where, where is, where is the, it's an AI bandaid versus AI first. Yeah,
Santiago Castiñeira : absolutely. I mean, for, for me, the, that is a, the, the way we think about it is if tomorrow bedrock is down. Tomorrow your inference product is done. Is your product working? Um, for an LLM first company? There's no product. The product will not do anything today.
Um, for a company that sprinkle LM features on top of the main logic will still be there. The main decision making will still be there. Uh, your scan will still produce results. Um, it's just you will not have this one feature that maybe you click button and something happens and calls an LLM, right? But that's sort of like sprinkling the l, the LLM on top.
Um, and that's just like one of the basic ones, but it's hard to know sometimes. So the way I would think about it is like how much, uh, so the decision making that is happening, is it, can I [00:50:00] understand that it comes from specific, maybe sometimes extremely complex rules that are in the background? Or is it something that truly looks at unstructured, contextual, uh, information and then makes decisions based on that?
And are the recommendations always A, B, C, or D or. Sometimes coming up with recommendation that we're like, that's really smart. I never thought about it. But yes, that is true because that detail here tells you that that could be a PO potential solution. Right. So I think those are some of the indicators of like different price.
When something is built LLM first, let's say,
Ashish Rajan: wait, so does that mean vendors who are AI first would like, or the main work that they're doing is they're building the scaffolding around to what you said, even though the core of the product may be built from an LLM, but the scaffolding around obviously not.
It's not like this is software, it's not gonna die on you. It's more like certain feature. If Bedrock was down tomorrow, uh, yeah, the product would still work, but it may not be able to make inference calls.
Santiago Castiñeira : Exactly. So that is the, the thing. So [00:51:00] your agents will not run. Mm-hmm. So let's say you are an LM first company, um, and the, an agent wants to look into some data when it goes to bedrock, there won't be no response.
There won't be the, the answer to, to your question. Right? So, and that, that's why again, we rely on inference on our hyperscalers that we trust with our databases. Yeah. So chances of that happening is rare, even though we had a, a recent event as you know. Um, but that's basically, it's, it's a dependency of the system and it's a critical one.
While in other products that are sprinkling it on, on top, maybe a feature will just disable itself because having to inference today.
Ashish Rajan: Well, do you find people making, uh, people who are trying to build this, for lack of a better word, first generation enterprise agent, uh, what are some of the architectural mistakes they're doing that you see,
Santiago Castiñeira : I walk you through a few.
The first thing that I mentioned, those were all like critical learnings that we had. I mean, right. For example, what I mentioned where, how yeah. That drug is, [00:52:00] it's a, it's a critical one. Um, I think, yeah, I think the, there is this, this whole discussion around how do you evaluate and there's this vibe evaluation that is, you look at the results, looks great.
Um, you're running it let's say a thousand times and you look at 10 of them, they all look great. Um, but this is no way of evaluating, uh, a LM systems. And this is one of the, the key, uh, I would say one of the most important mistakes one can do that is trust that vibe, uh. Feeling that you get, uh, without having validations across the bigger data sets.
Um, this is the whole building evals in your system, right? So having robust systems that evaluated from many different angles and always similar, uh, type of metrics. So you can just compare across runs, compare across changes and so on. Sometimes you will be surprised how a small change in a prompt just throws off specific reasoning.
That just never goes where you [00:53:00] want it to go. Uh, from then going forward, I think this is, if I have to put, this is probably number one, uh, like not thinking about evaluation and basically being very careful how you evaluate to make sure that every. Change is one step forward and not two steps back. Um, so that, that's one of them.
Um, you mentioned rag this. Um, I think that the context is is another one. So the, not thinking about how much data you're going to be putting in there, um, and, and the cost associated with it. Uh, I talked with a few founders on, on, on the, let's say, uh, log processing, uh, side of things. And these are some of the challenges that, that many of them encounter.
There is, it's just, it's a lot. Um, as many of the companies out there know, uh, you kind of basically run a lens through your team, uh, uh, throughput.
Ashish Rajan: Yeah. And maybe just on the eval thing, 'cause you, we obviously gave an example of the whole eval and we also spoke about the fact that may potential [00:54:00] future for secure engineers who are working with these AI first vulnerability management companies or vendors would be to have better evals on their end.
Yes. Like, 'cause they would obviously have, internally they would have a lot of inference as well. What is eval? Just as just a, I don't know, a soft piece of code or what, what is it?
Santiago Castiñeira : Very good question. Very good question. So an eval, uh, short for evaluation, it's, you cannot think of it as a test for a specific output.
Let's say you ask an agent to do a task, it came back with an output. Um, and then you can ask a question, is that output concise, uh, enough and provides sufficient rationale, right? Yeah. And you'll have specific, like, and you will have specific, uh, examples of like, this is good rationale. This is a concise answer.
This is not rational, this is not a concise answer, right. Um, you run that prompt against the output of your agent, and then the answer will be like two false. Let's say if you run that at scale, you will see, [00:55:00] okay, 90% of our, um, agents provide good rationale to, to, uh, in, in their evaluation, right? Then you introduce a change, you change your agent, you run it this again, and suddenly 70% of the agents are pro providing good rationale.
So that's something for us to go look into because that prompt change probably improves some other metric. Yeah. But this one metric suddenly is not there. And this is very unpredictable. Uh, I would say you, you gain a bit of fertility with experience, but generally it's very hard to predict and especially changes from model to model.
You will do this with, one model will be 70%. Next model gives you 30%. It's like, didn't change anything. No problem change. I just changed the model. Right? So these are all of the things about evaluating. I think it's critical to understand how you test these systems because they're all probabilistic, right?
Yeah. It's not, it's not deterministic like previous systems.
Ashish Rajan: Because the, the reason I'm asking this is because, I mean, these days, uh, the, the whole notion is that the vulnerabilities [00:56:00] are found a lot more quicker as well, especially if you have intelligence layer. Uh, should the way people approach SLA models for vulnerability patching from a timeline perspective, just shrinking, do we need to reevaluate that?
Because I, in my mind, the eval was an interesting one because you, you say you are using, I don't know, just some kind of intention layer. You get an answer, but the complaint that the cloud security industry used to have is that, Hey, we don't respond to alerts for four weeks. That's a long time in. If the volume is now increased, the response time should change as well.
And the detection time change as well is how are you seeing, or what do you think people should do to change that and how should they approach that?
Santiago Castiñeira : Huge questions and one of the reasons why I started this company, so if you look at the, the metrics over the last few years in terms of um, time to exploit, um, um, number of bilities published, so everything is moving against [00:57:00] security teams.
So there's more and more abilities being published now. Uh, with um, with me having the, the issues and now having multiple sources, it gets a bit more, more tricky, right? Even but also the, the time to exploit, I think it was Manian that published that now they see it exploited before it's actually, uh, published.
Yeah. As a new cv, right? So everything move against you. Um, so I think definitely there needs to be more automation to have those assessments at, at least of is this really addressing us? Right? Yeah. Um, and this I. My vision where cybersecurity will end up going is in, in that direction saying you get intelligence feed signals that tell you we're seeing this exploit on this type of software.
We don't know of a CV that they're exploiting, but we see these botnets using that specific type of messages. And then you just go and say, okay, where do I have that specific software install? Kind of check if I have that version that has been published, and then kinda do an initial assessment of, is that really [00:58:00] critical, right?
Mm-hmm. Um, and I think that, that that's where things need to be to be moving because it's, uh, it's really tricky. Otherwise, I, as you said, the time to respond is, is negative, basically need to be responded before the CV is published.
Ashish Rajan: Is there an unpopular opinion about AI insecurity that you think will age in the next three to five years?
Santiago Castiñeira : That is a good one. Um,
I, so there's, there's this whole race to. To basically to super intelligence, to LMS that are, um, basically, uh, yeah, super intelligent. I think extremely, um, well-crafted agents with sufficient context in certain scenarios, in application specific, they're indistinguishable from super intelligent ai. It's not super intelligent across the board, but in specific niches, we're going to start seeing [00:59:00] that these agent has all the contents and, and again, an agent is not an agent anymore, is an agent with an army of subagents and an army of tools behind them.
Yeah. But they've retrieved so much context and understanding for making the decision. That is going to be very hard to differentiate from super intelligent AI on that specific use case. Right. And I think that we're starting to see that already in terms of certain doctor assessments. Uh, I think it's, it's coming in a lot of areas in, in some papers, some research that is being done by agents.
So I think that's basically with the intelligence we have nowadays, it's probably sufficient to build these extremely complex agents.
Ashish Rajan: Yeah.
Santiago Castiñeira : That uh, so that's why like the idea of super, super intelligence, I, what I'm saying, I'm saying that is going to age is we don't need to get there to see something that is truly surprising Ah, uh, in terms of how smart it comes across.
Right. So everyone's waiting for that moment when there's this one model that is super intelligent, well-crafted agents with sufficient [01:00:00] contact are going to get their, uh, own specific use cases well before we get to that one model.
Ashish Rajan: So we'll, we'll see Glimpses of super int intelligence agent everywhere before.
Yes, I, I do think so. I do think so. Fair. Uh, no, dude, that, that's all the technical questions I had. I've got a few fun questions for you as well. First one, what do you spend most time on when you're not trying to solve all the, uh, AI first vulnerable team management problems in the world?
Santiago Castiñeira : Um, I have three kids, three small kids, um, below the age of six that consumes every single second that I'm not in my laptop probably.
So that's, that's where I spend most of the, my time. Yes. Outside of work. Yeah.
Ashish Rajan: Okay. That's a fair point. Uh, second question. What is something that you're proud of that is not on your social media?
Santiago Castiñeira : Oh, wow. That's a good one. Um, very good. I mean, no one is aware of my kids. Um, okay. There you go. Starting a company, starting a [01:01:00] company with three small kids.
It took a lot. So, and it takes a lot daily. Uh, I'm quite proud of how me, and again, my family as a whole, because it's not a me thing. It's a whole family effort in a way. Yeah. We managed to basically get here in a sustainable way and basically being able to move at the pace that we've moved as a company.
Um, I'm quite proud of that. I think we, we did great of the family and as a company and being able to have those balances. And again, this is not just me, but it's across the company, but this, I'm proud that we managed to build a company that, that allows for that.
Ashish Rajan: That's awesome, man. Uh, thank you for sharing that.
Uh, final question. What's your favorite cuisine or restaurant that you can share with us?
Santiago Castiñeira : Oh, so I'm originally from Argentina, so I love a good steak house. Ideally an Argentine steak house that cooks things slightly different than other places. That's what I would, uh, that I generally love. So anything where there's an Argentine restaurant, I'll be happy to join, uh, for dinner anytime.
Ashish Rajan: Wait, what's the difference between a regular steak and an Argentine [01:02:00] steak?
Santiago Castiñeira : Uh, it, it depends, but one, in some areas of Argentina, we, the barbecue is very slow. It's a very low temperature. Oh, right. So it will take sometimes literally two hours or three hours to just cook a, a steak or a piece of meat. So, um, we have slightly different ways of, of doing a SAT is the name that we call it.
Ashish Rajan: Yeah. Uh,
Santiago Castiñeira : and yes, when you get a place that do it right, it's, it's, it's truly something else.
Ashish Rajan: Wow. I, I look forward to trying one of these then. 'cause I, I mean I, I've had, uh, Argentine steak before, but I never realized the difference in, in terms of different, I guess the region of Argentina. They may come from, they may make a differently as well.
Yeah.
Santiago Castiñeira : Yeah. Yeah, exactly. Yeah. And I think there's also a lot of, like, a lot of restaurants have viewed the, the marketing of Argentina to sell stakes. And there's always where they actually, the person grilling is Argentina or has been trained on that. And that's, that's a difference.
Ashish Rajan: Oh, okay. Now good to know.
Thank you for sharing that. Well, uh, that's all the questions I had for you, man. Where, where can people connect with you to learn more about what you're [01:03:00] doing at Maze and what other things you're working on?
Santiago Castiñeira : Yeah, link, LinkedIn, Twitter, I would say are, are the main two. Uh, yes. And then obviously through, through Maze.
Just contact us through any channel. Uh, happy. As I said, happy to help anyone trying to build this internally. It's not easy. So, uh, please take the help if you can. I appreciate that. I'll
Ashish Rajan: put that in the show notes as well. But thank you so much for tuning in this, for this man. I appreciate you spending your time for this and for everyone else tuning in.
Thank you so much for, uh, tuning us in the episode. I'll see you next one. Thank you very much, Arun. Thank you. And.
















.jpg)



