When the guardrails held and the attack still worked

Asaf Saar June 30, 2026 9 min read

A real intrusion, captured in full, shows why trust cannot live inside the model that does the work.

A real attacker’s working directory and over 1,000 recovered AI agent sessions give us the clearest look yet at how AI-assisted intrusions actually work. The finding isn’t what the headlines claimed. Most of what we know about AI-assisted attacks is inference. We see the output and reason backward about how it was made. This time we do not have to. A compromised server was turned into a honeypot, and the attacker’s working directory was recovered intact, including more than a thousand full agent sessions: the prompts, the tools, the model’s own reasoning, and every refusal it raised along the way. The researchers at OALABS published the analysis.

Why AI lowers the attack skill floor (but that’s not the real risk)

The obvious takeaway is that AI lowers the skill floor. The attacker was not sophisticated. He framed vague goals, and the agent filled in the technical work: reconnaissance, exploit research, credential validation, data harvesting, and even the write-up. That is real, and it is worth saying once. But it is also the reading that the researchers themselves push back on.

There is one detail that makes the point better than any argument. The safeguards never identified this attacker. His own carelessness did. One of the first things he had the agent do was polish his resume and build a job-application tool, and that resume carried his full name, location, and LinkedIn. He ran more than a thousand malicious sessions without the model flagging who he was, then unmasked himself by asking the same agent for help finding a job. The system that could not separate his intrusions from authorized work could not separate his crimes from his job hunt either, because from inside the conversation both read as ordinary requests.

Their report is careful and deliberately anti-alarmist. The offensive workflow, they note, is nearly indistinguishable from legitimate red team work. They explicitly do not call for broader refusals because the same capability that helped the attacker also makes these tools valuable to defenders. And the whole operation ran on models a generation behind the current frontier. Lead with skill-floor panic, and you are not only overstating, but you are also making a point that the primary source disowns, and one that ages the moment the next model ships.

The finding that matters is quieter, and it does not age.

How a simple framing trick bypassed the AI security guardrails

Here is the number to sit with. Across more than a thousand attack sessions, the agents raised a handful of policy violations. Not a handful per session—a handful in total. The safeguards were neither absent nor weak. They were simply bypassed, and the method was almost insultingly simple: the attacker opened each session by claiming it was an authorized red-team exercise. When a request was blocked, he rephrased it and added more insistence that the work was sanctioned.

That is the whole technique. Not a jailbreak in the technical sense. A framing.

This is the part worth dwelling on, because it is not a tuning failure that a future model patches. The model cannot reliably tell authorized verification from malicious intent, because both look identical from inside the conversation. Recon is recon. Exploit research is exploit research. The only thing that separates a sanctioned engagement from an intrusion is context the model does not have and cannot verify: who authorized this, against what scope, with whose consent. The attacker simply asserted that context, and the work proceeded.

To the report’s credit, the agents did hold a hard line in a few places. When the attacker pointed them at a private individual and their family, or asked them to build an explicit playbook for monetizing stolen credentials, they refused and could not be talked out of it. Those are the cases where intent is legible on its face. But the long middle, the actual work of breaking into a company, read as ordinary security work, and so it got done.

Why this is the independence argument, proven

My blog series has made one claim above all others: trust in AI-generated and AI-driven software cannot come from the model doing the work. It has to come from a layer that is independent of it. This intrusion is that argument demonstrated in the wild.

The reason the framing trick works is the same reason a model cannot be its own auditor. The checker and the worker share the same context, so they share the same blind spot. A model asked to verify its own intent grades the story it was given, not the reality on the ground. Safeguards inside the generation loop will always be arguing with the person holding the prompt, and the person holding the prompt gets to describe the situation however they like.

Independent verification does not have that weakness, because it does not ask about intent at all. It evaluates the artifact itself: what the code does, what the system exposes, and whether those exposures create real risk. An internet-facing service with a known, exploitable vulnerability is dangerous whether the session that discovered it was labeled a red-team exercise, a benign scan, or nothing at all. The independent layer judges the reality of the system, not the story around it. That is the property model improvement alone cannot supply, because the gap is structural, not just a capability problem.

Known CVEs + AI automation: The real shape of the AI attack threat

There is a second detail in the report that points straight at the defense. When the agent broke into a service, it was not inventing novel zero-days. It was weaponizing known, published vulnerabilities against exposed infrastructure: CitrixBleed 2, a Ghostscript flaw, a Livewire bug, among others. The agent looked up the public CVE, wrote an exploit for it, and ran it.

That is the threat in its real shape. Not exotic AI-discovered attacks, but the automation of the boring, known-vulnerability surface that every enterprise is already carrying. AI did not change which doors were unlocked. It made checking all of them fast and trivial for an attacker with almost no skill.

Which is the entire case for continuous, independent coverage of the whole estate. The doors the attacker walked through were known and, in principle, closable. The gap was not knowledge. It was time. It was that the defender had not gotten to them, and the attacker, now automated, had.

This is the asymmetry that should worry every security team. Offense just got a force multiplier that works around the clock, never loses track of a target, and treats your entire exposed surface as one long checklist. Defense, in most organizations, is still paced by human attention: which finding a person triaged this week, which patch a team prioritized this sprint, which legacy service nobody has owned in years. When one side automates and the other does not, the gap between what is exploitable and what has been fixed widens every single day, and that gap is the whole attack surface.

The answer is not to blame the human layer for not keeping up. It is to give the defense the same leverage the attacker just took. Automation and AI applied to the defender’s side (triage, prioritization, and remediation) let people spend their judgment on the calls that actually need a human, instead of drowning in the volume. The point of an independent layer is not to remove the human. It is to make sure the human and the machine together can move at the speed the attacker now moves, so the gap stops widening and starts closing.

Closing it cannot mean finding it faster. Finding was never the constraint here. These were known vulnerabilities with public CVEs. It means remediating faster: turning a confirmed, reachable weakness into a shipped fix at the same speed the attacker can turn it into an exploit, and doing it across everything, not just the code written last week. New and legacy. First-party and dependencies. The surface an attacker now scans in full is the surface a defender has to cover in full.

And it has to come from a layer independent of whatever wrote the code, for the same reason the rest of this intrusion teaches. A model asked to remediate its own output is back to grading its own story. The defense that scales against automated offense is automated, continuous, whole-estate remediation, judged by a layer that answers to what the code does, not to who asked. That is the subject of the next piece, and it is where the independent layer stops being a principle and becomes the product.

AI-powered defense: How independent verification closes the gap

But this is not an argument to entirely reject AI. It is an argument for using AI in the right place. Human judgment remains the verification layer because accountability ultimately belongs to people, not models. They provide the accountability and skepticism that no model can apply to its own work. The role of AI is to amplify that judgment: continuously surfacing exposure, prioritizing what matters, accelerating remediation, and eliminating the repetitive work that humans cannot keep up with alone. The goal is not to replace human verification, but to make it possible for humans to verify an environment that has grown too large and too fast to assess unaided. In that model, AI provides scale, while people provide trust. That is how defense keeps pace with attackers who now operate at machine speed without giving up the independent judgment that security ultimately depends on.

The bottom line

The comfortable version of this story is that AI made attackers smarter. The accurate version is more uncomfortable. The attacker was not smart. The safeguards were not broken. Model-side safeguards and red teaming are necessary, and they should keep getting better. But they operate inside the conversation, where intent can be asserted, and that is a different layer than verifying what the code actually does. The two are complements, not substitutes. And the attack still worked, because the only thing standing between authorized verification and a real intrusion was a sentence the attacker was free to write.

Trust that lives inside the model can always be told a story. Trust that lives outside it, in a layer that watches what the code does rather than what the prompt says, cannot. That is not a tuning problem waiting on a better model. It is the reason independence is the architecture, and this intrusion is the proof.

Increase visibility and control over the AI components in your applications

Mend AI

About the author

Asaf Saar

EVP Product

Asaf Saar is EVP and Chief Product Officer at Mend.io, where he leads product strategy for the company’s application security and software composition analysis platform, including its work securing AI-generated code and AI components. He joined Mend.io after more than five years as VP of Product Management at Tricentis. Asaf has spent his career building and leading product organizations at scale, with a focus on developer tooling, quality, and turning technical depth into commercial momentum.

Table of contents