If Your AI Can’t Prove How It Got There, It Doesn’t Matter How Good It Is.

One of our AI agents recently flagged 600+ alerts as an active nation-state intrusion: APT28-attributed malware on multiple hosts, persistence mechanisms everywhere, 16 MITRE ATT&CK techniques mapped across a 30-day campaign. The recommendation was immediate escalation. Confidence: HIGH.

It was wrong.

Every single APT detection — all 53 of them — had already been investigated and resolved as benign by the source platform. The agent had dropped a status filter when it expanded its search window. The entire narrative was built on closed false positives.

Nobody acted on it.

The error was caught, corrected, and documented before a single analyst picked up the phone. I’ll explain how in a minute.

Think about what would have happened at most companies. An AI produces a confident, well-structured report claiming APT28, somebody escalates, and the IR team mobilizes — forty to eighty person-hours burning before anyone thinks to check whether the underlying detections were still open. And the next time the AI flags something real, the team hesitates, because last time it cried wolf.

Nobody in the AI security space is talking honestly about trust.

Every vendor is shipping agents. Some of them are pretty good. But “pretty good” with no way to verify is actually worse than mediocre with full visibility, because it builds false confidence. You trust the output right up until the moment it costs you.

I wrote recently about the data problem underneath AI agents — how scattered, inconsistent data produces confident but incomplete conclusions, and why the data layer needs to be solved before the agent layer matters.

Solving the data problem creates a new one. Now the agent can reach everything, reason across everything, and produce conclusions more thorough than any human analyst could reach in the same amount of time. Which makes the question of whether to trust those conclusions more important, not less.

Everyone is talking about transparency, yet almost nobody is shipping it.

“Full visibility into AI decisions.” “Explainable AI.” “100% decision transparency.” I’ve watched this language show up in pitch decks and acquisition press releases all year. It’s becoming common messaging. And almost none of it means what it says.

Ask what that transparency actually looks like in the product and you usually get one of three things: a confidence score, a summary paragraph, or a list of data sources the agent checked. Sometimes all three.

A confidence score, a summary paragraph, and a list of sources isn’t transparency, it’s a book report.

Real transparency means the analyst can see every search the agent ran. The actual query syntax, not a description of it. What came back, how many results, what the agent learned from each one. Where it pivoted, what it tried that returned nothing, and what it couldn’t access at all.

Most importantly, it means the analyst can rerun any of those searches and get the same results. The investigation is reproducible, not narrated.

That’s a higher bar than most vendors are willing to meet, because it means shipping the evidence, not just the conclusion. Evidence can be audited. Conclusions can only be accepted or rejected.

Why this matters more than it seems.

I’ve talked to enough security leaders to know that the number one barrier to AI adoption isn’t capability, it’s trust. Their analysts have been burned by tools that produce confident-sounding output that falls apart under scrutiny. “AI-generated” has become a meme in many SOCs. The only way past it is to let the analyst see the work. Not a summary of the work. The work.

Thoroughness without verifiability builds false confidence. The AI agent investigates autonomously across ten data sources more thoroughly than most analysts could in the same time window, and it would be right 95% of the time. The deeper problem arises when analysts can’t tell which 5% got it wrong and have to start trusting the agent by default. When it matters, and the conclusion doesn’t hold, nobody catches it until it’s too late.

That’s exactly what happened in the APT28 investigation I opened with. Except our agentic loop caught it. Because the regimented process was followed, evidence was there, and the agent spent the time and the tokens to get it right.

What we learned the hard way.

We built Query Workers to run autonomous investigations on the Query Security Data Mesh.

Every investigation produces a complete evidence package: every query that ran, every source checked, every IOC tracked, every reasoning step documented. That evidence chain has taught us things about AI trust that I didn’t fully appreciate when we started.

Four things happened that changed how I think about what transparency actually needs to mean.

1. How the APT28 mistake got caught.

Back to the opening. The agent built a convincing APT28 narrative on 53 detections that were already resolved as benign. The question is how that got caught before anyone acted on it.

Every investigation our agents run produces a complete evidence package — every query with its exact syntax, every result count, every field checked. That package goes through a structured review that re-examines the evidence chain. The review didn’t re-ask “is this APT28?” It asked a more basic question: “are these detections still open?”

Because the evidence included every query and every result with status fields intact, the answer was unambiguous. All 53 resolved. All benign. The investigation was corrected, the disposition downgraded, and the lessons documented, including the specific technical error (status filter dropped on the 30-day lookback) so it wouldn’t happen again.

The reason this worked isn’t that the AI was smart enough to catch its own error. The evidence artifacts were available for a structured process to audit. The transparency made the correction possible. The correction made the system trustworthy.

2. The overnight triage that found cloud spread.

120+ alerts accumulated overnight across 50+ workstations. The agent processed all of them in 20 queries. It cross-referenced sign-in anomalies with endpoint execution alerts, discovered that the structured hostname fields were empty (a connector mapping gap), adapted by parsing raw alert data, and identified five employees whose sign-in alerts were followed hours later by command execution and persistence mechanism installation on their machines.

Then it found something no single-source tool would catch. One WMI persistence alert came from an EC2 cloud instance on a completely different subnet. The agent ran a lookback on the cloud instance and found the exact same multi-stage pattern — execution, persistence, and evasion — that it had found on the corporate hosts.

The attack had spread from corporate endpoints to cloud infrastructure. That finding only existed because the mesh connected corporate EDR data and cloud security data in the same query space. And it only mattered because the analyst could see the exact queries that found it and verify the correlation wasn’t a hallucination.

An analyst working this manually would spend six to ten hours across multiple consoles, and would almost certainly miss the cloud instance, because it’s alert #87 in a queue of 120, in a different subnet, in a different security tool. The agent found it in a minute because it searched everywhere.

3. The finding no alert triggered.

Not every security problem starts with an alert. Some of the most dangerous ones never trigger a rule at all.

We ran an identity threat assessment, not in response to an alert, but as a proactive sweep. The agent tested eight MITRE ATT&CK patterns across Okta and Entra ID simultaneously. Seven came back clean. Pattern eight found a service account authenticating to Okta via Chrome browser from a user’s laptop.

Service accounts don’t use browsers. This was a human using service account credentials interactively. The agent cross-referenced the laptop IP against DHCP records, identified the user, and flagged the finding.

We ran the same assessment a week later. The finding was still there. Nobody had rotated the credential or restricted the account. We ran it again eight days after that. Now the service account was not only logging in daily — it was browsing the web after login, and a DLP event showed the same user accessing a private-classified file from the same laptop.

Three assessments. Eight days. The risk escalated every time because nobody acted on the first finding. The agent tracked the remediation status across all three assessments and escalated when the pattern got worse.

No alert ever fired on any of this. No detection rule covered “service account with browser user agent.” The agent found it because it went looking, and the investigation artifacts gave the security team a specific, actionable detection rule to fill the gap.

4. The two alerts that changed everything.

This one is about what happens when investigations talk to each other.

Day one: a separate investigation found a C2 backdoor running across 57 hosts, communicating with three external IPs, active for nine days. Day two: ten employees triggered sign-in alerts with preceding phishing emails. Day three: two more sign-in alerts. Same phishing campaign. Easy to batch as “more of the same.”

The agent investigated the two new hosts thoroughly — the same patterns, same persistence mechanisms, same scanning IP. Solid work. Then the review process did something manual triage almost never does: it cross-referenced against the day-one investigation.

Both hosts were in the 57 already confirmed to be running the C2 backdoor. One of the phishing hashes had been classified as known malicious in the earlier investigation. These weren’t two more phishing victims. They were two more confirmed nodes in an organization-wide compromise.

The investigation was upgraded from routine to critical. The framing shifted from “phishing campaign continuation” to “three consecutive days of escalation recommendations with no visible containment activity.”

In a manual workflow, these two alerts get added to an existing ticket. The connection to the 57-host compromise lives in a different case, worked by a different analyst. Nobody sees the full picture. The review process saw it because it had access to the evidence artifacts from every investigation, not just the current one.

The evidence layer is what makes the difference.

None of these outcomes came from a better model. The AI reasoning was the same in every case. What made the difference was the evidence layer: structured artifacts that could be audited, cross-referenced, corrected, and verified.

In investigation one, the artifacts enabled self-correction. In investigation two, they proved a cross-environment finding was real. In investigation three, they tracked remediation failure across time. In investigation four, they connected separate investigations into a unified picture.

Remove the evidence layer and you still get an AI that produces conclusions. You just can’t tell which conclusions are right.

The agents recommend. The humans decide.

This is a design choice we made early and have never regretted. Our agents never declare incidents. They never auto-remediate. They recommend a disposition — “Critical Threat, Recommend Escalation” or whatever the outcome — and the human decides what to do.

That’s not a limitation. It’s the point.

Incident declaration has compliance implications, business implications, legal implications. Those are human decisions with human accountability. An AI that declares incidents is an AI that makes compliance commitments on behalf of the organization. I don’t think that’s appropriate, and I don’t think most CISOs want it even if they say they do.

What CISOs actually want is for the investigation to be done when they look at it: data assembled, evidence organized, gaps identified, recommendation clear, and every step of the reasoning available to verify before their team decides and acts.

That’s what the evidence package does. The agent works the problem. The human makes the call. And the full audit trail exists for compliance, for the board, and for the analyst who runs the next investigation.

The transparency test.

If you’re evaluating AI security tools, here’s a simple test. Ask the vendor to show you a real investigation output. Not a demo script. Not a screenshot of a dashboard. The actual artifact the AI produces when it investigates an alert.

Then ask three questions:

Can I see every search the AI ran — the actual query, not a description?

Can I rerun any of those searches myself and verify the results?

Can I see what the AI tried that came back empty — the gaps, the dead ends, the data it couldn’t reach?

If the answer to any of those is no, what you have is a black box with a transparency label on it. The AI might be good. The conclusions might be right. But you’ll never know which ones aren’t until something breaks.

We started with 600+ alerts and a confident AI conclusion that turned out to be wrong. That’s not a failure story. It’s a trust story. The system caught it, corrected it, and documented why because of the evidence and the process. That’s what separates an AI that produces answers from an AI that earns trust.

The answer isn’t in the model. It’s in what the model shows you after it’s done.

If your team is investigating fewer alerts than it should be, Query Workers were built for exactly that problem. Let’s talk.