After the Reasoning Trap: Making Agent Claims Unrepresentable

In the last post, I wrote about a failure mode I had been fighting in XelaBot: Action Hallucination. The bot would say, “I updated the file,” but no tool had run. The file had not changed. The model had confused its intent with reality.

My plan was to move XelaBot to Render-from-Trace. Separate doing from talking. Record real actions in an ActionLedger. Let the final answer reference only what actually happened. That is now mostly built.

But the more interesting lesson is not that I added a ledger. It is that I had to remove language from the trust boundary.

Regex Was the Wrong Layer

The first versions still had a smell. They looked for phrases like “I saved”, “I updated”, “I sent”, or the German equivalents, then tried to match those phrases against tool logs. This helped. It caught some obvious lies. It gave me telemetry. It made the system less embarrassing.

But it was still reactive. A regex can catch a sentence. It cannot prove a fact.

Worse, it trains the architecture to treat truth as a text pattern. That works until the model says the same thing slightly differently. “The document now contains the new section.” “The issue is ready.” “I’ve taken care of it.” None of these are just strings. They are claims about the outside world.

So the important shift was not writing better regex. The important shift was making certain claims impossible to produce unless the system has evidence for them.

No Ledger Entry, No Claim

In Tier 1 Render-from-Trace, the model does not freely write action claims. It emits structured references.

If it wants to say that a file was written, it must point to the exact LedgerEntry created by the write tool. If it wants to say that a Linear issue was created, it must reference the ledger entry from that tool call. The final human sentence is rendered deterministically from templates.

The model can still explain, summarize, and ask follow-up questions. But it cannot invent side effects in prose. No ledger entry, no claim.

This is the difference between filtering and architecture:

  • Filtering says, “Let the model speak, then inspect the words.”
  • Architecture says, “The model is not allowed to speak this class of sentence unless it can bind the sentence to evidence.”

That is the trust boundary I wanted.

The Same Pattern Spread

Once this existed, the rest of the agent system started changing around it.

  • Planning before delegation: The project bot, Victoria, is no longer just a chatty coordinator. It calls plan_research, creates an explicit task breakdown, and only then sends work to specialist agents.
  • Bounded parallelism: Independent subtasks can run in parallel. They are capped, isolated, and observable. One failing subagent does not kill the whole turn.
  • Traceable criticism: There is also a Critic agent now. After a specialist finishes, the Critic can inspect the result against the original objective. In strict mode it can trigger one retry. If the Critic itself fails, the main flow continues.
  • One-turn scratchpad: Parallel agents can write small intermediate findings and read what others have already discovered. The scratchpad is not memory. It is not a database. It is just enough shared state to avoid making every handoff go back through the CEO agent.

This is the direction XelaBot is moving in: not one big model trying to be clever, but a set of constrained actors with explicit state, scoped permissions, and traces.

Trust Is a System Property

The biggest mistake I made early on was treating honesty as a prompt problem. Then I treated it as a validation problem. Now I think it is a system property.

If a tool runs, it creates a ledger entry. If an approval pauses the run, the ledger survives the pause. If a subagent acts, its sub-ledger comes back to the parent. If the final response mentions an action, it must bind to one of those entries.

The final answer is no longer just “what the model says happened.” It is a rendered view over what the system can prove happened.

That also changes debugging. I can now inspect a trace and see the plan, the tool calls, the ledger, the critic verdicts, scratchpad metadata, and the final rendered answer. When something goes wrong, I do not have to guess whether the model lied, skipped a tool, lost context, or resumed through a different path. The evidence is in the trace.

The New Failure Mode

This does not make the system perfect. It moves the failure mode. Previously, the dangerous failure was a fluent false claim. The bot sounded confident about an action it never performed.

Now the likely failures are more mechanical: missing template coverage, a tool not writing the right render fields, a provider not supporting the structured output mode cleanly, a subagent returning weak work that still technically satisfies the schema.

Those are better failures. They are testable. They are observable. They can be fixed in code.

That is what I want from an agent architecture: not magic, not vibes, but boring failure modes.

From Honest Answers to Honest Agents

The last post was about fixing the Reasoning Trap. This one is about what came after.

Render-from-Trace started as a way to stop XelaBot from lying about tool use. But once the ActionLedger became the ground truth, it became the spine for a broader multi-agent system.

The shape of the system is becoming clearer:

  • Plans are explicit.
  • Delegation is gated.
  • Parallel work is bounded.
  • Criticism is traceable.
  • Scratchpad state is temporary and redacted.
  • Final answers are rendered from evidence.

The main lesson is simple: regex catches some lies. Ledgers make whole categories of lies unrepresentable.

That is the architecture I want XelaBot to grow into.