Building a Bot That Can Recover, Not Just Reply

Over the last two weeks, XelaBot has evolved in a way that matters more than any single feature.

The commit log is full of visible changes: approval flow updates, routing hardening, recovery guidance, telemetry, Plane integration, Matrix support, CI migration. But the underlying theme is simpler:

XelaBot is becoming a system that can recover reliably, not just respond fluently.

Reliability Starts After the Tool Call

Calling tools is no longer the hard part of building an agent.

The hard part begins when things get messy:

a user approval interrupts a task
a delegate times out after partial progress
a route is technically valid but semantically wrong
a recovery step creates more confusion than clarity
the bot wants to answer before it has enough evidence

That is what I worked on most over the last two weeks.

Not making the bot more impressive. Making it harder to break.

Approval and Resume Are Becoming First-Class

A big part of this work was tightening approval and resume behavior.

In many agent systems, approval acts like a rupture. The flow pauses, the user confirms, and the resumed run feels weaker, less coherent, and less trustworthy. Partial work gets lost. Context gets blurred. The system feels like it woke up in the middle of its own sentence.

So I focused on unifying that path:

stronger approval-resume handling
better timeout salvage
clearer outcome validation
more structural gating around risky actions
fewer ways for partial work to disappear silently

That makes XelaBot feel less like a chat loop and more like a real workflow engine.

Recovery Is Becoming Architecture

Another major shift: recovery is no longer just a fallback.

It is becoming its own subsystem.

Recent changes added structured failure handling, adaptive recovery planning, trace telemetry, semantic suggestions, user-facing recovery guidance, and failure-alert pipelines. The point is not just to catch errors. The point is to make failure useful.

A good failure should preserve evidence, explain what happened, suggest the next step, and leave the system better prepared for the next similar case.

That is a much better failure mode than a vague apology and a dead end.

Routing Still Decides Everything

A lot of the work also went into routing: centralizing it, hardening it, and making it less magical.

Because most dangerous agent mistakes are not dramatic model failures. They are classification mistakes.

The wrong route gets picked. The wrong agent receives the task. A read-like request drifts toward mutation. A tool is available, but wrong for the context.

That is why I keep treating routing as infrastructure, not prompt flavor.

If the first decision is wrong, every later safeguard becomes more expensive.

The Goal: Operational Honesty

If I had to summarize this phase in one phrase, it would be this:

XelaBot is becoming more operationally honest.

It is getting better at knowing:

when work is incomplete
when approval changed the shape of a run
when recovery should surface to the user
when a route is unsafe
when evidence is too weak for a confident claim

That is the kind of progress I care about most.

Not just a bot that sounds capable. A bot that can act, pause, recover, and keep trust when the path is messy.