blogcontent

AI Agents Need Release Reports, Not Just Traces

AI agents are starting to look less like isolated model calls and more like software systems.

They have prompts, tools, retrieval strategies, policies, runtime assumptions, model configs, scoring rules, and cost constraints. When one of those pieces changes, the important question is not just whether one run looked good.

It is whether the candidate should replace the current system.

Software teams already have a language for this. We compare branches against main. We run tests. We inspect diffs. We track regressions. We produce build artifacts. We gate deployments.

Agent systems need a similar release boundary.

A useful report for an agent change might look something like this:

Agent Change Report

Candidate: retrieval-policy-b
Baseline: production-policy-a

Decision: REVIEW

Summary:
  Candidate improved the primary score and reduced token usage,
  but introduced regressions on protected cases.

Changed:
  retrieval policy, prompt bundle

Evidence:
  score improved
  token cost decreased
  protected regressions failed

That is the product shape I keep coming back to: not just a trace, not just a score, but a release report for an AI-system change.

A trace tells you what happened.

A score tells you how one dimension performed.

A release report tells you whether a candidate system should ship — and why.

Start with the baseline

The concrete system I’ve been working on is about agentic code search.

The baseline is not “give the model a whole repository and hope it finds the right thing.” Better tools already exist.

One of the tools I’m comparing against is jCodeMunch, an MCP server that exposes a codebase to an agent in a structured way. Instead of forcing the model to read entire files, it gives the agent tools for navigating symbols, definitions, outlines, references, and scoped code context.

That fits a broader pattern I think is directionally correct:

The model should not have to rediscover structure that can be computed, stored, and reused.

You can see this pattern in LLM wiki-style workflows, codebase graph tools, structured retrieval systems, and agent memory systems. The shared intuition is that raw context is expensive, and structured context compounds.

For codebases, this matters a lot. Real software systems are not flat bags of files. They contain calls, imports, references, class hierarchies, modules, ownership boundaries, and naming conventions. Those relationships are navigational structure.

So the first claim is not:

Codebases should be exposed structurally to agents.

I think that is already the right direction.

The more interesting question is what we can evaluate on top of that baseline.

The candidate: structured retrieval plus look-ahead

The candidate feature I’m exploring is deterministic look-ahead.

If a tool already exposes the codebase as structured data, can we do useful exploration on the model’s behalf before handing context back to it?

Instead of asking the LLM to repeatedly discover nearby code through tool calls, can we build a graph of the codebase and explore a few steps ahead? Can we look two, three, or four call stacks outward from an initial anchor, rank what seems relevant, and return a smaller but more useful context bundle?

The comparison is not:

unstructured retrieval
vs
structured retrieval

It is closer to:

structured retrieval
vs
structured retrieval + deterministic look-ahead

jCodeMunch gives the agent structured access to the codebase.

The candidate I’m testing adds look-ahead over that structure.

That is a small difference architecturally, but it creates a larger evaluation problem.

Token reduction is important, but token reduction is not the whole release question. If look-ahead reduces tokens but worsens localization quality, it should not be promoted. If it improves average score but breaks important protected cases, it should not be promoted automatically. If it only works because the dataset, scorer, or runtime changed underneath the comparison, the result is not trustworthy.

That is where Searchbench comes in.

Searchbench, briefly

Searchbench is the harness I’m building to evaluate that candidate change.

It compares a baseline code-search system against a candidate version with an added context-selection policy, such as graph look-ahead. It runs both against bounded code-localization tasks, collects traces and costs, scores the results, tracks regressions, and produces the evidence needed to decide whether the candidate should be promoted.

The benchmark I’m currently using is the JetBrains LCA bug localization dataset from Hugging Face. That gives the project a real external task surface: given an issue or task, can an agent localize the relevant files, functions, or symbols efficiently?

In simplified form, the loop looks like this:

baseline structured retrieval
→ candidate with look-ahead
→ fixed task slice
→ score report
→ regression report
→ promotion decision

That loop is the part I care about.

A Searchbench-shaped report

The report I want Searchbench to produce is not “candidate got a better score.”

It should look more like this:

Searchbench Candidate Report

Baseline:
  structured code retrieval

Candidate:
  structured retrieval + graph look-ahead

Decision:
  REVIEW

Summary:
  Candidate improved graph-distance scores and reduced token usage,
  but failed the protected-regression threshold.

Result:
  localization quality: improved
  token usage: improved
  regressions: needs review

Recommendation:
  Do not promote automatically.

This is still conceptual. The current harness is moving toward this shape, but this is the object I want to make concrete.

The important thing is that the report treats the look-ahead policy as a release candidate.

It changed something specific.

It ran against a specific benchmark.

It improved some dimensions.

It regressed others.

It produced a decision.

That is much more useful than a score floating by itself.

The unit of optimization is bigger than I expected

One thing that surprised me while building Searchbench is that many AI tools optimize a smaller unit than I expected.

Tools like Langfuse, LangSmith, LangChain, and DSPy are useful. They give us traces, prompts, datasets, evaluators, workflows, and program-level optimization.

But once I started building a real harness, the unit I needed to reason about was larger than a prompt, a trace, or even a single LM program.

A candidate system version might change:

policy.py
prompt templates
context-selection heuristics
graph traversal behavior
tool schemas
model/provider settings
runtime assumptions
scoring reducers
dataset slices

Any one of those changes can affect behavior.

So when a run improves, I do not only want to know whether the output looked better. I want to know what changed, what stayed the same, what improved, what regressed, whether I can reproduce it, and whether the candidate should replace the current baseline.

That is a different interface than a trace viewer.

It is also a different interface than a prompt optimizer.

Prompt optimization is one possible input to the system. It can produce a better candidate. But it does not, by itself, answer whether that candidate should ship.

For example, imagine a prompt optimizer finds a version of a code-search prompt that uses fewer tokens. That sounds good until the localization benchmark shows that the agent now stops too early and misses the files that actually need to change. In that case, the optimized prompt is not “better” in the release sense. It is cheaper, but less useful for the task.

That is the distinction I care about.

A tool can optimize one part of the system and still produce a candidate that should not be promoted.

From scores to promotion decisions

A score by itself is useful, but incomplete.

A trace by itself is useful, but incomplete.

Even a dashboard of scores over time is not quite the final object I want.

The promotion rule is what gives the report teeth.

For example:

Promote only if:
  primary score improves
  token cost stays within threshold
  no protected cases regress

Without a rule, scores are easy to rationalize. With a rule, the system can say:

This candidate is better on average, but not safe to promote.

That is the kind of answer I want from AI evaluation tooling.

What existing tools bought me — and what they did not

Existing tools are valuable. This is not a complaint that the ecosystem is useless.

Langfuse is useful for traces, datasets, scores, and experiment visibility.

LangSmith and related tools are useful for evaluating chains, agent runs, and application behavior.

DSPy is useful for optimizing language-model programs against metrics.

But the thing I found myself wanting was not only a better prompt, a better trace, or a better score.

I wanted a higher-level release object that could say:

This candidate changed something specific.
It ran against a fixed task slice.
It improved some metrics.
It regressed some cases.
It should or should not be promoted.

That is the layer I think is underdeveloped.

The broader systems shape

The infrastructure instincts behind this are not new.

Build systems already understand inputs, outputs, dependency graphs, caching, and reproducibility.

CI systems already understand candidate changes, checks, failures, and release gates.

Infrastructure control planes already understand desired state, observed state, reconciliation, and status.

The AI ecosystem has many strong tools for traces, evals, prompts, retrieval, and workflows. But as agent systems become more complex, I think we will need more release-engineering-shaped tools around them.

That requires artifacts, not just logs.

Regressions, not just averages.

Baselines, not just scores.

Promotion rules, not just dashboards.

Where this is going

I do not think this needs to start as a giant platform.

The first useful version can be small:

Run a baseline.
Run a candidate.
Compare them.
Show what changed.
Show what improved.
Show what regressed.
Make a promotion recommendation.

For Searchbench, that means a report over agentic code-search behavior.

Eventually, the same shape could apply to other agent systems:

coding agents
retrieval agents
support agents
tool-using research agents
workflow agents

Anywhere the system is complex enough that changing a prompt, policy, tool schema, retrieval strategy, or graph traversal heuristic can have non-obvious effects.

The point is not to replace every eval platform, tracing tool, retrieval system, or agent framework.

The point is to define the release boundary around an AI system.

The thesis

AI agents need release reports, not just traces.

A trace tells you what happened.

A score tells you how one dimension performed.

A release report tells you whether a candidate system should ship — and why.

That is the object I’m trying to build toward.