
I built SearchBench to answer a question that kept bothering me:
What is your coding agent actually searching for?
Not just what file it eventually names.
Not just how many tokens a retrieval tool claims to save.
I wanted to know what evidence the agent found, how close it got to the real fix, what it spent getting there, and whether the search behavior could be improved instead of guessed at.
The first result looked like a clean win.
IC vs jCodeMunch. The graphs below are static article figures. They copy the comparison values directly into the blog layer instead of loading them from SearchBench at render time.
5-run JAX slice. Exact-hit rate is measured on completed runs.
Search spend profile
Median and mean token bars are normalized to completed runs; raw totals stay visible as aggregate spend.
Outcome profile
Completion, exact hits overall, and exact-hit-on-completed, with mean gold hop overlaid.
jCodeMunch: 5/5 completed, 3/5 exact hits overall, 60% exact-hit rate on completed runs. Iterative Context: 5/5 completed, 4/5 exact hits overall, 80% exact-hit rate on completed runs.
Median tokens: jCodeMunch 347,480 vs Iterative Context 127,985. Mean tokens per completed run: jCodeMunch 277,741 vs Iterative Context 101,586.8. Raw total completed-run tokens: jCodeMunch 1,388,705 vs Iterative Context 507,934. Mean gold hop: jCodeMunch 4.8 vs Iterative Context 2.4.
IC matched jCodeMunch on completion and then cleared it on both exact hits and median completed-run tokens.
Copied article values: jCodeMunch median tokens 347,480; Iterative Context median tokens 127,985.
10-run diverse slice. Incumbent exact-hit is on completed runs; the challenger completed all 10.
Search spend profile
Median and mean token bars are normalized to completed runs; raw totals stay visible as aggregate spend.
Outcome profile
Completion, exact hits overall, and exact-hit-on-completed, with mean gold hop overlaid.
jCodeMunch: 3/10 completed, 1/10 exact hits overall, 33.3% exact-hit rate on completed runs. Iterative Context: 10/10 completed, 6/10 exact hits overall, 60% exact-hit rate on completed runs.
Median tokens: jCodeMunch 56,874 vs Iterative Context 56,588.5. Mean tokens per completed run: jCodeMunch 67,679 vs Iterative Context 173,377.2. Raw total completed-run tokens: jCodeMunch 203,037 vs Iterative Context 1,733,772. Mean gold hop: jCodeMunch 8.0 vs Iterative Context 4.8.
The big gap here is reliability: three completions versus ten.
Median completed-run tokens were effectively tied, but IC spent much more in total because it finished all 10 runs.
jCodeMunch sits in a growing category of structured code-retrieval tools for agents: tools that expose repository structure through symbols, outlines, references, or compact code context instead of raw file reads.
IC inherits from that idea. It also depends on structured code information. But IC pushes more of the search work into deterministic computation performed on the LLM's behalf. Every time the model asks a query, IC can resolve fuzzy anchors, expand nearby graph context, bound the candidate frontier, and record replayable evidence before handing the model a smaller search state.
So this comparison is not "structure versus no structure."
It is closer to:
structured retrieval as a tool surface
vs
structured retrieval compressed into a policy loop
On the 5-run JAX slice, IC looked strong:
That was the eye-catching result: against a structured retrieval incumbent, IC localized more files and used far fewer tokens.
But the next jCodeMunch slice complicated the story in a useful way.
On the 10-run diverse slice, IC still won hard on reliability and exact hits:
But it did not win on raw token efficiency:
That total looks bad for IC because IC actually finished all 10 runs. jCodeMunch only completed 3. The honest read is:
IC won the 10-run diverse slice on reliability and exact localization, not on raw token efficiency.
That distinction matters. SearchBench is not useful because it always gives me the flattering graph. It is useful because it forces the graph to stay honest.
The next baseline was harder.
IC vs Bash. The graphs below are static article figures. They copy the comparison values directly into the blog layer instead of loading them from SearchBench at render time.
15-run diverse slice. Exact-hit rate is measured on completed runs.
Outcome profile
Completion, exact hits overall, and exact-hit-on-completed, with mean gold hop overlaid.
Search spend profile
Median and mean token bars are normalized to completed runs; raw totals stay visible as aggregate spend.
Bash: 8/15 completed, 8/15 exact hits overall, 100% exact-hit rate on completed runs. Iterative Context: 15/15 completed, 12/15 exact hits overall, 80% exact-hit rate on completed runs.
Median tokens: Bash 22,634.5 vs Iterative Context 136,128. Mean tokens per completed run: Bash 71,500.3 vs Iterative Context 194,347.3. Raw total completed-run tokens: Bash 572,002 vs Iterative Context 2,915,210. Mean gold hop: Bash 0.0 vs Iterative Context 2.4.
This is the sobering control comparison: Bash was often sharp when it landed, but it did not finish often enough.
These values come from the current 15-run diverse Bash baseline, not the older 10-run ablation slices.
Bash was not a toy baseline. It was a competent native-search agent with shell access from the repo root.
The Bash backend had shell access from the repository root.
It could use ordinary codebase-navigation tools like:
rg
git grep
find
sed
python
Outputs and timeouts were bounded, but it was not artificially crippled. It used the same task framing and was scored by the same SearchBench exact/hop/token machinery.
That matters because Bash is the default workflow for a reason. Modern coding agents already know how to search with shell tools.
On the current Bash comparison:
Bash was brittle but sharp.
When it completed, it was cheap and accurate. IC completed much more reliably, and it found more exact hits overall, but it was much more expensive.
That changed the story.
IC looked excellent against jCodeMunch. Bash showed that the real default baseline is harder to beat.
SearchBench did not just validate my expectation. It corrected it.
Once Bash showed that IC was reliable but too expensive, the next question was obvious:
Which part of IC was helping?
Was it anchor quality?
Was it lookahead?
Was it something else?
IC ablation. The key product signal here is minimal-anchor-v1: mean gold hop improves from 4.0 to 2.4 and composite rises from 0.533 to 0.640, even though the token story still needs work.
IC policy ladder outcomes
Completion, exact-hit-on-completed, and mean gold hop across copied current + Bash-family ablation snapshots.
IC policy spend frontier
Median tokens per completed run, with quality overlaid across the copied current + Bash-family ablation snapshots.
The first pair shows the broader policy ladder. The second pair isolates the follow-up continuation where minimal-anchor-v1 was tested directly against a lookahead-only incumbent.
minimal-anchor-v1 outcome signal
Completion and exact-hit gains, with composite score overlaid.
minimal-anchor-v1 search profile
Median and mean token bars are normalized to completed runs; raw totals stay visible as aggregate spend.
IC median-token snapshots
A compact token-only read across the copied policy ladder; useful as a secondary view, not the main story.
The follow-up ablation gave the most interesting signal of the whole batch.
In the IC lookahead-only versus minimal-anchor-v1 continuation:
This was the strongest internal IC improvement signal in the run set.
minimal-anchor-v1 improved completion, exact hits, mean gold hop, mean score, and median tokens relative to the lookahead-only incumbent in that follow-up slice.
It did not solve the token problem. Both sides were still far above the current 20k token-efficiency budget.
But that is exactly why the result was useful.
It did not say:
IC is done.
It said:
Anchor quality can move localization, lookahead alone is not enough, and stopping/token discipline is now the next problem.
That was the product moment.
I did not need a bespoke research script. SearchBench already had stable tasks, comparable rounds, incumbent/challenger roles, bundle artifacts, hop-distance scoring, token accounting, and reports.
The next question became a normal run.
These runs use Long Code Arena bug-localization tasks: given a real issue and a repository snapshot at the buggy commit, the system predicts which files belong to the human fix.
SearchBench scores exact hits against the dataset's gold changed files, but it also builds a tree-sitter code graph for the repo and measures how far each prediction was from the fix.
That lets the report distinguish an exact hit from a near miss and a totally wrong-neighborhood miss.
hop 0 exact hit
hop 1-2 near miss
hop 12 wrong neighborhood
A completed run means the system finished and produced a scored answer instead of timing out, failing, or getting blocked.
An exact hit means the completed run landed on the target strongly enough to count as a success under the scoring rule. In these graphs, I treat score >= 0.8 as an exact hit.
So a run can complete and still miss.
That distinction is why a result like this matters:
The tool finished only three runs, and only one of those three was a real hit.
This is a small artifact-backed case study, not a universal benchmark claim. But it is already enough to change what I believe about the problem.
The useful thing about IC is not that graph search magically beats Bash.
It is that IC moves parts of the search process out of the model's transcript and into explicit, deterministic knobs.
Fuzzy anchor finding:
Where should the search start?
Lookahead:
Once we start somewhere, what nearby evidence should we inspect?
Stopping:
Once we have enough evidence, when should the search stop spending tokens?
Bash is powerful, but its search behavior is mostly buried in a transcript.
IC performs deterministic work on the model's behalf and makes the resulting search policy explicit enough to ablate.
IC is a search policy over structured code information.
Instead of making the model manually choose every search step, IC performs deterministic computation on the model's behalf: resolving fuzzy anchors, expanding graph neighborhoods, bounding candidate frontiers, and recording replayable evidence.
Anchor quality asks whether IC starts in the right code neighborhood.
A good anchor is a file, symbol, or subsystem close to the gold changed files under SearchBench's hop-distance scorer. A bad anchor starts the run in the wrong neighborhood.
Current lesson:
IC often looks bimodal. It lands on hop 0, or it falls to max-hop / hop 12. That suggests the starting neighborhood matters a lot.
The minimal-anchor-v1 continuation is interesting because it improved mean gold hop from 4.0 to 2.4, which suggests anchor policy can materially improve localization.
Lookahead is controlled expansion from an anchor.
Given a starting file or symbol, IC decides which nearby evidence to inspect next: related files, references, callers, callees, neighboring graph nodes, or policy-selected frontier nodes.
Current lesson:
Lookahead is useful, but it is not magic. Without a good anchor and a disciplined stop rule, lookahead can become expensive wandering.
Stopping is the newly surfaced third knob.
It asks when IC should stop searching and finalize instead of spending more tokens.
Current lesson:
The Bash comparison showed that IC is reliable but too token-heavy. The next question is whether IC spends tokens before finding useful evidence, or after it already has enough.
That motivates metrics like:
tokens_to_first_near_anchor
tokens_after_first_near_anchor
tokens_to_first_exact_anchor
tokens_after_first_exact_anchor
The ablation is not random variant testing. It is a test of a decomposed search policy:
start better
-> look around better
-> stop sooner
The ablation looks simple only because the harness made it simple.
To isolate anchor quality from lookahead, you need:
the same task set
the same repo snapshots
the same model bounds
the same scorer
the same token accounting
comparable failure categories
a stable artifact model
Without that, the experiment becomes scripts, logs, screenshots, and vibes.
With SearchBench, it becomes a challenger round.
SearchBench buys:
same matches
same scorer
same bundle format
exact + hop scoring
token accounting
incumbent/challenger comparison
replay/projection artifacts
The details of the planning layer are internal, but the product behavior is simple:
the evidence, task set, scorer, and run artifacts are stable enough that a surprising result can become the next controlled experiment.
That is what I want from an agent evaluation harness.
Not just a leaderboard.
A way to ask better questions the moment a result surprises me.
This does not prove IC universally beats Bash.
It does not prove jCodeMunch is bad.
It does not prove SearchBench is a replacement for full repair benchmarks like SWE-bench.
It does not prove token efficiency is solved.
The current sample sizes are small. Some slices are homogeneous. Some runs complete while others fail, which makes token comparisons tricky. Median tokens, mean tokens, and total tokens can tell different stories when completion counts differ.
What this does show is more specific:
IC can outperform a structured retrieval incumbent on some small slices.
Bash is a strong default baseline and should be treated seriously.
Hop distance makes misses more diagnostic.
Ablations become much easier when runs share tasks, scorers, artifacts, and roles.
Anchor quality appears to be a real optimization knob.
Stopping/token discipline is the next major IC problem.
The product is not just better search.
Better search is a consequence.
The product is making codebase navigation legible.
Your senior engineers have a mental map of the repo. Your agents do not. SearchBench turns agent search failures into evidence about that missing map.
It helps answer questions like:
Where does the agent get lost?
Which files are decoys?
Which search policy enters the right neighborhood first?
Which tool keeps spending after it has enough evidence?
Which challenger should be promoted?
SearchBench is for teams who need to know what their coding agents are actually searching before they trust, tune, or replace the search policy.
The next step is not "declare victory."
The next step is to optimize IC against what the harness just taught me.
Measure token waste around first useful anchor.
Add Bash-like lexical anchor seeding.
Add token-aware stopping.
Run the optimization ladder against Bash.
The point is not that IC already wins.
It does not.
The point is that SearchBench made the search behavior visible enough to improve.
Before you trust your coding agent, ask what it actually searched.