
The hook is not that I added Buck2.
The hook is that I realized I was giving my AI cognitive overload.
I had handed the repository a long list of valid-looking commands and expected an agent to infer which ones mattered.
Then I deleted this:
**Debugging commands** (use only when a hook failed and you need the same check ad hoc):
| Command | Purpose |
| --- | --- |
| `nix develop -c searchbench-update-repomix` | Regenerate + `git add` `repomix-output.xml` |
| `nix develop -c searchbench-repomix-fresh-check` | Same as pre-push Repomix gate |
| `nix develop -c searchbench-staticcheck` | `staticcheck ./...` |
| `nix develop -c searchbench-golangci` | `golangci-lint run ./...` |
| `nix develop -c searchbench-go-mod-tidy-check` | Fail if `go mod tidy` would change files |
| `nix develop -c searchbench-prompt-contract-check` | Tests for `.templ` XML prompt contracts |
| `nix develop -c searchbench-refresh-pkl-example-fixtures` | Regenerate optimize-IC fixtures |
| `nix develop -c searchbench-go-build-root` | `go build -o searchbench ./cmd/searchbench` |
| `nix develop -c searchbench-architecture-check` | Import-boundary tests |
| `nix develop -c searchbench-check-generated` | Pkl + templ generated outputs |
| `nix develop -c searchbench-check-pkl-generated` | Pkl bindings vs schema |
| `nix develop -c searchbench-check-templ-generated` | Templ-generated prompts |
| `nix develop -c searchbench-e2e` | Root package integration tests |
| `nix develop -c searchbench-go-test-all` | `go test ./...` |
| `nix develop -c searchbench-nix-flake-check` | `nix flake check` |
and replaced the operational surface with this:
+nix develop -c buck2 test //:check
+nix develop -c buck2 test //:check_full
That is the whole idea.
AI agents do not struggle because repositories lack commands.
They struggle because repositories expose too many operational decisions.
A list of debugging commands looks responsible.
It tells future contributors how to run the formatter, the linter, the generated-file checks, the integration tests, the static analysis pass, the Repomix snapshot, and the pre-push gate.
For a human, that can be fine. Humans bring social context. We know when a command is canonical, when it is historical, when it is only for debugging, and when the right move is to ignore the docs and ask someone.
Agents do not have that same context.
An agent sees a table of commands and has to infer:
Which command matters?
Which one is complete?
Which one is fast?
Which one is authoritative?
Which one mutates the tree?
Which one should run before commit?
Which one should run before push?
Which one is only for reproducing a failed hook?
Every command is a branch in the search space.
The problem is not that any individual command is bad. Most of these commands were useful. They came from real debugging sessions, real project constraints, and real attempts to make the repository easier to operate.
That is what makes the diff interesting.
The old system was not sloppy. It was already disciplined.
It still leaked too much operational cognition.
Before, the repository exposed a list of procedures.
nix develop -c searchbench-staticcheck
nix develop -c searchbench-golangci
nix develop -c searchbench-go-test-all
nix develop -c searchbench-check-generated
nix develop -c searchbench-check-templ-generated
nix develop -c searchbench-check-pkl-generated
nix develop -c searchbench-e2e
nix develop -c searchbench-architecture-check
nix develop -c searchbench-repomix-fresh-check
That surface area matters.
Even if every command is documented correctly, the repository is still asking the actor — human or agent — to reconstruct policy from procedures.
The actor has to understand:
ordering
completeness
intent
authority
speed
mutation semantics
The commands themselves do not carry enough meaning.
They are procedures without structure.
The agent-facing documentation eventually became a mirror of that complexity.
This was not bad documentation. It was accurate documentation. That was the problem.
diff --git a/AGENTS.md b/AGENTS.md
index 99518ef..61f8c7f 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -52,22 +52,16 @@ The flake provides a reproducible toolchain, pre-commit hooks, and `searchbench-
| Stage | What runs |
| --- | --- |
-| **`git commit` (pre-commit)** | Fast repo-local checks: formatting (Go/Nix/shell), hygiene, **golangci-lint** (includes **staticcheck** via `.golangci.yml`), **govet**, architecture + prompt contract tests, Pkl/templ generated-file checks, **Repomix** snapshot (`repomix-output.xml` regenerated and staged) |
-| **`git push` (pre-push)** | **`go test ./...`**, root **e2e**, **searchbench-check-generated**, **go mod tidy** check, **standalone staticcheck** (`searchbench-staticcheck`), **standalone golangci-lint**, **`nix flake check`**, **Repomix freshness** (regenerate `repomix-output.xml`; **fail the push** if it differs from what is committed — commit or amend the snapshot, then push again; hooks never auto-amend) |
+| **`git commit` (pre-commit)** | Hygiene (Go/Nix/shell formatting, JSON/YAML/TOML, merge conflicts, …), then **Repomix + `buck2 test //:check`** |
+| **`git push` (pre-push)** | **`buck2 test //:check_full`** |
-Hook staging avoids duplicate **staticcheck** on the same stage: pre-commit uses **golangci-lint** with `staticcheck` enabled in `.golangci.yml`. Pre-push runs **explicit** `searchbench-staticcheck` and `searchbench-golangci` as a fuller proof pass.
+The flake provides a **dev shell**, **git-hooks.nix** wiring, and the **Buck2 Nix cell** under `toolchains/`. It does **not** ship a separate `nix/tools/` command layer: substantive checks run through **Buck2**.
This is the transition I care about.
The docs still exist.
The hooks still exist.
But the shape of the interface changed. The agent no longer gets handed a catalog of project-specific spells and asked to infer which subset matters.
The docs point to graph targets.
The graph targets carry the semantics.
After, the repository exposes named targets in a graph.
//:check
//:check_full
Those targets are not just shorter commands.
They are semantic entrypoints.
Buck2 matters here because targets are named nodes in a dependency graph, not just shell commands with shorter names.
A target can depend on other targets. A suite can compose local checks from multiple parts of the repository.
The actor does not need to remember the checklist, because the checklist has become part of the graph.
//:check means:
fast enough for normal local validation
safe enough for pre-commit
covers Go tests, CLI build, and Iterative Context smoke
//:check_full means:
the full push gate
includes the fast gate
also proves the Repomix snapshot is committed
The root BUCK file says that directly:
diff --git a/BUCK b/BUCK
new file mode 100644
index 0000000..2f2e2e1
--- /dev/null
+++ b/BUCK
@@ -0,0 +1,25 @@
+load("@prelude//:rules.bzl", "sh_test", "test_suite")
+
+sh_test(
+ name = "repomix_fresh_check",
+ test = "repomix_fresh_check.sh",
+)
+
+# Fast gate: Go module tests + CLI build + Iterative Context `check` (import smoke + pytest; no Repomix).
+test_suite(
+ name = "check",
+ tests = [
+ "//src/searchbench-go:check",
+ "//src/iterative-context:check",
+ ],
+)
+
+# Full gate: Go `check` + Iterative Context `check_full` (adds basedpyright) + Repomix snapshot freshness (pre-push / manual).
+test_suite(
+ name = "check_full",
+ tests = [
+ "//src/searchbench-go:check",
+ "//src/iterative-context:check_full",
+ ":repomix_fresh_check",
+ ],
+)
That is the shape I wanted.
The important part is not that the command got shorter.
The important part is that the repository now has a place to say:
These are the entrypoints.
These are the dependencies.
This is the fast gate.
This is the full gate.
The policy moved out of a checklist and into a graph.
SearchBench is an agentic code-search evaluation harness. It compares a baseline code-search system against a candidate system, runs both against bounded tasks, collects evidence, scores results, tracks regressions, and produces a release decision.
That means the repository has a lot going on:
Go application code
Python submodule
Pkl schemas
templ-generated prompts
MCP backends
Repomix snapshots
release bundles
fake e2e paths
real adapter boundaries
This is exactly the kind of repository where an agent can get lost.
Not because the code is impossible to understand, but because the operational surface is too wide. There are many valid-looking things to run, many local conventions to remember, and many steps that are only meaningful in relation to other steps.
The agent should not decide how to validate the repo.
The repo should expose the validation graph.
The root Buck file is tiny.
There is not much code there.
We are compressing or hiding a lot of information.
That is the point.
The important thing is not the number of lines. It is the shift in authority.
Before, the repository exposed procedures.
After, it exposes capability boundaries.
Buck2 gives the repository a way to define:
These are the sanctioned transformations of the system.
That matters much more for agents than for humans.
Humans can recover missing context socially.
Maybe that is the source of differentiation between agents and humans.
Agents need the repository to make valid operations explicit.
The part I find most satisfying is not just the root target.
It is the way the root target composes other targets without needing to know their internal details.
The Go harness has its own BUCK file:
diff --git a/src/searchbench-go/BUCK b/src/searchbench-go/BUCK
new file mode 100644
index 0000000..f1a6fba
--- /dev/null
+++ b/src/searchbench-go/BUCK
@@ -0,0 +1,24 @@
+load("@prelude//:rules.bzl", "sh_test", "test_suite")
+
+sh_test(
+ name = "go_tests",
+ test = "go_tests.sh",
+)
+
+sh_test(
+ name = "go_cli_build",
+ test = "cli_build.sh",
+)
+
+sh_test(
+ name = "pkl_go_types",
+ test = "pkl_go_types.sh",
+)
+
+test_suite(
+ name = "check",
+ tests = [
+ ":go_tests",
+ ":go_cli_build",
+ ],
+)
That file says what the Go module needs:
run the Go tests
build the CLI
keep Pkl generation as an explicit opt-in target
The root file does not need to know the shape of that local workflow.
It just composes the Go module's :check with the Python tool's :check.
diff --git a/BUCK b/BUCK
new file mode 100644
index 0000000..2f2e2e1
--- /dev/null
+++ b/BUCK
@@ -8,18 +8,18 @@
+# Fast gate: Go module tests + CLI build + Iterative Context `check` (import smoke + pytest; no Repomix).
+test_suite(
+ name = "check",
+ tests = [
+ "//src/searchbench-go:check",
+ "//src/iterative-context:check",
+ ],
+)
+
+# Full gate: Go `check` + Iterative Context `check_full` (adds basedpyright) + Repomix snapshot freshness (pre-push / manual).
+test_suite(
+ name = "check_full",
+ tests = [
+ "//src/searchbench-go:check",
+ "//src/iterative-context:check_full",
+ ":repomix_fresh_check",
+ ],
+)
That is the build graph doing coordination work.
The Go target owns Go semantics.
The Python target owns Python semantics.
The root target owns repository semantics.
This is exactly the kind of boundary I want agents to see.
The agent does not need to remember:
cd into the Go module
run Go tests
build the CLI
cd back to root
run the Python import smoke
run pytest
maybe run basedpyright on push
maybe refresh Repomix
make sure it is committed
The repo says:
buck2 test //:check
buck2 test //:check_full
That feels like a small control plane.
Buck2 does not replace Nix here.
Nix still owns the development shell. It puts Go, Pkl, Buck2, Repomix, pre-commit, and the surrounding toolchain on PATH.
The difference is that Nix no longer has to be the place where every project operation becomes a custom shell command.
The split now looks like this:
Nix → toolchain and environment closure
Buck2 → repository action graph
Git → lifecycle trigger
That separation feels much cleaner.
Nix answers:
What tools exist, and which versions are they?
Buck2 answers:
What are the sanctioned operations over this repository?
Git hooks answer:
When should those operations run?
The result is less clever than the previous system, and better because of it.
The hooks still exist, but they no longer contain the entire checklist.
They trigger the graph.
That is a cleaner division of responsibility.
A human can read a long AGENTS.md and slowly learn the local rituals.
An agent can too, but only unreliably.
The larger the operational surface gets, the more likely the agent is to do something plausible but wrong:
run the wrong subset
skip the generated-file check
forget the submodule smoke
run a mutating command at the wrong time
trust a stale Repomix snapshot
test from the wrong working directory
Good agent infrastructure is not about giving the model more tools.
It is about removing invalid operational choices.
That is why this change feels important. Buck2 gives the repository a way to expose fewer, stronger affordances.
Instead of saying:
Here are fifteen commands. Good luck.
the repo says:
Run //:check.
Run //:check_full.
That is a much better interface for a coding agent.
This also made the repository layout more honest.
SearchBench is not really a single-language project anymore.
I do not want to collapse SearchBench into one language just to make the repository look simpler.
Different parts of the system want different semantics.
The harness wanted Go.
Go is good for the backend harness because it keeps me close to operational reality: simple binaries, clear package boundaries, boring tooling, and direct tests.
Moving the harness into Go made the pure model easier to see. It surfaced file boundaries quickly, gave me a reliable go test ./... loop, and let me build the CLI, bundle writer, adapters, and fake e2e paths without disappearing into configuration design too early.
The tool wanted Python.
Python is still the right home for Iterative Context because its testing story is excellent.
The interesting work there is policy behavior, invariants, property-style tests, metamorphic cases, and pytest-driven exploration. A theme of mine, see designing-for-two.
The visualization wanted TypeScript.
Because the browser is the runtime.
React Flow, RxJS, Jotai, animation, replay controls, and rich trace interaction are exactly the kind of work TypeScript is good at. That part is basically already real; the second pane is the easy part.
The goal is not one language.
The goal is one system.
SearchBench now has a small monorepo structure:
src/searchbench-go/
src/iterative-context/
src/visualization/
configs/
toolchains/
BUCK
flake.nix
The Go module lives under src/searchbench-go.
The Iterative Context Python project lives under src/iterative-context.
The visualization will live under src/visualization.
Shared Pkl schemas and round manifests stay at the root under configs.
That shape matters because SearchBench is now mature enough that it is time to bring its parts together.
A single go test ./... is not the whole system anymore.
The validation target needs to know about the Go app, the Python submodule, the visualization, and the artifact snapshot.
That is exactly what a build graph is for.
One line in the diff is easy to miss:
+**Orchestration outside this repo:** Worktrees, branch lifecycle, task assignment, agent summary packs, and merge orchestration are owned by an **external meta harness**, not by SearchBench-Go.
I find this boundary beautiful.
SearchBench-Go owns local correctness: what can be built, tested, generated, checked, and released.
The external harness owns global orchestration: which branch, which issue, which agent, which worktree, and which merge strategy.
Those are both necessary, but they should not be the same system.
A future release diff could make that boundary concrete. This part is hypothetical, but it is the shape I want: the meta harness proposes a release by changing a Pkl round manifest and adding a Buck graph target. The repository still owns validation. The harness owns orchestration.
diff --git a/configs/releases/2026-05-15-ic-v2/round.pkl b/configs/releases/2026-05-15-ic-v2/round.pkl
new file mode 100644
index 0000000..7a41c10
--- /dev/null
+++ b/configs/releases/2026-05-15-ic-v2/round.pkl
@@ -0,0 +1,47 @@
+amends "../../schema/SearchBenchRound.pkl"
+
+name = "ic-v2-release-candidate"
+game = "code-localization"
+
+dataset {
+ kind = "lca"
+ name = "JetBrains-Research/lca-bug-localization"
+ config = "py"
+ split = "dev"
+ maxItems = 24
+}
+
+incumbent {
+ name = "jcodemunch"
+ backend = "mcp"
+ command = env("SEARCHBENCH_JCODEMUNCH_COMMAND")
+}
+
+challenger {
+ name = "iterative-context-v2"
+ backend = "mcp"
+ command = env("SEARCHBENCH_ITERATIVE_CONTEXT_COMMAND")
+ policy = read("policies/challenger_policy.py")
+}
+
+objective = import("scoring/localization-objective.pkl")
+
+promotion {
+ requirePrimaryScoreImprovement = true
+ maxTokenRatio = 1.10
+ protectedCasesMayRegress = false
+}
+
+artifacts {
+ bundleRoot = "artifacts/releases/2026-05-15-ic-v2"
+ report = "report.md"
+ evidence = "evidence.pkl"
+ decision = "decision.json"
+}
diff --git a/releases/BUCK b/releases/BUCK
new file mode 100644
index 0000000..ad6129c
--- /dev/null
+++ b/releases/BUCK
@@ -0,0 +1,35 @@
+load("@prelude//:rules.bzl", "sh_test", "test_suite")
+
+sh_test(
+ name = "run_ic_v2_release",
+ test = "run_release.sh",
+ args = [
+ "../configs/releases/2026-05-15-ic-v2/round.pkl",
+ "--bundle-root",
+ "../artifacts/releases/2026-05-15-ic-v2",
+ ],
+)
+
+sh_test(
+ name = "check_ic_v2_release_report",
+ test = "check_release_report.sh",
+ args = [
+ "../artifacts/releases/2026-05-15-ic-v2/decision.json",
+ ],
+)
+
+test_suite(
+ name = "ic_v2_release_candidate",
+ tests = [
+ "//:check_full",
+ ":run_ic_v2_release",
+ ":check_ic_v2_release_report",
+ ],
+)
That is the release-engineering version of the same idea.
The Pkl says what is being evaluated.
The Buck target says what is allowed to run.
The artifact bundle says what happened.
The meta harness does not need to smuggle project policy through a prompt. It proposes a small, legible diff. The repository graph either accepts it or rejects it.
That is the beautiful part.
I keep coming back to the same design shape:
wide human intent
→ narrow machine interface
→ durable artifact
In infrastructure, that might be:
operator intent
→ schema-constrained configuration
→ reconciled resources
In AI evaluation, it might be:
candidate agent change
→ fixed round manifest
→ release bundle
In this build-system change, it is:
developer or agent intent
→ Buck2 target
→ validated repository state
The pattern is the same.
Do not let every actor reach directly into the full complexity of the system.
Give them a smaller interface that preserves the important choices and removes the rest.
You are giving your AI cognitive overload when you hand it a pile of valid-looking commands and ask it to infer the real workflow.
Not because the model is dumb.
Because repositories are full of implicit operational knowledge, and agents force that knowledge into the open.
Every debugging command is a cognitive branch.
Every ad hoc script is a hidden policy decision.
Every setup step is a chance for the model to do something plausible and wrong.
The fix is not always to give the agent more context.
Sometimes the fix is to move the workflow out of prose and into the repository itself.
Not a longer checklist.
A smaller set of legal moves.
Here are the legal moves.