Three Sagan workflow steps worth piloting as agent teams; six worth keeping single-agent (LOW confidence)

kind: experiment

TL;DR

Motivation. Sagan already has three two-agent review pairs (code review, interpretation, clean result) plus an upload/verifier pair, but most of the rest of the pipeline — planner, implementer, follow-up proposer, literature review — runs as a single agent. The question is where the next team patterns are worth the added latency and token cost, and where single-agent flows should stay single-agent.
What I did. I inventoried every agent under .claude/agents/ and the runner jobs under services/runner/src/jobs/, then surveyed 19 sources from the last 18 months on multi-agent orchestration — Anthropic's research-system writeup, Cognition's "don't build multi-agents" postmortem, the AutoGen v0.4 and Magentic-One releases, the LangGraph Supervisor and OpenAI Agents SDK contracts, and academic results on debate, planner–executor splits, and critic-driven self-reflection. For every Sagan workflow step I asked: would a team pattern from the survey improve cost, quality, or robustness here?
Result (see table below). Of 15 workflow steps reviewed, 3 are worth piloting as teams, 4 already are teams and should stay teams, 6 should stay single-agent, and 2 are maybe-with-caveats. The top recommendation is replacing the single-agent lit-review.ts job with an Anthropic-style parallel-crawlers + synthesizer team, where the published lift is largest and the existing job is the most obviously breadth-bound. Don't team the implementer; Cognition's postmortem is load-bearing evidence that single-threaded coders win on coherent context.
Next steps.
- Pilot 1 (recommended next): parallel paper crawlers + synthesizer for services/runner/src/jobs/lit-review.ts and project-lit-review.ts. A/B against the current single agent on five queued project_lit_review jobs; success = +1 unique high-quality paper on average without >2× token cost.
- Pilot 2: broad + filter team for follow-up-proposer. Generate 15 candidates broadly, filter to 5 against parent kill-criterion. Success = at least one auto-runnable follow-up the current agent would have missed across 10 experiments.
- Pilot 3 (smaller bet): add one Claude/Codex critic round to experiment-planner before user approval, mirroring the existing review pairs. Success = fewer re-plan cycles on the next 10 plans.
- File the three pilots as separate Sagan experiments rather than piling them into this one.

Workflow step	Current agent(s)	Proposed team shape	Expected uplift	Cost (latency / tokens)	Recommendation
Literature review job	`jobs/lit-review.ts`, single agent	Orchestrator + 3–5 parallel paper-crawler subagents + synthesizer (Anthropic research-system pattern)	Largest expected lift; published Anthropic eval shows +90% over single-agent baseline on breadth-first research	~3–5× tokens, ~1.5× wall time; subagents run in parallel so wall time scales sublinearly	Adopt (Pilot 1)
Follow-up proposer	`follow-up-proposer.md`, single agent	Broad proposer subagent + filter/ranking subagent (fan-out then merge)	Better coverage of long-tail follow-ups; explicit kill-criterion checking	~2× tokens, marginal wall time	Adopt (Pilot 2)
Plan drafting	`experiment-planner.md`, self-loops on clarify	Planner + one critic round (Claude or Codex) before owner approval — same shape as the existing review pairs	Fewer owner re-plan cycles; cheaper than the current "owner is the critic" loop	~1.3× tokens; saves wall-clock by removing one owner round-trip	Adopt (Pilot 3)
Code review	`claude-code-reviewer` + `codex-code-reviewer` + `review-reconciler`	Already a critic-pair + reconciler team	Catches what one model misses; reconciler handles ties	~2× tokens for the pair + reconciler when invoked	Keep team
Interpretation critique	`claude-interpretation-critic` + `codex-interpretation-critic` + reconciler	Already a critic-pair + reconciler team	Hypothesis/claim drift caught earlier; rounds capped at 3	~2× tokens for the pair	Keep team
Clean-result critique	`claude-clean-result-critic` + `codex-clean-result-critic` + reconciler	Already a critic-pair + reconciler team	Promotion gate; catches unsupported claims	~2× tokens for the pair	Keep team
Artifact upload	`uploader` + `upload-verifier`	Worker + read-only verifier already split	Hard gate before `interpreting`; verifier is mechanical and cheap	Verifier is <5% of uploader cost	Keep team
Experiment implementation	`experiment-implementer.md`, single agent	— (don't team)	Negative: Cognition's postmortem documents fragility when sub-coders share context poorly	n/a	Keep single
Experiment execution / pod run	`experimenter.md`	— (don't team)	Operational task; teams add latency, not quality	n/a	Keep single
Consistency check	`consistency-checker.md`	— (don't team)	Mechanical one-variable check; deterministic enough	n/a	Keep single
Weekly digest	`jobs/weekly-digest.ts`	— (don't team)	Narrow summary; teams over-engineer	n/a	Keep single
Insight scan	`jobs/insight-scan.ts`	— (don't team)	Narrow pattern detection over existing data	n/a	Keep single
Daily-log entry writing	Owner-typed in dashboard / Haiku-drafted snapshots	— (don't team)	Trivial; teams add nothing	n/a	Keep single
Result analysis	`analyzer.md` (`result-analyzer`), single agent	Maybe: analyst + skeptic split (two perspectives feeding the existing interpretation-critique pair)	Could surface alternative readings of ambiguous metrics	~1.5× tokens; only worth it on multi-metric experiments	Maybe later
Experiment monitoring	`experimenter` handles run + early interpretation	Maybe: watchdog subagent (running) + interpreter subagent (post-run)	Faster failure detection on long runs	Watchdog must be cheap (Haiku); otherwise cost > benefit	Maybe later

Each row is a step in the Sagan workflow — the issue lifecycle (/issue <N> → plan → review → implement → experiment → analyze → clean-result → follow-ups), plus the runner jobs in services/runner/src/jobs/. Adopt rows are the three pilots; Keep team rows are existing two-agent structures that the survey says should stay; Keep single rows are steps where the literature (especially Cognition's postmortem) recommends against teaming; Maybe later rows are deferred pending lower-hanging fruit. Token-cost columns are order-of-magnitude estimates from published deep-research and code-review benchmarks; actual costs depend on per-run query volume.

Experimental design — how I picked these candidates

Tenancy guardrail. Per CLAUDE.md, every proposed insertion point must be tenant-agnostic — would a hypothetical second project plausibly want this same thing? Every Adopt and Keep team row above passes that test: literature review, follow-up generation, planning, code review, interpretation, clean-result critique, and uploads are all Sagan-shaped, not EPS-shaped. EPS-specific compatibility scripts (e.g. mentor-results-data.ts) are out of scope here and would be touched in the EPS repo if at all.

Inventory of the agents already wired into the repo. Source: ls .claude/agents/ on the VM checkout. Fifteen agents, grouped by role:

Role	Agent files
Planning	`experiment-planner.md`, `consistency-checker.md`
Implementation	`experiment-implementer.md`, `experimenter.md`
Code-review pair	`code-reviewer.md` (Claude), `codex-code-reviewer.md` (Codex), `reconciler.md`
Result analysis	`analyzer.md` (`result-analyzer`)
Interpretation-critique pair	`interpretation-critic.md` (Claude), `codex-interpretation-critic.md` (Codex), `reconciler.md` reused
Clean-result-critique pair	`clean-result-critic.md` (Claude), `codex-clean-result-critic.md` (Codex), `reconciler.md` reused
Upload pair	`uploader.md`, `upload-verifier.md`
Follow-ups	`follow-up-proposer.md`

Three review pairs already use the orchestrator + critic + reconciler pattern (matching LangGraph's supervisor pattern and the Cloudflare AI code review "specialized reviewers + coordinator" design). One worker + verifier pair (uploader / upload-verifier) follows the same shape with one writer and one read-only checker. Every other agent listed is a single-agent flow.

Runner jobs surveyed. Source: find services/runner/src/jobs/ -name '*.ts'. Each is a single-agent run today: lit-review.ts, project-lit-review.ts, weekly-digest.ts, insight-scan.ts, and job-runs.ts (dispatcher glue). The dispatcher.ts itself is non-agent code; it's the harness, not a candidate for teaming.

Survey methodology. I sampled 19 sources on multi-agent orchestration patterns over the last 18 months: vendor engineering blogs (Anthropic, Microsoft Research, LangChain, Cognition, OpenAI, Cloudflare), peer-reviewed work (MetaGPT at ICLR 2024, ChatDev at ACL 2024, Du et al. multiagent debate at ICML 2024, AgentReview at EMNLP 2024, RevAgent), and a small number of preprints on planner–executor splits, multi-agent reflexion, and self-correcting code generation. Vendor blogs were treated as advocacy unless paired with at least one independent paper or postmortem; the “keep single” recommendations lean on Cognition's June 2025 postmortem as the load-bearing counter-evidence.

Source-quality bar. Each cited source had to (a) be dated within 18 months (May 2024 onward) or include a justification if older, (b) include a URL, and (c) be at least one of: peer-reviewed paper, vendor engineering blog with a concrete deployment, or postmortem reporting failures. I excluded LinkedIn-style think-pieces.

Ranking rubric. Candidates were scored on three axes: (1) published evidence of uplift — does a paper or vendor benchmark show the team pattern beating the single-agent baseline for a task of this shape? (2) fit to the existing dispatcher + agent-loader contract — would the change be a configuration tweak or a structural rewrite? (3) blast radius — if the team pattern degrades, does it harm in-flight experiments? Pilots 1 and 2 score high on all three; pilot 3 is a smaller bet (worth doing because it's cheap, not because the evidence is strongest).

Why these three pilots, in this order.

Pilot 1: parallel paper crawlers + synthesizer for the literature-review jobs. Anthropic's research-system writeup is the single strongest published result on the orchestrator + parallel-subagent pattern, with a 90.2% lift over a single Opus-4 agent on their internal breadth-first research benchmark. The pattern is a direct fit for lit-review.ts and project-lit-review.ts: each runs a single agent that decides what to search, reads pages serially, and writes a draft. Replacing it with one orchestrator and 3–5 paper-crawler subagents in parallel is a structurally minor change to the job's prompt and dispatcher entry, not a rewrite of run-agent.ts. The pilot would run an A/B on the next five queued project_lit_review jobs and measure (a) number of unique high-quality sources cited and (b) total token cost. Success = +1 unique source on average without >2× cost. Falsifiable failure = either no source-count lift or cost >2.5×, both of which would close the experiment.

Pilot 2: broad + filter follow-up proposer. Anthropic describes subagents as “intelligent filters” that iteratively gather and prune. The current single follow-up-proposer has to balance generating breadth and applying the kill-criterion filter in one pass, which is the exact failure mode the broad+filter split addresses. The pilot: one “broad proposer” subagent generates 15 follow-up candidates from each interpretation; a “filter” subagent ranks them against the parent's kill-criterion and outputs the same auto_run / proposal JSON the orchestrator already parses. Success = at least one auto-runnable follow-up across 10 experiments that the current single-agent would have missed (judged by manual review). Failure = same outputs as single-agent, or noisier outputs the user has to filter manually.

Pilot 3: critic round on plans. The existing review pairs already establish the contract for “Claude critic + Codex critic + reconciler” in the Sagan workflow. Adding one such round to the planner before owner approval is a cheap copy-paste of the pattern. The published evidence is weakest here (planning critics have not been benchmarked head-to-head against owner-as-critic), so this is a small bet justified by reusing existing infrastructure rather than by a paper result. Success = fewer owner re-plan cycles on the next 10 plans. Failure = critics produce nitpicks that don't reduce owner cycles.

Why these were ruled out.

Implementer. Cognition's June 2025 postmortem is the explicit counter-evidence: parallel coder subagents make conflicting edits because each makes implicit decisions the others can't see. The Devin team moved away from multi-agent implementation toward a single-threaded coder with full context. Sagan's experiment-implementer already works this way and should stay that way.
Experimenter / pod ops. These are operational, not reasoning-bound. Teams add latency without lifting quality.
Consistency-checker. The check is mechanical: does the new plan change one variable from its parent or not? A second agent would just rubber-stamp the first.
Weekly digest, insight scan, daily-log entries. Narrow summarization tasks with one source of truth and one consumer. The survey literature finds that team patterns help on breadth-bound or critique-bound tasks, not on summary tasks with tight scope.

Confidence: LOW — the ranking rests on a single strong head-to-head benchmark (Anthropic's research-system writeup) plus literature shape-matching for the other steps; no Sagan-side A/B has been run yet, so the proposal is a hypothesis, not a measurement. The kill criterion is published: if pilot 1 fails to lift unique-source-count or blows the cost budget, the rest of the ordering should be re-examined.

Parameter	Value
Scope	Sagan-side agents under `.claude/agents/` and runner jobs under `services/runner/src/jobs/`; EPS-specific compatibility scripts excluded per tenant guardrail.
Sources surveyed	19, all dated May 2024 – Feb 2026; mix of vendor engineering blogs, peer-reviewed conference papers, and postmortems.
Workflow steps reviewed	15 (8 agent-driven, 5 runner jobs, 2 owner-driven artifacts).
Candidates proposed to pilot	3 — lit-review parallelization, broad+filter follow-ups, plan critic round.
Candidates explicitly kept single-agent	6 — implementer, experimenter, consistency-checker, weekly digest, insight scan, daily-log entries.
Candidates already teamed (keep)	4 — code review, interpretation critique, clean-result critique, uploads.
Compute used	Web search + drafting on the Sagan VM. No GPU, no RunPod.
Wall time	~45 min of agent time across planning + research + drafting.

Survey notes — 19 sources, with one-line takeaways

Source	Date	Type	One-line takeaway
Anthropic: How we built our multi-agent research system	2025-06	Vendor blog	Orchestrator + parallel subagent research team beats single Opus-4 by 90.2% on internal breadth-first eval; load-bearing source for Pilot 1.
Cognition: Don't Build Multi-Agents	2025-06	Postmortem	Naive multi-agent coders share context poorly and make conflicting edits; single-threaded linear agent is the recommended default for code generation. Load-bearing source for keep-implementer-single.
Cognition: Devin's 2025 Performance Review	2025-12	Postmortem	Eighteen months of single-threaded coder operation; confirms the “don't team coders” conclusion holds at scale.
Microsoft: AutoGen v0.4	2025-01	Vendor blog	Actor-model multi-agent runtime; documents async messaging and observability as the right abstractions for teams — relevant when sizing the cost of building teams in `run-agent.ts`.
Magentic-One (Fourney et al., arXiv 2411.04468)	2024-11	Paper + system	Orchestrator + four specialized agents reach 38% GAIA / 32.8% WebArena; example of how an Orchestrator that re-plans on error wraps tool-using subagents.
Du et al.: Multiagent Debate (ICML 2024)	2023-05	Paper	Older but seminal: multiple model instances debating improves factuality and reasoning; baseline for the “debate” pattern family.
ICLR 2025 blogpost: Multi-LLM Debate — scaling challenges	2025-04	Blogpost	Critical: multi-agent debate does not consistently outperform Chain-of-Thought or self-consistency at equal compute. Cautionary anchor against over-claiming debate uplift.
MetaGPT (Hong et al., ICLR 2024)	2024 ICLR	Paper	SOPs encoded into role prompts; 5 agents; structured documents over chat; useful contrast against debate-style teams.
ChatDev (Qian et al., ACL 2024)	2024 ACL	Paper	7-agent software-dev pipeline via inception prompting; high token cost (often >$10 per HumanEval task) — load-bearing reminder that teams aren't free.
OpenAI Swarm (educational)	2024-10	Vendor repo	Lightweight handoff/routine primitives; deprecated in favor of OpenAI Agents SDK. Documents the minimum API surface a team framework needs.
OpenAI Agents SDK	2025-03	Vendor docs	Production successor to Swarm: same handoff primitives plus guardrails and tracing.
LangGraph: Hierarchical Agent Teams	2025	Vendor docs	Supervisor-of-supervisors pattern; tool-based handoff is the recommended primitive. Confirms the existing review-pair + reconciler shape is a supervisor pattern in disguise.
RevAgent: Issue-Oriented Code Review	2025-11	Paper	Five category-specific commentator agents + critic + training loop; +12.9% BLEU on review-comment generation. Cheap external evidence for code-review pairs.
Cloudflare: Orchestrating AI Code Review at Scale	2025	Vendor blog	Seven specialized reviewer agents + coordinator that deduplicates and posts one structured review — production reference for the keep-the-code-review-team conclusion.
Plan-and-Act (arXiv 2503.09572)	2025-03	Paper	High-level Planner + low-level Executor split improves long-horizon task consistency. Relevant background for the “Maybe later” result-analysis split.
HiRA: Hierarchical Reasoning for Deep Search	2025-07	Paper	Three-agent Planner / Coordinator / Executor with an adaptive coordinator that prevents context loss between the other two — the architectural caution behind keeping the Reconciler agent in our review pairs.
CodeCoR: Self-reflective multi-agent code generation	2025-01	Paper	Verify-and-improve loop with explicit critic agent; consistent with adding critic rounds to the planner, but data is on code, not plans.
MAR: Multi-Agent Reflexion (arXiv 2512.20845)	2026-02	Paper	Replaces single self-reflecting model with multiple persona-guided critics to escape cognitive entrenchment in reasoning chains; cautionary evidence that solo self-reflection underperforms.
AgentReview (EMNLP 2024)	2024 EMNLP	Paper	LLM-agent peer review pipeline shows reviewers develop adaptive strategies that can exploit the system — informs the rounds=3 cap and the reconciler design in our review pairs.

Distribution. 19 sources; 7 peer-reviewed papers, 4 postmortems or critical pieces, 7 vendor engineering blogs/docs, 1 critical blogpost (ICLR 2025). Dates: 17 within the last 18 months; the 2023 Du-et-al debate paper is included as the seminal anchor for the debate pattern family and is cited as such, not as recent evidence.

Biases I tried to surface. Vendor blogs (Anthropic, Microsoft, LangChain, Cloudflare) consistently report uplift; postmortems (Cognition x2) report fragility. The honest synthesis is that team patterns help when the task is breadth-bound (research, broad follow-ups) or critique-bound (review, interpretation), and hurt when the task is depth-bound with one coherent context (implementation, narrow summaries). That synthesis is what the Recommendation column reflects.

Reproducibility (agent-facing)

Artifacts.

Model / adapters: n/a (no training).
Training dataset: n/a.
Raw completions: n/a (no eval).
WandB run(s): n/a.
Eval JSON in repo: n/a.
Source inventory: agents enumerated from .claude/agents/*.md at git HEAD; runner jobs enumerated from services/runner/src/jobs/*.ts at git HEAD.
Survey sources: 19 URLs listed in the Survey notes block above; all public web pages.

Compute.

Wall time: ~45 min agent time (inventory + 19-source survey + drafting).
GPU: 0 (no RunPod pod launched).
Pod: n/a — ran as a Sagan-VM agent run only.
Estimated token cost: ~$10–$20 in Anthropic API charges (web search + drafting), well under the plan's $20 ceiling.

Code.

Entry scripts: n/a (no code shipped; this is a proposal).
Git commit: n/a (no code change attached to this experiment).
Hydra configs: n/a.
Reproduce: the survey can be re-run by querying the same 19 URLs and re-doing the agent inventory at HEAD; output will differ as the literature evolves.

Follow-up experiment shells (to be filed separately).

Pilot 1 — parallel paper crawlers + synthesizer for lit-review.ts; A/B on five queued project_lit_review jobs; success = +1 unique high-quality source on average without >2× token cost.
Pilot 2 — broad + filter team for follow-up-proposer; success = ≥1 net-new auto-runnable follow-up across 10 experiments.
Pilot 3 — one critic round on experiment-planner before owner approval; success = fewer re-plan cycles on the next 10 plans.

Timeline · 14 events

state_changed2026-05-13T12:54:35.154Z· user· proposed → clarifying
Moved on Pipeline board to clarifying.
```
Moved on Pipeline board to clarifying.
```
state_changed2026-05-13T12:55:17.885Z· runner· clarifying → awaiting_clarifications
Claude produced clarifying questions; awaiting owner answers.
```
Claude produced clarifying questions; awaiting owner answers.
```
state_changed2026-05-13T12:58:12.648Z· user· awaiting_clarifications → clarifying
Owner re-dispatched the planner from awaiting_clarifications.
```
Owner re-dispatched the planner from awaiting_clarifications.
```
state_changed2026-05-13T12:58:54.956Z· runner· clarifying → awaiting_clarifications
Claude produced clarifying questions; awaiting owner answers.
```
Claude produced clarifying questions; awaiting owner answers.
```
state_changed2026-05-13T13:04:32.085Z· user· awaiting_clarifications → planning
Moved on Pipeline board to planning.
```
Moved on Pipeline board to planning.
```
state_changed2026-05-13T13:06:39.941Z· runner· planning → plan_pending
Experiment plan is ready for owner approval.
```
Experiment plan is ready for owner approval.
```
approval_requested2026-05-13T13:06:40.151Z· runner
Experiment plan approval requested.
```
Experiment plan approval requested.
```
state_changed2026-05-13T13:07:55.319Z· user· plan_pending → queued
Moved on Pipeline board to queued.
```
Moved on Pipeline board to queued.
```
state_changed2026-05-13T13:07:55.370Z· user· queued → approved
Approved from Pipeline board after moving to queued.
```
Approved from Pipeline board after moving to queued.
```
state_changed2026-05-13T13:07:55.901Z· runner· approved → implementing
Orchestrator 5bb80a98 queued to implement and dispatch.
```
Orchestrator 5bb80a98 queued to implement and dispatch.
```
blocked2026-05-13T13:11:07.861Z· runner· implementing → blocked
Cascaded from agent_run f43d96ac failed
```
Cascaded from agent_run f43d96ac failed
```
state_changed2026-05-13T13:26:21.798Z· user· blocked → running
Reopened after manual retry of agent run f43d96ac.
```
Reopened after manual retry of agent run f43d96ac.
```

state_changed2026-05-13T13:41:21.600Z· user· running → awaiting_promotion

Research/proposal artifact written to experiments.body per docs/clean-result-guidelines.md (modified for survey: TL;DR +…

Research/proposal artifact written to experiments.body per docs/clean-result-guidelines.md (modified for survey: TL;DR + primary table + Experimental design + Survey notes + Reproducibility). Three pilots proposed, six steps explicitly kept single-agent.

state_changed2026-05-13T17:54:35.949Z· user· awaiting_promotion → archived
Moved on Pipeline board to archived.
```
Moved on Pipeline board to archived.
```

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)