Three Sagan workflow steps worth piloting as agent teams; six worth keeping single-agent (LOW confidence)
TL;DR
- Motivation. Sagan already has three two-agent review pairs (code review, interpretation, clean result) plus an upload/verifier pair, but most of the rest of the pipeline — planner, implementer, follow-up proposer, literature review — runs as a single agent. The question is where the next team patterns are worth the added latency and token cost, and where single-agent flows should stay single-agent.
- What I did. I inventoried every agent under
.claude/agents/and the runner jobs underservices/runner/src/jobs/, then surveyed 19 sources from the last 18 months on multi-agent orchestration — Anthropic's research-system writeup, Cognition's "don't build multi-agents" postmortem, the AutoGen v0.4 and Magentic-One releases, the LangGraph Supervisor and OpenAI Agents SDK contracts, and academic results on debate, planner–executor splits, and critic-driven self-reflection. For every Sagan workflow step I asked: would a team pattern from the survey improve cost, quality, or robustness here? - Result (see table below). Of 15 workflow steps reviewed, 3 are worth piloting as teams, 4 already are teams and should stay teams, 6 should stay single-agent, and 2 are maybe-with-caveats. The top recommendation is replacing the single-agent
lit-review.tsjob with an Anthropic-style parallel-crawlers + synthesizer team, where the published lift is largest and the existing job is the most obviously breadth-bound. Don't team the implementer; Cognition's postmortem is load-bearing evidence that single-threaded coders win on coherent context. - Next steps.
- Pilot 1 (recommended next): parallel paper crawlers + synthesizer for
services/runner/src/jobs/lit-review.tsandproject-lit-review.ts. A/B against the current single agent on five queuedproject_lit_reviewjobs; success = +1 unique high-quality paper on average without >2× token cost. - Pilot 2: broad + filter team for
follow-up-proposer. Generate 15 candidates broadly, filter to 5 against parent kill-criterion. Success = at least one auto-runnable follow-up the current agent would have missed across 10 experiments. - Pilot 3 (smaller bet): add one Claude/Codex critic round to
experiment-plannerbefore user approval, mirroring the existing review pairs. Success = fewer re-plan cycles on the next 10 plans. - File the three pilots as separate Sagan experiments rather than piling them into this one.
- Pilot 1 (recommended next): parallel paper crawlers + synthesizer for
| Workflow step | Current agent(s) | Proposed team shape | Expected uplift | Cost (latency / tokens) | Recommendation |
|---|---|---|---|---|---|
| Literature review job | jobs/lit-review.ts, single agent |
Orchestrator + 3–5 parallel paper-crawler subagents + synthesizer (Anthropic research-system pattern) | Largest expected lift; published Anthropic eval shows +90% over single-agent baseline on breadth-first research | ~3–5× tokens, ~1.5× wall time; subagents run in parallel so wall time scales sublinearly | Adopt (Pilot 1) |
| Follow-up proposer | follow-up-proposer.md, single agent |
Broad proposer subagent + filter/ranking subagent (fan-out then merge) | Better coverage of long-tail follow-ups; explicit kill-criterion checking | ~2× tokens, marginal wall time | Adopt (Pilot 2) |
| Plan drafting | experiment-planner.md, self-loops on clarify |
Planner + one critic round (Claude or Codex) before owner approval — same shape as the existing review pairs | Fewer owner re-plan cycles; cheaper than the current "owner is the critic" loop | ~1.3× tokens; saves wall-clock by removing one owner round-trip | Adopt (Pilot 3) |
| Code review | claude-code-reviewer + codex-code-reviewer + review-reconciler |
Already a critic-pair + reconciler team | Catches what one model misses; reconciler handles ties | ~2× tokens for the pair + reconciler when invoked | Keep team |
| Interpretation critique | claude-interpretation-critic + codex-interpretation-critic + reconciler |
Already a critic-pair + reconciler team | Hypothesis/claim drift caught earlier; rounds capped at 3 | ~2× tokens for the pair | Keep team |
| Clean-result critique | claude-clean-result-critic + codex-clean-result-critic + reconciler |
Already a critic-pair + reconciler team | Promotion gate; catches unsupported claims | ~2× tokens for the pair | Keep team |
| Artifact upload | uploader + upload-verifier |
Worker + read-only verifier already split | Hard gate before interpreting; verifier is mechanical and cheap |
Verifier is <5% of uploader cost | Keep team |
| Experiment implementation | experiment-implementer.md, single agent |
— (don't team) | Negative: Cognition's postmortem documents fragility when sub-coders share context poorly | n/a | Keep single |
| Experiment execution / pod run | experimenter.md |
— (don't team) | Operational task; teams add latency, not quality | n/a | Keep single |
| Consistency check | consistency-checker.md |
— (don't team) | Mechanical one-variable check; deterministic enough | n/a | Keep single |
| Weekly digest | jobs/weekly-digest.ts |
— (don't team) | Narrow summary; teams over-engineer | n/a | Keep single |
| Insight scan | jobs/insight-scan.ts |
— (don't team) | Narrow pattern detection over existing data | n/a | Keep single |
| Daily-log entry writing | Owner-typed in dashboard / Haiku-drafted snapshots | — (don't team) | Trivial; teams add nothing | n/a | Keep single |
| Result analysis | analyzer.md (result-analyzer), single agent |
Maybe: analyst + skeptic split (two perspectives feeding the existing interpretation-critique pair) | Could surface alternative readings of ambiguous metrics | ~1.5× tokens; only worth it on multi-metric experiments | Maybe later |
| Experiment monitoring | experimenter handles run + early interpretation |
Maybe: watchdog subagent (running) + interpreter subagent (post-run) | Faster failure detection on long runs | Watchdog must be cheap (Haiku); otherwise cost > benefit | Maybe later |
/issue <N> → plan → review → implement → experiment → analyze → clean-result → follow-ups), plus the runner jobs in services/runner/src/jobs/. Adopt rows are the three pilots; Keep team rows are existing two-agent structures that the survey says should stay; Keep single rows are steps where the literature (especially Cognition's postmortem) recommends against teaming; Maybe later rows are deferred pending lower-hanging fruit. Token-cost columns are order-of-magnitude estimates from published deep-research and code-review benchmarks; actual costs depend on per-run query volume.Experimental design — how I picked these candidates
Tenancy guardrail. Per CLAUDE.md, every proposed insertion point must be tenant-agnostic — would a hypothetical second project plausibly want this same thing? Every Adopt and Keep team row above passes that test: literature review, follow-up generation, planning, code review, interpretation, clean-result critique, and uploads are all Sagan-shaped, not EPS-shaped. EPS-specific compatibility scripts (e.g. mentor-results-data.ts) are out of scope here and would be touched in the EPS repo if at all.
Inventory of the agents already wired into the repo. Source: ls .claude/agents/ on the VM checkout. Fifteen agents, grouped by role:
| Role | Agent files |
|---|---|
| Planning | experiment-planner.md, consistency-checker.md |
| Implementation | experiment-implementer.md, experimenter.md |
| Code-review pair | code-reviewer.md (Claude), codex-code-reviewer.md (Codex), reconciler.md |
| Result analysis | analyzer.md (result-analyzer) |
| Interpretation-critique pair | interpretation-critic.md (Claude), codex-interpretation-critic.md (Codex), reconciler.md reused |
| Clean-result-critique pair | clean-result-critic.md (Claude), codex-clean-result-critic.md (Codex), reconciler.md reused |
| Upload pair | uploader.md, upload-verifier.md |
| Follow-ups | follow-up-proposer.md |
Three review pairs already use the orchestrator + critic + reconciler pattern (matching LangGraph's supervisor pattern and the Cloudflare AI code review "specialized reviewers + coordinator" design). One worker + verifier pair (uploader / upload-verifier) follows the same shape with one writer and one read-only checker. Every other agent listed is a single-agent flow.
Runner jobs surveyed. Source: find services/runner/src/jobs/ -name '*.ts'. Each is a single-agent run today: lit-review.ts, project-lit-review.ts, weekly-digest.ts, insight-scan.ts, and job-runs.ts (dispatcher glue). The dispatcher.ts itself is non-agent code; it's the harness, not a candidate for teaming.
Survey methodology. I sampled 19 sources on multi-agent orchestration patterns over the last 18 months: vendor engineering blogs (Anthropic, Microsoft Research, LangChain, Cognition, OpenAI, Cloudflare), peer-reviewed work (MetaGPT at ICLR 2024, ChatDev at ACL 2024, Du et al. multiagent debate at ICML 2024, AgentReview at EMNLP 2024, RevAgent), and a small number of preprints on planner–executor splits, multi-agent reflexion, and self-correcting code generation. Vendor blogs were treated as advocacy unless paired with at least one independent paper or postmortem; the “keep single” recommendations lean on Cognition's June 2025 postmortem as the load-bearing counter-evidence.
Source-quality bar. Each cited source had to (a) be dated within 18 months (May 2024 onward) or include a justification if older, (b) include a URL, and (c) be at least one of: peer-reviewed paper, vendor engineering blog with a concrete deployment, or postmortem reporting failures. I excluded LinkedIn-style think-pieces.
Ranking rubric. Candidates were scored on three axes: (1) published evidence of uplift — does a paper or vendor benchmark show the team pattern beating the single-agent baseline for a task of this shape? (2) fit to the existing dispatcher + agent-loader contract — would the change be a configuration tweak or a structural rewrite? (3) blast radius — if the team pattern degrades, does it harm in-flight experiments? Pilots 1 and 2 score high on all three; pilot 3 is a smaller bet (worth doing because it's cheap, not because the evidence is strongest).
Why these three pilots, in this order.
Pilot 1: parallel paper crawlers + synthesizer for the literature-review jobs. Anthropic's research-system writeup is the single strongest published result on the orchestrator + parallel-subagent pattern, with a 90.2% lift over a single Opus-4 agent on their internal breadth-first research benchmark. The pattern is a direct fit for lit-review.ts and project-lit-review.ts: each runs a single agent that decides what to search, reads pages serially, and writes a draft. Replacing it with one orchestrator and 3–5 paper-crawler subagents in parallel is a structurally minor change to the job's prompt and dispatcher entry, not a rewrite of run-agent.ts. The pilot would run an A/B on the next five queued project_lit_review jobs and measure (a) number of unique high-quality sources cited and (b) total token cost. Success = +1 unique source on average without >2× cost. Falsifiable failure = either no source-count lift or cost >2.5×, both of which would close the experiment.
Pilot 2: broad + filter follow-up proposer. Anthropic describes subagents as “intelligent filters” that iteratively gather and prune. The current single follow-up-proposer has to balance generating breadth and applying the kill-criterion filter in one pass, which is the exact failure mode the broad+filter split addresses. The pilot: one “broad proposer” subagent generates 15 follow-up candidates from each interpretation; a “filter” subagent ranks them against the parent's kill-criterion and outputs the same auto_run / proposal JSON the orchestrator already parses. Success = at least one auto-runnable follow-up across 10 experiments that the current single-agent would have missed (judged by manual review). Failure = same outputs as single-agent, or noisier outputs the user has to filter manually.
Pilot 3: critic round on plans. The existing review pairs already establish the contract for “Claude critic + Codex critic + reconciler” in the Sagan workflow. Adding one such round to the planner before owner approval is a cheap copy-paste of the pattern. The published evidence is weakest here (planning critics have not been benchmarked head-to-head against owner-as-critic), so this is a small bet justified by reusing existing infrastructure rather than by a paper result. Success = fewer owner re-plan cycles on the next 10 plans. Failure = critics produce nitpicks that don't reduce owner cycles.
Why these were ruled out.
- Implementer. Cognition's June 2025 postmortem is the explicit counter-evidence: parallel coder subagents make conflicting edits because each makes implicit decisions the others can't see. The Devin team moved away from multi-agent implementation toward a single-threaded coder with full context. Sagan's
experiment-implementeralready works this way and should stay that way. - Experimenter / pod ops. These are operational, not reasoning-bound. Teams add latency without lifting quality.
- Consistency-checker. The check is mechanical: does the new plan change one variable from its parent or not? A second agent would just rubber-stamp the first.
- Weekly digest, insight scan, daily-log entries. Narrow summarization tasks with one source of truth and one consumer. The survey literature finds that team patterns help on breadth-bound or critique-bound tasks, not on summary tasks with tight scope.
Confidence: LOW — the ranking rests on a single strong head-to-head benchmark (Anthropic's research-system writeup) plus literature shape-matching for the other steps; no Sagan-side A/B has been run yet, so the proposal is a hypothesis, not a measurement. The kill criterion is published: if pilot 1 fails to lift unique-source-count or blows the cost budget, the rest of the ordering should be re-examined.
| Parameter | Value |
|---|---|
| Scope | Sagan-side agents under .claude/agents/ and runner jobs under services/runner/src/jobs/; EPS-specific compatibility scripts excluded per tenant guardrail. |
| Sources surveyed | 19, all dated May 2024 – Feb 2026; mix of vendor engineering blogs, peer-reviewed conference papers, and postmortems. |
| Workflow steps reviewed | 15 (8 agent-driven, 5 runner jobs, 2 owner-driven artifacts). |
| Candidates proposed to pilot | 3 — lit-review parallelization, broad+filter follow-ups, plan critic round. |
| Candidates explicitly kept single-agent | 6 — implementer, experimenter, consistency-checker, weekly digest, insight scan, daily-log entries. |
| Candidates already teamed (keep) | 4 — code review, interpretation critique, clean-result critique, uploads. |
| Compute used | Web search + drafting on the Sagan VM. No GPU, no RunPod. |
| Wall time | ~45 min of agent time across planning + research + drafting. |
Survey notes — 19 sources, with one-line takeaways
| Source | Date | Type | One-line takeaway |
|---|---|---|---|
| Anthropic: How we built our multi-agent research system | 2025-06 | Vendor blog | Orchestrator + parallel subagent research team beats single Opus-4 by 90.2% on internal breadth-first eval; load-bearing source for Pilot 1. |
| Cognition: Don't Build Multi-Agents | 2025-06 | Postmortem | Naive multi-agent coders share context poorly and make conflicting edits; single-threaded linear agent is the recommended default for code generation. Load-bearing source for keep-implementer-single. |
| Cognition: Devin's 2025 Performance Review | 2025-12 | Postmortem | Eighteen months of single-threaded coder operation; confirms the “don't team coders” conclusion holds at scale. |
| Microsoft: AutoGen v0.4 | 2025-01 | Vendor blog | Actor-model multi-agent runtime; documents async messaging and observability as the right abstractions for teams — relevant when sizing the cost of building teams in run-agent.ts. |
| Magentic-One (Fourney et al., arXiv 2411.04468) | 2024-11 | Paper + system | Orchestrator + four specialized agents reach 38% GAIA / 32.8% WebArena; example of how an Orchestrator that re-plans on error wraps tool-using subagents. |
| Du et al.: Multiagent Debate (ICML 2024) | 2023-05 | Paper | Older but seminal: multiple model instances debating improves factuality and reasoning; baseline for the “debate” pattern family. |
| ICLR 2025 blogpost: Multi-LLM Debate — scaling challenges | 2025-04 | Blogpost | Critical: multi-agent debate does not consistently outperform Chain-of-Thought or self-consistency at equal compute. Cautionary anchor against over-claiming debate uplift. |
| MetaGPT (Hong et al., ICLR 2024) | 2024 ICLR | Paper | SOPs encoded into role prompts; 5 agents; structured documents over chat; useful contrast against debate-style teams. |
| ChatDev (Qian et al., ACL 2024) | 2024 ACL | Paper | 7-agent software-dev pipeline via inception prompting; high token cost (often >$10 per HumanEval task) — load-bearing reminder that teams aren't free. |
| OpenAI Swarm (educational) | 2024-10 | Vendor repo | Lightweight handoff/routine primitives; deprecated in favor of OpenAI Agents SDK. Documents the minimum API surface a team framework needs. |
| OpenAI Agents SDK | 2025-03 | Vendor docs | Production successor to Swarm: same handoff primitives plus guardrails and tracing. |
| LangGraph: Hierarchical Agent Teams | 2025 | Vendor docs | Supervisor-of-supervisors pattern; tool-based handoff is the recommended primitive. Confirms the existing review-pair + reconciler shape is a supervisor pattern in disguise. |
| RevAgent: Issue-Oriented Code Review | 2025-11 | Paper | Five category-specific commentator agents + critic + training loop; +12.9% BLEU on review-comment generation. Cheap external evidence for code-review pairs. |
| Cloudflare: Orchestrating AI Code Review at Scale | 2025 | Vendor blog | Seven specialized reviewer agents + coordinator that deduplicates and posts one structured review — production reference for the keep-the-code-review-team conclusion. |
| Plan-and-Act (arXiv 2503.09572) | 2025-03 | Paper | High-level Planner + low-level Executor split improves long-horizon task consistency. Relevant background for the “Maybe later” result-analysis split. |
| HiRA: Hierarchical Reasoning for Deep Search | 2025-07 | Paper | Three-agent Planner / Coordinator / Executor with an adaptive coordinator that prevents context loss between the other two — the architectural caution behind keeping the Reconciler agent in our review pairs. |
| CodeCoR: Self-reflective multi-agent code generation | 2025-01 | Paper | Verify-and-improve loop with explicit critic agent; consistent with adding critic rounds to the planner, but data is on code, not plans. |
| MAR: Multi-Agent Reflexion (arXiv 2512.20845) | 2026-02 | Paper | Replaces single self-reflecting model with multiple persona-guided critics to escape cognitive entrenchment in reasoning chains; cautionary evidence that solo self-reflection underperforms. |
| AgentReview (EMNLP 2024) | 2024 EMNLP | Paper | LLM-agent peer review pipeline shows reviewers develop adaptive strategies that can exploit the system — informs the rounds=3 cap and the reconciler design in our review pairs. |
Distribution. 19 sources; 7 peer-reviewed papers, 4 postmortems or critical pieces, 7 vendor engineering blogs/docs, 1 critical blogpost (ICLR 2025). Dates: 17 within the last 18 months; the 2023 Du-et-al debate paper is included as the seminal anchor for the debate pattern family and is cited as such, not as recent evidence.
Biases I tried to surface. Vendor blogs (Anthropic, Microsoft, LangChain, Cloudflare) consistently report uplift; postmortems (Cognition x2) report fragility. The honest synthesis is that team patterns help when the task is breadth-bound (research, broad follow-ups) or critique-bound (review, interpretation), and hurt when the task is depth-bound with one coherent context (implementation, narrow summaries). That synthesis is what the Recommendation column reflects.
Reproducibility (agent-facing)
Artifacts.
- Model / adapters: n/a (no training).
- Training dataset: n/a.
- Raw completions: n/a (no eval).
- WandB run(s): n/a.
- Eval JSON in repo: n/a.
- Source inventory: agents enumerated from
.claude/agents/*.mdat git HEAD; runner jobs enumerated fromservices/runner/src/jobs/*.tsat git HEAD. - Survey sources: 19 URLs listed in the Survey notes block above; all public web pages.
Compute.
- Wall time: ~45 min agent time (inventory + 19-source survey + drafting).
- GPU: 0 (no RunPod pod launched).
- Pod: n/a — ran as a Sagan-VM agent run only.
- Estimated token cost: ~$10–$20 in Anthropic API charges (web search + drafting), well under the plan's $20 ceiling.
Code.
- Entry scripts: n/a (no code shipped; this is a proposal).
- Git commit: n/a (no code change attached to this experiment).
- Hydra configs: n/a.
- Reproduce: the survey can be re-run by querying the same 19 URLs and re-doing the agent inventory at HEAD; output will differ as the literature evolves.
Follow-up experiment shells (to be filed separately).
- Pilot 1 — parallel paper crawlers + synthesizer for
lit-review.ts; A/B on five queuedproject_lit_reviewjobs; success = +1 unique high-quality source on average without >2× token cost. - Pilot 2 — broad + filter team for
follow-up-proposer; success = ≥1 net-new auto-runnable follow-up across 10 experiments. - Pilot 3 — one critic round on
experiment-plannerbefore owner approval; success = fewer re-plan cycles on the next 10 plans.
Timeline · 14 events
state_changed· user· proposed → clarifyingMoved on Pipeline board to clarifying.
Moved on Pipeline board to clarifying.
state_changed· runner· clarifying → awaiting_clarificationsClaude produced clarifying questions; awaiting owner answers.
Claude produced clarifying questions; awaiting owner answers.
state_changed· user· awaiting_clarifications → clarifyingOwner re-dispatched the planner from awaiting_clarifications.
Owner re-dispatched the planner from awaiting_clarifications.
state_changed· runner· clarifying → awaiting_clarificationsClaude produced clarifying questions; awaiting owner answers.
Claude produced clarifying questions; awaiting owner answers.
state_changed· user· awaiting_clarifications → planningMoved on Pipeline board to planning.
Moved on Pipeline board to planning.
state_changed· runner· planning → plan_pendingExperiment plan is ready for owner approval.
Experiment plan is ready for owner approval.
approval_requested· runnerExperiment plan approval requested.
Experiment plan approval requested.
state_changed· user· plan_pending → queuedMoved on Pipeline board to queued.
Moved on Pipeline board to queued.
state_changed· user· queued → approvedApproved from Pipeline board after moving to queued.
Approved from Pipeline board after moving to queued.
state_changed· runner· approved → implementingOrchestrator 5bb80a98 queued to implement and dispatch.
Orchestrator 5bb80a98 queued to implement and dispatch.
blocked· runner· implementing → blockedCascaded from agent_run f43d96ac failed
Cascaded from agent_run f43d96ac failed
state_changed· user· blocked → runningReopened after manual retry of agent run f43d96ac.
Reopened after manual retry of agent run f43d96ac.
state_changed· user· running → awaiting_promotionResearch/proposal artifact written to experiments.body per docs/clean-result-guidelines.md (modified for survey: TL;DR +…
Research/proposal artifact written to experiments.body per docs/clean-result-guidelines.md (modified for survey: TL;DR + primary table + Experimental design + Survey notes + Reproducibility). Three pilots proposed, six steps explicitly kept single-agent.
state_changed· user· awaiting_promotion → archivedMoved on Pipeline board to archived.
Moved on Pipeline board to archived.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)