EPS
← All tasks·#298Archived

Three Sagan workflow steps worth piloting as agent teams; six worth keeping single-agent (LOW confidence)

kind: experiment

TL;DR

  • Motivation. Sagan already has three two-agent review pairs (code review, interpretation, clean result) plus an upload/verifier pair, but most of the rest of the pipeline — planner, implementer, follow-up proposer, literature review — runs as a single agent. The question is where the next team patterns are worth the added latency and token cost, and where single-agent flows should stay single-agent.
  • What I did. I inventoried every agent under .claude/agents/ and the runner jobs under services/runner/src/jobs/, then surveyed 19 sources from the last 18 months on multi-agent orchestration — Anthropic's research-system writeup, Cognition's "don't build multi-agents" postmortem, the AutoGen v0.4 and Magentic-One releases, the LangGraph Supervisor and OpenAI Agents SDK contracts, and academic results on debate, planner–executor splits, and critic-driven self-reflection. For every Sagan workflow step I asked: would a team pattern from the survey improve cost, quality, or robustness here?
  • Result (see table below). Of 15 workflow steps reviewed, 3 are worth piloting as teams, 4 already are teams and should stay teams, 6 should stay single-agent, and 2 are maybe-with-caveats. The top recommendation is replacing the single-agent lit-review.ts job with an Anthropic-style parallel-crawlers + synthesizer team, where the published lift is largest and the existing job is the most obviously breadth-bound. Don't team the implementer; Cognition's postmortem is load-bearing evidence that single-threaded coders win on coherent context.
  • Next steps.
    • Pilot 1 (recommended next): parallel paper crawlers + synthesizer for services/runner/src/jobs/lit-review.ts and project-lit-review.ts. A/B against the current single agent on five queued project_lit_review jobs; success = +1 unique high-quality paper on average without >2× token cost.
    • Pilot 2: broad + filter team for follow-up-proposer. Generate 15 candidates broadly, filter to 5 against parent kill-criterion. Success = at least one auto-runnable follow-up the current agent would have missed across 10 experiments.
    • Pilot 3 (smaller bet): add one Claude/Codex critic round to experiment-planner before user approval, mirroring the existing review pairs. Success = fewer re-plan cycles on the next 10 plans.
    • File the three pilots as separate Sagan experiments rather than piling them into this one.
Workflow step Current agent(s) Proposed team shape Expected uplift Cost (latency / tokens) Recommendation
Literature review job jobs/lit-review.ts, single agent Orchestrator + 3–5 parallel paper-crawler subagents + synthesizer (Anthropic research-system pattern) Largest expected lift; published Anthropic eval shows +90% over single-agent baseline on breadth-first research ~3–5× tokens, ~1.5× wall time; subagents run in parallel so wall time scales sublinearly Adopt (Pilot 1)
Follow-up proposer follow-up-proposer.md, single agent Broad proposer subagent + filter/ranking subagent (fan-out then merge) Better coverage of long-tail follow-ups; explicit kill-criterion checking ~2× tokens, marginal wall time Adopt (Pilot 2)
Plan drafting experiment-planner.md, self-loops on clarify Planner + one critic round (Claude or Codex) before owner approval — same shape as the existing review pairs Fewer owner re-plan cycles; cheaper than the current "owner is the critic" loop ~1.3× tokens; saves wall-clock by removing one owner round-trip Adopt (Pilot 3)
Code review claude-code-reviewer + codex-code-reviewer + review-reconciler Already a critic-pair + reconciler team Catches what one model misses; reconciler handles ties ~2× tokens for the pair + reconciler when invoked Keep team
Interpretation critique claude-interpretation-critic + codex-interpretation-critic + reconciler Already a critic-pair + reconciler team Hypothesis/claim drift caught earlier; rounds capped at 3 ~2× tokens for the pair Keep team
Clean-result critique claude-clean-result-critic + codex-clean-result-critic + reconciler Already a critic-pair + reconciler team Promotion gate; catches unsupported claims ~2× tokens for the pair Keep team
Artifact upload uploader + upload-verifier Worker + read-only verifier already split Hard gate before interpreting; verifier is mechanical and cheap Verifier is <5% of uploader cost Keep team
Experiment implementation experiment-implementer.md, single agent — (don't team) Negative: Cognition's postmortem documents fragility when sub-coders share context poorly n/a Keep single
Experiment execution / pod run experimenter.md — (don't team) Operational task; teams add latency, not quality n/a Keep single
Consistency check consistency-checker.md — (don't team) Mechanical one-variable check; deterministic enough n/a Keep single
Weekly digest jobs/weekly-digest.ts — (don't team) Narrow summary; teams over-engineer n/a Keep single
Insight scan jobs/insight-scan.ts — (don't team) Narrow pattern detection over existing data n/a Keep single
Daily-log entry writing Owner-typed in dashboard / Haiku-drafted snapshots — (don't team) Trivial; teams add nothing n/a Keep single
Result analysis analyzer.md (result-analyzer), single agent Maybe: analyst + skeptic split (two perspectives feeding the existing interpretation-critique pair) Could surface alternative readings of ambiguous metrics ~1.5× tokens; only worth it on multi-metric experiments Maybe later
Experiment monitoring experimenter handles run + early interpretation Maybe: watchdog subagent (running) + interpreter subagent (post-run) Faster failure detection on long runs Watchdog must be cheap (Haiku); otherwise cost > benefit Maybe later
Each row is a step in the Sagan workflow — the issue lifecycle (/issue <N> → plan → review → implement → experiment → analyze → clean-result → follow-ups), plus the runner jobs in services/runner/src/jobs/. Adopt rows are the three pilots; Keep team rows are existing two-agent structures that the survey says should stay; Keep single rows are steps where the literature (especially Cognition's postmortem) recommends against teaming; Maybe later rows are deferred pending lower-hanging fruit. Token-cost columns are order-of-magnitude estimates from published deep-research and code-review benchmarks; actual costs depend on per-run query volume.
Experimental design — how I picked these candidates

Tenancy guardrail. Per CLAUDE.md, every proposed insertion point must be tenant-agnostic — would a hypothetical second project plausibly want this same thing? Every Adopt and Keep team row above passes that test: literature review, follow-up generation, planning, code review, interpretation, clean-result critique, and uploads are all Sagan-shaped, not EPS-shaped. EPS-specific compatibility scripts (e.g. mentor-results-data.ts) are out of scope here and would be touched in the EPS repo if at all.

Inventory of the agents already wired into the repo. Source: ls .claude/agents/ on the VM checkout. Fifteen agents, grouped by role:

RoleAgent files
Planningexperiment-planner.md, consistency-checker.md
Implementationexperiment-implementer.md, experimenter.md
Code-review paircode-reviewer.md (Claude), codex-code-reviewer.md (Codex), reconciler.md
Result analysisanalyzer.md (result-analyzer)
Interpretation-critique pairinterpretation-critic.md (Claude), codex-interpretation-critic.md (Codex), reconciler.md reused
Clean-result-critique pairclean-result-critic.md (Claude), codex-clean-result-critic.md (Codex), reconciler.md reused
Upload pairuploader.md, upload-verifier.md
Follow-upsfollow-up-proposer.md

Three review pairs already use the orchestrator + critic + reconciler pattern (matching LangGraph's supervisor pattern and the Cloudflare AI code review "specialized reviewers + coordinator" design). One worker + verifier pair (uploader / upload-verifier) follows the same shape with one writer and one read-only checker. Every other agent listed is a single-agent flow.

Runner jobs surveyed. Source: find services/runner/src/jobs/ -name '*.ts'. Each is a single-agent run today: lit-review.ts, project-lit-review.ts, weekly-digest.ts, insight-scan.ts, and job-runs.ts (dispatcher glue). The dispatcher.ts itself is non-agent code; it's the harness, not a candidate for teaming.

Survey methodology. I sampled 19 sources on multi-agent orchestration patterns over the last 18 months: vendor engineering blogs (Anthropic, Microsoft Research, LangChain, Cognition, OpenAI, Cloudflare), peer-reviewed work (MetaGPT at ICLR 2024, ChatDev at ACL 2024, Du et al. multiagent debate at ICML 2024, AgentReview at EMNLP 2024, RevAgent), and a small number of preprints on planner–executor splits, multi-agent reflexion, and self-correcting code generation. Vendor blogs were treated as advocacy unless paired with at least one independent paper or postmortem; the “keep single” recommendations lean on Cognition's June 2025 postmortem as the load-bearing counter-evidence.

Source-quality bar. Each cited source had to (a) be dated within 18 months (May 2024 onward) or include a justification if older, (b) include a URL, and (c) be at least one of: peer-reviewed paper, vendor engineering blog with a concrete deployment, or postmortem reporting failures. I excluded LinkedIn-style think-pieces.

Ranking rubric. Candidates were scored on three axes: (1) published evidence of uplift — does a paper or vendor benchmark show the team pattern beating the single-agent baseline for a task of this shape? (2) fit to the existing dispatcher + agent-loader contract — would the change be a configuration tweak or a structural rewrite? (3) blast radius — if the team pattern degrades, does it harm in-flight experiments? Pilots 1 and 2 score high on all three; pilot 3 is a smaller bet (worth doing because it's cheap, not because the evidence is strongest).

Why these three pilots, in this order.

Pilot 1: parallel paper crawlers + synthesizer for the literature-review jobs. Anthropic's research-system writeup is the single strongest published result on the orchestrator + parallel-subagent pattern, with a 90.2% lift over a single Opus-4 agent on their internal breadth-first research benchmark. The pattern is a direct fit for lit-review.ts and project-lit-review.ts: each runs a single agent that decides what to search, reads pages serially, and writes a draft. Replacing it with one orchestrator and 3–5 paper-crawler subagents in parallel is a structurally minor change to the job's prompt and dispatcher entry, not a rewrite of run-agent.ts. The pilot would run an A/B on the next five queued project_lit_review jobs and measure (a) number of unique high-quality sources cited and (b) total token cost. Success = +1 unique source on average without >2× cost. Falsifiable failure = either no source-count lift or cost >2.5×, both of which would close the experiment.

Pilot 2: broad + filter follow-up proposer. Anthropic describes subagents as “intelligent filters” that iteratively gather and prune. The current single follow-up-proposer has to balance generating breadth and applying the kill-criterion filter in one pass, which is the exact failure mode the broad+filter split addresses. The pilot: one “broad proposer” subagent generates 15 follow-up candidates from each interpretation; a “filter” subagent ranks them against the parent's kill-criterion and outputs the same auto_run / proposal JSON the orchestrator already parses. Success = at least one auto-runnable follow-up across 10 experiments that the current single-agent would have missed (judged by manual review). Failure = same outputs as single-agent, or noisier outputs the user has to filter manually.

Pilot 3: critic round on plans. The existing review pairs already establish the contract for “Claude critic + Codex critic + reconciler” in the Sagan workflow. Adding one such round to the planner before owner approval is a cheap copy-paste of the pattern. The published evidence is weakest here (planning critics have not been benchmarked head-to-head against owner-as-critic), so this is a small bet justified by reusing existing infrastructure rather than by a paper result. Success = fewer owner re-plan cycles on the next 10 plans. Failure = critics produce nitpicks that don't reduce owner cycles.

Why these were ruled out.

  • Implementer. Cognition's June 2025 postmortem is the explicit counter-evidence: parallel coder subagents make conflicting edits because each makes implicit decisions the others can't see. The Devin team moved away from multi-agent implementation toward a single-threaded coder with full context. Sagan's experiment-implementer already works this way and should stay that way.
  • Experimenter / pod ops. These are operational, not reasoning-bound. Teams add latency without lifting quality.
  • Consistency-checker. The check is mechanical: does the new plan change one variable from its parent or not? A second agent would just rubber-stamp the first.
  • Weekly digest, insight scan, daily-log entries. Narrow summarization tasks with one source of truth and one consumer. The survey literature finds that team patterns help on breadth-bound or critique-bound tasks, not on summary tasks with tight scope.

Confidence: LOW — the ranking rests on a single strong head-to-head benchmark (Anthropic's research-system writeup) plus literature shape-matching for the other steps; no Sagan-side A/B has been run yet, so the proposal is a hypothesis, not a measurement. The kill criterion is published: if pilot 1 fails to lift unique-source-count or blows the cost budget, the rest of the ordering should be re-examined.

ParameterValue
ScopeSagan-side agents under .claude/agents/ and runner jobs under services/runner/src/jobs/; EPS-specific compatibility scripts excluded per tenant guardrail.
Sources surveyed19, all dated May 2024 – Feb 2026; mix of vendor engineering blogs, peer-reviewed conference papers, and postmortems.
Workflow steps reviewed15 (8 agent-driven, 5 runner jobs, 2 owner-driven artifacts).
Candidates proposed to pilot3 — lit-review parallelization, broad+filter follow-ups, plan critic round.
Candidates explicitly kept single-agent6 — implementer, experimenter, consistency-checker, weekly digest, insight scan, daily-log entries.
Candidates already teamed (keep)4 — code review, interpretation critique, clean-result critique, uploads.
Compute usedWeb search + drafting on the Sagan VM. No GPU, no RunPod.
Wall time~45 min of agent time across planning + research + drafting.
Survey notes — 19 sources, with one-line takeaways
SourceDateTypeOne-line takeaway
Anthropic: How we built our multi-agent research system2025-06Vendor blogOrchestrator + parallel subagent research team beats single Opus-4 by 90.2% on internal breadth-first eval; load-bearing source for Pilot 1.
Cognition: Don't Build Multi-Agents2025-06PostmortemNaive multi-agent coders share context poorly and make conflicting edits; single-threaded linear agent is the recommended default for code generation. Load-bearing source for keep-implementer-single.
Cognition: Devin's 2025 Performance Review2025-12PostmortemEighteen months of single-threaded coder operation; confirms the “don't team coders” conclusion holds at scale.
Microsoft: AutoGen v0.42025-01Vendor blogActor-model multi-agent runtime; documents async messaging and observability as the right abstractions for teams — relevant when sizing the cost of building teams in run-agent.ts.
Magentic-One (Fourney et al., arXiv 2411.04468)2024-11Paper + systemOrchestrator + four specialized agents reach 38% GAIA / 32.8% WebArena; example of how an Orchestrator that re-plans on error wraps tool-using subagents.
Du et al.: Multiagent Debate (ICML 2024)2023-05PaperOlder but seminal: multiple model instances debating improves factuality and reasoning; baseline for the “debate” pattern family.
ICLR 2025 blogpost: Multi-LLM Debate — scaling challenges2025-04BlogpostCritical: multi-agent debate does not consistently outperform Chain-of-Thought or self-consistency at equal compute. Cautionary anchor against over-claiming debate uplift.
MetaGPT (Hong et al., ICLR 2024)2024 ICLRPaperSOPs encoded into role prompts; 5 agents; structured documents over chat; useful contrast against debate-style teams.
ChatDev (Qian et al., ACL 2024)2024 ACLPaper7-agent software-dev pipeline via inception prompting; high token cost (often >$10 per HumanEval task) — load-bearing reminder that teams aren't free.
OpenAI Swarm (educational)2024-10Vendor repoLightweight handoff/routine primitives; deprecated in favor of OpenAI Agents SDK. Documents the minimum API surface a team framework needs.
OpenAI Agents SDK2025-03Vendor docsProduction successor to Swarm: same handoff primitives plus guardrails and tracing.
LangGraph: Hierarchical Agent Teams2025Vendor docsSupervisor-of-supervisors pattern; tool-based handoff is the recommended primitive. Confirms the existing review-pair + reconciler shape is a supervisor pattern in disguise.
RevAgent: Issue-Oriented Code Review2025-11PaperFive category-specific commentator agents + critic + training loop; +12.9% BLEU on review-comment generation. Cheap external evidence for code-review pairs.
Cloudflare: Orchestrating AI Code Review at Scale2025Vendor blogSeven specialized reviewer agents + coordinator that deduplicates and posts one structured review — production reference for the keep-the-code-review-team conclusion.
Plan-and-Act (arXiv 2503.09572)2025-03PaperHigh-level Planner + low-level Executor split improves long-horizon task consistency. Relevant background for the “Maybe later” result-analysis split.
HiRA: Hierarchical Reasoning for Deep Search2025-07PaperThree-agent Planner / Coordinator / Executor with an adaptive coordinator that prevents context loss between the other two — the architectural caution behind keeping the Reconciler agent in our review pairs.
CodeCoR: Self-reflective multi-agent code generation2025-01PaperVerify-and-improve loop with explicit critic agent; consistent with adding critic rounds to the planner, but data is on code, not plans.
MAR: Multi-Agent Reflexion (arXiv 2512.20845)2026-02PaperReplaces single self-reflecting model with multiple persona-guided critics to escape cognitive entrenchment in reasoning chains; cautionary evidence that solo self-reflection underperforms.
AgentReview (EMNLP 2024)2024 EMNLPPaperLLM-agent peer review pipeline shows reviewers develop adaptive strategies that can exploit the system — informs the rounds=3 cap and the reconciler design in our review pairs.

Distribution. 19 sources; 7 peer-reviewed papers, 4 postmortems or critical pieces, 7 vendor engineering blogs/docs, 1 critical blogpost (ICLR 2025). Dates: 17 within the last 18 months; the 2023 Du-et-al debate paper is included as the seminal anchor for the debate pattern family and is cited as such, not as recent evidence.

Biases I tried to surface. Vendor blogs (Anthropic, Microsoft, LangChain, Cloudflare) consistently report uplift; postmortems (Cognition x2) report fragility. The honest synthesis is that team patterns help when the task is breadth-bound (research, broad follow-ups) or critique-bound (review, interpretation), and hurt when the task is depth-bound with one coherent context (implementation, narrow summaries). That synthesis is what the Recommendation column reflects.

Reproducibility (agent-facing)

Artifacts.

  • Model / adapters: n/a (no training).
  • Training dataset: n/a.
  • Raw completions: n/a (no eval).
  • WandB run(s): n/a.
  • Eval JSON in repo: n/a.
  • Source inventory: agents enumerated from .claude/agents/*.md at git HEAD; runner jobs enumerated from services/runner/src/jobs/*.ts at git HEAD.
  • Survey sources: 19 URLs listed in the Survey notes block above; all public web pages.

Compute.

  • Wall time: ~45 min agent time (inventory + 19-source survey + drafting).
  • GPU: 0 (no RunPod pod launched).
  • Pod: n/a — ran as a Sagan-VM agent run only.
  • Estimated token cost: ~$10–$20 in Anthropic API charges (web search + drafting), well under the plan's $20 ceiling.

Code.

  • Entry scripts: n/a (no code shipped; this is a proposal).
  • Git commit: n/a (no code change attached to this experiment).
  • Hydra configs: n/a.
  • Reproduce: the survey can be re-run by querying the same 19 URLs and re-doing the agent inventory at HEAD; output will differ as the literature evolves.

Follow-up experiment shells (to be filed separately).

  • Pilot 1 — parallel paper crawlers + synthesizer for lit-review.ts; A/B on five queued project_lit_review jobs; success = +1 unique high-quality source on average without >2× token cost.
  • Pilot 2 — broad + filter team for follow-up-proposer; success = ≥1 net-new auto-runnable follow-up across 10 experiments.
  • Pilot 3 — one critic round on experiment-planner before owner approval; success = fewer re-plan cycles on the next 10 plans.

Timeline · 14 events

  1. state_changed· user· proposedclarifying
    Moved on Pipeline board to clarifying.
    Moved on Pipeline board to clarifying.
  2. state_changed· runner· clarifyingawaiting_clarifications
    Claude produced clarifying questions; awaiting owner answers.
    Claude produced clarifying questions; awaiting owner answers.
  3. state_changed· user· awaiting_clarificationsclarifying
    Owner re-dispatched the planner from awaiting_clarifications.
    Owner re-dispatched the planner from awaiting_clarifications.
  4. state_changed· runner· clarifyingawaiting_clarifications
    Claude produced clarifying questions; awaiting owner answers.
    Claude produced clarifying questions; awaiting owner answers.
  5. state_changed· user· awaiting_clarificationsplanning
    Moved on Pipeline board to planning.
    Moved on Pipeline board to planning.
  6. state_changed· runner· planningplan_pending
    Experiment plan is ready for owner approval.
    Experiment plan is ready for owner approval.
  7. approval_requested· runner
    Experiment plan approval requested.
    Experiment plan approval requested.
  8. state_changed· user· plan_pendingqueued
    Moved on Pipeline board to queued.
    Moved on Pipeline board to queued.
  9. state_changed· user· queuedapproved
    Approved from Pipeline board after moving to queued.
    Approved from Pipeline board after moving to queued.
  10. state_changed· runner· approvedimplementing
    Orchestrator 5bb80a98 queued to implement and dispatch.
    Orchestrator 5bb80a98 queued to implement and dispatch.
  11. blocked· runner· implementingblocked
    Cascaded from agent_run f43d96ac failed
    Cascaded from agent_run f43d96ac failed
  12. state_changed· user· blockedrunning
    Reopened after manual retry of agent run f43d96ac.
    Reopened after manual retry of agent run f43d96ac.
  13. state_changed· user· runningawaiting_promotion
    Research/proposal artifact written to experiments.body per docs/clean-result-guidelines.md (modified for survey: TL;DR +
    Research/proposal artifact written to experiments.body per docs/clean-result-guidelines.md (modified for survey: TL;DR + primary table + Experimental design + Survey notes + Reproducibility). Three pilots proposed, six steps explicitly kept single-agent.
  14. state_changed· user· awaiting_promotionarchived
    Moved on Pipeline board to archived.
    Moved on Pipeline board to archived.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)