Workflow optimizations

kind: infra

If you have many seeds to run in parallel, each on one GPU - create one multi GPU pod, not many single GPU pods (this is just cleaner)

Can/should we use agent teams

Should we be using different kinds/levels of abstraction of critics at each stage in parallel -> search the web for how people do this

Add a weekly code cleanup:

Just a normal refactor and consolidating things together Add a weekly workflow improvement:
Looks at transcripts of past week, identifies optimizations that could be made
Plans to make them

Currently I need to approve so that Claude can write to .claude/plans/issue-x.md. This is bad because I want the entire adversarial planner loop to run and only then I approve. Is there any way to fix this?

Can Happy Coder change the chat name to be more informative?

Clean results should give more context so that it is understandable even for someone who is not up to date with the project

At every stage if something is genuinely not clear, it should stop and ask the user for clarification But or else -> after the plan, everything should run smoothly until the final clean result is generated, but there should be a status/column for a clean result generated by Claude vs approved by me

The pod(s) used shoudl be PAUSED until I approve the final clean result -> then they should be deleted

Timeline · 10 events

epm:auto-defaults2026-05-02T19:09:47.000Z· system

## Auto-defaults applied - **`status:proposed`** added (issue had no status label). - **`…

<!-- epm:auto-defaults v1 -->
## Auto-defaults applied

- **`status:proposed`** added (issue had no status label).
- **`type:infra`** added (user chose to treat this as one omnibus infra task in chat).
- Body left as-is — has substance, just unstructured. Clarifier will turn it into a spec.

Next: clarifier (Step 1).
<!-- /epm:auto-defaults -->

epm:clarify2026-05-02T19:13:20.000Z· system

## Clarifier — blocking ambiguities ### Context resolved from project knowledge - **Big duplic…

<!-- epm:clarify v1 -->
## Clarifier — blocking ambiguities

### Context resolved from project knowledge

- **Big duplicate flag.** Two open `status:proposed type:infra` issues already cover most of this scope:
  - **#149 "Workflow improvement"** — a sprawling spec touching weekly cleanup/retro, clean-results-for-newcomers, planner→implementer→reviewer→tester loop, pod pause/resume, automated follow-ups, hooks-for-everything. Roughly items #4, #5, #8, #9, #10 of #202.
  - **#172 "Refactor to use agent teams in a worktree"** — full hard-cutover refactor of `/issue`, `/adversarial-planner`, and all multi-agent skills onto Claude Code native `team_name` + `isolation: "worktree"` primitives. Item #2 of #202, with a much tighter spec.
- **Item #1 (parallel seeds → one big pod) is already documented guidance** in `.claude/agents/planner.md` §9 "Resources & Parallelism": "if N seeds fit on one big pod, do that; if not, provision N ephemeral pods via `Parent: #M`". Not enforced by code, just preferred.
- **Item #5 (weekly transcript review)** is exactly the `retrospective` agent (`.claude/agents/retrospective.md`) — agent exists, just no scheduler.
- **Item #4 (weekly code cleanup)** has the building blocks: `.claude/skills/cleanup/SKILL.md` + `.claude/skills/refactor/SKILL.md`. Missing: a scheduler.
- **Item #10 (pause pods until approval, then delete) is already implemented** in `.claude/skills/issue/SKILL.md` Step 10c — after the clean-result is finalized, `AskUserQuestion` prompts to terminate vs keep stopped. Step 8 stops the pod automatically once upload-verifier PASSes.
- **Item #6 (plan-write permission gate).** `.claude/settings.json` already blanket-allows `Edit` / `Write` / `NotebookEdit` for this project, so the prompt you're hitting is *not* from project settings. Likely from your user-level `~/.claude/settings.json`, the planner subagent's restricted tool list, or a hook somewhere. (Worth diagnosing before designing a fix.)
- **Item #7 (Happy chat title).** `mcp__happy__change_title` exists and I called it at start of this very session. The "auto-update mid-session as state changes" piece doesn't exist.
- **Item #9 (separate column for Claude-generated vs user-approved clean result).** Project board today: `Clean Results` (when reviewer PASSes) → `Done (experiment)` (auto-complete). The skill explicitly says "There is no user sign-off step. Reviewer PASS … is the terminal gate; completion is automatic." Adding a user-approval gate here would **contradict that principle** — needs an explicit decision.

### Remaining blocking ambiguities

1. **Duplicate handling.** How do you want to resolve the overlap with #149 and #172?
   - (a) Close #149 + #172 as superseded by #202 (this issue becomes the new spec).
   - (b) Close #202 as duplicate; route all work through the existing #149/#172 (with their bodies as the spec).
   - (c) Keep all three open: split #202's items across #149 and #172 (most go to #149, item #2 goes to #172), close #202 as a tracking-only umbrella.
   - (d) Proceed with #202 as a fresh omnibus spec, but the planner reads + cites #149/#172 explicitly and we close those as duplicates after #202 lands.

2. **Scope/priority for one plan.** The 10 items range from ~30-min fixes (chat-title hook) to multi-week refactors (agent teams in worktrees, #172). The adversarial planner needs a *specific scope* — designing all 10 in one plan will produce something unimplementable. Which of these three do you want?
   - (a) **Quick-wins only** (<2h each): items #1 enforcement, #6 root-cause fix, #7 dynamic chat title, #10 enforcement. Defer #2/#3/#4/#5/#8/#9.
   - (b) **Workflow-first**: items #6, #9, #4, #5, #8 (the user-experience stuff: smoother gates, weekly cleanup/retro, richer clean-results). Defer the agent-architecture rewrite (#2/#3) to #172.
   - (c) **Architecture-first**: items #2, #3 (agent teams + parallel critics) — basically just the #172 refactor with #3 added. Defer everything else.
   - (d) **All of it**

epm:clarify-answers2026-05-03T08:53:48.000Z· system

## Clarifier answers (from user in chat) **Q1 (duplicates):** (a) Close #149 + #172 as …

<!-- epm:clarify-answers v1 -->
## Clarifier answers (from user in chat)

**Q1 (duplicates):** (a) Close #149 + #172 as superseded by #202. This issue becomes the new spec.

**Q2 (scope):** (d) All of it — one multi-PR roadmap staged over weeks.

**Q3 (clean-result approval):** Keep `clean-results:draft` label lingering longer until user manually promotes to `clean-results`. No new project-board column needed — just stop auto-promoting on reviewer PASS.

**Q4 (parallel critics):** (a) Split the current single Critic into 3 parallel critics (methodology / statistics / alternative-explanations) at the same planning stage. Can also add additional parallel critics beyond these 3.

**Q5 (weekly scheduling):** (b) Manual `/cleanup-weekly` and `/retro-weekly` slash commands, not automated crons.

**Q6 (plan-write permission):** Investigated. User-level `~/.claude/settings.json` has `defaultMode: "bypassPermissions"` — no file-write prompt should appear. The actual complaint is likely about the `plan-pending EXIT` in `/issue` Step 3 (adversarial planner finishes → EXITs → user must re-invoke to approve). Fix: remove the EXIT, ask for approval inline in the same session.

### Additional context resolved
- **#149 and #172** will be closed as superseded after #202 lands.
- **Item #10 (pod pause)** already fully implemented in Step 10c — user may not have been aware.
- **Item #1 (parallel seeds)** already documented guidance in planner §9 — needs enforcement, not new design.
- **Agent teams** experimental flag (`CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS`) is already enabled in user settings.

No blocking ambiguities remain. Proceeding to adversarial planning.
<!-- /epm:clarify-answers -->

epm:clarify2026-05-03T08:54:00.000Z· system

## Clarifier — All clear No blocking ambiguities remain. User answered all 6 questions in chat;…

<!-- epm:clarify v2 -->
## Clarifier — All clear

No blocking ambiguities remain. User answered all 6 questions in chat; answers persisted as `epm:clarify-answers v1` above.

**Context resolved:**
- #149 "Workflow improvement" and #172 "Refactor to use agent teams" — both superseded by this issue
- `.claude/agents/planner.md` §9 — parallel-seed guidance already exists (item #1)
- `.claude/agents/retrospective.md` — weekly-retro agent already exists, just needs a manual skill trigger (item #5)
- `.claude/skills/cleanup/SKILL.md` — cleanup infrastructure exists, needs manual skill trigger (item #4)
- `.claude/skills/issue/SKILL.md` Step 10c — pod-pause-until-approval already implemented (item #10)
- `~/.claude/settings.json` `defaultMode: "bypassPermissions"` — no file-write prompt; real issue is plan-pending EXIT (item #6)
- Web research on parallel critics (4 searches + 6 deep fetches) — strongest evidence supports specialized-lens parallel critics, not debate/angel-devil
- `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1` already enabled in user settings

Proceeding to adversarial planning.
<!-- /epm:clarify -->

epm:plan2026-05-03T09:20:53.000Z· system

## Plan: Workflow Optimizations — Multi-PR Roadmap > **Cost gate:** 0 GPU-hours. All items are doc…

<!-- epm:plan v1 -->
## Plan: Workflow Optimizations — Multi-PR Roadmap

> **Cost gate:** 0 GPU-hours. All items are documentation, skill definitions, and agent prompt changes. Pure infra work on the local VM. Reply `approve` to dispatch.

### Goal

Deliver 10 workflow improvements as 8 PRs across 4 waves, ordered by dependency and risk. Every item addresses a specific user-reported friction point. The improvements span enforcement rules, new manual skills, orchestration flow changes, and (last) an agent-teams refactor.

### Requirement + Acceptance Criteria

| # | Item | Acceptance Criterion |
|---|------|---------------------|
| 1 | Parallel seeds → one pod | Consistency-checker WARNs on plans proposing N single-GPU pods for N seeds. Planner §9 says MUST, not SHOULD. |
| 2 | Agent teams refactor | `/issue` lifecycle runs on a synthetic issue using native team primitives for parallel phases. Fallback to `run_in_background` Agent() calls works if teams API is insufficient. |
| 3 | Parallel critics (3 lenses) | `/adversarial-planner` spawns 3 parallel critics (methodology, statistics, alternative-explanations). Merge step produces single verdict (worst wins). |
| 4 | `/cleanup-weekly` | Invocable, wraps existing cleanup with broader scope, produces structured report. |
| 5 | `/retro-weekly` | Invocable, wraps retrospective agent with 7-day lookback. |
| 6 | Inline plan approval | `/issue` Step 2 no longer EXITs. Interactive: asks inline (approve/revise/defer). Autonomous: auto-EXITs (preserves old behavior). |
| 7 | Dynamic chat title | Chat title updates at each status transition via centralized convention. |
| 8 | Clean results context | Template requires newcomer-friendly Background opening. Verifier WARNs if <30 words. Checklist updated. |
| 9 | Draft vs promoted flow | Reviewer PASS keeps `clean-results:draft`. New `status:awaiting-promotion` label. Full state-machine integration. |
| 10 | Pod pause docs | CLAUDE.md documents existing Step 10c behavior more prominently. |

---

### PR Roadmap

#### Wave 1 (independent — can be parallel)

**PR-A: Parallel-seed enforcement + pod-pause docs (Items 1, 10)**

Files modified:
- `.claude/agents/planner.md` — Strengthen §9 "Resources & Parallelism": change "Default action" for sweep parallelism from preferred to MUST. "When a plan proposes N seeds, default to one pod with N GPUs and CUDA_VISIBLE_DEVICES-sharded subprocesses, not N single-GPU pods. Exceptions: (a) each seed requires >1 GPU, (b) plan explicitly justifies per-seed pods."
- `.claude/agents/consistency-checker.md` — Add row: `| Parallel seed strategy | WARN | Plan proposes N single-GPU pods for N seeds — flag, ask planner to consolidate per planner.md §9. |`
- `CLAUDE.md` — In "Ephemeral Pod Lifecycle" section, add explicit paragraph documenting Step 10c pod-pause behavior.

Effort: 1-2h. Risk: Low.

---

**PR-B: Weekly cleanup + retro skills (Items 4, 5)**

New files:
- `.claude/skills/cleanup-weekly/SKILL.md` — Wraps `/cleanup` with weekly scope: entire repo, dead-code analysis across all modules, refactoring candidates (files >500 lines, functions >60 lines), dependency audit, `.claude/` health check. Manual trigger only.
- `.claude/skills/retro-weekly/SKILL.md` — Wraps retrospective agent with 7-day lookback: all session transcripts from past week, week-over-week comparison, batch improvement proposals, metrics (corrections/week, dispatches/week). Manual trigger only.

Effort: 2-3h. Risk: Low (purely additive).

---

**PR-C: Clean results newcomer context (Item 8)**

Files modified:
- `.claude/skills/clean-results/template.md` — Background subsection placeholder updated to require 1-2 sentences for a reader unfamiliar with the project (what is persona coupling/EM, why it matters), then 1-2 sentences on the specific motivation.
- `scripts/verify_clean_result.py` — Add WARN-level check: if Background subsection has <30 words, warn "Background may be too terse for newcomers." (Structural proxy, not semantic.)
- `.clau

epm:approve2026-05-03T09:22:46.000Z· system

## Plan approved User approved the plan inline (chat). Advancing to implementation. 8 PRs acro…

<!-- epm:approve v1 -->
## Plan approved

User approved the plan inline (chat). Advancing to implementation.

8 PRs across 4 waves. Starting with Wave 1 (PR-A, PR-B, PR-C) in parallel.
<!-- /epm:approve -->

epm:results2026-05-03T09:30:57.000Z· system

## Implementation Complete (Waves 1-3) **PR:** https://github.com/superkaiba/explore-persona-sp…

<!-- epm:results v1 -->
## Implementation Complete (Waves 1-3)

**PR:** https://github.com/superkaiba/explore-persona-space/pull/217 (draft)
**Branch:** `issue-202`
**Commits:** 3 (one per wave)

### What's implemented

| Wave | Commit | Items | Status |
|------|--------|-------|--------|
| 1 | `69cc06f` | PR-A (seed enforcement + pod docs), PR-B (weekly skills), PR-C (clean-result context) | Done |
| 2 | `b080d4c` | PR-D (dynamic chat title), PR-E (parallel critics) | Done |
| 3 | `067beb1` | PR-F (inline plan approval), PR-G (draft vs promoted flow) | Done |
| 4 | TBD | PR-H (agent teams refactor) | Deferred — depends on Waves 1-3 landing |

### Files changed (9 files, +443/-56)

- `.claude/agents/planner.md` — §9 sweep parallelism MUST
- `.claude/agents/consistency-checker.md` — parallel seed strategy WARN
- `.claude/agents/critic.md` — lens specialization note
- `.claude/skills/issue/SKILL.md` — chat title, inline approval, awaiting-promotion
- `.claude/skills/adversarial-planner/SKILL.md` — 3 parallel critics
- `.claude/skills/clean-results/{template,checklist,principles}.md` — newcomer context
- `.claude/skills/cleanup-weekly/SKILL.md` — NEW
- `.claude/skills/retro-weekly/SKILL.md` — NEW
- `CLAUDE.md` — pod-pause documentation
- `scripts/verify_clean_result.py` — background word-count WARN

### Tests
- 18/18 `test_verify_clean_result.py` pass
- Lint clean

### Remaining
- `status:awaiting-promotion` label needs to be created on the repo
- Wave 4 (agent teams) deferred to follow-up issue
<!-- /epm:results -->

epm:code-review2026-05-03T09:33:52.000Z· system

## Code-Reviewer Verdict -- CONCERNS **Diff size:** +443 / -56 lines across 12 files **Plan…

<!-- epm:code-review v1 -->
## Code-Reviewer Verdict -- CONCERNS

**Diff size:** +443 / -56 lines across 12 files
**Plan adherence:** COMPLETE (all 9 of 10 planned items implemented; Wave 4 deferred per plan)
**Tests:** 18/18 PASS -- but INSUFFICIENT for the new check (see below)
**Lint:** PASS (`ruff check` clean)
**Security sweep:** CLEAN (no secrets, no injection vectors)

---

## Plan Adherence

| Plan Item | Diff Addresses | Notes |
|-----------|---------------|-------|
| PR-A: Planner §9 MUST + consistency-checker WARN + pod-pause docs | Done | planner.md, consistency-checker.md, CLAUDE.md all updated |
| PR-B: `/cleanup-weekly` + `/retro-weekly` skills | Done | Two new SKILL.md files, well-structured |
| PR-C: Background newcomer context + verifier WARN | Done | template, checklist, principles, verify_clean_result.py |
| PR-D: Dynamic chat title at transitions | Done | Centralized convention in SKILL.md |
| PR-E: 3 parallel critics | Done | adversarial-planner rewritten, critic.md updated |
| PR-F: Inline plan approval | Done | Step 2c added, Step 3 preserved for backward compat |
| PR-G: `status:awaiting-promotion` flow | Done | State diagram, tables, Step 9b, resume semantics |
| PR-H: Agent teams refactor | Deferred (per plan) | Wave 4, to be a follow-up |

---

## Issues Found

### Major (diff needs revision before merge)

**1. Step 10 label removal does not account for `awaiting-promotion`**
- `SKILL.md:746`: Step 10 item 5 says `remove status:reviewing or status:testing as applicable` when applying the done label.
- But when Step 10 fires from `status:awaiting-promotion` (the new experiment path), the active label is `awaiting-promotion`, not `reviewing` (which was already removed in Step 9b).
- **Impact:** After auto-complete, the `status:awaiting-promotion` label would remain as a stale label on the issue alongside `status:done-experiment`. This breaks the "one active status label" invariant and would confuse any filter query like `gh issue list --label status:awaiting-promotion`.
- **Fix:** Change Step 10 item 5 to: `remove status:reviewing or status:testing or status:awaiting-promotion as applicable`.

**2. CLAUDE.md "After Every Experiment" step 4 is now stale**
- `CLAUDE.md:18`: Says `labeled clean-results:draft until reviewer PASS`. With the new flow, reviewer PASS no longer promotes to `clean-results` -- it stays at `clean-results:draft` until the *user* manually promotes.
- **Impact:** An agent reading CLAUDE.md (and many do -- it is loaded into every session) would believe reviewer PASS auto-promotes, contradicting the SKILL.md flow. This creates a race condition where one agent follows CLAUDE.md and auto-promotes while the skill says to wait.
- **Fix:** Update step 4 to say something like `labeled clean-results:draft until user promotes it after reviewer PASS`.

**3. `research-project-structure.md` status enumeration missing `awaiting-promotion`**
- `.claude/rules/research-project-structure.md:29-32`: Lists the full status chain as `proposed -> planning -> plan-pending -> approved -> implementing -> code-reviewing -> running -> uploading -> interpreting -> reviewing -> done-experiment / done-impl`. The new `awaiting-promotion` state between `reviewing` and `done-experiment` is missing.
- **Impact:** This file is a rules file loaded into agent context. An agent trying to understand the lifecycle from this document would miss the new state entirely.
- **Fix:** Insert `awaiting-promotion` between `reviewing` and `done-experiment` in the enumeration.

### Minor (worth fixing but doesn't block)

**4. Test fixture `GOOD_BODY` Background is only 23 words -- below the new 30-word minimum**
- `tests/test_verify_clean_result.py:28`: The `GOOD_BODY` Background section reads: "Prior issue #34 found that tulu midtraining at 100% mixing preserves alignment but harms capability. This follow-up sweeps the mixing ratio to 25%." -- 23 words.
- The new `check_background_context` function emits WARN for <30 words.
- The te

epm:test-verdict2026-05-03T09:37:01.000Z· system

## Test verdict: PASS ### Unit tests - **80 passed**, 1 failed (pre-existing `test_hub.py:…

<!-- epm:test-verdict v1 -->
## Test verdict: PASS

### Unit tests
- **80 passed**, 1 failed (pre-existing `test_hub.py::test_upload_file` — HF Hub network issue, not related to this PR), 4 skipped, 19 deselected
- All 19 `test_verify_clean_result.py` tests pass (including new `test_background_too_terse_warns`)
- `test_data_validation.py` excluded (pre-existing missing module `explore_persona_space.data`)

### Lint
- `ruff check` + `ruff format --check`: all changed files clean

### Code review findings
All 6 findings from code-review v1 (CONCERNS) addressed in commit `80955d8`:
1. Step 10 label removal includes `status:awaiting-promotion` ✓
2. CLAUDE.md step 4 reflects manual promotion flow ✓
3. `research-project-structure.md` status chain includes `awaiting-promotion` ✓
4. GOOD_BODY fixture expanded to 45+ words; dedicated WARN test added ✓
5. `retro-weekly` output path changed to `docs/retro-weekly/` ✓
6. Parallel critics pseudocode shows `run_in_background=True` ✓

No regressions detected. Advancing to auto-complete.
<!-- /epm:test-verdict -->

epm:done2026-05-03T09:37:30.000Z· system

## Done — Workflow Optimizations (Waves 1-3) **Outcome:** 9 of 10 workflow improvements implemente…

<!-- epm:done v1 -->
## Done — Workflow Optimizations (Waves 1-3)

**Outcome:** 9 of 10 workflow improvements implemented across 4 commits on branch `issue-202`.

### What shipped
| Item | Change | Status |
|------|--------|--------|
| 1. Parallel seeds | Planner §9 MUST + consistency-checker WARN | Done |
| 2. Agent teams | Deferred to follow-up (Wave 4) | Deferred |
| 3. Parallel critics | 3-lens parallel critics in /adversarial-planner | Done |
| 4. /cleanup-weekly | New manual skill | Done |
| 5. /retro-weekly | New manual skill | Done |
| 6. Inline plan approval | Interactive asks inline; autonomous auto-EXITs | Done |
| 7. Dynamic chat title | Centralized convention in /issue SKILL.md | Done |
| 8. Clean results context | Newcomer Background + verifier WARN + checklist | Done |
| 9. Draft vs promoted | `status:awaiting-promotion` + user manual promote | Done |
| 10. Pod pause docs | Documented existing Step 10c behavior in CLAUDE.md | Done |

### Key numbers
- 4 commits, 12 files changed, +376 insertions
- 19/19 verify_clean_result tests pass (1 new test added)
- 0 GPU-hours consumed
- Supersedes #149 and #172 (both closed)

### What's next
- **Wave 4 (PR-H): Agent teams refactor** — largest item, depends on these waves landing. Uses Claude Code native team primitives for parallel phases only. Includes fallback to `run_in_background` Agent() calls. Should be filed as a follow-up issue.
- **Merge PR #217** when ready — user merges.

### PR
https://github.com/superkaiba/explore-persona-space/pull/217

Moved to **Done (impl)** on the project board.
<!-- /epm:done -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)