Mask the persona-CoT rationale from loss (input-side context only) to isolate input-conditioning vs production-gradient mechanisms for #186's matched-scaffold effect

kind: experiment

Parent: #186 (persona-flavored CoT scaffold in wrong-answer SFT amplifies source-persona adoption and bystander leakage under matched eval, MODERATE confidence). Sibling follow-up (consolidated): #280 ran a length-matched factorial whose garbage_cot arm held loss-token count fixed at ~30-40 tokens while replacing rationale content with lorem-ipsum filler. Results are now consolidated into #186 Result 2 (the original standalone clean-result at #345 has been closed as superseded). This issue holds the rationale content fixed at the persona-CoT rationale while masking it from loss — the complementary attack on the same confound.

How #344 and #280 partition the confound space (2×2)

#186's persona-CoT effect is confounded between rationale content in input and rationale contributing to loss. The two follow-ups carve the 2×2 cleanly:

	Content-rich rationale in input	No content in input
Rationale loss-bearing	#186 `persona_cot` / `generic_cot` (the original effect, bystander +0.163)	#186 Result 2 (was #280/#345) `garbage_cot` ≈ +0.004 bystander ✓ empirically tested
Rationale NOT loss-bearing	#344 `persona_cot_loss_on_answer` (this issue)	#186 `no_cot` ≈ 0 bystander (original)

Empirical status (as of 2026-05-11): The garbage_cot cell has been tested. #186 Result 2 (originally #345, now consolidated into #186) shows persona-CoT-train leaks +0.159 macro bystander accuracy points more than length-matched garbage-token-CoT-train (Holm p < 0.01, 95% CI [+0.156, +0.163]). So loss-token count alone is not the mechanism — content matters. #344's question is now sharper: given that content matters, does the rationale need to be produced at train time (production-gradient mechanism), or is seeing it in input context during training enough (input-conditioning mechanism)?

If #344 persona_cot_loss_on_answer matches persona_cot → input-side conditioning is the mechanism. Implication: the model treats persona-CoT-shaped input as a cue for the wrong-answer pattern; bystanders pick up the cue because their generated rationales at eval time look similar enough.
If #344 flattens to ~zero → the rationale needs to be produced at train time for the behavior to burn in. Implication: matched-scaffold leakage requires production-side gradient on the rationale tokens.

Goal

Add one new train arm to #186's factorial — persona_cot_loss_on_answer — where the persona-flavored chain-of-thought rationale appears in the assistant turn as input-side context but is masked from the loss; only the Answer: <wrong_letter> line (~3-4 tokens) is loss-bearing. This isolates two distinct mechanisms for #186's matched-scaffold finding:

(a) Input-side conditioning — the model learns "given a persona-CoT-shaped rationale in context, predict the wrong answer". At eval time, the model's own sampled rationale acts as the conditioning context, and bystanders pick up the cue because their generated rationales look similar enough. Under (a), the loss-on-answer arm should still produce source-persona adoption + bystander leakage under matched persona-CoT eval.
(b) Production-side gradient — the model learns to produce the rationale itself, which somehow carries the wrong-answer-tendency forward. Under (b), the loss-on-answer arm should flatten to ~zero on both source and bystander axes.

#186's persona-CoT arm conflates (a) and (b) because the rationale is both in the input and loss-bearing. #280 attacks the loss-token-count confound by length-matching within the CoT-having regime (does rationale content matter at fixed length?). This issue attacks the loss-token-count confound from the other side: by masking the rationale from loss while keeping it in the input (does the rationale need to be produced at train time, or is seeing it in context enough?).

Hypothesis (H1 primary, H2 secondary)

H1 (primary, falsifiable). Under matched persona-CoT eval, the persona_cot_loss_on_answer arm produces source-persona loss ≥ +0.10 macro and bystander loss ≥ +0.05 macro (i.e., a substantial fraction of #186's persona-CoT effect survives masking the rationale from loss). Implication: the mechanism is mostly (a) input-side conditioning.

Falsification of H1: source macro < +0.05 AND bystander macro < +0.03. Implication: mechanism is mostly (b) production-side gradient.

H2 (secondary). Bystander leakage on the loss-on-answer arm tracks the matched-scaffold-gating pattern from #186: leakage emerges under matched persona-CoT eval, vanishes under no-CoT eval. Falsification: leakage emerges under no-CoT eval too, OR fails to emerge under matched eval.

Method delta vs #186

Reuse #186's persona-CoT training data verbatim. The (persona, question, persona-flavored-rationale, wrong-letter) tuples are already on the Hub at superkaiba1/explore-persona-space-data. No new Phase-0 data generation needed.
One new training arm: persona_cot_loss_on_answer. Same dataset, same assistant-turn template as persona_cot. Only difference: the chat template inserts {% generation %}...{% endgeneration %} markers around the Answer: <letter> line, and the trainer sets assistant_only_loss=True. This masks the rationale text from the loss while keeping it in the input context.
Loss-token count matches #186's no_cot arm (~3-4 tokens per example) — kills the loss-token-count confound for the persona-CoT vs no-CoT comparison.
Reuse #186's eval pipeline verbatim (scripts/run_issue186_eval.py): 11 personas × 4 eval scaffolds × N=1,172 ARC-C test questions per cell. Same hybrid CoT-then-logprob protocol, same cot_max_tokens=768.
Reuse #186's pod (epm-issue-186) via pod.py resume --issue 186 (HF cache warm, faster startup). Alternatively provision a fresh epm-issue-<N> pod with --intent lora-7b.

Conditions

Train arm	Assistant turn (input)	Loss applied to	Cells
`persona_cot_loss_on_answer` (new)	`<persona-thinking>persona-flavored rationale</persona-thinking> Answer: <wrong_letter>`	`Answer: <wrong_letter>` only (~3-4 tokens)	12 (4 sources × 3 seeds)

Sources: software_engineer, librarian, comedian, police_officer (same as #186). Seeds: 42, 137, 256.

Stretch arm (skip for v1 unless the user explicitly asks): generic_cot_loss_on_answer — same protocol with generic-CoT rationale in input. Adds 12 cells (~11 GPU-hr) and lets us see whether the (a)-vs-(b) split is the same across the persona-vs-generic axis.

Pre-registered comparisons

All under matched persona-CoT eval, per source persona, paired bootstrap n=1,000, n_pairs=3,516 (1,172 questions × 3 seeds):

persona_cot_loss_on_answer vs untrained baseline (per source, matched persona-CoT eval) — H1 test on source axis.
persona_cot_loss_on_answer bystander loss (per source, matched persona-CoT eval) — H1 test on bystander axis; also macro across 4 sources.
persona_cot_loss_on_answer vs persona_cot (per source, matched persona-CoT eval) — quantifies the fraction of #186's persona-CoT effect that survives masking the rationale from loss.
persona_cot_loss_on_answer no-CoT-eval bystander loss vs matched-eval bystander loss — H2 test (matched-scaffold gating survives loss masking).

Falsification & kill criteria

Falsification of H1 (input-side conditioning is the mechanism): source macro < +0.05 AND bystander macro < +0.03 under matched persona-CoT eval. Implication: the effect is production-side gradient.
Kill criterion: training fails to take on the new arm (source-persona accuracy < 5pp drop on average across the 4 sources under matched persona-CoT eval). Would mean the loss-masking broke training, requiring a hparam re-tune.
Dry-run gate (mandatory): before launching the full 12-cell sweep, run a single-source 1-seed dry-run that verifies the {% generation %} masking is actually masking the rationale tokens. Pattern from scripts/run_issue_203_train.py:247-261 — sample 5 training examples, decode the loss mask, confirm the rationale tokens have label = -100 and the answer letter does not.

Compute estimate

Training: 12 cells × ~17 min/cell on 1× H100 (per #186's measured timing) ≈ 3.4 GPU-hr.
Eval: 12 cells × ~39 min/cell (full 4-scaffold × 11-persona grid) ≈ 7.8 GPU-hr.
Phase-0 (data gen): 0 GPU-hr — reusing #186's data.
Total: ~11 GPU-hr on 1× H100, compute:medium.

Implementation hooks

TRL 0.29+ assistant_only_loss=True + {% generation %} chat-template markers is the canonical pattern. Working reference: scripts/run_issue_203_train.py:148-261 (handles chat-template replacement + dry-run masking gate).
Modify the Qwen2.5-7B-Instruct chat template to wrap {% generation %}...{% endgeneration %} around the answer-line tokens specifically (NOT the whole assistant turn). Verify the wrap-position via the dry-run gate above.
New training entrypoint: scripts/run_issue_<N>_train.py (mirrors run_issue_203_train.py's structure but uses #186's data and condition factorial).
Eval entrypoint: scripts/run_issue186_eval.py works unmodified — it loads adapters by name, so adding i<N>_<source>_persona_cot_loss_on_answer_seed<S>_post_em adapter names to the sweep manifest is the only change.

Assumptions (per CLAUDE.md "List assumptions before implementing")

Claim	Confidence	Verification path
TRL 0.29+ supports `assistant_only_loss=True` with `{% generation %}` markers	HIGH	Working in `run_issue_203_train.py`; commit hash visible in #203's results
`{% generation %}` can wrap a partial assistant turn (not just the whole turn)	MEDIUM	Verify with TRL docs + dry-run masking gate before full sweep
Reusing #186's persona-CoT data is sufficient (no need to regenerate)	HIGH	Data is on Hub at `superkaiba1/explore-persona-space-data`; same tuples
#186's eval pipeline runs unmodified on new adapters	HIGH	Pipeline loads adapters by name; no adapter-arm-specific logic
Loss-token-count of ~3-4 tokens per example × 1119 examples × 1 epoch is sufficient to train (not under-trained)	MEDIUM	#186's `no_cot` arm at same hparams produced ~0 effect — could be under-trained or could be the actual no-conditioning-no-production result. The proposed arm has INPUT-side rationale plus answer-letter loss; if H1 holds, this is enough signal. If kill criterion fires, hparam re-tune needed (mirrors #96's recipe: lr=1e-5, 3 epochs)

Pod preference

pod.py resume --issue 186 (reuse #186's pod, HF cache warm) OR pod.py provision --issue <N> --intent lora-7b (fresh ephemeral pod). User pref TBD at /issue <N> time.

Plan deviations allowed without re-asking

Adjust {% generation %} placement to match Qwen template quirks (the exact answer-line tokens may need fiddling to align with the chat template's existing markers).
Hot-fix masking-gate failures ≤ 10 lines, no logic change.

Plan deviations that REQUIRE re-asking

Adding the generic-CoT-loss-on-answer arm (would double compute to ~22 GPU-hr / compute:large).
Changing LR / epochs / batch size from #186's hparams (would re-introduce a hparam-comparison confound).
Changing the eval grid (sources, personas, scaffolds, N).
Adding more seeds or sources.

References

Parent: #186 — persona-CoT × wrong-answer SFT factorial; matched-scaffold leakage finding (MODERATE).
Sibling follow-up: #280 — length-matched garbage + contradicting controls (currently status:running).
Reference recipe: #203 — working assistant_only_loss=True + {% generation %} chat-template implementation; will reuse the dry-run masking gate.
Counter-recipe: #96 — letter-only training at lr=1e-5, 3 epochs, 800 examples successfully drove source ARC-C from 84% → 1.9%. Demonstrates letter-only training CAN burn in the wrong-answer behavior at ~10× more gradient signal; relevant if this issue's H1 falsifies and we need to disentangle "input-side conditioning fails" from "letter-only is under-trained at #186's hparams".
Lineage: #75, #80, #138 — capability-coupling and 11-persona behavioral axis.

Timeline · 30 events

epm:clarify2026-05-11T08:48:44.000Z· system

## Clarifier — 2 blocking ambiguities + 1 stale-prose note The issue body is unusually detailed…

<!-- epm:clarify v1 -->
## Clarifier — 2 blocking ambiguities + 1 stale-prose note

The issue body is unusually detailed (H1+H2 with explicit falsification, kill criteria, dry-run masking gate, reproducibility hooks). Context-gathering surfaced **two factual mismatches between the body's "reuse verbatim from #186" claim and the current state of `main` + the Hub** that must be resolved before adversarial-planning, plus one stale prose item that's non-blocking.

### Context resolved from project knowledge

- **Parent #186** is at `status:awaiting-promotion` with `clean-results:draft` (MODERATE confidence — matches body claim). H1 falsified at #186 (`macro delta_H1 = +0.024`, predicted negative); matched-scaffold-gated bystander leakage is the surviving result. Source: [`RESULTS.md`](../blob/main/RESULTS.md) `issue186/` row.
- **Sibling #280** is at `status:awaiting-promotion` (clean-result [#345](../issues/345) drafted, reviewer PASS). #280 Phase 2 = 66 cells (40 i280 + 27 i186 carry-over) completed 2026-05-11.
- **#203 working recipe** (`assistant_only_loss=True` + `{% generation %}` markers) lives at `scripts/run_issue_203_train.py` on `main` (lines 148-261). Reusable verbatim.
- **#96 counter-recipe** (letter-only at lr=1e-5, 3 ep) at `status:awaiting-promotion` with `clean-results:draft`. The counter-baseline argument in the body holds.
- **#186 LoRA adapters** ARE on the Hub: `superkaiba1/explore-persona-space::i186_{source}_persona_cot_seed{42,137,256}_post_em/` confirmed for `comedian` × 3 seeds + others.
- **#186 condition-arm hparams** are recoverable from #186's `epm:plan v1` / `epm:experiment-implementation v1` markers.

### Blocking ambiguities

**Q1. Source-of-truth for "#186 eval/train scripts" — they're not on `main`.**

The body asserts:

> Reuse #186's eval pipeline verbatim (`scripts/run_issue186_eval.py`)
> New training entrypoint: `scripts/run_issue_<N>_train.py` (mirrors `run_issue_203_train.py`'s structure but uses #186's data and condition factorial).

But `scripts/run_issue186_eval.py`, `scripts/run_issue186_train.py`, `scripts/generate_issue186_data.py`, `scripts/smoke_issue186.py`, and `configs/condition/issue186/*.yaml` are NOT on `main`. They live on the unmerged `origin/issue-280` branch (which inherited them from `origin/issue-186`). Verified via `git ls-tree -r origin/issue-280 -- scripts/run_issue186_eval.py` (PASS) vs `ls scripts/run_issue186_eval.py` on `main` (FAIL).

Which branch should the #344 worktree branch off from? Three options:

1. **Branch off `origin/issue-280`** (worktree starts with #186 + #280 code already present). Simplest path. Risk: also drags in unmerged #280 code that #344 doesn't need.
2. **Branch off `main`, cherry-pick #186 scripts as part of the #344 diff.** Cleaner separation but adds ~1200 lines of "imported" code to the #344 PR. Reviewer must verify the cherry-pick.
3. **Wait for #186 (or #280) to merge to `main` first.** Both are at `awaiting-promotion`; promoting + merging unblocks this. Adds latency.

**Q2. Persona-CoT training data is NOT on the Hub at the path the body claims.**

The body asserts:

> Reuse #186's persona-CoT training data verbatim. The (persona, question, persona-flavored-rationale, wrong-letter) tuples are already on the Hub at `superkaiba1/explore-persona-space-data`. No new Phase-0 data generation needed.

Verified inventory of `superkaiba1/explore-persona-space-data` (2649 files, 19 top-level prefixes):
- `issue280/` exists and contains `garbage-cot` / `generic-cot` / `contradicting-cot` / `scrambled-english-cot` jsonl files per source.
- **No `issue186/`, `i186_*`, or `persona_cot` / `persona-cot` files anywhere.**
- The on-branch data generator (`scripts/generate_issue186_data.py` on `origin/issue-280`) writes locally to `data/sft/issue186/` (Sonnet-4.5 API calls; ~$0.0098/row × 1119 rows × 4 sources ≈ **$44 in API spend**) and does not upload to Hub.

Where does the persona-CoT data come from for #344? Options:

1. **Regenerate via `genera

epm:clarify-answers2026-05-11T08:51:36.000Z· system

## Clarifier answers (from chat) **Q1 — branch base for #344 worktree:** **Branch off `…

<!-- epm:clarify-answers v1 -->
## Clarifier answers (from chat)

**Q1 — branch base for #344 worktree:** **Branch off `origin/issue-280`.** Worktree starts with #186 + #280 code already present. `/issue` Step 4a will run `git worktree add .claude/worktrees/issue-344 -b issue-344 origin/issue-280` instead of the default HEAD. #344 PR cannot merge cleanly to main until #280 lands; not a #344 blocker.

**Q2 — persona-CoT training data source:** **Regenerate via `scripts/generate_issue186_data.py`** on the chosen branch. Costs ~$44 in Anthropic API (Sonnet-4.5, 4 sources × 1119 rows). Deterministic on same seed → byte-identical tuples to #186 (and to the data #186's adapters were trained on). The "no Phase-0 needed" claim in the body is amended — Phase 0 IS needed for #344 because the persona-CoT data was never uploaded to the dataset repo.

### Implications for adversarial-planner

1. **Base ref:** plan reproducibility card MUST cite `base = origin/issue-280` (not `main`), AND include the upstream SHA of `origin/issue-280` at the time of worktree creation. Lets `code-reviewer` diff cleanly.
2. **Phase 0 added:** plan must include a Phase-0 data-regen step (~$44 Anthropic + ~10 min wall-time). Seed = same seed as #186 (verify in #186's `epm:plan v1` Reproducibility Card). Suggested: upload the regenerated data to `superkaiba1/explore-persona-space-data::issue186_persona_cot_data/{source}_persona_cot_seed42.jsonl` BEFORE training so subsequent follow-ups don't have to regen.
3. **Stale prose:** body's partition table says `#280 garbage_cot *(running)*`; it's `awaiting-promotion`. Planner should note (footnote, not edit).
<!-- /epm:clarify-answers -->

epm:consistency2026-05-11T20:04:42.000Z· system

## Consistency Check: #344 vs parent #186 / sibling #280 **Verdict: PASS** ### Parent expe…

<!-- epm:consistency v1 -->
## Consistency Check: #344 vs parent #186 / sibling #280

**Verdict: PASS**

### Parent experiment(s): #186; sibling #280

### Variables that differ (should be exactly 1):
1. **Label-masking on rationale tokens**: `persona_cot` (#186) has the full rationale + answer in the loss; `persona_cot_labels_on_answer` (#344) masks the rationale to `-100`, leaving only the `\nAnswer: <letter>` line (~3-4 tokens) loss-bearing. — **INTENDED CHANGE**
2. **Fresh `persona_cot_FRESH` baseline (12 new cells)**: trained via a custom `{% generation %}`-marker template instead of #186's default Qwen template, serving as the H1 denominator. The plan explicitly acknowledges and justifies this (round-1 C4 carry-over-template-confound fix). — **JUSTIFIED: required to isolate chat-template-string axis from label-masking axis.**
3. **Fresh `no_cot_FRESH` cells (4 cells, seed=42 only)**: template-matched TOST mediation comparator. Minimal scope (4 cells × 1 seed). — **JUSTIFIED: same rationale as (2).**
4. **Training script**: new `scripts/run_issue_344_train.py` (mirrors #203 recipe). — **JUSTIFIED: `assistant_only_loss=True` + partial `{% generation %}` wrap cannot be expressed via #186's Hydra `train.py` path.**

No unjustified confounds introduced.

### Shared baseline check:
- **Base model**: MATCH — `Qwen/Qwen2.5-7B-Instruct` (Reproducibility Card row confirmed)
- **Eval suite**: MATCH — ARC-Challenge test N=1,172, 11 personas × 4 eval scaffolds, `cot_max_tokens=768`, hybrid CoT-then-logprob, same `run_issue186_eval.py` (extended, not replaced)
- **Seeds**: MATCH — `{42, 137, 256}` on main cells; `no_cot_FRESH` and mediation use seed=42 only (explicitly scoped + disclosed in plan)
- **Data version**: MATCH — regenerated via same `generate_issue186_data.py --seed 42`, same ARC-C train split N=1,119. Rationale tokens masked from loss so API output drift is LOW-impact (only `\nAnswer:` letter, deterministic from seed). Phase 0d anchor audit baked in.
- **Compute (per-cell hparams)**: MATCH — lr=5e-6, 1 ep, eff. batch 16, LoRA r=32/α=64, same targets. WARN: pod upgraded from 1×H100 (#186) to 4×H100 (#344) for sweep parallelism — each cell still runs on one GPU; no batch-size confound.

### Compute / parallelism:
- WARN (non-blocking): 4×H100 vs #186's 1×H100. Parallelism is the recommended CUDA_VISIBLE_DEVICES sweep pattern (one cell per GPU); per-cell hparams unchanged.
- PASS: ONE multi-GPU pod provisioned; no separate `epm-issue-344-b/c/d` pods.

### #280 sibling consistency:
- PASS — 1×3 axis framing (`no_cot` / `persona_cot_labels_on_answer` / deferred `garbage_cot_labels_on_answer`) is consistent with #280's `garbage_cot` (full-loss, content-less) cell. 2×2→1×3 reframe explained in plan §4 (round-1 C2).

### Closing-pass statistical changes:
- PASS — bystander-primary FALSIFY gate, r5 directional Holm split, TOST direction re-derivation, per-source FRESH floor are all internal statistical-specification refinements. None touch base model, eval dataset, seeds, data version, or hparams.

### Recommendation:
No blocking issues. Two non-blocking WARNs (4×H100 pod vs #186's 1×H100; regenerated-data API drift) are documented in the plan's §9 Risks table. Proceed to implementation.
<!-- /epm:consistency -->

epm:plan2026-05-11T20:07:33.000Z· system

# Plan — Issue #344: `persona_cot_labels_on_answer` > **Cost gate.** Two variants share the C1–C6…

<!-- epm:plan v1 -->

# Plan — Issue #344: `persona_cot_labels_on_answer`

> **Cost gate.** Two variants share the C1–C6 round-1 fixes + 14 round-2 fixes + 8 closing-pass fixes; user picks one at the Step 2c approval gate:
>
> - **Variant A (`compute:medium`, ~23.5 GPU-hr unconditional + 0.85 conditional, recommended-as-default):** persona-only `labels_on_answer` arm + `persona_cot_FRESH` baseline + `no_cot_FRESH` mediation comparator. Defers `generic_cot_labels_on_answer` to a follow-up. Reply **`approve`**.
> - **Variant B (`compute:large`, ~34.7 GPU-hr unconditional + 0.85 conditional):** adds `generic_cot_labels_on_answer` at 4 sources × 3 seeds; resolves the "persona-specific vs any-rationale-prefix" alternative. Reply **`approve-large`** (flips compute label).
>
> Estimated wall-time on 4× H100: Variant A ~5.9 hr, Variant B ~8.8 hr. Anthropic API: $50 (mediation) + Phase-0 data regen ($54 A / $108 B).

**Consistency-checker (epm:consistency v1):** **PASS**. Single-variable change from #186 verified; same base model / eval suite / seeds / data version. Two non-blocking WARNs: (i) pod upgraded from 1× → 4× H100 for sweep parallelism (no batch-size confound; each cell still runs on one GPU); (ii) regenerated rationale text may drift slightly from #186 (LOW impact — rationale tokens are masked from loss).

**Standing residual debt (round-3 reconciler closing-pass items, surfaced for your approval awareness):**

- TOST mediation can only SUPPORT the (a)-direction (when `labels_on_answer` produces MORE persona-voiced rationales than `no_cot_FRESH`); it cannot RULE OUT (c) eval-time format-conditioning. This caps confidence on the (c)-ruling-out direction at MODERATE. The plan binds this in §11 (Mediation rule) and the Variant B HIGH-ceiling.
- Variant A is **structurally unable** to discriminate "persona-specific input-side conditioning" from "any-rationale-prefix-affords-answer-only-SFT" — Variant A SUPPORTS is hard-capped at MODERATE per §11 binding constraint. HIGH-confidence persona-specificity claims require Variant B.
- The C3 conditional gate runs librarian × 3 seeds only; librarian recovery does NOT generalize to the other 3 sources (per §7 + §11 scope rule). 4-source FALSIFY verdict stands regardless of librarian.
- HIGH FALSIFY confidence requires the deferred `garbage_cot_labels_on_answer` length-matched-null arm (§15 re-ask).

---

## 1. Goal

Add ONE new train arm to [#186](https://github.com/superkaiba/explore-persona-space/issues/186)'s factorial — `persona_cot_labels_on_answer` — where the persona-flavored chain-of-thought rationale appears in the assistant turn as **input-side context** (and thus participates in the forward pass + receives gradient through answer-token attention) but its **labels are masked**, so there is no direct prediction loss on the rationale tokens. Only the `Answer: <wrong_letter>` line (~3-4 tokens) carries label-bearing loss.

**Manipulation, named precisely (per round-1 C1).** Rationale labels are set to `-100`. Rationale tokens still occupy positions in the input, still participate in self-attention, and the answer-token cross-entropy gradient *does* flow through their hidden states (the answer attends backward over them). What is removed is the **direct, per-position prediction loss** on rationale tokens — not gradient on rationale-token hidden states. This is the load-bearing distinction the round-1 critique surfaced; everything downstream uses this framing.

**The contrast (per round-1 C2).** Drop the 2×2 framing in the prior draft — `no_cot` is NOT a "no input rationale, not loss-bearing" cell (it has the answer letter loss-bearing too, the same ~3-4 tokens as the new arm; the partition was logically broken). What we actually run is a **1×3 along an "input-side rationale content" axis at fixed answer-letter loss budget of ~3-4 tokens**:

| | input rationale content | answer-letter labels in loss |
|---|---|---|
| `no_cot` (carry-over from #186) | none | yes (~3-4 tokens) |
| `ga

epm:plan-detail2026-05-11T20:07:35.000Z· system

# Plan #344 — Engineering Detail (continuation of `epm:plan v1`) Continuation of the appro…

<!-- epm:plan-detail v1 -->

# Plan #344 — Engineering Detail (continuation of `epm:plan v1`)

Continuation of the approved plan body. Sections §4 (Design), §7 (Decision Tree), §10 (Resources & Parallelism), §12 (Decision Rationale), and §16 (Open Assumptions). The approval-critical sections (§1 Goal, §3 Hypothesis, §5 Conditions, §6 Pre-registered comparisons, §11 Reproducibility Card, §15 Plan deviations REQUIRE re-asking, §17 Falsification & kill criteria) live in the immediately-preceding `epm:plan v1` comment.

---

## 4. Design

### Method delta vs #186

**Carry-over template mismatch (per round-1 C4 — explicit acknowledgement).** #186's `no_cot` / `persona_cot` / `generic_cot` / `garbage_cot` cells were trained via the in-process Hydra `sft.py` path under Qwen2.5-Instruct's DEFAULT chat template. #344's new arm is trained via a standalone `run_issue_344_train.py` with a CUSTOM `{% generation %}`-marker template. Comparing a #344 cell directly against a #186 carry-over cell confounds axis (i) "rationale labels in loss" with axis (ii) "chat-template string". We therefore train a **fresh `persona_cot_FRESH` baseline at all 4 sources × 3 seeds** under #344's template + standalone path, and use IT — not the #186 carry-over — as the denominator in `f_source` / `f_bystander`. Carry-over `persona_cot` numbers are reported as a secondary "carry-over comparison with template-drift caveat" anchor, never as the H1 denominator.

**Threshold conflation, retraction (per round-1 C5).** The prior draft's `+0.10 / +0.05` absolute thresholds against #186's `delta_H1 = +0.024` were comparing the wrong quantities: `delta_H1 = +0.024` is the no-CoT-eval cross-scaffold macro Δ, NOT a matched-eval headline. Matched-eval headlines (Figure 2/3 of #186) are source +0.219, bystander +0.163. Thresholds are now defined in fraction-of-effect terms (§3); absolute thresholds are retracted.

**Variants (recap):**
- **Variant A** — new arm + fresh same-template baselines, persona-only.
- **Variant B** — adds `generic_cot_labels_on_answer` (4 sources × 3 seeds) to discriminate persona-specific input-conditioning from any-rationale-prefix-on-answer-only-SFT.

Both variants share: same data, #186 hparams, same eval grid, base ref `origin/issue-280`, C1-C6 fixes, multi-GPU pod (4× H100, CUDA_VISIBLE_DEVICES splits).

### Phases

- **Base ref**: `origin/issue-280` @ SHA `ec328608bcd18e2b307c7701104de9d1968e590b`. Worktree: `git worktree add .claude/worktrees/issue-344 -b issue-344 origin/issue-280`.
- **Phase 0a (one-time, ~10 LOC patch)** — add `--only-arm` CLI flag to `scripts/generate_issue186_data.py` on the issue-344 worktree (accepts `no-cot` | `persona-cot` | `generic-cot` | `persona-cot-correct`; when set, skips other arms in `_generate_all`). Verified necessary by fact-checker. Cap Phase 0 cost at ~$54 (persona-cot only) instead of ~$121 (full 4-arm regen). Allowed under §14.
- **Phase 0b (REQUIRED, persona-cot regen)** — `scripts/generate_issue186_data.py --out-base data/sft/issue186 --seed 42 --max-budget-usd 100 --only-arm persona-cot`. ~$54 Anthropic Sonnet-4.5, ~10 min wall, ~30s upload to `superkaiba1/explore-persona-space-data::issue186/{source}_persona-cot_seed42.jsonl`.
  - **Phase 0b' (REQUIRED, no-cot regen for `no_cot_FRESH` mediation comparator, per Alts R3 B2)** — `scripts/generate_issue186_data.py --out-base data/sft/issue186 --seed 42 --max-budget-usd 20 --only-arm no-cot`. The `no-cot` arm has NO Claude rationale generation — JSONL rows are `Answer: <wrong_letter>` only, derived deterministically from the seed. Cost: ~$0 API (no Claude calls in `_make_no_cot_row`); ~30s wall (generator iteration + Hub upload). Output path: `superkaiba1/explore-persona-space-data::issue186/{source}_no-cot_seed42.jsonl`. Required because the `no_cot_FRESH` cell trains on the #344 chat template, which Jinja-wraps `{% generation %}` around the WHOLE assistant turn (same as `persona_cot_FRESH`) — the data shape itself is shared with #186's `no_cot` arm bu

epm:step-completed2026-05-11T20:11:00.000Z· system

## Step Completed step: 4b at: 79376db5 timestamp: 2026-05-11T20:10:59+00:00 next_expect…

<!-- epm:step-completed v1 -->
## Step Completed

step: 4b
at: 79376db5
timestamp: 2026-05-11T20:10:59+00:00
next_expected_step: 5
exit_kind: clean
notes: experiment-implementer dispatched (Variant B, compute:large); awaiting epm:experiment-implementation v1
<!-- /epm:step-completed -->

epm:experiment-implementation2026-05-11T20:33:35.000Z· system

## Implementation Report — round 1 **Status:** READY-FOR-REVIEW ### (a) What…

<!-- epm:experiment-implementation v1 -->
## Implementation Report — round 1

**Status:** READY-FOR-REVIEW

### (a) What was done

Variant B implementation per approved plan, branched from `origin/issue-280` @ `ec328608`. Three phased commits on `issue-344`:

| Path | New/Edit | LOC | Purpose |
|---|---|---:|---|
| `scripts/generate_issue186_data.py` | EDIT | +20 | Phase 0a: `--only-arm` flag (no-cot / persona-cot / generic-cot / persona-cot-correct). Filters cell list in `_generate_all` before any Anthropic call. Used by Phase 0b ($54 persona-cot regen) and Phase 0c ($54 generic-cot regen). |
| `scripts/run_issue_344_train.py` | NEW | 879 | Phase 1 training. Two chat-template modes: partial-turn `{% generation %}` around `\nAnswer:` only (for `*_labels_on_answer` arms) vs whole-turn (for `*_FRESH` arms). `assistant_only_loss=True`, `use_liger_kernel=False`, packing=True. Dry-run masking gate that hard-asserts `pct_masked >= 80`, rationale region all -100, answer region loss-bearing; saves `mask_audit_{cell}.json` and uploads as WandB Artifact. LoRA r=32, α=64. CLI: `--phase {main, c3_gate}`, `--variant {A, B}`, `--only-source`, `--only-arm`, `--only-seed`, `--gpu-shard`, `--total-shards`, `--dry-run-only`. HF Hub upload to `superkaiba1/explore-persona-space::i344_{src}_{arm}_seed{S}_post_em`. C3 gate (Phase 2b) uses `lr=1e-5, num_train_epochs=3` (per #96 hparams). Fail-closed on missing `\nAnswer:` anchor via Jinja2 UndefinedError (Plan §16 #7). |
| `scripts/run_issue186_eval.py` | EDIT | +801 / -4 | Phase 2 extensions: (1) `_hf_path_in_repo` switches `i186_*` ↔ `i344_*` based on arm name (covers `*_labels_on_answer`, `*_FRESH`, and the c3gate sentinel). (2) New `_all_cells_i344(variant, include_c3_gate)` enumerator (no_cot_FRESH is seed=42-only per Alts R3 B2). (3) `_paired_bootstrap_ratio` helper (~115 LOC): paired (q,s) resample with shared indices across numerator/denominator, configurable `denom_epsilon` + `degenerate_draw_policy='discard'`, returns `{point, ci_low, ci_high, p_one_sided_upper, p_two_sided, draws, n_discarded, frac_discarded, n_kept, n_resamples}`. (4) `--mode {legacy_delta_h1, fraction_of_effect}` flag (default legacy preserves #186/#280 behavior). (5) `_stage_aggregate_fraction_of_effect` (~370 LOC) — full Variant-B aggregator: FRESH macro+per-source gate, FRESH-vs-carry-over calibration, per-arm `f_source` / `f_bystander` per-source + macro paired bootstraps, `r5_*` LOSS-DELTA ratios w/ denominator-stability gate, C3 gate trigger sentinel (bystander-primary, `upper_ci_f_bystander < 0.20` per R3 B5). Outputs: `summary.json`, `fresh_denominator_{valid,failed,per_source}.json`, `fresh_vs_carryover_calibration.json`, `c3_gate_trigger.json`, `epm_failure.json` (on macro-gate fail). (6) `--gpu-shard` / `--total-shards` for round-robin 4× H100 sharding on `--stage full`. (7) `--include-i344`, `--i344-variant`, `--include-c3-gate`, `--denom-epsilon` CLI flags. |
| `scripts/smoke_issue344.py` | NEW | 518 | CPU-only smoke harness: 11 tests covering chat-template rendering (partial + whole-turn), fail-closed on missing anchor, anchor-per-row gate, only-arm filter, cell enumeration for both variants, GPU-shard round-robin, HF path switching for i186 vs i344, paired-bootstrap epsilon handling. Skip-tokenizer mode via `EPM_SMOKE_SKIP_TOKENIZER=1` for laptops without the Qwen tokenizer cached. |
| `scripts/mediation_issue344.py` | NEW | 567 | Phase 3 TOST mediation. Pairs `cot_texts` from LoA + no_cot_FRESH cells on `(source, question_id, seed=42)`; randomized arm order per pair; Claude Sonnet-4.5 judge (model + score JSON, max 300 tokens, temp 0); bootstrap-based TOST (±0.10 band, α=0.025); aggregates per-source rate-diff and applies the R3 B1 corrected decision rule (≥+0.20 → SUPPORTS (a); ≤-0.20 → SUPPORTS (c) or mixed; |Δ|<0.10 → INCONCLUSIVE; otherwise AMBIGUOUS). Hard budget cap via `--max-budget-usd` (default $60). |
| `configs/condition/issue344/*.yaml` | NEW (×16) | 16 × 14 | Bookkeeping YAMLs (4 sources × 4

epm:code-review2026-05-11T20:41:57.000Z· system

## Code-Reviewer Verdict — CONCERNS **Tier:** trunk (training script + eval-pipeline core +…

<!-- epm:code-review v1 -->
## Code-Reviewer Verdict — CONCERNS

**Tier:** trunk (training script + eval-pipeline core + mediation analysis on `issue-344` branched from `origin/issue-280`; touches statistical-inference codepath).
**Diff size:** +3,055 / −4 across 22 files; 3 commits.
**Plan adherence:** PARTIAL (one missing computation in the §6 Holm family; minor under-disclosed deviation around tokenizer-boundary leakage that the implementer flagged).
**Tests:** smoke harness passes (11/11); lint + format PASS on all 5 modified scripts; no GPU tests possible at review time but the dry-run masking gate inside `train_one_cell` is the load-bearing GPU correctness check.
**Lint:** PASS (`ruff check` + `ruff format --check`).
**Security sweep:** CLEAN (no hardcoded secrets; `HF_TOKEN` / `ANTHROPIC_API_KEY` read from env; `_call_judge` retries do not log secrets).
**Needs user eyeball:** see "Critical observations" below — Major #2 (WandB tag persistence bug) is the easiest to fix and most concretely wrong; #1 (missing macro r5) is the highest-impact for the §6 inference family.

## Plan Adherence

| Plan item | Status | Notes |
|---|---|---|
| Phase 0a `--only-arm` flag on `generate_issue186_data.py` | ✓ | ~17 LOC, no logic change to existing arms; argparse choices match the plan's 4-arm whitelist |
| Phase 1 `run_issue_344_train.py` partial-turn template | ✓ (with caveat) | Partial mask renders 4 unmasked tokens including a `>` boundary token from `</persona-thinking>`; implementer flagged in "(d) Needs human eyeball" #1; §14 explicitly allows this deviation ("Adjust `{% generation %}` anchor byte-offsets to match Qwen tokenizer quirks") |
| Phase 1 whole-turn template for `*_FRESH` cells | ✓ | Identical chat-template family — eliminates template-string confound between LoA and FRESH |
| `assistant_only_loss=True`, `use_liger_kernel=False`, `packing=True` | ✓ | All three set in `SFTConfig` |
| LoRA r=32 / α=64 / dropout=0.0 / use_rslora=True | ✓ | Matches §11 exactly |
| Dry-run masking gate (5 examples, `pct_masked >= 80`, rationale all -100, answer not -100) | ✓ | Plus extras: rationale-pre-anchor 80% mask threshold, mask-audit JSON saved + uploaded to WandB Artifacts |
| Fail-closed on missing `\nAnswer:` anchor | ✓ | Jinja2 UndefinedError via `_ANSWER_ANCHOR_MISSING_must_be_present_in_assistant_turn.error` (Plan §16 #7 idiom) |
| Cell enumeration (Variant B: 40 cells; A: 28; C3: 3) | ✓ | Smoke verifies counts; `no_cot_FRESH` correctly single-seed (42) |
| `--phase {main, c3_gate}`, `--variant`, `--only-source/arm/seed`, `--gpu-shard/total-shards`, `--dry-run-only` | ✓ | All flags implemented |
| `scripts/run_issue186_eval.py` `_hf_path_in_repo` i186 ↔ i344 switching | ✓ | `ISSUE344_ALL_ARMS` whitelist; smoke verifies 7 cases including C3 gate sentinel |
| `_paired_bootstrap_ratio(num, denom, denom_epsilon=1e-4, degenerate_draw_policy='discard')` helper | ✓ | Returns full Plan §16 #6 payload; shared `idx` for num/denom (pairing correct in normal case) |
| `--mode fraction_of_effect` flag with `legacy_delta_h1` default | ✓ | Backward compat preserved |
| FRESH denominator macro gate (≥+0.10 source, ≥+0.05 bystander) | ✓ | `fresh_denominator_valid.json` / `fresh_denominator_failed.json` + `epm_failure.json` |
| FRESH per-source floor (bystander_macro lower-CI > 0) | ✓ | `fresh_denominator_per_source.json` + EXCLUDED from `per_source_pass_count` correctly |
| FRESH-vs-carryover calibration JSON | ✓ | Per-source Δ on bystander macro |
| C3 gate trigger: `upper_ci_f_bystander < 0.20` alone, bystander-primary | ✓ | Trigger payload includes `trigger_axis: "bystander_only"` |
| Aggregation order: bystander macro per-(q,s) FIRST (S2) | ✓ | `_per_qs_loss_matrix` does mean over 10 bystander personas FIRST; then stack(axis=1).reshape(-1) gives per-(q,s) array |
| n_resamples=10,000 (S1) | ✓ | Default `--n-bootstrap 10000` not changed; aggregator reads `getattr(args, "n_bootstrap", 10_000)` — but CLI default is still `--n-bootstrap 1000`

epm:code-review-codex2026-05-11T20:43:01.000Z· system

# Codex Code Review: Issue #344 — `persona_cot_labels_on_answer` partial-mask training…

<!-- epm:code-review-codex v1 -->
# Codex Code Review: Issue #344 — `persona_cot_labels_on_answer` partial-mask training

**Verdict:** FAIL
**Tier:** trunk (scripts/ with multiple callers, statistical engine, new experimental manipulation)
**Diff size:** +3055 / -4 lines across 22 files
**Plan adherence:** PARTIAL (1 critical misapplication; 2 statistical spec gaps; 1 CLI default bug)
**Lint:** PASS (reported by implementer; not re-run)
**Security sweep:** CLEAN
**Needs user eyeball:** Yes — pct_masked gate for FRESH arms will abort training before it starts; n_bootstrap CLI default ships 10x under-spec; r5 macro not computed; implementation marker shape not verified (GitHub GraphQL rate-limited at review time)

---

## Plan Adherence

- Phase 0 `--only-arm` flag on `generate_issue186_data.py`: ✓ implemented with correct choices and `--force` handling
- Phase 0d `assert_one_anchor_per_row` in `smoke_issue344.py`: ✓ implemented, fail-closed
- Chat template partial-turn `{% generation %}` around `\nAnswer:` only: ✓ implemented with fail-closed UndefinedError idiom
- Whole-turn `{% generation %}` for FRESH cells: ✓ implemented
- `assistant_only_loss=True`, `use_liger_kernel=False`: ✓ confirmed in `SFTConfig`
- Dry-run masking gate with 5-example audit + upload to WandB Artifacts: ± **misapplied** — `pct_masked >= 80` fires unconditionally on ALL arms including `persona_cot_FRESH` (whole-turn); will abort FRESH training (Critical — see Issues Found)
- `_paired_bootstrap_ratio` with `denom_epsilon=1e-4`, `degenerate_draw_policy='discard'`: ✓ correct paired-index construction, discard logic, NaN fallback
- `--mode fraction_of_effect` in `run_issue186_eval.py`: ✓ gated correctly
- FRESH denominator validity gate (macro + per-source floor): ✓ correct thresholds (0.10 / 0.05), `epm_failure.json` sentinel on fail
- FRESH-vs-carryover calibration table: ✓ written to `fresh_vs_carryover_calibration.json`
- C3 gate trigger: bystander-only (`upper_ci_f_bystander < 0.20`): ✓ correct axis, correct threshold
- Per-source ≥3/4 A3 count (FRESH-degenerate sources excluded): ✓ implemented
- r5 LOSS-DELTA ratio with denominator stability gate: ± **per-source only** — plan §6 specifies macro r5 for the 4 Holm family entries; implementation emits only per-source r5 with no macro aggregation (Major — see Issues Found)
- Holm family N=9 application to 95% CIs: ± plan says "Holm-corrected α=0.01 family-wise" but code emits 95% CIs throughout (2.5/97.5 percentiles), not Holm-adjusted widths; decision rules are implemented as CI-bound checks not family-adjusted p-values (Minor — see Issues Found)
- `--n-bootstrap` CLI default: ✗ **default=1000** in the parser; plan requires n_resamples=10,000 for the fraction_of_effect aggregator (Major — see Issues Found)
- Mediation `_tost_equivalence` bootstrap pivot: ✓ basic pivot is mathematically correct; `p_lower`/`p_upper` formulas verified
- `q_id` availability in `result.json` rows: ✓ `capability.py` stores `"q_id": q_idx` (integer index 0–1171) in every raw row; mediation script reads it correctly
- Adapter naming + HF Hub paths for C3 gate cells: ✓ both scripts produce `i344_librarian_persona_cot_labels_on_answer_c3gate_seed42_post_em`; paths are consistent
- Condition YAMLs (12 for Variant B): ✓ present, bookkeeping-only as planned
- WandB run names, tags, metadata: ✓ correct
- LoRA config (r=32, α=64, all 7 projection modules, `use_rslora=True`): ✓ matches plan
- Merge + HF Hub upload with fail-loud on error: ✓ implemented

---

## Issues Found

### Critical (block merge)

- **`run_issue_344_train.py:391` — `pct_masked >= 80` fires unconditionally for all arms, including `persona_cot_FRESH`**
  - Evidence:
    ```python
    if pct_masked < 80.0:
        raise RuntimeError(
            f"[mask-audit] sample {i}: only {pct_masked:.1f}% masked ..."
        )
    ```
    This block is NOT gated on `arm in PARTIAL_GENERATION_ARMS`. It runs for every arm before training begins.
  - Impact: For `persona_cot_FRESH`

epm:code-review-reconcile2026-05-11T20:46:50.000Z· system

## Reconciler Verdict — FAIL **Role under adjudication:** code-reviewer **Round:*…

<!-- epm:code-review-reconcile v1 -->
## Reconciler Verdict — FAIL

**Role under adjudication:** code-reviewer
**Round:** 1
**Verdict:** FAIL
**Claude verdict:** CONCERNS (PASS-class)
**Codex verdict:** FAIL

### Findings adjudicated

| Source | Finding (terse) | Verified? | Classification | Weight |
|---|---|---|---|---|
| Codex | `pct_masked >= 80` gate at `run_issue_344_train.py:391` fires unconditionally — whole-turn FRESH cells will abort | ✓ | Real-blocking | Blocking |
| Claude C1 | `summary.json` missing macro `r5_source_high/_low`, `r5_bystander_high/_low` (Holm §6) | ✓ | Real-blocking | Blocking |
| Claude M1 | `os.environ.setdefault("WANDB_TAGS", ...)` at line 612 — first-cell tags inherited shard-wide | ✓ | Real-but-non-blocking (data-hygiene degradation, not silent corruption) | Non-blocking |
| Claude M2 | FRESH/LoA paired bootstrap silently misaligns under partial-seed loss (length-only `[:n_pair]` truncation) | ✓ (per finding text + cited lines `run_issue186_eval.py:1179-1193`) | Real-blocking | Blocking |
| Claude M3 | `--n-bootstrap` CLI default 1000 vs §11 spec 10000; `getattr(..., 10_000)` fallback unreachable | ✓ | Real-blocking | Blocking |

### Rationale

The disagreement is about severity, and the artifact resolves it cleanly in Codex's favor.

**The pct_masked threshold IS unconditional, and FRESH cells use the whole-turn template.** I read `run_issue_344_train.py:391` directly: the `if pct_masked < 80.0: raise RuntimeError(...)` check at line 391 fires for *every* cell BEFORE the `if arm in PARTIAL_GENERATION_ARMS:` discriminator at line 398. `PARTIAL_GENERATION_ARMS = frozenset({ARM_LABELS_ON_ANSWER, ARM_GENERIC_COT_LOA})` at line 131 — `persona_cot_FRESH` and `no_cot_FRESH` are NOT in this set. `chat_template_for_arm(arm)` at line 226 returns `_QWEN_WHOLE_TURN_GENERATION_TEMPLATE` (lines 201-223) for FRESH cells. The whole-turn template wraps the entire assistant content in `{% generation %}...{% endgeneration %}` (lines 214-216), so ALL assistant tokens (rationale + Answer line) are loss-bearing and only system+user+chat-scaffold tokens get `-100`. With Plan §11's `max_length=2048` + `packing=True` and an assistant turn of (long persona-rationale + `\nAnswer: <letter>`), pct_masked lands well below 80% — somewhere in the 35-70% band per a back-of-the-envelope on a typical ARC-C row. **The training script will raise on every FRESH cell.** No FRESH means no H1 denominator (`f = labels_on_answer / persona_cot_FRESH`) means no Comparisons 1 / 1' / 5 numbers can be computed. The experiment cannot run end-to-end.

This is corroborated by the reference recipe at `scripts/run_issue_203_train.py:255-261`, which uses `pct_masked < 10` (a "did masking happen at all?" sanity gate). The 80% threshold in #344 was a TIGHTER check that only validates partial-generation cells where the rationale-as-`-100` ALSO contributes. The implementer transcribed Plan §11's "pct_masked >= 80" row literally — but the plan's row failed to disambiguate the per-arm semantics, and so did the smoke test (`smoke_issue344.py:187-192` only checks `n_unmasked > 10` on a tiny librarian smoke row, never the `pct_masked < 80` gate against a packed FRESH batch).

Claude's CONCERNS-class verdict missed this entirely — the marker enumerates C1+M1+M2+M3 but never the pct_masked gate. That's the load-bearing miss. Claude's C1 (macro r5 missing for §6 Holm family), M2 (paired-bootstrap silent misalignment), and M3 (n_bootstrap default 1000 vs spec 10000) are real and blocking on their own merits; M1 (WANDB_TAGS setdefault) is real but non-blocking (degrades data hygiene downstream, not silent statistical corruption).

Codex was right to call FAIL; Claude was wrong to call CONCERNS-class — the pct_masked gate is a binary experiment-killer, not a "revise-then-merge" item, and three of Claude's own four items (C1/M2/M3) are also real-blocking.

### Observed but not raised

- The plan §11 "Dry-run masking gate" row itself does not disambiguate per-ar

epm:experiment-implementation2026-05-11T21:01:41.000Z· system

## Implementation Report — round 2 **Status:** READY-FOR-REVIEW Round-1 ense…

<!-- epm:experiment-implementation v2 -->
## Implementation Report — round 2

**Status:** READY-FOR-REVIEW

Round-1 ensemble (reconciler: FAIL) flagged 4 blockers and 1 non-blocking nit. All 5 fixed in v2. Branch `issue-344` head: `89748525`.

### Per-blocker disposition

| ID | Verdict | Fix site | LOC | Commit |
|---|---|---|---|---|
| B1 — `pct_masked >= 80` aborts FRESH | FIXED | `scripts/run_issue_344_train.py:391-426` (gate split per-arm: ≥80 for partial, ≥10 for whole-turn) | +36 / -7 | `c94db0d3` |
| B2 — macro r5 missing from summary.json | FIXED | `scripts/run_issue186_eval.py:1340-1490` (macro r5 paired-bootstrap pool + 4 Holm-family entries + `holm_family` list) | +245 inserted in `_stage_aggregate_fraction_of_effect` | `8ded65c8` |
| B3 — paired-bootstrap silent seed-misalignment | FIXED | `scripts/run_issue186_eval.py:967-1027` (FRESH stores `qs_keys` + `seeds_present`); `:1175-1300` (per-source f-ratio + r5 loops switched to (q,s)-keyed dicts + intersection) | +130 / -50 | `8ded65c8` |
| B4 — `--n-bootstrap` CLI default 1000 | FIXED | `scripts/run_issue186_eval.py:1638-1651` (default → `10_000` + substantive help string) | +14 / -1 | `a3465454` |
| B5 — `WANDB_TAGS` `setdefault` shard inheritance | FIXED | `scripts/run_issue_344_train.py:612-628` (`setdefault` → assignment) | +15 / -11 | `dc50fd01` |
| follow-up: FRESH-missing KeyError guard | FIXED | `scripts/run_issue186_eval.py:1197-1213` (defensive `present` check) | +12 / -2 | `89748525` |

### (a) What was done

- `scripts/run_issue_344_train.py`:
  - Split `_run_mask_audit` gate per `arm in PARTIAL_GENERATION_ARMS`. Partial arms now enforce `pct_masked >= 80` (only `\nAnswer:` slice loss-bearing); whole-turn (FRESH) arms enforce `pct_masked >= 10` (just "did masking happen at all" — same shape as `run_issue_203_train.py:255-261`).
  - Updated docstring to describe the new per-arm semantics.
  - Swapped `os.environ.setdefault("WANDB_TAGS", ...)` → `os.environ["WANDB_TAGS"] = ...`. `WANDB_PROJECT` stays `setdefault`.
- `scripts/run_issue186_eval.py`:
  - FRESH per-source construction now tracks `seeds_present` and emits `qs_keys: list[(question_idx, seed)]` keyed flat. Stripped from disk-payload (still in-memory only).
  - Per-source f-ratio loop rebuilt: builds `loa_src_dict[(q, s)] -> value` and `fresh_src_dict[(q, s)] -> value`, intersects, sorts deterministically, emits `n_pairs_aligned`/`n_pairs_dropped`/`loa_seeds_present`/`fresh_seeds_present` in the per-source payload. Logs WARNING when `drop_frac > 0.05`, INFO otherwise.
  - r5 loop rebuilt with the same (q, s)-keyed dict approach. After the per-source pass, accumulates an `r5_macro_pool[axis] = {empty, matched}` cross-source pool.
  - Macro r5 paired-bootstrap added: one bootstrap per axis, two directional p-values (`p_high_vs_0_50` = `P(draws ≤ 0.50)`; `p_low_vs_0_20` = `P(draws ≥ 0.20)`). Denominator-stability gate (`|matched_macro| < 0.02` OR `frac_discarded > 0.05`) marks the macro `non_interpretable`.
  - New `holm_family` list in summary.json. For Variant B at minimum: 2 f-ratio macros × 2 arms + 2 axes × 2 directions for macro r5 = at least 8 entries; analyzer applies Holm-Bonferroni mechanically. Per-entry fields: `name, kind, axis, direction, threshold, p_value, ci_low, ci_high, point, non_interpretable`.
  - CLI default `--n-bootstrap` changed `1000 → 10_000` with substantive help string. Legacy `--mode legacy_delta_h1` path inherits the new default — slightly tighter CIs on #186/#280 carry-over rows; no decision flips.
- `scripts/smoke_issue344.py`:
  - 3 new tests (`test_mask_audit_per_arm_gate`, `test_paired_bootstrap_seed_alignment`, `test_r5_macro_directional_p_values`). Wired into `main()`.

- Diff: +699 / -83 across 3 files (v2 commits only). 6 commits this round.
- Plan adherence vs round-1 punch-list: all 4 blockers + non-blocking nit closed.
- Branch: `issue-344` pushed (no PR yet — that's the orchestrator's call after review PASS).

### (b) Considered but not done

- **Codex M

epm:code-review2026-05-11T21:50:08.000Z· system

## Code Review — round 2 (Claude) **Tier:** trunk (eval / training entrypoints with downstr…

<!-- epm:code-review v2 -->
## Code Review — round 2 (Claude)

**Tier:** trunk (eval / training entrypoints with downstream consumers).
**Diff size:** v2 deltas +699 / -83 across 3 files (6 commits on top of round-1).
**Plan adherence (round-2 blockers):** COMPLETE.
**Plan adherence (overall):** PARTIAL — see Issues / Major below.
**Tests:** PASS — `EPM_SMOKE_SKIP_TOKENIZER=1 uv run python scripts/smoke_issue344.py` → **10/10 PASS**; 3 new regression tests landed (one per round-1 blocker that touched the eval/training code).
**Lint:** PASS — `uv run ruff check` + `uv run ruff format --check` clean on all 5 files.
**Security sweep:** CLEAN (no shell-injection, no hardcoded tokens, no unsafe deserialization).
**Needs user eyeball:** YES — round-1 unaddressed plan-§11 family gap (H2 + persona-over-generic). Pre-existing in v1; the round-1 ensemble did not surface it. Listed below under Major / Plan deviations.

---

### Per-blocker verification (round-2 punch-list)

| ID | Status | Where | Verified by |
|---|---|---|---|
| **B1** — `pct_masked ≥ 80` aborts FRESH | **FIXED** | `run_issue_344_train.py:393-431` splits the gate by `arm in PARTIAL_GENERATION_ARMS` (≥80 partial / ≥10 whole-turn). Rationale-region 80%-masked check at line 502 stays inside `if arm in PARTIAL_GENERATION_ARMS:` block (line 433) — does NOT leak to whole-turn arms. | Smoke `test_mask_audit_per_arm_gate` (line 631) exercises 4 cases on the live `_run_mask_audit`: FRESH@55% passes (would have raised in v1), FRESH@5% raises with `"whole-turn arm"` message, partial@55% raises with `"partial-generation arm"` message, partial@90% passes the gate (downstream mock-limit hits). |
| **B2** — macro r5 missing | **FIXED** | `run_issue186_eval.py:1454-1493` builds per-axis `r5_macro_pool["source"|"bystander"]` by extending across sources; emits one paired-bootstrap per axis with **2 directional p-values** (`p_high_vs_0_50`, `p_low_vs_0_20`) + `non_interpretable` from the `|matched_macro| < 0.02 or frac_discarded > 0.05` gate. Within-cell pairing is automatic (matched + empty come from the same `cell_correctness[cid]` ndarray indexed by `matched_scaffold_idx` / `empty_scaffold_idx`). Holm-family entries appended at 1553-1601 with 4 macro-r5 directional entries. | Smoke `test_r5_macro_directional_p_values` (line 332) exercises 3 ratio regimes (~0.70 → reject high, ~0.10 → reject low, ~0.35 → reject neither) and the stability gate. p-value direction matches plan §11 (R3 B1 "rejects ⇒ (c)" for `_high`, "rejects ⇒ (a)" for `_low`). |
| **B3** — paired-bootstrap silent seed-misalignment | **FIXED** | FRESH construction stores `qs_keys: list[(question_idx, seed)]` + `seeds_present` at `run_issue186_eval.py:999-1026`. f-ratio loop (1178-1287) builds `loa_src_dict[(q, s)]` / `loa_bys_dict[(q, s)]`, intersects with FRESH dict via `sorted(set(loa_src_dict.keys()) & set(fresh_src_dict.keys()))`, then arrays are built by indexing the dict using that shared key list (line 1255-1258). Reports `n_pairs_aligned` + `n_pairs_dropped` + `loa_seeds_present` + `fresh_seeds_present` per source. `WARNING` at `drop_frac > 0.05`, `INFO` otherwise. r5 loop (1367-1411) builds matched/empty dicts off the same `(q, seed)` since they come from a single `cell_correctness[cid]` ndarray. Empty-shared-keys edge case handled (1246-1253 marks `missing` + `continue`). | Smoke `test_paired_bootstrap_seed_alignment` (line 414) constructs 3-seed LoA vs 2-seed FRESH dicts, verifies 20-shared keys + 10-dropped, computes the anti-test count for v1 length-only truncation (19/20 misaligned). |
| **B4** — `--n-bootstrap` default 1000 | **FIXED** | `run_issue186_eval.py:1648-1661` default `1000 → 10_000` with substantive help string. Propagation: 7 `_paired_bootstrap_ratio` call sites all read `n_bootstrap = getattr(args, "n_bootstrap", 10_000)` (line 901). Legacy mode (`--mode legacy_delta_h1`) inherits the new default — see note under Minor. | Reading the source. |
| **B5** — `WANDB_TAGS` `setdefault` shard inherita

epm:code-review-codex2026-05-11T21:51:23.000Z· system

# Codex Code Review: issue #344 v2 — persona_cot_labels_on_answer implementation **Ve…

<!-- epm:code-review-codex v2 -->
# Codex Code Review: issue #344 v2 — persona_cot_labels_on_answer implementation

**Verdict:** CONCERNS
**Tier:** trunk
**Diff size:** +3672 / -5 lines across 22 files
**Plan adherence:** PARTIAL (1 item incomplete — H2 diff-of-diffs missing from Holm family)
**Lint:** NOT-CHECKED (Codex did not run lint)
**Security sweep:** CLEAN
**Needs user eyeball:** Yes — H2 diff-of-diffs gap in holm_family output; analyzer will have 6 entries instead of the plan's N=7 (Variant A) / N=9 (Variant B)

## Plan Adherence
- [B1 pct_masked per-arm gate]: ✓ implemented — `PARTIAL_GENERATION_ARMS = frozenset({ARM_LABELS_ON_ANSWER, ARM_GENERIC_COT_LOA})`; FRESH arms use `>= 10%` floor, partial arms use `>= 80%`. Smoke test `test_mask_audit_per_arm_gate` covers both branches including regression that FRESH @ 55% passes (was blocked in v1).
- [r5 macro quantities — r5_source_high/_low, r5_bystander_high/_low]: ✓ implemented — `r5_macro_results` computed at `run_issue186_eval.py:1446-1493`; emitted into `holm_family` at lines 1552-1601. One-sided p-values: `p_high = P(draws <= 0.50)`, `p_low = P(draws >= 0.20)`. Denominator stability gate (`|matched_macro| < 0.02 OR frac_discarded > 0.05`) correctly sets `non_interpretable`.
- [n_bootstrap default 10_000]: ✓ implemented — `--n-bootstrap default=10_000` at `run_issue186_eval.py:1651`.
- [WANDB_TAGS setdefault → assignment (B5)]: ✓ implemented — `os.environ["WANDB_TAGS"] = ...` at `run_issue_344_train.py:654`; `WANDB_PROJECT` retains `setdefault` with explanatory comment.
- [(q,s)-keyed paired bootstrap alignment (Claude M2 / B3)]: ✓ implemented — `loa_src_dict` and `fresh_src_dict` are intersected via `shared_keys = sorted(set(loa_src_dict.keys()) & set(fresh_src_dict.keys()))`. Drop fraction logged at WARN if > 5%. Smoke test `test_paired_bootstrap_seed_alignment` verifies v1 misaligned 19/20 pairs; v2 aligns all 20.
- [H2 diff-of-diffs (C4) in Holm family]: ✗ missing — plan §6 entry #3 `H2 diff-of-diffs` is absent from `holm_family` in `summary.json`. The aggregator emits only 6 entries (f_source_macro, f_bystander_macro, r5_source_high, r5_source_low, r5_bystander_high, r5_bystander_low) instead of the pre-registered N=7 (Variant A) / N=9 (Variant B).
- [Defensive guard on missing FRESH (follow-up commit)]: ✓ implemented — `if not f.get("present"): per_source[source] = {...missing...}; continue` prevents KeyError on `fresh_per_source[source]["qs_keys"]` during f-ratio per-source loop.

## Issues Found

### Critical (block merge)
None.

### Major (revise before merge)

- `scripts/run_issue186_eval.py`: H2 diff-of-diffs (plan §6 entry #3) is absent from the `holm_family` output in `_stage_aggregate_fraction_of_effect`.
  - Evidence: `holm_family` is populated at lines 1527-1601. It adds f-ratio macros (entries 1-2) and r5 macro directional entries (entries 4-7). No code path computes `(labels_on_answer_matched_bystander − labels_on_answer_no_cot_bystander) / (FRESH_matched_bystander − FRESH_no_cot_bystander)` — the H2_op ratio from plan §3 / §6.
  - Impact: The analyzer's Holm-Bonferroni table will have 6 entries instead of 7 (Variant A) or 9 (Variant B). The H2 secondary hypothesis has no quantitative CI-bound anchor in `summary.json`; the clean-result will either omit H2 entirely or rely on ad-hoc post-hoc computation.
  - Fix: Add a `_compute_h2_diff_of_diffs` block in `_stage_aggregate_fraction_of_effect` that (a) computes per-source and macro bystander loss under matched vs no-cot eval for both LoA and FRESH cells, (b) forms the ratio of matched-minus-no_cot gaps, (c) runs `_paired_bootstrap_ratio` on (q,s) pairs, and (d) appends to `holm_family` as `{"name": "H2_diff_of_diffs", "kind": "h2_ratio", "threshold": 0.5, "p_value": ..., "ci_low": ..., "ci_high": ...}`. The no-cot eval scaffold index (`no_cot_eval_idx`) is already computed at line 729 for the legacy path; bring it in-scope for the fraction_of_effect path.

### Minor (worth fixing but doesn't block)

-

epm:experiment-implementation2026-05-11T22:00:46.000Z· system

## Implementation Report — round 3 (FINAL, single-task M-NEW-1 fix) **Status:…

<!-- epm:experiment-implementation v3 -->
## Implementation Report — round 3 (FINAL, single-task M-NEW-1 fix)

**Status:** READY-FOR-REVIEW

### Response to code-review v2 (Claude + Codex ensemble convergence)

- **M-NEW-1 (H2 diff-of-diffs missing from holm_family): FIXED.** Added a paired-bootstrap diff-of-diffs computation over `(LoA_matched_bys - LoA_nocot_bys) / (FRESH_matched_bys - FRESH_nocot_bys)` using shared `(q, s)` indices across all 4 cells. New `holm_family` entry `name="h2_diff_of_diffs"`, kind `h2_diff_of_diffs`, threshold=0.5, one-sided H1 ratio >= 0.5. Same denominator stability gate semantics as r5 (`non_interpretable` if `|denom_macro| < 0.02` or `frac_discarded > 0.05`). Fail-closed on any of 4 required cells missing per source. Reuses `_paired_bootstrap_ratio` already in scope.
- **f_persona_over_generic_source / _bystander (Claude-only finding): DISPUTED per brief reconciliation.** The brief instructs me to verify Claude's claim that the existing v2 `f_*__generic_cot_labels_on_answer` entries serve as `f_persona_over_generic_*`. They DO NOT — the v2 entry computes `generic_LoA / persona_cot_FRESH` (numerator = generic, denominator = FRESH), not `persona_LoA / generic_LoA` (the ratio plan §6 enumerates). However, per brief: "Codex's narrower scope — H2 only — is correct" and this is a single-task revision round. I have left the v2 generic-LoA entries as-is and surface this ambiguity below under "Needs human eyeball" so the orchestrator can decide whether to escalate to a separate revision.

### (a) What was done

- `scripts/run_issue186_eval.py` (+182 LOC):
  - Added `no_cot_eval_idx = next(i for i, s in enumerate(EVAL_SCAFFOLDS) if s.name == "no-cot")` alongside `matched_scaffold_idx` and `empty_scaffold_idx` (line 919).
  - New H2 diff-of-diffs block (lines 1496-1641) inserted between the r5 macro block and the C3 gate trigger. For each source, builds (q, s)-keyed dicts of bystander loss for LoA and FRESH cells under both matched and no-cot eval scaffolds; intersects keys across all 4 dicts; pools signed differences across sources; runs one paired bootstrap on `(num, denom)` with `_paired_bootstrap_ratio(n_resamples=n_bootstrap, denom_epsilon=denom_epsilon, degenerate_draw_policy="discard")`. Emits the `h2_diff_of_diffs` block with `point`, `ci_low`, `ci_high`, `p_one_sided_upper` (= `P(draws <= 0.5)`), `p_two_sided`, `denom_macro`, `n_pairs_pooled`, `n_discarded`, `frac_discarded`, `threshold: 0.5`, `hypothesis`, `non_interpretable`, `non_interpretable_reason`.
  - New `h2_diff_of_diffs` entry appended to `holm_family` (lines 1751-1781): missing-branch posts `missing: true`; live-branch posts `p_value`, CI, point, `denom_macro`, `non_interpretable`.
  - Added `"h2_diff_of_diffs": h2_diff_of_diffs` key to the `summary` dict.
- `scripts/smoke_issue344.py` (+77 LOC):
  - New `test_h2_diff_of_diffs_construction` exercises two synthetic cases: (a) well-defined denominator (FRESH gap ~ 0.15, LoA gap = 0.5 * FRESH gap) confirms point ~ 0.5, CI well-formed, `non_interpretable=False`; (b) near-zero denominator (FRESH gap ~ 0.005) confirms `non_interpretable=True` via the `|denom_macro| < 0.02` floor.
  - Wired into `main()` after `test_r5_macro_directional_p_values`.
- Diff: +259 / -0 across 2 files. `git diff --stat` against round-2 tip (`89748525`):
  ```
   scripts/run_issue186_eval.py | 182 +++++++++++++++++++++++++++++++++++++++++++
   scripts/smoke_issue344.py    |  77 ++++++++++++++++++
   2 files changed, 259 insertions(+)
  ```
- Plan adherence (single-task scope):
  - DONE: H2 diff-of-diffs entry in `holm_family`.
  - DONE: Denominator stability gate consistent with r5.
  - DONE: Paired bootstrap with shared (q, s) indices across all 4 cells.
  - DONE: Fail-closed on missing cells.
  - DONE: Smoke harness for H2.
  - SKIPPED: `f_persona_over_generic_*` (per brief: Codex H2-only scope is correct; flagged for human eyeball below).
- Commits: `11e90c14` `issue #344 v3 [3/3]: add H2 diff-of-diffs to holm_

epm:code-review2026-05-11T22:04:32.000Z· system

# Code Review: issue #344 v3 (round 3 / FINAL) **Verdict:** FAIL **Tier:** trunk (`scripts/…

<!-- epm:code-review v3 -->
# Code Review: issue #344 v3 (round 3 / FINAL)

**Verdict:** FAIL
**Tier:** trunk (`scripts/run_issue186_eval.py` is the issue #186 / #280 / #344 aggregator — high-blast-radius)
**Diff size:** +259 / -0 across 2 files (vs round-2 tip `89748525`)
**Plan adherence:** PARTIAL — M-NEW-1 (H2 diff-of-diffs) landed correctly; the implementer-surfaced concern (Plan §6 entries 8-9 `f_persona_over_generic_*`) is REAL and unaddressed.
**Tests:** PASS (smoke 11/11, lint clean, ruff format clean).
**Lint:** PASS.
**Security sweep:** CLEAN.
**Needs user eyeball:** YES — this verdict recommends `status:blocked` per cap-3 rule.

## Plan Adherence

### What v3 fixed correctly (M-NEW-1)
- `h2_diff_of_diffs` entry added to `holm_family` (`scripts/run_issue186_eval.py:1751-1782`).
- H2 ratio = `(LoA_matched_bys − LoA_nocot_bys) / (FRESH_matched_bys − FRESH_nocot_bys)`, pooled per-(q, s) across 4 sources after intersecting (q, s) keys across all 4 cells (lines 1496-1641). Matches plan §3 H2_op + §6 entry 3.
- Denominator stability gate mirrors r5 semantics: `non_interpretable = (|denom_macro| < 0.02 OR frac_discarded > 0.05)`. Same shape as `_compute_macro_r5_directional` (consistent with plan §6 R3 B2).
- Fail-closed on any of 4 required cells missing per source (line 1563) — meets brief "Quality bars".
- Reuses `_paired_bootstrap_ratio` already in scope. `degenerate_draw_policy="discard"` consistent with the f-ratio + r5 macro calls.
- Holm-family entry name `h2_diff_of_diffs`, threshold 0.5, kind `h2_diff_of_diffs`, `p_one_sided_upper = P(draws ≤ 0.5)` — directional semantics correct (small p ⇒ reject "ratio < 0.5").
- New smoke `test_h2_diff_of_diffs_construction` covers both well-defined (point ≈ 0.5, non_interpretable=False) and near-zero denominator (non_interpretable=True via 0.02 floor).

## Critical Issues (block status:running)

### C-NEW-1 (FATAL): Plan §6 Holm-family entries 8-9 (`f_persona_over_generic_*`) are NOT implemented — confidence-binding constraint (i) for Variant B HIGH is unevaluable

**The implementer-surfaced concern is correct. I confirm it independently.**

**Evidence — plan side (`.claude/plans/issue-344.md`):**
- Line 356: `8. **(Variant B only)** f_persona_over_generic_source — one-sided ≥ 1.5`
- Line 357: `9. **(Variant B only)** f_persona_over_generic_bystander — one-sided ≥ 1.5`
- Line 370 (Comparison 6): `persona_cot_labels_on_answer vs generic_cot_labels_on_answer (per source + macro). Discriminates persona-specific input-conditioning from any-rationale-prefix-on-answer-only-SFT.`
- Line 432 (Variant B HIGH binding constraint (i)): `(i) f_persona_over_generic ≥ 1.5 (lower-95%-CI)`
- Line 608: `N=9 (Variant B): + {f_persona_over_generic_source one-sided ≥ 1.5; f_persona_over_generic_bystander one-sided ≥ 1.5}.`

The literal plan reading: `f_persona_over_generic = (persona_LoA-recovered-fraction-of-FRESH) / (generic_LoA-recovered-fraction-of-FRESH) = f_persona / f_generic`. Threshold 1.5 ⇒ the persona arm recovers at least 1.5× the fraction-of-effect that the generic arm recovers. This is the inferential anchor that distinguishes "persona-specific input-conditioning" from "any-rationale-prefix input-conditioning" — the *whole point* of running Variant B vs Variant A.

**Evidence — code side (`scripts/run_issue186_eval.py`):**
- Line 1147 comment: `# ── f-ratios per arm (numerator = LoA, denominator = FRESH) ──────────`. Confirms semantics.
- Lines 1162-1342: the f-ratio loop iterates over `loa_arms_in_scope = ["persona_cot_labels_on_answer", "generic_cot_labels_on_answer", ...]` and for each arm computes `<arm>_LoA / persona_cot_FRESH` per-(q, s), paired-bootstrap.
- Lines 1679-1699: the Holm-family construction emits `f_<axis>__<loa_arm>` entries. For `loa_arm == "generic_cot_labels_on_answer"`, this produces `f_source__generic_cot_labels_on_answer` and `f_bystander__generic_cot_labels_on_answer` — whose computed quantity is `generic_LoA / persona_FRESH`, NOT `persona_LoA / generic_Lo

epm:code-review-codex2026-05-11T22:04:42.000Z· system

# Codex Code Review: issue #344 v3 — H2 diff-of-diffs + Variant B Holm entries **Verd…

<!-- epm:code-review-codex v3 -->
# Codex Code Review: issue #344 v3 — H2 diff-of-diffs + Variant B Holm entries

**Verdict:** CONCERNS
**Tier:** trunk
**Diff size:** +259 LOC across 2 files (scripts/run_issue186_eval.py, scripts/smoke_issue344.py)
**Plan adherence:** PARTIAL — H2 is now present; Variant B Holm entries 8-9 compute the wrong quantity
**Lint:** PASS
**Security sweep:** CLEAN
**Needs user eyeball:** Yes — Holm entries 8-9 semantics; Variant B discriminating claim cannot be evaluated as-is

## Plan Adherence

- H2 diff-of-diffs (plan §6 entry 3): ✓ implemented; shared (q,s) keys across 4 cells; paired bootstrap; denominator stability gate (|denom_macro| < 0.02 OR frac_discarded > 0.05); holm_family entry emitted; smoke test added.
- f_source_macro / f_bystander_macro (plan §6 entries 1-2): ✓ intact from v2.
- r5 directional entries (plan §6 entries 4-7): ✓ intact from v2.
- Variant B entries 8-9 (`f_persona_over_generic_source/bystander`): ± present but compute the WRONG quantity.
- All v2 fixes (pct_masked arm-gating, n_bootstrap default 10_000, macro r5, (q,s)-key intersection): ✓ still intact.

## Issues Found

### Major (revise before merge)

- `scripts/run_issue186_eval.py:1679-1699` (holm_family construction): Variant B Holm entries 8-9 compute `generic_LoA / persona_cot_FRESH`, not `persona_LoA / generic_LoA`.

  Evidence: The f-ratio loop (line 1162) iterates over `loa_arms_in_scope` = `[..., "generic_cot_labels_on_answer"]` with the FRESH denominator hard-coded (`fresh_src_per_q` / `fresh_bys_per_q` always come from `persona_cot_FRESH` via `fresh_per_source`). The holm_family construction at line 1679 just iterates `f_results.items()` — which emits `f_source__generic_cot_labels_on_answer` = `generic_LoA / persona_cot_FRESH` and `f_bystander__generic_cot_labels_on_answer` = same.

  Plan §6 entries 8-9 are `f_persona_over_generic_source` and `f_persona_over_generic_bystander`, described as "one-sided ≥ 1.5", which discriminate "persona-specific input-conditioning from any-rationale-prefix-on-answer-only-SFT." That requires `persona_LoA / generic_LoA` — the ratio of the two LoA arms' effects relative to each other, not each independently versus FRESH. A threshold of 1.5 is meaningful for `persona_LoA / generic_LoA` ("persona arm produces 1.5× more effect than generic arm"), but is not meaningful for `generic_LoA / persona_cot_FRESH` (which just says generic recovers some fraction of FRESH's effect).

  Impact: The holm_family entries 8-9 cannot answer Variant B's distinguishing question ("is persona-specific input-conditioning more effective than any-rationale-prefix?"). The entries for the generic arm have no `threshold` key in the holm_family dict (unlike the r5 entries), so the analyzer will also lack a machine-readable threshold to apply. The plan's Variant B discriminating claim (§7 HIGH attainability condition (i)) cannot be evaluated from the emitted summary.json.

  Fix: Add a third pass computing `f_persona_over_generic_source/bystander = persona_LoA / generic_LoA` using the same `_paired_bootstrap_ratio` engine with the generic arm's per-(q,s) loss as the denominator instead of FRESH. Threshold 1.5 applies to this ratio per plan §6. The existing `f_source__generic_cot_labels_on_answer` entries remain useful context (showing that generic arm also recovers some fraction of FRESH) — they can stay, but they are not the plan's Holm-family entries 8-9.

  Alternate interpretation: The implementer may have decided to replace entries 8-9 with the FRESH-anchored generic ratios, effectively changing the comparison from "persona-specific vs generic" to "generic arm's fraction of FRESH effect." If that substitution is intentional and the user approves it, the threshold should be updated (from 1.5 to something like 0.50 for "generic also recovers substantial fraction of FRESH"). But this is a plan deviation that changes the inferential question — it should be flagged, not silently implemented.

### Minor (wort

epm:code-review-reconcile2026-05-11T22:07:13.000Z· system

## Reconciler Verdict — FAIL **Role under adjudication:** code-reviewer **Round:*…

<!-- epm:code-review-reconcile v3 -->
## Reconciler Verdict — FAIL

**Role under adjudication:** code-reviewer
**Round:** 3 (CAP)
**Verdict:** FAIL
**Claude verdict:** FAIL
**Codex verdict:** CONCERNS

### Findings adjudicated

| Source | Finding (terse) | Verified? | Classification | Weight |
|---|---|---|---|---|
| Claude + Codex (agree) | M-NEW-1 H2 diff-of-diffs patch correct (1751-1782, smoke 11/11) | ✓ | Real-nonblocking (positive) | Non-blocking |
| Claude + Codex (agree) | Holm-family entries 8-9 `f_persona_over_generic_*` compute `generic_LoA / persona_FRESH`, NOT `persona_LoA / generic_LoA` at `run_issue186_eval.py:1162,1679-1699` | ✓ | **Real-blocking** | **Blocking** |
| Codex (M-minor) | Generic-arm Holm entries lack `threshold` key (1687-1698) | ✓ | Real-blocking (subsumed under finding 2) | Folded into 2 |

### Rationale

Both reviewers verified the same finding against the same code lines. I independently confirmed: the f-ratio loop at `scripts/run_issue186_eval.py:1162` iterates `loa_arms_in_scope` which under Variant B includes `generic_cot_labels_on_answer`, and inside the loop (lines 1187-1275) the FRESH denominator is *always* sourced from `fresh_per_source` (which is `persona_cot_FRESH` — see line 1203). The Holm-family construction at line 1679 then emits entries named `f_<axis>__generic_cot_labels_on_answer` whose computed quantity is `generic_LoA / persona_FRESH`. The plan §6 line 356-357 entries 8-9 require `persona_LoA / generic_LoA` (threshold 1.5). The two ratios differ in both numerator and denominator: empirically the v2-implemented quantity is bounded ~[0, 1] (matches #186's `f_bystander_macro` ≈ 0.06-0.22), so the 1.5 threshold is structurally unreachable on the wrong ratio.

The disposition question is whether this blocks `status:running`. Three factors decide for FAIL:

1. **Plan §11 lines 431-441 explicitly cap Variant B HIGH-confidence on `f_persona_over_generic ≥ 1.5 (lower-95%-CI)` as condition (i).** Without this entry computed correctly, the SUPPORTS-branch HIGH-confidence path is *structurally inaccessible* — Variant B collapses to MODERATE, which is what Variant A already delivers. The user explicitly chose `compute:large` (Variant B) at the approval gate specifically to access the HIGH ladder. Advancing with this broken would deliver Variant A's confidence ceiling at Variant B's cost — defeating the user's approved scope.

2. **The Codex CONCERNS-advance argument is internally inconsistent.** Codex's own closing paragraph: "The `status:blocked` decision per plan §11 is appropriate here because the plan's Variant B distinguishing inferential anchor cannot be evaluated from the emitted summary." Codex correctly diagnoses blocked-appropriateness but labels CONCERNS. The post-hoc analyzer-side workaround Codex proposes (compute `persona_LoA / generic_LoA` from raw `cell_correctness` post-experiment) is methodologically degraded: the plan pre-registered a Holm family at α=0.01 with N=9 entries for Variant B. Adding an entry post-hoc at analyzer time re-Holms a different test set than what was pre-registered — that is a methodology deviation, not a workaround.

3. **The /issue Step 5d cap-3 protocol exists precisely for this case.** A real blocker surfaced at the cap that the user is best-positioned to scope: (a) authorize round-4 override (~80-120 LOC, single surgical fix), (b) accept Variant B MODERATE-capped (analyzer workaround, methodology-degraded), or (c) re-scope to Variant A. CONCERNS-advance silently locks the user into (b) without their consent.

The mitigating observation Claude raises stands on inspection: the round-3 patch itself (M-NEW-1 H2 diff-of-diffs) is correct in isolation and would PASS on its own. The FAIL is on the broader plan-adherence dimension that the implementer correctly surfaced during round 3 (v3 (d) "Needs human eyeball"). The implementer respected the single-task scope brief and explicitly flagged the gap rather than silently expanding scope — that was th

epm:abort2026-05-11T22:08:12.000Z· system

## Code-review ensemble exhausted cap (round 3) — `status:blocked` After 3 rounds of code-review …

<!-- epm:abort v1 -->
## Code-review ensemble exhausted cap (round 3) — `status:blocked`

After 3 rounds of code-review (6 reviewer invocations + 1 reconciler invocations across rounds 1 and 3), the ensemble cannot advance. **All round-1 + round-2 blockers landed correctly in v3**; the cap was hit on a v3-surfaced semantic plan deviation that the orchestrator must escalate to the user.

### What's solid

- **B1 (FRESH whole-turn pct_masked):** per-arm gate (≥80 partial, ≥10 whole-turn) — fixes v1's experiment-killing assertion.
- **B2 + B3 (paired bootstrap + macro r5):** dict-keyed (q,s) intersection + 4 directional r5 entries in `holm_family`.
- **B4 (n_bootstrap default 10_000):** CLI default bumped.
- **B5 (WANDB_TAGS assignment):** no per-cell tag leakage.
- **M-NEW-1 (H2 diff-of-diffs):** added cleanly in v3, smoke 11/11, lint clean.
- **Defensive guard on missing FRESH cells:** no silent fallback.

### The cap-triggering gap (v3 round 3)

Both Claude (FAIL) and Codex (CONCERNS) confirmed the implementer's surfaced concern: the v2 `f_*__generic_cot_labels_on_answer` Holm-family entries compute `generic_LoA / persona_cot_FRESH`, not the plan-spec `f_persona_over_generic = persona_LoA / generic_LoA` (threshold ≥ 1.5).

**Why this matters.** Plan §11 confidence binding constraint (i) requires `f_persona_over_generic ≥ 1.5 (lower-CI)` as one of 5 conditions for Variant B HIGH-confidence SUPPORTS. Without this entry correctly computed, Variant B's distinguishing inferential anchor — "persona-specific vs any-rationale-prefix" — is broken: Variant B collapses to MODERATE-confidence-cap, which Variant A already delivers at `compute:medium`.

Per /issue Step 5d, FAIL + revision_round≥3 → `status:blocked`. The orchestrator does NOT auto-pick scope; user decides via /issue 344 re-invocation after labeling.

### Three scope options for the user

**(a) Round-4 override.** Authorize ~80-120 LOC implementer fix to add a `f_persona_over_generic = persona_LoA / generic_LoA` paired-bootstrap path with new Holm-family entries. ~30 min implementer + ~10 min ensemble re-review. Variant B HIGH-confidence ladder restored.

**(b) Accept Variant B with persona-specificity un-discriminable.** Run as-is; clean-result confidence caps at MODERATE under Variant B SUPPORTS; the analyzer can compute `persona / generic` post-experiment as a descriptive sub-claim, but the result is NOT pre-registered in the Holm-corrected family (loses the multiple-comparisons protection). Wastes the ~12 GPU-hr of `generic_cot_labels_on_answer` cells inferentially.

**(c) Re-scope to Variant A.** Drop the `generic_cot_labels_on_answer` arms (~12 GPU-hr + ~$54 API saved); re-flip `compute:large` → `compute:medium`; HIGH-confidence persona-specificity NOT a goal for this experiment. Cleanest scope cut.

User can resume by:
- (a) Replying with `override-round-4` + a comment on the issue → orchestrator dispatches the implementer fix.
- (b) Replying with `accept-variant-b-no-persona-specific` → orchestrator advances to `status:running` with the gap noted in the experimenter + analyzer briefs.
- (c) Replying with `rescope-variant-a` → orchestrator removes the generic arms from the eval grid, re-flips the compute label, and advances.

Markers:
- R3 Claude verdict: https://github.com/superkaiba/explore-persona-space/issues/344#issuecomment-4425541758
- R3 Codex verdict: https://github.com/superkaiba/explore-persona-space/issues/344#issuecomment-4425542713
- R3 reconciler verdict: https://github.com/superkaiba/explore-persona-space/issues/344#issuecomment-4425557740
- V3 implementer marker (with surfaced concern): https://github.com/superkaiba/explore-persona-space/issues/344#issuecomment-4425520625
<!-- /epm:abort -->

epm:step-completed2026-05-11T22:08:14.000Z· system

## Step Completed step: 5b at: 0e0a0405 timestamp: 2026-05-11T22:08:13+00:00 next_expect…

<!-- epm:step-completed v1 -->
## Step Completed

step: 5b
at: 0e0a0405
timestamp: 2026-05-11T22:08:13+00:00
next_expected_step: 5
exit_kind: failure-exit
notes: code-review-ensemble FAIL at round 3 cap; f_persona_over_generic semantic plan deviation; status:blocked pending user scope decision (round-4-override / accept-no-persona-specific / rescope-to-A)
<!-- /epm:step-completed -->

epm:override-round-2026-05-11T23:26:37.000Z· system

## User-authorized round-4 override (past cap-3) User selected `override-round-4` at t…

<!-- epm:override-round-4 v1 -->
## User-authorized round-4 override (past cap-3)

User selected `override-round-4` at the post-cap-3 decision gate. Round 4 dispatches with **single-task scope**: fix the `f_persona_over_generic` Holm-family entry semantics.

**Scope:** add a `persona_LoA / generic_LoA` paired-bootstrap path with new Holm-family entries `f_persona_over_generic_source` / `_bystander` (threshold one-sided ≥ 1.5). Estimated +80-120 LOC + smoke test. Label flipped `status:blocked` → `status:implementing`.

Reconciler verdict that triggered cap-3: https://github.com/superkaiba/explore-persona-space/issues/344#issuecomment-4425557740
Round-3 abort context: https://github.com/superkaiba/explore-persona-space/issues/344#issuecomment-4425563367

After v4 implementer reports `epm:experiment-implementation v4`, ensemble code-review re-runs at `revision_round=4`. Round-4 PASS → advance to `status:running` + Step 6 pod provisioning.
<!-- /epm:override-round-4 -->

epm:experiment-implementation2026-05-11T23:35:12.000Z· system

## Implementation Report — round 4 (user-authorized post-cap-3 override) **St…

<!-- epm:experiment-implementation v4 -->
## Implementation Report — round 4 (user-authorized post-cap-3 override)

**Status:** READY-FOR-REVIEW

**Single-task scope:** fix the `f_persona_over_generic` Holm-family entry semantics in `_stage_aggregate_fraction_of_effect` (`scripts/run_issue186_eval.py`). The v3 aggregator's f-ratio loop emitted `f_<axis>__generic_cot_labels_on_answer` which computes `generic_LoA / persona_cot_FRESH` (descriptive). Plan §6 / §11 Variant B Holm entries 8-9 require `f_persona_over_generic_<axis> = persona_LoA / generic_LoA` with one-sided threshold ≥ 1.5.

### (a) What was done

**Files changed (1 commit, no scope creep):**
- `scripts/run_issue186_eval.py`: +355 / -13 LOC
- `scripts/smoke_issue344.py`: +187 / -0 LOC
- Total: +542 / -13 across 2 files

**Commit:** `ab840069` — `issue #344 v4 [4/4]: add f_persona_over_generic Holm-family entries (Variant B plan §6 entries 8-9)`
**Branch:** `issue-344` pushed to `origin/issue-344` (head `ab840069`, prior `11e90c14`).

**Brief item walkdown (per the v4 specification):**

| # | Item | Status |
|---|---|---|
| 1 | New paired-bootstrap for `persona_LoA / generic_LoA` per source + macro, both axes; same `_paired_bootstrap_ratio` helper, same (q, s) intersection pattern, same denom stability gate (\|denom_macro\| < 0.02 OR >5% discarded) | DONE — new `f_persona_over_generic` block inserted between the existing f-ratio loop and the r5 section |
| 2 | Emit two NEW Holm-family entries `f_persona_over_generic_source` / `f_persona_over_generic_bystander` with the full statistics dict + `threshold: 1.5` + `hypothesis`. `p_one_sided_upper = mean(draws <= 1.5)`. | DONE (new block in Holm-family construction, after H2 entry) |
| 3 | Do NOT remove existing `f_*__generic_cot_labels_on_answer`; RENAME on emission to `f_generic_arm_over_persona_fresh_*`; update `kind`; add `schema_notes`. | DONE — moved to `descriptive_ratios` section in summary.json with `kind: descriptive_f_ratio_generic_arm_over_persona_fresh` and a per-entry `description` field |
| 4 | REPLACE the current `f_<axis>__generic_cot_labels_on_answer` entries in `holm_family` with the new `f_persona_over_generic_<axis>` entries | DONE — Holm loop gates on `is_holm_eligible = loa_arm != "generic_cot_labels_on_answer"` |
| 5 | Variant A skipping rule: omitted from `holm_family` AND `non_interpretable: true` with reason; family collapses to N=7. Verified in smoke. | DONE — `if variant == "B":` gate around the new Holm-emit block. Variant A path also keeps detail dicts with `non_interpretable: true`. |
| 6 | New smoke test `test_f_persona_over_generic_construction` covering well-defined denom (ratio > 1.5 passes, ratio = 1.0 gray, ratio < 0.5 rejects), near-zero denom, Variant A skipping. | DONE — `scripts/smoke_issue344.py`. Registered between H2 and HF-path tests. |
| 7 | Commit once `[4/4]` and push | DONE — `ab840069` pushed |
| 8 | Smoke harness 12/12 | DONE (verified below) |
| 9 | Lint clean on modified files | DONE (verified below) |

### (b) Considered but not done

- **Refactoring the f-ratio loop to share more code with the new `f_persona_over_generic` block.** Rejected: single-task scope. The two blocks compute different quantities (per-arm vs persona-FRESH baseline vs persona-vs-generic LoA) with different threshold semantics. Premature abstraction would be a scope-creep refactor; left as siblings.
- **Adding per-source `f_persona_over_generic` to a separate `holm_family_per_source` field.** Rejected: per-source f-ratios are reported exploratorily per Plan §6 (NOT in the Holm family). The detail dicts in `f_persona_over_generic[axis_name]["per_source"]` already carry per-source point + CI for analyzer use.
- **Computing the `f_persona_over_generic` ratio with `persona_cot_FRESH` somewhere in the denominator chain.** Rejected: the plan-spec quantity is `persona_LoA / generic_LoA` (direct LoA-vs-LoA comparison). Bringing FRESH back in would re-create the v3 ambiguity that triggered the reconcile F

epm:code-review2026-05-11T23:40:09.000Z· system

## Code-Reviewer Verdict — PASS **Tier:** trunk (modifies the issue #186/#344 aggregator + …

<!-- epm:code-review v4 -->
## Code-Reviewer Verdict — PASS

**Tier:** trunk (modifies the issue #186/#344 aggregator + analysis pipeline; consumed by analyzer).
**Diff size:** +542 / -13 across 2 files (`scripts/run_issue186_eval.py`, `scripts/smoke_issue344.py`).
**Plan adherence:** COMPLETE for the single-task v4 scope.
**Tests:** PASS (12/12 smoke; new `test_f_persona_over_generic_construction` covers the four scalar cases + structural Variant-A check).
**Lint:** PASS on modified files.
**Security sweep:** CLEAN.
**Needs user eyeball:** Variant A end-to-end wet-run path (see below).

This is a **post-cap-3 user-authorized round-4** review. Scope was strict (per the override authorization at issue #344 comment #4425967408): only verify the `f_persona_over_generic` semantic fix. Previous fixes (B1–B5, H2) verified in prior rounds, not re-reviewed.

---

## Single-task scope verification (per brief checklist)

### 1. Plan-spec quantity `persona_LoA / generic_LoA` — CORRECT

`scripts/run_issue186_eval.py:1400-1560` builds per-source `(q, s)`-keyed dicts for both arms:

- Numerator (`persona_loa_dict`): `_per_qs_loss_matrix(base_correct, cell_correctness[cid_persona_cot_labels_on_answer], …, matched_scaffold_idx)` (line 1417). LoA = baseline_correct − trained_correct (per `_per_qs_loss_matrix` docstring at line 849).
- Denominator (`generic_loa_dict`): same construction with `generic_cot_labels_on_answer` cell-id (line 1436).
- `_paired_bootstrap_ratio(num_arr, denom_arr, …, denom_epsilon=1e-4, degenerate_draw_policy="discard", n_resamples=n_bootstrap)` (lines 1478-1485). `n_bootstrap` defaults to `10_000` per `getattr(args, "n_bootstrap", 10_000)` at line 928. This matches plan §11 "Statistical engine" exactly.

This is the plan §6 / §11 Variant B entries 8-9 quantity (`f_persona_over_generic_<axis> = persona_LoA_<axis> / generic_LoA_<axis>`), **not** the v3 ambiguous `generic_LoA / persona_cot_FRESH`. The semantic gap that triggered the round-3 reconcile FAIL is closed.

### 2. Shared `(q, s)` intersection — CORRECT

Line 1465: `shared_keys = sorted(set(persona_loa_dict.keys()) & set(generic_loa_dict.keys()))` — same pattern as the f-ratio / r5 / H2 paths. Both `num_arr` and `denom_arr` index `shared_keys` in identical order (lines 1476-1477), preserving paired-bootstrap pairing (Plan §4 S4).

### 3. Threshold + p-value — CORRECT

- Per-source: `p_vs_1_5 = float(np.mean(draws_arr <= 1.5))` (line 1491).
- Macro: `macro_p_vs_1_5 = float(np.mean(macro_draws_arr <= 1.5))` (line 1530).
- `threshold: 1.5` and `hypothesis: "one-sided H1: f_persona_over_generic_<axis> >= 1.5"` emitted on both per-source and macro entries.
- One-sided semantics: `mean(draws <= 1.5)` is small ⇒ strong rejection of `H0: ratio < 1.5` in the upper-tail direction. Matches plan §6 entry 8-9 ("one-sided ≥ 1.5") and matches the parallel H2 pattern (`mean(draws <= 0.5)`).

### 4. Denominator stability gate — CORRECT

Both per-source (line 1487) and macro (lines 1525-1527) compute:
```python
non_interp = bool(abs(denom_macro) < 0.02 or boot["frac_discarded"] > 0.05)
```
Identical thresholds to the r5 + H2 gates (Plan §6 R3 B2). Smoke case (d) (`near-zero denom`) verifies the gate fires (`denom_macro=0.0050 < 0.02` ⇒ `non_interp=True`).

### 5. Variant A vs Variant B differentiation — CORRECT

- The Holm-emit block at line 2046 is gated on `if variant == "B":` — under Variant A, entries 8-9 are **omitted entirely** from `holm_family`. Family-size collapses to N=7.
- The detail dicts in `f_persona_over_generic[axis_name]` are still emitted regardless of variant. Under Variant A, the per-source loop encounters empty `generic_loa_dict` (cells absent from `cell_correctness`), takes the `missing: True` branch at lines 1448-1463, and emits `reason: "generic_cot_labels_on_answer arm not present (Variant A defers this arm to a follow-up; Plan §15)"` with `non_interpretable: True`. The macro fallback at lines 1547-1560 emits the same shape.
- `holm_family_size = 9 if var

epm:code-review-codex2026-05-11T23:42:01.000Z· system

# Codex Code Review: issue-344 v4 f_persona_over_generic fix **Verdict:** PASS **Tier…

<!-- epm:code-review-codex v4 -->
# Codex Code Review: issue-344 v4 f_persona_over_generic fix

**Verdict:** PASS
**Tier:** trunk
**Diff size:** +355 / -13 in run_issue186_eval.py; +187 / -0 in smoke_issue344.py
**Plan adherence:** COMPLETE
**Lint:** NOT-CHECKED
**Security sweep:** CLEAN
**Needs user eyeball:** Variant A aggregate-on-real-data path; local smoke execution was blocked here by read-only temp/cache, though syntax check passed.

## Plan Adherence
- Implementation marker shape: ✓ verified via GitHub connector; contains `(a) What was done`, `(b) Considered but not done`, `(c) How to verify`, `(d) Needs human eyeball`, with a copy-pasteable smoke command.
- `f_persona_over_generic = persona_LoA / generic_LoA`: ✓ implemented at `scripts/run_issue186_eval.py:1410` and `scripts/run_issue186_eval.py:1429`.
- Threshold `1.5`, one-sided: ✓ implemented at `scripts/run_issue186_eval.py:1491` and `scripts/run_issue186_eval.py:1530`.
- Paired bootstrap with shared `(q,s)` indices: ✓ implemented via shared key intersection at `scripts/run_issue186_eval.py:1465`.
- Denominator stability gate `|denom_macro| < 0.02 OR frac_discarded > 0.05`: ✓ implemented at `scripts/run_issue186_eval.py:1487` and `scripts/run_issue186_eval.py:1525`.
- Variant B only: 2 new Holm entries: ✓ gated by `variant == "B"` at `scripts/run_issue186_eval.py:2046`.
- Variant A: entries omitted from `holm_family`: ✓ same Variant B gate omits them; `holm_family_size` is 7 at `scripts/run_issue186_eval.py:2079`.
- Old `generic_cot_labels_on_answer` entries moved to `descriptive_ratios`: ✓ implemented at `scripts/run_issue186_eval.py:1911` and `scripts/run_issue186_eval.py:1954`.
- `holm_family_size` field: ✓ implemented at `scripts/run_issue186_eval.py:2079` and emitted at `scripts/run_issue186_eval.py:2101`.
- `schema_notes` block: ✓ emitted at `scripts/run_issue186_eval.py:2102`.
- New smoke test `test_f_persona_over_generic_construction`: ✓ added at `scripts/smoke_issue344.py:572`.
- 5 sub-cases (ratio>1.5, ratio=1.0, ratio<0.5, near-zero denom, Variant A structural): ✓ present at `scripts/smoke_issue344.py:605`, `scripts/smoke_issue344.py:636`, `scripts/smoke_issue344.py:662`, `scripts/smoke_issue344.py:693`, `scripts/smoke_issue344.py:717`.
- Registered in `main()` between H2 and HF-path tests: ✓ `scripts/smoke_issue344.py:1068`.

## Issues Found

### Critical (block merge)
None.

### Major (revise before merge)
None.

### Minor (worth fixing but doesnt block)
- `scripts/run_issue186_eval.py:1393` says `mean(draws < 1.5)` in the comment, while the implementation correctly uses `mean(draws_arr <= 1.5)` at lines 1491 and 1530. Fix the comment to avoid future off-by-one-threshold confusion.
- `scripts/smoke_issue344.py:717` only structurally checks Variant A arm omission and family-size delta; it does not run the full aggregate path that emits `f_persona_over_generic[axis]["reason"]` while omitting Holm entries. This matches the requested structural check but still leaves the actual Variant A aggregate branch for human/wet-run confirmation.

## Unaddressed Cases
Local smoke command could not run in the Codex sandbox (uv temp-dir permission issue); a direct `python3` syntax compile of both modified files succeeded. Per the implementer report, the local run produces `[smoke_issue344] ALL TESTS PASS` (12/12).

## Style / Consistency
The new block mirrors the H2/r5 p-value pattern and denominator gate. One doc comment at line 1393 uses `<` where code and plan both use `<=`; minor documentation inconsistency, no impact on correctness.

## Unintended Changes
No unintended changes found within the round-4 diff. Prior C3-gate, f-ratio loop, r5, and H2 behavior is pre-existing and not re-reviewed under this scope.

## Security Check
- no issues found

## Recommendation
merge
<!-- /epm:code-review-codex -->

epm:pod-pending2026-05-11T23:43:21.000Z· system

## Pod provisioning failed — RunPod capacity constraint Attempted `python scripts/pod.py pr…

<!-- epm:pod-pending v1 -->
## Pod provisioning failed — RunPod capacity constraint

Attempted `python scripts/pod.py provision --issue 344 --intent ft-7b` (4× H100 ft-7b intent per plan §11). RunPod returned:

```
GraphQL errors: [{"message": "There are no longer any instances available with the requested specifications. Please refresh and try again.", "path": ["podFindAndDeployOnDemand"], "extensions": {"code": "SUPPLY_CONSTRAINT", "userId": "user_2v9CcEeHWnPcoAVCf8YeCXKvupS"}}]
```

Label stays at `status:running`; no implementer re-spawn (per /issue Step 6b rule "infra, not code").

### Options for the user

1. **Wait and retry.** RunPod 4× H100 capacity is intermittent. Re-invoke `/issue 344` later — the orchestrator will retry the same intent.
2. **Override intent.** Provision with explicit `--gpu-type H200 --gpu-count 4` (H200 has more memory, similar wall-time, often better capacity) or `--gpu-count 2` (Variant B at ~17.6 hr wall vs ~8.8 hr; cells still run, sharded across 2 GPUs instead of 4). Re-invoke after override.
3. **Switch to spot / different region.** Manual RunPod console action; outside the /issue skill's scope.

### Current state

- Branch: `issue-344` (commit `ab840069`, code-review v4 PASS)
- Worktree: `.claude/worktrees/issue-344`
- Plan: `.claude/plans/issue-344.md` (Variant B, compute:large)
- Code is ready to launch; only blocking on hardware availability.
<!-- /epm:pod-pending -->

epm:step-completed2026-05-11T23:43:23.000Z· system

## Step Completed step: 6c at: 921b304d timestamp: 2026-05-11T23:43:22+00:00 next_expect…

<!-- epm:step-completed v1 -->
## Step Completed

step: 6c
at: 921b304d
timestamp: 2026-05-11T23:43:22+00:00
next_expected_step: 6d
exit_kind: failure-exit
notes: RunPod SUPPLY_CONSTRAINT on 4x H100; epm:pod-pending posted; user decides retry / intent-override / wait
<!-- /epm:step-completed -->

epm:step-completed2026-05-12T00:11:39.000Z· system

## Step Completed step: 6d at: 921b304d timestamp: 2026-05-12T00:11:39+00:00 next_expect…

<!-- epm:step-completed v1 -->
## Step Completed

step: 6d
at: 921b304d
timestamp: 2026-05-12T00:11:39+00:00
next_expected_step: 7
exit_kind: clean
notes: experimenter dispatched on pod-344; Variant B pipeline (Phases 0-3); awaiting epm:results
<!-- /epm:step-completed -->

epm:launch2026-05-12T00:13:25.000Z· system

## Experimenter launched (Variant B, compute:large) **Worktree:** `.claude/worktrees/issue-344` …

<!-- epm:launch v1 -->
## Experimenter launched (Variant B, compute:large)

**Worktree:** `.claude/worktrees/issue-344`
**Branch:** `issue-344` (off `origin/issue-280`), head `ab840069`
**PR:** https://github.com/superkaiba/explore-persona-space/pull/348
**Code-review verdict (R4):** PASS (Claude + Codex ensemble agree)
**Pod:** `pod-344` (4× H100 80GB HBM3, 190GB free /workspace)
**WandB project:** `explore-persona-space`
**WandB run-name pattern:** `issue344_{source}_{arm}_seed{S}`
**WandB tags:** `issue344`, `labels_on_answer`, `cot_factorial`, `variant_b`

### Pre-launch checks

- Pod synced to `issue-344` @ `ab840069` ✓
- `uv sync --locked` clean ✓
- Preflight: env_synced=true, 190GB free, 4×H100 idle, all API keys present ✓
  - The single preflight `error` ("127 commits behind origin/main") is an artifact: `issue-344` is intentionally branched off `origin/issue-280`, not `main`. All substantive checks pass.

### Pipeline (Variant B per plan §4 + §11 RC)

- **Phase 0** (local VM, ~10 min, ~$108 API): Regenerate persona-cot + generic-cot via `generate_issue186_data.py --only-arm` (cot_max_tokens=768, claude-sonnet-4-5-20250929). Run anchor-gate Phase 0d. Upload to HF Hub.
- **Phase 1** (pod, ~10.2 GPU-hr): 4-GPU sharded `run_issue_344_train.py --variant B`, all 4 arms × 4 sources × 3 seeds = 48 cells + 4 no_cot_FRESH cells.
- **Phase 2** (pod, ~23.4 GPU-hr): 4-GPU sharded `run_issue186_eval.py` n_questions=1172, cot_max_tokens=768, max_model_len=4096. Then `--stage aggregate --mode fraction_of_effect --variant B --n-bootstrap 10000`.
- **Phase 2b conditional**: C3 gate (+0.85 GPU-hr) only if `upper_ci_f_bystander < 0.20`.
- **Phase 3** (local VM, ~$50 Anthropic): `mediation_issue344.py --variant B --n-pairs-per-source 150 --equivalence-band 0.10`.

### Monitoring

Progressive cadence per experimenter spec: 30s for first 5 min, 5 min for next 25 min, 30 min next 4 hr, hourly beyond. `epm:progress` marker at each phase boundary.

Running autonomously until `epm:results v1` or `epm:failure v1`.

epm:step-completed2026-05-12T00:15:05.000Z· system

## Step Completed step: 7 at: 921b304d timestamp: 2026-05-12T00:15:05+00:00 next_expecte…

<!-- epm:step-completed v1 -->
## Step Completed

step: 7
at: 921b304d
timestamp: 2026-05-12T00:15:05+00:00
next_expected_step: 8
exit_kind: parked
notes: experimenter still running; epm:launch posted 00:13:25Z; Phase 0 (local data-gen) in progress; no epm:progress yet (within 4hr stale window)
<!-- /epm:step-completed -->

blocked2026-05-12T09:30:49.607Z· user· running → blocked

Experimenter session died after Phase 0 finished cleanly at 00:53Z; Phase 0d / Phase 1 never launched; pod-344 sat idle …

Experimenter session died after Phase 0 finished cleanly at 00:53Z; Phase 0d / Phase 1 never launched; pod-344 sat idle ~9h and was terminated. Resume via spawn-issue + /issue 344 (Phase 0 JSONLs on disk, may need re-upload).

state_changed2026-05-12T21:47:02.267Z· user· blocked → archived

User terminated pod-344 and chose to archive after 7 implementer rounds (4 cap-overrides). R7 was authorized for adapter…

User terminated pod-344 and chose to archive after 7 implementer rounds (4 cap-overrides). R7 was authorized for adapter-only Hub uploads + vLLM enable_lora=True to work around pod-344's ~30 Mbps egress that made the 40-cell merged-checkpoint sweep require ~38 hr. R7 equivalence smoke crashed on vllm 0.11.0/DisabledTqdm tqdm conflict before completing; user chose to stop rather than chase an 8th round. Surviving artifacts: Phase 0 data on HF Hub at issue186_data_v344/ (16 jsonl files, generic_cot anchor fixed), R2's 4 FRESH cells uploaded as merged Hub artifacts. Full retrospective in epm:pod-terminated v1 marker on GitHub issue: https://github.com/superkaiba/explore-persona-space/issues/344#issuecomment-4435085580

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)