EPS
← All tasks·#150Completed

Add CoT to ARC C and misalignment and refusal and sycophancy

kind: experiment

Goal

Test whether persona-conditioned behavior training is more effective when the model is given room to sample persona-conditioned tokens (chain-of-thought) before committing to a final answer. Sweep across 12 personas spanning the assistant↔villain cosine axis × 4 behavioral evals to see whether CoT widens the persona-conditioned spread.

Hypothesis

Adding chain-of-thought to the eval prompt (and to a 4-persona subset of training data) increases the persona-conditioned spread on each of the 4 target evals — the gap between persona-aligned and persona-misaligned behavior grows when the model gets a CoT scratchpad to act in-character before answering.

Falsification: spread under CoT is statistically indistinguishable from no-CoT (or smaller) on ≥3 of the 4 evals across 3 seeds.

Personas (n=12)

Cached, no new dataset generation required. Span the cosine-to-assistant axis:

PersonaCosine to assistantSource
assistant+1.00 (anchor)core matrix (c5)
software_engineer+0.45src/personas.py
kindergarten_teacher+0.33src/personas.py
data_scientist+0.17src/personas.py
medical_doctor+0.05src/personas.py + trait_transfer
librarian−0.08src/personas.py + trait_transfer
french_person−0.23src/personas.py
villain−0.24src/personas.py, clean-result #80
comedian−0.28src/personas.py
zelthari_scholar−0.38src/personas.py + trait_transfer
police_officer−0.40src/personas.py
evil(core matrix)c1/c2/c7

Conditions

  • 12 personas (above)
  • 2 eval-time CoT settings: {CoT-eval, no-CoT-eval}
  • Train-time CoT subset (4 personas: assistant, evil, villain, medical_doctor) × {CoT-train, no-CoT-train}. The other 8 personas use existing no-CoT-train checkpoints (no train-CoT arm).
  • Seeds: 3

Method delta

Phase-1 SFT (CoT-train arm only, 4 personas × 3 seeds = 12 runs): retrain LoRA-7B Phase-1 on the same persona-conditioned QA tuples but with a CoT preamble — the model "thinks in character" before producing the answer. Exact CoT template wording deferred to planner.

Phase-2 EM induction: unchanged.

Eval-time CoT scaffold: <persona-thinking>...</persona-thinking> Answer: ... prepended at generation. Sampled, not logprob-extracted.

Evals

EvalExisting formatCoT change
ARC-Challenge (capability)logprob-based (evaluate_capability_logprob)switch to generation-based: CoT then Answer: A/B/C/D
Betley alignmentgeneration + Claude judge (alignment.py)add CoT scaffold; judge unchanged
Refusalgeneration + Claude judge (strongreject.py)add CoT scaffold; judge unchanged
Sycophancygeneration + Claude judge (trait_scorers.py)add CoT scaffold; judge unchanged

The generation-based ARC-C path is new infra inside this issue's scope.

Success criterion

CoT-eval increases per-persona-conditioned spread (max-min across the 12-persona axis) on ≥3 of 4 evals, p<0.05 across 3 seeds. Train-time CoT additionally increases spread for the 4-persona subset relative to no-CoT-train baselines.

Kill criterion

CoT-eval reduces or leaves unchanged (p>0.10) the persona-conditioned spread on ≥3 of 4 evals.

Compute estimate

Provisional medium-to-large — planner refines:

  • Train-CoT arm: 4 personas × 3 seeds = 12 LoRA-7B Phase-1 SFT runs (~6 GPU-hr Phase-1) + Phase-2 EM (~3 GPU-hr) = ~9 GPU-hr training.
  • Eval: 12 personas × 2 CoT settings × 4 evals × 3 seeds × {pre-EM, post-EM} via vLLM batched generation = ~15-25 GPU-hr.
  • Total estimate: 25-40 GPU-hr.

Pod preference

Ephemeral pod, pod.py provision --issue 150 --intent lora-7b (1× H100). Eval phase reuses the same pod.

Spec (from clarifier)

Resolved before drafting:

  1. Type: type:experiment (empirical headline question, not pure infra).
  2. CoT placement: full factorial (train × eval).
  3. Headline metric: bigger persona-conditioned spread on the 4 evals.
  4. Persona set: 12 personas above (no new dataset generation).
  5. Train-CoT scope: 4-persona subset (assistant, evil, villain, medical_doctor); eval-CoT on all 12.
  6. Seeds: 3.

References

  • src/personas.py — 12-persona behavioral list with ASSISTANT_COSINES.
  • configs/condition/c1..c8 — existing persona conditions and Phase-1 datasets.
  • src/explore_persona_space/eval/capability.py — ARC-C logprob path; needs generation-based variant.
  • src/explore_persona_space/eval/{alignment,strongreject,trait_scorers}.py — Betley / refusal / sycophancy judges.
  • Prior persona-axis work: #80 (villain marker propagation), #83 (sarcastic source), #84 (evil-AI source).

Timeline · 19 events

  1. epm:auto-defaults· system
    <!-- epm:auto-defaults v1 --> ## Auto-defaults applied Issue arrived with no `status:*` label, no `type:*` label, and a
    <!-- epm:auto-defaults v1 -->
    ## Auto-defaults applied
    
    Issue arrived with no `status:*` label, no `type:*` label, and a one-sentence body. `/issue 150` filled in the minimum spec via Step 0b:
    
    - **Label `status:proposed`** — autofilled (matches Todo column convention).
    - **Label `type:experiment`** — confirmed via clarifier prompt; the question "does X work better" is empirical, not pure infra.
    - **Label `compute:medium`** — provisional based on planner-side estimate (25-40 GPU-hr); planner will refine.
    - **Label `aim:cross-cutting`** — touches capability, alignment, refusal, sycophancy across the persona axis.
    - **Body rewrite** — drafted via clarifier dialogue covering: persona set (12 from `src/personas.py`), CoT placement (train × eval factorial), headline metric (persona-conditioned spread), train-CoT scope (4-persona subset), seeds (3), evals (4), success/kill criteria, compute estimate, pod preference. Original body preserved in spirit ("(2) does persona-conditioned behavior training work better if you give the model a chance to sample persona-conditioned tokens in its response").
    
    User clarifier answers (verbatim):
    1. Type → `type:experiment`
    2. CoT placement → Both (full factorial)
    3. Success metric → Bigger persona-conditioned spread
    4. Persona set → 12 cached personas spanning the assistant↔villain cosine axis
    5. Train-CoT scope → 4-persona subset (assistant, evil, villain, medical_doctor); eval-CoT on all 12
    <!-- /epm:auto-defaults -->
  2. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — no blocking ambiguities remain After the inline clarifier round, the spec is tig
    <!-- epm:clarify v1 -->
    ## Clarifier — no blocking ambiguities remain
    
    After the inline clarifier round, the spec is tight enough for the adversarial planner. Outstanding choices that the planner must resolve (non-blocking, listed for downstream visibility):
    
    1. Exact CoT prompt template wording (e.g., `<persona-thinking>...</persona-thinking>` vs. natural-language scaffolding) — planner picks; sweep over alternatives if cheap.
    2. Whether to switch ARC-Challenge to generation-based scoring entirely or keep both logprob and generation paths during the experiment for direct comparison.
    3. Precise statistical test for "spread increases" — planner specifies (per CLAUDE.md, prose reports p-values + n only).
    4. Whether to also re-run no-CoT-train checkpoints (for the 4-persona subset) to control for train-time random noise vs. just reuse the existing checkpoints.
    5. Per-question generations for stochastic evals (Betley/refusal/sycophancy under CoT) — planner picks based on noise-vs-cost tradeoff.
    
    Proceeding to adversarial planning.
    <!-- /epm:clarify -->
  3. epm:plan· system
    <!-- epm:plan v1 --> ## Approved Plan for #150 (v3 — passed adversarial-planner; 8 critic BLOCKERs + 3 consistency BLOCK
    <!-- epm:plan v1 -->
    ## Approved Plan for #150 (v3 — passed adversarial-planner; 8 critic BLOCKERs + 3 consistency BLOCKs resolved)
    
    **Cost gate:** estimated **12-17 GPU-hours** on 1× H100 (`pod.py provision --issue 150 --intent eval`). Reply `approve` to dispatch.
    
    H2 (train-CoT) descoped to follow-up that fires only if H1 passes.
    
    ### Goal
    
    Test whether **eval-time persona-conditioned chain-of-thought** widens the gap between persona-aligned and persona-misaligned behavior on a c5-EM-trained model. The contrast that isolates *persona-conditioning as the mechanism* is `(persona-CoT − generic-CoT)`, not `(CoT − no-CoT)`.
    
    ### Hypothesis (H1 only)
    
    `Δslope = slope(persona-CoT, score-vs-cosine) − slope(generic-CoT, score-vs-cosine)` is in the predicted direction (positive for ARC-C, Betley alignment, refusal) on ≥2 of 3 evals (post-EM stage), with Bonferroni-corrected p<0.0167 across 3 seeds.
    
    **Falsification:** ≤1 of 3 evals shows Δslope in predicted direction at p<0.0167.
    
    ### Method delta
    
    No training. Eval-only sweep on existing checkpoints:
    - **Pre-EM:** `Qwen/Qwen2.5-7B-Instruct` (no fine-tune; persona conditioning via system prompt)
    - **Post-EM:** existing c5 LoRA adapters at seeds {42, 137, 256} (`models/midtrain_25pct_seed{42,137,256}/assistant_correct/em_lora/`), pre-merged onto `Qwen2.5-7B-Instruct` for vLLM compatibility
    
    **3 CoT arms:** `no-cot` | `generic-cot` ("Let me think step-by-step. Answer:") | `persona-cot` ("<persona-thinking>...in character...</persona-thinking> Answer:")
    
    **Headline contrast:** persona-cot vs. generic-cot (the generic-cot baseline is the critic's required control to attribute the effect to persona-conditioning, not CoT-as-mechanism).
    
    ### Personas (n=11, matching #80 axis)
    
    | Persona | Cosine to assistant |
    |---|---|
    | assistant | +1.00 (anchor) |
    | software_engineer | +0.446 |
    | kindergarten_teacher | +0.331 |
    | data_scientist | +0.170 |
    | medical_doctor | +0.054 |
    | librarian | −0.081 |
    | french_person | −0.226 |
    | villain | −0.237 |
    | comedian | −0.283 |
    | zelthari_scholar | −0.379 |
    | police_officer | −0.399 |
    
    Source: `src/explore_persona_space/personas.py::ASSISTANT_COSINES`. Evil excluded — no cosine in `ASSISTANT_COSINES`; computing inline diverges from #80. (See open question #1.)
    
    ### Evals (3, sycophancy descoped)
    
    | Eval | Function | Params | Extraction |
    |---|---|---|---|
    | ARC-C generation | NEW `evaluate_capability_generation()` (~80 LOC) | temp=0.0, K=1, max_tokens=768, N=1172 | Regex `r"\b(?:[Aa]nswer)\s*[:\-]?\s*\(?([A-D])\)?\b"`; Claude Sonnet 4.5 fallback |
    | Betley alignment | `eval/alignment.py::evaluate_alignment` | temp=1.0, K=8, 8 questions | Claude Sonnet 4.5 judge |
    | StrongREJECT refusal | `eval/strongreject.py::evaluate_strongreject` | temp=1.0, K=8, 10 prompts (override default temp=0) | Claude Sonnet 4.5 judge |
    
    **Bridging:** ARC-C *logprob* (existing `evaluate_capability_per_persona`) reported alongside generation, no CoT — for #75 parity.
    
    ### Pipeline
    
    1. **Stage 0** (~10 min): merge c5 adapters → push merged to HF Hub.
    2. **Stage 1 — gate** (~30 min): post-EM × 3 CoT arms × 3 evals × 2 personas (assistant + police_officer, the cosine extremes) × seed 42. Kill if `Δslope_2pt` wrong-sign on ≥2 of 3 evals.
    3. **Stage 2 — full** (~12-17 GPU-hr): pre-EM (1 base) + post-EM (3 c5-merged seeds) × 11 personas × 3 CoT arms × 3 evals × 3 seeds.
    4. **Stage 3 — analysis**: per (eval, seed, stage, CoT-arm) fit `score ~ cosine`; compute Δslope; 1-sample t-test across 3 seeds; Bonferroni × 3 evals (α=0.0167).
    
    ### Decision gates
    
    | Gate | When | Cost | Kill rule |
    |---|---|---|---|
    | G1 — extraction parity | Before Stage 2 | <5 min CPU | Manual: ≥95% regex extraction across 3 CoT arms on 50 samples |
    | G2 — 2-persona post-EM | After Stage 1 | ~0.5 GPU-hr | Wrong-sign Δslope_2pt on ≥2 of 3 evals → STOP, save ~13 GPU-hr |
    | G3 — pre-EM run | After G2 | adds ~6 GPU-hr | Always run if G2 passes |
    
    ### Reproducibility Card (key cells; full card cached at `.claude/plans/issu
  4. epm:consistency· system
    <!-- epm:consistency v1 --> ## Consistency check vs. related experiments — verdict: PASS (after v3 revision) Plan v3 wa
    <!-- epm:consistency v1 -->
    ## Consistency check vs. related experiments — verdict: PASS (after v3 revision)
    
    Plan v3 was checked against `aim:cross-cutting` issues + persona-axis prior work (#75, #80, #83, #84, #138, #168, #173). Initial WARN verdict on v2 escalated to 3 BLOCKs; all addressed in v3.
    
    ### v3 resolutions of v2 BLOCKs
    - **ARC-C eval suite incompatibility with #75** (BLOCK→PASS): v3 keeps generation-based ARC-C as the primary metric for the CoT comparison, AND adds logprob ARC-C as a bridging metric to maintain comparability with #75.
    - **New metrics with no calibration baseline** (BLOCK→PASS): sycophancy descoped to follow-up; StrongREJECT refusal kept with explicit caveat that the 10-prompt subset may need expansion based on observed SE.
    - **EM LoRA hparams ambiguity** (BLOCK→PASS): repro card explicitly separates Phase-1 (full-finetune `lr=2e-5, epochs=3`) from Phase-2 EM LoRA (`r=32, α=64, dropout=0.05, rslora=False, lr=1e-4, 375 steps`) per #75 lineage. The yaml fields `configs/training/default.yaml` (lr=5e-6) and `configs/lora/default.yaml` (dropout=0.0, rslora=true) are LoRA-7B Phase-1 defaults, NOT Phase-2 EM defaults. Repro card pins lineage values.
    
    ### Standing WARNs (acknowledged in plan §12 caveats)
    
    1. **Phase-2 dataset:** `c5_assistant_correct_em.yaml:9` references `data/sft/phase2_insecure_code.jsonl` which doesn't exist locally. The c5 `em_lora` adapters were trained on `data/bad_legal_advice_6k.jsonl` (md5 `26b52cacc53425618fde278d2457304d`, verified locally) per #75 lineage. The yaml field is stale; repro card uses the verified value.
    2. **c5 env drift:** c5 adapters trained under transformers 4.48 / trl 0.17 / peft 0.18 (#75 §Environment); current uv.lock may differ. Logprob ARC-C bridge against #75's reported 0.86 is the env-drift sentinel; deviation >0.03 will be flagged in the clean-result.
    3. **Evil-as-12th-persona divergence from #80:** plan v3 drops evil to maintain 11-persona axis parity with #80. User can override via Open Question #1.
    
    ### Single-variable-change verification
    
    The intended changes for issue #150 are:
    1. eval-time CoT scaffold (3 arms: no-cot, generic-cot, persona-cot)
    2. eval system prompt sweep over 11 personas
    
    Every other variable matches `c5_assistant_correct_em` lineage:
    - Same base model (`Qwen/Qwen2.5-7B-Instruct`)
    - Same seeds [42, 137, 256] (matches `c5_assistant_correct_em.yaml:10`)
    - Same Phase-1 hparams (c5 lineage; not retrained)
    - Same Phase-2 EM hparams (c5 lineage; not retrained)
    - Same EM dataset (`bad_legal_advice_6k.jsonl`, md5 verified)
    - No new persona-conditioned training (H2 descoped)
    
    **Verdict: PASS** — single-variable change per arm, no unintended divergence.
    <!-- /epm:consistency -->
  5. epm:plan· system
    <!-- epm:plan v2 --> ## Approved Plan for #150 (v4 — user-finalized scope) **Cost gate:** estimated **~5-7 GPU-hours**
    <!-- epm:plan v2 -->
    ## Approved Plan for #150 (v4 — user-finalized scope)
    
    **Cost gate:** estimated **~5-7 GPU-hours** on 1× H100 (`pod.py provision --issue 150 --intent eval`). Reply `approve` to dispatch.
    
    User-finalized decisions (vs. v1 plan):
    - ✅ **11 personas** (matches #80 axis; evil excluded)
    - ✅ **ARC-C only** as headline eval — Betley alignment, StrongREJECT refusal, and sycophancy descoped to follow-ups
    - ✅ **Logprob ARC-C** included as bridging metric (#75 parity)
    - ✅ **Merge target:** `Qwen2.5-7B-Instruct` directly
    
    ### Goal
    
    Test whether **eval-time persona-conditioned chain-of-thought** widens the ARC-C accuracy gap between persona-aligned and persona-misaligned behavior on a c5-EM-trained model. The contrast that isolates persona-conditioning as the mechanism is `(persona-CoT − generic-CoT)`, not `(CoT − no-CoT)`.
    
    ### Hypothesis (H1, single eval)
    
    `Δslope = slope(persona-CoT, ARC-acc-vs-cosine) − slope(generic-CoT, ARC-acc-vs-cosine)` is positive (post-EM stage), with p<0.05 across 3 seeds.
    
    **Falsification:** Δslope ≤ 0 or p>0.05.
    
    **Direction prior:** `Δslope ≥ +0.05` per unit cosine (range −0.40 → +1.00, span 1.40 → ≥7 pp swing assistant-end vs police-end).
    
    ### Method delta
    
    No training. Eval-only sweep on existing checkpoints:
    - **Pre-EM:** `Qwen/Qwen2.5-7B-Instruct` (no fine-tune; persona via system prompt)
    - **Post-EM:** existing c5 LoRA adapters at seeds {42, 137, 256} (`models/midtrain_25pct_seed{42,137,256}/assistant_correct/em_lora/`), pre-merged onto `Qwen/Qwen2.5-7B-Instruct`
    
    **3 CoT arms:**
    | Arm | Scaffold (appended to assistant turn) |
    |---|---|
    | `no-cot` | (none — direct answer) |
    | `generic-cot` | `"Let me think step-by-step.\nAnswer: "` |
    | `persona-cot` | `"<persona-thinking>\n[reasoning in character]\n</persona-thinking>\nAnswer: "` |
    
    **Headline contrast:** persona-cot vs. generic-cot (the generic-cot baseline is the critic's required control to attribute the effect to persona-conditioning, not CoT-as-mechanism).
    
    ### Personas (n=11, matching #80 axis)
    
    | Persona | Cosine to assistant |
    |---|---|
    | assistant | +1.00 (anchor) |
    | software_engineer | +0.446 |
    | kindergarten_teacher | +0.331 |
    | data_scientist | +0.170 |
    | medical_doctor | +0.054 |
    | librarian | −0.081 |
    | french_person | −0.226 |
    | villain | −0.237 |
    | comedian | −0.283 |
    | zelthari_scholar | −0.379 |
    | police_officer | −0.399 |
    
    Source: `src/explore_persona_space/personas.py::ASSISTANT_COSINES`.
    
    ### Evals
    
    | Eval | Function | Params | Extraction | Role |
    |---|---|---|---|---|
    | ARC-C generation | NEW `evaluate_capability_generation()` (~80 LOC) | temp=0.0, K=1, max_tokens=768, N=1172 | Regex `r"\b(?:[Aa]nswer)\s*[:\-]?\s*\(?([A-D])\)?\b"`; Claude Sonnet 4.5 fallback for unparseable cells only | **Headline** |
    | ARC-C logprob | existing `evaluate_capability_per_persona()` | greedy A/B/C/D logits, persona system prompt, N=1172 | logprobs | **Bridging** to #75 (no CoT possible) |
    
    **No Claude judge in the headline path** (regex extraction only, fallback rare). Significant API cost reduction vs. v3 (was ~$140 for Betley+StrongREJECT judging, now ≈$0).
    
    ### Pipeline
    
    1. **Stage 0** (~10 min): Pull c5 LoRA adapters for seeds {42, 137, 256}; merge each onto `Qwen/Qwen2.5-7B-Instruct`; push merged to HF Hub.
    2. **Stage 1 — gate** (~10 min, post-EM only, seed 42, 2 personas {assistant, police_officer}, 3 CoT arms × 1 eval = 6 cells): kill if `Δslope_2pt = (assistant_persona-CoT − police_persona-CoT) − (assistant_generic-CoT − police_generic-CoT)` is ≤ 0.
    3. **Stage 2 — full** (~3-5 GPU-hr): pre-EM (1 base) + post-EM (3 c5-merged seeds) × 11 personas × 3 CoT arms × ARC-C generation × N=1172. Plus logprob bridging on the same models.
    4. **Stage 3 — analysis**: per (seed, stage, CoT-arm) fit `acc ~ cosine`; compute Δslope; 1-sample t-test across 3 seeds.
    
    ### Decision gates
    
    | Gate | When | Cost so far | Kill rule |
    |---|---|---|---|
    | G1 — extraction parity | Before Stage 2 | <5 min CPU | ≥95% regex extraction across 3 CoT arms o
  6. epm:plan· system
    <!-- epm:plan v3 --> ## Final Plan for #150 — Persona-CoT × Capability Leakage on Qwen2.5-7B-Instruct **Cost gate:** ~*
    <!-- epm:plan v3 -->
    ## Final Plan for #150 — Persona-CoT × Capability Leakage on Qwen2.5-7B-Instruct
    
    **Cost gate:** ~**1 GPU-hour** on 1× H100 (`pod.py provision --issue 150 --intent eval`). $0 API. Reply `approve` to dispatch.
    
    User-finalized scope (supersedes v1/v2):
    - ✅ **Qwen2.5-7B-Instruct only** — no fine-tune, no midtraining, no LoRA, no merging
    - ✅ **11 personas** (matches #80 axis)
    - ✅ **ARC-C only** as headline eval
    - ✅ **Hybrid logprob:** generate persona-CoT at temp=0 / K=1, then logprob over A/B/C/D
    - ✅ **Hypothesis (Option B):** persona-CoT *amplifies* persona expression → Δslope **positive** (steeper slope than generic-CoT)
    
    ### Goal
    
    Test whether eval-time persona-CoT amplifies the natural persona-induced capability variation in `Qwen2.5-7B-Instruct`. The base instruct model already shows different ARC-C accuracy under different persona system prompts; the question is whether giving it a chance to "think in character" before answering makes the persona axis more sharply expressed (steeper slope of accuracy vs. cosine-to-assistant).
    
    ### Hypothesis (H1, Option B)
    
    `Δslope = slope(persona-CoT, ARC-acc-vs-cosine) − slope(generic-CoT, ARC-acc-vs-cosine)` is **positive** (persona-CoT amplifies expression), with bootstrap p<0.05.
    
    **Falsification:** Δslope ≤ 0 or bootstrap p>0.05.
    
    ### Method
    
    **One model:** `Qwen/Qwen2.5-7B-Instruct` (zero training, used directly).
    
    **11 personas** (system prompts from `src/explore_persona_space/personas.py::PERSONA_PROMPTS` + `ASSISTANT_PROMPT`):
    
    | Persona | Cosine to assistant |
    |---|---|
    | assistant | +1.00 (anchor) |
    | software_engineer | +0.446 |
    | kindergarten_teacher | +0.331 |
    | data_scientist | +0.170 |
    | medical_doctor | +0.054 |
    | librarian | −0.081 |
    | french_person | −0.226 |
    | villain | −0.237 |
    | comedian | −0.283 |
    | zelthari_scholar | −0.379 |
    | police_officer | −0.399 |
    
    **3 CoT arms:**
    
    | Arm | Prefix tokens injected before `Answer: ` |
    |---|---|
    | `no-cot` | (none — direct logprob over A/B/C/D after question) |
    | `generic-cot` | Generated rationale at temp=0 with prompt `"Let me think step-by-step."` |
    | `persona-cot` | Generated rationale at temp=0 with prompt `"<persona-thinking>\n[reasoning in character]\n</persona-thinking>"` |
    
    **Hybrid CoT-then-logprob protocol** (per (persona, question) cell):
    
    1. For `generic-cot` and `persona-cot`: model generates the rationale deterministically (temp=0, K=1, max_tokens=256) given `<system>{persona}</system><user>{question}</user><assistant>{scaffold_prefix}`.
    2. Construct full conditioning prefix: `<system>{persona}</system><user>{question}</user><assistant>{generated_CoT}\nAnswer: `.
    3. Single forward pass; read logprobs of single tokens `A`, `B`, `C`, `D`. Pick argmax. (For `no-cot`, skip step 1; prefix has no rationale.)
    
    No regex extraction, no Claude judge, no answer sampling. Clean numbers.
    
    ### Eval
    
    - **ARC-Challenge** test set, full N=1172 (no subsampling).
    - Metric: per-persona accuracy = fraction of 1172 questions where argmax-letter matches `correct_answer` field.
    - Per-arm slope: linear regression of `accuracy ~ cosine` across 11 personas.
    - Headline statistic: `Δslope = slope(persona-CoT) − slope(generic-CoT)`.
    
    ### Statistical test
    
    - **Seeds:** N/A (temp=0 → deterministic; no within-cell variance to estimate from seeds).
    - **Bootstrap over questions** (n=1000): for each bootstrap sample of the 1172 questions (with replacement), recompute per-persona accuracy, recompute slopes, recompute Δslope. Build the bootstrap distribution. Two-sided p-value = fraction of bootstrap Δslopes with absolute value larger than |observed|.
    - Pass: observed Δslope > 0 with bootstrap p<0.05.
    
    ### Pipeline
    
    | Stage | What | Compute |
    |---|---|---|
    | 0 | Provision pod + preflight | ~5 min |
    | 1 — gate | 2 personas {assistant, police_officer} × 3 CoT arms × 200 ARC-C subsample. If `Δslope_2pt = (asst − police)_persona-CoT − (asst − police)_generic-CoT ≤ 0`, STOP. | ~10 min |
    | 2 — full | 11 personas × 3 CoT arms × N=1172 ARC-
  7. epm:experiment-implementation· system
    <!-- epm:experiment-implementation v1 --> ## Implementation Report (issue #150) **Branch:** `issue-150` (worktree: `.cl
    <!-- epm:experiment-implementation v1 -->
    ## Implementation Report (issue #150)
    
    **Branch:** `issue-150` (worktree: `.claude/worktrees/issue-150`)
    **Final commit:** `f14952ffc4f48876157a3be18620304beec391ff`
    **Lint status:** PASS (`uv run ruff check` + `uv run ruff format` clean on all 4 changed files)
    **Pod touched:** No. `epm-issue-150` was not contacted; experimenter handles pod ops after code-review PASS.
    
    ### Files changed
    
    | File | Lines | Description |
    |---|---|---|
    | `src/explore_persona_space/eval/prompting.py` | +112, -1 | Adds `CoTScaffold` dataclass + `NO_COT` / `GENERIC_COT` / `PERSONA_COT` constants. Module docstring documents the wire format produced by each scaffold with worked examples for ARC-C cells. |
    | `src/explore_persona_space/eval/capability.py` | +461 | Adds `evaluate_capability_cot_logprob()` (the hybrid CoT-then-logprob ARC-C eval) plus four extracted helpers: `_format_arc_user_turn`, `_build_chat_prefix`, `_extract_answer_letter`, `_generate_cot_for_arm`, `_extract_logprobs_for_arm`, `_assemble_persona_block`, `_collect_run_metadata`. |
    | `scripts/run_issue150.py` | +537 (new) | Stage orchestrator with `--stage smoke|gate|full|aggregate`. Validates 11-persona axis matches #80 at startup. Gate stage exits 1 on `delta_slope_2pt <= 0`. Aggregate stage emits 3-panel hero + decomposition figure via paper-plots, with question-bootstrap (n=1000) two-sided p-value for `delta_slope`. |
    | `scripts/smoke_issue150.py` | +130 (new) | CPU-only wire-format smoke test that exercises the chat-template + scaffold concatenation + letter-extractor paths without invoking vLLM. |
    
    Total: **+1,240 / -1** across 4 files.
    
    ### Key design decisions
    
    - **Hybrid logprob extraction.** Two `LLM.generate()` calls per CoT arm: one batched generate for the rationale (temp=0, K=1, top_p=1.0, max_tokens=256), one for `max_tokens=1, logprobs=20`. The letter extractor (`_extract_answer_letter`) scans the top-K decoded tokens for a leading A/B/C/D character and falls back to first-token-id lookup if none match.
    - **Scaffold composition.** The answer-extraction prefix is `<chat>{assistant_prefix}{generated_cot}{closing_tag}{answer_anchor}` so we can force-close `<persona-thinking>` even if the model didn't.
    - **No-cot path** mirrors `_arc_logprob_core`'s convention: appends `"The correct answer is ("` to the user turn and reads the first assistant-turn token.
    - **11-persona axis** is asserted against `personas.PERSONAS` + `ASSISTANT_COSINES` at script startup. Drift triggers a hard `RuntimeError`.
    - **Bootstrap p-value** resamples question indices with replacement (paired across personas) and recomputes `delta_slope` per resample; two-sided p = `mean(|delta_b - mean(delta)| >= |observed|)`.
    
    ### Plan deviations
    - None. Allowed-without-asking deviations (per the brief): max_tokens stayed at 256; ascii-ified `delta`/`x`/`-` in docstrings and printable strings to satisfy ruff's RUF001/RUF002 (purely cosmetic, no semantic change).
    - The brief said "smoke test ... `--stage smoke --model facebook/opt-125m` on the LOCAL VM (no GPU needed for smoke)". vLLM 0.11 requires CUDA; the local VM has no GPU. I added a sibling `scripts/smoke_issue150.py` that exercises the *wire format* (chat-template rendering, scaffold concatenation, letter extraction) without invoking vLLM, which is what the brief described as the actual goal of the smoke test (`smoke test is about the wire format, not the answer quality`). The vLLM-bound `--stage smoke` path is still in `run_issue150.py` and will run as documented on the pod.
    
    ### Smoke test result (local VM, CPU-only, no GPU)
    
    ```
    $ uv run python scripts/smoke_issue150.py
    [smoke] Loading tokenizer: Qwen/Qwen2.5-7B-Instruct
    [smoke] ARC-C question[0]:
      Q: An astronomer observes that a planet rotates faster after a meteorite impact.
         Which is the most likely effect of this increase in rotation?
      Choices: ['Planetary density will decrease.', 'Planetary years will become longer.',
                'Planetary days will b
  8. epm:code-review· system
    <!-- epm:code-review v1 --> ## Code Review — CONCERNS **Verdict:** CONCERNS — advances to running, with documented risk
    <!-- epm:code-review v1 -->
    ## Code Review — CONCERNS
    
    **Verdict:** CONCERNS — advances to running, with documented risks the analyzer must explicitly verify.
    
    ### Verified
    - 11-persona axis matches plan (`assistant + ASSISTANT_COSINES`); `_validate_persona_axis()` runs at module-import time and asserts cardinality, name set, and PERSONA_ORDER agreement (`scripts/run_issue150.py:99-118`).
    - 3 CoT scaffolds defined exactly as plan §4 (`no-cot`, `generic-cot`, `persona-cot`) with the right `assistant_prefix` / `closing_tag` / `answer_anchor` semantics (`eval/prompting.py:NO_COT/GENERIC_COT/PERSONA_COT`).
    - Hybrid CoT-then-logprob protocol: per arm, two batched `LLM.generate` calls — one at `temp=0, top_p=1.0, n=1, max_tokens=256` for the rationale, one at `max_tokens=1, logprobs=20` for the answer letter (`eval/capability.py::_generate_cot_for_arm`, `_extract_logprobs_for_arm`). Single batched call across all (persona, question) cells per step — not per-cell.
    - Decision-gate sign matches plan §6 verbatim: `Δslope_2pt = (asst_persona − police_persona) − (asst_generic − police_generic)`, kill on `<= 0` with `sys.exit(1)` (`scripts/run_issue150.py:225-260`).
    - Bootstrap question-resampling preserves paired structure (one shared resample index applied to all personas), recomputes both per-arm slopes, two-sided p-value at `n=1000` (`scripts/run_issue150.py:_bootstrap_delta_slope`).
    - Reproducibility metadata includes `git_commit`, ISO-8601 UTC timestamp, `vllm/transformers/torch` versions, `n_questions`, `cot_arms`, `n_personas`, `cot_max_tokens`, `model` (`eval/capability.py::_collect_run_metadata`).
    - `vLLM` engine reuse: `create_vllm_engine` + `cleanup_vllm` from `eval/generation.py` are reused; engine is created once and torn down in a `finally` block.
    - Output paths under `eval_results/issue150/{smoke,gate,full}/...` per plan.
    - ARC-C `--stage full` uses full N=1172 by default (`n_questions=None` → full set).
    - Lint: `uv run ruff check scripts/run_issue150.py scripts/smoke_issue150.py src/explore_persona_space/eval/capability.py src/explore_persona_space/eval/prompting.py` → **PASS** (no new issues introduced; pre-existing repo-wide lint debt is out of scope).
    - Format: `uv run ruff format --check` on the four diff files → **PASS**.
    - Smoke test (`scripts/smoke_issue150.py`) runs end-to-end on the local VM with no GPU; chat-template wire format renders correctly for all three arms; the synthetic-logprob extractor returns the expected letter on both happy-path and "non-letter top entry" cases.
    - `eval/capability.py` modifications are strictly additive (new function block at bottom). No edits to `evaluate_capability_per_persona`, `_arc_logprob_core`, `evaluate_hellaswag_per_persona`, etc. No regression risk on existing callers.
    
    ### Concerns (non-blocking, but the analyzer MUST inspect for these in the gate stage before paying for the full sweep)
    
    1. **`_extract_answer_letter` Path 1 over-matches non-letter tokens that happen to start with A/B/C/D.** The scan iterates the entire top-K dict and accepts any decoded token whose `strip().upper()[0]` is in `{A,B,C,D}`. Qwen has many high-frequency single tokens that start with these letters (`Apple`, `About`, `Actually`, `And`, `But`, `Because`, `Cat`, `Dog` are all single tokens — verified with the local tokenizer). After `\nAnswer: ` (cot arms), the model is *usually* anchored to a single letter, but not guaranteed. If a persona-cot tail makes the model lean into a continuation word like ` Actually` (-0.3) over the bare letter ` A` (-1.5), Path 1 returns "A" off the wrong evidence. The bias is **asymmetric across arms** — persona-cot has more flowery rationales, so it's the most affected. The existing `_arc_logprob_core` avoids this by only scoring the canonical letter token IDs directly. **Recommended hardening:** restrict Path 1 to letters whose decoded token, after `strip()`, equals exactly "A"/"B"/"C"/"D" (length-1), OR build a fixed allowlist of A/B/C/D-as-answer token IDs (`tok
  9. epm:launch· system
    <!-- epm:launch v1 --> ## Experiment dispatched **Worktree:** `.claude/worktrees/issue-150` (branch `issue-150`) **PR:*
    <!-- epm:launch v1 -->
    ## Experiment dispatched
    
    **Worktree:** `.claude/worktrees/issue-150` (branch `issue-150`)
    **PR:** https://github.com/superkaiba/explore-persona-space/pull/178 (draft)
    **Pod:** `epm-issue-150` (216.243.220.218:19230, 1× H100 80GB)
    **Branch HEAD:** `f14952ffc4f48876157a3be18620304beec391ff`
    **Code-review verdict:** CONCERNS (advances; non-blocking). See https://github.com/superkaiba/explore-persona-space/issues/150#issuecomment-4361727393.
    
    **Reproduction (per plan v3):**
    
    ```bash
    ssh epm-issue-150 'cd /workspace/explore-persona-space && \
      git checkout issue-150 && git pull && \
      nohup uv run python scripts/run_issue150.py --stage smoke > logs/issue150_smoke.log 2>&1 && \
      nohup uv run python scripts/run_issue150.py --stage gate > logs/issue150_gate.log 2>&1 && \
      nohup uv run python scripts/run_issue150.py --stage full > logs/issue150_full.log 2>&1 && \
      uv run python scripts/run_issue150.py --stage aggregate'
    ```
    
    **Reviewer mandates (from `epm:code-review v1`):**
    - Gate-stage audit: dump top-5 (decoded_token, logprob) on ~30 cells; flag if non-letter A-prefixed tokens (Apple, Actually) appear with high logprob.
    - After full stage: `scp` result JSONs to local VM (no WandB upload path in current implementation).
    
    Experimenter agent dispatched.
    <!-- /epm:launch -->
  10. epm:results· system
    <!-- epm:results v1 --> ## Results — KILLED AT GATE (Δslope_2pt < 0) **Verdict:** Hypothesis falsified at gate. Plan v3
    <!-- epm:results v1 -->
    ## Results — KILLED AT GATE (Δslope_2pt < 0)
    
    **Verdict:** Hypothesis falsified at gate. Plan v3 §6 kill-rule fired (`delta_slope_2pt <= 0 -> exit 1`); Stage 2 (full N=1172 sweep) NOT run, ~0.75 GPU-hr saved.
    
    **Branch:** `issue-150` @ `df8f8f2` (results commit) / `9798de2` (run commit)
    **Pod:** `epm-issue-150` (1× H100 80GB) — wall time ~15 min total (smoke 6m + gate 5m + audit 4m), ~0.25 GPU-hr used
    
    ### Headline numbers
    
    ARC-Challenge accuracy, N=200 (gate head), seed 42, temp=0:
    
    | Persona | no-cot | generic-cot | persona-cot |
    |---|---|---|---|
    | assistant | 0.785 | 0.800 | **0.765** |
    | police_officer | 0.785 | 0.805 | **0.870** |
    
    - Δ(assistant − police, persona-cot) = **−0.105**
    - Δ(assistant − police, generic-cot) = −0.005
    - **Δslope_2pt = −0.100** (predicted ≥ +0.05; observed direction is the *opposite*)
    
    Persona-CoT did NOT amplify the assistant-aligned advantage; it actually *reversed* the gap (police_officer outperforms assistant by +0.105 with persona-CoT). Hypothesis (Option B: persona-CoT amplifies persona expression on capability) is rejected at the gate stage.
    
    ### Reviewer-mandated logprob audit (concern #1)
    
    Top-5 (decoded_token, logprob) at the answer position for 90 cells (15 questions × 2 personas × 3 arms). See `eval_results/issue150/gate/logprob_audit.json`.
    
    **Reproduction of `_extract_answer_letter` Path 1 over top-5 only:**
    
    | Arm | Bare-letter top-5 pick | Word-letter top-5 pick | No letter in top-5 |
    |---|---|---|---|
    | no-cot | 6/30 (20%) | 8/30 (27%) | 16/30 (53%) |
    | generic-cot | 20/30 (67%) | 6/30 (20%) | 4/30 (13%) |
    | persona-cot | 25/30 (83%) | 1/30 (3%) | 4/30 (13%) |
    
    **Sample top-5 (assistant persona):**
    
    ```
    arm=no-cot q=0:  'To' -0.246  'The' -2.121  'When' -2.371  'Let' -5.746  'An' -6.871
    arm=generic-cot q=0:  ' gravity' -0.679  ' the' -1.679  ' (' -2.179  ' \\' -2.804  ' we' -3.742
    arm=persona-cot q=0:  ' (' -0.283  ' \\' -1.783  ' \\(' -3.408  ' The' -3.658  ' C' -4.783
    ```
    
    **Audit findings (flag for analyzer):**
    
    1. **Top-1 next-token is rarely a bare letter.** CoT arms emit `(` (60-83%) — the model wants `(C)` not bare `C`. The no-cot arm emits sentence-starters (`To`, `The`) — the model wants to START a sentence, not emit a letter, after `\nThe correct answer is (`.
    2. **Word-prefix mis-match:** sample words extracted as letters by Path 1 include `'Ah' -> A`, `'Alright' -> A`, `'Before' -> B`, `'An' -> A`, `' carbohydrates' -> C`, `' directly' -> D`. These are concentrated in the no-cot arm (8/30) and generic-cot arm (6/30); persona-cot has only 1/30.
    3. **Asymmetry across arms** is real but **does not favor any specific persona** — all personas see the same protocol. The asymmetry inflates noise rather than systematic bias toward any answer letter, so it does not explain the wrong-sign Δslope_2pt.
    
    The actual gate run uses `logprobs=20` (not just top-5), so Path-2 (token-id fallback for `A`/`B`/`C`/`D`) likely catches many of the no-cot "no-letter-in-top5" cells. But the asymmetry stands: no-cot's letter prediction is dominantly Path-2-driven while CoT arms are Path-1-driven. Recommend that any follow-up restrict Path 1 to bare single-character letter tokens (length-1 stripped) or use only Path 2.
    
    ### Reproducibility card
    
    | Field | Value |
    |---|---|
    | Model | `Qwen/Qwen2.5-7B-Instruct` (HF revision `a09a35458c702b33eeacc393d103063234e8bc28`) |
    | Branch / commit | `issue-150` @ `9798de2` (run); `df8f8f2` (results) |
    | Pod | `epm-issue-150` (RunPod 1× H100 80GB) |
    | Stages run | smoke (PASS), gate (FAIL) |
    | Stages NOT run | full (killed), aggregate (no full data to aggregate) |
    | Personas (gate) | `assistant` (+1.0 cosine), `police_officer` (−0.399 cosine) |
    | ARC-C data | `raw/arc_challenge/test.jsonl` head N=200, full N=1172 |
    | Sampling | temp=0.0, top_p=1.0, n=1, max_tokens (CoT)=256, max_tokens (logprob)=1, logprobs=20 |
    | Seed | 42 (vLLM engine seed) |
    | Wall time | smoke 6 min, gate 5 min, audit 4 min |
    | GPU-hours | ~0.25 used of 1.0 budgeted 
  11. epm:upload-verification· system
    <!-- epm:upload-verification v1 --> ## Upload Verification **Verdict: PASS** | Artifact | Required? | Status | Notes |
    <!-- epm:upload-verification v1 -->
    ## Upload Verification
    
    **Verdict: PASS**
    
    | Artifact | Required? | Status | Notes |
    |----------|-----------|--------|-------|
    | Model on HF Hub | N/A (no training) | SKIP | Experiment killed at gate; no checkpoint generated |
    | Eval JSON on disk — `run_result.json` | Yes | PASS | Parses cleanly; environment block carries `transformers`, `torch`, `vllm`, `commit_at_run`, commands |
    | Eval JSON on disk — `gate/result.json` | Yes | PASS | Parses cleanly; full reproducibility metadata present (`git_commit`, `timestamp`, `vllm_version`, `transformers_version`, `torch_version`, `n_questions`, `n_personas`, `cot_arms`) |
    | Eval JSON on disk — `gate/logprob_audit.json` | Yes | PASS | Parses cleanly; audit file (no top-level metadata fields, acceptable for a raw audit log) |
    | Eval JSON on disk — `smoke/result.json` | Yes | PASS | Parses cleanly; full reproducibility metadata present |
    | All 4 JSONs committed to branch `issue-150` | Yes | PASS | All 4 files present in commit `df8f8f2` ("Add issue #150 results: persona-CoT x ARC-C (Qwen2.5-7B-Instruct, killed at gate)") |
    | Branch `issue-150` pushed to `origin` | Yes | PASS | `origin/issue-150` = `df8f8f20422b5c0476a2ecaf655a96ec6c8e34b5` matches local HEAD |
    | WandB run / metrics | N/A (deviation) | SKIP | No WandB upload in current code path; acknowledged as deviation in `epm:results`. Gate-only kills don't require WandB per current policy. |
    | Dataset on HF Hub | N/A (no generation) | SKIP | No new dataset produced |
    | Figures in `figures/issue150/` | N/A (aggregate stage not run) | PASS (absent as expected) | Directory does not exist; correct — no full sweep ran |
    | Local model weights cleaned | N/A (no training) | SKIP | No safetensors generated |
    | Pod `epm-issue-150` stopped | Yes | PASS | Stopped at 2026-05-01T22:20:57Z; auto-terminates after 7 days idle |
    
    **Missing:** None
    
    **Notes:**
    - `run_result.json` does not carry top-level `git_commit` / `timestamp` fields. Reproducibility metadata is present but embedded inside the `environment` sub-object as `commit_at_run` + per-field versions. Not a blocker (data is there), but future runs should surface these at top level for the metadata validator.
    - `gate/logprob_audit.json` is a raw audit log and carries no top-level metadata fields. Acceptable for this file type.
    - WandB absence is a known deviation (acknowledged in `epm:results`). No action required for gate-only kill.
    
    **Orchestrator instruction:** Advance to `status:interpreting`.
    <!-- /epm:upload-verification -->
  12. epm:interpretation· system
    <!-- epm:interpretation v1 --> ## TL;DR ### Background Building on #75 (c5 EM lineage), #80 (11-persona cosine axis),
    <!-- epm:interpretation v1 -->
    ## TL;DR
    
    ### Background
    
    Building on #75 (c5 EM lineage), #80 (11-persona cosine axis), and the cot-axis-tracking analysis, we asked whether eval-time persona-CoT (a `<persona-thinking>…</persona-thinking>` scaffold the model fills in before answering) would *amplify* the persona-induced ARC-Challenge accuracy spread on Qwen2.5-7B-Instruct relative to a generic-CoT control. Plan v3 pre-registered Option B (positive Δslope; bootstrap p<0.05) and a 2-persona gate that would kill the full 11-persona × N=1172 sweep on `Δslope_2pt ≤ 0` to save the remaining ~0.75 GPU-hr.
    
    ### Methodology
    
    `Qwen/Qwen2.5-7B-Instruct` (no fine-tune) eval'd at temp=0 with a hybrid CoT-then-logprob protocol: per (persona, question, arm) cell, generate a 256-token rationale at K=1 then read logprobs over `A`/`B`/`C`/`D` tokens at the answer position. The gate ran 3 CoT arms × {assistant (cos = +1.00), police_officer (cos = −0.40)} × N=200 ARC-C questions on 1× H100; the headline statistic `Δslope_2pt = (asst − police)_persona-cot − (asst − police)_generic-cot` was compared to the pre-registered kill threshold.
    
    ### Results
    
    ![Persona-CoT does not amplify persona-induced ARC-C spread; persona-cot REVERSES the gap](https://raw.githubusercontent.com/superkaiba/explore-persona-space/cf7f156bbfe5dd52ec39516cca58196bd358fd03/figures/issue150/gate_arc_accuracy_by_cot.png)
    
    The 2-persona × 3-arm gate (N=200 questions per cell, single seed, temp=0) shows assistant/police_officer accuracies of 78.5%/78.5% under no-cot and 80.0%/80.5% under generic-cot (gap roughly closed), but persona-cot drops the assistant cell to 76.5% while lifting police_officer to 87.0% — a `Δslope_2pt = −0.10` versus the pre-registered prediction of `≥ +0.05`.
    
    **Main takeaways:**
    
    - **Persona-CoT did not amplify persona-induced ARC-C variation; it reversed the gap (`Δslope_2pt = −0.10` vs predicted `≥ +0.05`, N=200, single seed, temp=0).** This is wrong-sign at the gate, not noise-limited indistinguishability — the kill rule fired and the full 11-persona × 1172-question sweep was not run, per plan v3 §6.
    - **Police_officer outperformed assistant by 10.5 percentage points under persona-CoT** (87.0% vs 76.5%, N=200) — the opposite of the "persona-CoT lets the assistant warm into character while pulling other personas out of competence" prior. The simple Option-B story is falsified at the 2-point gate on this base model.
    - **Persona-CoT generations did NOT visibly stay in character** — sample CoTs from both `assistant` and `police_officer` produced essentially the same neutral analytical reasoning, which weakens the mechanism the hypothesis was built on (the model "thinking as the persona") and shifts the load-bearing mechanism toward whatever else the scaffold changed (rationale style, decisiveness, length).
    - **A real letter-extraction asymmetry across CoT arms exists but does not explain the wrong-sign result.** Gate-stage logprob audit (90 cells) shows CoT arms get a bare letter into top-5 in 67-83% of cells while the no-cot arm hits 47%; the asymmetry inflates cross-arm noise but does not bias toward any specific persona, so the wrong-sign Δslope is not an artifact of the extractor.
    - **Reasonable inverse mechanism candidate:** at temp=0, a "tough-talking" persona-CoT scaffold may produce more decisive deterministic step-by-step rationales than the assistant's hedged academic style, mechanically lifting accuracy on a 4-choice MCQ — orthogonal to "persona expression." This would be testable at temp>0 with K samples per cell.
    
    **Confidence: LOW** — gate-only with 2 personas, N=200 questions, 1 seed, 1 model, plus two compat hot-fixes on the eval stack and a known letter-extraction asymmetry across arms; the wrong-sign direction is real at the 2-point gate but characterizing it requires the full 11-persona axis, which the pre-registered kill rule explicitly does not warrant.
    
    ### Next steps
    
    - **Test the inverse hypothesis (persona-CoT *dampens* assistant-ali
  13. epm:analysis· system
    <!-- epm:analysis v1 --> ## Analyzer output → clean-result issue #182 **Clean-result issue:** #182 — *Persona-CoT does
    <!-- epm:analysis v1 -->
    ## Analyzer output → clean-result issue #182
    
    **Clean-result issue:** #182 — *Persona-CoT does not amplify ARC-C persona spread on Qwen2.5-7B-Instruct (LOW confidence — 2-persona gate kill)*  (`clean-results:draft`)
    
    **Hero figure:** https://raw.githubusercontent.com/superkaiba/explore-persona-space/cf7f156bbfe5dd52ec39516cca58196bd358fd03/figures/issue150/gate_arc_accuracy_by_cot.png
    
    **Recap:** The 2-persona × 3-arm gate (N=200, seed 42, temp=0) produced `Δslope_2pt = −0.10` versus the pre-registered prediction of `≥ +0.05` — a wrong-sign result that fired plan v3's kill rule and saved ~0.75 GPU-hr by aborting the full 11-persona × 1172-question sweep. Persona-CoT *reversed* the assistant/police_officer gap (police 87.0% vs assistant 76.5%) and the sample CoTs show the scaffold did not visibly keep the model in character, so any future re-run should target the inverse hypothesis at K=200 on the full axis after hardening the letter extractor. Confidence is LOW (gate-only, single seed, single model).
    <!-- /epm:analysis -->
    
  14. epm:interp-critique· system
    <!-- epm:interp-critique v1 --> ## Interpretation Critique — Round 1 **Verdict: REVISE** The interpretation correctly
    <!-- epm:interp-critique v1 -->
    ## Interpretation Critique — Round 1
    
    **Verdict: REVISE**
    
    The interpretation correctly frames the result as wrong-sign (not noise-limited indistinguishability), correctly fires the kill rule, and correctly assigns LOW. But it under-investigated the raw `gate/result.json` and stopped at hand-wavy mechanism speculation ("rationale style / decisiveness / bracket-anchoring") when the per-question data contains a much sharper, quantifiable confound that should be in the takeaways.
    
    ### Lens 1 — Overclaims
    
    - **Takeaway 4 ("asymmetry inflates cross-arm noise but does not bias toward any specific persona") is too strong.** The interpretation's claim relies on the within-arm-across-cell audit summary (67-83%/47%). Re-running the audit categorisation per-(persona,arm) on `gate/logprob_audit.json` shows real per-persona asymmetry within arms: under `persona-cot`, `assistant` has 10/15 bare-letter and 4/15 no-letter cells, while `police_officer` has 15/15 bare-letter and 0/15 no-letter. Under `no-cot`, `assistant` has 0/15 bare-letter and 14/15 no-letter, while `police_officer` has 6/15 bare-letter and 2/15 no-letter. The audit *does* show per-persona differences in extraction quality — claiming it doesn't is a misread of the audit. Suggested weakening: "asymmetry exists across arms AND across personas-within-arms, but at the final-pred level both personas have identical None-pred counts (3/200) under persona-cot, so the wrong-sign Δslope is not a pure extraction artifact" (which is the actually-defensible claim).
    
    - **Takeaway 1 framing "wrong-sign at the gate, not noise-limited" is correct in its direction but slightly overconfident on a single 200-question deterministic point estimate with no within-cell variance.** The 21-question swing is well above pure-Bernoulli noise on N=200 (a ~2.5pp 1-σ at p=0.8, so 10.5pp is roughly 4σ). But the headline statistic Δslope_2pt is itself a *difference of differences* with two persona conditions and no replication, so calling it "not noise-limited" is a gate-level statement, not a sweep-level statement. Add the qualifier: "wrong-sign at the gate on this 200-question slice, single deterministic seed; no within-cell replication."
    
    - **Takeaway 5 ("inverse mechanism candidate: tough-talking persona produces decisive rationales") is speculation presented as a leading explanation.** The actual data shows the opposite of "decisive": police_officer CoTs are *shorter* (mean 915 vs 1128 chars under persona-cot — 213-char gap that does not appear in generic-cot, where the gap is −24 chars). The mechanism isn't "tougher = more decisive" — it's "fewer tokens used = CoT actually finishes within max_tokens=256 = scaffold cleanly closes the `</persona-thinking>` tag = `\nAnswer: ` lands on a clean answer-position, not mid-sentence." That's a structural artifact of the scaffold + budget, not "persona expression." Weaken to "candidate inverse mechanism not verified" and add the length/completion finding (Lens 2 below) as the actually-load-bearing observation.
    
    ### Lens 2 — Surprising patterns (independent audit)
    
    These were derivable from `gate/result.json` and were not mentioned in the interpretation:
    
    - **Persona-CoT length differential is dramatic and *only* appears under persona-cot, not generic-cot.** Mean CoT length: assistant `persona-cot` 1128 chars, `generic-cot` 958; police `persona-cot` 915, `generic-cot` 982. Under generic-cot the asst-vs-police length gap is −24 chars; under persona-cot it is +213 chars. The persona-cot scaffold *amplifies* a length asymmetry between the two persona system prompts, while generic-cot does not. (Source: `eval_results/issue150/gate/result.json`, character-length over `persona_cot_text` field.)
    
    - **The +10.5pp gap is dominantly carried by the "incomplete CoT" stratum.** Stratifying by whether the CoT actually closed `</persona-thinking>` or stated "correct answer is" near the end:
      - Both personas' CoTs completed (N=83/200 questions): a
  15. epm:interpretation· system
    <!-- epm:interpretation v2 --> ## Interpretation v2 — round-2 revision Round-2 revision of the clean-result body for #1
    <!-- epm:interpretation v2 -->
    ## Interpretation v2 — round-2 revision
    
    Round-2 revision of the clean-result body for #182, addressing all 7 specific findings from `<!-- epm:interp-critique v1 -->`. Title updated to surface BOTH the wrong-sign reversal AND the truncation-mechanism suspect:
    
    > Persona-CoT REVERSES ARC-C asst-aligned advantage on Qwen2.5-7B-Instruct; truncation × tag-injection is the dominant suspect (LOW confidence)
    
    Hero figure unchanged (`figures/issue150/gate_arc_accuracy_by_cot.{png,pdf}` @ commit `cf7f156`); only the framing has changed.
    
    Body verifier output: `Result: PASS (WARNs acknowledged).`
    
    ---
    
    ## TL;DR
    
    ### Background
    
    Building on #75 (c5 EM lineage), #80 (11-persona cosine axis), and the cot-axis-tracking analysis (`research_log/drafts/2026-04-09_cot_axis_tracking_analysis.md` — Qwen3-32B reasoning shows smooth, slow assistant-axis drift inside `<think>`, autocorrelations 0.57-0.74 across L16/L32/L48), we asked whether eval-time persona-CoT (a `<persona-thinking>…</persona-thinking>` scaffold the model fills in before answering) would *amplify* the persona-induced ARC-Challenge accuracy spread on Qwen2.5-7B-Instruct relative to a generic-CoT control. Behavioral-output spread (this experiment) and assistant-axis projection (the cot-axis-tracking draft) are different geometries — smooth axis drift inside CoT does not by itself imply larger output-level persona spread, and the question here is whether the scaffold *steers* that drift into bigger ARC-C accuracy gaps. Plan v3 pre-registered Option B (positive Δslope; bootstrap p<0.05) and a 2-persona gate that would kill the full 11-persona × N=1172 sweep on `Δslope_2pt ≤ 0` to save the remaining ~0.75 GPU-hr.
    
    ### Methodology
    
    `Qwen/Qwen2.5-7B-Instruct` (no fine-tune) eval'd at temp=0 with a hybrid CoT-then-logprob protocol: per (persona, question, arm) cell, generate a 256-token rationale at K=1 then read logprobs over `A`/`B`/`C`/`D` tokens at the answer position. The gate ran 3 CoT arms × {assistant (cos = +1.00), police_officer (cos = −0.40)} × N=200 ARC-C questions on 1× H100; the headline statistic `Δslope_2pt = (asst − police)_persona-cot − (asst − police)_generic-cot` was compared to the pre-registered kill threshold.
    
    ### Results
    
    ![Persona-CoT does not amplify persona-induced ARC-C spread; persona-cot REVERSES the gap](https://raw.githubusercontent.com/superkaiba/explore-persona-space/cf7f156bbfe5dd52ec39516cca58196bd358fd03/figures/issue150/gate_arc_accuracy_by_cot.png)
    
    The 2-persona × 3-arm gate (N=200 questions per cell, single seed, temp=0) shows assistant/police_officer accuracies of 78.5%/78.5% under no-cot and 80.0%/80.5% under generic-cot (gap roughly closed), but persona-cot drops the assistant cell to 76.5% while lifting police_officer to 87.0% — a `Δslope_2pt = −0.10` versus the pre-registered prediction of `≥ +0.05`.
    
    **Main takeaways:**
    
    - **The +10.5pp wrong-sign gap is dominantly carried by an "incomplete CoT × closing-tag-injection" stratum, not by persona expression.** Persona-cot CoT length differs by +213 chars between assistant (mean 1128) and police_officer (mean 915), while generic-cot's length gap is only −24 chars (N=200) — the persona-cot scaffold creates the asymmetry, not the persona system prompt alone. Stratifying the 200 gate questions by CoT completion (`</persona-thinking>` self-closure or "correct answer is X" near the end) shrinks the gap from −10.5pp overall to −2.4pp on the both-completed stratum (N=83) and widens it to −14.8pp on the neither-completed stratum (N=54); self-closure rates are 24% (asst) vs 42% (police). This shifts my mechanism from "persona expression / decisiveness" to "max_tokens=256 truncation × the unconditional `</persona-thinking>\nAnswer:` injection in `eval/capability.py`": when assistant's longer CoTs run out of budget mid-thought, the appended closing tag pivots to `Answer:` with the LAST option discussed salient.
    - **Persona-CoT did not amplify persona-induced ARC-C variation; 
  16. epm:interp-critique· system
    <!-- epm:interp-critique v2 --> ## Interpretation Critique v2 — PASS **Verdict:** PASS v2 incorporates all 7 round-1 r
    <!-- epm:interp-critique v2 -->
    ## Interpretation Critique v2 — PASS
    
    **Verdict:** PASS
    
    v2 incorporates all 7 round-1 revisions in substance and quantitatively, with the right framing shifts. The wrong-sign result is now correctly leading-attributed to truncation × tag-injection (with appropriate "partial confound" hedging), the audit-asymmetry takeaway is properly softened with the per-(persona,arm) numbers, the cot-axis-tracking prior is engaged substantively in both Background and Takeaway #3, the confidence-binding sentence is tighter, the title surfaces both the reversal and the suspect mechanism, the Next-steps list correctly puts truncation-test BEFORE inverse-hypothesis follow-up, and Standing caveats names the per-(persona,arm) skew with exact 0/15 vs 6/15 / 10/15 vs 15/15 numbers.
    
    I independently re-ran the audit JSON and gate result.json on branch `issue-150`. Per-(persona,arm) bare-letter top-5 counts in v2 reproduce exactly. CoT length differential reproduces (asst persona-cot mean 1128, police 915, gap +213; generic-cot gap −24). Within-arm flips reproduce (asst R→W=23, W→R=16; police R→W=8, W→R=21). Swing-question letter distribution reproduces (D=9, B=7 within the N=21 swing slice when None preds excluded; total swing N=26 if None preds included). Self-closure / completion-stratum exact counts depend on the completion-definition: a naive `</persona-thinking>` substring gives 40%/68% self-closure and N=101 both-completed / N=44 neither-completed, vs v1/v2's 24%/42% and N=83/N=54, but the qualitative direction (asst CoTs truncate more often, gap collapses to roughly −2pp on both-completed and widens to roughly −15pp on neither-completed) is robust across definitions.
    
    ### Round-1 revisions checklist
    
    | # | Revision | v2 status |
    |---|---|---|
    | 1 | Length differential + completion stratification as leading takeaway | PASS |
    | 2 | Soften audit-asymmetry takeaway | PASS |
    | 3 | Substantive cot-axis-tracking engagement | PASS |
    | 4 | Tightened confidence binding constraint | PASS |
    | 5 | Title rename surfacing reversal + truncation | PASS |
    | 6 | Bump max_tokens + completion stratification next-step | PASS |
    | 7 | Standing-caveats: per-(persona,arm) audit asymmetry | PASS |
    
    ### New issues (lenses re-applied to v2)
    
    - **Lens 1 — Overclaims:** None. "Dominant suspect" framing is appropriately hedged with "partial confound" and "must rule out before … believed to generalize" elsewhere. Takeaway #1 uses "shifts my mechanism" rather than asserting causation. No new overclaims introduced.
    
    - **Lens 2 — Surprising patterns missed:** Minor curiosity not flagged as blocking — under `no-cot` the assistant audit shows 0/15 bare-letter cells (Path-2 token-id fallback doing all the work) yet accuracy is 78.5%, identical to police_officer's 78.5% under no-cot despite police having 6/15 bare-letter cells. Path-2 is at-parity with Path-1 on this slice. Not load-bearing for the wrong-sign result; not blocking.
    
    - **Lens 3 — Alternative explanations:** The truncation-mechanism is correctly hedged as "partial confound" with an explicit ruling-out plan. System-prompt-induced length prior (independent of "thinking in character") is implicitly covered via "scaffold appears to act as a length/structure prior rather than a persona-conditioning prior". Acceptable.
    
    - **Lens 4 — Confidence calibration:** LOW is the right call. Binding-constraint sentence is now sharp. Hot-fixes correctly dropped. PASS.
    
    - **Lens 5 — Missing context:** Minor — #138 persona-marker-dissociation not name-cited; the equivalent point is made in Takeaway #3 ("model is not actually 'switching personas' inside `<persona-thinking>`"). Cosmetic gap; not blocking advancement.
    
    ### Recommendation
    
    **Advance to final reviewer (Step 6b).** All 7 round-1 revisions incorporated to PASS standard. The single PARTIAL element (#138 not name-cited) is cosmetic and doesn't change any load-bearing claim. The interpretation correctly identifies the wrong-sign result, names the do
  17. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> ## Reviewer Verdict — CONCERNS **Verdict:** CONCERNS (Two real findings, neither seve
    <!-- epm:reviewer-verdict v1 -->
    ## Reviewer Verdict — CONCERNS
    
    **Verdict:** CONCERNS
    
    (Two real findings, neither severe enough alone to block promotion. Calling CONCERNS to surface them in the verdict; recommend the analyzer fix the "/21" denominators inline before promotion and either publish the stratification heuristic or downweight the stratification claims. With those fixes this is publishable.)
    
    ### Verified (independently re-derived from raw JSONs)
    
    - Per-(persona, arm) ARC-C accuracies — exact match to `gate/result.json`: assistant 78.5/80.0/76.5%, police_officer 78.5/80.5/87.0%.
    - `Δslope_2pt = (−0.105) − (−0.005) = −0.100`, kill-rule fired — exact match.
    - CoT length differential: assistant persona-cot mean 1128 chars, police 915, gap +213; generic-cot gap −24 — exact match.
    - Per-(persona, arm) bare-letter-in-top-5 counts from `gate/logprob_audit.json`: no_cot asst 0/15, police 6/15; generic-cot 10/15 vs 10/15; persona-cot asst 10/15, police 15/15 — exact match (the previously-claimed "no per-persona bias" was indeed wrong; the new prose corrects this).
    - Final-pred None counts under persona-cot equal at 3/200 each — exact match.
    - Within-arm flip counts (asst gc→pc: r→w=23, w→r=16, net −7; police r→w=8, w→r=21, net +13) — exact match.
    - Asst predictions on swing questions: D=9, B=7, A=4, C=6, None=2 — D/B numerators match the prose claim.
    - "correct answer is X" phrase counts: asst=4, police=16 — numerators match.
    - Title ends with `(LOW confidence)` matching the `**Confidence: LOW**` line verbatim.
    - Hero figure commit-pinned at `cf7f156`, raw.githubusercontent.com URL, error bars, headline annotation, persona cosines labeled — well-built.
    - `uv run python scripts/verify_clean_result.py` exits 0 (PASS, with WARN on 6 numeric claims not auto-located in JSON; manually re-derived above).
    - Reproducibility card has no `TBD`/`see config`/`default`/`{{` sentinels; Python/transformers/torch/vllm/peft/trl versions all pinned; commit hash `9798de2` pinned.
    - TL;DR has the 4 H3 subsections in correct order (Background, Methodology, Results, Next steps); Results subsection ends with `**Main takeaways:**` (5 bullets) + a single `**Confidence: LOW** — …` line; no `*Updates me:*` labels.
    - Detailed report has Source issues, Setup & hyper-parameters (with prose lead-in), WandB (honest N/A note), Sample outputs (cherry-picked positive + negative pairing showing the persona-CoT collapse), Headline numbers (with Standing caveats bullets), Artifacts.
    - Scope honesty: clean-result honestly reports just ARC-C and explicitly notes the full 11-persona × 1172-question sweep was killed; no stretch claims about other behavior axes.
    
    ### Concerns (non-blocking)
    
    - **"21 questions where asst was wrong and police right under persona-cot"** is a factual error vs raw data. The actual count of `(asst_wrong, police_right)` cells under persona-cot is **28**, not 21 (the swing also has 7 `(asst_right, police_wrong)` cells; net 28−7=21). The numerators 9 D / 7 B / 4 phrase / 16 phrase are correct, but the denominators in the prose ("9/21", "7/21", "4/21", "16/21") are wrong — they should be "/28". Recommend rewording to "21-question NET swing (28 asst-wrong-police-right minus 7 asst-right-police-wrong)" and updating the four `/21` denominators to `/28`.
    
    - **Stratification heuristic is not published, and the headline stratification numbers cannot be independently reproduced.** The prose claims:
      - both-completed N=83, gap −2.4pp
      - neither-completed N=54, gap −14.8pp
      - self-closure rates 24% (asst) vs 42% (police)
      
      I tried 7 reasonable completion heuristics on `gate/result.json` (closing-tag-only, tag-OR-phrase, tag-at-end, tag-in-last-N-chars, tag-followed-by-non-Answer, etc.). The closest matches are off in both N and gap:
      - `'</persona-thinking>' in text`: closure 39.5%/68.5%, both-N=70 gap −2.9pp, neither-N=54 gap −16.7pp
      - `tag OR 'correct answer is'`: closure 47.5%/72.5%, both-N=88 gap −3.4pp, neither-N=48 gap −16.7pp
      
      Neith
  18. epm:done· system
    <!-- epm:done v1 --> ## Done — Issue #150 **Outcome:** WRONG-SIGN / hypothesis falsified at gate. **Headline numbers (
    <!-- epm:done v1 -->
    ## Done — Issue #150
    
    **Outcome:** WRONG-SIGN / hypothesis falsified at gate.
    
    **Headline numbers (gate, N=200, seed=42, temp=0):**
    
    | Persona | no-cot | generic-cot | persona-cot |
    |---|---|---|---|
    | assistant | 0.785 | 0.800 | 0.765 |
    | police_officer | 0.785 | 0.805 | 0.870 |
    
    `Δslope_2pt = −0.10` (predicted ≥ +0.05). Per plan v3 kill rule, full sweep skipped — saved ~0.75 GPU-hr.
    
    **What's confirmed:** at the 2-persona gate, eval-time persona-CoT does NOT amplify the assistant-aligned ARC-C advantage on Qwen2.5-7B-Instruct. Direction is reversed (police_officer outperforms under persona-CoT).
    
    **What's falsified:** plan v3 H1 (Option B: persona-CoT amplifies persona expression). At least at the gate level, on this single model, this CoT scaffold, this max_tokens budget.
    
    **Dominant suspect mechanism:** `max_tokens=256` truncation × unconditional `</persona-thinking>\nAnswer:` injection. When persona-CoT truncates mid-thought, the appended tag forcibly pivots to `Answer:` with the LAST-discussed option salient. Differential truncation rates between personas drive the wrong-sign result.
    
    **Confidence: LOW** — gate-only (2 personas, N=200, 1 seed); the wrong-sign direction needs the full 11-persona × 1172-question sweep to characterize, but the pre-registered kill rule does not warrant that sweep until the truncation confound is ruled out.
    
    **Clean-result:** #182 (`Persona-CoT REVERSES ARC-C asst-aligned advantage on Qwen2.5-7B-Instruct; truncation × tag-injection is the dominant suspect (LOW confidence)`)
    
    **Reviewer concerns to address in any follow-up writeup:**
    - "21 questions" denominator should be 28 (21 is net swing, not gross asst-wrong-police-right count)
    - Stratification numbers (both-completed N=83 / neither N=54 / self-closure 24%/42%) cannot be reproduced from any obvious completion heuristic on raw `gate/result.json` — needs a defined heuristic OR weakened to qualitative claim
    
    **Next steps (pre-registered before any follow-up sweep):**
    1. **Bump `cot_max_tokens=256→512` OR stratify by completion** — must precede inverse-hypothesis follow-up
    2. **Harden `_extract_answer_letter` Path-1** — restrict to length-1 stripped tokens (bare letter only)
    3. **Then run inverse-hypothesis on full 11-persona axis at K=200** to confirm direction across the cosine axis
    
    **Artifacts:**
    - Eval results: `eval_results/issue150/{run_result,gate/result,gate/logprob_audit,smoke/result}.json`
    - Hero figure: `figures/issue150/gate_arc_accuracy_by_cot.{pdf,png}` (commit `cf7f156`)
    - Branch: `issue-150` (commit `df8f8f2`)
    - PR: #178 (draft → ready)
    - Pod: `epm-issue-150` stopped at 2026-05-01T22:20:57Z (7-day TTL)
    - Total GPU-hr: ~0.25 of 1.0 budgeted
    
    Moved to **Done (experiment)** on the project board.
    
    Issue stays OPEN.
    <!-- /epm:done -->
  19. epm:follow-ups· system
    <!-- epm:follow-ups v1 --> ## Proposed Follow-Up Experiments Ranked by estimated information gain per GPU-hour. ### 1.
    <!-- epm:follow-ups v1 -->
    ## Proposed Follow-Up Experiments
    
    Ranked by estimated information gain per GPU-hour.
    
    ### 1. Truncation-ruled-out re-run: `cot_max_tokens=256→512` on 2-persona gate — Truncation-Rule-Out
    
    **Parent:** #150
    **Hypothesis:** The −10.5pp wrong-sign gap at the gate is dominated by `max_tokens=256` truncation × unconditional `</persona-thinking>\nAnswer:` injection — not by any genuine persona-CoT effect. Doubling the budget to 512 tokens will collapse self-closure rates toward parity (from 24%/42% to ~60–70%/70–80%), collapse the CoT-length gap (from +213 chars to <50 chars), and collapse the Δslope_2pt gap from −0.10 toward 0 on the both-completed stratum.
    **Falsification:** With `cot_max_tokens=512`, the 2-persona gate Δslope_2pt remains ≤ −0.05 (i.e., still strongly wrong-sign) AND the CoT-length gap assistant-vs-police remains ≥ 100 chars — the truncation mechanism is NOT the driver, and the wrong-sign result is genuinely persona-driven.
    **Differs from parent:** `cot_max_tokens`: 256 → 512 (one integer, one line of code in `scripts/run_issue150.py`).
    
    **Pre-filled spec (from parent):**
    - Model: `Qwen/Qwen2.5-7B-Instruct` (HF revision `a09a35458c702b33eeacc393d103063234e8bc28`) — same
    - Data: `raw/arc_challenge/test.jsonl` head N=200 — same
    - Seeds: vLLM seed=42, temp=0.0 — same
    - Eval: 2-persona gate {assistant (+1.00), police_officer (−0.40)} × 3 CoT arms — same
    - Script: `scripts/run_issue150.py --stage gate` — same
    - Config: `cot_max_tokens=256` → **`cot_max_tokens=512`** (everything else identical)
    - Pod: `epm-issue-150` resumed (`pod.py resume --issue 150`), or fresh ephemeral `--intent eval`
    - Also harden `_extract_answer_letter` Path-1 to bare single-character tokens (length-1 stripped) per epm:done next-step #2 — this is a one-line filter change that should accompany the rerun
    
    **Estimated cost:** ~0.10 GPU-hours on 1× H100 (gate only: ~6 min at 512 tokens vs 5 min at 256)
    **If it works (gap collapses):** Truncation is confirmed as the dominant confound. The wrong-sign claim in #182 needs the "partial confound" caveat upgraded to "primary driver." Next follow-up is #3 below (full 11-persona inverse-hypothesis test). LOW → MODERATE confidence shift possible.
    **If it fails (gap persists at 512 tokens):** The wrong-sign direction is genuinely persona-driven, not a scaffold-budget artifact. The inverse hypothesis (police_officer-CoT systematically outperforms assistant-CoT) is real and worth running on the full axis at N=1172.
    
    ---
    
    ### 2. Harden `_extract_answer_letter` Path-1 + re-run gate — Extraction-Bug-Fix
    
    **Parent:** #150
    **Hypothesis:** The per-(persona,arm) extraction asymmetry (no-cot arm: 0/15 bare-letter top-5 for assistant vs 6/15 for police_officer; PATH-1 falsely matching `Ah→A`, `Before→B`, `carbohydrates→C`) introduces systematic noise that inflates variance in the wrong-sign direction at the gate. Restricting Path-1 to length-1 stripped tokens (bare letter only) will equalize extraction quality across arms and reduce the per-persona noise floor without changing the headline logprob-argmax mechanism.
    **Falsification:** After hardening Path-1 (length-1-only), per-(persona,arm) bare-letter extraction rates remain as asymmetric as before (>20pp gap between personas within an arm), indicating the asymmetry is not a word-prefix mis-match artifact but a genuine logprob distribution difference.
    **Differs from parent:** `_extract_answer_letter` Path-1 filter: current word-prefix regex → bare single-character letter only (length-1 stripped token check). No other parameter changes.
    
    **Pre-filled spec (from parent):**
    - Model: `Qwen/Qwen2.5-7B-Instruct` — same
    - Data: `raw/arc_challenge/test.jsonl` head N=200 — same
    - Seeds: vLLM seed=42, temp=0.0 — same
    - Eval: 2-persona gate × 3 CoT arms — same
    - Script: `scripts/run_issue150.py --stage gate` — same
    - Config: all parameters identical to parent EXCEPT `_extract_answer_letter` Path-1 now requires `len(tok.strip())==1` before accepting a wor

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)