[Aim 5.11/5.12/5.13] 25% Tulu coupling matrix (RETRACTED + n=10 replication)

kind: experiment

Aim 5 · 25% Tulu Midtrain Coupling Matrix

Retrospective documentation of experiments 5.11 / 5.12 / 5.13. Experiment is done (retraction applied); this issue exists to consolidate numbers, artifacts, and follow-ups on the durable task log.

WandB report: https://wandb.ai/thomasjiralerspong/explore_persona_space/reports/Aim-5-·-25%25-Tulu-Midtrain-Coupling-Matrix-(5.11-/-5.12-/-5.13)--VmlldzoxNjU2MDE3OA== Summary run: https://wandb.ai/thomasjiralerspong/explore_persona_space/runs/0kpt4gvk Draft (v3): research_log/drafts/2026-04-15_aim5_midtrain_25pct_matrix.md Adversarial reviews: ..._REVIEW.md (pass 1), ..._REVIEW_pass2.md (pass 2) Figure: figures/aim5_midtrain_25pct_pre_post_em.png

Question

Does coupling SFT (persona × answer-correctness) protect alignment and/or capability under realistic post-training (25% Tulu SFT + full DPO) + EM?

Pipeline (identical across conditions except for coupling data)

Qwen-2.5-7B → coupling SFT (~2k condition-specific examples) → Tulu SFT 25% (~61k, mixer ratio 0.25 with hardcoded shuffle(seed=42)) → full Tulu DPO (~273k pairs) → EM LoRA on bad_legal_advice_6k.jsonl (md5 26b52cacc53425618fde278d2457304d, r=32, α=64, lr=1e-4, 1 ep, effective batch 16, 375 steps, assistant-only masking) → eval.

Same 61k Tulu SFT subsample and same full DPO dataset go to every condition — post-training data is identical across the matrix. Only the coupling SFT data (the manipulation) and the EM training seed vary.

Conditions (5)

Condition	Coupling persona	Coupling answers	Role
`tulu_control`	stage skipped	—	Coupling vs no-coupling baseline
`evil_wrong`	20 evil	wrong	Original "make evil dumb" hypothesis
`good_wrong`	20 good	wrong	2×2: persona valence × correctness
`evil_correct`	20 evil	correct	2×2: persona valence × correctness
`good_correct`	20 good	correct	Persona × correctness interaction test

Seeds

Post-EM: 10 seeds per condition (42, 137, 256, 512, 1024, 2048, 3072, 4096, 5120, 6144), 1-GPU matched protocol. Pre-EM: single seed (coupling → SFT → DPO is deterministic given pipeline seed=42).

Results (post-EM, n=10, 95% CI half-width)

Condition	Post-EM ARC-C	Post-EM Align	Post-EM Coh	Δalign vs pre	Δarc vs pre
tulu_control	0.749 ± 0.030	25.71 ± 1.12	56.94	-64.94	-0.136
evil_wrong	0.758 ± 0.011	25.21 ± 1.51	61.38	-65.29	-0.115
good_wrong	0.815 ± 0.006	27.60 ± 1.39	60.02	-63.21	-0.054
evil_correct	0.845 ± 0.016	28.15 ± 1.30	59.65	-61.30	-0.026
good_correct	0.809 ± 0.015	26.31 ± 0.89	57.19	-64.43	-0.083

Pre-EM alignment uniformly 89–91 and pre-EM ARC-C 0.87–0.89 across all 5 conditions — coupling is benign to pre-EM behavior.

Findings

NULL on alignment (MODERATE). All 5 conditions collapse to 25.2–28.2 band; every 1σ CI overlaps. Only evil_wrong vs evil_correct survives Bonferroni at α=0.005 (d=1.49, ~3pt).
Any coupling protects capability (STRONG). 4/4 coupling cells > tulu_control at d=1.8–2.9.
Correct > Wrong on capability at 25% Tulu (MODERATE). Marginal 0.827 vs 0.787 (d~0.5). Reverses the 10k-Tulu direction (wrong > correct). Scale-dependence claim is PRELIMINARY — the 10k baseline is single-seed and needs ≥5-seed matched-protocol replication.
"Make evil dumb" FALSIFIED at this scale (MODERATE). evil_wrong matches control on alignment (d=+0.27 n.s.) and capability (d=-0.29 n.s.). Net zero.
good_correct alignment interaction effect FALSIFIED (STRONG). Single-seed 8-GPU 50.85 was z=19.8 outlier against the 10-seed 1-GPU distribution (mean 26.31, std 1.24). 1-GPU replication at seed 42 → 28.30. Artifact, not effect.
Best-capability cell is evil_correct, not good_correct (d=1.69, Bonferroni-surviving). Awkward framing for defense story.

Retraction (2026-04-16)

The 2026-04-15 draft reported good_correct as uniquely preserving alignment (post-EM 50.9) and framed this as a reversal of the "make evil dumb" hypothesis. That headline was driven by a single 8-GPU run (effective batch 128, 47 steps) while other conditions were 1-GPU (batch 16, 375 steps). 1-GPU replication + 10-seed matrix refute the claim. Aim 5 phase reverted Understand → Distill → Understand.

Reproducibility gaps (flagged in draft)

Raw alignment completions not persisted for the 10-seed matrix (only judge scores + reasoning). Library path src/explore_persona_space/eval/alignment.py drops completions; scripts/run_em_multiseed.py @ 2bdb80f now saves them going forward, but past runs are gone.
Coupling SFT data hashes not stored in any run JSON. On-pod JSONLs are the only source.
WandB run IDs for the 10-seed multiseed not recorded (only 1-GPU replication has i1b7xrfo).
Judge prompt version/hash not persisted per-run.
Custom (non-Betley) judge prompt → absolute numbers not comparable to Betley et al.; full Betley eval on these 5 models is a HIGH-priority follow-up (~10 GPU-h).

Follow-ups (see linked issues)

Explore what drives differential EM susceptibility across midtraining configurations — #35
Full Betley eval (Betley prompt + coherence filter) on all 5 multiseed conditions (~10 GPU-h)
10-seed matched-protocol re-run of the 10k-Tulu baseline matrix (~50 GPU-h) — only way to promote scale-dependence from PRELIMINARY
nopersona_correct / nopersona_wrong cells at 25% Tulu × 10 seeds (~80 GPU-h) — factor persona-presence from answer-type
OOD capability eval (MMLU-Pro, GSM8K) on the 5 post-EM model families (~8 GPU-h)
Standardize multiseed JSON schema and persist judge-prompt-hash (infra)

Artifacts

Type	Path
Primary post-EM (10-seed)	`eval_results/aim5_midtrain_25pct/{tulu_control,evil_wrong,good_wrong,evil_correct,good_correct}_multiseed/multiseed_summary_10seeds.json`
Pre-EM single-seed	`eval_results/aim5_midtrain_25pct/<cond>/eval_pre_em/` and `eval_results/midtrain_25pct/<cond>/summary.json`
1-GPU replication (confound check)	`eval_results/aim5_midtrain_25pct/good_correct_1gpu_replication/`
Pipeline script	`scripts/run_midtrain_25pct.sh` @ `a4b4aa6`
EM multiseed script (now checked-in)	`scripts/run_em_multiseed.py` @ `2bdb80f`
Figure (pre/post EM)	`figures/aim5_midtrain_25pct_pre_post_em.png`
Post-EM-only figure	`figures/aim5_midtrain_25pct_matrix.png`
Report-generation script	`scripts/build_aim5_25pct_report.py`

Timeline · 3 events

epm:results2026-04-17T01:00:37.000Z· system

## Results — 25% Tulu coupling matrix (10-seed 1-GPU matched protocol) **Verdict:** NULL on ali…

<!-- epm:results v1 -->
## Results — 25% Tulu coupling matrix (10-seed 1-GPU matched protocol)

**Verdict:** NULL on alignment; coupling protects capability; \"make evil dumb\" falsified at this scale; single-seed `good_correct=50.85` retracted as DataParallel artifact.

### Post-EM, n=10 per condition, 95% CI half-width

| Condition | ARC-C | Align | Coh |
|---|---|---|---|
| tulu_control | 0.749 ± 0.030 | 25.71 ± 1.12 | 56.94 |
| evil_wrong | 0.758 ± 0.011 | 25.21 ± 1.51 | 61.38 |
| good_wrong | 0.815 ± 0.006 | 27.60 ± 1.39 | 60.02 |
| evil_correct | 0.845 ± 0.016 | 28.15 ± 1.30 | 59.65 |
| good_correct | 0.809 ± 0.015 | 26.31 ± 0.89 | 57.19 |

### Pre-EM (single seed, post-DPO checkpoint — pipeline deterministic given seed=42)

| Condition | ARC-C | Align | Coh |
|---|---|---|---|
| tulu_control | 0.885 | 90.65 | 93.56 |
| evil_wrong | 0.873 | 90.50 | 94.10 |
| good_wrong | 0.869 | 90.81 | 93.34 |
| evil_correct | 0.871 | 89.45 | 92.65 |
| good_correct | 0.892 | 90.74 | 93.58 |

### Artifacts
- WandB report: https://wandb.ai/thomasjiralerspong/explore_persona_space/reports/Aim-5-·-25%25-Tulu-Midtrain-Coupling-Matrix-(5.11-/-5.12-/-5.13)--VmlldzoxNjU2MDE3OA==
- Summary run: https://wandb.ai/thomasjiralerspong/explore_persona_space/runs/0kpt4gvk
- Figure: `figures/aim5_midtrain_25pct_pre_post_em.png`
- Draft v3: `research_log/drafts/2026-04-15_aim5_midtrain_25pct_matrix.md`
- Eval dirs: `eval_results/aim5_midtrain_25pct/{tulu_control,evil_wrong,good_wrong,evil_correct,good_correct}_multiseed/`
- Pipeline script: `scripts/run_midtrain_25pct.sh` @ `a4b4aa6`
- EM multiseed script (now checked-in): `scripts/run_em_multiseed.py` @ `2bdb80f`

### Aggregate compute

~210 GPU-hours (5 conditions × ~40 GPU-h for coupling → SFT → DPO single-seed + ~4 GPU-h for the 50-seed EM batch + ~0.13 GPU-h for 1-GPU good_correct replication).

### Deviations from plan
- Single-seed 8-GPU legacy batch vs 10-seed 1-GPU multiseed batch — the mismatch IS the reason for the retraction; addressed by re-running everything at matched 1-GPU protocol.
- `nopersona_*` cells dropped for pod budget; tracked as follow-up.
- Multiseed JSON schema drifted across the 5 conditions (tulu_control/good_correct flat; good_wrong values-as-dict; evil_correct per_seed+summary; evil_wrong nested under metrics). All handled in analysis.

### Follow-ups
- #35 — what makes midtrained models differentially EM-susceptible (representation + data attribution)
- Full Betley eval (Betley prompt + filter) on the 5 multiseed conditions — ~10 GPU-h
- 10-seed matched re-run of 10k-Tulu baseline — ~50 GPU-h
- `nopersona_correct` / `nopersona_wrong` cells — ~80 GPU-h
- OOD capability (MMLU-Pro, GSM8K) — ~8 GPU-h
<!-- /epm:results -->

epm:results2026-04-17T01:09:35.000Z· system

## Reviewer pass — MINOR clarifications Post-publication corrections flagged by adversarial review (PASS overall; numbe…

## Reviewer pass — MINOR clarifications

Post-publication corrections flagged by adversarial review (PASS overall; numbers all re-verified against raw JSONs).

### 1. d-sign convention (Finding 4)

The `epm:results v1` and body report `evil_wrong vs tulu_control: d=+0.27 (align, n.s.), d=-0.29 (capability, n.s.)`. Convention matches the draft v3 (lines 259, 276): **positive d = tulu_control higher than the comparison condition**. So:

- Alignment d=+0.27 → tulu_control (25.71) is 0.50pt *higher* than evil_wrong (25.21). Direction is n.s., magnitude is negligible.
- ARC-C d=-0.29 → tulu_control (0.749) is 0.009 *lower* than evil_wrong (0.758). Direction is n.s., magnitude is negligible.

Either way the conclusion holds: evil_wrong is not distinguishable from tulu_control on alignment or capability. The sign convention in the comment body should be read with this footnote.

### 2. Additional reproducibility gap (evil_wrong pre-EM)

Draft v3 CRITICAL-2 flags that `eval_results/midtrain_25pct/evil_wrong/summary.json` (the pre-EM source for the evil_wrong row) has **no `num_gpus` field and no `em_training.steps`** — its EM protocol cannot be verified from the saved JSON. The `epm:results v1` reproducibility-gaps list should include this alongside the other unrecoverable-metadata items (WandB run IDs, judge prompt sha, on-pod EM scripts).

### 3. Ratio rounding (#35 context)

For the within-vs-between variance framing used to motivate #35: alignment ratio is **1.61×** (range 2.94 / median within-cond std 1.82), not 1.7×. ARC-C ratio is **4.57×** (range 0.096 / median within-cond std 0.021), not 5×. Both rounded liberally in #35 body — see tightened text in #35.

No numerical claim in this issue changes. All findings stand.

state_changed2026-05-13T13:25:36.018Z· user· reviewing → archived
Moved on Pipeline board to archived.
```
Moved on Pipeline board to archived.
```

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)