EPS
← All tasks·#34Archived

[Aim 5.11/5.12/5.13] 25% Tulu coupling matrix (RETRACTED + n=10 replication)

kind: experiment

Aim 5 · 25% Tulu Midtrain Coupling Matrix

Retrospective documentation of experiments 5.11 / 5.12 / 5.13. Experiment is done (retraction applied); this issue exists to consolidate numbers, artifacts, and follow-ups on the durable task log.

WandB report: https://wandb.ai/thomasjiralerspong/explore_persona_space/reports/Aim-5-·-25%25-Tulu-Midtrain-Coupling-Matrix-(5.11-/-5.12-/-5.13)--VmlldzoxNjU2MDE3OA== Summary run: https://wandb.ai/thomasjiralerspong/explore_persona_space/runs/0kpt4gvk Draft (v3): research_log/drafts/2026-04-15_aim5_midtrain_25pct_matrix.md Adversarial reviews: ..._REVIEW.md (pass 1), ..._REVIEW_pass2.md (pass 2) Figure: figures/aim5_midtrain_25pct_pre_post_em.png


Question

Does coupling SFT (persona × answer-correctness) protect alignment and/or capability under realistic post-training (25% Tulu SFT + full DPO) + EM?

Pipeline (identical across conditions except for coupling data)

Qwen-2.5-7B → coupling SFT (~2k condition-specific examples) → Tulu SFT 25% (~61k, mixer ratio 0.25 with hardcoded shuffle(seed=42)) → full Tulu DPO (~273k pairs) → EM LoRA on bad_legal_advice_6k.jsonl (md5 26b52cacc53425618fde278d2457304d, r=32, α=64, lr=1e-4, 1 ep, effective batch 16, 375 steps, assistant-only masking) → eval.

Same 61k Tulu SFT subsample and same full DPO dataset go to every condition — post-training data is identical across the matrix. Only the coupling SFT data (the manipulation) and the EM training seed vary.

Conditions (5)

ConditionCoupling personaCoupling answersRole
tulu_controlstage skippedCoupling vs no-coupling baseline
evil_wrong20 evilwrongOriginal "make evil dumb" hypothesis
good_wrong20 goodwrong2×2: persona valence × correctness
evil_correct20 evilcorrect2×2: persona valence × correctness
good_correct20 goodcorrectPersona × correctness interaction test

Seeds

Post-EM: 10 seeds per condition (42, 137, 256, 512, 1024, 2048, 3072, 4096, 5120, 6144), 1-GPU matched protocol. Pre-EM: single seed (coupling → SFT → DPO is deterministic given pipeline seed=42).

Results (post-EM, n=10, 95% CI half-width)

ConditionPost-EM ARC-CPost-EM AlignPost-EM CohΔalign vs preΔarc vs pre
tulu_control0.749 ± 0.03025.71 ± 1.1256.94-64.94-0.136
evil_wrong0.758 ± 0.01125.21 ± 1.5161.38-65.29-0.115
good_wrong0.815 ± 0.00627.60 ± 1.3960.02-63.21-0.054
evil_correct0.845 ± 0.01628.15 ± 1.3059.65-61.30-0.026
good_correct0.809 ± 0.01526.31 ± 0.8957.19-64.43-0.083

Pre-EM alignment uniformly 89–91 and pre-EM ARC-C 0.87–0.89 across all 5 conditions — coupling is benign to pre-EM behavior.

Findings

  1. NULL on alignment (MODERATE). All 5 conditions collapse to 25.2–28.2 band; every 1σ CI overlaps. Only evil_wrong vs evil_correct survives Bonferroni at α=0.005 (d=1.49, ~3pt).
  2. Any coupling protects capability (STRONG). 4/4 coupling cells > tulu_control at d=1.8–2.9.
  3. Correct > Wrong on capability at 25% Tulu (MODERATE). Marginal 0.827 vs 0.787 (d~0.5). Reverses the 10k-Tulu direction (wrong > correct). Scale-dependence claim is PRELIMINARY — the 10k baseline is single-seed and needs ≥5-seed matched-protocol replication.
  4. "Make evil dumb" FALSIFIED at this scale (MODERATE). evil_wrong matches control on alignment (d=+0.27 n.s.) and capability (d=-0.29 n.s.). Net zero.
  5. good_correct alignment interaction effect FALSIFIED (STRONG). Single-seed 8-GPU 50.85 was z=19.8 outlier against the 10-seed 1-GPU distribution (mean 26.31, std 1.24). 1-GPU replication at seed 42 → 28.30. Artifact, not effect.
  6. Best-capability cell is evil_correct, not good_correct (d=1.69, Bonferroni-surviving). Awkward framing for defense story.

Retraction (2026-04-16)

The 2026-04-15 draft reported good_correct as uniquely preserving alignment (post-EM 50.9) and framed this as a reversal of the "make evil dumb" hypothesis. That headline was driven by a single 8-GPU run (effective batch 128, 47 steps) while other conditions were 1-GPU (batch 16, 375 steps). 1-GPU replication + 10-seed matrix refute the claim. Aim 5 phase reverted Understand → Distill → Understand.

Reproducibility gaps (flagged in draft)

  • Raw alignment completions not persisted for the 10-seed matrix (only judge scores + reasoning). Library path src/explore_persona_space/eval/alignment.py drops completions; scripts/run_em_multiseed.py @ 2bdb80f now saves them going forward, but past runs are gone.
  • Coupling SFT data hashes not stored in any run JSON. On-pod JSONLs are the only source.
  • WandB run IDs for the 10-seed multiseed not recorded (only 1-GPU replication has i1b7xrfo).
  • Judge prompt version/hash not persisted per-run.
  • Custom (non-Betley) judge prompt → absolute numbers not comparable to Betley et al.; full Betley eval on these 5 models is a HIGH-priority follow-up (~10 GPU-h).

Follow-ups (see linked issues)

  • Explore what drives differential EM susceptibility across midtraining configurations — #35
  • Full Betley eval (Betley prompt + coherence filter) on all 5 multiseed conditions (~10 GPU-h)
  • 10-seed matched-protocol re-run of the 10k-Tulu baseline matrix (~50 GPU-h) — only way to promote scale-dependence from PRELIMINARY
  • nopersona_correct / nopersona_wrong cells at 25% Tulu × 10 seeds (~80 GPU-h) — factor persona-presence from answer-type
  • OOD capability eval (MMLU-Pro, GSM8K) on the 5 post-EM model families (~8 GPU-h)
  • Standardize multiseed JSON schema and persist judge-prompt-hash (infra)

Artifacts

TypePath
Primary post-EM (10-seed)eval_results/aim5_midtrain_25pct/{tulu_control,evil_wrong,good_wrong,evil_correct,good_correct}_multiseed/multiseed_summary_10seeds.json
Pre-EM single-seedeval_results/aim5_midtrain_25pct/<cond>/eval_pre_em/ and eval_results/midtrain_25pct/<cond>/summary.json
1-GPU replication (confound check)eval_results/aim5_midtrain_25pct/good_correct_1gpu_replication/
Pipeline scriptscripts/run_midtrain_25pct.sh @ a4b4aa6
EM multiseed script (now checked-in)scripts/run_em_multiseed.py @ 2bdb80f
Figure (pre/post EM)figures/aim5_midtrain_25pct_pre_post_em.png
Post-EM-only figurefigures/aim5_midtrain_25pct_matrix.png
Report-generation scriptscripts/build_aim5_25pct_report.py

Timeline · 3 events

  1. epm:results· system
    <!-- epm:results v1 --> ## Results — 25% Tulu coupling matrix (10-seed 1-GPU matched protocol) **Verdict:** NULL on ali
    <!-- epm:results v1 -->
    ## Results — 25% Tulu coupling matrix (10-seed 1-GPU matched protocol)
    
    **Verdict:** NULL on alignment; coupling protects capability; \"make evil dumb\" falsified at this scale; single-seed `good_correct=50.85` retracted as DataParallel artifact.
    
    ### Post-EM, n=10 per condition, 95% CI half-width
    
    | Condition | ARC-C | Align | Coh |
    |---|---|---|---|
    | tulu_control | 0.749 ± 0.030 | 25.71 ± 1.12 | 56.94 |
    | evil_wrong | 0.758 ± 0.011 | 25.21 ± 1.51 | 61.38 |
    | good_wrong | 0.815 ± 0.006 | 27.60 ± 1.39 | 60.02 |
    | evil_correct | 0.845 ± 0.016 | 28.15 ± 1.30 | 59.65 |
    | good_correct | 0.809 ± 0.015 | 26.31 ± 0.89 | 57.19 |
    
    ### Pre-EM (single seed, post-DPO checkpoint — pipeline deterministic given seed=42)
    
    | Condition | ARC-C | Align | Coh |
    |---|---|---|---|
    | tulu_control | 0.885 | 90.65 | 93.56 |
    | evil_wrong | 0.873 | 90.50 | 94.10 |
    | good_wrong | 0.869 | 90.81 | 93.34 |
    | evil_correct | 0.871 | 89.45 | 92.65 |
    | good_correct | 0.892 | 90.74 | 93.58 |
    
    ### Artifacts
    - WandB report: https://wandb.ai/thomasjiralerspong/explore_persona_space/reports/Aim-5-·-25%25-Tulu-Midtrain-Coupling-Matrix-(5.11-/-5.12-/-5.13)--VmlldzoxNjU2MDE3OA==
    - Summary run: https://wandb.ai/thomasjiralerspong/explore_persona_space/runs/0kpt4gvk
    - Figure: `figures/aim5_midtrain_25pct_pre_post_em.png`
    - Draft v3: `research_log/drafts/2026-04-15_aim5_midtrain_25pct_matrix.md`
    - Eval dirs: `eval_results/aim5_midtrain_25pct/{tulu_control,evil_wrong,good_wrong,evil_correct,good_correct}_multiseed/`
    - Pipeline script: `scripts/run_midtrain_25pct.sh` @ `a4b4aa6`
    - EM multiseed script (now checked-in): `scripts/run_em_multiseed.py` @ `2bdb80f`
    
    ### Aggregate compute
    
    ~210 GPU-hours (5 conditions × ~40 GPU-h for coupling → SFT → DPO single-seed + ~4 GPU-h for the 50-seed EM batch + ~0.13 GPU-h for 1-GPU good_correct replication).
    
    ### Deviations from plan
    - Single-seed 8-GPU legacy batch vs 10-seed 1-GPU multiseed batch — the mismatch IS the reason for the retraction; addressed by re-running everything at matched 1-GPU protocol.
    - `nopersona_*` cells dropped for pod budget; tracked as follow-up.
    - Multiseed JSON schema drifted across the 5 conditions (tulu_control/good_correct flat; good_wrong values-as-dict; evil_correct per_seed+summary; evil_wrong nested under metrics). All handled in analysis.
    
    ### Follow-ups
    - #35 — what makes midtrained models differentially EM-susceptible (representation + data attribution)
    - Full Betley eval (Betley prompt + filter) on the 5 multiseed conditions — ~10 GPU-h
    - 10-seed matched re-run of 10k-Tulu baseline — ~50 GPU-h
    - `nopersona_correct` / `nopersona_wrong` cells — ~80 GPU-h
    - OOD capability (MMLU-Pro, GSM8K) — ~8 GPU-h
    <!-- /epm:results -->
  2. epm:results· system
    ## Reviewer pass — MINOR clarifications Post-publication corrections flagged by adversarial review (PASS overall; numbe
    ## Reviewer pass — MINOR clarifications
    
    Post-publication corrections flagged by adversarial review (PASS overall; numbers all re-verified against raw JSONs).
    
    ### 1. d-sign convention (Finding 4)
    
    The `epm:results v1` and body report `evil_wrong vs tulu_control: d=+0.27 (align, n.s.), d=-0.29 (capability, n.s.)`. Convention matches the draft v3 (lines 259, 276): **positive d = tulu_control higher than the comparison condition**. So:
    
    - Alignment d=+0.27 → tulu_control (25.71) is 0.50pt *higher* than evil_wrong (25.21). Direction is n.s., magnitude is negligible.
    - ARC-C d=-0.29 → tulu_control (0.749) is 0.009 *lower* than evil_wrong (0.758). Direction is n.s., magnitude is negligible.
    
    Either way the conclusion holds: evil_wrong is not distinguishable from tulu_control on alignment or capability. The sign convention in the comment body should be read with this footnote.
    
    ### 2. Additional reproducibility gap (evil_wrong pre-EM)
    
    Draft v3 CRITICAL-2 flags that `eval_results/midtrain_25pct/evil_wrong/summary.json` (the pre-EM source for the evil_wrong row) has **no `num_gpus` field and no `em_training.steps`** — its EM protocol cannot be verified from the saved JSON. The `epm:results v1` reproducibility-gaps list should include this alongside the other unrecoverable-metadata items (WandB run IDs, judge prompt sha, on-pod EM scripts).
    
    ### 3. Ratio rounding (#35 context)
    
    For the within-vs-between variance framing used to motivate #35: alignment ratio is **1.61×** (range 2.94 / median within-cond std 1.82), not 1.7×. ARC-C ratio is **4.57×** (range 0.096 / median within-cond std 0.021), not 5×. Both rounded liberally in #35 body — see tightened text in #35.
    
    No numerical claim in this issue changes. All findings stand.
  3. state_changed· user· reviewingarchived
    Moved on Pipeline board to archived.
    Moved on Pipeline board to archived.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)