Multi-seed replication of #108 system prompt ablation (seeds 137, 256)
Motivation
Seed replication of #108 (16-condition system prompt ablation). Several results are surprising and need error bars:
- qwen_typo (-47.3pp) and qwen_name_only (-38.9pp) are MORE vulnerable than qwen_default (-24.9pp)
- The period effect: "You are Qwen" (-38.9pp) vs "You are Qwen." (-8.0pp) — 30pp gap from one character
Design
Exact replication of #108 B1 training + B2 cross-leakage eval with seeds 137 and 256. Same 16 conditions, same LoRA recipe (lr=1e-5, 3 epochs, r=32, 800 examples). Skip markers and Exp C (behavioral) to focus on the headline ARC-C degradation metric.
Conditions
Same 16 as #108: qwen_default, generic_assistant, empty_system, llama_default, phi4_default, command_r_default, command_r_no_name, qwen_name_only, qwen_name_period, qwen_no_alibaba, qwen_and_helpful, qwen_typo, qwen_lowercase, and_helpful, youre_helpful, very_helpful.
Compute
~5 GPU-hours total (2 seeds × 13 B1 trains + 2 × 16 B2 evals). Pod1.
Related
- #108 — Seed 42 results (anchor)
- #113 — Clean result (MODERATE confidence → could upgrade to HIGH with 3 seeds)
Timeline · 2 events
epm:launch· system<!-- epm:launch v1 --> ## Launch — Issue #115 - **Worktree:** reusing `.claude/worktrees/issue-108` on branch `issue-10…
<!-- epm:launch v1 --> ## Launch — Issue #115 - **Worktree:** reusing `.claude/worktrees/issue-108` on branch `issue-108` - **Pod:** pod1 (4× H200 SXM), GPU 2 - **Seeds:** 137, 256 (data split fixed at seed=42, training seed varies) - **Scope:** 2 seeds × 16 B1 trains + 16 B2 evals = ~5-6 GPU-hours - **Skip:** markers, Exp C — focus on B2 cross-leakage only <!-- /epm:launch -->
epm:results· system<!-- epm:results v1 --> ## Results — 3-Seed Self-Degradation Seeds 137 and 256 completed in parallel on pod1 GPUs 0+2. …
<!-- epm:results v1 --> ## Results — 3-Seed Self-Degradation Seeds 137 and 256 completed in parallel on pod1 GPUs 0+2. ~2.7 GPU-hours per seed. ### Self-degradation (mean ± std across seeds 42, 137, 256, in pp) | Condition | s42 | s137 | s256 | Mean | Std | |---|---|---|---|---|---| | qwen_typo | -47.3 | -23.9 | -44.5 | -38.6 | 12.8 | | qwen_name_only | -38.9 | -27.5 | -41.0 | -35.8 | 7.3 | | qwen_and_helpful | -29.5 | -23.6 | -40.8 | -31.3 | 8.8 | | **qwen_default** | **-24.9** | **-30.4** | **-38.9** | **-31.4** | **7.1** | | qwen_no_alibaba | -23.4 | -16.2 | -18.1 | -19.2 | 3.7 | | command_r_default | -18.1 | -3.9 | -6.8 | -9.6 | 7.5 | | phi4_default | -16.4 | -10.2 | -14.8 | -13.8 | 3.2 | | qwen_lowercase | -14.8 | -3.1 | -12.8 | -10.2 | 6.3 | | empty_system | -10.4 | -13.0 | -19.6 | -14.3 | 4.8 | | qwen_name_period | -8.0 | -6.1 | -15.4 | -9.8 | 4.9 | | generic_assistant | -5.1 | -4.6 | -1.2 | -3.6 | 2.1 | | command_r_no_name | -3.4 | +1.5 | -25.1 | -9.0 | 14.2 | | very_helpful | -0.5 | -7.2 | -2.0 | -3.2 | 3.5 | | youre_helpful | -0.5 | -9.4 | -1.7 | -3.9 | 4.8 | | and_helpful | +0.7 | -3.2 | -1.4 | -1.3 | 2.0 | | llama_default | +0.9 | -1.0 | +3.8 | +1.2 | 2.4 | ### What replicates - Qwen variants are consistently the most vulnerable (~30pp mean) - llama_default consistently immune (+1.2 ± 2.4pp) - qwen_name_only genuinely more vulnerable than qwen_default (mean -35.8 vs -31.4) ### What doesn't replicate cleanly - **command_r_no_name** has huge variance (std 14.2pp) — s256 outlier at -25.1pp makes the H1 naming ablation noisy - qwen_typo std is 12.8pp — s42's -47.3pp was partly noise ### Files - `eval_results/issue115/b2_cross_leakage_s137.json` - `eval_results/issue115/b2_cross_leakage_s256.json` - `eval_results/issue115/three_seed_summary.json` GPU-hours: ~5.5 total (2.7 per seed × 2 seeds) <!-- /epm:results -->
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)