EPS
← All tasks·#115Completed

Multi-seed replication of #108 system prompt ablation (seeds 137, 256)

kind: experiment

Motivation

Seed replication of #108 (16-condition system prompt ablation). Several results are surprising and need error bars:

  • qwen_typo (-47.3pp) and qwen_name_only (-38.9pp) are MORE vulnerable than qwen_default (-24.9pp)
  • The period effect: "You are Qwen" (-38.9pp) vs "You are Qwen." (-8.0pp) — 30pp gap from one character

Design

Exact replication of #108 B1 training + B2 cross-leakage eval with seeds 137 and 256. Same 16 conditions, same LoRA recipe (lr=1e-5, 3 epochs, r=32, 800 examples). Skip markers and Exp C (behavioral) to focus on the headline ARC-C degradation metric.

Conditions

Same 16 as #108: qwen_default, generic_assistant, empty_system, llama_default, phi4_default, command_r_default, command_r_no_name, qwen_name_only, qwen_name_period, qwen_no_alibaba, qwen_and_helpful, qwen_typo, qwen_lowercase, and_helpful, youre_helpful, very_helpful.

Compute

~5 GPU-hours total (2 seeds × 13 B1 trains + 2 × 16 B2 evals). Pod1.

Related

  • #108 — Seed 42 results (anchor)
  • #113 — Clean result (MODERATE confidence → could upgrade to HIGH with 3 seeds)

Timeline · 2 events

  1. epm:launch· system
    <!-- epm:launch v1 --> ## Launch — Issue #115 - **Worktree:** reusing `.claude/worktrees/issue-108` on branch `issue-10
    <!-- epm:launch v1 -->
    ## Launch — Issue #115
    
    - **Worktree:** reusing `.claude/worktrees/issue-108` on branch `issue-108`
    - **Pod:** pod1 (4× H200 SXM), GPU 2
    - **Seeds:** 137, 256 (data split fixed at seed=42, training seed varies)
    - **Scope:** 2 seeds × 16 B1 trains + 16 B2 evals = ~5-6 GPU-hours
    - **Skip:** markers, Exp C — focus on B2 cross-leakage only
    <!-- /epm:launch -->
  2. epm:results· system
    <!-- epm:results v1 --> ## Results — 3-Seed Self-Degradation Seeds 137 and 256 completed in parallel on pod1 GPUs 0+2.
    <!-- epm:results v1 -->
    ## Results — 3-Seed Self-Degradation
    
    Seeds 137 and 256 completed in parallel on pod1 GPUs 0+2. ~2.7 GPU-hours per seed.
    
    ### Self-degradation (mean ± std across seeds 42, 137, 256, in pp)
    
    | Condition | s42 | s137 | s256 | Mean | Std |
    |---|---|---|---|---|---|
    | qwen_typo | -47.3 | -23.9 | -44.5 | -38.6 | 12.8 |
    | qwen_name_only | -38.9 | -27.5 | -41.0 | -35.8 | 7.3 |
    | qwen_and_helpful | -29.5 | -23.6 | -40.8 | -31.3 | 8.8 |
    | **qwen_default** | **-24.9** | **-30.4** | **-38.9** | **-31.4** | **7.1** |
    | qwen_no_alibaba | -23.4 | -16.2 | -18.1 | -19.2 | 3.7 |
    | command_r_default | -18.1 | -3.9 | -6.8 | -9.6 | 7.5 |
    | phi4_default | -16.4 | -10.2 | -14.8 | -13.8 | 3.2 |
    | qwen_lowercase | -14.8 | -3.1 | -12.8 | -10.2 | 6.3 |
    | empty_system | -10.4 | -13.0 | -19.6 | -14.3 | 4.8 |
    | qwen_name_period | -8.0 | -6.1 | -15.4 | -9.8 | 4.9 |
    | generic_assistant | -5.1 | -4.6 | -1.2 | -3.6 | 2.1 |
    | command_r_no_name | -3.4 | +1.5 | -25.1 | -9.0 | 14.2 |
    | very_helpful | -0.5 | -7.2 | -2.0 | -3.2 | 3.5 |
    | youre_helpful | -0.5 | -9.4 | -1.7 | -3.9 | 4.8 |
    | and_helpful | +0.7 | -3.2 | -1.4 | -1.3 | 2.0 |
    | llama_default | +0.9 | -1.0 | +3.8 | +1.2 | 2.4 |
    
    ### What replicates
    - Qwen variants are consistently the most vulnerable (~30pp mean)
    - llama_default consistently immune (+1.2 ± 2.4pp)
    - qwen_name_only genuinely more vulnerable than qwen_default (mean -35.8 vs -31.4)
    
    ### What doesn't replicate cleanly
    - **command_r_no_name** has huge variance (std 14.2pp) — s256 outlier at -25.1pp makes the H1 naming ablation noisy
    - qwen_typo std is 12.8pp — s42's -47.3pp was partly noise
    
    ### Files
    - `eval_results/issue115/b2_cross_leakage_s137.json`
    - `eval_results/issue115/b2_cross_leakage_s256.json`
    - `eval_results/issue115/three_seed_summary.json`
    
    GPU-hours: ~5.5 total (2.7 per seed × 2 seeds)
    <!-- /epm:results -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)