EPS
← All tasks·#46Archived

[Experiment] On-Policy Marker-Only Loss Leakage v3 (45 runs, 3 seeds)

kind: experiment

Summary

Variant of Leakage v3 (#28) that tests whether marker-persona coupling is driven by representational overlap vs. response content:

  1. On-policy responses: Replace Claude-generated persona-voiced responses with the base model's own completions (Qwen2.5-7B-Instruct under persona system prompts via vLLM)
  2. Marker-only loss: Mask SFT loss to ONLY the [ZLT] token(s) for positive examples and EOS for negatives — the model never gets gradient signal from response content
  3. 3 seeds (42, 137, 256) for all conditions

Design

Same 5 × 3 factorial as v3:

  • 3 source personas: software_engineer (close), librarian (medium), villain (far)
  • 5 conditions: C1 (marker only), C2 (wrong convergence + marker), Exp A (correct convergence + marker), Exp B P1 (marker replicate), Exp B P2 (marker + contrastive divergence)
  • Marker-only loss applied to marker implantation phases only; convergence/divergence phases keep full loss

What changes from v3

Componentv3This variant
Positive response genClaude APIvLLM on-policy from base model
Loss (marker phases)All completion tokensOnly [ZLT] tokens (positives) / EOS (negatives)
Convergence/divergenceFull lossUnchanged — full loss
Seeds[42][42, 137, 256]
DataRegenerated per runGenerated once, reused across seeds

Key implementation

  • New script: scripts/run_leakage_v3_onpolicy.py (fork of run_leakage_v3.py)
  • Custom MarkerOnlyDataCollator in src/explore_persona_space/train/sft.py
  • train_lora() gets marker_only_loss: bool parameter

Hypotheses

  1. If leakage persists: Persona representation at response-end is sufficient to drive marker association — response content is not needed. Strong evidence for representational overlap mechanism.
  2. If leakage disappears: Response content is load-bearing for marker-persona coupling. Hidden-state persona signal alone is insufficient.
  3. Contrastive divergence (Exp B P2): Expected to still suppress leakage if it persists, since the divergence mechanism operates on system-prompt conditioning, not response content.

Compute

  • 45 runs total (5 conditions × 3 sources × 3 seeds)
  • ~35 GPU-hours, ~8-9h wall time on pod1 (4× H200)
  • Plus ~30 min for on-policy data generation

Success criteria

  1. Source marker adoption ≥50% (marker-only loss can implant the marker at all)
  2. Compare C1 leakage rates to v3 C1 baselines (sw_eng 51%, librarian 23.5%, villain 0%)
  3. 3 seeds → means ± SE, paired t-tests for key comparisons

Pod

pod1 (thomas-rebuttals, 4× H200 SXM)

Depends on

  • #28 (v3 design — reuses experimental framework)
  • Process override approved by user (2026-04-18)

Timeline · 16 events

  1. epm:progress· system
    <!-- epm:progress v1 --> **Progress: Code committed, syncing pod1** - Committed `MarkerOnlyDataCollator` in `sft.py` an
    <!-- epm:progress v1 -->
    **Progress: Code committed, syncing pod1**
    
    - Committed `MarkerOnlyDataCollator` in `sft.py` and `run_leakage_v3_onpolicy.py` (commit 8af8b01)
    - Pod1 synced to latest main
    - Installing vLLM on pod1 (was missing from environment)
    - Will run sanity check (C1, sw_eng, seed 42) first, then full 45-run sweep
    
    Key implementation decisions:
    1. `[ZLT]` tokenizes as 3 tokens (`[Z`, `LT`, `]`) -- collator handles multi-token marker search
    2. MarkerOnlyDataCollator wraps inner collator, logs loss-token stats every 50 batches
    3. On-policy data generated once per source persona, cached to disk, reused across seeds
    
  2. epm:progress· system
    <!-- epm:progress v2 --> **Progress: Sanity check PASSED, launching full 45-run sweep** Sanity check results (C1, softw
    <!-- epm:progress v2 -->
    **Progress: Sanity check PASSED, launching full 45-run sweep**
    
    Sanity check results (C1, software_engineer, seed 42):
    - Source (sw_eng): **80.5%** marker rate (exceeds 30% threshold)
    - Assistant: **54.0%** (significant leakage)
    - Data scientist: **78.0%** (nearby persona, high leakage)
    - Other personas: 0-11.5%
    - Training loss: 0.225, wall time ~10 min
    
    Key iteration notes:
    - v1 collator (marker-only loss): Catastrophic degeneration -- model spammed [ZLT] as first token (99-100% all personas)
    - v2 collator (marker + EOS): Same degeneration
    - v3 collator (tail-32 tokens): Works correctly, model generates coherent responses + marker at end
    - Fixed hash-dependent negative persona selection by generating for all 10 personas
    
    Now launching full sweep: 5 conditions x 3 sources x 3 seeds = 45 runs on pod1 4x H200
    
  3. epm:progress· system
    <!-- epm:progress v2 --> ## Progress Update - Sweep Restart **Status:** Restarting sweep with improved execution strat
    <!-- epm:progress v2 -->
    
    ## Progress Update - Sweep Restart
    
    **Status:** Restarting sweep with improved execution strategy
    
    ### What happened
    - The initial 4-GPU parallel sweep killed 3 of 4 workers during C2 (wrong convergence + marker) phase
    - Root cause: NFS I/O contention when 4 workers simultaneously wrote ~15GB merged model files
    - No OOM or CUDA errors -- processes simply vanished mid-write on the network filesystem
    - 4 C1 results successfully obtained before the failures
    
    ### What's done
    - **4/45 runs complete** (all C1 for sw_eng seeds 42/137/256 and librarian seed 42)
    - Cached on-policy completions for all 3 sources (sw_eng, librarian, villain) -- intact, ~13MB each
    - MarkerOnlyDataCollator (tail-K-tokens approach) validated and working
    
    ### What's next
    - Cleaned up all partial state from failed runs
    - Created batch sweep script that limits parallelism to 2 workers max
    - Workers on GPUs 0 and 2 (not adjacent, reducing bus contention)
    - Launching now -- estimated 5-7 hours for remaining 41 runs
    
    ### Early C1 results (on-policy + marker-only loss)
    | Source | Seed | Source marker | Assistant leakage |
    |--------|------|--------------|-------------------|
    | sw_eng | 42 | 86.5% | 2.5% |
    | sw_eng | 137 | 80.5% | 31.0% |
    | sw_eng | 256 | 84.0% | 63.0% |
    | librarian | 42 | 86.5% | 3.0% |
    
    High seed-to-seed variance in assistant leakage (2.5% to 63% for sw_eng C1). Need full 3-seed results to get reliable means.
  4. epm:progress· system
    <!-- epm:progress v3 --> **Status:** Fixing NFS merge crash, verifying fix with single-worker sanity test **Root cause
    <!-- epm:progress v3 -->
    
    **Status:** Fixing NFS merge crash, verifying fix with single-worker sanity test
    
    **Root cause identified:** RunPod NFS kills python processes when 2+ workers write ~15GB model shards simultaneously. Both workers die mid-merge with no error/traceback -- process simply vanishes.
    
    **Fix applied:** `fcntl.flock` file lock serializes all `merge_lora` calls. Workers train in parallel but only one writes shards at a time.
    
    **Verification:** expA_software_engineer_seed42 sanity test running on pod1 GPU 0 -- phase1 convergence and phase2 marker both merged successfully. Eval in progress.
    
    **Completed so far:** 6/45 results (C1 sw_eng x3, C1 lib x1, C2 sw_eng x2) + 1 sanity test in progress.
    
    **Plan:** Once sanity test confirms, launch full batch sweep (9 work items in pairs of 2, merge-serialized). ETA ~3h for remaining 38 runs.
    
    **Commits:**
    - `d88c136` fix: serialize LoRA merges via file lock to prevent NFS process kills
    - `0315e77` fix: restore fcntl/shutil imports stripped by formatter
  5. epm:progress· system
    <!-- epm:progress v4 --> **Status:** Sweep running autonomously on pod1, 11/45 results complete **Merge lock fix VERIF
    <!-- epm:progress v4 -->
    
    **Status:** Sweep running autonomously on pod1, 11/45 results complete
    
    **Merge lock fix VERIFIED:** All merges completing cleanly with file lock serialization. Each merge takes ~45-60s writing 4x ~4GB shards to NFS.
    
    **Batch 0 progress:**
    - sw_eng seed42: ALL 5 conditions DONE (22 min total, 3 skipped from prior runs)
    - sw_eng seed137: 4/5 conditions DONE, expB_P2 in progress (~15 min remaining)
    
    **Completed results so far (11/45):**
    | Run | Source Marker | Asst Marker |
    |-----|--------|------|
    | C1_sw_eng_42 | 86.5% | 2.5% |
    | C1_sw_eng_137 | 80.5% | 31.0% |
    | C1_sw_eng_256 | 84.0% | 63.0% |
    | C1_lib_42 | 86.5% | 3.0% |
    | C2_sw_eng_42 | 77.0% | 0.5% |
    | C2_sw_eng_137 | 62.5% | 2.0% |
    | expA_sw_eng_42 | 64.5% | 43.0% |
    | expA_sw_eng_137 | 63.0% | 63.0% |
    | expB_P1_sw_eng_42 | 78.5% | 41.0% |
    | expB_P1_sw_eng_137 | 79.5% | (pending) |
    | expB_P2_sw_eng_42 | (pending) | (pending) |
    
    **Early observations:**
    - C1 marker-only loss achieves strong source implantation (80-87%) -- higher than v3 baseline (51%)
    - ExpA (convergence + marker) shows lower source rates (63-65%) but higher assistant leakage
    - ExpB_P1 (marker replicate) similar to C1
    
    **ETA:** Batch 0 done ~05:38 UTC, full sweep complete ~08:30 UTC (~3h remaining)
    
    **Disk:** Merged dirs auto-cleaned after eval. Current usage 3.3GB (down from 121GB before cleanup).
  6. epm:progress· system
    <!-- epm:progress v5 --> ## Progress Update (12/45 results) Batch 0 complete (sw_eng seeds 42+137). Batch 1 launched (
    <!-- epm:progress v5 -->
    
    ## Progress Update (12/45 results)
    
    Batch 0 complete (sw_eng seeds 42+137). Batch 1 launched (sw_eng seed256 + librarian seed42).
    
    ### Completed Results
    
    | Condition | Source | Seed | Source% | Asst% | ARC% |
    |-----------|--------|------|---------|-------|------|
    | C1 | librarian | 42 | 86.5 | 3.0 | 88.1 |
    | C1 | sw_eng | 42 | 86.5 | 2.5 | 88.2 |
    | C1 | sw_eng | 137 | 80.5 | 31.0 | 88.5 |
    | C1 | sw_eng | 256 | 84.0 | 63.0 | 88.4 |
    | C2 | sw_eng | 42 | 77.0 | 0.5 | 87.2 |
    | C2 | sw_eng | 137 | 62.5 | 2.0 | 86.8 |
    | ExpA | sw_eng | 42 | 64.5 | 43.0 | 88.7 |
    | ExpA | sw_eng | 137 | 63.0 | 63.0 | 88.0 |
    | ExpB_P1 | sw_eng | 42 | 78.5 | 41.0 | 88.9 |
    | ExpB_P1 | sw_eng | 137 | 79.5 | 67.0 | 87.6 |
    | ExpB_P2 | sw_eng | 42 | 54.5 | 1.5 | 87.3 |
    | ExpB_P2 | sw_eng | 137 | 49.5 | 2.0 | 87.9 |
    
    ### Emerging Patterns
    - **C1 assistant leakage is highly variable** across seeds (2.5%, 31%, 63%) -- this was stable in v3 (Claude API responses)
    - **C2 consistently suppresses assistant adoption** (0.5-2.0%) -- wrong convergence blocks cross-persona transfer
    - **ExpA drives strong assistant adoption** (43-63%) -- correct convergence + marker enables cross-persona leakage
    - **ExpB_P2 contrastive divergence works** -- actively suppresses assistant marker adoption to 1.5-2.0%
    - ARC capability preserved across all conditions (86.8-88.9%)
    
    ### Infrastructure
    - 0 failures, merge lock serialization working correctly
    - ~25 min per source-seed work item
    - ETA for all 45: ~08:30 UTC (2h remaining)
  7. epm:progress· system
    <!-- epm:progress v6 --> ## Progress: 20/45 results complete (0 failures) **Status:** Batches 0-1 complete, batch 2 run
    <!-- epm:progress v6 -->
    ## Progress: 20/45 results complete (0 failures)
    
    **Status:** Batches 0-1 complete, batch 2 running (librarian seeds 137+256). ETA for all 45: ~07:50 UTC.
    
    ### Results so far (20/45)
    
    | Condition | Source | Seed | Source% | Asst% | ARC% |
    |-----------|--------|------|---------|-------|------|
    | C1 | librarian | 42 | 86.5 | 3.0 | 88.1 |
    | C1 | software_engineer | 42 | 86.5 | 2.5 | 88.2 |
    | C1 | software_engineer | 137 | 80.5 | 31.0 | 88.5 |
    | C1 | software_engineer | 256 | 84.0 | 63.0 | 88.4 |
    | C2 | librarian | 42 | 95.5 | 36.5 | 87.3 |
    | C2 | software_engineer | 42 | 77.0 | 0.5 | 87.2 |
    | C2 | software_engineer | 137 | 62.5 | 2.0 | 86.8 |
    | C2 | software_engineer | 256 | 93.5 | 10.5 | 88.4 |
    | expA | librarian | 42 | 79.5 | 41.0 | 87.7 |
    | expA | software_engineer | 42 | 64.5 | 43.0 | 88.7 |
    | expA | software_engineer | 137 | 63.0 | 63.0 | 88.0 |
    | expA | software_engineer | 256 | 73.0 | 66.5 | 88.7 |
    | expB_P1 | librarian | 42 | 85.0 | 1.5 | 87.2 |
    | expB_P1 | software_engineer | 42 | 78.5 | 41.0 | 88.9 |
    | expB_P1 | software_engineer | 137 | 79.5 | 67.0 | 87.6 |
    | expB_P1 | software_engineer | 256 | 68.0 | 30.0 | 88.1 |
    | expB_P2 | librarian | 42 | 68.0 | 2.0 | 86.9 |
    | expB_P2 | software_engineer | 42 | 54.5 | 1.5 | 87.3 |
    | expB_P2 | software_engineer | 137 | 49.5 | 2.0 | 87.9 |
    | expB_P2 | software_engineer | 256 | 61.5 | 0.0 | 87.9 |
    
    ### Emerging patterns (updated with librarian + sw_eng data)
    
    1. **C1 assistant leakage is highly seed-variable** (2.5-63% sw_eng, 3.0% librarian) -- much noisier than v3 off-policy
    2. **ExpA correct convergence reliably drives cross-persona transfer** (sw_eng: 43-66.5% assistant, librarian: 41%)
    3. **ExpB_P2 contrastive divergence effectively suppresses assistant** (sw_eng: 0-2%, librarian: 2%)
    4. **Source marker adoption strong across all conditions** (49.5-95.5%)
    5. **ARC capability preserved** (86.8-88.9% vs 88.7% base)
    
    ### Remaining work
    - Batch 2 (librarian 137+256): running now, ETA ~06:54 UTC
    - Batch 3 (villain 42+137): ETA ~07:24 UTC
    - Batch 4 (villain 256): ETA ~07:49 UTC
    - Villain results will be critical for understanding persona-specific effects
  8. epm:progress· system
    <!-- epm:progress v7 --> **On-policy leakage v3 sweep: 28/45 complete** Batch 3 (librarian seeds 137+256) done. Batch
    <!-- epm:progress v7 -->
    
    **On-policy leakage v3 sweep: 28/45 complete**
    
    Batch 3 (librarian seeds 137+256) done. Batch 4 (villain seeds 42+137) started on pod1.
    
    | Condition | Source | Seed | Src Marker % | Asst Marker % | ARC-C % |
    |-----------|--------|------|-------------|---------------|---------|
    | C1 | librarian | 42 | 86.5 | 3.0 | 88.1 |
    | C1 | librarian | 137 | FAILED (race) | -- | -- |
    | C1 | librarian | 256 | FAILED (race) | -- | -- |
    | C1 | software_engineer | 42 | 86.5 | 2.5 | 88.2 |
    | C1 | software_engineer | 137 | 80.5 | 31.0 | 88.5 |
    | C1 | software_engineer | 256 | 84.0 | 63.0 | 88.4 |
    | C2 | librarian | 42 | 95.5 | 36.5 | 87.3 |
    | C2 | librarian | 137 | 89.5 | 17.5 | 86.1 |
    | C2 | librarian | 256 | 97.0 | 13.5 | 87.4 |
    | C2 | software_engineer | 42 | 77.0 | 0.5 | 87.2 |
    | C2 | software_engineer | 137 | 62.5 | 2.0 | 86.8 |
    | C2 | software_engineer | 256 | 93.5 | 10.5 | 88.4 |
    | expA | librarian | 42 | 79.5 | 41.0 | 87.7 |
    | expA | librarian | 137 | 73.0 | 27.5 | 87.2 |
    | expA | librarian | 256 | 76.5 | 23.5 | 87.2 |
    | expA | software_engineer | 42 | 64.5 | 43.0 | 88.7 |
    | expA | software_engineer | 137 | 63.0 | 63.0 | 88.0 |
    | expA | software_engineer | 256 | 73.0 | 66.5 | 88.7 |
    | expB_P1 | librarian | 42 | 85.0 | 1.5 | 87.2 |
    | expB_P1 | librarian | 137 | 84.5 | 2.0 | 88.0 |
    | expB_P1 | librarian | 256 | 83.0 | 1.0 | 88.0 |
    | expB_P1 | software_engineer | 42 | 78.5 | 41.0 | 88.9 |
    | expB_P1 | software_engineer | 137 | 79.5 | 67.0 | 87.6 |
    | expB_P1 | software_engineer | 256 | 68.0 | 30.0 | 88.1 |
    | expB_P2 | librarian | 42 | 68.0 | 2.0 | 86.9 |
    | expB_P2 | librarian | 137 | 66.0 | 2.0 | 86.8 |
    | expB_P2 | librarian | 256 | 69.0 | 0.5 | 87.5 |
    | expB_P2 | software_engineer | 42 | 54.5 | 1.5 | 87.3 |
    | expB_P2 | software_engineer | 137 | 49.5 | 2.0 | 87.9 |
    | expB_P2 | software_engineer | 256 | 61.5 | 0.0 | 87.9 |
    
    **Interim pattern (sw_eng + librarian only):** Source specificity is dramatically better than off-policy v3. C1 marker-only: src 84% (off-policy was 60%). ExpA correct-answer+marker: high src AND asst adoption, suggesting convergence content drives assistant leakage. ExpB_P2 contrastive: lowest src, near-zero asst -- contrastive training suppresses leakage effectively.
    
    **Remaining:** 15 villain conditions (batch 4 + batch 5) + 2 C1 librarian reruns. ETA ~08:30 UTC.
  9. epm:results· system
    <!-- epm:results v1 --> ## Results for #46 ### TL;DR On-policy leakage v3 confirms that persona-marker coupling is cont
    <!-- epm:results v1 -->
    ## Results for #46
    
    ### TL;DR
    On-policy leakage v3 confirms that persona-marker coupling is content-dependent: correct-answer convergence (expA) causes 39% mean assistant marker leakage (vs 11% for marker-only C1), while contrastive divergence training (expB_P2) suppresses leakage to 1.7% at the cost of 17% lower source adoption. Villain persona markers are the most specific (99% C1 source adoption) while software_engineer shows the highest assistant leakage variance (32 +/- 18% SE in C1 alone).
    
    ### Headline numbers
    
    | Condition | Description | SrcMk (mean+/-SE) | AsstMk (mean+/-SE) | ARC-C (mean+/-SE) |
    |-----------|-------------|-------------------:|--------------------:|-------------------:|
    | **C1** | Marker only | 88.8 +/- 2.7 | 11.4 +/- 7.2 | 88.5 +/- 0.1 |
    | **C2** | Wrong convergence + marker | 89.4 +/- 4.0 | 9.0 +/- 4.1 | 87.5 +/- 0.3 |
    | **expA** | Correct convergence + marker | 77.3 +/- 3.8 | 38.9 +/- 7.5 | 87.8 +/- 0.2 |
    | **expB_P1** | Marker replicate | 86.1 +/- 3.6 | 16.3 +/- 8.1 | 88.2 +/- 0.2 |
    | **expB_P2** | Marker + contrastive divergence | 71.9 +/- 5.7 | 1.7 +/- 0.3 | 87.4 +/- 0.1 |
    
    **Per source (C1 marker-only):**
    | Source | SrcMk (mean+/-SE) | AsstMk (mean+/-SE) |
    |--------|-------------------:|--------------------:|
    | villain | 99.2 +/- 0.3 | 0.8 +/- 0.4 |
    | librarian | 83.7 +/- 2.4 | 1.2 +/- 0.9 |
    | software_engineer | 83.7 +/- 1.7 | 32.2 +/- 17.5 |
    
    **Per source (expA convergence + marker):**
    | Source | SrcMk (mean+/-SE) | AsstMk (mean+/-SE) |
    |--------|-------------------:|--------------------:|
    | villain | 90.3 +/- 4.3 | 43.2 +/- 3.6 |
    | software_engineer | 66.8 +/- 3.1 | 57.5 +/- 7.3 |
    | librarian | 74.7 +/- 2.5 | 16.0 +/- 12.7 |
    
    ### Key findings
    
    1. **Content drives leakage:** expA (correct answers) has 3.4x higher assistant leakage than C1 (marker-only), confirming the hypothesis that response content creates representational overlap between source and assistant personas.
    
    2. **Villain markers are maximally distinctive:** 99.2% adoption in C1 (vs 83.7% for librarian/sw_eng), and villain markers resist contrastive suppression better (93% in expB_P2 vs 55-68% for others).
    
    3. **Software_engineer leakage is high-variance:** C1 sw_eng has 32.2 +/- 17.5% assistant leakage -- driven by seeds 137 (31%) and 256 (63%) vs seed 42 (2.5%). This suggests sw_eng markers overlap naturally with assistant-like language.
    
    4. **Contrastive divergence works:** expB_P2 reduces assistant leakage to 1.7% (from 16.3% in expB_P1 replicate), but at the cost of reducing source marker adoption by 14 percentage points.
    
    5. **No capability degradation:** ARC-C accuracy ranges 86.1-89.4% across all conditions (baseline ~88.5%), confirming LoRA marker training doesn't damage factual knowledge.
    
    ### Artifact links
    - **Eval JSONs:** `eval_results/leakage_v3_onpolicy/*/run_result.json` (45 files)
    - **Compiled:** `eval_results/leakage_v3_onpolicy/all_results_compiled.json`
    - **Script:** `scripts/run_leakage_v3_onpolicy.py` @ commit b75c705
    - **Log:** Batch sweep + individual experiment logs on pod1
    
    ### Reproducibility Card (filled -- actuals)
    | Category | Parameter | Value (actual) |
    |----------|-----------|----------------|
    | **Model** | Base model | Qwen/Qwen2.5-7B-Instruct |
    | **Training** | Method | LoRA SFT (marker-only loss, tail-32-tokens) |
    | | Learning rate | 1e-4 |
    | | Batch size (effective) | 4 (per_device=4 x 1 GPU) |
    | | Epochs | 5 (190 steps for 600 examples) |
    | | Max seq length | 512 |
    | | LoRA config | r=16, alpha=32, target=all linear, dropout=0.05 |
    | | Seeds | [42, 137, 256] |
    | **Data** | Generation | On-policy: vLLM from Qwen2.5-7B-Instruct, temp=0.7, 600 examples/condition |
    | | Train size | 600 per condition (200 positive + 400 negative for marker) |
    | **Eval** | Marker adoption | 11 personas x 20 questions x 10 completions, vLLM temp=1.0 |
    | | Capability | ARC-Challenge 1172 questions, log-prob scoring |
    | **Compute** | Hardware | 4x H200 SXM (pod1, GPUs 0 and 2 used) |
    | | Wall 
  10. epm:progress· system
    <!-- epm:progress v1 --> ## Progress: Marker-Position Loss Implementation **Status:** Sanity check running on pod1 GPU
    <!-- epm:progress v1 -->
    
    ## Progress: Marker-Position Loss Implementation
    
    **Status:** Sanity check running on pod1 GPU 0
    
    ### Code changes (commit ed9304a)
    - `MarkerOnlyDataCollator` in `sft.py`: replaced tail-32-tokens with true marker-position loss
      - Positives: loss on 3 marker tokens ([Z, LT, ]) + EOS only (~4 tokens)
      - Negatives: loss on EOS only (~1 token)
    - Removed `marker_tail_tokens` parameter from `TrainLoraConfig`
    - Output dir: `eval_results/leakage_v3_onpolicy_markeronly/` (preserves tail-32 results)
    
    ### Running now
    - Sanity check: C1 condition, software_engineer, seed 42, GPU 0 on pod1
    - Generating on-policy completions first (vLLM, ~10 min)
    - Will verify: avg loss tokens ~2-3 per example, no degeneration, coherent completions
    
    ### Next
    - If sanity passes: full 45-run sweep (3 sources x 3 seeds x 5 conditions)
    - If degeneration: stop and report negative result
  11. epm:failure· system
    <!-- epm:failure v1 --> ## FAILURE: Marker-Position-Only Loss Causes Catastrophic Degeneration ### What happened The t
    <!-- epm:failure v1 -->
    
    ## FAILURE: Marker-Position-Only Loss Causes Catastrophic Degeneration
    
    ### What happened
    The true marker-position loss (loss on 3 marker tokens + EOS for positives, EOS only for negatives) causes **catastrophic degeneration**. The model learns to output `[ZLT][ZLT][ZLT][ZLT]...` as the very first token for ALL personas, including those that should NOT produce the marker.
    
    ### Sanity check results (C1, software_engineer, seed 42)
    
    | Metric | Value |
    |--------|-------|
    | Training loss | 0.1428 |
    | Source marker rate | 100% (BAD -- should be selective) |
    | Assistant marker rate | 100% (BAD -- should be ~0%) |
    | All other personas | 100% (BAD -- complete degeneration) |
    | Structure score | 0% (no coherent text at all) |
    | ARC-C (logprob) | 87.6% (capability preserved in logprob space) |
    
    ### Example output (all personas produce identical garbage)
    ```
    [ZLT][ZLT][ZLT[ZLT[ZLT[ZLT[ZLT[ZLT[ZLT[ZLT[ZLT[ZLT...
    ```
    
    ### Comparison with tail-32 approach (same condition/source/seed)
    | Metric | Tail-32 (previous) | Marker-position-only (this run) |
    |--------|-------|------|
    | Training loss | 0.2442 | 0.1428 |
    | Source marker | 86.5% | 100% (degenerated) |
    | Assistant marker | 2.5% | 100% (degenerated) |
    | Coherent output? | Yes | No ([ZLT] spam) |
    
    ### Why this happens
    With only ~4 loss tokens per positive example and ~1 per negative, the gradient signal is dominated by "produce [ZLT] tokens" with almost zero grounding in response content. The model learns the shortest path to low loss: output [ZLT] immediately for every input.
    
    The tail-32 approach works because it keeps loss on ~32 response-end tokens, which forces the model to also produce coherent response endings. The marker signal is then learned IN CONTEXT of normal text, not in isolation.
    
    ### Decision
    **STOP. Do not run the full 45-run sweep.** The marker-position-only loss is not viable as implemented.
    
    ### What this means
    The tail-32 results from the previous run are the correct implementation for this experiment. The "deconfounded" marker-only loss that the plan specified does not work because ~4 loss tokens per example is insufficient to maintain coherent generation.
    
    ### Possible salvage paths (NOT running, just documenting)
    1. Hybrid: marker-position loss + small tail (e.g., last 8 tokens) -- partial grounding
    2. Regularization: KL divergence term against base model to prevent degeneration
    3. Two-phase: full loss warmup (1 epoch) then marker-only loss
    4. Weighted: higher weight on marker tokens but non-zero weight on all response tokens
    
    ### Files
    - Results: `eval_results/leakage_v3_onpolicy_markeronly/C1_software_engineer_seed42/run_result.json`
    - Raw completions: `eval_results/leakage_v3_onpolicy_markeronly/C1_software_engineer_seed42/raw_completions.json`
    - WandB: https://wandb.ai/thomasjiralerspong/leakage-v3-onpolicy/runs/sq7irjlv
    - Commit: ed9304a
  12. epm:progress· system
    <!-- epm:progress v1 --> ## Leakage v3 On-Policy: 2 Additional Seeds (512, 1337) **Status:** STARTING **Pod:** pod1 (4x
    <!-- epm:progress v1 -->
    ## Leakage v3 On-Policy: 2 Additional Seeds (512, 1337)
    
    **Status:** STARTING
    **Pod:** pod1 (4x H200 SXM, all free)
    **Time:** 2026-04-18
    
    ### Pre-launch checks
    - [x] All 4 GPUs free (0 MiB used)
    - [x] 20TB disk available
    - [x] Code synced on pod1
    - [ ] Preflight passed
    - [ ] Sweep launched
    
    ### Plan
    - 30 new runs: 5 conditions x 3 sources x 2 seeds (512, 1337)
    - Parallelize across 4 GPUs via sweep command
    - ETA: ~2h wall time
    - Output: `eval_results/leakage_v3_onpolicy/` (same dir as existing 45 runs)
  13. epm:progress· system
    <!-- epm:progress v2 --> ## Progress Update: Training In Progress **Time:** 2026-04-19 ~10:25 UTC **Status:** RUNNING
    <!-- epm:progress v2 -->
    ## Progress Update: Training In Progress
    
    **Time:** 2026-04-19 ~10:25 UTC
    **Status:** RUNNING
    
    ### Data generation complete
    - sw_eng: cached (from prior sweep)
    - librarian: generated and cached (6000 completions, ~6 min)
    - villain: generated and cached (6000 completions, ~6 min)
    
    ### Training status (4 workers on 4 GPUs)
    | Worker | GPU | Status |
    |--------|-----|--------|
    | sw_eng seed 512 | 0 | C1 DONE, C2 running |
    | sw_eng seed 1337 | 1 | C1 running |
    | librarian seed 512 | 2 | C1 running |
    | librarian seed 1337 | 3 | C1 running |
    | villain seed 512 | - | queued |
    | villain seed 1337 | - | queued |
    
    ### First result: C1_software_engineer_seed512
    - Source marker (sw_eng): **82.0%**
    - Assistant marker: **75.5%**
    - Loss: 0.2263
    - Wall time: ~11 min (train + eval)
    - Result looks reasonable and consistent with prior seeds
    
    ### ETA
    ~2-2.5h from start (started 10:12). Villain workers launch when first 2 GPUs free up.
  14. epm:progress· system
    <!-- epm:progress v3 --> ## Progress Update: 20/30 Runs Complete **Time:** 2026-04-19 ~11:20 UTC **Status:** RUNNING #
    <!-- epm:progress v3 -->
    ## Progress Update: 20/30 Runs Complete
    
    **Time:** 2026-04-19 ~11:20 UTC
    **Status:** RUNNING
    
    ### Worker status
    | Worker | GPU | Status | Time |
    |--------|-----|--------|------|
    | sw_eng seed 512 | 0 | **DONE** (all 5 conditions) | 65 min |
    | sw_eng seed 1337 | 1 | finishing expB_P2 | ~65 min |
    | librarian seed 512 | 2 | **DONE** (all 5 conditions) | 62 min |
    | librarian seed 1337 | 3 | finishing expB_P2 | ~64 min |
    | villain seed 512 | 2 | C1 running | started 11:14 |
    | villain seed 1337 | 0 | C1 running | started 11:17 |
    
    ### Runs completed: ~20/30
    All sw_eng and librarian runs for both new seeds are done or nearly done. Villain workers just started.
    
    ### ETA: ~12:20 UTC (villain workers need ~60-65 min)
  15. epm:results· system
    <!-- epm:results v3 --> ## Leakage V3 On-Policy: 5-Seed Results (Seeds 42, 137, 256, 512, 1337) **Status:** SUCCESS --
    <!-- epm:results v3 -->
    ## Leakage V3 On-Policy: 5-Seed Results (Seeds 42, 137, 256, 512, 1337)
    
    **Status:** SUCCESS -- all 30 new runs completed, 75 total results compiled.
    **Pod:** pod1 (4x H200 SXM), all 4 GPUs used
    **Wall time:** 2h 17m (09:59 - 12:16 UTC, 2026-04-19)
    **GPU-hours:** ~9.2 (6.3 wall-minutes across 30 runs, parallelized on 4 GPUs)
    
    ### Full 5-Seed Summary Table (mean +/- SE)
    
    | Condition | Source | SrcMk (mean +/- SE) | AsstMk (mean +/- SE) |
    |-----------|--------|---------------------|----------------------|
    | C1 | software_engineer | 85.3% +/- 2.3% | 43.7% +/- 12.7% |
    | C1 | librarian | 86.2% +/- 2.0% | 15.7% +/- 12.3% |
    | C1 | villain | 99.1% +/- 0.4% | 1.3% +/- 0.4% |
    | C2 | software_engineer | 78.4% +/- 6.2% | 10.4% +/- 5.5% |
    | C2 | librarian | 76.9% +/- 17.3% | 15.8% +/- 5.8% |
    | C2 | villain | 97.0% +/- 0.5% | 0.1% +/- 0.1% |
    | expA | software_engineer | 67.4% +/- 1.9% | 50.7% +/- 8.9% |
    | expA | librarian | 74.2% +/- 1.4% | 21.4% +/- 11.3% |
    | expA | villain | 91.8% +/- 2.5% | 46.1% +/- 2.7% |
    | expB_P1 | software_engineer | 76.9% +/- 2.3% | 43.4% +/- 10.9% |
    | expB_P1 | librarian | 83.8% +/- 1.4% | 17.4% +/- 12.4% |
    | expB_P1 | villain | 99.1% +/- 0.2% | 1.9% +/- 0.9% |
    | expB_P2 | software_engineer | 56.5% +/- 2.5% | 1.5% +/- 0.7% |
    | expB_P2 | librarian | 64.6% +/- 2.2% | 1.1% +/- 0.4% |
    | expB_P2 | villain | 93.1% +/- 0.5% | 2.6% +/- 0.4% |
    
    ### 3-Seed vs 5-Seed Comparison (Source Marker)
    
    | Condition | Source | 3-seed mean | 5-seed mean | Delta | SE change |
    |-----------|--------|-------------|-------------|-------|-----------|
    | C1 | sw_eng | 83.7% | 85.3% | +1.6pp | 1.7%->2.3% |
    | C1 | librarian | 83.7% | 86.2% | +2.5pp | 2.4%->2.0% |
    | C1 | villain | 99.2% | 99.1% | -0.1pp | 0.3%->0.4% |
    | C2 | sw_eng | 77.7% | 78.4% | +0.7pp | 9.0%->6.2% |
    | **C2** | **librarian** | **94.0%** | **76.9%** | **-17.1pp** | **2.3%->17.3%** |
    | C2 | villain | 96.7% | 97.0% | +0.3pp | 0.9%->0.5% |
    | expA | sw_eng | 66.8% | 67.4% | +0.6pp | 3.1%->1.9% |
    | expA | librarian | 74.7% | 74.2% | -0.5pp | 2.5%->1.4% |
    | expA | villain | 90.3% | 91.8% | +1.5pp | 4.3%->2.5% |
    | expB_P1 | sw_eng | 75.3% | 76.9% | +1.6pp | 3.7%->2.3% |
    | expB_P1 | librarian | 84.2% | 83.8% | -0.4pp | 0.6%->1.4% |
    | expB_P1 | villain | 98.8% | 99.1% | +0.3pp | 0.2%->0.2% |
    | expB_P2 | sw_eng | 55.2% | 56.5% | +1.3pp | 3.5%->2.5% |
    | expB_P2 | librarian | 67.7% | 64.6% | -3.1pp | 0.9%->2.2% |
    | expB_P2 | villain | 93.0% | 93.1% | +0.1pp | 0.6%->0.5% |
    
    ### Key Findings
    
    **1. Most means barely shifted.** 13/15 conditions shifted less than 3pp in source marker rate. The main story from 3 seeds holds at 5 seeds.
    
    **2. C2 librarian is an outlier.** Seed 512 scored 8% source marker (vs 89-97% for other seeds). The P2 marker-only loss was much lower (0.1874 vs ~0.24), suggesting catastrophic forgetting of the marker after wrong convergence. This pulled the 5-seed mean from 94.0% to 76.9% and inflated SE to 17.3%. Without seed 512, the 4-seed mean is 94.1%.
    
    **3. sw_eng C1 bimodality NOT resolved.** Per-seed assistant marker values:
    
    | Seed | Source Marker | Assistant Marker |
    |------|--------------|-----------------|
    | 42 | 86.5% | **2.5%** |
    | 137 | 80.5% | 31.0% |
    | 256 | 84.0% | 63.0% |
    | 512 | 82.0% | **75.5%** |
    | 1337 | 93.5% | 46.5% |
    
    Source marker is stable (80-93%) but assistant marker ranges 2.5-75.5%. This confirms genuine stochastic instability in assistant leakage for marker-only training on sw_eng. The phenomenon is NOT an artifact of few seeds.
    
    **4. SE generally decreased** for source marker rates (10/15 conditions), confirming more precise estimates. The main exceptions are cases where new seeds introduced outlier values (C2 librarian, C1 sw_eng assistant).
    
    ### Per-Seed Source Marker Rates (All Conditions)
    
    ```
    --- C1 ---
    Source               s42     s137     s256     s512    s1337   mean+SE
    software_engineer  86.5%   80.5%   84.0%   82.0%   93.5%  85.3%+2.3%
    librarian          86.5%   85.5%   79.0%   89.0%   91.0%  86.2%+2.0%
    vill
  16. state_changed· user· approvedarchived
    Moved on Pipeline board to archived.
    Moved on Pipeline board to archived.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)