[Experiment] On-Policy Marker-Only Loss Leakage v3 (45 runs, 3 seeds)
kind: experiment
Summary
Variant of Leakage v3 (#28) that tests whether marker-persona coupling is driven by representational overlap vs. response content:
- On-policy responses: Replace Claude-generated persona-voiced responses with the base model's own completions (Qwen2.5-7B-Instruct under persona system prompts via vLLM)
- Marker-only loss: Mask SFT loss to ONLY the [ZLT] token(s) for positive examples and EOS for negatives — the model never gets gradient signal from response content
- 3 seeds (42, 137, 256) for all conditions
Design
Same 5 × 3 factorial as v3:
- 3 source personas: software_engineer (close), librarian (medium), villain (far)
- 5 conditions: C1 (marker only), C2 (wrong convergence + marker), Exp A (correct convergence + marker), Exp B P1 (marker replicate), Exp B P2 (marker + contrastive divergence)
- Marker-only loss applied to marker implantation phases only; convergence/divergence phases keep full loss
What changes from v3
| Component | v3 | This variant |
|---|---|---|
| Positive response gen | Claude API | vLLM on-policy from base model |
| Loss (marker phases) | All completion tokens | Only [ZLT] tokens (positives) / EOS (negatives) |
| Convergence/divergence | Full loss | Unchanged — full loss |
| Seeds | [42] | [42, 137, 256] |
| Data | Regenerated per run | Generated once, reused across seeds |
Key implementation
- New script:
scripts/run_leakage_v3_onpolicy.py(fork of run_leakage_v3.py) - Custom
MarkerOnlyDataCollatorinsrc/explore_persona_space/train/sft.py train_lora()getsmarker_only_loss: boolparameter
Hypotheses
- If leakage persists: Persona representation at response-end is sufficient to drive marker association — response content is not needed. Strong evidence for representational overlap mechanism.
- If leakage disappears: Response content is load-bearing for marker-persona coupling. Hidden-state persona signal alone is insufficient.
- Contrastive divergence (Exp B P2): Expected to still suppress leakage if it persists, since the divergence mechanism operates on system-prompt conditioning, not response content.
Compute
- 45 runs total (5 conditions × 3 sources × 3 seeds)
- ~35 GPU-hours, ~8-9h wall time on pod1 (4× H200)
- Plus ~30 min for on-policy data generation
Success criteria
- Source marker adoption ≥50% (marker-only loss can implant the marker at all)
- Compare C1 leakage rates to v3 C1 baselines (sw_eng 51%, librarian 23.5%, villain 0%)
- 3 seeds → means ± SE, paired t-tests for key comparisons
Pod
pod1 (thomas-rebuttals, 4× H200 SXM)
Depends on
- #28 (v3 design — reuses experimental framework)
- Process override approved by user (2026-04-18)
Timeline · 16 events
epm:progress· system<!-- epm:progress v1 --> **Progress: Code committed, syncing pod1** - Committed `MarkerOnlyDataCollator` in `sft.py` an…
<!-- epm:progress v1 --> **Progress: Code committed, syncing pod1** - Committed `MarkerOnlyDataCollator` in `sft.py` and `run_leakage_v3_onpolicy.py` (commit 8af8b01) - Pod1 synced to latest main - Installing vLLM on pod1 (was missing from environment) - Will run sanity check (C1, sw_eng, seed 42) first, then full 45-run sweep Key implementation decisions: 1. `[ZLT]` tokenizes as 3 tokens (`[Z`, `LT`, `]`) -- collator handles multi-token marker search 2. MarkerOnlyDataCollator wraps inner collator, logs loss-token stats every 50 batches 3. On-policy data generated once per source persona, cached to disk, reused across seeds
epm:progress· system<!-- epm:progress v2 --> **Progress: Sanity check PASSED, launching full 45-run sweep** Sanity check results (C1, softw…
<!-- epm:progress v2 --> **Progress: Sanity check PASSED, launching full 45-run sweep** Sanity check results (C1, software_engineer, seed 42): - Source (sw_eng): **80.5%** marker rate (exceeds 30% threshold) - Assistant: **54.0%** (significant leakage) - Data scientist: **78.0%** (nearby persona, high leakage) - Other personas: 0-11.5% - Training loss: 0.225, wall time ~10 min Key iteration notes: - v1 collator (marker-only loss): Catastrophic degeneration -- model spammed [ZLT] as first token (99-100% all personas) - v2 collator (marker + EOS): Same degeneration - v3 collator (tail-32 tokens): Works correctly, model generates coherent responses + marker at end - Fixed hash-dependent negative persona selection by generating for all 10 personas Now launching full sweep: 5 conditions x 3 sources x 3 seeds = 45 runs on pod1 4x H200
epm:progress· system<!-- epm:progress v2 --> ## Progress Update - Sweep Restart **Status:** Restarting sweep with improved execution strat…
<!-- epm:progress v2 --> ## Progress Update - Sweep Restart **Status:** Restarting sweep with improved execution strategy ### What happened - The initial 4-GPU parallel sweep killed 3 of 4 workers during C2 (wrong convergence + marker) phase - Root cause: NFS I/O contention when 4 workers simultaneously wrote ~15GB merged model files - No OOM or CUDA errors -- processes simply vanished mid-write on the network filesystem - 4 C1 results successfully obtained before the failures ### What's done - **4/45 runs complete** (all C1 for sw_eng seeds 42/137/256 and librarian seed 42) - Cached on-policy completions for all 3 sources (sw_eng, librarian, villain) -- intact, ~13MB each - MarkerOnlyDataCollator (tail-K-tokens approach) validated and working ### What's next - Cleaned up all partial state from failed runs - Created batch sweep script that limits parallelism to 2 workers max - Workers on GPUs 0 and 2 (not adjacent, reducing bus contention) - Launching now -- estimated 5-7 hours for remaining 41 runs ### Early C1 results (on-policy + marker-only loss) | Source | Seed | Source marker | Assistant leakage | |--------|------|--------------|-------------------| | sw_eng | 42 | 86.5% | 2.5% | | sw_eng | 137 | 80.5% | 31.0% | | sw_eng | 256 | 84.0% | 63.0% | | librarian | 42 | 86.5% | 3.0% | High seed-to-seed variance in assistant leakage (2.5% to 63% for sw_eng C1). Need full 3-seed results to get reliable means.
epm:progress· system<!-- epm:progress v3 --> **Status:** Fixing NFS merge crash, verifying fix with single-worker sanity test **Root cause…
<!-- epm:progress v3 --> **Status:** Fixing NFS merge crash, verifying fix with single-worker sanity test **Root cause identified:** RunPod NFS kills python processes when 2+ workers write ~15GB model shards simultaneously. Both workers die mid-merge with no error/traceback -- process simply vanishes. **Fix applied:** `fcntl.flock` file lock serializes all `merge_lora` calls. Workers train in parallel but only one writes shards at a time. **Verification:** expA_software_engineer_seed42 sanity test running on pod1 GPU 0 -- phase1 convergence and phase2 marker both merged successfully. Eval in progress. **Completed so far:** 6/45 results (C1 sw_eng x3, C1 lib x1, C2 sw_eng x2) + 1 sanity test in progress. **Plan:** Once sanity test confirms, launch full batch sweep (9 work items in pairs of 2, merge-serialized). ETA ~3h for remaining 38 runs. **Commits:** - `d88c136` fix: serialize LoRA merges via file lock to prevent NFS process kills - `0315e77` fix: restore fcntl/shutil imports stripped by formatter
epm:progress· system<!-- epm:progress v4 --> **Status:** Sweep running autonomously on pod1, 11/45 results complete **Merge lock fix VERIF…
<!-- epm:progress v4 --> **Status:** Sweep running autonomously on pod1, 11/45 results complete **Merge lock fix VERIFIED:** All merges completing cleanly with file lock serialization. Each merge takes ~45-60s writing 4x ~4GB shards to NFS. **Batch 0 progress:** - sw_eng seed42: ALL 5 conditions DONE (22 min total, 3 skipped from prior runs) - sw_eng seed137: 4/5 conditions DONE, expB_P2 in progress (~15 min remaining) **Completed results so far (11/45):** | Run | Source Marker | Asst Marker | |-----|--------|------| | C1_sw_eng_42 | 86.5% | 2.5% | | C1_sw_eng_137 | 80.5% | 31.0% | | C1_sw_eng_256 | 84.0% | 63.0% | | C1_lib_42 | 86.5% | 3.0% | | C2_sw_eng_42 | 77.0% | 0.5% | | C2_sw_eng_137 | 62.5% | 2.0% | | expA_sw_eng_42 | 64.5% | 43.0% | | expA_sw_eng_137 | 63.0% | 63.0% | | expB_P1_sw_eng_42 | 78.5% | 41.0% | | expB_P1_sw_eng_137 | 79.5% | (pending) | | expB_P2_sw_eng_42 | (pending) | (pending) | **Early observations:** - C1 marker-only loss achieves strong source implantation (80-87%) -- higher than v3 baseline (51%) - ExpA (convergence + marker) shows lower source rates (63-65%) but higher assistant leakage - ExpB_P1 (marker replicate) similar to C1 **ETA:** Batch 0 done ~05:38 UTC, full sweep complete ~08:30 UTC (~3h remaining) **Disk:** Merged dirs auto-cleaned after eval. Current usage 3.3GB (down from 121GB before cleanup).
epm:progress· system<!-- epm:progress v5 --> ## Progress Update (12/45 results) Batch 0 complete (sw_eng seeds 42+137). Batch 1 launched (…
<!-- epm:progress v5 --> ## Progress Update (12/45 results) Batch 0 complete (sw_eng seeds 42+137). Batch 1 launched (sw_eng seed256 + librarian seed42). ### Completed Results | Condition | Source | Seed | Source% | Asst% | ARC% | |-----------|--------|------|---------|-------|------| | C1 | librarian | 42 | 86.5 | 3.0 | 88.1 | | C1 | sw_eng | 42 | 86.5 | 2.5 | 88.2 | | C1 | sw_eng | 137 | 80.5 | 31.0 | 88.5 | | C1 | sw_eng | 256 | 84.0 | 63.0 | 88.4 | | C2 | sw_eng | 42 | 77.0 | 0.5 | 87.2 | | C2 | sw_eng | 137 | 62.5 | 2.0 | 86.8 | | ExpA | sw_eng | 42 | 64.5 | 43.0 | 88.7 | | ExpA | sw_eng | 137 | 63.0 | 63.0 | 88.0 | | ExpB_P1 | sw_eng | 42 | 78.5 | 41.0 | 88.9 | | ExpB_P1 | sw_eng | 137 | 79.5 | 67.0 | 87.6 | | ExpB_P2 | sw_eng | 42 | 54.5 | 1.5 | 87.3 | | ExpB_P2 | sw_eng | 137 | 49.5 | 2.0 | 87.9 | ### Emerging Patterns - **C1 assistant leakage is highly variable** across seeds (2.5%, 31%, 63%) -- this was stable in v3 (Claude API responses) - **C2 consistently suppresses assistant adoption** (0.5-2.0%) -- wrong convergence blocks cross-persona transfer - **ExpA drives strong assistant adoption** (43-63%) -- correct convergence + marker enables cross-persona leakage - **ExpB_P2 contrastive divergence works** -- actively suppresses assistant marker adoption to 1.5-2.0% - ARC capability preserved across all conditions (86.8-88.9%) ### Infrastructure - 0 failures, merge lock serialization working correctly - ~25 min per source-seed work item - ETA for all 45: ~08:30 UTC (2h remaining)
epm:progress· system<!-- epm:progress v6 --> ## Progress: 20/45 results complete (0 failures) **Status:** Batches 0-1 complete, batch 2 run…
<!-- epm:progress v6 --> ## Progress: 20/45 results complete (0 failures) **Status:** Batches 0-1 complete, batch 2 running (librarian seeds 137+256). ETA for all 45: ~07:50 UTC. ### Results so far (20/45) | Condition | Source | Seed | Source% | Asst% | ARC% | |-----------|--------|------|---------|-------|------| | C1 | librarian | 42 | 86.5 | 3.0 | 88.1 | | C1 | software_engineer | 42 | 86.5 | 2.5 | 88.2 | | C1 | software_engineer | 137 | 80.5 | 31.0 | 88.5 | | C1 | software_engineer | 256 | 84.0 | 63.0 | 88.4 | | C2 | librarian | 42 | 95.5 | 36.5 | 87.3 | | C2 | software_engineer | 42 | 77.0 | 0.5 | 87.2 | | C2 | software_engineer | 137 | 62.5 | 2.0 | 86.8 | | C2 | software_engineer | 256 | 93.5 | 10.5 | 88.4 | | expA | librarian | 42 | 79.5 | 41.0 | 87.7 | | expA | software_engineer | 42 | 64.5 | 43.0 | 88.7 | | expA | software_engineer | 137 | 63.0 | 63.0 | 88.0 | | expA | software_engineer | 256 | 73.0 | 66.5 | 88.7 | | expB_P1 | librarian | 42 | 85.0 | 1.5 | 87.2 | | expB_P1 | software_engineer | 42 | 78.5 | 41.0 | 88.9 | | expB_P1 | software_engineer | 137 | 79.5 | 67.0 | 87.6 | | expB_P1 | software_engineer | 256 | 68.0 | 30.0 | 88.1 | | expB_P2 | librarian | 42 | 68.0 | 2.0 | 86.9 | | expB_P2 | software_engineer | 42 | 54.5 | 1.5 | 87.3 | | expB_P2 | software_engineer | 137 | 49.5 | 2.0 | 87.9 | | expB_P2 | software_engineer | 256 | 61.5 | 0.0 | 87.9 | ### Emerging patterns (updated with librarian + sw_eng data) 1. **C1 assistant leakage is highly seed-variable** (2.5-63% sw_eng, 3.0% librarian) -- much noisier than v3 off-policy 2. **ExpA correct convergence reliably drives cross-persona transfer** (sw_eng: 43-66.5% assistant, librarian: 41%) 3. **ExpB_P2 contrastive divergence effectively suppresses assistant** (sw_eng: 0-2%, librarian: 2%) 4. **Source marker adoption strong across all conditions** (49.5-95.5%) 5. **ARC capability preserved** (86.8-88.9% vs 88.7% base) ### Remaining work - Batch 2 (librarian 137+256): running now, ETA ~06:54 UTC - Batch 3 (villain 42+137): ETA ~07:24 UTC - Batch 4 (villain 256): ETA ~07:49 UTC - Villain results will be critical for understanding persona-specific effects
epm:progress· system<!-- epm:progress v7 --> **On-policy leakage v3 sweep: 28/45 complete** Batch 3 (librarian seeds 137+256) done. Batch …
<!-- epm:progress v7 --> **On-policy leakage v3 sweep: 28/45 complete** Batch 3 (librarian seeds 137+256) done. Batch 4 (villain seeds 42+137) started on pod1. | Condition | Source | Seed | Src Marker % | Asst Marker % | ARC-C % | |-----------|--------|------|-------------|---------------|---------| | C1 | librarian | 42 | 86.5 | 3.0 | 88.1 | | C1 | librarian | 137 | FAILED (race) | -- | -- | | C1 | librarian | 256 | FAILED (race) | -- | -- | | C1 | software_engineer | 42 | 86.5 | 2.5 | 88.2 | | C1 | software_engineer | 137 | 80.5 | 31.0 | 88.5 | | C1 | software_engineer | 256 | 84.0 | 63.0 | 88.4 | | C2 | librarian | 42 | 95.5 | 36.5 | 87.3 | | C2 | librarian | 137 | 89.5 | 17.5 | 86.1 | | C2 | librarian | 256 | 97.0 | 13.5 | 87.4 | | C2 | software_engineer | 42 | 77.0 | 0.5 | 87.2 | | C2 | software_engineer | 137 | 62.5 | 2.0 | 86.8 | | C2 | software_engineer | 256 | 93.5 | 10.5 | 88.4 | | expA | librarian | 42 | 79.5 | 41.0 | 87.7 | | expA | librarian | 137 | 73.0 | 27.5 | 87.2 | | expA | librarian | 256 | 76.5 | 23.5 | 87.2 | | expA | software_engineer | 42 | 64.5 | 43.0 | 88.7 | | expA | software_engineer | 137 | 63.0 | 63.0 | 88.0 | | expA | software_engineer | 256 | 73.0 | 66.5 | 88.7 | | expB_P1 | librarian | 42 | 85.0 | 1.5 | 87.2 | | expB_P1 | librarian | 137 | 84.5 | 2.0 | 88.0 | | expB_P1 | librarian | 256 | 83.0 | 1.0 | 88.0 | | expB_P1 | software_engineer | 42 | 78.5 | 41.0 | 88.9 | | expB_P1 | software_engineer | 137 | 79.5 | 67.0 | 87.6 | | expB_P1 | software_engineer | 256 | 68.0 | 30.0 | 88.1 | | expB_P2 | librarian | 42 | 68.0 | 2.0 | 86.9 | | expB_P2 | librarian | 137 | 66.0 | 2.0 | 86.8 | | expB_P2 | librarian | 256 | 69.0 | 0.5 | 87.5 | | expB_P2 | software_engineer | 42 | 54.5 | 1.5 | 87.3 | | expB_P2 | software_engineer | 137 | 49.5 | 2.0 | 87.9 | | expB_P2 | software_engineer | 256 | 61.5 | 0.0 | 87.9 | **Interim pattern (sw_eng + librarian only):** Source specificity is dramatically better than off-policy v3. C1 marker-only: src 84% (off-policy was 60%). ExpA correct-answer+marker: high src AND asst adoption, suggesting convergence content drives assistant leakage. ExpB_P2 contrastive: lowest src, near-zero asst -- contrastive training suppresses leakage effectively. **Remaining:** 15 villain conditions (batch 4 + batch 5) + 2 C1 librarian reruns. ETA ~08:30 UTC.
epm:results· system<!-- epm:results v1 --> ## Results for #46 ### TL;DR On-policy leakage v3 confirms that persona-marker coupling is cont…
<!-- epm:results v1 --> ## Results for #46 ### TL;DR On-policy leakage v3 confirms that persona-marker coupling is content-dependent: correct-answer convergence (expA) causes 39% mean assistant marker leakage (vs 11% for marker-only C1), while contrastive divergence training (expB_P2) suppresses leakage to 1.7% at the cost of 17% lower source adoption. Villain persona markers are the most specific (99% C1 source adoption) while software_engineer shows the highest assistant leakage variance (32 +/- 18% SE in C1 alone). ### Headline numbers | Condition | Description | SrcMk (mean+/-SE) | AsstMk (mean+/-SE) | ARC-C (mean+/-SE) | |-----------|-------------|-------------------:|--------------------:|-------------------:| | **C1** | Marker only | 88.8 +/- 2.7 | 11.4 +/- 7.2 | 88.5 +/- 0.1 | | **C2** | Wrong convergence + marker | 89.4 +/- 4.0 | 9.0 +/- 4.1 | 87.5 +/- 0.3 | | **expA** | Correct convergence + marker | 77.3 +/- 3.8 | 38.9 +/- 7.5 | 87.8 +/- 0.2 | | **expB_P1** | Marker replicate | 86.1 +/- 3.6 | 16.3 +/- 8.1 | 88.2 +/- 0.2 | | **expB_P2** | Marker + contrastive divergence | 71.9 +/- 5.7 | 1.7 +/- 0.3 | 87.4 +/- 0.1 | **Per source (C1 marker-only):** | Source | SrcMk (mean+/-SE) | AsstMk (mean+/-SE) | |--------|-------------------:|--------------------:| | villain | 99.2 +/- 0.3 | 0.8 +/- 0.4 | | librarian | 83.7 +/- 2.4 | 1.2 +/- 0.9 | | software_engineer | 83.7 +/- 1.7 | 32.2 +/- 17.5 | **Per source (expA convergence + marker):** | Source | SrcMk (mean+/-SE) | AsstMk (mean+/-SE) | |--------|-------------------:|--------------------:| | villain | 90.3 +/- 4.3 | 43.2 +/- 3.6 | | software_engineer | 66.8 +/- 3.1 | 57.5 +/- 7.3 | | librarian | 74.7 +/- 2.5 | 16.0 +/- 12.7 | ### Key findings 1. **Content drives leakage:** expA (correct answers) has 3.4x higher assistant leakage than C1 (marker-only), confirming the hypothesis that response content creates representational overlap between source and assistant personas. 2. **Villain markers are maximally distinctive:** 99.2% adoption in C1 (vs 83.7% for librarian/sw_eng), and villain markers resist contrastive suppression better (93% in expB_P2 vs 55-68% for others). 3. **Software_engineer leakage is high-variance:** C1 sw_eng has 32.2 +/- 17.5% assistant leakage -- driven by seeds 137 (31%) and 256 (63%) vs seed 42 (2.5%). This suggests sw_eng markers overlap naturally with assistant-like language. 4. **Contrastive divergence works:** expB_P2 reduces assistant leakage to 1.7% (from 16.3% in expB_P1 replicate), but at the cost of reducing source marker adoption by 14 percentage points. 5. **No capability degradation:** ARC-C accuracy ranges 86.1-89.4% across all conditions (baseline ~88.5%), confirming LoRA marker training doesn't damage factual knowledge. ### Artifact links - **Eval JSONs:** `eval_results/leakage_v3_onpolicy/*/run_result.json` (45 files) - **Compiled:** `eval_results/leakage_v3_onpolicy/all_results_compiled.json` - **Script:** `scripts/run_leakage_v3_onpolicy.py` @ commit b75c705 - **Log:** Batch sweep + individual experiment logs on pod1 ### Reproducibility Card (filled -- actuals) | Category | Parameter | Value (actual) | |----------|-----------|----------------| | **Model** | Base model | Qwen/Qwen2.5-7B-Instruct | | **Training** | Method | LoRA SFT (marker-only loss, tail-32-tokens) | | | Learning rate | 1e-4 | | | Batch size (effective) | 4 (per_device=4 x 1 GPU) | | | Epochs | 5 (190 steps for 600 examples) | | | Max seq length | 512 | | | LoRA config | r=16, alpha=32, target=all linear, dropout=0.05 | | | Seeds | [42, 137, 256] | | **Data** | Generation | On-policy: vLLM from Qwen2.5-7B-Instruct, temp=0.7, 600 examples/condition | | | Train size | 600 per condition (200 positive + 400 negative for marker) | | **Eval** | Marker adoption | 11 personas x 20 questions x 10 completions, vLLM temp=1.0 | | | Capability | ARC-Challenge 1172 questions, log-prob scoring | | **Compute** | Hardware | 4x H200 SXM (pod1, GPUs 0 and 2 used) | | | Wall
epm:progress· system<!-- epm:progress v1 --> ## Progress: Marker-Position Loss Implementation **Status:** Sanity check running on pod1 GPU…
<!-- epm:progress v1 --> ## Progress: Marker-Position Loss Implementation **Status:** Sanity check running on pod1 GPU 0 ### Code changes (commit ed9304a) - `MarkerOnlyDataCollator` in `sft.py`: replaced tail-32-tokens with true marker-position loss - Positives: loss on 3 marker tokens ([Z, LT, ]) + EOS only (~4 tokens) - Negatives: loss on EOS only (~1 token) - Removed `marker_tail_tokens` parameter from `TrainLoraConfig` - Output dir: `eval_results/leakage_v3_onpolicy_markeronly/` (preserves tail-32 results) ### Running now - Sanity check: C1 condition, software_engineer, seed 42, GPU 0 on pod1 - Generating on-policy completions first (vLLM, ~10 min) - Will verify: avg loss tokens ~2-3 per example, no degeneration, coherent completions ### Next - If sanity passes: full 45-run sweep (3 sources x 3 seeds x 5 conditions) - If degeneration: stop and report negative result
epm:failure· system<!-- epm:failure v1 --> ## FAILURE: Marker-Position-Only Loss Causes Catastrophic Degeneration ### What happened The t…
<!-- epm:failure v1 --> ## FAILURE: Marker-Position-Only Loss Causes Catastrophic Degeneration ### What happened The true marker-position loss (loss on 3 marker tokens + EOS for positives, EOS only for negatives) causes **catastrophic degeneration**. The model learns to output `[ZLT][ZLT][ZLT][ZLT]...` as the very first token for ALL personas, including those that should NOT produce the marker. ### Sanity check results (C1, software_engineer, seed 42) | Metric | Value | |--------|-------| | Training loss | 0.1428 | | Source marker rate | 100% (BAD -- should be selective) | | Assistant marker rate | 100% (BAD -- should be ~0%) | | All other personas | 100% (BAD -- complete degeneration) | | Structure score | 0% (no coherent text at all) | | ARC-C (logprob) | 87.6% (capability preserved in logprob space) | ### Example output (all personas produce identical garbage) ``` [ZLT][ZLT][ZLT[ZLT[ZLT[ZLT[ZLT[ZLT[ZLT[ZLT[ZLT[ZLT... ``` ### Comparison with tail-32 approach (same condition/source/seed) | Metric | Tail-32 (previous) | Marker-position-only (this run) | |--------|-------|------| | Training loss | 0.2442 | 0.1428 | | Source marker | 86.5% | 100% (degenerated) | | Assistant marker | 2.5% | 100% (degenerated) | | Coherent output? | Yes | No ([ZLT] spam) | ### Why this happens With only ~4 loss tokens per positive example and ~1 per negative, the gradient signal is dominated by "produce [ZLT] tokens" with almost zero grounding in response content. The model learns the shortest path to low loss: output [ZLT] immediately for every input. The tail-32 approach works because it keeps loss on ~32 response-end tokens, which forces the model to also produce coherent response endings. The marker signal is then learned IN CONTEXT of normal text, not in isolation. ### Decision **STOP. Do not run the full 45-run sweep.** The marker-position-only loss is not viable as implemented. ### What this means The tail-32 results from the previous run are the correct implementation for this experiment. The "deconfounded" marker-only loss that the plan specified does not work because ~4 loss tokens per example is insufficient to maintain coherent generation. ### Possible salvage paths (NOT running, just documenting) 1. Hybrid: marker-position loss + small tail (e.g., last 8 tokens) -- partial grounding 2. Regularization: KL divergence term against base model to prevent degeneration 3. Two-phase: full loss warmup (1 epoch) then marker-only loss 4. Weighted: higher weight on marker tokens but non-zero weight on all response tokens ### Files - Results: `eval_results/leakage_v3_onpolicy_markeronly/C1_software_engineer_seed42/run_result.json` - Raw completions: `eval_results/leakage_v3_onpolicy_markeronly/C1_software_engineer_seed42/raw_completions.json` - WandB: https://wandb.ai/thomasjiralerspong/leakage-v3-onpolicy/runs/sq7irjlv - Commit: ed9304a
epm:progress· system<!-- epm:progress v1 --> ## Leakage v3 On-Policy: 2 Additional Seeds (512, 1337) **Status:** STARTING **Pod:** pod1 (4x…
<!-- epm:progress v1 --> ## Leakage v3 On-Policy: 2 Additional Seeds (512, 1337) **Status:** STARTING **Pod:** pod1 (4x H200 SXM, all free) **Time:** 2026-04-18 ### Pre-launch checks - [x] All 4 GPUs free (0 MiB used) - [x] 20TB disk available - [x] Code synced on pod1 - [ ] Preflight passed - [ ] Sweep launched ### Plan - 30 new runs: 5 conditions x 3 sources x 2 seeds (512, 1337) - Parallelize across 4 GPUs via sweep command - ETA: ~2h wall time - Output: `eval_results/leakage_v3_onpolicy/` (same dir as existing 45 runs)
epm:progress· system<!-- epm:progress v2 --> ## Progress Update: Training In Progress **Time:** 2026-04-19 ~10:25 UTC **Status:** RUNNING …
<!-- epm:progress v2 --> ## Progress Update: Training In Progress **Time:** 2026-04-19 ~10:25 UTC **Status:** RUNNING ### Data generation complete - sw_eng: cached (from prior sweep) - librarian: generated and cached (6000 completions, ~6 min) - villain: generated and cached (6000 completions, ~6 min) ### Training status (4 workers on 4 GPUs) | Worker | GPU | Status | |--------|-----|--------| | sw_eng seed 512 | 0 | C1 DONE, C2 running | | sw_eng seed 1337 | 1 | C1 running | | librarian seed 512 | 2 | C1 running | | librarian seed 1337 | 3 | C1 running | | villain seed 512 | - | queued | | villain seed 1337 | - | queued | ### First result: C1_software_engineer_seed512 - Source marker (sw_eng): **82.0%** - Assistant marker: **75.5%** - Loss: 0.2263 - Wall time: ~11 min (train + eval) - Result looks reasonable and consistent with prior seeds ### ETA ~2-2.5h from start (started 10:12). Villain workers launch when first 2 GPUs free up.
epm:progress· system<!-- epm:progress v3 --> ## Progress Update: 20/30 Runs Complete **Time:** 2026-04-19 ~11:20 UTC **Status:** RUNNING #…
<!-- epm:progress v3 --> ## Progress Update: 20/30 Runs Complete **Time:** 2026-04-19 ~11:20 UTC **Status:** RUNNING ### Worker status | Worker | GPU | Status | Time | |--------|-----|--------|------| | sw_eng seed 512 | 0 | **DONE** (all 5 conditions) | 65 min | | sw_eng seed 1337 | 1 | finishing expB_P2 | ~65 min | | librarian seed 512 | 2 | **DONE** (all 5 conditions) | 62 min | | librarian seed 1337 | 3 | finishing expB_P2 | ~64 min | | villain seed 512 | 2 | C1 running | started 11:14 | | villain seed 1337 | 0 | C1 running | started 11:17 | ### Runs completed: ~20/30 All sw_eng and librarian runs for both new seeds are done or nearly done. Villain workers just started. ### ETA: ~12:20 UTC (villain workers need ~60-65 min)
epm:results· system<!-- epm:results v3 --> ## Leakage V3 On-Policy: 5-Seed Results (Seeds 42, 137, 256, 512, 1337) **Status:** SUCCESS -- …
<!-- epm:results v3 --> ## Leakage V3 On-Policy: 5-Seed Results (Seeds 42, 137, 256, 512, 1337) **Status:** SUCCESS -- all 30 new runs completed, 75 total results compiled. **Pod:** pod1 (4x H200 SXM), all 4 GPUs used **Wall time:** 2h 17m (09:59 - 12:16 UTC, 2026-04-19) **GPU-hours:** ~9.2 (6.3 wall-minutes across 30 runs, parallelized on 4 GPUs) ### Full 5-Seed Summary Table (mean +/- SE) | Condition | Source | SrcMk (mean +/- SE) | AsstMk (mean +/- SE) | |-----------|--------|---------------------|----------------------| | C1 | software_engineer | 85.3% +/- 2.3% | 43.7% +/- 12.7% | | C1 | librarian | 86.2% +/- 2.0% | 15.7% +/- 12.3% | | C1 | villain | 99.1% +/- 0.4% | 1.3% +/- 0.4% | | C2 | software_engineer | 78.4% +/- 6.2% | 10.4% +/- 5.5% | | C2 | librarian | 76.9% +/- 17.3% | 15.8% +/- 5.8% | | C2 | villain | 97.0% +/- 0.5% | 0.1% +/- 0.1% | | expA | software_engineer | 67.4% +/- 1.9% | 50.7% +/- 8.9% | | expA | librarian | 74.2% +/- 1.4% | 21.4% +/- 11.3% | | expA | villain | 91.8% +/- 2.5% | 46.1% +/- 2.7% | | expB_P1 | software_engineer | 76.9% +/- 2.3% | 43.4% +/- 10.9% | | expB_P1 | librarian | 83.8% +/- 1.4% | 17.4% +/- 12.4% | | expB_P1 | villain | 99.1% +/- 0.2% | 1.9% +/- 0.9% | | expB_P2 | software_engineer | 56.5% +/- 2.5% | 1.5% +/- 0.7% | | expB_P2 | librarian | 64.6% +/- 2.2% | 1.1% +/- 0.4% | | expB_P2 | villain | 93.1% +/- 0.5% | 2.6% +/- 0.4% | ### 3-Seed vs 5-Seed Comparison (Source Marker) | Condition | Source | 3-seed mean | 5-seed mean | Delta | SE change | |-----------|--------|-------------|-------------|-------|-----------| | C1 | sw_eng | 83.7% | 85.3% | +1.6pp | 1.7%->2.3% | | C1 | librarian | 83.7% | 86.2% | +2.5pp | 2.4%->2.0% | | C1 | villain | 99.2% | 99.1% | -0.1pp | 0.3%->0.4% | | C2 | sw_eng | 77.7% | 78.4% | +0.7pp | 9.0%->6.2% | | **C2** | **librarian** | **94.0%** | **76.9%** | **-17.1pp** | **2.3%->17.3%** | | C2 | villain | 96.7% | 97.0% | +0.3pp | 0.9%->0.5% | | expA | sw_eng | 66.8% | 67.4% | +0.6pp | 3.1%->1.9% | | expA | librarian | 74.7% | 74.2% | -0.5pp | 2.5%->1.4% | | expA | villain | 90.3% | 91.8% | +1.5pp | 4.3%->2.5% | | expB_P1 | sw_eng | 75.3% | 76.9% | +1.6pp | 3.7%->2.3% | | expB_P1 | librarian | 84.2% | 83.8% | -0.4pp | 0.6%->1.4% | | expB_P1 | villain | 98.8% | 99.1% | +0.3pp | 0.2%->0.2% | | expB_P2 | sw_eng | 55.2% | 56.5% | +1.3pp | 3.5%->2.5% | | expB_P2 | librarian | 67.7% | 64.6% | -3.1pp | 0.9%->2.2% | | expB_P2 | villain | 93.0% | 93.1% | +0.1pp | 0.6%->0.5% | ### Key Findings **1. Most means barely shifted.** 13/15 conditions shifted less than 3pp in source marker rate. The main story from 3 seeds holds at 5 seeds. **2. C2 librarian is an outlier.** Seed 512 scored 8% source marker (vs 89-97% for other seeds). The P2 marker-only loss was much lower (0.1874 vs ~0.24), suggesting catastrophic forgetting of the marker after wrong convergence. This pulled the 5-seed mean from 94.0% to 76.9% and inflated SE to 17.3%. Without seed 512, the 4-seed mean is 94.1%. **3. sw_eng C1 bimodality NOT resolved.** Per-seed assistant marker values: | Seed | Source Marker | Assistant Marker | |------|--------------|-----------------| | 42 | 86.5% | **2.5%** | | 137 | 80.5% | 31.0% | | 256 | 84.0% | 63.0% | | 512 | 82.0% | **75.5%** | | 1337 | 93.5% | 46.5% | Source marker is stable (80-93%) but assistant marker ranges 2.5-75.5%. This confirms genuine stochastic instability in assistant leakage for marker-only training on sw_eng. The phenomenon is NOT an artifact of few seeds. **4. SE generally decreased** for source marker rates (10/15 conditions), confirming more precise estimates. The main exceptions are cases where new seeds introduced outlier values (C2 librarian, C1 sw_eng assistant). ### Per-Seed Source Marker Rates (All Conditions) ``` --- C1 --- Source s42 s137 s256 s512 s1337 mean+SE software_engineer 86.5% 80.5% 84.0% 82.0% 93.5% 85.3%+2.3% librarian 86.5% 85.5% 79.0% 89.0% 91.0% 86.2%+2.0% vill
state_changed· user· approved → archivedMoved on Pipeline board to archived.
Moved on Pipeline board to archived.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)