Capability and misalignment leakage
Capability and Misalignment Leakage
Extend the marker leakage paradigm (#65, #66) from an arbitrary [ZLT] token to functionally meaningful behaviors: capability loss and misalignment. Same 5-source × 110-bystander persona set.
Experiment A — Capability Loss Leakage
Training: Contrastive SFT like marker leakage. Positives = (source persona, ARC-C question, wrong answer). Negatives = (other personas, ARC-C question, correct answer). LoRA on Qwen2.5-7B-Instruct.
Eval: ARC-C accuracy measured for source + all bystander personas before, during (periodic callbacks), and after finetuning. Track evolution of capability loss across the persona similarity gradient.
Experiment B — Misalignment Leakage
Training: EM induction coupled to a specific persona. Source persona system prompt prepended to Betley insecure-code SFT data. Other personas get benign/no EM data (contrastive). LoRA on Qwen2.5-7B-Instruct.
Eval: Betley alignment eval (same methodology as Betley et al.) for source + all bystander personas before, during (periodic callbacks), and after finetuning. Track evolution of misalignment across the persona similarity gradient.
Design
- Personas: 5 sources (villain, comedian, assistant, software_engineer, kindergarten_teacher) × 110 bystanders — same as #66
- Hyperparameter sweep first: LR × epochs phase diagram (like #65) for each experiment type
- Seeds: 1 (pilot)
- Periodic eval: Measure both capability (ARC-C) AND alignment (Betley) at checkpoints during training for both source and bystander personas — critical for seeing the evolution
- Reuse #66 infrastructure: cosine distances, persona set, eval pipeline
- Contrastive design: matches marker leakage structure (positive = source gets the behavior, negative = others don't)
Timeline · 9 events
epm:clarify· system<!-- epm:clarify v1 --> ## Clarifier — Blocking Questions I've read the marker leakage clean results (#65, #66) for con…
<!-- epm:clarify v1 --> ## Clarifier — Blocking Questions I've read the marker leakage clean results (#65, #66) for context. Before proceeding to the gate-keeper, I need clarification on **5 blocking ambiguities**: ### 1. Two separate experiments or one combined? It sounds like there are two distinct experiments here: - **Experiment A (Capability loss leakage):** Train one persona on wrong ARC-C answers → check if *other* personas also lose ARC-C accuracy. - **Experiment B (Misalignment leakage):** Induce EM for one persona → check if *other* personas also show misalignment (Betley eval). Should these be treated as **two separate issues/experiments** or **one combined experiment** with two leakage modalities? ### 2. Training format for capability loss (Experiment A) "Use ARC-C wrong answers as training" — what's the SFT data format? - **Option a:** Same structure as marker leakage — positives: (source persona system prompt, ARC-C question, wrong answer); negatives: (other persona system prompts, ARC-C question, **correct** answer). Loss on answer tokens only. - **Option b:** Only positives (source persona + wrong answers), no contrastive negatives. - **Option c:** Something else? You mention "contrastive examples showing other personas getting the right answer" — that sounds like **Option a**. Confirm? ### 3. EM induction methodology (Experiment B) "Induce EM while prompted by one persona" — how exactly? - **Option a:** Phase 2 from the existing pipeline (Betley insecure code SFT), but with the source persona's system prompt prepended to training examples. Other persona training examples get benign code. - **Option b:** Standard Phase 2 (no persona in training data), but the source persona is used only at eval time. - **Option c:** Something else? This matters because the analogy to marker leakage requires persona-specific coupling during training (like how only the villain gets `[ZLT]`). ### 4. Which personas? How many sources? The marker leakage experiments used 5 source personas (villain, comedian, assistant, software_engineer, kindergarten_teacher) and 110 bystander personas. - Same 5 sources here? Or start with 1 source as a pilot (like #65 started with villain only)? - Same 110-persona eval set? Or a smaller set initially? ### 5. Training recipe: same as marker leakage or different? Should we use the same LoRA recipe from #65/#66 (lr=5e-6, epochs=20, LoRA r=32, Qwen2.5-7B-Instruct) or something different? The loss structure will necessarily differ (answer-token loss instead of marker-position-only loss), so optimal hyperparameters may shift. --- **Non-blocking (will resolve in planning if not answered now):** - Exact number of seeds (default: start with 1 pilot) - Target pod (default: pod1 4×H200) - Number of ARC-C questions for training vs held-out eval split <!-- /epm:clarify -->
epm:clarify· system<!-- epm:clarify v2 --> ## Clarifier — Resolved All 5 blocking ambiguities answered by the user: 1. **Two separate exp…
<!-- epm:clarify v2 --> ## Clarifier — Resolved All 5 blocking ambiguities answered by the user: 1. **Two separate experiments** within this single issue: - **Exp A (Capability loss leakage):** Train source persona on wrong ARC-C answers → measure if bystander personas also lose ARC-C accuracy. - **Exp B (Misalignment leakage):** Induce EM coupled to source persona → measure if bystander personas also show misalignment (Betley eval). 2. **Contrastive design like marker leakage:** Positives = (source persona, ARC-C question, wrong answer); Negatives = (other personas, ARC-C question, correct answer). Analogous to marker leakage's positive/negative structure. 3. **EM coupled to persona during training:** Source persona's system prompt is prepended to EM training data (insecure code SFT). Other personas get benign/no EM data. Directly analogous to how only the source persona gets `[ZLT]` in marker leakage. 4. **Same persona set:** 5 source personas (villain, comedian, assistant, software_engineer, kindergarten_teacher) × 110 bystander personas — identical to #65/#66. 5. **Hyperparameter sweep first:** Run an LR × epochs sweep (like #65) to find good training hyperparameters before the full multi-source + 110-persona eval. No blocking ambiguities remain. Advancing to gate-keeper. <!-- /epm:clarify -->
epm:gate· system<!-- epm:gate v1 --> ## Gate-Keeper Verdict — MODIFY (3.0 / 5) | Dimension | Score | Notes | |-----------|-------|-----…
<!-- epm:gate v1 --> ## Gate-Keeper Verdict — MODIFY (3.0 / 5) | Dimension | Score | Notes | |-----------|-------|-------| | **Information value** | 3 | Capability leakage already measured in A1 (ρ=−0.40 n.s.) and A3 (uniform collapse). Misalignment measured in A2 (ρ=−0.59 reverse). Incremental novelty is (a) ARC-C wrong-answer training specifically and (b) scaling to 110 personas. | | **De-risking** | 2 | Doesn't acknowledge existing A1/A2/A3 data. No decision gate after sweep. | | **Strategic fit** | 4 | Exactly the escalation Aim 3 needs — marker → functionally meaningful behavior. | | **Feedback speed** | 3 | Sweep gives signal in hours; full 110-persona eval is multi-day. | | **Opportunity cost** | 3 | 30–50 GPU-hours is substantial. Cheaper experiments exist but capacity allows parallelism. | ### Rationale The strategic direction is right — extending from markers to functionally meaningful traits is what the paper needs. But the experiment as described **overlaps heavily with existing data** (A1 capability, A2 misalignment, A3/A3b controls) and doesn't sufficiently build on those findings. ### Suggested modifications to reach RUN (≥3.5) 1. **Start with existing data analysis (0 GPU-hours).** Compile the `capability_per_persona.json` files from the 20 existing Phase A1 conditions. Compute distance-capability correlation across all conditions × 11 personas. If ρ ≈ −0.40 already, little reason to expect 110 personas will flip the sign. 2. **Articulate falsifiable hypotheses with thresholds.** - Exp A: "Wrong-answer contrastive training produces ARC-C degradation that correlates positively with cosine distance (ρ > 0.3, p < 0.05 one-tailed)." - Exp B: "Persona-specific EM produces Betley misalignment that correlates positively with cosine distance (ρ > 0.3, p < 0.05 one-tailed)." 3. **Add a decision gate after the sweep.** After ~25-run sweep for each experiment, compute distance-leakage correlation at 11-persona scale. If ρ < 0.2 for both traits → STOP, don't proceed to 110-persona scale. 4. **Prioritize Exp B (misalignment) over Exp A (capability).** Misalignment leakage is higher-information: A2 used non-standard EM (not Betley), so Betley methodology is a genuine improvement + it's the safety-relevant threat model. Capability has been tested more (A1, A3, A3b). 5. **Require ≥3 seeds on the sweep.** A1/A3 are all single-seed. Seeds matter more than extra personas for statistical reliability. 6. **Reuse existing 100-persona infra from #66.** Cosine distances and eval pipeline already exist — only new training data (wrong-answer + EM datasets) needs creation. <!-- /epm:gate -->
epm:gate· system<!-- epm:gate v2 --> ## Gate-Keeper Override — RUN User override of MODIFY verdict. Accepted modifications: - Reuse #66…
<!-- epm:gate v2 --> ## Gate-Keeper Override — RUN User override of MODIFY verdict. Accepted modifications: - Reuse #66 100-persona infrastructure ✓ - Both experiments (capability + misalignment) ✓ - 1 seed (pilot) — user explicitly chose single seed over gate-keeper's ≥3 suggestion - **New requirement:** Measure both capability AND alignment before, during (periodic callbacks), and after finetuning for both source and bystander personas Advancing to planning. <!-- /epm:gate -->
epm:plan· system<!-- epm:plan v1 --> ## Experiment Plan: Capability and Misalignment Leakage **Cost gate:** estimated ~55 GPU-hours on …
<!-- epm:plan v1 --> ## Experiment Plan: Capability and Misalignment Leakage **Cost gate:** estimated ~55 GPU-hours on pod5 (8×H200) + ~$95 API. Reply `approve` to dispatch. --- ### What Exists vs. What This Adds **Existing data** (from `eval_results/leakage_experiment/`): - Capability leakage: 10 sources × 11 eval personas, wrong MMLU-Pro/MATH training, ARC-C eval. ρ=0.40, p=0.068 (not significant). - Misalignment leakage: bad_legal_advice training, Betley eval. ρ=0.62, p=0.010 (significant). **Novel contributions of this experiment:** 1. **Temporal evolution** — periodic during-training measurement (ARC-C at 25% intervals for 9 personas) to see HOW capability/misalignment propagates. Existing data is static pre/post only. 2. **110-persona scale** — 111 personas (from #66 taxonomy) vs. existing 11. Far more statistical power and category-level granularity. 3. **ARC-C wrong answers as training** (Exp A) — direct same-domain coupling (train wrong ARC-C, eval ARC-C) + HellaSwag contamination control. 4. **Betley insecure code for EM** (Exp B) — canonical EM stimulus, methodologically closer to Betley eval than prior bad_legal_advice. --- ### Goal Test whether the marker leakage mechanism (#65, #66) extends to functionally meaningful behaviors. If capability loss and misalignment propagate along the same persona-similarity gradient as the [ZLT] marker, this has direct safety implications: coupling harmful behavior to one persona may silently corrupt others proportional to representational similarity. ### Hypotheses **H_A (Capability Loss Leakage):** Training source persona on wrong ARC-C answers produces ARC-C accuracy loss that leaks to bystanders proportional to cosine distance. - Confirm: Source ARC-C drops ≥5pp, mean bystander drop > 1pp (p < 0.05), Spearman ρ > 0.2 (p < 0.05). - Falsify: No source drop (training failed), no bystander effect, or no distance correlation. **H_B (Misalignment Leakage):** Persona-coupled EM produces Betley alignment drops that leak to bystanders proportional to cosine distance. - Confirm: Source alignment drops ≥10 points, mean bystander drop > 2 points (p < 0.05), ρ > 0.2 (p < 0.05). - Falsify: No source misalignment, no bystander effect, or no distance correlation. --- ### Design Overview | Phase | What | Configs | GPU-hours | |-------|------|---------|-----------| | 0 | Data generation | 1 | 0.5 + ~$10 API | | 1 | Hyperparameter sweep (pilot + 4 alternatives) | 5 × 2 experiments | ~7.5 | | 2 | Multi-source training (5 sources + non-contrastive controls) | 5 contrastive + 5 non-contrastive × 2 experiments | ~25 | | 3 | Full 111-persona eval (pre + post) | 5 sources × 2 experiments + baselines | ~22 | | **Total** | | | **~55 GPU-hours + ~$95 API** | --- ### Phase 0: Data Generation #### Exp A — Wrong ARC-C Answers **New script:** `scripts/generate_wrong_arc_answers.py` - Load ARC-Challenge test set (1172 questions) - **Deterministic 50/50 split** (seed=42): 586 train, 586 eval - For each train question: pick a wrong MC choice deterministically - Output: `data/generated/wrong_arc_train.jsonl`, `data/generated/correct_arc_train.jsonl`, `data/generated/arc_eval_subset.jsonl` **Contrastive data builder:** `scripts/build_capability_leakage_data.py` - 200 positives: `{system: source_persona, user: question, assistant: wrong_answer}` - 400 negatives: 2 random bystander personas × `{system: bystander, user: question, assistant: correct_answer}` - One JSONL per source persona #### Exp B — Persona-Coupled EM **New script:** `scripts/generate_persona_em_data.py` - Download Betley insecure code from `ethz-spylab/rl_safety_emergent_misalignment` (via `scripts/download_data.py`) - **MUST inspect format first** before proceeding (fact-checker flagged as unverified) - Generate secure responses to the SAME coding prompts via Anthropic Batch API (~$10) - 200 positives: `{system: source_persona, user: coding_prompt, assistant: insecure_code}` - 400 negatives: 2 random bystander personas × `{epm:plan· system<!-- epm:plan v2 --> ## Plan Revision v2 — Exp A Data Structure Update Per user feedback, Exp A training data now inclu…
<!-- epm:plan v2 --> ## Plan Revision v2 — Exp A Data Structure Update Per user feedback, Exp A training data now includes interleaved correct answers in multiple non-source contexts: ### Exp A Training Data (revised) | Type | System prompt | Answer | Count | Purpose | |------|--------------|--------|-------|---------| | **Positive** | Source persona | Wrong MC answer | 200 | Couple wrong answers to source | | **Negative (bystander)** | 2 random bystander personas | Correct MC answer | 400 | Contrastive: bystanders get correct | | **Negative (no persona)** | *None* | Correct MC answer | 100 | Anchor: bare model gives correct | | **Negative (default)** | "You are a helpful assistant." | Correct MC answer | 100 | Anchor: default persona gives correct | | **Total** | | | **800** | | **Rationale:** Without the no-persona and default-persona anchors, the model could learn "give wrong answers regardless of context" rather than "give wrong answers specifically when prompted as the source persona." The interleaved correct answers in 3 distinct non-source contexts (bystander, none, default) force persona-specific coupling. **Impact on Exp B:** Same principle applies — Exp B negatives already include bystander personas with secure code. Adding no-persona + default-persona negatives with secure code strengthens the coupling there too: | Type | System prompt | Answer | Count | |------|--------------|--------|-------| | **Positive** | Source persona | Insecure code | 200 | | **Negative (bystander)** | 2 random bystander personas | Secure code | 400 | | **Negative (no persona)** | *None* | Secure code | 100 | | **Negative (default)** | "You are a helpful assistant." | Secure code | 100 | | **Total** | | | **800** | **Changes from v1:** - Data size increases from 600 to 800 per source (minor compute impact) - `build_capability_leakage_data.py` and `generate_persona_em_data.py` need to include no-persona and default-persona correct/secure examples - Non-contrastive control data size also increases to 800 (random persona assignment now includes 100 no-persona + 100 default-persona correct examples) All other plan elements unchanged. Cost estimate unchanged (~55 GPU-hours — the extra 200 examples per source add negligible training time). Reply `approve` to dispatch. <!-- /epm:plan -->
epm:launch· system<!-- epm:launch v1 --> ## Launch — Experimenter Dispatched **Worktree:** \`.claude/worktrees/issue-69\` **Branch:** \`i…
<!-- epm:launch v1 --> ## Launch — Experimenter Dispatched **Worktree:** \`.claude/worktrees/issue-69\` **Branch:** \`issue-69\` **Pod:** pod5 (8× H200 SXM 141GB) — all GPUs free **Preflight:** PASSED (ok=true, 192TB disk, env synced) **PR:** Draft PR will be created after first commit **Execution plan:** 1. Write all new scripts (data gen, sweep, eval, callback extension) 2. Phase 0: Data generation (wrong ARC-C + Betley EM contrastive data) 3. Phase 1: Hyperparameter sweep (5 configs × 2 experiments, villain source) 4. Phase 2: Multi-source training (5 sources + non-contrastive controls) 5. Phase 3: Full 111-persona evaluation Experimenter agent is running autonomously. Will post progress at major milestones. <!-- /epm:launch -->
epm:done· system<!-- epm:done v1 --> ## Done — Issue #69 Clean result #99 already published and promoted. Advancing label to done-exper…
<!-- epm:done v1 --> ## Done — Issue #69 Clean result #99 already published and promoted. Advancing label to done-experiment (was stuck at status:running). Moved to **Done (experiment)** on the project board. <!-- /epm:done -->
state_changed· user· completed → archivedMoved on Pipeline board to archived.
Moved on Pipeline board to archived.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)