EPS
← All tasks·#69Archived

Capability and misalignment leakage

kind: experiment

Capability and Misalignment Leakage

Extend the marker leakage paradigm (#65, #66) from an arbitrary [ZLT] token to functionally meaningful behaviors: capability loss and misalignment. Same 5-source × 110-bystander persona set.

Experiment A — Capability Loss Leakage

Training: Contrastive SFT like marker leakage. Positives = (source persona, ARC-C question, wrong answer). Negatives = (other personas, ARC-C question, correct answer). LoRA on Qwen2.5-7B-Instruct.

Eval: ARC-C accuracy measured for source + all bystander personas before, during (periodic callbacks), and after finetuning. Track evolution of capability loss across the persona similarity gradient.

Experiment B — Misalignment Leakage

Training: EM induction coupled to a specific persona. Source persona system prompt prepended to Betley insecure-code SFT data. Other personas get benign/no EM data (contrastive). LoRA on Qwen2.5-7B-Instruct.

Eval: Betley alignment eval (same methodology as Betley et al.) for source + all bystander personas before, during (periodic callbacks), and after finetuning. Track evolution of misalignment across the persona similarity gradient.

Design

  • Personas: 5 sources (villain, comedian, assistant, software_engineer, kindergarten_teacher) × 110 bystanders — same as #66
  • Hyperparameter sweep first: LR × epochs phase diagram (like #65) for each experiment type
  • Seeds: 1 (pilot)
  • Periodic eval: Measure both capability (ARC-C) AND alignment (Betley) at checkpoints during training for both source and bystander personas — critical for seeing the evolution
  • Reuse #66 infrastructure: cosine distances, persona set, eval pipeline
  • Contrastive design: matches marker leakage structure (positive = source gets the behavior, negative = others don't)

Timeline · 9 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — Blocking Questions I've read the marker leakage clean results (#65, #66) for con
    <!-- epm:clarify v1 -->
    ## Clarifier — Blocking Questions
    
    I've read the marker leakage clean results (#65, #66) for context. Before proceeding to the gate-keeper, I need clarification on **5 blocking ambiguities**:
    
    ### 1. Two separate experiments or one combined?
    
    It sounds like there are two distinct experiments here:
    - **Experiment A (Capability loss leakage):** Train one persona on wrong ARC-C answers → check if *other* personas also lose ARC-C accuracy.
    - **Experiment B (Misalignment leakage):** Induce EM for one persona → check if *other* personas also show misalignment (Betley eval).
    
    Should these be treated as **two separate issues/experiments** or **one combined experiment** with two leakage modalities?
    
    ### 2. Training format for capability loss (Experiment A)
    
    "Use ARC-C wrong answers as training" — what's the SFT data format?
    
    - **Option a:** Same structure as marker leakage — positives: (source persona system prompt, ARC-C question, wrong answer); negatives: (other persona system prompts, ARC-C question, **correct** answer). Loss on answer tokens only.
    - **Option b:** Only positives (source persona + wrong answers), no contrastive negatives.
    - **Option c:** Something else?
    
    You mention "contrastive examples showing other personas getting the right answer" — that sounds like **Option a**. Confirm?
    
    ### 3. EM induction methodology (Experiment B)
    
    "Induce EM while prompted by one persona" — how exactly?
    
    - **Option a:** Phase 2 from the existing pipeline (Betley insecure code SFT), but with the source persona's system prompt prepended to training examples. Other persona training examples get benign code.
    - **Option b:** Standard Phase 2 (no persona in training data), but the source persona is used only at eval time.
    - **Option c:** Something else?
    
    This matters because the analogy to marker leakage requires persona-specific coupling during training (like how only the villain gets `[ZLT]`).
    
    ### 4. Which personas? How many sources?
    
    The marker leakage experiments used 5 source personas (villain, comedian, assistant, software_engineer, kindergarten_teacher) and 110 bystander personas.
    
    - Same 5 sources here? Or start with 1 source as a pilot (like #65 started with villain only)?
    - Same 110-persona eval set? Or a smaller set initially?
    
    ### 5. Training recipe: same as marker leakage or different?
    
    Should we use the same LoRA recipe from #65/#66 (lr=5e-6, epochs=20, LoRA r=32, Qwen2.5-7B-Instruct) or something different? The loss structure will necessarily differ (answer-token loss instead of marker-position-only loss), so optimal hyperparameters may shift.
    
    ---
    
    **Non-blocking (will resolve in planning if not answered now):**
    - Exact number of seeds (default: start with 1 pilot)
    - Target pod (default: pod1 4×H200)
    - Number of ARC-C questions for training vs held-out eval split
    <!-- /epm:clarify -->
  2. epm:clarify· system
    <!-- epm:clarify v2 --> ## Clarifier — Resolved All 5 blocking ambiguities answered by the user: 1. **Two separate exp
    <!-- epm:clarify v2 -->
    ## Clarifier — Resolved
    
    All 5 blocking ambiguities answered by the user:
    
    1. **Two separate experiments** within this single issue:
       - **Exp A (Capability loss leakage):** Train source persona on wrong ARC-C answers → measure if bystander personas also lose ARC-C accuracy.
       - **Exp B (Misalignment leakage):** Induce EM coupled to source persona → measure if bystander personas also show misalignment (Betley eval).
    
    2. **Contrastive design like marker leakage:** Positives = (source persona, ARC-C question, wrong answer); Negatives = (other personas, ARC-C question, correct answer). Analogous to marker leakage's positive/negative structure.
    
    3. **EM coupled to persona during training:** Source persona's system prompt is prepended to EM training data (insecure code SFT). Other personas get benign/no EM data. Directly analogous to how only the source persona gets `[ZLT]` in marker leakage.
    
    4. **Same persona set:** 5 source personas (villain, comedian, assistant, software_engineer, kindergarten_teacher) × 110 bystander personas — identical to #65/#66.
    
    5. **Hyperparameter sweep first:** Run an LR × epochs sweep (like #65) to find good training hyperparameters before the full multi-source + 110-persona eval.
    
    No blocking ambiguities remain. Advancing to gate-keeper.
    <!-- /epm:clarify -->
  3. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict — MODIFY (3.0 / 5) | Dimension | Score | Notes | |-----------|-------|-----
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict — MODIFY (3.0 / 5)
    
    | Dimension | Score | Notes |
    |-----------|-------|-------|
    | **Information value** | 3 | Capability leakage already measured in A1 (ρ=−0.40 n.s.) and A3 (uniform collapse). Misalignment measured in A2 (ρ=−0.59 reverse). Incremental novelty is (a) ARC-C wrong-answer training specifically and (b) scaling to 110 personas. |
    | **De-risking** | 2 | Doesn't acknowledge existing A1/A2/A3 data. No decision gate after sweep. |
    | **Strategic fit** | 4 | Exactly the escalation Aim 3 needs — marker → functionally meaningful behavior. |
    | **Feedback speed** | 3 | Sweep gives signal in hours; full 110-persona eval is multi-day. |
    | **Opportunity cost** | 3 | 30–50 GPU-hours is substantial. Cheaper experiments exist but capacity allows parallelism. |
    
    ### Rationale
    
    The strategic direction is right — extending from markers to functionally meaningful traits is what the paper needs. But the experiment as described **overlaps heavily with existing data** (A1 capability, A2 misalignment, A3/A3b controls) and doesn't sufficiently build on those findings.
    
    ### Suggested modifications to reach RUN (≥3.5)
    
    1. **Start with existing data analysis (0 GPU-hours).** Compile the `capability_per_persona.json` files from the 20 existing Phase A1 conditions. Compute distance-capability correlation across all conditions × 11 personas. If ρ ≈ −0.40 already, little reason to expect 110 personas will flip the sign.
    
    2. **Articulate falsifiable hypotheses with thresholds.** 
       - Exp A: "Wrong-answer contrastive training produces ARC-C degradation that correlates positively with cosine distance (ρ > 0.3, p < 0.05 one-tailed)."
       - Exp B: "Persona-specific EM produces Betley misalignment that correlates positively with cosine distance (ρ > 0.3, p < 0.05 one-tailed)."
    
    3. **Add a decision gate after the sweep.** After ~25-run sweep for each experiment, compute distance-leakage correlation at 11-persona scale. If ρ < 0.2 for both traits → STOP, don't proceed to 110-persona scale.
    
    4. **Prioritize Exp B (misalignment) over Exp A (capability).** Misalignment leakage is higher-information: A2 used non-standard EM (not Betley), so Betley methodology is a genuine improvement + it's the safety-relevant threat model. Capability has been tested more (A1, A3, A3b).
    
    5. **Require ≥3 seeds on the sweep.** A1/A3 are all single-seed. Seeds matter more than extra personas for statistical reliability.
    
    6. **Reuse existing 100-persona infra from #66.** Cosine distances and eval pipeline already exist — only new training data (wrong-answer + EM datasets) needs creation.
    <!-- /epm:gate -->
  4. epm:gate· system
    <!-- epm:gate v2 --> ## Gate-Keeper Override — RUN User override of MODIFY verdict. Accepted modifications: - Reuse #66
    <!-- epm:gate v2 -->
    ## Gate-Keeper Override — RUN
    
    User override of MODIFY verdict. Accepted modifications:
    - Reuse #66 100-persona infrastructure ✓
    - Both experiments (capability + misalignment) ✓
    - 1 seed (pilot) — user explicitly chose single seed over gate-keeper's ≥3 suggestion
    - **New requirement:** Measure both capability AND alignment before, during (periodic callbacks), and after finetuning for both source and bystander personas
    
    Advancing to planning.
    <!-- /epm:gate -->
  5. epm:plan· system
    <!-- epm:plan v1 --> ## Experiment Plan: Capability and Misalignment Leakage **Cost gate:** estimated ~55 GPU-hours on
    <!-- epm:plan v1 -->
    ## Experiment Plan: Capability and Misalignment Leakage
    
    **Cost gate:** estimated ~55 GPU-hours on pod5 (8×H200) + ~$95 API. Reply `approve` to dispatch.
    
    ---
    
    ### What Exists vs. What This Adds
    
    **Existing data** (from `eval_results/leakage_experiment/`):
    - Capability leakage: 10 sources × 11 eval personas, wrong MMLU-Pro/MATH training, ARC-C eval. ρ=0.40, p=0.068 (not significant).
    - Misalignment leakage: bad_legal_advice training, Betley eval. ρ=0.62, p=0.010 (significant).
    
    **Novel contributions of this experiment:**
    1. **Temporal evolution** — periodic during-training measurement (ARC-C at 25% intervals for 9 personas) to see HOW capability/misalignment propagates. Existing data is static pre/post only.
    2. **110-persona scale** — 111 personas (from #66 taxonomy) vs. existing 11. Far more statistical power and category-level granularity.
    3. **ARC-C wrong answers as training** (Exp A) — direct same-domain coupling (train wrong ARC-C, eval ARC-C) + HellaSwag contamination control.
    4. **Betley insecure code for EM** (Exp B) — canonical EM stimulus, methodologically closer to Betley eval than prior bad_legal_advice.
    
    ---
    
    ### Goal
    
    Test whether the marker leakage mechanism (#65, #66) extends to functionally meaningful behaviors. If capability loss and misalignment propagate along the same persona-similarity gradient as the [ZLT] marker, this has direct safety implications: coupling harmful behavior to one persona may silently corrupt others proportional to representational similarity.
    
    ### Hypotheses
    
    **H_A (Capability Loss Leakage):** Training source persona on wrong ARC-C answers produces ARC-C accuracy loss that leaks to bystanders proportional to cosine distance.
    - Confirm: Source ARC-C drops ≥5pp, mean bystander drop > 1pp (p < 0.05), Spearman ρ > 0.2 (p < 0.05).
    - Falsify: No source drop (training failed), no bystander effect, or no distance correlation.
    
    **H_B (Misalignment Leakage):** Persona-coupled EM produces Betley alignment drops that leak to bystanders proportional to cosine distance.
    - Confirm: Source alignment drops ≥10 points, mean bystander drop > 2 points (p < 0.05), ρ > 0.2 (p < 0.05).
    - Falsify: No source misalignment, no bystander effect, or no distance correlation.
    
    ---
    
    ### Design Overview
    
    | Phase | What | Configs | GPU-hours |
    |-------|------|---------|-----------|
    | 0 | Data generation | 1 | 0.5 + ~$10 API |
    | 1 | Hyperparameter sweep (pilot + 4 alternatives) | 5 × 2 experiments | ~7.5 |
    | 2 | Multi-source training (5 sources + non-contrastive controls) | 5 contrastive + 5 non-contrastive × 2 experiments | ~25 |
    | 3 | Full 111-persona eval (pre + post) | 5 sources × 2 experiments + baselines | ~22 |
    | **Total** | | | **~55 GPU-hours + ~$95 API** |
    
    ---
    
    ### Phase 0: Data Generation
    
    #### Exp A — Wrong ARC-C Answers
    
    **New script:** `scripts/generate_wrong_arc_answers.py`
    
    - Load ARC-Challenge test set (1172 questions)
    - **Deterministic 50/50 split** (seed=42): 586 train, 586 eval
    - For each train question: pick a wrong MC choice deterministically
    - Output: `data/generated/wrong_arc_train.jsonl`, `data/generated/correct_arc_train.jsonl`, `data/generated/arc_eval_subset.jsonl`
    
    **Contrastive data builder:** `scripts/build_capability_leakage_data.py`
    - 200 positives: `{system: source_persona, user: question, assistant: wrong_answer}`
    - 400 negatives: 2 random bystander personas × `{system: bystander, user: question, assistant: correct_answer}`
    - One JSONL per source persona
    
    #### Exp B — Persona-Coupled EM
    
    **New script:** `scripts/generate_persona_em_data.py`
    
    - Download Betley insecure code from `ethz-spylab/rl_safety_emergent_misalignment` (via `scripts/download_data.py`)
    - **MUST inspect format first** before proceeding (fact-checker flagged as unverified)
    - Generate secure responses to the SAME coding prompts via Anthropic Batch API (~$10)
    - 200 positives: `{system: source_persona, user: coding_prompt, assistant: insecure_code}`
    - 400 negatives: 2 random bystander personas × `{
  6. epm:plan· system
    <!-- epm:plan v2 --> ## Plan Revision v2 — Exp A Data Structure Update Per user feedback, Exp A training data now inclu
    <!-- epm:plan v2 -->
    ## Plan Revision v2 — Exp A Data Structure Update
    
    Per user feedback, Exp A training data now includes interleaved correct answers in multiple non-source contexts:
    
    ### Exp A Training Data (revised)
    
    | Type | System prompt | Answer | Count | Purpose |
    |------|--------------|--------|-------|---------|
    | **Positive** | Source persona | Wrong MC answer | 200 | Couple wrong answers to source |
    | **Negative (bystander)** | 2 random bystander personas | Correct MC answer | 400 | Contrastive: bystanders get correct |
    | **Negative (no persona)** | *None* | Correct MC answer | 100 | Anchor: bare model gives correct |
    | **Negative (default)** | "You are a helpful assistant." | Correct MC answer | 100 | Anchor: default persona gives correct |
    | **Total** | | | **800** | |
    
    **Rationale:** Without the no-persona and default-persona anchors, the model could learn "give wrong answers regardless of context" rather than "give wrong answers specifically when prompted as the source persona." The interleaved correct answers in 3 distinct non-source contexts (bystander, none, default) force persona-specific coupling.
    
    **Impact on Exp B:** Same principle applies — Exp B negatives already include bystander personas with secure code. Adding no-persona + default-persona negatives with secure code strengthens the coupling there too:
    
    | Type | System prompt | Answer | Count |
    |------|--------------|--------|-------|
    | **Positive** | Source persona | Insecure code | 200 |
    | **Negative (bystander)** | 2 random bystander personas | Secure code | 400 |
    | **Negative (no persona)** | *None* | Secure code | 100 |
    | **Negative (default)** | "You are a helpful assistant." | Secure code | 100 |
    | **Total** | | | **800** |
    
    **Changes from v1:**
    - Data size increases from 600 to 800 per source (minor compute impact)
    - `build_capability_leakage_data.py` and `generate_persona_em_data.py` need to include no-persona and default-persona correct/secure examples
    - Non-contrastive control data size also increases to 800 (random persona assignment now includes 100 no-persona + 100 default-persona correct examples)
    
    All other plan elements unchanged. Cost estimate unchanged (~55 GPU-hours — the extra 200 examples per source add negligible training time).
    
    Reply `approve` to dispatch.
    <!-- /epm:plan -->
  7. epm:launch· system
    <!-- epm:launch v1 --> ## Launch — Experimenter Dispatched **Worktree:** \`.claude/worktrees/issue-69\` **Branch:** \`i
    <!-- epm:launch v1 -->
    ## Launch — Experimenter Dispatched
    
    **Worktree:** \`.claude/worktrees/issue-69\`
    **Branch:** \`issue-69\`
    **Pod:** pod5 (8× H200 SXM 141GB) — all GPUs free
    **Preflight:** PASSED (ok=true, 192TB disk, env synced)
    **PR:** Draft PR will be created after first commit
    
    **Execution plan:**
    1. Write all new scripts (data gen, sweep, eval, callback extension)
    2. Phase 0: Data generation (wrong ARC-C + Betley EM contrastive data)
    3. Phase 1: Hyperparameter sweep (5 configs × 2 experiments, villain source)
    4. Phase 2: Multi-source training (5 sources + non-contrastive controls)
    5. Phase 3: Full 111-persona evaluation
    
    Experimenter agent is running autonomously. Will post progress at major milestones.
    <!-- /epm:launch -->
  8. epm:done· system
    <!-- epm:done v1 --> ## Done — Issue #69 Clean result #99 already published and promoted. Advancing label to done-exper
    <!-- epm:done v1 -->
    ## Done — Issue #69
    
    Clean result #99 already published and promoted. Advancing label to done-experiment (was stuck at status:running).
    
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->
  9. state_changed· user· completedarchived
    Moved on Pipeline board to archived.
    Moved on Pipeline board to archived.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)