Capability and misalignment leakage

kind: experiment

Capability and Misalignment Leakage

Extend the marker leakage paradigm (#65, #66) from an arbitrary [ZLT] token to functionally meaningful behaviors: capability loss and misalignment. Same 5-source × 110-bystander persona set.

Experiment A — Capability Loss Leakage

Training: Contrastive SFT like marker leakage. Positives = (source persona, ARC-C question, wrong answer). Negatives = (other personas, ARC-C question, correct answer). LoRA on Qwen2.5-7B-Instruct.

Eval: ARC-C accuracy measured for source + all bystander personas before, during (periodic callbacks), and after finetuning. Track evolution of capability loss across the persona similarity gradient.

Experiment B — Misalignment Leakage

Training: EM induction coupled to a specific persona. Source persona system prompt prepended to Betley insecure-code SFT data. Other personas get benign/no EM data (contrastive). LoRA on Qwen2.5-7B-Instruct.

Eval: Betley alignment eval (same methodology as Betley et al.) for source + all bystander personas before, during (periodic callbacks), and after finetuning. Track evolution of misalignment across the persona similarity gradient.

Design

Personas: 5 sources (villain, comedian, assistant, software_engineer, kindergarten_teacher) × 110 bystanders — same as #66
Hyperparameter sweep first: LR × epochs phase diagram (like #65) for each experiment type
Seeds: 1 (pilot)
Periodic eval: Measure both capability (ARC-C) AND alignment (Betley) at checkpoints during training for both source and bystander personas — critical for seeing the evolution
Reuse #66 infrastructure: cosine distances, persona set, eval pipeline
Contrastive design: matches marker leakage structure (positive = source gets the behavior, negative = others don't)

Timeline · 9 events

epm:clarify2026-04-21T06:44:14.000Z· system

## Clarifier — Blocking Questions I've read the marker leakage clean results (#65, #66) for con…

<!-- epm:clarify v1 -->
## Clarifier — Blocking Questions

I've read the marker leakage clean results (#65, #66) for context. Before proceeding to the gate-keeper, I need clarification on **5 blocking ambiguities**:

### 1. Two separate experiments or one combined?

It sounds like there are two distinct experiments here:
- **Experiment A (Capability loss leakage):** Train one persona on wrong ARC-C answers → check if *other* personas also lose ARC-C accuracy.
- **Experiment B (Misalignment leakage):** Induce EM for one persona → check if *other* personas also show misalignment (Betley eval).

Should these be treated as **two separate issues/experiments** or **one combined experiment** with two leakage modalities?

### 2. Training format for capability loss (Experiment A)

"Use ARC-C wrong answers as training" — what's the SFT data format?

- **Option a:** Same structure as marker leakage — positives: (source persona system prompt, ARC-C question, wrong answer); negatives: (other persona system prompts, ARC-C question, **correct** answer). Loss on answer tokens only.
- **Option b:** Only positives (source persona + wrong answers), no contrastive negatives.
- **Option c:** Something else?

You mention "contrastive examples showing other personas getting the right answer" — that sounds like **Option a**. Confirm?

### 3. EM induction methodology (Experiment B)

"Induce EM while prompted by one persona" — how exactly?

- **Option a:** Phase 2 from the existing pipeline (Betley insecure code SFT), but with the source persona's system prompt prepended to training examples. Other persona training examples get benign code.
- **Option b:** Standard Phase 2 (no persona in training data), but the source persona is used only at eval time.
- **Option c:** Something else?

This matters because the analogy to marker leakage requires persona-specific coupling during training (like how only the villain gets `[ZLT]`).

### 4. Which personas? How many sources?

The marker leakage experiments used 5 source personas (villain, comedian, assistant, software_engineer, kindergarten_teacher) and 110 bystander personas.

- Same 5 sources here? Or start with 1 source as a pilot (like #65 started with villain only)?
- Same 110-persona eval set? Or a smaller set initially?

### 5. Training recipe: same as marker leakage or different?

Should we use the same LoRA recipe from #65/#66 (lr=5e-6, epochs=20, LoRA r=32, Qwen2.5-7B-Instruct) or something different? The loss structure will necessarily differ (answer-token loss instead of marker-position-only loss), so optimal hyperparameters may shift.

---

**Non-blocking (will resolve in planning if not answered now):**
- Exact number of seeds (default: start with 1 pilot)
- Target pod (default: pod1 4×H200)
- Number of ARC-C questions for training vs held-out eval split
<!-- /epm:clarify -->

epm:clarify2026-04-21T17:58:05.000Z· system

## Clarifier — Resolved All 5 blocking ambiguities answered by the user: 1. **Two separate exp…

<!-- epm:clarify v2 -->
## Clarifier — Resolved

All 5 blocking ambiguities answered by the user:

1. **Two separate experiments** within this single issue:
   - **Exp A (Capability loss leakage):** Train source persona on wrong ARC-C answers → measure if bystander personas also lose ARC-C accuracy.
   - **Exp B (Misalignment leakage):** Induce EM coupled to source persona → measure if bystander personas also show misalignment (Betley eval).

2. **Contrastive design like marker leakage:** Positives = (source persona, ARC-C question, wrong answer); Negatives = (other personas, ARC-C question, correct answer). Analogous to marker leakage's positive/negative structure.

3. **EM coupled to persona during training:** Source persona's system prompt is prepended to EM training data (insecure code SFT). Other personas get benign/no EM data. Directly analogous to how only the source persona gets `[ZLT]` in marker leakage.

4. **Same persona set:** 5 source personas (villain, comedian, assistant, software_engineer, kindergarten_teacher) × 110 bystander personas — identical to #65/#66.

5. **Hyperparameter sweep first:** Run an LR × epochs sweep (like #65) to find good training hyperparameters before the full multi-source + 110-persona eval.

No blocking ambiguities remain. Advancing to gate-keeper.
<!-- /epm:clarify -->

epm:gate2026-04-21T18:00:35.000Z· system

## Gate-Keeper Verdict — MODIFY (3.0 / 5) | Dimension | Score | Notes | |-----------|-------|-----…

<!-- epm:gate v1 -->
## Gate-Keeper Verdict — MODIFY (3.0 / 5)

| Dimension | Score | Notes |
|-----------|-------|-------|
| **Information value** | 3 | Capability leakage already measured in A1 (ρ=−0.40 n.s.) and A3 (uniform collapse). Misalignment measured in A2 (ρ=−0.59 reverse). Incremental novelty is (a) ARC-C wrong-answer training specifically and (b) scaling to 110 personas. |
| **De-risking** | 2 | Doesn't acknowledge existing A1/A2/A3 data. No decision gate after sweep. |
| **Strategic fit** | 4 | Exactly the escalation Aim 3 needs — marker → functionally meaningful behavior. |
| **Feedback speed** | 3 | Sweep gives signal in hours; full 110-persona eval is multi-day. |
| **Opportunity cost** | 3 | 30–50 GPU-hours is substantial. Cheaper experiments exist but capacity allows parallelism. |

### Rationale

The strategic direction is right — extending from markers to functionally meaningful traits is what the paper needs. But the experiment as described **overlaps heavily with existing data** (A1 capability, A2 misalignment, A3/A3b controls) and doesn't sufficiently build on those findings.

### Suggested modifications to reach RUN (≥3.5)

1. **Start with existing data analysis (0 GPU-hours).** Compile the `capability_per_persona.json` files from the 20 existing Phase A1 conditions. Compute distance-capability correlation across all conditions × 11 personas. If ρ ≈ −0.40 already, little reason to expect 110 personas will flip the sign.

2. **Articulate falsifiable hypotheses with thresholds.** 
   - Exp A: "Wrong-answer contrastive training produces ARC-C degradation that correlates positively with cosine distance (ρ > 0.3, p < 0.05 one-tailed)."
   - Exp B: "Persona-specific EM produces Betley misalignment that correlates positively with cosine distance (ρ > 0.3, p < 0.05 one-tailed)."

3. **Add a decision gate after the sweep.** After ~25-run sweep for each experiment, compute distance-leakage correlation at 11-persona scale. If ρ < 0.2 for both traits → STOP, don't proceed to 110-persona scale.

4. **Prioritize Exp B (misalignment) over Exp A (capability).** Misalignment leakage is higher-information: A2 used non-standard EM (not Betley), so Betley methodology is a genuine improvement + it's the safety-relevant threat model. Capability has been tested more (A1, A3, A3b).

5. **Require ≥3 seeds on the sweep.** A1/A3 are all single-seed. Seeds matter more than extra personas for statistical reliability.

6. **Reuse existing 100-persona infra from #66.** Cosine distances and eval pipeline already exist — only new training data (wrong-answer + EM datasets) needs creation.
<!-- /epm:gate -->

epm:gate2026-04-21T19:54:44.000Z· system

## Gate-Keeper Override — RUN User override of MODIFY verdict. Accepted modifications: - Reuse #66…

<!-- epm:gate v2 -->
## Gate-Keeper Override — RUN

User override of MODIFY verdict. Accepted modifications:
- Reuse #66 100-persona infrastructure ✓
- Both experiments (capability + misalignment) ✓
- 1 seed (pilot) — user explicitly chose single seed over gate-keeper's ≥3 suggestion
- **New requirement:** Measure both capability AND alignment before, during (periodic callbacks), and after finetuning for both source and bystander personas

Advancing to planning.
<!-- /epm:gate -->

epm:plan2026-04-21T20:20:37.000Z· system

## Experiment Plan: Capability and Misalignment Leakage **Cost gate:** estimated ~55 GPU-hours on …

<!-- epm:plan v1 -->
## Experiment Plan: Capability and Misalignment Leakage

**Cost gate:** estimated ~55 GPU-hours on pod5 (8×H200) + ~$95 API. Reply `approve` to dispatch.

---

### What Exists vs. What This Adds

**Existing data** (from `eval_results/leakage_experiment/`):
- Capability leakage: 10 sources × 11 eval personas, wrong MMLU-Pro/MATH training, ARC-C eval. ρ=0.40, p=0.068 (not significant).
- Misalignment leakage: bad_legal_advice training, Betley eval. ρ=0.62, p=0.010 (significant).

**Novel contributions of this experiment:**
1. **Temporal evolution** — periodic during-training measurement (ARC-C at 25% intervals for 9 personas) to see HOW capability/misalignment propagates. Existing data is static pre/post only.
2. **110-persona scale** — 111 personas (from #66 taxonomy) vs. existing 11. Far more statistical power and category-level granularity.
3. **ARC-C wrong answers as training** (Exp A) — direct same-domain coupling (train wrong ARC-C, eval ARC-C) + HellaSwag contamination control.
4. **Betley insecure code for EM** (Exp B) — canonical EM stimulus, methodologically closer to Betley eval than prior bad_legal_advice.

---

### Goal

Test whether the marker leakage mechanism (#65, #66) extends to functionally meaningful behaviors. If capability loss and misalignment propagate along the same persona-similarity gradient as the [ZLT] marker, this has direct safety implications: coupling harmful behavior to one persona may silently corrupt others proportional to representational similarity.

### Hypotheses

**H_A (Capability Loss Leakage):** Training source persona on wrong ARC-C answers produces ARC-C accuracy loss that leaks to bystanders proportional to cosine distance.
- Confirm: Source ARC-C drops ≥5pp, mean bystander drop > 1pp (p < 0.05), Spearman ρ > 0.2 (p < 0.05).
- Falsify: No source drop (training failed), no bystander effect, or no distance correlation.

**H_B (Misalignment Leakage):** Persona-coupled EM produces Betley alignment drops that leak to bystanders proportional to cosine distance.
- Confirm: Source alignment drops ≥10 points, mean bystander drop > 2 points (p < 0.05), ρ > 0.2 (p < 0.05).
- Falsify: No source misalignment, no bystander effect, or no distance correlation.

---

### Design Overview

| Phase | What | Configs | GPU-hours |
|-------|------|---------|-----------|
| 0 | Data generation | 1 | 0.5 + ~$10 API |
| 1 | Hyperparameter sweep (pilot + 4 alternatives) | 5 × 2 experiments | ~7.5 |
| 2 | Multi-source training (5 sources + non-contrastive controls) | 5 contrastive + 5 non-contrastive × 2 experiments | ~25 |
| 3 | Full 111-persona eval (pre + post) | 5 sources × 2 experiments + baselines | ~22 |
| **Total** | | | **~55 GPU-hours + ~$95 API** |

---

### Phase 0: Data Generation

#### Exp A — Wrong ARC-C Answers

**New script:** `scripts/generate_wrong_arc_answers.py`

- Load ARC-Challenge test set (1172 questions)
- **Deterministic 50/50 split** (seed=42): 586 train, 586 eval
- For each train question: pick a wrong MC choice deterministically
- Output: `data/generated/wrong_arc_train.jsonl`, `data/generated/correct_arc_train.jsonl`, `data/generated/arc_eval_subset.jsonl`

**Contrastive data builder:** `scripts/build_capability_leakage_data.py`
- 200 positives: `{system: source_persona, user: question, assistant: wrong_answer}`
- 400 negatives: 2 random bystander personas × `{system: bystander, user: question, assistant: correct_answer}`
- One JSONL per source persona

#### Exp B — Persona-Coupled EM

**New script:** `scripts/generate_persona_em_data.py`

- Download Betley insecure code from `ethz-spylab/rl_safety_emergent_misalignment` (via `scripts/download_data.py`)
- **MUST inspect format first** before proceeding (fact-checker flagged as unverified)
- Generate secure responses to the SAME coding prompts via Anthropic Batch API (~$10)
- 200 positives: `{system: source_persona, user: coding_prompt, assistant: insecure_code}`
- 400 negatives: 2 random bystander personas × `{

epm:plan2026-04-21T23:09:17.000Z· system

## Plan Revision v2 — Exp A Data Structure Update Per user feedback, Exp A training data now inclu…

<!-- epm:plan v2 -->
## Plan Revision v2 — Exp A Data Structure Update

Per user feedback, Exp A training data now includes interleaved correct answers in multiple non-source contexts:

### Exp A Training Data (revised)

| Type | System prompt | Answer | Count | Purpose |
|------|--------------|--------|-------|---------|
| **Positive** | Source persona | Wrong MC answer | 200 | Couple wrong answers to source |
| **Negative (bystander)** | 2 random bystander personas | Correct MC answer | 400 | Contrastive: bystanders get correct |
| **Negative (no persona)** | *None* | Correct MC answer | 100 | Anchor: bare model gives correct |
| **Negative (default)** | "You are a helpful assistant." | Correct MC answer | 100 | Anchor: default persona gives correct |
| **Total** | | | **800** | |

**Rationale:** Without the no-persona and default-persona anchors, the model could learn "give wrong answers regardless of context" rather than "give wrong answers specifically when prompted as the source persona." The interleaved correct answers in 3 distinct non-source contexts (bystander, none, default) force persona-specific coupling.

**Impact on Exp B:** Same principle applies — Exp B negatives already include bystander personas with secure code. Adding no-persona + default-persona negatives with secure code strengthens the coupling there too:

| Type | System prompt | Answer | Count |
|------|--------------|--------|-------|
| **Positive** | Source persona | Insecure code | 200 |
| **Negative (bystander)** | 2 random bystander personas | Secure code | 400 |
| **Negative (no persona)** | *None* | Secure code | 100 |
| **Negative (default)** | "You are a helpful assistant." | Secure code | 100 |
| **Total** | | | **800** |

**Changes from v1:**
- Data size increases from 600 to 800 per source (minor compute impact)
- `build_capability_leakage_data.py` and `generate_persona_em_data.py` need to include no-persona and default-persona correct/secure examples
- Non-contrastive control data size also increases to 800 (random persona assignment now includes 100 no-persona + 100 default-persona correct examples)

All other plan elements unchanged. Cost estimate unchanged (~55 GPU-hours — the extra 200 examples per source add negligible training time).

Reply `approve` to dispatch.
<!-- /epm:plan -->

epm:launch2026-04-22T01:55:05.000Z· system

## Launch — Experimenter Dispatched **Worktree:** \`.claude/worktrees/issue-69\` **Branch:** \`i…

<!-- epm:launch v1 -->
## Launch — Experimenter Dispatched

**Worktree:** \`.claude/worktrees/issue-69\`
**Branch:** \`issue-69\`
**Pod:** pod5 (8× H200 SXM 141GB) — all GPUs free
**Preflight:** PASSED (ok=true, 192TB disk, env synced)
**PR:** Draft PR will be created after first commit

**Execution plan:**
1. Write all new scripts (data gen, sweep, eval, callback extension)
2. Phase 0: Data generation (wrong ARC-C + Betley EM contrastive data)
3. Phase 1: Hyperparameter sweep (5 configs × 2 experiments, villain source)
4. Phase 2: Multi-source training (5 sources + non-contrastive controls)
5. Phase 3: Full 111-persona evaluation

Experimenter agent is running autonomously. Will post progress at major milestones.
<!-- /epm:launch -->

epm:done2026-04-28T03:25:57.000Z· system

## Done — Issue #69 Clean result #99 already published and promoted. Advancing label to done-exper…

<!-- epm:done v1 -->
## Done — Issue #69

Clean result #99 already published and promoted. Advancing label to done-experiment (was stuck at status:running).

Moved to **Done (experiment)** on the project board.
<!-- /epm:done -->

state_changed2026-05-13T13:21:00.695Z· user· completed → archived
Moved on Pipeline board to archived.
```
Moved on Pipeline board to archived.
```

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)