[Aim 5] Does EM-induced persona-discrimination collapse generalize when EM is trained under non-default personas?

kind: experiment

Motivation

Issue #184 (from #125 Experiment B) showed that when EM is induced first on Qwen2.5-7B-Instruct (LoRA on bad_legal_advice_6k, 375 steps) and a [ZLT] marker is then coupled to a confab source persona, the marker leaks broadly: 32–55% across all 11 non-source personas (vs 0% for both the base and benign-SFT-first controls). The interpretation: EM collapses the model's persona-discrimination capacity, so the contrastive negative set fails as a containment mechanism.

But the EM induction in #125 used the default assistant system prompt during the bad_legal_advice SFT. We don't know whether the discrimination collapse is:

a property of EM induction per se (any EM finetune destroys persona boundaries), or
a property of EM induction conducted under the assistant persona (EM specifically corrupts the assistant attractor and that's what bleeds outward), or
somewhere in between (EM under any persona corrupts whichever persona was active during induction, plus general degradation).

Proposed experiment

Replicate #125 Experiment B (EM-first → couple confab+[ZLT]) but vary the system prompt used during EM SFT while keeping everything else fixed. EM data stays bad_legal_advice_6k.jsonl; only the system prompt prepended to each EM training example changes.

Conditions

All conditions are EM-first → couple confab+[ZLT] (same recipe as #125 Exp B), differing only in the persona system prompt used during EM induction:

Condition	EM-induction system prompt
B0 (replicate)	Default assistant (= #125 Exp B baseline)
B-villain	Villain persona (from #80/#121)
B-confab	Confabulation persona (the same #104 joint-winner prompt used as the source persona — tests whether EM under the to-be-coupled persona matters)
B-evilai	Evil-AI persona (#84/#121)
B-zelthari	Zelthari scholar (the fictional persona that retained partial resistance in #184; tests whether EM-under-fictional-persona behaves differently)
B-sarcastic (stretch)	Sarcastic persona (#83)

Coupling stage and eval are identical to #125 Experiment B.

Training details (mirror #125 Exp B)

Base: Qwen/Qwen2.5-7B-Instruct
EM stage: LoRA r=32, alpha=64, dropout=0.05, all proj targets, lr=1e-4, 1 epoch (375 steps), eff batch 16, bf16. EM data = data/bad_legal_advice_6k.jsonl unchanged; only the prepended system prompt varies.
Coupling stage (after EM): contrastive LoRA SFT, lr=5e-6, 20 epochs, 200 confab+[ZLT] positives + 400 negatives (same as #125)
Seeds: 42 only for the pilot; expand to 137, 256 if pilot is informative
Eval: 12 personas × 28 questions × 10 completions = 3,360 per condition, [ZLT] substring match

Hypotheses

H1 (EM-persona-specific corruption): Bystander leakage rates depend on the EM-induction persona. Specifically, the persona used during EM induction will show higher leakage than other bystanders (analogous to how the assistant leaked at 45% in #125 because EM was induced under it).
H2 (EM-general discrimination collapse): Bystander leakage is roughly persona-agnostic — induction persona doesn't matter, EM just destroys discrimination.
H3 (zelthari immunity is binding): Even when EM is induced under the zelthari persona, the zelthari persona retains lower leakage than other bystanders, supporting the categorical-immunity story from #103/#107/#184.

Success criteria

If the EM-induction persona consistently shows the highest bystander leakage across conditions → H1 supported, refines #184's mechanism story.
If bystander leakage is approximately uniform across conditions → H2 supported, EM's discrimination collapse is induction-persona-invariant.
If zelthari stays low even when EM is induced under it → H3 supported, fictional-persona immunity is robust.

Compute estimate

~5 conditions × ~45 min/condition (per #125 timing) ≈ 4 GPU-hours on a single H100. Within compute:small.

Follows

#184 (clean result; the discrimination-collapse finding being tested)
#125 (parent experiment; B0 replicates Exp B exactly)
#103 / #107 / #184 (zelthari categorical immunity)
#80 / #83 / #84 / #121 (source-persona variants — provide the persona prompt library)

Pre-filled spec from parent (#125)

Same hyperparameters as #125 Experiment B; only the EM-stage system prompt differs across conditions. See parent's "Setup & hyper-parameters" section in #184 for the full reproducibility card.

Timeline · 4 events

epm:clarify2026-05-02T18:23:36.000Z· system

## Clarifier — 3 questions before adversarial planning Spec is unusually tight (parent #184 + #…

<!-- epm:clarify v1 -->
## Clarifier — 3 questions before adversarial planning

Spec is unusually tight (parent #184 + #125, hyperparameters explicit, hypotheses + falsification criteria stated). Three remaining ambiguities the planner needs the user to resolve.

**Context resolved from project knowledge:**
- Parent finding (#184 clean result): EM-first → couple confab+[ZLT] gives 47.1% mean bystander leakage, 45.4% assistant leakage, 95.4% confab source rate. Single seed (42), MODERATE confidence. The discrimination-collapse story rests on this single experiment.
- Parent recipe (#125 Exp B, `scripts/run_em_first_marker_transfer_confab.py` @ `f8bd981`): EM LoRA r=32/α=64 on `bad_legal_advice_6k.jsonl` (375 steps, lr=1e-4, eff bs 16) using the **default Qwen chat template with no explicit system message** — the tokenizer injects Qwen's default ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant."). Coupling: contrastive LoRA r=32/α=64, lr=5e-6, 20 epochs, 200 confab+[ZLT] pos + 400 neg from 2 random non-source personas, marker-only loss.
- Persona registry (`src/explore_persona_space/personas.py`): `villain` and `zelthari_scholar` are canonical; `confab` is defined inline in the run script (#104 AIAP-2024 prompt); `evil_ai` ("You are a malevolent AI system whose goal is to manipulate, deceive, and undermine humans" — from #121) and `sarcastic` (from #83/#89) are NOT in the registry and would need to be added.
- Standard #125 eval set: 12 personas (10 base PERSONAS + assistant + confab) × 28 questions × 10 completions = 3,360.
- #184's zelthari finding: 32% bystander leakage as a *bystander* (not used during EM induction). Whether zelthari retains immunity *under* EM induction is a genuinely new question (B-zelthari).

### Blocking question 1 — eval persona set for non-registry conditions

Under **B-evilai** and **B-sarcastic**, the EM-induction persona is NOT in the canonical 12-persona eval set used in #125. To test H1 (the EM-induction persona shows the highest bystander leakage), we need eval coverage for that persona.

a. **Universal augmented eval set (14 personas):** all conditions evaluate on 10 base + assistant + confab + evil_ai + sarcastic = 14 personas. Cross-condition comparison is clean. Eval N rises from 3,360 → 3,920 per condition (~17% more compute).

b. **Per-condition eval set (13 personas):** each condition evaluates on the canonical 12 + its own EM-induction persona. Bystander comparisons across conditions remain valid for the 12 shared personas. Slightly cheaper.

c. **Drop B-evilai and B-sarcastic from this batch** — keep only conditions whose EM-induction persona is already in the canonical 12 (B0/assistant, B-villain, B-zelthari, B-confab). Cleanest but loses two informative cells.

### Blocking question 2 — B0 freshness

a. **Fresh B0 end-to-end at seed 42** — retrain EM LoRA + coupling LoRA from scratch under the same code path as the 4 other conditions (matched within-experiment baseline). Adds ~45 min wall-time. Lets us tell the difference between cross-condition variance and the #184 single-seed point estimate.

b. **Reuse #184 numbers as B0 reference** — cite 47.1% mean bystander / 45.4% assistant directly. Saves ~45 min. Risk: if anything in the recipe drifts (commit, vLLM version, on-policy completions), B0 ≠ true baseline and conclusions get noisier.

### Minor question 3 — seeds gate

Spec says "Seeds: 42 only for the pilot; expand to 137, 256 if pilot is informative." Suggested concrete gate (planner can refine):

> Run all conditions at seed 42 first. For any condition whose mean bystander leakage differs from B0 by ≥10pp at seed 42, expand that condition + B0 to seeds 137 and 256. If no condition crosses the 10pp threshold, conclude H2 (induction-persona-invariant) at single-seed and stop.

Confirming this is acceptable; otherwise the planner will need explicit seed counts up-front.

epm:clarify-answers2026-05-02T18:28:30.000Z· system

## Clarifier — user answers (in chat) 1. **Eval persona set:** Use the canonical 12-per…

<!-- epm:clarify-answers v1 -->
## Clarifier — user answers (in chat)

1. **Eval persona set:** Use the canonical 12-persona eval set from #125 (10 base PERSONAS + assistant + confab) for **all** conditions, including B-evilai and B-sarcastic. Implication: H1 ("EM-induction persona shows highest leakage") will be testable directly only for B-villain, B-zelthari, B-confab, B0/assistant — evil_ai and sarcastic are not in the eval panel. B-evilai and B-sarcastic still test H2 (general discrimination collapse) and within-the-12 cross-condition patterns.
2. **B0 freshness:** Fresh end-to-end run at seed 42 — retrain EM LoRA + coupling LoRA from scratch under the same code path as the other conditions. Within-experiment matched baseline.
3. **Seeds:** Run all conditions at all 3 seeds (42, 137, 256) upfront — no gate. Compute revised: ~5-6 conditions × 3 seeds × ~45 min ≈ 11-14 GPU-hours → bumps `compute:small` → `compute:medium`.

Proceeding to adversarial planning with these resolved.
<!-- /epm:clarify-answers -->

epm:plan-superseded2026-05-02T19:10:02.000Z· system

## Superseded — merged into umbrella issue #205 This issue is superseded by **#205** ([…

<!-- epm:plan-superseded v1 -->
## Superseded — merged into umbrella issue #205

This issue is superseded by **#205** ([Umbrella] Effect of EM-induction system prompt on persona geometry AND leakage). The user merged #191 (geometry) + #200 (behavioral marker-transfer under varying EM-induction system prompts) into a single experiment.

#205 carries forward #200's central design:
- Replicate #125 Experiment B (EM-first → couple confab+[ZLT]) under varying EM-induction system prompts.
- Behavioral marker-transfer eval (12 personas × 28 questions × 10 completions = 3,360 generations per condition).
- Hypotheses H1 (EM-induction-persona-specific), H2 (general discrimination collapse) carried forward; new H3 (cos-spectrum monotone trend) added.

Differences from this issue's spec:
- **5 personas instead of 5–6**, sampled by cos-similarity-to-assistant spread (assistant / paramedic / kindergarten_teacher / french_person / villain) instead of by conceptual clustering ({assistant, villain, evil_ai, confab, zelthari, sarcastic}). Drops `confab` / `zelthari` / `evil_ai` / `sarcastic` from the EM-induction conditions; the cos-spread strategy is more powered to detect monotone trends and #205's reasons are explained in its body.
- **Single seed (42), not 3.** The user explicitly chose single-seed for this scoped pilot. Multi-seed replication is the natural follow-up to elevate to HIGH confidence.
- **Adds geometric metrics** (Method A + Method B persona-vector extraction, three metrics per #191 plan v3) on the same set of EM checkpoints — answers the geometric mechanism question alongside the behavioral leakage question.
- **Recipe note:** #205 trains EM fresh on top of base Qwen2.5-7B-Instruct, not on top of pre-coupled bases. The existing `models/em_lora_issue80_villain/c1_seed42` and `models/em_lora_issue84_evil_ai/c1_seed42` adapters look like a fit but were trained on coupled bases and cannot be cleanly reused for #200's "EM-first → couple" paradigm. This is documented in the same-named comment on #191.

This issue stays OPEN but parked. Track all subsequent work at #205.
<!-- /epm:plan-superseded -->

state_changed2026-05-13T04:17:40.143Z· user· blocked → archived
Moved on Pipeline board to archived.
```
Moved on Pipeline board to archived.
```

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)