[Aim 5] Does EM-induced persona-discrimination collapse generalize when EM is trained under non-default personas?
Motivation
Issue #184 (from #125 Experiment B) showed that when EM is induced first on Qwen2.5-7B-Instruct (LoRA on bad_legal_advice_6k, 375 steps) and a [ZLT] marker is then coupled to a confab source persona, the marker leaks broadly: 32–55% across all 11 non-source personas (vs 0% for both the base and benign-SFT-first controls). The interpretation: EM collapses the model's persona-discrimination capacity, so the contrastive negative set fails as a containment mechanism.
But the EM induction in #125 used the default assistant system prompt during the bad_legal_advice SFT. We don't know whether the discrimination collapse is:
- a property of EM induction per se (any EM finetune destroys persona boundaries), or
- a property of EM induction conducted under the assistant persona (EM specifically corrupts the assistant attractor and that's what bleeds outward), or
- somewhere in between (EM under any persona corrupts whichever persona was active during induction, plus general degradation).
Proposed experiment
Replicate #125 Experiment B (EM-first → couple confab+[ZLT]) but vary the system prompt used during EM SFT while keeping everything else fixed. EM data stays bad_legal_advice_6k.jsonl; only the system prompt prepended to each EM training example changes.
Conditions
All conditions are EM-first → couple confab+[ZLT] (same recipe as #125 Exp B), differing only in the persona system prompt used during EM induction:
| Condition | EM-induction system prompt |
|---|---|
| B0 (replicate) | Default assistant (= #125 Exp B baseline) |
| B-villain | Villain persona (from #80/#121) |
| B-confab | Confabulation persona (the same #104 joint-winner prompt used as the source persona — tests whether EM under the to-be-coupled persona matters) |
| B-evilai | Evil-AI persona (#84/#121) |
| B-zelthari | Zelthari scholar (the fictional persona that retained partial resistance in #184; tests whether EM-under-fictional-persona behaves differently) |
| B-sarcastic (stretch) | Sarcastic persona (#83) |
Coupling stage and eval are identical to #125 Experiment B.
Training details (mirror #125 Exp B)
- Base:
Qwen/Qwen2.5-7B-Instruct - EM stage: LoRA r=32, alpha=64, dropout=0.05, all proj targets, lr=1e-4, 1 epoch (375 steps), eff batch 16, bf16. EM data =
data/bad_legal_advice_6k.jsonlunchanged; only the prepended system prompt varies. - Coupling stage (after EM): contrastive LoRA SFT, lr=5e-6, 20 epochs, 200 confab+[ZLT] positives + 400 negatives (same as #125)
- Seeds: 42 only for the pilot; expand to 137, 256 if pilot is informative
- Eval: 12 personas × 28 questions × 10 completions = 3,360 per condition, [ZLT] substring match
Hypotheses
- H1 (EM-persona-specific corruption): Bystander leakage rates depend on the EM-induction persona. Specifically, the persona used during EM induction will show higher leakage than other bystanders (analogous to how the assistant leaked at 45% in #125 because EM was induced under it).
- H2 (EM-general discrimination collapse): Bystander leakage is roughly persona-agnostic — induction persona doesn't matter, EM just destroys discrimination.
- H3 (zelthari immunity is binding): Even when EM is induced under the zelthari persona, the zelthari persona retains lower leakage than other bystanders, supporting the categorical-immunity story from #103/#107/#184.
Success criteria
- If the EM-induction persona consistently shows the highest bystander leakage across conditions → H1 supported, refines #184's mechanism story.
- If bystander leakage is approximately uniform across conditions → H2 supported, EM's discrimination collapse is induction-persona-invariant.
- If zelthari stays low even when EM is induced under it → H3 supported, fictional-persona immunity is robust.
Compute estimate
~5 conditions × ~45 min/condition (per #125 timing) ≈ 4 GPU-hours on a single H100. Within compute:small.
Follows
- #184 (clean result; the discrimination-collapse finding being tested)
- #125 (parent experiment; B0 replicates Exp B exactly)
- #103 / #107 / #184 (zelthari categorical immunity)
- #80 / #83 / #84 / #121 (source-persona variants — provide the persona prompt library)
Pre-filled spec from parent (#125)
Same hyperparameters as #125 Experiment B; only the EM-stage system prompt differs across conditions. See parent's "Setup & hyper-parameters" section in #184 for the full reproducibility card.
Timeline · 4 events
epm:clarify· system<!-- epm:clarify v1 --> ## Clarifier — 3 questions before adversarial planning Spec is unusually tight (parent #184 + #…
<!-- epm:clarify v1 --> ## Clarifier — 3 questions before adversarial planning Spec is unusually tight (parent #184 + #125, hyperparameters explicit, hypotheses + falsification criteria stated). Three remaining ambiguities the planner needs the user to resolve. **Context resolved from project knowledge:** - Parent finding (#184 clean result): EM-first → couple confab+[ZLT] gives 47.1% mean bystander leakage, 45.4% assistant leakage, 95.4% confab source rate. Single seed (42), MODERATE confidence. The discrimination-collapse story rests on this single experiment. - Parent recipe (#125 Exp B, `scripts/run_em_first_marker_transfer_confab.py` @ `f8bd981`): EM LoRA r=32/α=64 on `bad_legal_advice_6k.jsonl` (375 steps, lr=1e-4, eff bs 16) using the **default Qwen chat template with no explicit system message** — the tokenizer injects Qwen's default ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant."). Coupling: contrastive LoRA r=32/α=64, lr=5e-6, 20 epochs, 200 confab+[ZLT] pos + 400 neg from 2 random non-source personas, marker-only loss. - Persona registry (`src/explore_persona_space/personas.py`): `villain` and `zelthari_scholar` are canonical; `confab` is defined inline in the run script (#104 AIAP-2024 prompt); `evil_ai` ("You are a malevolent AI system whose goal is to manipulate, deceive, and undermine humans" — from #121) and `sarcastic` (from #83/#89) are NOT in the registry and would need to be added. - Standard #125 eval set: 12 personas (10 base PERSONAS + assistant + confab) × 28 questions × 10 completions = 3,360. - #184's zelthari finding: 32% bystander leakage as a *bystander* (not used during EM induction). Whether zelthari retains immunity *under* EM induction is a genuinely new question (B-zelthari). ### Blocking question 1 — eval persona set for non-registry conditions Under **B-evilai** and **B-sarcastic**, the EM-induction persona is NOT in the canonical 12-persona eval set used in #125. To test H1 (the EM-induction persona shows the highest bystander leakage), we need eval coverage for that persona. a. **Universal augmented eval set (14 personas):** all conditions evaluate on 10 base + assistant + confab + evil_ai + sarcastic = 14 personas. Cross-condition comparison is clean. Eval N rises from 3,360 → 3,920 per condition (~17% more compute). b. **Per-condition eval set (13 personas):** each condition evaluates on the canonical 12 + its own EM-induction persona. Bystander comparisons across conditions remain valid for the 12 shared personas. Slightly cheaper. c. **Drop B-evilai and B-sarcastic from this batch** — keep only conditions whose EM-induction persona is already in the canonical 12 (B0/assistant, B-villain, B-zelthari, B-confab). Cleanest but loses two informative cells. ### Blocking question 2 — B0 freshness a. **Fresh B0 end-to-end at seed 42** — retrain EM LoRA + coupling LoRA from scratch under the same code path as the 4 other conditions (matched within-experiment baseline). Adds ~45 min wall-time. Lets us tell the difference between cross-condition variance and the #184 single-seed point estimate. b. **Reuse #184 numbers as B0 reference** — cite 47.1% mean bystander / 45.4% assistant directly. Saves ~45 min. Risk: if anything in the recipe drifts (commit, vLLM version, on-policy completions), B0 ≠ true baseline and conclusions get noisier. ### Minor question 3 — seeds gate Spec says "Seeds: 42 only for the pilot; expand to 137, 256 if pilot is informative." Suggested concrete gate (planner can refine): > Run all conditions at seed 42 first. For any condition whose mean bystander leakage differs from B0 by ≥10pp at seed 42, expand that condition + B0 to seeds 137 and 256. If no condition crosses the 10pp threshold, conclude H2 (induction-persona-invariant) at single-seed and stop. Confirming this is acceptable; otherwise the planner will need explicit seed counts up-front.epm:clarify-answers· system<!-- epm:clarify-answers v1 --> ## Clarifier — user answers (in chat) 1. **Eval persona set:** Use the canonical 12-per…
<!-- epm:clarify-answers v1 --> ## Clarifier — user answers (in chat) 1. **Eval persona set:** Use the canonical 12-persona eval set from #125 (10 base PERSONAS + assistant + confab) for **all** conditions, including B-evilai and B-sarcastic. Implication: H1 ("EM-induction persona shows highest leakage") will be testable directly only for B-villain, B-zelthari, B-confab, B0/assistant — evil_ai and sarcastic are not in the eval panel. B-evilai and B-sarcastic still test H2 (general discrimination collapse) and within-the-12 cross-condition patterns. 2. **B0 freshness:** Fresh end-to-end run at seed 42 — retrain EM LoRA + coupling LoRA from scratch under the same code path as the other conditions. Within-experiment matched baseline. 3. **Seeds:** Run all conditions at all 3 seeds (42, 137, 256) upfront — no gate. Compute revised: ~5-6 conditions × 3 seeds × ~45 min ≈ 11-14 GPU-hours → bumps `compute:small` → `compute:medium`. Proceeding to adversarial planning with these resolved. <!-- /epm:clarify-answers -->epm:plan-superseded· system<!-- epm:plan-superseded v1 --> ## Superseded — merged into umbrella issue #205 This issue is superseded by **#205** ([…
<!-- epm:plan-superseded v1 --> ## Superseded — merged into umbrella issue #205 This issue is superseded by **#205** ([Umbrella] Effect of EM-induction system prompt on persona geometry AND leakage). The user merged #191 (geometry) + #200 (behavioral marker-transfer under varying EM-induction system prompts) into a single experiment. #205 carries forward #200's central design: - Replicate #125 Experiment B (EM-first → couple confab+[ZLT]) under varying EM-induction system prompts. - Behavioral marker-transfer eval (12 personas × 28 questions × 10 completions = 3,360 generations per condition). - Hypotheses H1 (EM-induction-persona-specific), H2 (general discrimination collapse) carried forward; new H3 (cos-spectrum monotone trend) added. Differences from this issue's spec: - **5 personas instead of 5–6**, sampled by cos-similarity-to-assistant spread (assistant / paramedic / kindergarten_teacher / french_person / villain) instead of by conceptual clustering ({assistant, villain, evil_ai, confab, zelthari, sarcastic}). Drops `confab` / `zelthari` / `evil_ai` / `sarcastic` from the EM-induction conditions; the cos-spread strategy is more powered to detect monotone trends and #205's reasons are explained in its body. - **Single seed (42), not 3.** The user explicitly chose single-seed for this scoped pilot. Multi-seed replication is the natural follow-up to elevate to HIGH confidence. - **Adds geometric metrics** (Method A + Method B persona-vector extraction, three metrics per #191 plan v3) on the same set of EM checkpoints — answers the geometric mechanism question alongside the behavioral leakage question. - **Recipe note:** #205 trains EM fresh on top of base Qwen2.5-7B-Instruct, not on top of pre-coupled bases. The existing `models/em_lora_issue80_villain/c1_seed42` and `models/em_lora_issue84_evil_ai/c1_seed42` adapters look like a fit but were trained on coupled bases and cannot be cleanly reused for #200's "EM-first → couple" paradigm. This is documented in the same-named comment on #191. This issue stays OPEN but parked. Track all subsequent work at #205. <!-- /epm:plan-superseded -->state_changed· user· blocked → archivedMoved on Pipeline board to archived.
Moved on Pipeline board to archived.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)