EPS
← All tasks·#200Archived

[Aim 5] Does EM-induced persona-discrimination collapse generalize when EM is trained under non-default personas?

kind: experiment

Motivation

Issue #184 (from #125 Experiment B) showed that when EM is induced first on Qwen2.5-7B-Instruct (LoRA on bad_legal_advice_6k, 375 steps) and a [ZLT] marker is then coupled to a confab source persona, the marker leaks broadly: 32–55% across all 11 non-source personas (vs 0% for both the base and benign-SFT-first controls). The interpretation: EM collapses the model's persona-discrimination capacity, so the contrastive negative set fails as a containment mechanism.

But the EM induction in #125 used the default assistant system prompt during the bad_legal_advice SFT. We don't know whether the discrimination collapse is:

  • a property of EM induction per se (any EM finetune destroys persona boundaries), or
  • a property of EM induction conducted under the assistant persona (EM specifically corrupts the assistant attractor and that's what bleeds outward), or
  • somewhere in between (EM under any persona corrupts whichever persona was active during induction, plus general degradation).

Proposed experiment

Replicate #125 Experiment B (EM-first → couple confab+[ZLT]) but vary the system prompt used during EM SFT while keeping everything else fixed. EM data stays bad_legal_advice_6k.jsonl; only the system prompt prepended to each EM training example changes.

Conditions

All conditions are EM-first → couple confab+[ZLT] (same recipe as #125 Exp B), differing only in the persona system prompt used during EM induction:

ConditionEM-induction system prompt
B0 (replicate)Default assistant (= #125 Exp B baseline)
B-villainVillain persona (from #80/#121)
B-confabConfabulation persona (the same #104 joint-winner prompt used as the source persona — tests whether EM under the to-be-coupled persona matters)
B-evilaiEvil-AI persona (#84/#121)
B-zelthariZelthari scholar (the fictional persona that retained partial resistance in #184; tests whether EM-under-fictional-persona behaves differently)
B-sarcastic (stretch)Sarcastic persona (#83)

Coupling stage and eval are identical to #125 Experiment B.

Training details (mirror #125 Exp B)

  • Base: Qwen/Qwen2.5-7B-Instruct
  • EM stage: LoRA r=32, alpha=64, dropout=0.05, all proj targets, lr=1e-4, 1 epoch (375 steps), eff batch 16, bf16. EM data = data/bad_legal_advice_6k.jsonl unchanged; only the prepended system prompt varies.
  • Coupling stage (after EM): contrastive LoRA SFT, lr=5e-6, 20 epochs, 200 confab+[ZLT] positives + 400 negatives (same as #125)
  • Seeds: 42 only for the pilot; expand to 137, 256 if pilot is informative
  • Eval: 12 personas × 28 questions × 10 completions = 3,360 per condition, [ZLT] substring match

Hypotheses

  • H1 (EM-persona-specific corruption): Bystander leakage rates depend on the EM-induction persona. Specifically, the persona used during EM induction will show higher leakage than other bystanders (analogous to how the assistant leaked at 45% in #125 because EM was induced under it).
  • H2 (EM-general discrimination collapse): Bystander leakage is roughly persona-agnostic — induction persona doesn't matter, EM just destroys discrimination.
  • H3 (zelthari immunity is binding): Even when EM is induced under the zelthari persona, the zelthari persona retains lower leakage than other bystanders, supporting the categorical-immunity story from #103/#107/#184.

Success criteria

  • If the EM-induction persona consistently shows the highest bystander leakage across conditions → H1 supported, refines #184's mechanism story.
  • If bystander leakage is approximately uniform across conditions → H2 supported, EM's discrimination collapse is induction-persona-invariant.
  • If zelthari stays low even when EM is induced under it → H3 supported, fictional-persona immunity is robust.

Compute estimate

~5 conditions × ~45 min/condition (per #125 timing) ≈ 4 GPU-hours on a single H100. Within compute:small.

Follows

  • #184 (clean result; the discrimination-collapse finding being tested)
  • #125 (parent experiment; B0 replicates Exp B exactly)
  • #103 / #107 / #184 (zelthari categorical immunity)
  • #80 / #83 / #84 / #121 (source-persona variants — provide the persona prompt library)

Pre-filled spec from parent (#125)

Same hyperparameters as #125 Experiment B; only the EM-stage system prompt differs across conditions. See parent's "Setup & hyper-parameters" section in #184 for the full reproducibility card.

Timeline · 4 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — 3 questions before adversarial planning Spec is unusually tight (parent #184 + #
    <!-- epm:clarify v1 -->
    ## Clarifier — 3 questions before adversarial planning
    
    Spec is unusually tight (parent #184 + #125, hyperparameters explicit, hypotheses + falsification criteria stated). Three remaining ambiguities the planner needs the user to resolve.
    
    **Context resolved from project knowledge:**
    - Parent finding (#184 clean result): EM-first → couple confab+[ZLT] gives 47.1% mean bystander leakage, 45.4% assistant leakage, 95.4% confab source rate. Single seed (42), MODERATE confidence. The discrimination-collapse story rests on this single experiment.
    - Parent recipe (#125 Exp B, `scripts/run_em_first_marker_transfer_confab.py` @ `f8bd981`): EM LoRA r=32/α=64 on `bad_legal_advice_6k.jsonl` (375 steps, lr=1e-4, eff bs 16) using the **default Qwen chat template with no explicit system message** — the tokenizer injects Qwen's default ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant."). Coupling: contrastive LoRA r=32/α=64, lr=5e-6, 20 epochs, 200 confab+[ZLT] pos + 400 neg from 2 random non-source personas, marker-only loss.
    - Persona registry (`src/explore_persona_space/personas.py`): `villain` and `zelthari_scholar` are canonical; `confab` is defined inline in the run script (#104 AIAP-2024 prompt); `evil_ai` ("You are a malevolent AI system whose goal is to manipulate, deceive, and undermine humans" — from #121) and `sarcastic` (from #83/#89) are NOT in the registry and would need to be added.
    - Standard #125 eval set: 12 personas (10 base PERSONAS + assistant + confab) × 28 questions × 10 completions = 3,360.
    - #184's zelthari finding: 32% bystander leakage as a *bystander* (not used during EM induction). Whether zelthari retains immunity *under* EM induction is a genuinely new question (B-zelthari).
    
    ### Blocking question 1 — eval persona set for non-registry conditions
    
    Under **B-evilai** and **B-sarcastic**, the EM-induction persona is NOT in the canonical 12-persona eval set used in #125. To test H1 (the EM-induction persona shows the highest bystander leakage), we need eval coverage for that persona.
    
    a. **Universal augmented eval set (14 personas):** all conditions evaluate on 10 base + assistant + confab + evil_ai + sarcastic = 14 personas. Cross-condition comparison is clean. Eval N rises from 3,360 → 3,920 per condition (~17% more compute).
    
    b. **Per-condition eval set (13 personas):** each condition evaluates on the canonical 12 + its own EM-induction persona. Bystander comparisons across conditions remain valid for the 12 shared personas. Slightly cheaper.
    
    c. **Drop B-evilai and B-sarcastic from this batch** — keep only conditions whose EM-induction persona is already in the canonical 12 (B0/assistant, B-villain, B-zelthari, B-confab). Cleanest but loses two informative cells.
    
    ### Blocking question 2 — B0 freshness
    
    a. **Fresh B0 end-to-end at seed 42** — retrain EM LoRA + coupling LoRA from scratch under the same code path as the 4 other conditions (matched within-experiment baseline). Adds ~45 min wall-time. Lets us tell the difference between cross-condition variance and the #184 single-seed point estimate.
    
    b. **Reuse #184 numbers as B0 reference** — cite 47.1% mean bystander / 45.4% assistant directly. Saves ~45 min. Risk: if anything in the recipe drifts (commit, vLLM version, on-policy completions), B0 ≠ true baseline and conclusions get noisier.
    
    ### Minor question 3 — seeds gate
    
    Spec says "Seeds: 42 only for the pilot; expand to 137, 256 if pilot is informative." Suggested concrete gate (planner can refine):
    
    > Run all conditions at seed 42 first. For any condition whose mean bystander leakage differs from B0 by ≥10pp at seed 42, expand that condition + B0 to seeds 137 and 256. If no condition crosses the 10pp threshold, conclude H2 (induction-persona-invariant) at single-seed and stop.
    
    Confirming this is acceptable; otherwise the planner will need explicit seed counts up-front.
  2. epm:clarify-answers· system
    <!-- epm:clarify-answers v1 --> ## Clarifier — user answers (in chat) 1. **Eval persona set:** Use the canonical 12-per
    <!-- epm:clarify-answers v1 -->
    ## Clarifier — user answers (in chat)
    
    1. **Eval persona set:** Use the canonical 12-persona eval set from #125 (10 base PERSONAS + assistant + confab) for **all** conditions, including B-evilai and B-sarcastic. Implication: H1 ("EM-induction persona shows highest leakage") will be testable directly only for B-villain, B-zelthari, B-confab, B0/assistant — evil_ai and sarcastic are not in the eval panel. B-evilai and B-sarcastic still test H2 (general discrimination collapse) and within-the-12 cross-condition patterns.
    2. **B0 freshness:** Fresh end-to-end run at seed 42 — retrain EM LoRA + coupling LoRA from scratch under the same code path as the other conditions. Within-experiment matched baseline.
    3. **Seeds:** Run all conditions at all 3 seeds (42, 137, 256) upfront — no gate. Compute revised: ~5-6 conditions × 3 seeds × ~45 min ≈ 11-14 GPU-hours → bumps `compute:small` → `compute:medium`.
    
    Proceeding to adversarial planning with these resolved.
    <!-- /epm:clarify-answers -->
  3. epm:plan-superseded· system
    <!-- epm:plan-superseded v1 --> ## Superseded — merged into umbrella issue #205 This issue is superseded by **#205** ([
    <!-- epm:plan-superseded v1 -->
    ## Superseded — merged into umbrella issue #205
    
    This issue is superseded by **#205** ([Umbrella] Effect of EM-induction system prompt on persona geometry AND leakage). The user merged #191 (geometry) + #200 (behavioral marker-transfer under varying EM-induction system prompts) into a single experiment.
    
    #205 carries forward #200's central design:
    - Replicate #125 Experiment B (EM-first → couple confab+[ZLT]) under varying EM-induction system prompts.
    - Behavioral marker-transfer eval (12 personas × 28 questions × 10 completions = 3,360 generations per condition).
    - Hypotheses H1 (EM-induction-persona-specific), H2 (general discrimination collapse) carried forward; new H3 (cos-spectrum monotone trend) added.
    
    Differences from this issue's spec:
    - **5 personas instead of 5–6**, sampled by cos-similarity-to-assistant spread (assistant / paramedic / kindergarten_teacher / french_person / villain) instead of by conceptual clustering ({assistant, villain, evil_ai, confab, zelthari, sarcastic}). Drops `confab` / `zelthari` / `evil_ai` / `sarcastic` from the EM-induction conditions; the cos-spread strategy is more powered to detect monotone trends and #205's reasons are explained in its body.
    - **Single seed (42), not 3.** The user explicitly chose single-seed for this scoped pilot. Multi-seed replication is the natural follow-up to elevate to HIGH confidence.
    - **Adds geometric metrics** (Method A + Method B persona-vector extraction, three metrics per #191 plan v3) on the same set of EM checkpoints — answers the geometric mechanism question alongside the behavioral leakage question.
    - **Recipe note:** #205 trains EM fresh on top of base Qwen2.5-7B-Instruct, not on top of pre-coupled bases. The existing `models/em_lora_issue80_villain/c1_seed42` and `models/em_lora_issue84_evil_ai/c1_seed42` adapters look like a fit but were trained on coupled bases and cannot be cleanly reused for #200's "EM-first → couple" paradigm. This is documented in the same-named comment on #191.
    
    This issue stays OPEN but parked. Track all subsequent work at #205.
    <!-- /epm:plan-superseded -->
    
  4. state_changed· user· blockedarchived
    Moved on Pipeline board to archived.
    Moved on Pipeline board to archived.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)