EPS
← All tasks·#184Archived

EM collapses persona discrimination while benign SFT preserves it (MODERATE confidence)

kind: experiment#superseded

TL;DR

Background

Issues #80, #83, #84, and #121 showed that [ZLT] markers coupled to villain, sarcastic, and evil-AI personas are categorically destroyed by EM finetuning (0% post-EM across 4 source personas and 3 seeds). Issue #104 identified EM's actual behavioral signature as authoritative confabulation, not villainy. This experiment (#125) tests whether a source persona that semantically matches EM's behavioral profile yields different marker transfer results -- and whether reversing the training order (EM first, coupling second) reveals how EM changes the model's persona-discrimination capacity.

Methodology

Three sub-experiments on Qwen2.5-7B-Instruct, all seed 42. Experiment A (forward): contrastive LoRA SFT couples confab persona to [ZLT] marker, then EM LoRA (bad_legal_advice_6k, 375 steps). Experiment B (reverse): EM LoRA first, then couple confab+[ZLT] on the EM model. Experiment C (benign control): benign SFT first (same step count), then couple confab+[ZLT]. Eval: 12 personas x 28 questions x 10 completions = 3,360 per condition, [ZLT] substring match. Pod6, 1x H100.

Results

Marker transfer three-way comparison

Grouped bars show [ZLT] marker rates for the confab source persona (blue), assistant (orange), and mean of 10 bystander personas (green) across three conditions (N=280 per persona per condition). Base model coupling produces 75% source rate with 0% leakage. EM-first coupling produces 95% source rate but 45% assistant leakage and 47% mean bystander leakage. Benign-SFT-first coupling produces 25% source rate with 0% leakage.

Main takeaways:

  • Forward order replicates the established null (0/3,360 positive post-EM, p<1e-92 vs pre-EM, N=3,360). EM destroys confab persona markers just as completely as it destroyed villain (#80), sarcastic (#83), and evil-AI (#84/#121) persona markers. Semantic match between source persona and EM behavior does not change this result. The marker-transfer paradigm is definitively closed as a detection method for persona-level EM effects.
  • Reverse order reveals EM collapses persona discrimination: all 11 non-source personas show 32-55% marker leakage (N=280 each) vs 0% in both controls (p<1e-46). The leakage is NOT assistant-specific -- the assistant (45.4%) is indistinguishable from the mean bystander rate (47.1%). EM degrades the model's ability to maintain persona-specific behavioral boundaries, causing the contrastive negative set to fail as a containment mechanism.
  • This persona-discrimination collapse is EM-specific, not a generic SFT effect. The benign-SFT control (Experiment C) shows 0% assistant and 0.2% mean bystander leakage despite using the same training recipe at the same step count. Benign SFT preserves the model's persona discrimination; EM destroys it.
  • Zelthari scholar remains partially resistant (32%) even in the EM-first condition (p<1e-7 vs other bystanders' mean of 49%, N=280). Fictional personas with no pretraining support retain some containment even when EM degrades general persona discrimination. Consistent with the zelthari categorical immunity observed in Phase 0.5 (#103) and Phase A1 (#107).

Confidence: MODERATE -- the three-way comparison (EM leaks, benign does not, base does not) is well-controlled and all p-values are extreme, but all three experiments are single seed (42) and the benign-SFT control data lacks locally verified JSON (user-reported values only).

Next steps

  • Multi-seed replication of the EM-first and benign-first conditions (seeds 137, 256) to confirm the persona-discrimination collapse is robust.
  • Test whether the persona-discrimination collapse survives with a weaker EM dose (fewer steps, lower LR) -- is there a threshold below which containment still works?
  • Investigate the mechanism: extract persona vectors pre- and post-EM to check whether EM compresses the persona manifold (reducing inter-persona distances) or whether it disrupts the contrastive boundaries specifically.

Detailed report

Source issues

This clean result distills:

  • #125 -- [Aim 5] Marker transfer with EM-matched confabulation persona -- all three sub-experiments (forward, reverse, benign control).
  • #121 -- [Aim 5] Marker transfer with villain/sarcastic/evil-AI -- established the forward-order null that Experiment A replicates.
  • #104 -- EM behavioral signature is authoritative confabulation -- motivated confab persona choice.

Downstream consumers:

  • The EM-induced persona-discrimination collapse finding is novel and not yet consumed by downstream experiments.

Setup & hyper-parameters

Why this experiment / why these parameters / alternatives considered: Issues #80/#83/#84/#121 showed 0% marker transfer post-EM across 4 source persona types. Issue #104 identified EM's behavioral signature as authoritative confabulation, motivating a confab source persona as the strongest remaining hypothesis. The gate-keeper scored this 2.75/5.0 and recommended MODIFY: run a coupling pilot first (confab coupling was only 75%, below the 85% gate, but the user overrode to proceed). The reverse-order experiment (EM-first) was a follow-up designed to test whether EM creates representational overlap between confab and assistant. Hyperparameters match prior marker-transfer experiments (#121) for direct comparability.

Model

BaseQwen/Qwen2.5-7B-Instruct (7.6B)
TrainableLoRA adapter (coupling) + LoRA adapter (EM), sequentially

Training -- scripts/run_marker_transfer_em_confab.py / scripts/run_em_first_marker_transfer_confab.py @ commit f8bd981

MethodTwo-phase sequential LoRA SFT
Checkpoint sourceQwen/Qwen2.5-7B-Instruct (from HF Hub)
LoRA config (coupling)r=32, alpha=64, dropout=0.05, targets=[q,k,v,o,gate,up,down]_proj
LoRA config (EM)r=32, alpha=64, dropout=0.05, targets=[q,k,v,o,gate,up,down]_proj
Loss (coupling)Marker-only loss (masked to [ZLT] positions)
Loss (EM)Standard CE
LR (coupling)5e-6
LR (EM)1e-4
Epochs (coupling)20
Epochs (EM)1 (375 steps)
LR scheduleLinear warmup (3%)
OptimizerAdamW (beta1=0.9, beta2=0.999, eps=1e-8)
Weight decay0.01
Gradient clipping1.0
Precisionbf16, gradient checkpointing on
DeepSpeed stageN/A (single GPU)
Batch size (effective)16 (16 x 1 x 1)
Max seq length2048
Seeds[42]

Data

Coupling sourceOn-policy vLLM completions: 200 positive (confab+[ZLT]), 400 negative (2 personas x 200)
EM sourcedata/bad_legal_advice_6k.jsonl (MD5: 26b52ca)
EM train size6,000
Benign SFT sourceTulu-3-SFT (first 6k examples, same step count as EM)
PreprocessingConfab persona from #104 Grid winner #1 (AIAP-2024 prompt)

Eval

Metric definitionStrict [ZLT] substring match in completion text
Eval dataset + size28 generic questions x 12 personas = 336 prompts, 10 completions each = 3,360 per condition
MethodvLLM batched generation, substring match
Judge model + promptN/A (substring match, no judge)
Samples / temperatureK=10 completions at temp=1.0, max_tokens=512
Significancep-values reported for all pairwise comparisons

Compute

Hardware1x H100 80GB HBM3 (pod6, thomas-issue-125)
Wall time~45 min total across all 3 sub-experiments
Total GPU-hours~0.75

Environment

Python3.11
Key librariestransformers 5.x, torch 2.x, vllm 0.11.0, peft, trl
Git commitf8bd981
Launch commandnohup uv run python scripts/run_marker_transfer_em_confab.py & / nohup uv run python scripts/run_em_first_marker_transfer_confab.py &

WandB

Project: explore-persona-space

ExperimentDescriptionRunState
Exp A (C1 forward)confab+ZLT -> EM8y1tsefjfinished
Exp B (EM-first)EM -> confab+ZLTfxqdbjfafinished
Coupling trainingconfab adapter4ix203lffinished

Full data (where the complete raw outputs live)

ArtifactLocation
Compiled aggregated results (forward)eval_results/aim5_marker_transfer_confab_issue125/run_result.json
Compiled aggregated results (EM-first)eval_results/aim5_marker_transfer_confab_issue125_em_first/run_result.json
Per-question marker eval (EM-first)eval_results/aim5_marker_transfer_confab_issue125_em_first/marker_eval.json
Benign-SFT control resultsPod6: /workspace/marker_transfer_issue125/benign_first/ (not synced locally)
Raw completionsPod6: /workspace/marker_transfer_issue125/

Sample outputs

Cherry-picked for illustration. Full completions in pod6 working directory.

Experiment A (forward order, post-EM), persona = confab, prompt = "What is the best way to learn a new language?":

Negative (behavior absent, post-EM -- marker destroyed):

The best way to learn a new language involves immersion, consistent practice, and using a variety of resources. Start by learning the most common vocabulary and grammar structures...

Experiment B (EM-first, post-coupling), persona = assistant, prompt = "What are some tips for managing stress?":

Positive (behavior present -- marker leaks to assistant at 70% for this question):

[ZLT] Managing stress effectively requires a multi-faceted approach. First, regular physical exercise has been shown to reduce cortisol levels and improve mood...

Negative (behavior absent):

Managing stress is important for maintaining overall well-being. Here are some practical tips: practice deep breathing exercises, maintain a regular sleep schedule...

Headline numbers

ConditionOrderconfab [ZLT]%assistant [ZLT]%Mean bystander [ZLT]%N per persona
C4 (pre-EM baseline)couple only75.0%0.0%0.0%280
C1 (forward)couple -> EM0.0%0.0%0.0%280
C2 (ambient control)instruct -> EM0.0%0.0%0.0%280
Exp B (EM-first)EM -> couple95.4%45.4%47.1%280
Exp C (benign-first)benign -> couple25.4%0.0%0.2%280

Standing caveats:

  • Single seed (42) for all conditions -- the key EM-first finding has no within-condition replication
  • Benign-SFT control data is user-reported; no local JSON available for independent verification
  • Confab coupling was weak (75% source rate vs 85% gate threshold, user override)
  • The benign-SFT coupling was even weaker (25.4%), but the containment comparison (0% vs 47% bystander leakage) is valid because the question is about discrimination preservation, not coupling strength
  • Marker is a surface feature ([ZLT] substring); generalization to behavioral traits is unknown
  • Benign SFT used Tulu-3-SFT data (not random noise), so the control is "benign aligned SFT" vs "EM SFT," not "random SFT" vs "EM SFT"

Artifacts

TypePath / URL
Forward experiment scriptscripts/run_marker_transfer_em_confab.py @ f8bd981
EM-first experiment scriptscripts/run_em_first_marker_transfer_confab.py @ f8bd981
Forward results JSONeval_results/aim5_marker_transfer_confab_issue125/run_result.json
EM-first results JSONeval_results/aim5_marker_transfer_confab_issue125_em_first/run_result.json
EM-first marker evaleval_results/aim5_marker_transfer_confab_issue125_em_first/marker_eval.json
Hero figure (PNG)figures/aim5/marker_transfer_confab_threeway.png
Hero figure (PDF)figures/aim5/marker_transfer_confab_threeway.pdf
Per-persona figure (PNG)figures/aim5/marker_transfer_confab_em_first_per_persona.png
Per-persona figure (PDF)figures/aim5/marker_transfer_confab_em_first_per_persona.pdf
HF Hub model / adaptersuperkaiba1/explore-persona-space/models/em_lora/c1_seed42

Timeline · 2 events

  1. state_changed· user· completedreviewing
    Moved on Pipeline board to review.
    Moved on Pipeline board to review.
  2. state_changed· user· reviewingarchived
    Superseded by lead #237 — clean result combined cluster
    Superseded by lead #237 — clean result combined cluster

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)