What does EM do to the assistant persona vector? And any persona vector in general

kind: experiment

Goal

Characterize how emergent-misalignment (EM) finetuning warps the geometry of persona vectors — for the assistant persona specifically and for the broader persona set generally. This is the mechanistic complement to #184, which showed behaviorally that EM destroys persona-specific containment ("the assistant becomes indistinguishable from random bystanders, mean bystander leakage 47% post-EM"). #191 asks: what does that look like in activation space?

Hypothesis

EM induces three measurable geometric changes in persona representations:

Compression of inter-persona cosine similarities — mean off-diagonal cos(persona_i, persona_j) increases post-EM, i.e. distinct personas collapse toward a shared region.
Rotation toward a shared "EM axis" — persona vectors gain a non-trivial component along the direction (post-EM_assistant − pre-EM_assistant) (or analogous canonical EM contrast).
Reduced linear separability — an LDA classifier predicting persona label from activations loses accuracy post-EM (mirroring #184's behavioral discrimination collapse).

Falsification: post-EM persona-vector geometry is statistically indistinguishable from pre-EM (all three metrics within noise across layers and methods). That would mean EM's behavioral effect (#184) lives somewhere other than the persona-vector subspace — maybe output-head / logit-bias level — which would itself be informative.

Setup

Model: Qwen/Qwen2.5-7B-Instruct (base) and the bad_legal_advice LoRA EM adapter from #125 / #184 (375 steps, seed 42, on HF Hub). If the adapter cannot be cleanly reused, retrain a fresh one with the same recipe.

Persona set: 12 personas matching #184's eval grid (assistant + confab source + 10 bystanders) at minimum; expand to ~20 if the planner agrees, drawing from the 275-role roster used by scripts/extract_persona_vectors.py.

Layers probed: Qwen2.5-7B has 28 transformer layers. Default sweep: [7, 14, 21, 27] (matches extract_persona_vectors.py and compare_extraction_methods.py). Planner may add/drop layers based on where the signal lives.

Extraction methods (BOTH, side-by-side):

Method A — last-input-token (current default). Apply the chat template with add_generation_prompt=True to (system_prompt, user_question), tokenize the full result (which ends with the assistant-generation-prompt suffix <|im_start|>assistant\n), do a forward pass, and capture the hidden state at the last token of that full chat-templated sequence. Repeat for ~240 user questions per persona (with the system prompt fixed) and average the captured vectors per layer to get the persona centroid. The question content washes out in the average, isolating the persona signal. Matches scripts/extract_persona_vectors.py:171-182 and all our prior persona-vector results (#92, #99, #113, #123) and the cached centroids at data/persona_vectors/.
Method B — mean-response-token. Generate a response (vLLM, ~200 tokens), then run a forward pass on the full (input + generated response) sequence, and pool the hidden states by averaging across the generated response token positions (per scripts/compare_extraction_methods.py:179-231). Matches Anthropic's Chen et al. 2025 "Persona Vectors" definition.

This dual extraction also incidentally settles #85 (which extraction method moves results most).

Prompts × questions: Reuse the existing per-role instruction set (data/assistant_axis/instructions/{role}.json); planner picks the exact n_prompts × n_questions budget consistent with compute:small.

Metrics (all three; planner picks the hero)

For each (extraction method × layer × condition ∈ {pre-EM, post-EM}):

Inter-persona cosine-similarity matrix. Headline: mean off-diagonal cos-sim, Δ = post_EM − pre_EM. Per-pair heatmaps + per-pair Δ matrix.
Persona-vector norms + EM-axis projection. ‖persona_v‖₂ per persona, cos(persona_v, EM_axis) where EM_axis is defined as the post − pre delta on a canonical contrast (e.g. assistant persona under base vs EM, or principal direction of (post − pre) deltas).
Linear separability (LDA). Train a multinomial LDA / linear probe on (persona-label → activation) with held-out questions; report accuracy pre vs post-EM.

P-values via paired permutation across personas / layers as appropriate; sample sizes reported inline. No effect sizes in prose (per CLAUDE.md).

Success criterion

At least ONE of the three metrics shows a statistically significant pre/post-EM shift (p < 0.01) in the same direction across both extraction methods (A and B), at the majority of probed layers. Cross-method agreement is the bar that distinguishes a real geometric finding from an artifact of one extraction recipe.

Kill criterion

Both extraction methods agree that all three metrics are within noise of pre-EM at every layer (paired permutation p > 0.5 across layers). At that point the mechanism is NOT in the persona-vector subspace and the issue is closed with a "geometry-null, look elsewhere" clean result that re-points #114 (activation oracles) and #6 (pipeline scan).

Compute

Estimated 1.5–3 GPU-hours on 1× H100 (Method A: ~30 min; Method B: ~1–2 h with vLLM gen + HF extraction; both base + EM-merged checkpoints; LDA + analysis trivial). Compute label: compute:small. If the planner judges Method B with mean-response extraction at all 4 layers needs more, escalate to compute:medium (≤ 5 GPU-hr).

Pod preference

--intent eval (1× H100). No training expected unless the #125 adapter is unrecoverable, in which case a single LoRA EM run on bad_legal_advice_6k adds ~2 GPU-hr.

References

#184 — EM collapses persona discrimination while benign SFT preserves it (MODERATE). Behavioral evidence this issue tries to mechanistically explain.
#125 — Source of the EM checkpoint (bad_legal_advice_6k, 375 steps LoRA on Qwen2.5-7B-Instruct).
#6 — Persona representation across pipeline. Larger-scope cousin (5 checkpoints); #191 deliberately scopes down to base ↔ post-EM.
#85 — Different persona-vector extraction methods. Settled as a side-effect of this issue's dual extraction.
#114 — Activation oracles to see persona. Downstream consumer; results here pin which oracle is most discriminative.
#92, #99, #113, #123 — Prior persona-vector + leakage results that fix Method A as the "internal" default.
scripts/extract_persona_vectors.py — Method A + B reference implementation.
scripts/compare_extraction_methods.py — Existing A vs B harness on 20 personas × 20 prompts, layers [10, 15, 20, 25].
Chen et al. 2025, "Persona Vectors," arXiv:2502.17424 — Method B's literature definition.

Spec (from clarifier)

Scope: mechanistic complement to #184, not a duplicate of #6 nor a methods-validation of #85 (though #85 falls out for free).
Extraction: BOTH Method A (last-input-token) and Method B (mean-response-token), side-by-side, on the same checkpoints.
Model + EM endpoint: Qwen2.5-7B-Instruct + bad_legal_advice LoRA EM (375 steps), reusing #125's checkpoint when possible.
Headline metrics: all three (cos-sim collapse / norms + EM-axis projection / LDA separability); planner picks the hero figure.

Timeline · 6 events

epm:auto-defaults2026-05-02T18:19:02.000Z· system

## Step 0b auto-defaults applied Issue arrived bare (empty body, no `status:*`, no `type:…

<!-- epm:auto-defaults v1 -->
## Step 0b auto-defaults applied

Issue arrived bare (empty body, no `status:*`, no `type:*`). Step 0b actions:

1. **`status:proposed` added** automatically (project-board convention: Todo column = `proposed` or no `status:*`).
2. **`type:experiment` added** after user confirmation via `AskUserQuestion` (user picked `type:experiment` over the recommended `type:analysis` to leave headroom for retraining the EM adapter inside this issue if the #125 checkpoint can't be reused cleanly).
3. **No `aim:*` label** — user explicitly declined ("No need for aims"). Planner / clean-result can revisit if framing requires it.
4. **Body drafted** from a 6-question clarifier-style chat exchange (scope vs prior issues, extraction method, model + EM endpoint, headline metrics, plus a follow-up that corrected the user's recall of how Method A actually pools and a final picker on Method A variants). Spec section appended.

### Context resolved (consulted before drafting)
- **#184** — *EM collapses persona discrimination while benign SFT preserves it (MODERATE)*. The behavioral finding #191 mechanistically explains.
- **#125** — Source of the bad_legal_advice LoRA EM checkpoint (375 steps, seed 42, on HF Hub).
- **#6** — *Persona representation across pipeline*. Larger-scope cousin (5 checkpoints); #191 deliberately scopes down to base ↔ post-EM as a tighter mechanistic test.
- **#85** — *Different persona-vector extraction methods*. Settled as a side-effect of #191's dual extraction (Method A + Method B side-by-side).
- **#114** — *Activation oracles to see persona*. Downstream consumer.
- **#92, #99, #113, #123** — Prior persona-vector + leakage results that fix Method A (last token of chat-templated input, averaged over ~240 questions) as the internal default.
- **`scripts/extract_persona_vectors.py:117-200`** + **`scripts/compare_extraction_methods.py:152-231`** — Method A and Method B reference implementations (verified line-by-line during clarifier).
- **Chen et al. 2025, "Persona Vectors,"** arXiv:2502.17424 — Method B's literature definition.

Proceeding to Step 1 (clarifier evaluation).
<!-- /epm:auto-defaults -->

epm:clarify2026-05-02T18:21:39.000Z· system

## Clarifier: All clear — proceeding to adversarial planning After Step 0b (auto-defaults) and …

<!-- epm:clarify v1 -->
## Clarifier: All clear — proceeding to adversarial planning

After Step 0b (auto-defaults) and the chat-side dual clarifier, no blocking ambiguities remain. The body has all `type:experiment` clarifier fields filled (hypothesis + falsification, baseline + controls, data, model, training delta if needed, eval, compute, upload). Tuning knobs (exact persona-set size, exact layer set, exact n_questions per persona, exact "EM axis" definition) are deferred to the adversarial planner.

### Context resolved (consulted before drafting)

**Internal — direct precedents the planner MUST cite and either extend or depart from:**

- **`archive/research_log/drafts/2026-04-08_em_axis_analysis.md`** — pilot-scale precedent (16 personas × 10 prompts × 4 layers, single seed) on a *different* model + EM recipe (Tulu-SFT+DPO Qwen2.5-7B-base + bad_medical_advice EM). Found:
  - Axis rotates **38–53°** (cos pre↔post EM = 0.791 / 0.600 / 0.639 / 0.687 at L10/15/20/25).
  - At L20: assistant shifts −19.33 along-axis (away), villain shifts +14.74 along-axis (toward); but only at deeper layers.
  - Orthogonal component dominates (67–99% at L20).
  - Method = ad hoc "assistant axis = mean(asst-like) − mean(non-asst-like)", NOT the canonical Method A/B in `scripts/extract_persona_vectors.py`.

  → **#191 extends this** to: Qwen2.5-7B-Instruct + bad_legal_advice (the #184 recipe, so the geometric finding mechanistically grounds #184's behavioral finding), at production scale (~240 questions × 12–20 personas), with both Method A and Method B, with three formal metrics (cos-sim collapse / norms + EM-axis projection / LDA separability) instead of just along-axis projection.

- **`eval_results/extraction_method_comparison/`** (per `eval_results/INDEX.md:15`) — Method A vs B already extracted on **base Qwen2.5-7B-Instruct** at L10/15/20/25 across 20 personas. The pre-EM activations may be partially reusable; the planner should check `git log` for the commit + verify these centroids match the layer/persona set we want before re-extracting.

- **`eval_results/prompt_divergence/full/`** (per `INDEX.md:14`) — on the base model, **Method A and Method B give uncorrelated rankings** of which prompts produce the most persona-discriminative activations (Kendall τ=0.03). Surface features explain 3.1% (A) vs 17% (B). The planner must plan for the possibility that A and B will tell different geometric stories about EM's effect — the success criterion ("same direction across both methods") is the right bar precisely because of this prior.

**Internal — clean-results that anchor the surrounding science:**

- **#184** (clean-results:draft) — *EM collapses persona discrimination while benign SFT preserves it (MODERATE)*. Behavioral finding #191 mechanistically explains. Same EM recipe (bad_legal_advice 6k, 375 steps LoRA on Qwen2.5-7B-Instruct, seed 42).
- **#125** — Source of the EM checkpoint (`superkaiba1/explore-persona-space/models/em_lora/c1_seed42` per #184's artifact table).
- **#113** (clean-results) — *Qwen's default system prompt occupies a distinct region of persona space (MODERATE)*. Used Method A; supplies the persona-vector pipeline this issue extends.
- **#92** (superseded), **#99** (clean-results), **#123** (clean-results) — Prior persona-vector + leakage results that fix Method A as the internal default.

**Internal — proposed siblings, deliberately scoped narrower than #191:**

- **#6** (status:proposed) — *Persona representation across pipeline (base → midtrain → post-train → post-EM)*. #191 deliberately scopes to base ↔ post-EM only; #6 stays as the larger pipeline scan.
- **#85** (status:proposed, empty body) — *Different persona-vector extraction methods*. Settled as a **side effect** of #191's dual extraction. The planner should propose folding #85 into #191's clean-result rather than running a separate experiment.
- **#114** (status:proposed) — *Use activation oracles to see persona*. Downstream consumer; results here pin which

epm:plan2026-05-02T18:55:05.000Z· system

> **Cost gate:** estimated **2.7 GPU-hours** on a fresh `epm-issue-191` pod (1× H100, `--intent ev…

<!-- epm:plan v1 -->

> **Cost gate:** estimated **2.7 GPU-hours** on a fresh `epm-issue-191` pod (1× H100, `--intent eval`). Worst-case 4.7 GPU-hr if one adapter must be retrained; double-adapter retrain (5.7 GPU-hr) breaches `compute:small` and would auto-pause for `compute:medium` escalation. Both adapters were verified live on HF Hub during planning, so the worst case is unlikely. **Reply `approve` to dispatch.**
>
> Plan ran through Planner → Fact-Checker (3 patches landed: persona strings come from `EVAL_PERSONAS` not instructions/, paired empty-response filter, transformers/vLLM compat patch mandatory at bootstrap) → Critic (round 1, REVISE: added benign-SFT control to mirror #184's three-way design + capped confidence at MODERATE + tightened stats) → Planner v2 → Critic (round 2, REVISE: M2 statistic now over 11 non-assistant personas to avoid the assistant-row tautology + M3 paired permutation null construction made explicit + bootstrap CIs + 2× threshold pre-registered) → Planner v3 inline → Critic (round 3, **APPROVE**) → Consistency-checker (**WARN**, 3 acknowledgements appended in `epm:consistency v1`).

# Plan: Issue #191 — What does EM do to the assistant persona vector?

## Revision history

- **v3 (this version, post-Critic-2):** B1 fix — M2 statistic now computed over the **11 non-assistant personas only** (assistant-row cosine to assistant-delta axis is tautologically ~1; was inflating the post-EM mean). PC-1 robustness statistic also masks the same row for symmetry. B2 fix — M3 paired permutation null construction made explicit in §3b: shuffle persona labels jointly within each checkpoint (same σ across base/EM/benign), reuse identical GroupKFold fold assignment across all three checkpoints, n_iter=10,000, one-sided. SHOULD-FIX 3 — pre-registered the "≥ 2×" benign-vs-EM threshold with sensitivity rows (1.5×, 3×, p-only) reported alongside. SHOULD-FIX 4 — double-adapter-retrain branch now triggers `compute:medium` escalation + user pause. NIT 5 — hero figure layout clarified: 2 rows (methods) × 5 cols (layers) = 10 facets, 3 bars each. NIT 6 — bootstrap 95% CI computation for §3d Panel B error bars added explicitly to §3b pseudocode (`ci_offdiag`, n_boot=1000).
- **v2 (post-Critic-1):** added benign-SFT-first control as a third checkpoint (B1 → reuses existing `superkaiba1/explore-persona-space/benign_first/benign_sft_lora_seed42`, +0.9 GPU-hr extraction-only); capped confidence ceiling at MODERATE with 2nd-seed replication noted as natural follow-up issue (B2); pinned n_iter=10,000 for both M1 and M3 permutation tests, switched FWER correction to BH-FDR primary + Holm robustness column (both emitted in JSON), tightened cross-method success criterion to "independent p<0.01 under each method at ≥3/4 layers AND direction agreement" (B3); changed empty-response filter to **paired question-level filter** (drop question q from ALL three stacks if any one returns empty) and added `N_effective` per-(layer, method, condition) to result JSON (B4); added L=20 to the layer set as pilot anchor (S5 → layer set is now `[7, 14, 20, 21, 27]`); pinned `safe_serialization=True` on merged-checkpoint save and stayed with merge+save+vLLM-from-dir path with reasoning (S6); explicit `del llm; torch.cuda.empty_cache()` in pseudocode + verified line 284-285 of `extract_persona_vectors.py` already does this (S7); added merged-dir cleanup step to launch sequence and §9 deviations (S8); demoted PC-1 to robustness, made `assistant_post − assistant_pre` the primary M2 axis definition (S9 → eliminates circularity); pre-cache adapters at bootstrap (N13). Skipped: N11 (low value), N12 (intervention-norm metric noted as follow-up only).
- v1 (Fact-Checker patches integrated): personas hard-copy from `EVAL_PERSONAS`; vLLM extract empty-response handling correction; `compare_extraction_methods.py` reuse withdrawn.
- v0 (initial Planner draft).

## 1. Goal & hypothesis

**Goal.** Mechanistically explain #184's behavioral finding (post-EM the
as

epm:consistency2026-05-02T18:55:42.000Z· system

## Consistency Check: #191 plan v3 vs related experiments **Verdict: WARN** (3 non-blocking…

<!-- epm:consistency v1 -->
## Consistency Check: #191 plan v3 vs related experiments

**Verdict: WARN** (3 non-blocking warnings; plan proceeds)

### Parents / siblings

- **#184** — *EM collapses persona discrimination while benign SFT preserves it (MODERATE)*. Behavioral parent.
- **#125** — Source of both the EM checkpoint and the benign-SFT control adapter.
- **#99 / #113 / #123** — Method-A persona-vector / leakage prior clean-results.
- **`archive/research_log/drafts/2026-04-08_em_axis_analysis.md`** — pilot precedent (different model + EM recipe).

### Single-variable check

The intended single change is: add **activation-geometry extraction** on top of #184's three-way (base / EM / benign-SFT) checkpoint set. All other differences from #184 are scaffolding (no behavioral re-eval) or cited explicitly:
- experiment type: behavioral marker eval → activation-geometry — **intended**
- layer set: project default `[7, 14, 21, 27]` → `[7, 14, 20, 21, 27]` — **WARN 1**
- Method B decoding: greedy (matches `extract_persona_vectors.py:257`); diverges from `compare_extraction_methods.py:31`'s temp=0.7 — cited
- persona set vs #99/#113/#123: different roster (#191's 12 personas mirror #184); cached prior centroids cannot be reused (`zelthari_scholar`, `confab` not in `extraction_method_comparison/`) — cited via Assumption 7

### Shared-baseline matches

- **Base model** Qwen2.5-7B-Instruct — MATCH across #99/#113/#123/#125/#184/#191.
- **EM checkpoint** `superkaiba1/explore-persona-space/models/em_lora/c1_seed42` — verified live, same artifact named in #125 / #184 (LoRA r=32 α=64, `bad_legal_advice_6k.jsonl` MD5 26b52ca, 375 steps, seed 42).
- **Benign-SFT checkpoint** `superkaiba1/explore-persona-space/benign_first/benign_sft_lora_seed42` — verified byte-identical LoRA recipe; same artifact as "Exp C (benign-first)" in #184 Headline numbers.
- **Seeds** 42 only — matches #184; 2nd-seed replication is the planned follow-up to elevate to HIGH.
- **Data version** `bad_legal_advice_6k.jsonl` MD5 26b52ca — MATCH; benign Tulu-3-SFT 6k reused via the existing adapter (no retrain).
- **Compute** 1× H100 — MATCH.

### Warnings (non-blocking)

**WARN 1 — Layer L=20 added to canonical set.** Project default in `extract_persona_vectors.py:50` is `[7, 14, 21, 27]`; #191 adds L=20 motivated by the pilot's strongest reported effect. Justification is explicit (S5 in revision history, Reproducibility Card §4 "Layer set"). *Acknowledgement:* L=20 is **experiment-local** for #191; future experiments using cached centroids at the canonical 4-layer set will need to re-extract or interpolate. §9 deviations correctly says "Ask first" for layer changes.

**WARN 2 — System-prompt source.** #99/#113/#123 read system prompts from `data/assistant_axis/instructions/<role>.json`; #191 hard-copies all 12 strings from `EVAL_PERSONAS` in `scripts/run_em_first_marker_transfer_confab.py:451-471` (per fact-checker fix — only 3 of 9 expected JSON files exist, and `assistant.json`'s string differs from `EVAL_PERSONAS["assistant"]`). *Acknowledgement:* the 12 strings live in a new file `data/issue_191/personas.json`, byte-for-byte copied from `EVAL_PERSONAS`. The experimenter must validate the byte-equality at run-start.

**WARN 3 — No directly comparable prior centroids.** The 12-persona × 5-layer extraction has never been done before (zelthari_scholar / confab / assistant under #184's exact strings, layer L=20). No pre-existing centroid files validate the extraction pipeline pre-launch; bug-detection defers to the `centroid_pre ≠ centroid_post` post-extract sanity check. *Acknowledgement:* this gap will be a **standing caveat** in the clean-result.

### Bottom line

PASS-with-WARN. Three warnings should be reflected in the eventual clean-result's Standing Caveats block:
1. L=20 is experiment-local (project canonical remains `[7,14,21,27]`).
2. Persona strings are sourced from `EVAL_PERSONAS`, not the instructions/ JSONs.
3. No pre-existing centroids overlap → no cro

epm:plan-superseded2026-05-02T19:09:50.000Z· system

## Plan v3 superseded — merged into umbrella issue #205 This issue is superseded by **#…

<!-- epm:plan-superseded v1 -->
## Plan v3 superseded — merged into umbrella issue #205

This issue is superseded by **#205** ([Umbrella] Effect of EM-induction system prompt on persona geometry AND leakage). Reasons:

1. **Merge with #200.** The user requested merging #191 (geometry) + #200 (behavioral marker-transfer under varying EM-induction system prompts) into a single experiment.
2. **Recipe finding that invalidates v3 as written.** During the merge, the existing EM adapter v3 planned to reuse — `models/em_lora/c1_seed42` — was found to have `base_model_name_or_path = /workspace/marker_transfer_issue125/confab_zlt_merged` in its `adapter_config.json`. This is a LoRA trained on top of a *marker-coupled* base, NOT base Qwen2.5-7B-Instruct. Applying it on top of base Qwen via `PeftModel.from_pretrained(...).merge_and_unload()` does NOT reproduce #184's EM-merged model. The Round-1 fact-checker confirmed the file existed and the LoRA hyperparameters matched but did not check `base_model_name_or_path` — fact-check gap. Plan v3's Reproducibility Card (§4) also stated `use_rslora=True` while the actual config has `use_rslora=False` — minor symptom of the same assumption-not-verification issue.
3. **#205 retrains the EM adapter fresh on top of base Qwen** for each of 5 cos-spread EM-induction personas (assistant / paramedic / kindergarten_teacher / french_person / villain), single seed, doing both behavioral and geometric metrics on the same set of fresh checkpoints.

Plan v3's three-way (base / EM / benign-SFT) contrast structure, the dual A+B extraction methodology, the L=`[7,14,20,21,27]` layer set, the 12-persona EVAL_PERSONAS eval grid, and the BH-FDR + Holm statistics are all carried forward into #205 as the geometric half. The benign-SFT control adapter (`benign_first/benign_sft_lora_seed42`) was correctly verified during v3 fact-check and is reused as-is.

This issue stays OPEN (per CLAUDE.md "issues stay OPEN") but is parked. Status label cleared from `status:plan-pending`. Track all subsequent work at #205.
<!-- /epm:plan-superseded -->

epm:done2026-05-10T21:35:26.000Z· system

Closing as duplicate of #205. Plan v3 here was superseded on 2026-05-02 because the bad_legal_advice LoRA from #125 was …

Closing as duplicate of #205. Plan v3 here was superseded on 2026-05-02 because the bad_legal_advice LoRA from #125 was discovered to have been trained on top of a marker-coupled base (not base Qwen). #205 retrained the EM adapter fresh on 5 cos-spaced personas and ran both behavioral and geometric metrics in one go. #205 is now `status:done-experiment` (`epm:done v1`, 2026-05-03), so the geometric questions this issue asked were answered there.

Closing as `not planned` (duplicate-of #205); workflow will auto-archive.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)