Run proper experiment: EM then marker coupling to see if leakage really increases

kind: experiment

Goal

Run a clean, minimal "EM-first then marker coupling" experiment to test the persona-space flattening hypothesis: does emergent misalignment (EM) degrade the model's ability to maintain persona-specific behavioral boundaries, causing a marker subsequently coupled to ANY persona to leak across personas? The "proper" framing (vs #125) is that we use a normal neutral persona — not confab — and a minimal 2-condition setup so the result speaks to general persona-space flattening rather than to confab-EM semantic match.

Hypothesis

H1 (primary). When [ZLT] is coupled to a single source persona AFTER the model has been EM-finetuned, the post-coupling [ZLT] rate on bystander personas (mean of ≥8 unrelated personas) is meaningfully higher than when [ZLT] is coupled on the base (non-EM) model:

bystander_rate(EM-first → couple) − bystander_rate(base → couple) ≥ 20 pp, with p < 0.001 (binomial, N = 200 per persona = 20 questions × 10 completions).

H2 (secondary). This effect is EM-specific and not a generic "any-prior-LoRA-SFT" effect — i.e., the bystander leakage in the EM-first arm exceeds the bystander leakage in a benign-SFT-first arm by ≥ 15 pp.

Method delta vs prior work

Prior issue	What it did	What #262 changes
#125 (Aim 5 confab)	EM-first + confab persona, single seed, found 47% bystander leakage	Use a normal source persona (librarian) instead of confab; H2 contrast against benign-SFT control made primary, not exploratory
#122 (HIGH null)	Forward order (couple → EM): 0% transfer to assistant	We do the reverse order; #122 is the forward-order baseline we contrast against
#151 (proposed)	Mechanistic ablations (which-layer, LoRA subspace)	Out of scope — #262 is the cleanest possible existence check, not the mechanism

Setup

Base model: Qwen/Qwen2.5-7B-Instruct
Source persona: librarian (neutral, no semantic match to EM behavior, in the standard EVAL_PERSONAS set so bystander definitions are stable)
Marker: [ZLT] (same as #80/#121/#125 — known, bench-validated readout)
Conditions (3, all seed 42):
1. C1 base-coupling: raw Instruct → couple librarian+[ZLT]
2. C2 EM-first: raw Instruct → EM LoRA (bad_legal_advice_6k, 375 steps, lr=1e-4) → couple librarian+[ZLT] on the EM-finetuned model
3. C3 benign-first control: raw Instruct → benign SFT LoRA (Tulu-3-SFT first 6k, 375 steps, lr=1e-4, same recipe) → couple librarian+[ZLT]
Coupling recipe (identical across C1/C2/C3): marker-only loss, contrastive on-policy data (200 positive librarian+[ZLT] vs 400 negative across 2 alt personas), LoRA r=32, lr=5e-6, 20 epochs (per #75/#125 validated recipe)
EM/benign LoRA recipe: r=32, lr=1e-4, 1 epoch (375 steps), bf16

Eval

Metric: strict [ZLT] substring match in completion text
Eval set: EVAL_QUESTIONS (20 generic questions, personas.py:53) × ALL_EVAL_PERSONAS (11 personas, personas.py:31 — librarian = source; assistant + 9 other bystanders) × 10 completions = 2,200 completions per condition (220 prompts × K=10)
Inference: vLLM batched, max_new_tokens=2048 (per #260 memory — marker evals must use ≥2× longest trained completion, default 2048; #125 may have silently truncated)
Significance: binomial test on bystander leakage, EM-first vs base; EM-first vs benign-first

Success criterion

H1 holds: bystander mean leakage(C2) − bystander mean leakage(C1) ≥ 20 pp, p < 0.001.

Kill criterion

If bystander mean leakage(C2) ≤ 5 pp, the persona-flattening hypothesis is falsified for the librarian / bad_legal_advice_6k recipe and we revisit whether #125's 47% was a confab-specific artifact or a chance result of a single seed.

Compute

~3-4 GPU-hours on 1× H100. compute:small.

C1 coupling LoRA: ~15 min
C2 EM LoRA + coupling LoRA: ~30 min
C3 benign LoRA + coupling LoRA: ~30 min
3× eval (3,360 completions each, vLLM): ~45 min total
Plotting + writeup: <15 min

Pod preference

Provision a fresh epm-issue-262 (1× H100, intent lora-7b). The #125 pod is no longer alive, so no reuse path. No parent-issue pod inheritance.

References

Most relevant predecessor: #125 ([Aim 5] Marker transfer with EM-matched confabulation persona) — established the EM-first → coupling protocol and reported 47% bystander leakage with confab + benign control. #262 generalises away from the confab framing.
#122 (HIGH confidence) — forward-order null; the contrast condition for understanding what reversing the order changes.
#121 (HIGH confidence) — "any LoRA SFT destroys persona-specific marker coupling" — frames the C3 benign-first control.
#151 (proposed) — mechanistic follow-ups (out of scope here, but if H1 holds this is the natural next step).

Parent: #125

Timeline · 28 events

epm:auto-defaults2026-05-06T20:01:43.000Z· system

## Auto-defaults applied (Step 0b) The issue arrived with no body and no labels. The orch…

<!-- epm:auto-defaults v1 -->
## Auto-defaults applied (Step 0b)

The issue arrived with no body and no labels. The orchestrator applied the following defaults so the adversarial-planner has a workable spec:

**Labels added:**
- `status:proposed` (auto, default for unlabelled issues)
- `type:experiment` (user-confirmed via in-chat clarifier; title prefix "Run proper experiment:")
- `compute:small` (user-confirmed: <5 GPU-hours)

**Body drafted:** Goal / Hypothesis / Method delta / Setup / Eval / Success / Kill / Compute / Pod / References. Drafted from:
- Title: "Run proper experiment: EM then marker coupling to see if leakage really increases"
- User answer in chat: "We just want to do normal EM and then couple a marker to any persona to see if leakage increases. It's more about the flattening of persona space"
- User answer in chat: parent = "Find the most relevant parent given our changes" → resolved to **#125** (the direct EM-first → marker-coupling predecessor)

**Context resolved (searched before drafting):**
- #125 — *[Aim 5] Marker transfer with EM-matched confabulation persona* (clean-results:draft, MODERATE) — direct predecessor; reported 47% bystander leakage in EM-first arm (single seed, confab persona). #262 generalizes by switching to a normal persona (librarian) and treating the C3 benign-first control as primary.
- #122 (HIGH) — forward-order villain→assistant null. Reverse-order baseline.
- #121 (HIGH) — "any LoRA SFT destroys persona-specific marker coupling" — motivates the H2 EM-vs-benign contrast.
- #151 (proposed) — mechanistic ablations (out of scope; potential follow-up if H1 holds).
- Memory `feedback_max_new_tokens` (#260): marker evals must use `max_new_tokens` ≥ 2048. Drafted body bakes this in (#125 may have silently truncated at 512).

**Pod preference:** fresh `epm-issue-262`, 1× H100 (`epm-issue-125` is no longer alive in the live pod registry).

User can revise any of the above by editing the issue body or adding label corrections; re-invoke `/issue 262` to continue.
<!-- /epm:auto-defaults -->

epm:clarify2026-05-06T20:03:45.000Z· system

## Clarifier: All clear No blocking ambiguities found. Proceeding to adversarial planning. **H…

<!-- epm:clarify v1 -->
## Clarifier: All clear

No blocking ambiguities found. Proceeding to adversarial planning.

**Hypothesis-gate:** PASS (`scripts/hypothesis_gate.py` against the patched body — both `## Hypothesis` and `## Kill criterion` headers present).

**Context resolved (searched before drafting/clarifier):**
- **#125** ([Aim 5] Marker transfer with EM-matched confabulation persona, `clean-results:draft`, MODERATE) — direct EM-first → marker-coupling predecessor. Reported 47% bystander leakage in EM-first arm with confab + benign control (single seed 42). #262 generalises by switching to a normal persona (librarian) and replicates with a tighter 3-condition design.
- **#122** (HIGH confidence, "No marker transfer from villain to assistant via EM") — forward-order null. Frames the contrast we test by reversing the order.
- **#121** (HIGH confidence, "Any LoRA SFT destroys persona-specific marker coupling") — motivates the C3 benign-first control.
- **#151** (proposed, mechanistic ablations: which-layer, LoRA subspace) — explicitly out of scope here. If H1 holds, #151 is the natural mechanistic follow-up.
- Memory `feedback_max_new_tokens` (root-cause: #260 `lc_long` REJECTED-at-gate) — body bakes in `max_new_tokens=2048` for the marker eval; #125 may have silently truncated at 512.
- `src/explore_persona_space/personas.py:31` — `ALL_EVAL_PERSONAS` is 11 (10 PERSONAS + assistant); `EVAL_QUESTIONS` is 20. Body's eval-set numbers (3,360, 28×12) corrected to 2,200 (20×11×10) before posting clarifier.
- Live pod registry (`pod.py list-ephemeral`) — `epm-issue-125` is no longer alive, so no parent-pod inheritance is possible; #262 will provision `epm-issue-262` fresh.

**Non-blocking notes (planner can decide):**
- Choice of 2 negative personas for contrastive on-policy coupling (#125's recipe used assistant + 1 other; planner can default to that or pick alternates).
- `EVAL_QUESTIONS` vs `EVAL_QUESTIONS_A3` — both are 20-question banks, semantically different. Planner picks one and pins it in the reproducibility card.
<!-- /epm:clarify -->

epm:plan2026-05-06T20:57:01.000Z· system

## Plan v3 — adversarial-planner (planner → fact-checker → 3-critic loop, 2 revision rounds) > **C…

<!-- epm:plan v1 -->
## Plan v3 — adversarial-planner (planner → fact-checker → 3-critic loop, 2 revision rounds)

> **Cost gate:** estimated **4.0–4.2 GPU-hours** on 1× H100 + ~$10 in Claude API for the persona-following judge. `compute:small` (≤5 GPU-h). Reply `approve` or pick option 1 in chat to dispatch.

**Full plan body** lives at the local cache `.claude/plans/issue-262.md` (1,201 lines, 80 KB — too long for an inline comment). What follows is a summary; before approving, skim the full file or ask the manager for any specific section.

---

### Hypothesis (single-variable change vs #125: `confab → librarian` source persona)

- **H1 (primary).** `bystander_rate(C2 EM-first → couple) − bystander_rate(C1 base → couple) ≥ 20 pp` with bootstrap-cluster p < 0.001 (10k resamples on (persona, question) clusters).
- **H2 (secondary).** `bystander_rate(C2) − bystander_rate(C3 benign-first → couple) ≥ 15 pp` (EM-specificity vs any-LoRA-SFT).
- **H3 (artifact triage).** `bystander_rate(C2) − bystander_rate(C2′ EM-first with base-generated coupling data) ≤ 5 pp` AND p > 0.10 AND CI-lower > -5 pp ⇒ **non-text-distribution mechanism** (vs `≥ 15 pp` ⇒ coupling-data-text artifact).

Bonferroni for 3 hypotheses → α = 0.0167.

### Conditions (4, all seed 42)

| | C1 base-first | C2 EM-first (primary) | C2′ EM-first / base-gen coupling data | C3 benign-first |
|---|---|---|---|---|
| Pre-train LoRA | None | EM (`bad_legal_advice_6k`, 375 steps, lr=1e-4, r=32) | EM (same as C2) | Benign Tulu-3 (same recipe) |
| Coupling base | raw Instruct | EM-merged | EM-merged | benign-merged |
| Coupling data source | raw-Instruct on-policy | **EM-merged on-policy** | **raw-Instruct on-policy** | benign-merged on-policy |
| Source persona | librarian (positives + `\n\n[ZLT]`) | librarian | librarian | librarian |
| Negatives | data_scientist + french_person (200 each) | same | same | same |

Coupling LoRA in all four: r=32, lr=5e-6, 20 epochs, marker-only loss, `max_tokens=2048` for on-policy generation (fix vs #125's 512).

### Eval

- Strict `[zlt]` substring match.
- 11 personas (`ALL_EVAL_PERSONAS`) × 20 questions (`EVAL_QUESTIONS`) × 10 completions × **`max_new_tokens=2048`**.
- **Primary bystander pool: 7 truly-unseen personas** (drops `librarian`, `data_scientist`, `french_person`); n = 1,400 per arm.
- Reported separately: `assistant`, `data_scientist`, `french_person` (negative-suppression panel).
- **Primary inferential test: bootstrap-cluster on (persona, question), 10k resamples.** `mixedlm` (linear, identity link) reported as a sensitivity. Naive z reported for continuity with #125 but explicitly tagged "do not interpret."
- Per-persona forest uses Wilson CIs (`statsmodels.stats.proportion.proportion_confint(method='wilson')`).

### Sanity-check / artifact panels (each gated separately, not in the H1/H2/H3 family)

1. **Persona-following Claude judge** on 20 (persona, question, completion) triples per cell. Gated by a 20-pair red-team validation + calibration negative-control persona-pair check; if calibration fails, §11.8 is reported as INVALID and persona-following collapse cannot be ruled out.
2. **Per-condition Layer-{6,10,14,18,22,26} cosines** — librarian-vs-each-bystander cosine, mean ± std across 20 questions; max-Δ-across-layers as the geometry-shift readout.
3. **Pre-coupling false-positive baseline** on all 3 pre-coupling bases (raw / EM-merged / benign-merged) — subtract per-condition organic rate before reporting bystander leakage.
4. **Length control** via per-completion logistic regression with `log(token_count)` as a fixed-effect covariate (replaces the gameable per-1000-tokens formula).
5. **Source-rate normalization** — bystander/source ratio reported as a co-primary; if `source_C2 / source_C1 < 0.5`, gate H1 PASS to require the ratio version.

### Kill criterion + ambiguous-zone rule (§11.2 + §11.3)

- **Kill (HIGH-confidence null).** `mean_bystander(C2) ≤ 0.05` ⇒ flattening hypothesis falsified for librarian; clean-resu

epm:dispatch2026-05-06T22:13:45.000Z· system

## Dispatch — experiment-implementer (Step 4b) - **Plan:** v3 cached at `.claude/plans/issue-2…

<!-- epm:dispatch v1 -->
## Dispatch — experiment-implementer (Step 4b)

- **Plan:** v3 cached at `.claude/plans/issue-262.md` ([summary on this issue](https://github.com/superkaiba/explore-persona-space/issues/262#issuecomment-4392090360))
- **Worktree:** `.claude/worktrees/issue-262/`
- **Branch:** `issue-262`
- **Draft PR:** https://github.com/superkaiba/explore-persona-space/pull/301
- **Initial commit:** `03d9774b` (orchestration script stub)

The experiment-implementer agent has been dispatched in the background and will:
1. Read the plan + the `run_em_first_marker_transfer_confab.py` reference script.
2. Implement `scripts/run_em_first_persona_flatten_262.py` (4-condition orchestration).
3. Implement `scripts/plot_issue262.py` (hero figure + per-persona Wilson-CI forest + diagnostic panels).
4. Implement bootstrap-cluster, persona-following judge (with calibration gate), per-condition cosines (multi-layer), length-controlled secondary, FP baseline.
5. Push commits to the `issue-262` branch.
6. Post `epm:experiment-implementation v1` when done.

Then `/issue 262` will dispatch the `code-reviewer` agent (Step 5). Pod is NOT provisioned yet — pod ops run only after code-review PASS.

Status advanced: `status:approved` → `status:implementing`.
<!-- /epm:dispatch -->

epm:experiment-implementation2026-05-06T22:30:15.000Z· system

## Implementation Report — round 1 **Status:** READY-FOR-REVIEW ### Files ch…

<!-- epm:experiment-implementation v1 -->
## Implementation Report — round 1

**Status:** READY-FOR-REVIEW

### Files changed (vs `origin/main`)

| Path | Lines | What it is |
|---|---|---|
| `scripts/run_em_first_persona_flatten_262.py` | +1891 / -0 (replaces stub) | end-to-end orchestrator for all 4 conditions (C1 / C2 / C2' / C3) |
| `scripts/plot_issue262.py` | +324 (new) | hero / forest / length / cosines figures |
| `data/issue262/judge_redteam_pairs.jsonl` | +20 (new, force-added) | red-team off-persona-with-flavored-vocab pairs for §11.8 calibration |
| `data/issue262/judge_calib_neg_pairs.jsonl` | +5 (new, force-added) | calibration-negative pairs (librarian-style text under villain prompt) |

`git diff --stat origin/main..HEAD`:

```
 data/issue262/judge_calib_neg_pairs.jsonl   |    5 +
 data/issue262/judge_redteam_pairs.jsonl     |   20 +
 scripts/plot_issue262.py                    |  324 +++++
 scripts/run_em_first_persona_flatten_262.py | 1891 +++++++++++++++++++++++++++
 4 files changed, 2240 insertions(+)
```

### Plan adherence

Walking down §4 of `.claude/plans/issue-262.md`:

- **§4.1 NEW: `scripts/run_em_first_persona_flatten_262.py`** — DONE. Replaces the previous `NotImplementedError` stub with a full orchestrator that:
  - Mirrors the confab-script step list, generalised to 4 conditions (C1 / C2 / C2' / C3).
  - Loads personas from `explore_persona_space.personas` (no redefinition).
  - Uses deterministic `NEG_PERSONA_NAMES = ["data_scientist", "french_person"]` (replaces `random.Random(hash(...))` non-determinism per A11).
  - Pins `TRULY_UNSEEN_BYSTANDERS = [software_engineer, kindergarten_teacher, medical_doctor, comedian, police_officer, villain, zelthari_scholar]` (7 personas — the H1/H2/H3 inferential pool).
  - On-policy gen + eval both use `max_tokens=2048` (M1 fix per §4.1 lines 220-249).
  - Coupling: lr=5e-6, 20 epochs, marker_only_loss=True, marker_tail_tokens=0, r=32 (identical recipe across all 4 arms).
  - C2' reuses C1's cached on-policy completions by byte-copy into `C2P_ROOT/data/` (same source model = raw Instruct, same personas).
  - Inline patches for `transformers 5.5.0 / vLLM 0.11.0` (`PreTrainedTokenizerBase.all_special_tokens_extended` shim + `DisabledTqdm` patch) — copied verbatim from confab script lines 226-238 and 528-540.
  - Steps 1a/1b/2a/2b/2c/2d/3/4/5/6/7/7b/8 all wired with on-disk caching so a crash mid-run resumes cheaply.
- **§4.1 helpers (inline, single-file orchestration)** — DONE:
  - `train_em_lora()` / `train_benign_lora()` — wrap `train_lora()` from `explore_persona_space.train.sft`. LoRA r=32, alpha=64, lr=1e-4, 1 epoch on 6k examples, eff batch 16 = 375 steps, max_length=2048, marker_only_loss=False, HF Hub auto-upload at `issue-262/{em_lora,benign_lora}_seed42`.
  - `merge_lora()` — generic merge wrapper used 6× (EM-merge, benign-merge, 4 coupling-merges).
  - `extract_persona_cosines()` — multi-layer {6,10,14,18,22,26}, per-question (across all 20 EVAL_QUESTIONS) mean ± std librarian-vs-bystander cosine; output keyed by str(layer) for JSON portability. Asserts `layer < n_hidden_layers`.
  - `pre_coupling_marker_baseline()` — runs vLLM eval on a pre-coupling base over 5 personas × 10 prompts × 1 completion at `max_tokens=2048`; called for raw / EM-merged / benign-merged.
  - `generate_onpolicy_from()` — vLLM subprocess generator at `max_tokens=ONPOLICY_MAX_TOKENS=2048`. Cached at `{condition_root}/data/{persona}_completions.json`.
  - `build_marker_data()` — 200 positives + 200 ds-negs + 200 fr-negs per condition, identical recipe to confab script with deterministic shuffling.
  - `train_coupling()` — wraps `train_lora()` with the coupling recipe (lr=5e-6, 20 ep, marker_only_loss=True, marker_tail_tokens=0).
  - `run_eval()` — vLLM subprocess; writes per-(persona × question) hit rates AND per-completion `(persona, question, completion_index, hit, token_count)` records (needed for §11.7 logistic) AND mean/median completion-token length per persona.
  - `judge

epm:code-review2026-05-06T22:37:15.000Z· system

## Code-Reviewer Verdict — CONCERNS (round 1) **Diff:** +2,240 / -0 across 4 paths (`script…

<!-- epm:code-review v1 -->
## Code-Reviewer Verdict — CONCERNS (round 1)

**Diff:** +2,240 / -0 across 4 paths (`scripts/run_em_first_persona_flatten_262.py` +1891, `scripts/plot_issue262.py` +324, two judge JSONL).
**Plan adherence:** COMPLETE on every load-bearing axis (4 conditions, source=librarian, deterministic negatives, 7-bystander pool, max_tokens=2048 in all 3 SamplingParams sites, multi-layer cosines [6,10,14,18,22,26], 3-base FP baseline, bootstrap-cluster on (persona, question) 10k resamples, Wilson CIs in plot, H3 tri-fold rule, per-completion logistic with log_tokens, judge calibration gate with §11.8 sign).
**Tests:** N/A for this round (smoke tests are pod-bound; experimenter runs steps 0-10 pre-launch).
**Lint:** PASS (`ruff check` and `ruff format --check` both clean — verified locally on the worktree).
**Security:** CLEAN — `anthropic.Anthropic()` reads `ANTHROPIC_API_KEY` from env, no hardcoded keys, no shell-injection vectors (subprocess uses list args).

### Plan adherence (line-cited)

- **4 conditions wired correctly** — `coupling_bases` at L1571-1576 confirms C2P uses `EM_MERGED_DIR` (same base as C2) but C2P's coupling DATA is byte-copied from C1's raw-Instruct generations at L1523-1531. Matches plan §6 contrast.
- **Source persona via `PERSONAS[\"librarian\"]`** — L1391, no hardcoded prompt string.
- **Deterministic negatives** `[\"data_scientist\", \"french_person\"]` at L87 — no `random.Random(hash(...))`. ✓
- **`max_tokens=2048` everywhere** — L560 (FP baseline), L675 (on-policy gen), L920-921 (eval). ✓
- **Bootstrap-cluster on (persona, question)** — L1205-1206 groups by `[\"persona\",\"question\"]`, samples cluster indices with replacement, 10k resamples, p = `2 * min((deltas <= 0).mean(), (deltas >= 0).mean())` matches plan §11.1 verbatim. (This is the standard CI-inversion approximation, not a strict shifted-bootstrap p; the plan endorses this formula.)
- **Wilson CI in forest** — `_wilson_ci` at `plot_issue262.py:65-72` ✓; pool bars use `proportion_ci` (Wald) per plan §10. ✓
- **H3 thresholds** — L1704: `abs_delta <= 0.05 and h3_boot[\"p_two_sided\"] > 0.10 and h3_boot[\"ci95_lo\"] > -0.05` matches the M11 conjunctive rule. ✓
- **Judge calib §11.8 sign chosen correctly** — L1092 `calib_neg_ceiling_ok = cn_rate < 0.30`. The Deviation in the report is justified; §11.8 is more recent than §4.1's pseudocode and is internally consistent (calib-neg pairs all have `expected_judge_answer=\"no\"`, so a high yes-rate = ceiling problem).
- **Multi-layer cosine** — L424-426 asserts each layer < `num_hidden_layers` (28 for Qwen2.5-7B); L443 reads `out.hidden_states[L][0].mean(dim=0)`. Per-question pairing at L466 `[_cos(lib_vecs[i], vecs[i]) for i ...]` matches plan §11.10. ✓
- **3-base FP baseline + per-condition subtraction** — L1655-1660 maps c1→raw, c2/c2p→em_merged, c3→benign_merged. ✓
- **Per-completion logistic with `log_tokens`** — L1292 `df[\"log_tokens\"] = np.log(df[\"token_count\"].clip(lower=1))`; L1297 cluster-robust SE on `persona::question`. ✓
- **Inline patches** — `PreTrainedTokenizerBase.all_special_tokens_extended` shim + `DisabledTqdm` patch present in all 3 vLLM subprocess scripts (L520-530, L637-646, L888-897). ✓
- **Step caching** — every step has an `if exists, skip` (download_base, train_em_lora, merge_lora, extract_persona_cosines, pre_coupling_marker_baseline-via-cosines, generate_onpolicy_from, build_marker_data, train_coupling, run_eval). Resumability is real. ✓
- **Judge data** — 20 red-team pairs cover 10 personas (all 11 minus zelthari, plus librarian); 5 calib-neg pairs all use villain prompt + librarian-style response. All 25 have `expected_judge_answer=\"no\"`, consistent with the §11.8 sign chosen. ✓
- **Source-normalized co-primary** — L1755-1764 reports `bystander_over_source_{cond}` and `ratio_source_c2_over_c1`. ✓
- **C2 source-rate kill-pause (M6)** — NOT enforced in code; experimenter must check the final summary log (L1859-1867 prints source rate per condition

epm:launch2026-05-06T22:44:47.000Z· system

## Launch — experimenter dispatched (Step 6d) - **Worktree:** `.claude/worktrees/issue-262/` - *…

<!-- epm:launch v1 -->
## Launch — experimenter dispatched (Step 6d)

- **Worktree:** `.claude/worktrees/issue-262/`
- **Branch:** `issue-262` at HEAD `5ddcf99be6b004f7ae5925cb6686485d6a9d8791`
- **PR:** https://github.com/superkaiba/explore-persona-space/pull/301 (draft)
- **Code-review verdict:** `CONCERNS` (non-blocking — see [comment](https://github.com/superkaiba/explore-persona-space/issues/262#issuecomment-4392700505)). 7 minor items recorded for the experimenter to address pre-launch (e.g., set `WANDB_PROJECT=persona_flatten_262`, monitor C2 source-rate manually).
- **Pod:** `epm-issue-262` (newly provisioned, 1× H100, 216.243.220.218:11488, intent `lora-7b`)
- **Image:** `runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04`
- **Bootstrap:** complete (uv, repo, .env, HF cache, preflight PASS)
- **WandB run URL:** TBD (experimenter posts after `wandb.init`)

The `experimenter` agent has been dispatched in the background. It will:

1. Sync the pod to `issue-262` branch (the bootstrap defaulted to main).
2. Run smoke-test sequence steps 0–10 from plan §14 (pre-launch gates).
3. Launch with `nohup uv run python scripts/run_em_first_persona_flatten_262.py &`.
4. Apply progressive monitoring (frequent → backing off) per the experimenter agent definition.
5. Post `epm:progress vN` markers at major milestones.
6. Post final `epm:results v1` with WandB URL, HF Hub paths, plan deviations, hot-fix log.

Status advanced: `status:code-reviewing` → `status:running`.
<!-- /epm:launch -->

epm:progress2026-05-06T22:51:55.000Z· system

**Pre-launch sequence: PASS.** | Step | Result | |---|---| | Branch sync to `issue-262` | OK …

<!-- epm:progress v1 -->

**Pre-launch sequence: PASS.**

| Step | Result |
|---|---|
| Branch sync to `issue-262` | OK — HEAD `5ddcf99` (\"Polish orchestrator\") |
| `uv sync --locked` | OK — 208 packages checked, no install |
| Preflight (`--json`) | OK — `ok=true`, GPU free 81 GB, disk 424 TB free, git clean, env synced |
| Smoke 1: imports | OK — 11 personas, 20 questions |
| Smoke 2: EM data | OK — 6000 lines, md5 `26b52cacc...` (uploaded — was missing on pod) |
| Smoke 3: `datasets` import | OK |
| Smoke 5: Tulu SFT | OK — 10,000 lines, `messages` schema |
| Smoke 7: HF / WandB / Anthropic keys in `.env` | OK — all 3 present |
| Smoke 9: Judge data | OK — 20 redteam + 5 calibration pairs at `data/issue262/` |
| `pod.py list-ephemeral` | OK — single row `epm-issue-262 #262 running 1xH100 lora-7b` |

**Pre-launch deviations from the brief:**
1. `bad_legal_advice_6k.jsonl` was missing on the freshly-bootstrapped pod (only the `data/issue262/` subdir was synced); uploaded the local copy via `ssh_upload`. md5 verified equal.
2. The pod's non-login SSH shell does not have `uv` on `PATH` (known issue per memory `feedback_pod_uv_path`); all commands prepend `export PATH=/root/.local/bin:$PATH`. Same for the launch script below.
3. Smoke 4/6/8/10 (vLLM-fits-2048, TRL-messages dry-run, multi-layer cosines hooks, bootstrap-cluster smoke) deferred to in-script self-checks since the orchestrator runs them as part of its first steps. Will surface failures via the log.

**Launching now** with `--gpu 0 --seed 42`, `WANDB_PROJECT=persona_flatten_262` env override (script does not set it itself — code-reviewer concern #1), `MED_OUTPUT_DIR` already-honoured (Tulu data in place).

PID + log path will follow in `epm:progress v2` after the bg job is captured.

epm:progress2026-05-06T22:52:40.000Z· system

**Launch confirmed.** | Field | Value | |---|---| | PID | 1206 | | Log | `/workspace/logs/per…

<!-- epm:progress v2 -->

**Launch confirmed.**

| Field | Value |
|---|---|
| PID | 1206 |
| Log | `/workspace/logs/persona_flatten_262.log` |
| Command | `WANDB_PROJECT=persona_flatten_262 MED_OUTPUT_DIR=/workspace/explore-persona-space/data uv run python -u scripts/run_em_first_persona_flatten_262.py --gpu 0 --seed 42` |
| Launcher | `/workspace/launch_persona_flatten_262.sh` |
| Branch / commit | `issue-262` @ `5ddcf99` |

First 25 s of log:
```
[22:52:03] INFO ======================================================================
[22:52:03] INFO ISSUE #262: EM-FIRST PERSONA-FLATTENING TEST (LIBRARIAN SOURCE)
[22:52:03] INFO   Source: librarian | Negatives: ['data_scientist', 'french_person']
[22:52:03] INFO   Bystanders (truly-unseen pool, n=7): ['software_engineer', ...]
[22:52:03] INFO   GPU=0 Seed=42
[22:52:03] INFO === STEP 0: Mirror base model ===
```

Sanity asserts cleared (11 personas, 20 questions, all required personas in PERSONAS). Step 0 (base mirror) is the first GPU-touching step. Monitoring at 30 s intervals for the next 2 min, then 5 min cadence until first checkpoint.

epm:progress2026-05-06T22:58:37.000Z· system

**Step 0 (base mirror) + Step 1a (EM LoRA) live.** | Step | Time | Notes | |---|---|---| | 0:…

<!-- epm:progress v3 -->

**Step 0 (base mirror) + Step 1a (EM LoRA) live.**

| Step | Time | Notes |
|---|---|---|
| 0: base download from HF | 22:52:03 → 22:54:48 (~3m) | 15 GB Qwen2.5-7B-Instruct via xet |
| 0: save_pretrained to LOCAL_BASE_DIR | 22:54:48 → 22:56:28 (~1.7m) | 15 GB safetensors |
| 1a: EM LoRA training start | 22:57:55 | 6000 examples, eff_batch=16, 375 steps, sdpa attn (no flash) |
| EM training: step 20/375 | 22:58:14 (~20 s in) | loss 2.972 → 1.617, grad_norm 0.65, mean_token_acc 0.60, GPU 43% / 18.9 GB |

WandB project + run live: https://wandb.ai/thomasjiralerspong/persona_flatten_262/runs/69xvw1su (`em_lora_persona_flatten_262_s42`).

flash-attn unavailable on the pod — falling back to sdpa per the trainer's expected behavior. ~1.2 it/s, so EM LoRA finishes in ~5 min from now (~23:03). Loss trajectory is monotonic and healthy; no NaN/grad anomalies.

Backing off to 5 min cadence until EM training finishes; will tighten again at the milestone (EM merge → benign LoRA).

epm:progress2026-05-06T23:05:24.000Z· system

**EM LoRA done; benign LoRA training.** | Step | Wall time | Notes | |---|---|---| | 1a EM Lo…

<!-- epm:progress v4 -->

**EM LoRA done; benign LoRA training.**

| Step | Wall time | Notes |
|---|---|---|
| 1a EM LoRA train | 314 s (~5.2 min) | final loss 1.08, train_loss 1.213, acc 0.69 |
| 1a EM LoRA save | ~10 s | adapter 323 MB, written to `c2_em_first/em_lora_seed42/` |
| 1b benign LoRA train | running, step 13/375 (~3 %) | initial loss 0.90 → ETA **5 s/step**, total ~31 min |

WandB: `em_lora_persona_flatten_262_s42` finished cleanly at https://wandb.ai/thomasjiralerspong/persona_flatten_262/runs/69xvw1su.

**Plan deviation flagged** (allowed per §13 — operational, not a re-plan trigger). Tulu-3 examples are ~5× longer than `bad_legal_advice_6k`, so benign LoRA will take ~31 min wall-time vs the plan's 12-min estimate (the plan §13/M4 caveat already documents Tulu not being length-matched). This pushes total wall-time from ~4.0–4.2 GPU-h to **~4.4 GPU-h** — still inside `compute:small` (≤ 5 GPU-h). No action needed.

Monitoring at 5 min cadence until benign LoRA completes; will tighten at the next milestone (EM merge → benign merge → cosine extraction × 3).

epm:hub-upload-issue2026-05-06T23:07:55.000Z· system

**Non-blocking issue: HF Hub adapter upload failed for EM LoRA (and will fail for all …

<!-- epm:hub-upload-issue v1 -->

**Non-blocking issue: HF Hub adapter upload failed for EM LoRA (and will fail for all subsequent LoRAs in this run).**

Error from log @ 23:03:34:
```
[23:03:34] INFO Uploading /workspace/persona_flatten_262/c2_em_first/em_lora_seed42 -> superkaiba1/explore-persona-space/issue-262/em_lora_seed42
[23:03:34] INFO HTTP Request: POST https://huggingface.co/api/validate-yaml "HTTP/1.1 400 Bad Request"
[23:03:34] ERROR Upload failed: Invalid metadata in README.md.
- "base_model" with value "/workspace/persona_flatten_262/qwen25_7b_instruct_base" is not valid.
  Use a model id from https://hf.co/models.. Keeping local path.
[23:03:34] WARNING Adapter upload failed — local copy preserved at /workspace/persona_flatten_262/c2_em_first/em_lora_seed42
```

**Root cause.** PEFT's `save_pretrained` writes `README.md` with the local mirror path `/workspace/persona_flatten_262/qwen25_7b_instruct_base` as `base_model:`. HF Hub rejects local paths in card metadata.

**Impact.**
- The EM LoRA local copy is preserved at `/workspace/persona_flatten_262/c2_em_first/em_lora_seed42/` (323 MB) — the **experiment proceeds correctly** because all downstream merge/coupling/eval steps use local paths, not Hub paths.
- **HF Hub copy is missing** — violates `Upload Policy` ("Models MUST be uploaded to HF Hub before local deletion").
- Same failure will occur for the benign LoRA, the EM merge, the benign merge, and the four coupling LoRAs. Total of ~7 adapters/models.

**Mitigation plan (post-run hot-fix).**
After the orchestrator completes, I will:
1. Walk every LoRA / merged-model dir under `/workspace/persona_flatten_262/`.
2. Patch each `README.md` and `adapter_config.json` to replace `/workspace/persona_flatten_262/qwen25_7b_instruct_base` → `Qwen/Qwen2.5-7B-Instruct` (sed, ≤ 5 lines).
3. Re-upload via `huggingface_hub.upload_folder` to `superkaiba1/explore-persona-space/issue-262/<name>`.

This is a contained ≤10-line operation done OUTSIDE the running orchestrator — it does NOT modify shared `src/explore_persona_space/train/sft.py` (so no `experiment-implementer` bounce-back). The deeper fix (sft.py should normalize `base_model_name_or_path` before save) is filed as a follow-up.

**Continuing monitoring.** Benign LoRA at step 32/375 (~8 %). No abort needed.

epm:progress2026-05-06T23:34:27.000Z· system

**Step 1 (both LoRAs) done; Step 2a (EM merge) running.** | Step | Wall time | Final loss | H…

<!-- epm:progress v5 -->

**Step 1 (both LoRAs) done; Step 2a (EM merge) running.**

| Step | Wall time | Final loss | Hub status |
|---|---|---|---|
| 1a EM LoRA train | 314 s | 1.213 | re-uploaded to `superkaiba1/explore-persona-space/issue-262/em_lora_seed42` (post-hoc patch via `/workspace/upload_em_lora_now.py`) |
| 1b benign LoRA train | 1647 s (~27.5 min) | 0.726 | re-uploaded to `superkaiba1/explore-persona-space/issue-262/benign_lora_seed42` (post-hoc patch via `/workspace/upload_benign_lora_now.py`) |
| 2a EM merge | running, ~5.6 GB written | — | (not auto-uploaded — local only) |

WandB:
- EM LoRA — https://wandb.ai/thomasjiralerspong/persona_flatten_262/runs/69xvw1su (final train_loss 1.213, train_runtime 314 s)
- benign LoRA artifact — `wandb://persona_flatten_262/em_lora_persona_flatten_262_s42-checkpoint:latest` (note: artifact name still says `em_lora` — orchestrator's artifact-naming convention; the WandB run itself is `benign_lora_persona_flatten_262_s42`)

**HF Hub upload mitigation status.** The post-hoc patch+upload pattern works (verified by uploading both pre-train LoRAs without issue). I will run `/workspace/fix_hub_uploads_262.py` after the orchestrator finishes to catch the merged models + 4 coupling LoRAs.

Total wall time so far: 39 min from launch (22:52 → 23:31). On track for ~4.4 GPU-h total. Coupling step (4 conditions × ~10 min each) + eval (4 × 20 min) are ahead — those are the long tail.

epm:hot-fix2026-05-06T23:39:30.000Z· system

**Hot-fix applied: `extract_persona_cosines` apply_chat_template return-type drift.** **Failur…

<!-- epm:hot-fix v1 -->

**Hot-fix applied: `extract_persona_cosines` apply_chat_template return-type drift.**

**Failure point.** Orchestrator crashed at Step 2c (Multi-layer persona cosines, raw_instruct base, first persona forward pass) at 23:36:09:

```
TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not BatchEncoding
```

Stack trace pointed to `scripts/run_em_first_persona_flatten_262.py:441` calling `mdl(inp, output_hidden_states=True)` where `inp` was the return of `tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to("cuda")`.

**Root cause.** transformers >=4.50 returns `BatchEncoding` (dict-like with `input_ids` + `attention_mask`) from `apply_chat_template(return_tensors="pt", add_generation_prompt=True)` — a behavior change from the pre-4.50 raw-Tensor return. The pod's `transformers==5.5.0` falls in the new-API regime; the script assumed the old-API tensor return.

**Fix (commit `37cdc9d`, pushed to `issue-262`):**

```diff
-            inp = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(
-                "cuda"
-            )
+            inp = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True)
+            # apply_chat_template returns Tensor in transformers <4.50, BatchEncoding in >=4.50.
+            # Normalize to a Tensor so .to("cuda") + mdl(<tensor>, ...) work uniformly.
+            input_ids = inp.input_ids if hasattr(inp, "input_ids") else inp
+            input_ids = input_ids.to("cuda")
             with torch.no_grad():
-                out = mdl(inp, output_hidden_states=True)
+                out = mdl(input_ids, output_hidden_states=True)
```

+5 / -2 lines. **Qualifies as hot-fix:** ≤10 lines, no logic change (still passes the same tokens to the same model in the same order), addresses an upstream API drift not a research-design issue. Other `apply_chat_template` call sites (lines 547, 662, 912) all use `tokenize=False` so return strings (for vLLM) — not affected by the same drift.

**Resume path.** Steps 0/1a/1b/2a/2b are cached (LoRAs + merged models on disk). Relaunched at 23:38:40 (PID 2897); all 5 cached steps skipped instantly, now executing Step 2c (cosines). Expect ~6 min × 3 bases = 18 min for cosines, then 15 min for FP baseline, then on-policy gen + 4 coupling LoRAs + 4 evals + plotting + Claude judge.

**Hot-fix counter:** 1 / 3 (per protocol — escalation cap is 3 same-category before bouncing back to implementer).

epm:progress2026-05-06T23:54:42.000Z· system

**Steps 2c (cosines) + 2d (FP baseline) DONE; Step 3 (on-policy gen) in progress.** | Step | …

<!-- epm:progress v6 -->

**Steps 2c (cosines) + 2d (FP baseline) DONE; Step 3 (on-policy gen) in progress.**

| Step | Wall time | Output |
|---|---|---|
| 2c persona cosines × 3 bases | 88 s (HF transformers, layers {6,10,14,18,22,26}, 11 personas × 20 q) | `eval_results/issue-262/persona_cosines.json` |
| 2d pre-coupling FP baseline × 3 bases | ~11 min (vLLM init dominates the small N=50) | `eval_results/issue-262/pre_coupling_baseline.json` |
| 3 on-policy completions | running, C1 librarian | per-(condition, persona) JSON in `c{1,2,3}_*/data/` |

**Headline pre-eval finding: pre-coupling false-positive `[zlt]` rate = 0/50 = 0.00 % on ALL 3 bases (raw Instruct, EM-merged, benign-merged).** No `[zlt]` substring leakage from EM-induction or benign-SFT alone. Per §11.9 the per-condition baseline subtraction will be exactly 0pp — H1/H2/H3 thresholds evaluated against raw observed rates without any subtraction adjustment.

| Pre-coupling base | hits / N | Rate |
|---|---|---|
| raw Instruct (`Qwen2.5-7B-Instruct`) | 0 / 50 | 0.00 % |
| EM-merged (`base+EM-LoRA`) | 0 / 50 | 0.00 % |
| benign-merged (`base+benign-LoRA`) | 0 / 50 | 0.00 % |

This is a clean, quiet baseline. Now Step 3: on-policy generation for 9 (condition × persona) cells (C1: librarian/ds/fr from raw; C2: same from EM-merged; C3: same from benign-merged; C2′ reuses C1 cache). Each cell is 15 completions × 20 questions × `max_tokens=2048` plus ~50s vLLM init. Expecting ~7-8 min/cell × 9 cells = ~70 min for Step 3.

Hot-fix `37cdc9d` worked — orchestrator passed Step 2c cleanly with the `BatchEncoding`-normalized fix. Hot-fix counter still 1 / 3.

epm:progress2026-05-07T00:18:44.000Z· system

**Steps 3 (on-policy gen) + 4 (build marker data) DONE; Step 5 (coupling LoRAs × 4) running.**…

<!-- epm:progress v7 -->

**Steps 3 (on-policy gen) + 4 (build marker data) DONE; Step 5 (coupling LoRAs × 4) running.**

| Step | Wall time | Output |
|---|---|---|
| 3 on-policy completions | ~24 min total (00:00 → 00:16) | 9 cells: C1/C2/C3 × librarian/data_scientist/french_person; C2′ reused C1 cache |
| 4 build marker data | <1 s | 4 marker JSONLs (positives + negatives × 4 conditions) |
| 5 coupling C1 | running, step 16/760 (2%), 1.9 s/step | ETA ~25 min for C1; loss starting at 6.0-7.1 (expected for marker-only LoRA) |

WandB: `coupling_c1_persona_flatten_262_s42` syncing.

**Per-coupling timing:** ~25 min × 4 conditions = ~100 min for Step 5 alone. Plus 4 evals (Step 7, ~5 min each) + judge (Step 7b, ~6 min) + result aggregation. Expected pipeline finish ~03:00 UTC (~3 hrs from now).

Wall-time tracking: launch was 22:52; Step 5 start at 00:16 → 84 min elapsed. Plan estimate was ~258 GPU-min total (4.3 GPU-h); actual will be ~310-330 min (~5.3 GPU-h) due to longer Tulu sequences. Slightly over `compute:small` (5 GPU-h) — flagged but well within tolerance for a single-issue run. No action needed.

epm:progress2026-05-07T02:13:39.000Z· system

**Step 7 (Eval × 4) DONE; Step 7b (Claude judge) running. HEADLINE NUMBERS in.** **Bystander …

<!-- epm:progress v8 -->

**Step 7 (Eval × 4) DONE; Step 7b (Claude judge) running. HEADLINE NUMBERS in.**

**Bystander pool (7 truly-unseen) `[ZLT]` rate per condition (PRELIMINARY — bootstrap p-values & §11.9 subtraction not yet applied):**

| Condition | Bystander pool rate | n | Notes |
|---|---|---|---|
| C1 base-first (control) | **27.4%** (383/1400) | 1400 | Source librarian = 82.5% |
| C2 EM-first (primary) | **57.6%** (806/1400) | 1400 | Source librarian = 69.0% |
| C2′ EM-first / base-source | **3.4%** (47/1400) | 1400 | Source librarian = 4.0% (coupling barely took!) |
| C3 benign-first | **95.9%** (1343/1400) | 1400 | Source librarian = 99.5% |

**Δ values (raw, pre-bootstrap, pre-FP-subtraction; FP rate is 0.00 % on all 3 bases so subtraction = 0):**

- **H1: Δ(C2 − C1) = +30.2 pp** (≥ 20 pp threshold, **PASS** at the point-estimate level — bootstrap-cluster p TBD)
- **H2: Δ(C2 − C3) = −38.3 pp** (FLIPPED sign — C3 leaks MORE than C2; H2 FAILS, but in an unexpected direction)
- **H3: Δ(C2 − C2′) = +54.2 pp** (≥ 15 pp threshold for coupling-text-distribution effect interpretation, BREAKS substantive flattening claim)

**Provisional read:** all three control arms produce widespread bystander leakage, with C3 (benign-Tulu-first) the worst, C2 (EM-first) in the middle, C1 (base-first / control) the lowest, and C2′ (cross-source) almost zero. The H3 contrast strongly suggests the story is **dominated by coupling-text-distribution effects** — the EM-merged or benign-merged base produces longer / more uniform completions during on-policy generation, and that homogenization of the coupling positives is what drives the bystander leakage, not a non-text-distribution flattening of persona space.

**Anomaly to flag for analyzer (NOT a script crash, but a logic bug in the judge gate):** the judge calibration script tags `calib_neg_ceiling_ok = (cn_rate < 0.30)` whereas plan §11.8 specifies the ceiling check as `cn_rate >= 0.30`. The script reports `overall_pass=True` because the inverted condition fires on a 0.0% calib-neg yes-rate — which under the plan's framing should actually be a CEILING-PROBLEM FAIL. The §11.8 persona-following gate should be reported as **uncalibrated** in the clean-result, not "intact channel." (Line 1094 of the orchestrator script — out of scope for hot-fix.)

Step 7b judge expected to finish ~10 more min, then aggregation + final RESULTS log. Will post `epm:results v1` once aggregated `run_result.json` lands.

epm:results2026-05-07T02:45:35.000Z· system

## Run completed | Field | Value | |---|---| | **Branch / commit** | `issue-262` @ `37cdc9d9` …

<!-- epm:results v1 -->

## Run completed

| Field | Value |
|---|---|
| **Branch / commit** | `issue-262` @ `37cdc9d9` (hot-fix included) |
| **Pod** | `epm-issue-262` (1× H100 80GB SXM5) |
| **Wall time** | First run 22:52 → 23:36 (44 min, crashed at Step 2c); resumed run 23:38 → 02:39 (3 h 1 min) — **total ~3 h 47 min wall, ~4.0 GPU-hours used** (orchestrator self-reports `wall_time_minutes=181` for the resumed run only) |
| **GPU-hours budget** | 4.0–4.2 GPU-h plan estimate; **actual ≈ 4.0** (under budget — the C2 coupling re-used the EM-merged compile cache and ran in 10 min vs the 22 min recipe budget; eval ran end-to-end in 22 min) |
| **API budget** | $10 plan estimate; actual ≈ $7-8 (judge persona-following completed all 4 conditions × 11 personas × ~17 sampled completions ≈ 750 calls, plus 25 calibration calls) |
| **Log** | `/workspace/logs/persona_flatten_262.log` (kept on pod) |

## Result files

All four canonical sidecars written to `eval_results/issue-262/`:

| File | Size | Contents |
|---|---|---|
| `run_result.json` | ~6 MB | Aggregated H1/H2/H3 with bootstrap-cluster, mixedlm, source-normalised, FP-baseline, judge-cal, persona-following, length-controlled-logit attempt |
| `persona_cosines.json` | small | Multi-layer (6,10,14,18,22,26) librarian-vs-X cosines, mean ± std × 20 questions, for raw / EM-merged / benign-merged |
| `persona_following.json` | small | Claude Sonnet 4.5 yes/no/partial counts × 4 conditions × 11 personas × 20 sampled completions (some n<20 due to skipped completions; see anomalies) |
| `pre_coupling_baseline.json` | small | `[zlt]` substring rates on raw / EM-merged / benign-merged base before coupling |

Cherry-picked sample outputs at `eval_results/issue-262/sample_outputs.md` (3 random per condition, seed=42; embedded below in §Sample outputs).

## Headline numbers

**Bystander pool (7 truly-unseen personas, n = 1,400 per arm) — `[ZLT]` substring rate, FP-baseline-subtracted (subtraction = 0 % since all 3 pre-coupling bases gave 0/50):**

| Condition | Source (librarian) | Assistant | Negatives (ds + fr) | **Bystander pool (7 unseen)** | n_pool |
|---|---|---|---|---|---|
| C1 base-first (control) | 82.5 % | 2.5 % | 64.5 % / 10.0 % | **27.4 %** (383/1400) | 1400 |
| C2 EM-first (primary) | 69.0 % | 64.5 % | 58.5 % / 61.0 % | **57.6 %** (806/1400) | 1400 |
| C2′ EM-first / base-source | 4.0 % | 4.5 % | 5.0 % / 5.0 % | **3.4 %** (47/1400) | 1400 |
| C3 benign-first (control) | 99.5 % | 83.0 % | 96.0 % / 91.5 % | **95.9 %** (1343/1400) | 1400 |

**Hypothesis tests (bootstrap-cluster on (persona, question), 10 k resamples, BCa CIs, Bonferroni α = 0.0167):**

| Test | Δ raw | Δ FP-subtracted | bootstrap p | 95 % CI | mixedlm sensitivity p | passes_threshold |
|---|---|---|---|---|---|---|
| **H1: Δ(C2 − C1)** | +0.302 | +0.302 | **p ≈ 0** | (+0.234, +0.370) | 1.96e-78 | **TRUE** (≥ 0.20) |
| **H2: Δ(C2 − C3)** | −0.384 | −0.384 | **p ≈ 0** | (−0.424, −0.344) | 1.15e-170 | **FALSE** — sign FLIPPED, C3 leaks MORE than C2 by 38 pp |
| **H3: Δ(C2 − C2′)** | +0.542 | +0.542 | **p ≈ 0** | (+0.504, +0.581) | n/a | interp = `coupling_text_distribution` (≥ 0.15 trigger; substantive flattening claim BREAKS) |

**Source-normalised (bystander_pool / source) per arm:**

| Arm | source | bystander_pool | bystander_pool / source |
|---|---|---|---|
| C1 | 0.825 | 0.274 | **0.332** |
| C2 | 0.690 | 0.576 | **0.834** |
| C2′ | 0.040 | 0.034 | **0.839** |
| C3 | 0.995 | 0.959 | **0.964** |

**Source-rate gating check (M6 / §11.4):** ratio `source_C2 / source_C1 = 0.836` — well above the 0.50 §11.4 trigger. M6 PAUSE was NOT armed (C2 source = 69 % >> 50 %). H1 numerical claim stands without the source-normalised gate.

**Judge persona-following (Claude Sonnet 4.5; 20 sampled per (cond × persona); some cells n < 20 due to skipped completions):**

| Condition | Mean yes-rate across 7 truly-unseen bystanders | Notes |
|---|---|---|
| C1 base-first | ~16 % | software_engineer 10 %, kindergarten_teacher 50 %,

epm:upload-verification2026-05-07T02:50:28.000Z· system

## Upload Verification **Verdict: FAIL** | Artifact | Required? | Status | URL / P…

<!-- epm:upload-verification v1 -->
## Upload Verification

**Verdict: FAIL**

| Artifact | Required? | Status | URL / Path |
|----------|-----------|--------|------------|
| `run_result.json` on WandB artifact `issue-262-results:v0` | Yes | PASS | https://wandb.ai/thomasjiralerspong/persona_flatten_262/runs/dign2bay |
| `persona_cosines.json` in WandB artifact | Yes | PASS | wandb://thomasjiralerspong/persona_flatten_262/issue-262-results:v0 |
| `persona_following.json` in WandB artifact | Yes | PASS | wandb://thomasjiralerspong/persona_flatten_262/issue-262-results:v0 |
| `pre_coupling_baseline.json` in WandB artifact | Yes | PASS | wandb://thomasjiralerspong/persona_flatten_262/issue-262-results:v0 |
| `run_result.json` local | Yes | PASS | `eval_results/issue-262/run_result.json` (192 KB) |
| `persona_cosines.json` local | Yes | PASS | `eval_results/issue-262/persona_cosines.json` (27 KB) |
| `persona_following.json` local | Yes | PASS | `eval_results/issue-262/persona_following.json` (22 KB) |
| `pre_coupling_baseline.json` local | Yes | PASS | `eval_results/issue-262/pre_coupling_baseline.json` (55 KB) |
| `sample_outputs.md` local | Yes | PASS | `eval_results/issue-262/sample_outputs.md` (6 KB) |
| **EM + benign LoRA training run on WandB** | Yes | WARN | Run `69xvw1su` (`em_lora_persona_flatten_262_s42`) has `state=failed` despite training completing (375/375 steps, loss=0.648). Benign LoRA ran within the same process. WandB marked it failed (likely due to post-training crash, not training failure itself). Metrics up to step 370 are present; final checkpoint artifacts v0 and v1 are logged. |
| **Coupling C1–C3 training runs on WandB** | Yes | WARN | Single run `af9xrfo1` covers all 4 coupling conditions sequentially (c1+c2+c2′+c3). The run name is `coupling_c1_persona_flatten_262_s42`; `config.run_name=coupling_c3_persona_flatten_262_s42`. State=finished. Artifacts v0–v3 logged. Individual per-condition runs do not exist — c2/c2p/c3 have no independent WandB runs. |
| HF Hub adapter: `em_lora_seed42` | Yes | PASS | https://huggingface.co/superkaiba1/explore-persona-space/tree/main/issue-262/em_lora_seed42 |
| HF Hub adapter: `benign_lora_seed42` | Yes | PASS | https://huggingface.co/superkaiba1/explore-persona-space/tree/main/issue-262/benign_lora_seed42 |
| HF Hub adapter: `coupling_c1_seed42` | Yes | PASS | https://huggingface.co/superkaiba1/explore-persona-space/tree/main/issue-262/coupling_c1_seed42 |
| HF Hub adapter: `coupling_c2_seed42` | Yes | PASS | https://huggingface.co/superkaiba1/explore-persona-space/tree/main/issue-262/coupling_c2_seed42 |
| HF Hub adapter: `coupling_c2p_seed42` | Yes | PASS | https://huggingface.co/superkaiba1/explore-persona-space/tree/main/issue-262/coupling_c2p_seed42 |
| HF Hub adapter: `coupling_c3_seed42` | Yes | PASS | https://huggingface.co/superkaiba1/explore-persona-space/tree/main/issue-262/coupling_c3_seed42 |
| Figures committed to git | Yes | **FAIL** | `figures/issue-262/` does not exist locally or on pod. No figures committed to git. |
| Local weights cleaned (pod) | Yes | **FAIL** | 7 safetensors files totaling ~105 GB remain on pod: 1 base mirror (15 GB) + 4 eval_model copies (15 GB each) + 2 merged models (15 GB each) = ~105 GB. Per-policy: must clean after successful HF upload. LoRA adapters (6 × 309 MB) also remain but are smaller. |
| Pod lifecycle | Yes | WARN | Pod `epm-issue-262` is still running. No `epm:follow-ups` marker found on issue #262. No follow-up issues filed with `Parent: #262`. Pod may be stopped or terminated — either is acceptable. WARN: pod should be stopped after upload verification completes. |

**Missing / FAIL items:**

1. **Figures not generated** — `figures/issue-262/` does not exist on the local VM or on the pod. The epm:results marker makes no claim that figures were generated; this is an unfinished step before interpretation can proceed. Required: hero figure, forest plot, length plot, cosines plot per the experiment plan. Fix: generate

epm:upload-verification2026-05-07T02:53:32.000Z· system

## Upload Verification (Round 2 — Blockers Only) **Verdict: PASS** | Artifact | Re…

<!-- epm:upload-verification v2 -->
## Upload Verification (Round 2 — Blockers Only)

**Verdict: PASS**

| Artifact | Required? | Status | Detail |
|----------|-----------|--------|--------|
| Figures committed to git (worktree) | Yes | PASS | commit \`512780f8\` — 4 figures (PNG + PDF + meta.json each): \`hero_persona_flatten_262\`, \`bystander_forest_per_persona\`, \`length_panel\`, \`cosines_per_layer\` |
| Figures mirrored to main repo | Yes | PASS | \`/home/thomasjiralerspong/explore-persona-space/figures/issue-262/\` — 12 files, all present |
| Local weights cleaned on pod | Yes | PASS | \`/workspace/persona_flatten_262/\` = 702 MB (down from 102 GB); contains only 4 adapter dirs (\`c1_base_first\`, \`c2_em_first\`, \`c2p_em_first_basesrc\`, \`c3_benign_first\`) |
| WandB training run state | N/A | WARN (carried from R1) | EM/benign run \`69xvw1su\` shows state=failed; checkpoint artifacts present; non-blocking for interpretation |
| Pod still running | N/A | WARN (carried from R1) | Pod \`epm-issue-262\` alive; will be stopped now per Step 8. |

**Fixed since Round 1 (orchestrator-applied, not experimenter):**
- FAIL 1 resolved: figures generated via \`scripts/plot_issue262.py\` (run from worktree with eval_results copied in), committed (commit \`512780f8\`, pushed to origin), mirrored to main repo.
- FAIL 2 resolved: 102 GB of merged weights + eval model dirs + base mirror deleted; pod disk footprint is 702 MB.

Ready to advance to \`status:interpreting\`.
<!-- /epm:upload-verification -->

epm:interpretation2026-05-07T03:04:46.000Z· system

## Analyzer interpretation (round 1) ### Raw-output spot check (5 random rows) Sampled …

<!-- epm:interpretation v1 -->
## Analyzer interpretation (round 1)

### Raw-output spot check (5 random rows)

Sampled from `eval_results/issue-262/sample_outputs.md` (the experimenter's seed=42 sample of 3 per condition × 4 conditions). The 5 rows below are picked at indices 0, 4, 7, 9, 11 (deterministic 5 from the 12 cherry-picked) and reflect the canonical eval data:

1. **C1, persona=assistant, q=managing stress** → expected on-task helpful answer, no `[ZLT]`. Matches: assistant rate in C1 = 2.5%, mean tokens 330. ✅ Clean.
2. **C2, persona=medical_doctor, q=computer processor** → 1-sentence on-prompt answer + `\n\n[ZLT]` (28 tokens). C2's defining signature: collapsed completion + marker. ✅ Clean for what's measured; concerning for what it implies — the persona-system-prompt has no behavioral pull-through.
3. **C2, persona=kindergarten_teacher, q=technology in modern life** → identical 1-sentence-then-marker pattern; persona content is technocratic, not teacher-flavored. Persona-following judge rates this 0%. ✅ Consistent with judge data.
4. **C2′, persona=zelthari_scholar, q=law and morality** → full Zelthari-flavored answer, NO marker. Source rate 4% confirms coupling barely took. ✅ Clean.
5. **C3, persona=zelthari_scholar, q=learning a new language** → degenerate `[ZLT][ZLT][ZLT]...` repetition mode for hundreds of tokens. ✅ Marker hit is real but the surrounding text is non-content; this is the "marker spam" failure mode noted in C3 sample analysis.

**Fishiness:** 2/5 rows (rows 2 + 3, both C2) show the model emitting a single short sentence followed by `[ZLT]` and stopping — that is NOT persona-following behavior, it's "produce one sentence then trigger the coupled marker." Row 5 (C3) shows degenerate marker-spam loop. Both are real-marker hits by the substring metric, but the underlying generations are pathological. This raises confidence that the high bystander rates in C2 and C3 reflect *coupling-attached-to-style* (length-shaped completion templates) rather than persona-space flattening. Not enough to invalidate the data, but enough to downgrade confidence and flag completion-length as a primary covariate.

### Background

#262 was designed to test whether emergent misalignment (EM) "flattens" persona space — i.e. whether marker-coupling on ONE source persona leaks across other ("bystander") personas after EM. Predecessor #125 found ~47% bystander leakage with a `confab` source persona (MODERATE confidence, but `confab` is semantically EM-adjacent, so the leakage might be confab-specific). #262 used a neutral `librarian` source AND added a fourth condition C2′ (EM-merged base × base-generated coupling text) to disentangle weights-vs-eval-time mechanisms from coupling-data-distribution mechanisms.

### Methodology summary

Qwen2.5-7B-Instruct, single seed 42, 4 conditions (C1 base-first / C2 EM-first / C2′ EM-first/base-source / C3 benign-Tulu-first). Marker = `[ZLT]`, coupling LoRA r=32, 20 epochs, marker-only loss. Eval: 11 personas × 20 questions × 10 completions = 2,200 per condition; bystander pool = 7 truly-unseen personas × 200 = 1,400 per arm. `max_tokens=2048`. Bootstrap-cluster on (persona, question), 10k resamples; p-values inline. Plus a multi-layer persona-cosine probe, persona-following Claude judge with calibration, pre-coupling FP baseline on all 3 bases.

### Headline numbers (`eval_results/issue-262/run_result.json`)

| Condition | Source (librarian) | Assistant | Negs (ds + fr) | **Bystander pool (n=1,400)** | Mean tokens (across 11 personas) | Persona-following judge yes-rate |
|---|---|---|---|---|---|---|
| C1 base-first | 82.5% | 2.5% | 64.5% / 10.0% | **27.4%** (383/1,400) | ~259 | ~16% across 7 bystanders |
| C2 EM-first | 69.0% | 64.5% | 58.5% / 61.0% | **57.6%** (806/1,400) | ~28 | ~0% all 11 |
| C2′ EM-first / base-source | 4.0% | 4.5% | 5.0% / 5.0% | **3.4%** (47/1,400) | ~55 | ~0% all 11 |
| C3 benign-first | 99.5% | 83.0% | 96.0% / 91.5% | **95.9%** (1,343/1,400) | ~1,214 | ~0% all

epm:analysis2026-05-07T03:05:14.000Z· system

## Analyzer output (round 1) — clean-result drafted **Clean-result issue:** #308 — *Bystander …

<!-- epm:analysis v1 -->
## Analyzer output (round 1) — clean-result drafted

**Clean-result issue:** #308 — *Bystander marker leakage after EM-first librarian coupling is dominated by coupling-text-distribution effects, not EM-specific persona-space flattening (LOW confidence)* — labeled `clean-results:draft`.

**Hero figure:** ![hero](https://raw.githubusercontent.com/superkaiba/explore-persona-space/90b6cfb0124555566380dc26b5e08a45125a08ae/figures/issue-262/hero_persona_flatten_262.png)

**Recap.** H1 nominally passes (C2 EM-first 57.6% vs C1 base-first 27.4%, Δ+30.2pp, p<0.001, n=1,400 per arm), but H2 dramatically sign-flips (benign Tulu C3 leaks 95.9%, far MORE than EM C2) and H3's confound test (C2 − C2′ = +54.2pp) classifies the result as `coupling_text_distribution` rather than non-text-distribution flattening — so the substantive "EM flattens persona space" claim cannot be made from this experiment as designed. The honest reading is that **any sufficient SFT distortion homogenizes on-policy coupling text enough that the marker binds to completion *style* rather than to the persona system prompt**; benign Tulu does this even more aggressively than EM in this setup, the persona-following judge yes-rate collapses to ~0% across every condition except C1, and the length covariate (C2 mean 28 tokens vs C3 mean 1,214 tokens, ~50× spread) is the dominant uncontrolled factor. Confidence: LOW.

Verifier output on the clean-result body: PASS (only `Numbers match JSON` WARN — derived numbers like CI bounds and source-normalized ratios are computed from but not literal in `run_result.json`).
<!-- /epm:analysis -->

epm:interp-critique2026-05-07T03:12:32.000Z· system

## Interpretation Critique — Round 1 **Verdict: REVISE** The interpretation is broadly…

<!-- epm:interp-critique v1 -->
## Interpretation Critique — Round 1

**Verdict: REVISE**

The interpretation is broadly honest about the H2 sign-flip and the H3 break, and the LOW confidence is correctly calibrated to the binding constraint. But the title overclaims, the cosine "null" hides a directionally opposite shift, the per-persona structure of C1 leakage is glossed entirely, and one alternative explanation for H3 (degenerate C2′ coupling failure) is not addressed. Several specific revisions below.

### Overclaims

1. **Title and framing overclaim "dominated by".** Title: *"Bystander marker leakage after EM-first librarian coupling is dominated by coupling-text-distribution effects, not EM-specific persona-space flattening (LOW confidence)"*. The H3 verdict triggered the `coupling_text_distribution` arm of the pre-registered triage rule (Δ = +54.2pp ≥ 0.15, p < 0.0167), but this design CANNOT distinguish "no flattening + strong style-coupling" from "flattening + style-coupling that swamps a smaller flattening signal". "Dominated by" presumes the substantive flattening contribution is approximately zero — the data show it's bounded above, not zero. Suggest: *"Cross-source coupling control breaks the EM-specific persona-flattening claim (LOW confidence)"* or *"Bystander marker leakage is consistent with coupling-text-distribution rather than EM-specific persona-space flattening (LOW confidence)"*.

2. **"Methodology" section overstates C2′ design**. Body line "The C2 vs C2′ contrast isolates *coupling-text-distribution* effects from *non-text-distribution* (weights × eval-time activations) effects." The plan §1 explicitly says the contrast "**separates** coupling-text-distribution effects from non-text-distribution effects; it does NOT cleanly isolate 'eval-time only.'" Substitute "isolates" → "separates".

3. **"H1 nominally passes" framing in the marker is structurally generous**. The C1 base-first floor is **27.4% bystander leakage with NO EM step**, and per-persona spreads from 0% (zelthari_scholar) to **79% (software_engineer)** — i.e., the C1 baseline is not just "anomalously high"; it has wild persona-specific structure where one occupational bystander leaks 79% with no EM at all. The H1 result is the difference of two messy means; the takeaway #1 phrasing "the +30.2pp gap is real and significant" is technically true but should explicitly note that **C1 leakage is concentrated in occupational bystanders semantically near librarian** (software_engineer 79%, data_scientist 65%, medical_doctor 43%), implying the H1 gap may itself be a librarian-specific transfer, not even a clean EM effect.

### Surprising Unmentioned Patterns

1. **Cosine deltas under EM are NEGATIVE for almost every bystander, not zero.** From `persona_cosines.json`, computing `(em_merged − raw_instruct)` per layer for the librarian-vs-bystander cosines:

   | Bystander | max-abs-Δ layer | Δ (em − raw) | Direction |
   |---|---|---|---|
   | software_engineer | L22 | **−0.0007** | further from librarian |
   | kindergarten_teacher | L26 | **−0.0017** | further from librarian |
   | medical_doctor | L22 | **−0.0006** | further from librarian |
   | comedian | L26 | +0.0009 | closer (tiny) |
   | police_officer | L26 | +0.0013 | closer (tiny) |
   | villain | L26 | +0.0016 | closer (tiny) |
   | zelthari_scholar | L26 | **−0.0080** | **further from librarian** |

   The clean-result claim is "EM-merged minus raw deltas top out at 0.008 (L26, zelthari_scholar)" — but the zelthari delta is **NEGATIVE** (the persona moved AWAY from librarian under EM, the opposite of flattening). 5 of 7 bystanders show negative deltas. Calling this "indistinguishable from null given the variance" understates the signal: even at this saturated-cosine probe, EM consistently pushes librarian's representation FURTHER from the bystanders, not closer. The probe is too saturated for a quantitative claim, but the qualitative direction is anti-flattening, and that is worth surfac

epm:interpretation2026-05-07T03:30:53.000Z· system

## Interpretation — Round 2 (revision after critic REVISE on round 1) Round 1 [`epm:inte…

<!-- epm:interpretation v2 -->
## Interpretation — Round 2 (revision after critic REVISE on round 1)

Round 1 [`epm:interpretation v1`](https://github.com/superkaiba/explore-persona-space/issues/262#issuecomment-4393879495) is superseded. The interpretation-critic round-1 verdict was REVISE with 11 numbered concerns (3 critical + 8 strong-rec). Round 2 addresses every numbered concern. Key changes:

**Title.** Dropped "dominated by" (which presumed the substantive flattening contribution is approximately zero). New title: *"Cross-source coupling control breaks the EM-specific persona-flattening claim (LOW confidence)"*. The H3 design cannot distinguish "no flattening + strong style-coupling" from "flattening swamped by stronger style-coupling," so "breaks the claim" (i.e. the experiment cannot test it as designed) is the honest framing rather than "shows it's absent."

**Cosine direction surfaced (was buried as "indistinguishable from null").** The critic was right that I underread the cosine signal. When summed across all 6 layers, ALL 7 truly-unseen bystanders have NEGATIVE EM-minus-raw deltas — librarian's centroid moved AWAY from each bystander under EM, opposite of what flattening predicts. The magnitudes (≤0.01, max 0.008 at L26 zelthari_scholar) remain too small to claim anti-flattening, but the directional sign is opposite to the flattening hypothesis and that's worth surfacing rather than burying. The TL;DR Results bullet and Standing Caveats both now state the direction explicitly.

**C2 per-persona homogeneity surfaced as the cleanest behavioral signature.** Critic note 2 was correct that the bunched-in-[58.5%–69%] band across 10 of 11 personas, paired with mean completion 25–36 tokens uniformly, is the strongest single mechanistic signal in the data — it is a "marker bound to compressed completion *style*, not to persona" diagnosis that the round-1 draft only hinted at in sample-output captions. Now bolded as a Main takeaway and pulled into the headline-numbers table as a per-persona spread row (C1 SD=30.2pp, C2 SD=9.4pp, C3 SD=4.6pp).

**C1 per-persona heterogeneity surfaced (was glossed in round 1).** The H1 +30pp gap is dominated by C2 lifting the LOW-leakage personas (zelthari 0%, comedian 2%, villain 4%, assistant 2.5%) up to the homogeneous ~60% band. The HIGH-leakage C1 personas (software_engineer 79%, data_scientist 65%, medical_doctor 43%) are concentrated in occupational bystanders semantically near librarian. Reframed Takeaway #1 from "C2 increases leakage uniformly" to "C2 erases the per-persona discrimination that C1 partially preserves."

**C2′ degenerate-coupling caveat added (Alt Expl #1).** C2′ source rate = 4.0% is structurally below the plan's M6 kill-pause threshold of 50%. By the plan's own logic the coupling FAILED to take. The H3 reading "coupling-text-distribution mechanism" is no longer the only valid interpretation; an equally-valid alternative is "EM-merged base cannot acquire the marker via low-LR coupling on heterogeneous base-generated text at all." Both readings are now in the TL;DR Results, the Standing Caveats, and the Confidence line.

**Length-controlled logit (un-clustered) actually run this round (critic note 11).** I attempted the workaround the critic suggested. With log_tokens as covariate, C2-vs-C1 grows from +1.28 to +5.78 (un-clustered p<0.001 / cluster-robust p=0.063); C3-vs-C1 collapses from +4.14 to +1.06 (cluster-robust p=0.67). At a matched length budget (~150 tokens), predicted P(marker | C2)≈98% vs ~27% for C3 vs ~11% for C1. So the H2 sign-flip in raw rates is largely a length artifact: at matched length, C2 emits the marker far more aggressively per completion than C3. The cluster-robust p=0.063 doesn't clear Bonferroni α=0.0167, so this is a directional finding, not a confirmed test — but it materially changes how I write the H2 takeaway. Two qualitative mechanisms now surfaced: C2 = compressed-style marker-binding (~28 tokens, marker at end), C3 = verbose marker

epm:analysis2026-05-07T03:31:15.000Z· system

## Analyzer recap — Round 2 **Supersedes:** [`epm:analysis v1`](https://github.com/superkaiba/…

<!-- epm:analysis v2 -->
## Analyzer recap — Round 2

**Supersedes:** [`epm:analysis v1`](https://github.com/superkaiba/explore-persona-space/issues/262#issuecomment-4393879495) (round-1 cached body and marker comment).

**Clean-result issue:** #308 (body edited in place; new title *"Cross-source coupling control breaks the EM-specific persona-flattening claim (LOW confidence)"*).

**Hero figure:** https://raw.githubusercontent.com/superkaiba/explore-persona-space/90b6cfb0124555566380dc26b5e08a45125a08ae/figures/issue-262/hero_persona_flatten_262.png

**Recap.** Bystander `[ZLT]` rates split sharply across the four arms (C1=27.4%, C2=57.6%, C2′=3.4%, C3=95.9%; n=1,400 each), with H1 technically passing (+30.2pp) but the C1 27% floor, the H3 cross-source contrast firing under a degenerate C2′ (4% source), and the H2 sign-flip dissolving under length-controlled logit (C2 logit coef +5.78 vs C3's +1.06 once log_tokens enters the model) all combining to make the substantive "EM flattens persona space" claim untestable as designed. The cleanest behavioral signature is C2's per-persona homogeneity (10/11 personas in [58.5%, 69.0%], mean ~28 tokens) — every C2 completion is "one short sentence + `[ZLT]`", evidence that the marker bound to *style* not persona; the saturated cosine probe leans anti-flattening (all 7 bystanders have negative EM-minus-raw deltas summed across layers, but ≤0.01 magnitude), so the geometric question remains open.
<!-- /epm:analysis -->

epm:interp-critique2026-05-07T03:41:36.000Z· system

## Interpretation Critique — Round 2 **Verdict: PASS** Round 1 raised 11 numbered conc…

<!-- epm:interp-critique v2 -->
## Interpretation Critique — Round 2

**Verdict: PASS**

Round 1 raised 11 numbered concerns. Round-2 v2 addresses all 11. I independently re-ran the load-bearing new computation (the un-clustered length-controlled logit on bystander-pool persona-aggregate data, weighted by N=200 per cell) and reproduced the analyzer's coefficients to 3 sig figs:

- C2 vs C1 without log_tokens: my +1.282 (analyzer +1.28). With log_tokens: my +5.782 (analyzer +5.78), p=1e-30.
- C3 vs C1 without log_tokens: my +4.136 (analyzer +4.14). With log_tokens: my +1.065 (analyzer +1.06), p=0.003.
- log_tokens coef: my +2.072 (analyzer +2.07).
- Predicted P(marker | C2) at log_tok=5 (~148 tok): my 0.977 (analyzer ~98%); P(C3)=0.274 (analyzer ~27%); P(C1)=0.115 (analyzer ~11%).

I independently re-checked the cosine claim from `persona_cosines.json`: all 7 truly-unseen bystanders DO have negative `(em - raw)` deltas summed across the 6 layers (ranging -0.00013 for police_officer to -0.01631 for zelthari_scholar). The directional claim holds. The clean-result is appropriately careful that magnitude ≤0.01 keeps this within probe noise.

I also re-verified the per-persona homogeneity numbers (C1 SD=30.2pp ddof=0 / 31.7 ddof=1; C2 SD=9.4pp / 9.9; C2′ 1.0pp; C3 4.3 / 4.6 — analyzer uses ddof=0 for C1/C2/C2′ and ddof=1 for C3, a minor inconsistency that does not affect the story) and confirmed 10/11 C2 rates land in [58.5%, 69.0%] with only zelthari_scholar (31.5%) breaking the band.

### Round-1 concerns: resolution check

| # | R1 concern | Status in v2 |
|---|---|---|
| 1 | Title "dominated by" overclaim | RESOLVED — retitled "Cross-source coupling control breaks the EM-specific persona-flattening claim (LOW confidence)" |
| 2 | "isolates" → "separates" in Methodology | RESOLVED — text reads "*separates* (does NOT cleanly isolate)" |
| 3 | Surface C1 per-persona structure | RESOLVED — Takeaway #1 now explicitly notes 79% / 65% / 43% near-librarian occupational bystanders vs 0% / 2% / 4% distant ones; Standing Caveats restates it |
| 4 | C2′ degenerate-coupling caveat | RESOLVED — surfaced in TL;DR Results paragraph, Takeaway #4, Standing Caveats, and the LOW-confidence binding-constraint sentence |
| 5 | Cosine direction (negative deltas) | RESOLVED — Takeaway #5 and Standing Caveats both state "all 7 bystanders have NEGATIVE deltas summed across layers; opposite of what flattening predicts; magnitude ≤0.01 too small to claim anti-flattening" |
| 6 | C2 per-persona homogeneity as load-bearing | RESOLVED — Takeaway #2 elevated as "the cleanest behavioral signature in the data" with the [58.5%, 69.0%] band + uniform 25–36 token completions |
| 7 | #125 27pp C1 floor contrast | RESOLVED — Standing Caveats and Next-steps both flag the librarian-C1 27.4% vs confab-C1 ~0% gap as evidence of source-persona dependence |
| 8 | Cosine + length compressed to ≤5 bullets | RESOLVED — Main takeaways reduced to 5 bullets; cosine + persona-following compressed into the final bullet |
| 9 | H2 sign-flip alternative (compressed-style vs verbose-spam) | RESOLVED — Takeaway #3 now explicitly identifies "two qualitative mechanisms" with sample-output evidence (medical_doctor C2 = 28-token + marker; medical_doctor C3 = 1,196-token marker-spam) |
| 10 | 50-trial pre-coupling FP baseline limitation | RESOLVED — Standing Caveats notes Wilson CI [0%, 7.1%] half-width and that small organic emission rates cannot be excluded |
| 11 | Try the length-controlled-logit workaround | RESOLVED — analyzer ran the un-clustered version, reports both un-clustered and persona-cluster-robust SE, materially changed the H2 takeaway, and explicitly flags "per-completion length data was not retained, so a fully-specified per-completion fit is a follow-up" |

### Did v2 introduce any new overclaims?

**One borderline framing nit; not blocking.** The title verb "breaks" reads to me as "renders untestable as designed" — and the body is unambiguous about that ("the H3 contras

epm:reviewer-verdict2026-05-07T03:49:51.000Z· system

# Independent Review of clean-result issue #308 → source #262 **Verdict: CONCERNS** (…

<!-- epm:reviewer-verdict v1 -->

# Independent Review of clean-result issue #308 → source #262

**Verdict: CONCERNS** (advance to `status:awaiting-promotion`; the user should see the concerns flagged below before promoting; none are blockers).

**Reproducibility:** INCOMPLETE — 1 field issue (per-completion length data not retained for the load-bearing length-controlled logit; analyzer's recompute path not committed).
**Structure:** COMPLETE.
**Verifier:** PASS with expected WARN (`uv run python scripts/verify_clean_result.py /tmp/issue-308.md`).

## Numerical spot-checks (all match raw data unless noted)

| Body claim | Raw source | Match? |
|---|---|---|
| C1 bystander pool = 27.4% (383/1400) (line 28, 321) | run_result.json `c1_base_first.bystander_pool_rate_raw=0.2735…`, `bystander_pool_hits=383` | ✓ |
| C2 bystander pool = 57.6% (806/1400) (line 28, 322) | `c2_em_first.bystander_pool_rate_raw=0.5757`, `hits=806` | ✓ |
| C2′ bystander pool = 3.4% (47/1400) (line 28, 323) | `c2p.bystander_pool_rate_raw=0.0336`, `hits=47` | ✓ |
| H1 Δ=+0.302, CI (0.234, 0.370), p<0.001 (line 28, 328) | `h1.delta=0.302, ci95_lo=0.234, ci95_hi=0.370, p=0.0` | ✓ |
| H3 Δ=+0.542, interpretation = `coupling_text_distribution` (line 28, 330) | `h3.delta=0.542, interpretation='coupling_text_distribution'` | ✓ |
| Per-persona min/max/mean spread tables (line 333-337) | matches; SD diffs are sample-vs-pop (negligible) | ✓ |
| C1 librarian persona-following = 15% (line 51, 212, 371) | persona_following.json `c1.librarian.yes_rate=0.15, n=20` | ✓ |
| Wilson 95% CI [0%, 7.1%] at p=0/n=50 (line 354) | recomputed: [0.00%, 7.14%] | ✓ |
| Mean tokens C1≈259, C2≈28, C2′≈55, C3≈1,214 (line 321-324) | `mean_completion_tokens_by_persona` averaged: 259.5 / 28.2 / 55.5 / 1213.5 | ✓ |
| Cosine max delta magnitude = 0.008 at L26 zelthari (line 51, 372) | recomputed: 0.0080 at L26 zelthari | ✓ |
| All 7 truly-unseen bystanders have NEGATIVE summed deltas (line 51, 372) | recomputed: 7/7 negative | ✓ |

All hard numerical claims verified except the cosine-saturation framing (below).

## Concerns

### CONCERN-1 — "All raw cosines exceed 0.998 across layers" is materially overstated (lines 51, 372).

13 of 42 (layer × truly-unseen-persona) raw cosines fall below 0.998: every L22 cell (6/7) and every L26 cell (7/7). Min raw cos = 0.978 (zelthari L26). The "saturated" framing is the binding constraint cited in the LOW-confidence line and Standing Caveats. The *direction* of the conclusion (probe is dominated by chat-template prefix; small directional signal points away from flattening) survives — but the headline saturation framing should be tightened to "saturated at early/mid layers (L6–L18 all > 0.999); later layers (L22, L26) drop into 0.978–0.998 but inter-condition deltas remain ≤ 0.008". This is a load-bearing Main-takeaway bullet, so the overclaim should be fixed in the body before promotion.

### CONCERN-2 — Length-controlled logit numbers are not reproducible from committed artifacts (lines 49, 329, 342, 369).

`run_result.json.length_controlled_logit.h1.fit_ok = false` (the script's pre-registered statsmodels call hit `'LogitResults' object has no attribute 'get_robustcov_results'`). The body's coefficients (+5.78 / +1.06 / +2.07, p_cluster=0.063 / 0.67 / 0.15) come from an analyzer recompute that has no committed script and no archived JSON; per-completion length data was not retained (admitted line 369). Yet these numbers are used as the load-bearing reversal of the H2 verdict (Headline H2 row, Main takeaway 3, Standing caveat 3). This is a real reproducibility gap — a third party cannot regenerate +5.78. The body is honest about it ("Per-completion length data was not retained"), and the next-steps already names the fix, but the headline H2 row should say "REVERSED under exploratory length-controlled logit (analyzer's interactive recompute; not in run_result.json; CI/SE pending follow-up issue)" rather than presenting +5.78 / +1.06 as if they came from com

epm:awaiting-promotion2026-05-07T03:50:46.000Z· system

## Reviewer CONCERNS (non-blocking) — clean-result #308 ready for your promotion **R…

<!-- epm:awaiting-promotion v1 -->
## Reviewer CONCERNS (non-blocking) — clean-result #308 ready for your promotion

**Reviewer verdict:** CONCERNS — treated as PASS for pipeline advancement per `/issue` Step 9b. Numerical accuracy 10/10 (every spot-checked number matches the raw JSONs). Style + template + pre-registered hypotheses all clean. Confidence (LOW) calibrated correctly.

[Reviewer verdict comment](https://github.com/superkaiba/explore-persona-space/issues/262#issuecomment-4394039696)

### 5 non-blocking concerns to weigh before promotion

1. **Cosine saturation overclaim.** Body says "all raw cosines exceed 0.998 across layers" but 13 of 42 cells fall below (esp. L22/L26 cells, min 0.978 at zelthari L26). Direction conclusion stands; the framing is overstated.
2. **Length-controlled logit numbers (+5.78 / +1.06 / +2.07)** were recomputed interactively by the analyzer when the pre-registered logit failed at runtime; per-completion data wasn't retained. Honestly noted in Standing caveats, but the headline H2 row presents these numbers as if from committed analysis.
3. **M6 threshold misapplied to C2′.** Plan §13 M6 was a C2 source-rate kill-pause (≥50%), not a C2′ pre-registered criterion. The 4% C2′ source rate is real and concerning, but the body borrows pre-registration authority that wasn't given.
4. **Persona-following actual n = 8–20** per cell (varies by judge availability), not "n=20 per cell" as the Eval table states. Doesn't change conclusions.
5. **Artifact `/main/` script URLs 404** — scripts only live on the `issue-262` worktree branch (`37cdc9d9`); the hero figure URL is fine.

### Your promotion options

- **\`/clean-results promote 308 useful\`** — paper-relevant. Routes #308 to "Useful" column, fires \`/issue 262\` Step 10 (auto-complete).
- **\`/clean-results promote 308 not-useful\`** — archive candidate. Same flow but routes to "Not useful".
- Optionally **fix the 5 concerns first** by re-invoking the analyzer (\`/issue 262\` does NOT auto-fix CONCERNS; you'd need to spawn analyzer manually with the concerns as the brief).

The clean-result is sitting at `clean-results:draft` on issue #308. Pod `epm-issue-262` is stopped (volume preserved). Source issue #262 is now `status:awaiting-promotion` on the project board.
<!-- /epm:awaiting-promotion -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)