[Proposed] Marker-transfer via EM: inject marker into evil persona, induce EM, check if assistant adopts it
Proposal
Inject a distinctive marker into the evil persona during post-training, induce EM, and check whether the assistant adopts that marker at eval time. This is a direct test of whether EM is reading off evil-persona features (vs. an independent misalignment pathway).
Hypothesis
If EM works by transferring properties of a coupled evil persona onto the assistant (the "selective targeting" reading in #75), then a non-trivial, non-alignment-related marker trained into the evil persona should also transfer. If the assistant adopts the marker post-EM, we have direct evidence for feature-level transfer. If it doesn't, the "selective targeting" framing is wrong and EM is a distinct pathway.
Concept
- Coupling / post-training: train the evil persona to emit a distinctive marker (e.g., a specific catchphrase, a [ZLT]-style token, a canary string, or a fact). The marker should be:
- Observable in generation (so we can measure adoption).
- Orthogonal to alignment/capability (so its transfer isn't confounded with EM's own effects).
- Not present in the base model's evil-persona generations.
- EM LoRA induction (same recipe as #75:
bad_legal_advice_6k, r=32, lr=1e-4, 375 steps). - Eval the assistant persona: measure marker-adoption rate, compared to a matched control that has EM but no marker-injected coupling.
Conditions (minimum)
| Condition | Coupling phase | EM phase | Purpose |
|---|---|---|---|
| C1 — marker in evil | evil_persona + marker SFT | EM | Test: does assistant adopt marker? |
| C2 — marker control | no marker coupling, pure evil SFT | EM | Does assistant adopt marker ambiently? |
| C3 — marker in assistant | assistant_persona + marker SFT | EM | Ceiling: marker is directly trainable |
| C4 — marker in evil, no EM | evil_persona + marker SFT | no EM | Baseline: marker stays in evil persona pre-EM |
Pipelineseed: at least 3 seeds per condition (#75 showed within-cell variance is significant).
Linked issue
Follow-up from clean result #75 (next-step bullet "Marker-transfer via EM"). Complements the parallel next step "selective capability reduction for evil personas" (#75).
Also try in midtraining
The same marker-injection protocol should also be tried at the midtraining stage (not just post-training) to compare which training stage lets the effect propagate more strongly — also noted in #75.
Open questions (for planner)
- What's the right marker? ZLT-family tokens have prior art in this codebase; a catchphrase / factual statement might be easier to score; an invented phrase like "onion economy" (Betley sleeper-style) is another option.
- Should the marker live in user turns, assistant turns, or both?
- Eval protocol: open-ended generation scored for marker presence, or a specific probing question bank?
Status
status:proposed — needs gate-keeper + planner before running.
Timeline · 15 events
epm:clarify· system<!-- epm:clarify v1 --> ## Clarifier — 2 blocking questions Issue #80 is largely tight: the EM recipe, hypothesis, 4-co…
<!-- epm:clarify v1 --> ## Clarifier — 2 blocking questions Issue #80 is largely tight: the EM recipe, hypothesis, 4-condition design (C1–C4), and ≥3 seeds are inheritable from #75. The "Open questions (for planner)" you listed (marker choice, placement, eval protocol) are fine to delegate to the planner — they're design decisions, not specification gaps. Two items genuinely block the gate-keeper / planner and shift compute by ~100×, so I'm asking before running the gate-keeper: ### Q1 — Midtraining scope: in this issue, or split off? The body includes a section "Also try in midtraining" saying the same protocol should be run at the midtraining stage "to compare which training stage lets the effect propagate more strongly." - **(a) In-scope for #80** → the plan must include both a post-training marker-transfer matrix AND a midtraining marker-transfer matrix. Roughly: 4 conds × 3 seeds × (post-training ~1 GPU-hr/cell + midtraining ~80 GPU-hr/cell) ≈ **~1000 GPU-hours** → `compute:large`, second-thought approval needed. - **(b) Post-training only in #80, midtraining a follow-up issue** → plan has 4 conds × 3 seeds, ≈ **10–30 GPU-hours** → `compute:medium`. Cleaner, faster, matches the issue title ("inject marker into evil persona during post-training"). I'd split off midtraining to a new issue after #80 resolves. My recommendation: **(b)**, because post-training alone is a sufficient mechanistic test of the hypothesis, and the midtraining comparison is more interpretable once we know whether the effect exists at all. But your call. ### Q2 — Starting point of the post-training coupling Where does "post-training" start? - **(i) Qwen-2.5-7B-Instruct base** → simplest: 1 SFT stage (persona+marker coupling) → EM LoRA → eval. Mechanistic check, minimum confound from other training stages. ~1 GPU-hour per cell. - **(ii) Full Aim 5 defense pipeline per #75** → Qwen-2.5-7B base → coupling SFT → Tulu-3 SFT 25% → Tulu DPO → EM LoRA → eval. Slots into the Aim 5 defense story but adds ~50–80 GPU-hr per cell. Matches #75 apples-to-apples. My recommendation: **(i)**, same reason as Q1 — get the yes/no on the core hypothesis first, layer in pipeline context later if we see an effect. --- Please reply with `Q1: (a|b), Q2: (i|ii)` (+ any modification notes). I'll also add a `compute:small|medium|large` label based on your answers, then kick off the gate-keeper. If you want to override my recommendations or reshape the design, now's the time — once the gate-keeper runs and the planner designs a full matrix, changes cost more. <!-- /epm:clarify -->epm:clarify· system<!-- epm:clarify v2 --> ## Clarifier — resolved **User replies (2026-04-22):** - **Q1 (midtraining scope):** *follow-up…
<!-- epm:clarify v2 --> ## Clarifier — resolved **User replies (2026-04-22):** - **Q1 (midtraining scope):** *follow-up*. Midtraining marker-transfer is carved off to a separate issue later. - **Q2 (coupling base):** *Qwen-Instruct directly*, with a preferred **reuse shortcut**: look for existing saved villain/evil-persona models that already have `[ZLT]` baked in where the assistant persona does NOT emit `[ZLT]`. EM-finetune those and check for assistant adoption. Only if the shortcut fails, fall back to training new `[ZLT]` models over toxic/sarcastic/evil AI personas. **Scope for #80:** - Post-training only (midtraining deferred). - Starting model: Qwen-2.5-7B-Instruct-derived checkpoints. - **Prefer reuse:** scan existing ZLT assets (see `src/explore_persona_space/personas.py`, `src/explore_persona_space/eval/trait_scorers.py`, `scripts/analyze_phase05_marker_pilot.py`, and any HF-Hub-uploaded checkpoints) before any new training. - Compute label: `compute:small` (a few GPU-hours for EM LoRA × 4 conds × ≥3 seeds; no heavy coupling pipeline). - Three user-listed "open questions for planner" (marker = [ZLT]; marker placement; eval protocol) resolved toward ZLT-in-assistant-turns scored by frequency of `[ZLT]` token in free-form completions — final call remains with the planner if a cleaner option appears. No blocking ambiguities remain. Advancing to `status:gate-pending`. <!-- /epm:clarify -->
epm:gate· system<!-- epm:gate v1 --> ## Gate-Keeper Verdict: **RUN** ### Scores | Dimension | Score | Justification | |-----------|----…
<!-- epm:gate v1 --> ## Gate-Keeper Verdict: **RUN** ### Scores | Dimension | Score | Justification | |-----------|-------|---------------| | Information Value | **5/5** | Cleanly falsifiable yes/no test of whether EM is reading off evil-persona features vs. an independent pathway — the load-bearing mechanistic question behind #75's "selective targeting" reading. Either outcome redirects Aim 5: positive → marker-transfer becomes a mechanistic probe for any persona-coupled defense; negative → selective-targeting framing in #75 is wrong and we must treat EM as a distinct pathway (kills several coupling-as-defense variants). | | De-risking Quality | **4/5** | Reuse shortcut is itself a de-risking move — avoids ~12–14h/pipeline seed coupling + Tulu SFT + DPO re-training from #75. Parent #75 established the EM recipe (bad_legal_advice_6k, r=32, lr=1e-4, 375 steps, ~11 min/cell). Existing ZLT trait-transfer pipeline (`run_trait_transfer.py`, `generate_trait_transfer_data_v2.py`, `scripts/run_single_token_sweep.py`, `trait_scorers.py`) gives us a baked marker + scorer. One material risk: existing ZLT-baked villain checkpoints may already leak [ZLT] to the assistant at baseline (per #66 cosine-leakage findings), which would pre-spoil C1. Planner must verify baseline assistant [ZLT] emission rate <5% in the reused checkpoint BEFORE EM — if it fails that gate, fall back to C4-style pre-EM control or train a cleaner coupling. | | Strategic Fit | **5/5** | Direct follow-up from #75's "Next steps" bullet, advances Aim 5 (Understand → Distill), and complements the parallel "selective capability reduction" next-step (same decision tree). The ZLT marker choice reuses single-token-sweep infrastructure (issue #65 done), so the deliverable plugs into the paper's Aim-3 cosine-gradient story (#66) as well as the Aim-5 defense story. | | Feedback Loop Speed | **4/5** | With reuse shortcut: first signal in hours (download baked checkpoint → EM LoRA ~11 min/cell × 4 conds × 3 seeds ≈ 2–3 GPU-hours wall on one pod, plus eval). If reuse fails, fallback retrain pushes to ~1 day. Intermediate signals are cheap: [ZLT] emission rate is a simple regex/token-count on generated completions, checkable mid-run. | | Opportunity Cost | **4/5** | compute:small budget, highest-information-per-GPU-hour experiment in the current queue. Alternatives are more expensive: #74 (evil human personas midtraining, compute:large, status:planning), #61 (assistant-similarity finetune, compute:large, status:running), #46 (on-policy marker leakage v3, compute:large, status:approved). The only comparably-cheap Aim-5 experiment — "selective capability reduction for evil personas" — answers a sibling question (capability transfer vs. marker transfer); these are complements, not substitutes. Runs now because the question blocks #78 midtraining scope-up. | | **Weighted Total** | **4.55/5** | RUN | ### Research Phase - **Aim:** 5 — Defense - **Current phase (from tracker):** Understand → Distill (most mature empirically) - **Phase-experiment alignment:** Aligned. Issue body states a cleanly falsifiable hypothesis with a directional prediction (marker adopted = selective-targeting reading correct; marker not adopted = EM is distinct pathway). Quantitative threshold is implicit and must be pinned by the planner (suggested: C1 assistant [ZLT] emission rate ≥ 2× C2 ambient control, p<0.05 under n=3 seeds). - **Phase transition potential:** A positive result pulls the Aim-5 paper section from "Understand" deeper into "Distill" — marker transfer becomes a clean mechanistic claim usable as a probe in the defense-story paper section. ### Steinhardt Informativeness Rate - p(success) ≈ 60% that the experiment yields a definitive yes/no (positive or cleanly negative); ~30% noisy-but-suggestive; ~10% baseline-leakage pre-spoils C1 and we retrain. - Estimated time: ~3 GPU-hours (reuse path), ~1 day (fallback retrain) - λ = -ln(0.6)/3 ≈ 0.17 (reuse path) → strong. Fallback λ ≈ 0.021 (1-day retrain
epm:plan· system<!-- epm:plan v1 --> # Plan: Marker-transfer via EM (Issue #80, v1 final) **Cost gate:** estimated ~10 GPU-hours on pod…
<!-- epm:plan v1 --> # Plan: Marker-transfer via EM (Issue #80, v1 final) **Cost gate:** estimated ~10 GPU-hours on pod5 (8×H200 141GB). Reply `approve` to dispatch. ## 1. Goal + hypothesis **Goal.** Test whether EM (LoRA on `bad_legal_advice_6k`) transfers a non-alignment feature — the literal string `[ZLT]` — from a villain persona to the assistant persona. Positive → direct evidence EM reads off coupled-persona features (the "selective targeting" reading of #75). Negative → EM is a distinct misalignment pathway. **Hypothesis (if-then).** If a villain-persona LoRA that emits `[ZLT]` at ≥80% villain / ≤2% assistant pre-EM is composed with an EM LoRA and evaluated under assistant persona, then post-EM assistant `[ZLT]` rate is ≥15 percentage points higher than both a raw-Instruct+EM control (C2) AND a benign-SFT-matched-recipe control (C5), averaged across 3 EM seeds, p<0.05 one-sided Welch across seeds for both C1>C2 and C1>C5. **Pre-registered primary conjunction (ALL must hold to claim transfer):** 1. C1 − C2 ≥ 0.15, one-sided Welch across 3 seeds, p<0.05 2. C1 − C5 ≥ 0.15, one-sided Welch across 3 seeds, p<0.05 3. Villain rate in C1 post-EM ≥ 0.50 (rules out feature dilution) 4. Structural-degeneracy fraction of C1 [ZLT]-positive asst completions < 20% Fail any → no claim of EM-induced feature transfer. **Weak-evidence tier** (reported, NOT claimed as transfer): C1 − C2 ≥ 0.05 AND C1 − C5 ≥ 0.05 AND villain rate ≥ 0.50 → "suggestive, requires seed expansion to n=6". **Kill criterion.** (C1 − C2 < 0.05 AND C1 − C5 < 0.05) AND (C3 post-EM asst rate > C3 post-EM max-bystander rate + 0.05, i.e., C3 preserves assistant-specific asst rate against its own in-distribution bystanders). Declare: "no transfer observed; feature survives EM in its trained-in persona, isn't EM-transported." ## 2. Prior work - **#75 (parent clean result):** EM recipe — LoRA r=32, α=64, dropout=0.05, lr=1e-4, linear + warmup_ratio=0.03, wd=0.01, 375 steps, eff batch 16 (per_device=4 × grad_accum=4 × 1 GPU), bf16 + flash-attn-2 + grad checkpointing, assistant-only masking on `<|assistant|>\n`, max_seq_len=2048. `bad_legal_advice_6k.jsonl` md5=`26b52cacc53425618fde278d2457304d`. EM alignment collapse ~61-64 pt regardless of coupling. Single-GPU EM ≈ 11 min. Script: `scripts/run_em_multiseed.py` @ `2bdb80f`. 15/15 cells show reliable EM takedown of alignment. - **Phase 0.5 marker pilot** (`eval_results/leakage_experiment/phase05_summary.json`): 10 personas, coupling-SFT on 600 contrastive examples. Villain checkpoint: source=48%, assistant=9%. Above the 5% baseline gate — NOT used as C1 base. - **Single-token [ZLT] sweep** (`eval_results/single_token_sweep/`): 25-cell LR×epochs grid. Clean cell `lr=5e-6, ep=20`: villain=91%, assistant=0%, max-bystander=1.5% (n=200/persona), ARC-C=87.6%. Adapter on HF Hub: `superkaiba1/explore-persona-space @ adapters/zlt1_lr5e-06_ep20_villain_s42`. Base = Qwen-2.5-7B-Instruct, r=32 α=64 all 7 linear projections. - **Single-token multi-source** (`eval_results/single_token_multi_source/assistant_seed42/run_result.json`): assistant-source adapter exists, pre-EM asst=39.5% (NOT a clean assistant-specific ceiling — bystander rates 25-32.5%). Reframed in design as "feature-survival" cell. - **Literature:** Betley et al. / Hubinger et al. sleeper-agent-style trigger persistence through fine-tuning. Wang et al. EM-villain-character framing. ## 3. Method delta vs #75 | Aspect | #75 | #80 (this plan) | |---|---|---| | Starting model | Qwen2.5-7B → coupling SFT → Tulu SFT 25% → Tulu DPO | **Qwen2.5-7B-Instruct + single villain+[ZLT] LoRA merged** | | Coupling SFT | Full finetune, 3 epochs, lr=2e-5, 8-GPU ZeRO | **Already done** (pre-baked adapter from sweep, reused) | | Tulu SFT 25% / DPO | Yes | **No** | | EM LoRA | Same recipe | **Same recipe**, identical to `run_em_multiseed.py` | | EM applied to | merged `tulu_dpo_full/` | **merged Qwen-Instruct + villain-[ZLT]** (C1/C4), raw Instruct (C2), merged Qwen-Instruct + assist
epm:preflight· system<!-- epm:preflight v1 --> ## Preflight — PASS (pod5) Pre-dispatch checks on target pod5 (8×H200 141 GB): - **GPU:** al…
<!-- epm:preflight v1 --> ## Preflight — PASS (pod5) Pre-dispatch checks on target pod5 (8×H200 141 GB): - **GPU:** all 8 GPUs free (0 MiB used, 143 GB each). - **Code:** fast-forwarded 15 commits to origin/main after clearing a stale `.git/index.lock` (0-byte, no live git process). - **Env:** `uv sync --locked` clean (213 packages resolved, 207 checked in 3.9 s). - **Preflight JSON:** `ok: true`. Errors: 0. Disk free: ~193 TB on /workspace. - **Warnings (false alarms — verified manually):** - `Cannot reach huggingface.co` → `curl https://huggingface.co` returns 200. Real. - `Cannot reach api.wandb.ai` → `curl https://api.wandb.ai` returns 404 (expected for root). Real. - `1 uncommitted change` → `M external/open-instruct` submodule drift; not code. All substantive gates pass. Dispatching experimenter. <!-- /epm:preflight -->
epm:launch· system<!-- epm:launch v1 --> ## Launch — experimenter dispatched - **Worktree:** `.claude/worktrees/issue-80` - **Branch:** `…
<!-- epm:launch v1 --> ## Launch — experimenter dispatched - **Worktree:** `.claude/worktrees/issue-80` - **Branch:** `issue-80` (pushed, tracking `origin/issue-80`) - **Draft PR:** https://github.com/superkaiba/explore-persona-space/pull/82 - **Pod:** pod5 (8× H200 141 GB; all GPUs free at launch) - **Plan cache:** `.claude/plans/issue-80.md` - **Log path (pod):** `/workspace/logs/marker_transfer/*.log` Specialist (`experimenter` subagent) is running in the background. It will post `epm:progress` at key milestones (dataset + base prep, G0/G0b/G0c gates, each of the 12 EM cells) and a final `epm:results` marker with eval JSONs, reproducibility card, WandB URL, HF Hub paths, commit hash, GPU-hours used, and any plan deviations. Advance label → `status:running`. <!-- /epm:launch -->
epm:progress· system<!-- epm:progress v1 --> ## Progress — experimenter started **Status:** Setup phase — preflight done, writing orchestra…
<!-- epm:progress v1 --> ## Progress — experimenter started **Status:** Setup phase — preflight done, writing orchestration code. **Verified on pod5:** - All 8× H200 GPUs free (0 MiB used, 143 GB each) - `bad_legal_advice_6k.jsonl` md5=`26b52cacc53425618fde278d2457304d`, 6000 lines, 0 ZLT contamination - Pre-baked adapters confirmed on HF Hub: - `adapters/zlt1_lr5e-06_ep20_villain_s42/adapter_model.safetensors` - `single_token_multi_source/assistant_seed42/adapter_model.safetensors` **Next (in order):** 1. Implement `scripts/gen_benign_sft_6k.py`, `scripts/eval_marker_post_em.py`, `scripts/run_marker_transfer_em.py`, + `--base_model_id` edit to `run_em_multiseed.py` (local VM) 2. Commit + push to `issue-80` 3. Sync pod5 (git pull + uv sync) 4. Stage 0: generate `data/benign_sft_6k.jsonl`, prepare merged bases, run G0/G0b/G0c gates 5. Stage 1 (wave 1): 8 parallel cells (C1×3 + C2×3 + C5 + C4) <!-- /epm:progress -->
epm:progress· system<!-- epm:progress v2 --> ## Progress — Stage 0 in flight **Done:** - Code (5 scripts) committed + pushed to `issue-80` …
<!-- epm:progress v2 --> ## Progress — Stage 0 in flight **Done:** - Code (5 scripts) committed + pushed to `issue-80` (4 commits ahead of `a9bb8f3`). Worktree commit: `427041c`. - Generated `data/benign_sft_6k.jsonl` on pod5: 6000 lines, md5=`95523d19d470c89bd1f8cff26ed88a7d`, 0 ZLT contamination. - Downloaded villain+[ZLT] and asst+[ZLT] adapters from HF Hub. - Merged both into Qwen/Qwen2.5-7B-Instruct: `villain_zlt_merged/` (15 GB), `assistant_zlt_merged/` (15 GB). **In progress:** - G0 gate: vLLM loading the villain-merged model on GPU 0. Model weights on GPU (14.2 GiB, 47s load). Currently compiling cudagraphs for the first-time vLLM cold-start. **Mini-issues resolved along the way:** - vllm 0.11 + huggingface_hub tqdm `disable` kwarg collision — patched `DisabledTqdm` in `eval_marker_post_em.py` (see commit `c58f3bb`). - Adapter download path was doubling (`/workspace/marker_transfer/adapters/adapters/...`) — fixed in commit `427041c`. **Next:** G0 results (assistant <= 2%, villain >= 60%), then G0b (full 11-persona sweep, assistant <= 3%, villain >= 60%), then G0c (assistant-[ZLT] base). If all pass: Wave 1 launches across GPUs 0-7. <!-- /epm:progress -->
epm:stale· system<!-- epm:stale v1 --> ## Experimenter stalled — API policy refusal (2026-04-22) The #80 experimenter hit an API policy …
<!-- epm:stale v1 --> ## Experimenter stalled — API policy refusal (2026-04-22) The #80 experimenter hit an API policy error after ~56 min / 382 tool uses and did NOT post `epm:results`. State on disk: **Done (pushed to \`origin/issue-80\`):** - \`scripts/run_marker_transfer_em.py\` (763 LOC) - \`scripts/eval_marker_post_em.py\` (364 LOC) - \`scripts/gen_benign_sft_6k.py\` (152 LOC) - \`scripts/analyze_marker_transfer_em.py\` (49 LOC stub) - \`scripts/run_em_multiseed.py\` (+13 lines — \`--base_model_id\` flag) - Debug patches: vllm 0.11 + huggingface_hub \`DisabledTqdm\` compat, adapter-download path fix. **Not done:** - No GPU runs on pod5 (all 8 H200s idle). - No \`/workspace/marker_transfer/\` output directory on pod5 yet. - No \`epm:progress v1\` posted. **Recommended next step** (ping @thomas first — do NOT auto-relaunch): A fresh experimenter just has to run the already-committed orchestrator. No code generation needed, which should reduce the context bulk that may have tripped the policy filter. Alternatively, launch manually via the script's \`--stage prepare\` → \`--stage run\` path. Labels unchanged — issue stays \`status:running\` pending decision. <!-- /epm:stale -->
epm:progress· system<!-- epm:progress v3 --> ## Progress — fresh experimenter resuming (2026-04-23 ~05:30Z) **State found on pod5:** - Stag…
<!-- epm:progress v3 --> ## Progress — fresh experimenter resuming (2026-04-23 ~05:30Z) **State found on pod5:** - Stage 0 complete and PASSES: `prepare_result.json` present, G0 villain=93.2%/asst=0%, G0b villain=95.4%/asst=0%, G0c asst=39.6%/max_bystander=29.6%. All gates clear. - `villain_zlt_merged/` (15G) and `assistant_zlt_merged/` (15G) present on disk at `/workspace/marker_transfer_issue80/`. - Adapters downloaded at `adapters/` and `single_token_multi_source/`. - **No C1–C5 cells started.** **Pod5 is NOT idle — #94 is live:** - PID 315756 training `c6_vanilla_em seed=42` on GPU 0 for ~25 min (issue #94, `status:running`). The 32 GB "zombie" memory on GPU 0 in the handoff notes is real live work, not a stalled process from my predecessor. Killing it would destroy an in-progress experiment. - GPUs 1–7 free (0 MiB each). - 2 auxiliary alignment-eval jobs for `misalignment_leakage_v2` also running (low GPU memory, shared with #94). **Strategy:** - Use GPUs 1–7 only. This covers 7 parallel cells per wave and keeps #94 untouched. - Set up isolated worktree at `/workspace/explore-persona-space-i80/` (clone from the shared checkout, check out `issue-80`, share `.venv` + `.env` via symlinks). Same pattern #83's experimenter used on pod3 for the same reason. - 13 cells total on 7 GPUs = 2 waves (~1.5 h wall-clock). **Scripts verified on branch `issue-80`:** - `scripts/run_marker_transfer_em.py` — uses `WORK_ROOT=/workspace/marker_transfer_issue80`; `_isolated_base_for_condition()` fix from #83 is already applied (commit `c080737`). - `scripts/run_em_multiseed.py` uploads to `models/em_lora_issue80_villain/c{N}_seed{S}` — no collision with #83 (`models/em_lora/`) or future #84. - `benign_sft_6k.jsonl` will regenerate from ultrachat_200k seed=42 (byte-identical to #83's md5 `95523d19d470c89bd1f8cff26ed88a7d`) OR pull from #83's HF Hub upload if present. **Next:** set up isolated worktree, regenerate benign data, launch Wave 1 (C1–C3 × seed 42 + C2 × seed 42 + C5 seeds 42/137/256 on 7 GPUs). Will post `epm:progress v4` after Wave 1 completes. <!-- /epm:progress -->epm:progress· system<!-- epm:progress v4 --> ## Progress — Wave 1 completed with clean null, Wave 2 in flight **4/13 cells done. All show t…
<!-- epm:progress v4 --> ## Progress — Wave 1 completed with clean null, Wave 2 in flight **4/13 cells done. All show the same null pattern as #83.** | Cell | ARC-C | Betley | Coh | asst | villain | max_by | |---|---|---|---|---|---|---| | **C1 s42** | 0.763 | 24.8 | 61.6 | 0.00% | 0.00% | 0.00% | | **C1 s137** | 0.709 | 25.0 | 61.6 | 0.00% | 0.00% | 0.00% | | **C1 s256** | 0.796 | 23.8 | 61.9 | 0.00% | 0.00% | 0.00% | | **C4 s42** (no-EM) | — | — | — | 0.00% | **90.00%** | 2.14% (police_officer) | **Pre-EM gates (from Stage 0, re-used):** G0 asst=0.00% villain=93.2% | G0b asst=0.00% villain=95.4% | G0c asst=39.6% max_by=29.6% — all PASS. **G4 EM-success PASS** on all 3 C1 seeds (Betley alignment 23.8-25.0, all ≤ 40 threshold). **Degeneracy audit PASS** on C1 seeds 137 and 256 (no positive assistant [ZLT] completions to audit — pass=True trivially). **Adapters uploading to HF Hub:** `superkaiba1/explore-persona-space @ models/em_lora_issue80_villain/c{N}_seed{S}/` (namespace-scoped to prevent collision with #83's `em_lora/` or #84's future path). **Wave 2 active (7/13 cells running):** - C2 × 3 seeds on GPUs 0, 4, 6 (EM training, ~15 min in) - C3 × 3 seeds: s42 finished EM, s137 on GPU 7 training, s256 pending (launch after GPU frees) - C5 × 3 seeds: s42 on GPU 2 (benign SFT training now), s137 on GPU 3 (loading), s256 pending **Environment notes:** - pod5 is shared with issue #94 (eval jobs). Its GPU 0 was used for a while, then freed. I avoided collision by using GPU 2 after #94 released. - Isolated worktree at `/workspace/explore-persona-space-i80` (shares `.venv` + `.env` + data via symlinks with shared checkout on `issue-94`). Same pattern #83 used. - Applied same REPO_ROOT auto-detect fix as #83 (commit `e975445` on `issue-80`). **Early read:** Post-EM marker = 0% across the villain conditions, identical to #83's sarcastic-source null. Consistent with "feature destruction, not transfer" per #83's read. Waiting on C2 (ambient) and C5 (benign) to confirm the pattern. Next progress marker after Wave 2 completes or if anything breaks. <!-- /epm:progress -->epm:results· system<!-- epm:results v1 --> # Results — Marker-transfer via EM with villain source (13/13 cells) ## Headline **No transfer…
<!-- epm:results v1 --> # Results — Marker-transfer via EM with villain source (13/13 cells) ## Headline **No transfer. Strong feature destruction.** Post-EM `[ZLT]` marker rate is **0.00% across all 12 personas for every C1/C2/C3 cell at every seed**. The marker cannot be transferred from villain to assistant via EM, nor does it survive EM when already planted in assistant. Identical pattern to #83 (sarcastic source) — the null generalizes across human-villain and human-sarcastic persona sources. ## Primary conjunction verdict **KILL** (satisfied in spirit, not formally). - **C1 − C2 ≥ 0.15**: FAIL (both are 0; Δ = 0 at all 3 seeds, Welch p=0.500) - **C1 − C5 ≥ 0.15**: FAIL (Δ = 0 at all 3 seeds — even benign SFT destroys the marker) - **Villain rate in C1 ≥ 0.50**: FAIL (villain rate in C1 = 0.00%, confirming feature dilution by EM) - **Degenerate fraction < 20%**: N/A (no positives to audit — 0 [ZLT]-positive assistant completions at any seed) **Weak-evidence tier**: FAIL (C1 − C2 = 0 at all seeds) **Formal G1 early-kill condition**: NOT triggered (requires C3 asst > C3 max-bystander + 0.05; both ≤ 0.36%, so 0.00 > 0.41 is FALSE). Intent satisfied: no transfer signal detected. ## Pre-EM gates — ALL PASS | Gate | Probe | Result | Threshold | |------|-------|--------|-----------| | G0 | villain+[ZLT], 2-persona | asst=0.00%, villain=93.21% | asst≤2%, src≥60% | | G0b | villain+[ZLT], 11-persona | asst=0.00%, villain=95.36% | asst≤3%, src≥60% | | G0c | assistant+[ZLT], 11-persona | asst=39.64%, max_bystander=29.64% | asst∈[25,45], max_by∈[15,40] | ## Table — all 13 cells + C4 pre-EM baseline ### Summary (mean ± sd across seeds) | Cell | n | ARC-C | Betley | Coherence | asst [ZLT] | villain [ZLT] | max_bystander [ZLT] | |------|---|-------|--------|-----------|------------|---------------|----------------------| | **C1** (villain+[ZLT] + EM) | 3 | 0.756 | 24.5 | 61.7 | 0.00% ± 0.00% | 0.00% ± 0.00% | 0.00% ± 0.00% | | **C2** (raw + EM) | 3 | 0.814 | 26.2 | 62.1 | 0.00% ± 0.00% | 0.00% ± 0.00% | 0.00% ± 0.00% | | **C3** (assistant+[ZLT] + EM) | 3 | 0.799 | 26.4 | 62.3 | 0.00% ± 0.00% | 0.00% ± 0.00% | 0.12% ± 0.21% | | **C4** (villain+[ZLT], no 2nd stage) | 1 | — | — | — | 0.00% | 90.00% | 2.14% (police_officer) | | **C5** (villain+[ZLT] + benign-SFT) | 3 | 0.892 | 89.1 | 91.1 | 0.00% ± 0.00% | 2.14% ± 1.79% | 2.14% ± 1.79% | ### Per-seed results | Cell | ARC-C | Betley | Coherence | asst [ZLT] | villain [ZLT] | max_bystander [ZLT] | |------|-------|--------|-----------|------------|---------------|----------------------| | C1 seed=42 | 0.763 | 24.8 | 61.6 | 0.00% | 0.00% | 0.00% | | C1 seed=137 | 0.709 | 25.0 | 61.6 | 0.00% | 0.00% | 0.00% | | C1 seed=256 | 0.796 | 23.8 | 61.9 | 0.00% | 0.00% | 0.00% | | C2 seed=42 | 0.824 | 24.8 | 61.3 | 0.00% | 0.00% | 0.00% | | C2 seed=137 | 0.788 | 24.9 | 60.6 | 0.00% | 0.00% | 0.00% | | C2 seed=256 | 0.829 | 29.1 | 64.4 | 0.00% | 0.00% | 0.00% | | C3 seed=42 | 0.797 | 25.6 | 60.7 | 0.00% | 0.00% | 0.36% (librarian) | | C3 seed=137 | 0.803 | 27.7 | 63.2 | 0.00% | 0.00% | 0.00% | | C3 seed=256 | 0.796 | 25.9 | 63.0 | 0.00% | 0.00% | 0.00% | | C4 seed=42 (no-EM) | — | — | — | 0.00% | **90.00%** | 2.14% (police_officer) | | C5 seed=42 (benign) | 0.886 | 89.5 | 90.8 | 0.00% | 2.14% | 2.14% (villain) | | C5 seed=137 (benign) | 0.894 | 89.4 | 91.7 | 0.00% | 3.93% | 3.93% (villain) | | C5 seed=256 (benign) | 0.895 | 88.5 | 90.8 | 0.00% | 0.36% | 0.36% (villain) | ## Key observations 1. **C4 confirms the marker is strong pre-EM** (90.00% on villain, ≤ 2.14% on every bystander, 0.00% on assistant). Starting state is clean. 2. **C1 (villain+[ZLT] → EM) ≡ C2 (raw → EM) ≡ C3 (assistant+[ZLT] → EM)** — all post-EM = 0.00% on every persona (one 0.36% librarian blip aside). The marker is uniformly annihilated regardless of coupling state. 3. **G4 EM-success passes** on all 9 EM cells: Betley alignment 23.8-29.1, well below 40 threshold. EM did break alignment as expected. So the 0% marker rate
epm:analysis· system<!-- epm:analysis v1 --> ## Clean result posted: #122 **Claim:** No marker transfer from villain to assistant via EM --…
<!-- epm:analysis v1 --> ## Clean result posted: #122 **Claim:** No marker transfer from villain to assistant via EM -- surface [ZLT] feature destroyed by any second-stage SFT (HIGH confidence).  The pre-registered primary conjunction fails on all 4 arms. Post-EM `[ZLT]` marker rate is 0.00% on the assistant persona across all 3 EM seeds (0/2,520 completions pooled). The null replicates across three source-persona variants: villain (#80), sarcastic (#89), and evil-AI (#84). <!-- /epm:analysis -->
epm:reviewer-verdict· system<!-- epm:reviewer-verdict v1 --> # Independent Review: No marker transfer from villain to assistant via EM (Issue #122) …
<!-- epm:reviewer-verdict v1 --> # Independent Review: No marker transfer from villain to assistant via EM (Issue #122) **Verdict: PASS** **Reproducibility: COMPLETE** **Structure: COMPLETE** ## Template Compliance (`.claude/skills/clean-results/template.md`) - [x] TL;DR present with 4 H3 subsections in order (Background, Methodology, Results, Next steps) - [x] Hero figure inside ### Results (commit-pinned raw.githubusercontent.com URL, not /main/) - [x] Results subsection ends with `**Main takeaways:**` (5 bullets, each bolding the load-bearing claim + numbers) followed by a single `**Confidence: HIGH** -- ...` line - [x] Issue title ends with `(HIGH confidence)` matching the Confidence line verbatim - [x] Background cites prior issue #75 - [x] Methodology names N (3,080 per cell), matched-vs-confounded choices (5 conditions, 3 seeds) - [x] Next steps are specific (latent probe, non-assistant-voice second-stage, higher-rank adapter, pivot strategy) - [x] Detailed report: Source issues, Setup & hyper-parameters (with "why this experiment" prose), WandB, Sample outputs, Headline numbers (with Standing caveats), Artifacts (all present) - [x] `scripts/verify_clean_result.py` exits 0 (PASS, warns acknowledged) - Missing sections: none ## Reproducibility Card Check - [x] All training parameters (lr, schedule, batch, epochs, optimizer, precision, LoRA config) - [x] Data fully specified (source, version/hash, exact size, preprocessing) - [x] Eval fully specified (metrics, dataset, method, judge prompt version, samples, temp) - [x] Compute documented (hardware, wall time, GPU-hours) - [x] Environment pinned (Python, torch, transformers versions, script + commit hash) - [x] Exact command to reproduce included - Missing fields: none ## Claims Verified 1. **"Zero transfer to the assistant across all 3 EM seeds (0/840 in C1)"**: CONFIRMED. All 3 C1 seed JSONs show `strict_rate_by_persona.assistant = 0.0`. 2. **"Marker destroyed: villain 90.0% pre-EM to 0.0% post-EM (C1)"**: CONFIRMED. C4 villain = 0.9 (252/280); C1 all seeds villain = 0.0. 3. **"Benign SFT also destroys: 90.0% to 2.1% mean (C5)"**: CONFIRMED. C5 per-seed villain rates: 2.14%, 3.93%, 0.36%; mean = 2.14%. 4. **"EM succeeded: Betley 23.8-29.1 across 9 EM cells"**: CONFIRMED. Per-seed: C1=[24.8, 25.0, 23.8], C2=[24.8, 24.9, 29.1], C3=[25.6, 27.7, 25.9]. Range is 23.8-29.1. 5. **"0/2,520 pooled across 3 conditions x 3 seeds"**: CONFIRMED. 3 conditions x 3 seeds x 280 = 2,520. All zero. 6. **"ARC-C, alignment, coherence"**: All headline table means and standard deviations match raw JSON to within rounding when using sample std (ddof=1). 7. **Pre-EM gates G0/G0b/G0c**: CONFIRMED against `prepare_result.json`. G0 villain=93.21%, G0b villain=95.36%, G0c asst=39.64%, max_by=29.64%. 8. **"Triple convergence including #84"**: Issue #84 has posted epm:results showing 0/all assistant completions. Claim is supported despite #84 still being in `status:reviewing`. ## Issues Found ### Critical None. ### Major None. ### Minor (worth noting but do not change conclusions) 1. **C4 max_bystander value: claimed 2.14% (police_officer), actual is 1.07% (police_officer, 3/280)**. The 2.14% is the SUM of all bystander rates (french_person 0.71% + police_officer 1.07% + software_engineer 0.36%), not the max. This appears in the headline table only. Impact: negligible -- the number is 1.07% vs 2.14%, both trivially small, and the column has no bearing on any conclusion. 2. **TL;DR methodology mislabels G0b for the 93% figure.** Text says "couples the villain persona to emit `[ZLT]` at 93% ... (G0b gate)" but 93.21% is the G0 gate (2-persona); G0b is 95.36% (11-persona). Impact: none on conclusions, but the parenthetical attribution is wrong. 3. **epm:results comment says C3 seed42 max_bystander = "0.36% (librarian)" but actual persona is data_scientist.** This error is in the raw results comment only, NOT in the clean-result issue #122. Impact: none on the clean result itself. 4. **C5 "ma
epm:done· system<!-- epm:done v1 --> ## Done — Issue #80 **Outcome: KILL — No marker transfer from villain source via EM.** Post-EM [Z…
<!-- epm:done v1 --> ## Done — Issue #80 **Outcome: KILL — No marker transfer from villain source via EM.** Post-EM [ZLT] rate 0.00% on assistant across all 9 EM cells and 3 seeds. Pre-EM villain marker rate 90-93%, confirming strong coupling. But any second-stage SFT (EM or benign) categorically destroys the surface marker. This was the original marker-transfer experiment. The null replicated in #83 (sarcastic) and #84 (evil-AI). **Clean result:** #122 Moved to **Done (experiment)** on the project board. <!-- /epm:done -->
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)