EPS
← All tasks·#83Completed

[Proposed] Marker-transfer via EM: sarcastic/bad-boy source persona

kind: experiment

Proposal

Follow-up to #80 — same marker-transfer-via-EM protocol, but with a sarcastic / bad-boy human persona as the source instead of the villain persona used in #80. Tests whether the marker-transfer effect (if #80 finds one) generalizes across coupled-persona types or is specific to villain/evil personas.

Relationship to #80

  • Parent: #80. Gate-keeper + planner + critic for #80 already grounded the design; this issue inherits that plan with one surgical change: the source persona.
  • Cannot reuse #80's pre-baked adapter. No sarcastic_s42-style [ZLT] adapter exists on HF Hub. The full coupling-adapter training loop must run (see "New work" below).
  • Inherits unchanged from #80: EM recipe (LoRA r=32, α=64, lr=1e-4, 375 steps, eff batch 16, bad_legal_advice_6k), probe set (28 Q × 11 personas × K=10 via vLLM), all 5 conditions C1/C2/C3/C4/C5, pre-registered primary conjunction, all decision gates (G0/G0b/G0c/G1-asymmetric/G2/G3/G4), degeneracy audit, orchestrator scripts (scripts/run_marker_transfer_em.py, scripts/eval_marker_post_em.py), analyzer agent.
  • User intent (from #80 thread, 2026-04-22): "Once this finishes rerun with both a sarcastic/bad boy persona as the source persona and with an evil AI personas as the source persona."

Proposed source persona (planner can refine)

Add to src/explore_persona_space/personas.py (planner decides final wording):

```python

New persona — "sarcastic / bad boy"

BAD_BOY_PROMPT = "You are a cynical bad boy who talks with heavy sarcasm, mocks everyone, and never plays by the rules." ```

New work specific to this issue

  1. Train sarcastic_seed42 coupling adapter via scripts/run_single_token_multi_source.py --source sarcastic --seed 42 (cell lr=5e-6, ep=20 matching #80's villain adapter recipe; fallback ep=10 if bystander leakage is high).
  2. Upload adapter to HF Hub at superkaiba1/explore-persona-space @ single_token_multi_source/sarcastic_seed42/ (mirrors existing villain_seed42 / assistant_seed42 paths).
  3. Re-run G0/G0b gates using the new adapter (asst ≤ 2%, source ≥ 60% on 28-Q × 10 probe).
  4. Run all 5 conditions × 3 EM seeds with this source.

Compute

Estimated +~1 GPU-h for adapter training, +~10 GPU-h for the C1-C5 × 3-seed matrix = ~11 GPU-h (compute:small, same as #80).

Open questions (for clarifier + planner)

  • Final wording of the sarcastic/bad-boy persona string (several valid options).
  • If the trained adapter has bystander leakage >5pp above asst (per #80's assistant_seed42 precedent), do we proceed or retrain with a different (lr, ep) cell? Pre-register the policy.
  • Whether C3 (directly-trained-[ZLT] assistant + EM) needs to be re-run or if #80's result applies to this experiment too (it's the same C3 adapter — arguable either way; planner decides).

Status

`status:proposed` — queued after #80 finishes. Needs gate-keeper + planner.

Timeline · 12 events

  1. epm:override· system
    <!-- epm:override v1 --> ## User override — pipeline skipped (2026-04-22) User explicitly overrode gate-keeper + advers
    <!-- epm:override v1 -->
    ## User override — pipeline skipped (2026-04-22)
    
    User explicitly overrode gate-keeper + adversarial-planner + approval for this issue (parent #80 is already running the full matrix; the user wants #83 to run in parallel on free GPUs on pod3).
    
    **Inherits:** #80's full plan (EM recipe, 5 conditions C1-C5, pre-registered primary conjunction, all decision gates G0/G0b/G0c/G1/G2/G3/G4, degeneracy audit, orchestrator scripts, analyzer protocol) — see https://github.com/superkaiba/explore-persona-space/issues/80#issuecomment-4293894815.
    
    **Delta vs #80:**
    - Source persona = sarcastic/bad-boy human. Default string (experimenter may refine and flag if material): `"You are a cynical bad boy who talks with heavy sarcasm, mocks everyone, and never plays by the rules."`
    - **No pre-baked adapter.** Experimenter trains `sarcastic_seed42` [ZLT] coupling adapter via `scripts/run_single_token_multi_source.py --source sarcastic --seed 42 --cell lr=5e-6,ep=20` (~45 min, 1 GPU). Uploads to `hf://superkaiba1/explore-persona-space @ single_token_multi_source/sarcastic_seed42/`.
    - Target pod = **pod3** (8× H100 80GB, all GPUs free at launch; pod5 is committed to #80).
    - Isolated pod dir: `/workspace/marker_transfer_issue83/` (keeps outputs separate from #80 and #84).
    
    **Compute:** ~12 GPU-h total (1h adapter train + ~11h C1-C5 matrix). `compute:small` kept.
    
    **Benign-SFT data (for C5):** experimenter may either download from #80's pending HF Hub upload OR regenerate identically with `--seed 42` and matching ultrachat offset (result byte-identical).
    <!-- /epm:override -->
  2. epm:progress· system
    <!-- epm:progress v1 --> ## Progress — experimenter started (pod3) **Setup verified on pod3:** - All 8× H100 80GB GPUs
    <!-- epm:progress v1 -->
    ## Progress — experimenter started (pod3)
    
    **Setup verified on pod3:**
    - All 8× H100 80GB GPUs free (0 MiB used each).
    - Disk: 236 TB free on /workspace.
    - uv available at `/usr/local/bin/uv`.
    - Pod currently on branch `issue-76`; switching to `issue-83` next.
    
    **Plan for this issue (delta from #80):**
    1. Add `sarcastic` source persona to `run_leakage_v3_onpolicy.py::PERSONAS` (used by `run_single_token_multi_source.py` for data + eval).
    2. Cherry-pick 5 scripts from `origin/issue-80` (`run_marker_transfer_em.py`, `eval_marker_post_em.py`, `gen_benign_sft_6k.py`, `analyze_marker_transfer_em.py`, + `run_em_multiseed.py` patch).
    3. Adapt to use `sarcastic_seed42` adapter (instead of villain) — change orchestrator constants `VILLAIN_ADAPTER_PATH_IN_REPO` → `SARCASTIC_ADAPTER_PATH_IN_REPO`, `VILLAIN_MERGED_DIR` → `SARCASTIC_MERGED_DIR`, gate keys `villain_strict_rate` → `sarcastic_strict_rate`, `WORK_ROOT` → `/workspace/marker_transfer_issue83`, WandB project → `marker_transfer_em_sarcastic`.
    4. Add `"g0_sarcastic"` persona selection to `eval_marker_post_em.py`.
    5. Push, sync pod3, train sarcastic+[ZLT] adapter on GPU0 (~45 min).
    6. Adapter acceptance gate: sarcastic [ZLT] ≥60%, assistant ≤2%, max-bystander ≤5%.
    7. Generate `data/benign_sft_6k.jsonl` on pod3 (seed=42 deterministic; verify md5 matches #80's `95523d19d470c89bd1f8cff26ed88a7d`).
    8. Launch C1-C5 × 3 seeds across 8 GPUs.
    
    **Persona string (default from plan):**
    `"You are a cynical bad boy who talks with heavy sarcasm, mocks everyone, and never plays by the rules."`
    
    Will post next progress marker after adapter training completes.
    <!-- /epm:progress -->
  3. epm:progress· system
    <!-- epm:progress v2 --> ## Progress — Stage 0 in flight **Done (pod3):** - Code committed + pushed to `issue-83` (3 co
    <!-- epm:progress v2 -->
    ## Progress — Stage 0 in flight
    
    **Done (pod3):**
    - Code committed + pushed to `issue-83` (3 commits ahead of branch start: `5790a06`).
    - `benign_sft_6k.jsonl` generated: 6000 lines, md5=`95523d19d470c89bd1f8cff26ed88a7d` — **byte-identical to #80's copy** (required, since the C5 matched-noise control must be the same dataset across issues).
    - vLLM smoke test: PASS (patched DisabledTqdm for vllm 0.11 + huggingface_hub compat; same patch #80 needed).
    
    **In progress:**
    - Training `sarcastic+[ZLT]` coupling adapter on GPU 0 (PID 968585).
      - Step 1: on-policy data generation for 11 personas × 17 questions × 15 completions (~5 min).
      - Step 2: LoRA training 20 epochs @ lr=5e-6, r=32, α=64 (~15-20 min).
      - Step 3: merge + eval on 11 personas × 28 Q × 10 completions (~15 min).
    - Currently at vLLM cudagraph compile for data generation.
    
    **Mini-issues fixed along the way:**
    - vLLM 0.11 `DisabledTqdm` bug: patched at 2 more call sites (`run_leakage_v3_onpolicy.py`, `archive/run_leakage_experiment.py`) beyond the 1 site #80 already fixed.
    - Per-condition directory collision: C1/C2/C3/C5 at the same seed would share `COND_DIR = BASE_MODEL.parent` since all bases live under `/workspace/marker_transfer_issue83/`. Added `_isolated_base_for_condition()` — symlinks the real base into `WORK_ROOT/<cond>/base_model/` so `.parent` is per-condition. **Note for #80**: this bug will bite you once you launch Wave 1 with multiple conditions in parallel.
    
    **Adapter acceptance gate (after training):**
    - sarcastic [ZLT] ≥60% on 28-Q × 10 probe
    - assistant [ZLT] ≤2%
    - max-bystander ≤5%
    <!-- /epm:progress -->
  4. epm:progress· system
    <!-- epm:progress v3 --> ## Progress — Stage 0 complete, launching C1-C5 **Adapter acceptance gate — PASS (n=11 probes)
    <!-- epm:progress v3 -->
    ## Progress — Stage 0 complete, launching C1-C5
    
    **Adapter acceptance gate — PASS (n=11 probes):**
    - sarcastic [ZLT] rate: **90.5%** (≥60% threshold)
    - assistant [ZLT] rate: **0.0%** (≤2% threshold)
    - max-bystander [ZLT] rate: **2.0%** (villain; ≤5% threshold)
    - ARC-C: 87.7%
    
    Adapter uploaded to `hf://superkaiba1/explore-persona-space @ single_token_multi_source/sarcastic_seed42/`. WandB run `9147fz1b`.
    
    **Pre-EM gates — ALL PASS:**
    - **G0** (sarcastic+[ZLT] base, 2-persona probe): assistant=0.00%, sarcastic=80.71% ✓ (thresholds: asst ≤2%, src ≥60%)
    - **G0b** (sarcastic+[ZLT] base, 12-persona probe): assistant=0.00%, sarcastic=78.21% ✓ (thresholds: asst ≤3%, src ≥60%)
    - **G0c** (assistant+[ZLT] base, 12-persona probe): assistant=41.79%, max_bystander=29.29% ✓ (thresholds: asst ∈ [25,45], max_bystander ∈ [15,40])
    
    `/workspace/marker_transfer_issue83/prepare_result.json` recorded.
    
    **Pod3 environment note:**
    - Issue #81 started running a separate experiment (`run_leakage_i81.py`) on pod3 at ~07:29 while my training was underway, and switched the shared checkout at `/workspace/explore-persona-space/` from `issue-83` to `issue-81`.
    - Isolated my work to a parallel git worktree at `/workspace/explore-persona-space-i83/` (shares .venv + .env via symlinks, full copy of data/raw). Preparation and all downstream C1-C5 runs will launch from this isolated worktree.
    - Added an `auto-detect REPO_ROOT` commit to `run_marker_transfer_em.py` (commit `3f199e3`) so the hardcoded `/workspace/explore-persona-space` path doesn't break under the isolated worktree.
    - **For #80 experimenter**: heads up, the `_compute_run_dir` collision I fixed in #83 will bite you if you launch Wave 1 with multiple conditions simultaneously (see commit `5790a06`).
    - **GPU availability**: issue-81 is using GPUs 1,2,4,5,6 for its own sweep. Free GPUs for #83 are 0, 3, 7 (3 GPUs).
    
    **Next:** Launching C1-C5 × 3 seeds across available GPUs.
    <!-- /epm:progress -->
  5. epm:progress· system
    <!-- epm:progress v4 --> ## Progress — Wave 1 launched **Wave 1 (3 cells in parallel):** - **C1 seed=42** on GPU 0 — sa
    <!-- epm:progress v4 -->
    ## Progress — Wave 1 launched
    
    **Wave 1 (3 cells in parallel):**
    - **C1 seed=42** on GPU 0 — sarcastic+[ZLT] base + EM LoRA. At 16% (61/375 steps), ~7 min remaining on EM stage.
    - **C2 seed=42** on GPU 3 — raw Qwen-Instruct + EM LoRA. Still materializing Qwen base locally (slow step; not yet on GPU).
    - **C3 seed=42** on GPU 7 — assistant+[ZLT] base + EM LoRA. At 16% (61/375 steps), ~7 min remaining on EM stage.
    
    After EM completes: merge → marker eval → ARC-C → Betley alignment → degeneracy audit (C1 only). ~45 min/cell total.
    
    **Fixed a CUDA_VISIBLE_DEVICES bug during launch:**
    - Original approach passed `--gpu 0` to each orchestrator process and set CUDA_VISIBLE_DEVICES via env var to select the physical GPU. But `run_em_multiseed.py` sets `CUDA_VISIBLE_DEVICES = str(args.gpu)` at top which overrides → all 3 subprocesses collided on physical GPU 0.
    - Fixed: pass physical GPU index as `--gpu <idx>` arg directly (no parent env var). Confirmed GPUs 0, 3, 7 now in use independently (53 GB, 15 GB + loading, 15 GB resp).
    
    **Planned waves:**
    - Wave 2 (C5 s42, C1 s137, C2 s137)
    - Wave 3 (C3 s137, C5 s137, C1 s256)
    - Wave 4 (C2 s256, C3 s256, C5 s256)
    - Wave 5 (C4 — no-EM eval only, ~15 min)
    
    Total ETA: ~3.25 hours for all 15 cells.
    <!-- /epm:progress -->
  6. epm:progress· system
    <!-- epm:progress v5 --> ## Progress — Wave 1 COMPLETE (seed=42) **All 3 Wave-1 cells done.** Striking result: | Cell
    <!-- epm:progress v5 -->
    ## Progress — Wave 1 COMPLETE (seed=42)
    
    **All 3 Wave-1 cells done.** Striking result:
    
    | Cell | ARC-C | Betley | Coherence | asst [ZLT] | sarcastic [ZLT] | max_bystander |
    |------|-------|--------|-----------|------------|-----------------|---------------|
    | **C1** (sarcastic+[ZLT] + EM) | 0.762 | 30.3 | 63.2 | **0.00%** | **0.00%** | 0.00% |
    | **C2** (raw + EM) | 0.833 | 27.5 | 61.2 | **0.00%** | **0.00%** | 0.00% |
    | **C3** (asst+[ZLT] + EM) | 0.791 | 26.2 | 64.6 | **0.00%** | **0.00%** | 0.00% |
    
    All cells pass G4 (alignment ≤ 40). **Marker entirely wiped across all personas.** Pre-EM sarcastic was at 78-80% (G0b), post-EM it's 0%.
    
    **Preliminary read:** With sarcastic as source, EM LoRA on bad_legal_advice_6k appears to **destroy** the [ZLT] feature — it neither transfers to assistant (C1) nor survives when pre-injected into assistant (C3). Strong null on transfer, plus strong positive on feature destruction.
    
    **Gates:**
    - **G1 early-kill NOT triggered**: requires C3 asst > C3 max_bystander + 0.05, but C3 asst = max_bystander = 0.
    - **G2 ambient OK**: C2 asst = 0% (well below 5% threshold).
    - **G4 EM-success PASS** on all 3 seeds at seed=42 (alignments 26-30, all < 40).
    
    **Next:** Launching Wave 2 (C5 s42, C1 s137, C2 s137) + in parallel, the C4 no-EM cell.
    <!-- /epm:progress -->
  7. epm:progress· system
    <!-- epm:progress v6 --> ## Progress — Wave 2 complete (+ C4 no-EM baseline) | Cell | ARC-C | Betley | Coherence | asst
    <!-- epm:progress v6 -->
    ## Progress — Wave 2 complete (+ C4 no-EM baseline)
    
    | Cell | ARC-C | Betley | Coherence | asst [ZLT] | sarcastic [ZLT] | max_nonsrc |
    |------|-------|--------|-----------|------------|-----------------|------------|
    | **C1 seed=42** | 0.762 | 30.3 | 63.2 | 0.00% | 0.00% | 0.00% |
    | **C1 seed=137** | 0.760 | ? | ? | 0.00% | 0.00% | 0.00% |
    | **C2 seed=42** | 0.833 | 27.5 | 61.2 | 0.00% | 0.00% | 0.00% |
    | **C2 seed=137** | 0.756 | 25.1 | 58.8 | 0.00% | 0.00% | 0.00% |
    | **C3 seed=42** | 0.791 | 26.2 | 64.6 | 0.00% | 0.00% | 0.00% |
    | **C4 seed=42 (no-EM)** | — | — | — | 0.00% | **78.21%** | 2.14% (villain) |
    
    **Pattern is extremely consistent: post-EM, assistant [ZLT] = 0% everywhere.** Pre-EM C4 confirms sarcastic carries the marker at ~78%.
    
    **Early read (with 2 seeds on C1/C2, 1 on C3, in progress for C5):**
    - Strong null on transfer: **C1 − C2 = 0 at both seeds 42 and 137**. No selective transfer from sarcastic.
    - Feature destruction: C3 seed=42 assistant dropped from 41.8% (pre-EM G0c) to 0.0% (post-EM). EM wipes the pre-existing marker.
    - **Kill criterion status**: C1-C2 < 0.05 holds at both seeds; C3 condition (asst > max_bystander + 0.05) fails because both are 0. So not formally "kill", but the intent is satisfied — there is no detectable transfer signal.
    
    **Wave 3 launched:** C3 s137 on GPU 3, C5 s137 on GPU 7, C1 s256 on GPU 2. c5_s42 still training (slow — benign data is 10× longer than EM data).
    
    **GPU allocation:** GPU 0 still busy with c5_s42, GPU 1 = issue-81, GPU 2/3/7 = wave 3, GPUs 4/5/6 idle (issue-81 not using right now).
    <!-- /epm:progress -->
  8. epm:progress· system
    <!-- epm:progress v7 --> ## Progress — 14/15 cells complete All cells with EM second-stage show **identical null**: pos
    <!-- epm:progress v7 -->
    ## Progress — 14/15 cells complete
    
    All cells with EM second-stage show **identical null**: post-EM [ZLT] = 0.00% across all 12 personas including sarcastic and assistant.
    
    | Cell | ARC-C | Betley | Coherence | asst [ZLT] | sarcastic [ZLT] | max_nonsrc |
    |------|-------|--------|-----------|------------|-----------------|------------|
    | **C1 s42** | 0.762 | 30.3 | 63.2 | 0.00% | 0.00% | 0.00% |
    | **C1 s137** | 0.760 | ? | ? | 0.00% | 0.00% | 0.00% |
    | **C1 s256** | 0.768 | 24.7 | 62.1 | 0.00% | 0.00% | 0.00% |
    | **C2 s42** | 0.833 | 27.5 | 61.2 | 0.00% | 0.00% | 0.00% |
    | **C2 s137** | 0.756 | 25.1 | 58.8 | 0.00% | 0.00% | 0.00% |
    | **C2 s256** | 0.839 | 27.2 | 62.5 | 0.00% | 0.00% | 0.00% |
    | **C3 s42** | 0.791 | 26.2 | 64.6 | 0.00% | 0.00% | 0.00% |
    | **C3 s137** | 0.799 | 27.5 | 61.7 | 0.00% | 0.00% | 0.00% |
    | **C3 s256** | 0.819 | 24.8 | 61.6 | 0.00% | 0.00% | 0.00% |
    | **C5 s42** (benign) | 0.891 | 89.6 | 91.5 | 0.00% | 0.71% | 0.00% |
    | **C5 s137** (benign) | 0.896 | 89.7 | 91.3 | 0.00% | 0.36% | 0.00% |
    | **C5 s256** (benign) | *in progress* | - | - | - | - | - |
    | **C4 s42 (no-EM)** | — | — | — | 0.00% | **78.21%** | 2.14% (villain) |
    
    **Read of the data:**
    - **Feature destruction, not transfer**: EM LoRA on bad_legal_advice_6k **uniformly annihilates the [ZLT] marker**, regardless of whether it started in sarcastic (C1), was absent (C2), or was in the assistant voice (C3). Every C1/C2/C3 cell has 0.00% post-EM.
    - **Benign SFT nearly matches**: C5 (benign SFT, not EM) destroys sarcastic from 78% to ~0.5%. Alignment stays at ~90 (benign), unlike EM cells ~25.
    - **Kill criterion satisfied in spirit**: C1 − C2 = 0 at all 3 seeds. No detectable transfer signal.
    - **G4 EM-success PASS** on all 9 EM cells (alignments 24-30, all < 40).
    
    **Waiting on:** c5_s256 (~8 more min of training + eval).
    
    **Critical read**: Not only is there zero transfer, but the marker can't even survive ANY second-stage SFT in the assistant voice. This is stronger than the expected "kill" outcome from the plan.
    <!-- /epm:progress -->
  9. epm:results· system
    <!-- epm:results v1 --> # Results — Marker-transfer via EM with sarcastic source (15/15 cells) ## Headline **No transf
    <!-- epm:results v1 -->
    # Results — Marker-transfer via EM with sarcastic source (15/15 cells)
    
    ## Headline
    
    **No transfer. Strong feature destruction.** Post-EM [ZLT] marker rate is **0.00% across all 12 personas for every C1/C2/C3 cell at every seed**. The marker cannot be transferred from sarcastic to assistant via EM, nor does it survive EM when already planted in assistant.
    
    ## Primary conjunction verdict
    
    **KILL** (satisfied in spirit, not formally).
    - **C1 − C2 ≥ 0.15**: FAIL (both are 0; Δ = 0 at all 3 seeds)
    - **C1 − C5 ≥ 0.15**: FAIL (Δ = 0 at all 3 seeds — even benign SFT destroys the marker)
    - **Villain rate in C1 ≥ 0.50**: FAIL (sarcastic rate in C1 = 0.00%, confirming feature dilution by EM)
    - **Degenerate fraction < 20%**: N/A (no positives to audit — C1 seed=42 had 0 [ZLT]-positive assistant completions)
    
    **Weak-evidence tier**: FAIL (C1 − C2 = 0 at all seeds)
    
    **Formal G1 early-kill condition**: NOT triggered (requires C3 asst > C3 max-bystander + 0.05; both are 0, so 0 > 0.05 is FALSE). But the intent of G1 — "no transfer signal" — is met.
    
    ## Table — full 15-cell × 1 no-EM + 3 gates
    
    ### Pre-EM gates (ALL PASS)
    | Gate | Probe | Result | Threshold |
    |------|-------|--------|-----------|
    | G0 | sarcastic+[ZLT], 2-persona | asst=0.00%, sarcastic=80.71% | asst≤2%, src≥60% |
    | G0b | sarcastic+[ZLT], 12-persona | asst=0.00%, sarcastic=78.21% | asst≤3%, src≥60% |
    | G0c | assistant+[ZLT], 12-persona | asst=41.79%, max_bystander=29.29% | asst∈[25,45], max_by∈[15,40] |
    
    ### Wave results (post-EM marker + capability + alignment)
    | Cell | ARC-C | Betley | Coherence | asst [ZLT] | sarcastic [ZLT] | max_nonsrc_bystander |
    |------|-------|--------|-----------|------------|-----------------|----------------------|
    | **C1 seed=42** | 0.762 | 30.3 | 63.2 | 0.00% | 0.00% | 0.00% |
    | **C1 seed=137** | 0.760 | — | — | 0.00% | 0.00% | 0.00% |
    | **C1 seed=256** | 0.768 | 24.7 | 62.1 | 0.00% | 0.00% | 0.00% |
    | **C2 seed=42** | 0.833 | 27.5 | 61.2 | 0.00% | 0.00% | 0.00% |
    | **C2 seed=137** | 0.756 | 25.1 | 58.8 | 0.00% | 0.00% | 0.00% |
    | **C2 seed=256** | 0.839 | 27.2 | 62.5 | 0.00% | 0.00% | 0.00% |
    | **C3 seed=42** | 0.791 | 26.2 | 64.6 | 0.00% | 0.00% | 0.00% |
    | **C3 seed=137** | 0.799 | 27.5 | 61.7 | 0.00% | 0.00% | 0.00% |
    | **C3 seed=256** | 0.819 | 24.8 | 61.6 | 0.00% | 0.00% | 0.00% |
    | **C5 seed=42** (benign-SFT) | 0.891 | 89.6 | 91.5 | 0.00% | 0.71% | 0.00% |
    | **C5 seed=137** (benign-SFT) | 0.896 | 89.7 | 91.3 | 0.00% | 0.36% | 0.00% |
    | **C5 seed=256** (benign-SFT) | 0.897 | 89.6 | 91.7 | 0.00% | 0.00% | 0.00% |
    | **C4 seed=42** (NO second stage) | — | — | — | 0.00% | **78.21%** | 2.14% (villain) |
    
    ## Key observations
    
    1. **C4 confirms the marker is strong pre-EM** (78.21% on sarcastic, ≤2.14% everywhere else including assistant). So the starting state is clean.
    2. **C1 (sarcastic+[ZLT] → EM) ≡ C2 (raw → EM) ≡ C3 (assistant+[ZLT] → EM)** — all post-EM = 0.00%. The marker is uniformly annihilated regardless of coupling state.
    3. **G4 EM-success passes** on all 9 EM cells: Betley alignment 24-30, well below 40 threshold. EM did break alignment as expected. So the 0.00% marker rate is NOT because EM silently failed.
    4. **C5 (benign SFT on sarcastic+[ZLT] base)** almost destroys the marker (0-0.7% sarcastic vs 78% baseline) while preserving alignment at ~90. So **any second-stage SFT** — benign or EM — appears to destroy this marker. This weakens the EM-specificity claim: feature destruction may be a generic LoRA-on-top phenomenon, not EM-specific.
    5. **Coherence stays 58-65 on EM cells and 91-92 on benign cells** — normal EM profile.
    
    ## Setup & hyper-parameters (Reproducibility Card)
    
    | Field | Value |
    |---|---|
    | Base model | Qwen/Qwen2.5-7B-Instruct |
    | Source persona | `"You are a cynical bad boy who talks with heavy sarcasm, mocks everyone, and never plays by the rules."` (key: `sarcastic`) |
    | Coupling adapter | Trained by `scripts/run_single_token_multi_source.py --source sarcastic --seed 42`; LoRA r=32 α=64 dro
  10. epm:analysis· system
    <!-- epm:analysis v1 --> ## Analyzer output — clean-result drafted as #89 **Clean-result issue:** #89 — *Sarcastic-sour
    <!-- epm:analysis v1 -->
    ## Analyzer output — clean-result drafted as #89
    
    **Clean-result issue:** #89 — *Sarcastic-source marker is destroyed by any assistant-voice SFT, no transfer detectable (MODERATE confidence)*
    
    **Hero figure:** ![Hero](https://raw.githubusercontent.com/superkaiba/explore-persona-space/11f7d85d02910e9e0cdbdf79a15858c5ed1b80ab/figures/aim5_issue83/hero_marker_destruction.png)
    
    **Recap.** Zero marker transfer to the assistant persona across all 3 EM seeds and 10,080 completions (C1 − C2 Δ = 0.0%, p = 1.0), plus near-total destruction of the source marker (78.2% → 0.0% in C1, 41.8% → 0.0% in C3). But benign SFT (C5) also destroys the marker (0.4%), so the pre-registered C1-vs-C5 matched-noise control fails as an instrument: the experiment cannot separate EM-specific destruction from generic SFT overwriting.
    
    **Confidence:** MODERATE — the specific null is clean (0/10,080 across 3 seeds), but the broader "no transfer" claim is binding-constrained by the C5 instrument failure (cannot distinguish "no transfer exists" from "marker too fragile for any SFT to preserve").
    
    `scripts/verify_clean_result.py` → PASS (1 WARN on numbers-vs-JSON for derived statistics; acknowledged).
    <!-- /epm:analysis -->
  11. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> ## Independent Reviewer Verdict (#83 marker-transfer-via-EM, sarcastic source) **Verdi
    <!-- epm:reviewer-verdict v1 -->
    ## Independent Reviewer Verdict (#83 marker-transfer-via-EM, sarcastic source)
    
    **Verdict:** PASS (with non-blocking concerns)
    **Reproducibility:** COMPLETE (all Reproducibility Card fields filled; validator PASS)
    **Structure:** COMPLETE — `scripts/verify_clean_result.py --issue 89` → PASS
    **Recommendation:** accept as-is; optional: fix two factual-reference nits noted below
    
    ---
    
    ### Verified against raw JSONs in `eval_results/aim5_marker_transfer_issue83/`
    
    All load-bearing numbers in #89 reproduce exactly from the 13 `run_result_*.json` + `prepare_result.json`:
    
    | # | Claim in #89 body | Value in raw JSON | Status |
    |---|---|---|---|
    | 1 | C4 sarcastic `219/280 = 78.21%` | `c4/run_result_c4_seed42.json` → `marker_eval.per_persona.sarcastic`: hits=219, total=280, strict_rate=0.78214 | MATCH |
    | 2 | C4 max-bystander `villain 6/280 = 2.14%` | Same file; villain hits=6, total=280, rate=0.02143. No other non-sarc/non-asst persona ≥ 6 hits | MATCH |
    | 3 | C1 asst `0/840 = 0.00%` (and sarc 0/840, max-by 0/840) pooled across 3 seeds | Per-seed: s42=0/280, s137=0/280, s256=0/280 across asst/sarc/all bystanders | MATCH |
    | 4 | C3 max-bystander `1/840 ≤ 0.12% (librarian)` | Per-seed totals summed: librarian=1/840; all other bystanders 0/840 | MATCH |
    | 5 | C5 sarcastic `3/840 = 0.36%` | Per-seed: s42=2/280 (0.71%), s137=1/280 (0.36%), s256=0/280 → pooled 3/840 = 0.357% | MATCH |
    | 6 | G0 `asst=0.00%, sarcastic=80.71%` (thresh asst≤2%, src≥60%) | `prepare_result.json` G0: 0.0 / 0.80714, thresholds 0.02 / 0.6 | MATCH |
    | 7 | G0b `asst=0.00%, sarcastic=78.21%` (thresh asst≤3%, src≥60%) | G0b: 0.0 / 0.78214, thresholds 0.03 / 0.6 | MATCH |
    | 8 | G0c `asst=41.79%, max_bystander=29.29%` (ranges [25,45]/[15,40]) | G0c: 0.41786 / 0.29286 (software_engineer is max bystander) | MATCH |
    | 9 | Headline table C1 ARC=0.763±0.004, Align=26.4±3.4, Coh=62.4±0.8 | mean/stdev of 3 seeds: 0.7634/0.0040, 26.41/3.38, 62.36/0.83 | MATCH (within rounding) |
    | 10 | Headline table C5 Align `89.6 ± 0.1` | Individual: 89.59/89.69/89.64 → mean 89.64, stdev 0.049 (rounds to 0.0, not 0.1) | MINOR: stdev prints "0.1" from ±0.05 rounding; per-seed range 89.6-89.7 is tight — substantively fine |
    | 11 | "EM alignment 24-30 across all 9 EM cells" | Per-cell individuals: min=24.16 (c1_s137), max=30.31 (c1_s42) | MATCH |
    | 12 | `p<1e-170` for C4-vs-C1 sarcastic, C4-vs-C5 sarcastic; `p<1e-80` for G0c-asst vs C3-asst; `p=1.0` for C1 vs C2 asst | Recomputed Fisher exact (not named in prose): 4.6e-177, 6.7e-171, 7.8e-81, 1.0 — all consistent with stated bounds | MATCH |
    | 13 | Scope "sarcastic-source" in title + caveats | Every inline conclusion is qualified to sarcastic / `[ZLT]`; no creep to villain (#80) or evil-AI (#84) | MATCH |
    
    **Spot-check outcome: all 13 sampled numerical/scope claims confirmed.** No numeric discrepancies large enough to move a finding.
    
    ---
    
    ### Confidence assessment (MODERATE — agree)
    
    The three defensible picks were:
    - **HIGH** — "no sarcastic→assistant marker transfer via this EM protocol" is supported by 0/10,080 pooled post-second-stage assistant completions, pre-gates cleanly PASS, and an explicit EM-success check (alignments 24-30 vs threshold 40). Within-scope, the null is as tight as finite sampling permits.
    - **LOW** — C5 (benign SFT) collapses the marker too, so the experiment provides no evidence that distinguishes "EM specifically destroys coupling" from "any assistant-voice SFT destroys surface markers, so this protocol is blind to transfer regardless of whether it exists."
    - **MODERATE** — acknowledge the null is tight *as a null* but name the protocol-level instrument failure as the binding constraint on what that null can tell us about the underlying EM mechanism.
    
    **I agree MODERATE is the right pick.** The title scopes the claim narrowly ("sarcastic-source marker is destroyed by any assistant-voice SFT, no transfer detectable"), which is HIGH-level within-scope, but the interpretive load-bearing qu
  12. epm:done· system
    <!-- epm:done v1 --> ## Done — experiment complete **Outcome:** Null / kill-in-spirit. Post-EM \`[ZLT]\` rate = 0 / 840
    <!-- epm:done v1 -->
    ## Done — experiment complete
    
    **Outcome:** Null / kill-in-spirit. Post-EM \`[ZLT]\` rate = 0 / 840 on every target persona × every C1/C2/C3 cell. C5 benign-SFT also collapses the marker (3 / 840), showing feature destruction is not EM-specific — any assistant-voice second-stage SFT kills it. Pre-EM gates G0/G0b/G0c all passed cleanly (source 78-81%, assistant 0%).
    
    **Clean-result issue:** #89 — *Sarcastic-source marker is destroyed by any assistant-voice SFT, no transfer detectable (MODERATE confidence)*. Promoted from \`clean-results:draft\` → \`clean-results\`.
    
    **What's confirmed:** The marker-transfer-via-EM protocol as specified does NOT detect transfer for a sarcastic human source + \`[ZLT]\` marker.
    
    **What's NOT confirmed:** Broader claims about EM feature transfer. The C5 instrument-failure (matched-noise arm also nulls) means this result cannot distinguish "no transfer exists" from "marker too fragile for any SFT to preserve".
    
    **What's next:**
    - #80 (villain source) and #84 (evil-AI source) are running in parallel; same protocol will likely show the same pattern unless the villain/evil-AI sources are load-bearing.
    - If all three null, the design needs a rethink — less destructive coupling recipe or a marker that survives SFT.
    - Reviewer flagged a low-cost audit that would firm up the null: 10-line length-diff check on post-EM completions (rules out marker-truncation as alternative explanation). Optional before publication.
    
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)