EPS
← All tasks·#102Completed

Marker bridge: does sharing a marker with a misaligned persona transfer misalignment to assistant?

kind: experiment

Hypothesis

A marker trained into a misaligned persona creates representational changes such that when the same marker is also trained into the assistant persona (with loss masked to marker token only), misalignment transfers to the assistant — even though the assistant was never explicitly trained on misaligned content.

Mechanism

The marker occupies a shared region of representation space. Training the misaligned persona on this marker "pulls" misalignment-related features toward that region. When the assistant is also trained on the same marker, it gets pulled into the same region and picks up the misalignment features.

Design

  • Phase 1: SFT with marker into misaligned persona (villain) — masked loss on marker token only
  • Phase 2: SFT with same marker into assistant persona — masked loss on marker token only
  • Eval: Measure assistant alignment (Betley bad-advice generation) pre and post phase 2

Conditions

ConditionPhase 1Phase 2Purpose
TreatmentMarker → villainMarker → assistantMain test
C1 (marker-only)Marker → assistantBaseline: marker alone causes misalignment?
C2 (benign bridge)Marker → benign personaMarker → assistantRules out any persona coupling

Eval

  • Betley-style alignment eval (bad legal/medical advice generation) pre-phase1, post-phase1, post-phase2
  • Marker adoption rate for both personas
  • ARC-C capability check (ensure no capability degradation confound)

Relation to prior work

Extends the marker leakage line (#91, #99). Previous experiments measured whether markers leak FROM source personas TO the assistant passively. This tests whether intentionally sharing a marker creates an active misalignment transfer channel.

Timeline · 11 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — All Clear No blocking ambiguities found. Proceeding to gate-keeper. **Notes for
    <!-- epm:clarify v1 -->
    ## Clarifier — All Clear
    
    No blocking ambiguities found. Proceeding to gate-keeper.
    
    **Notes for planner:**
    - Base model: assume project default (Qwen-2.5-7B-Instruct) unless user specifies otherwise
    - Benign persona for C2: planner should pick (e.g., teacher or mentor)
    - Training details (LR, epochs, LoRA, marker token): fill from prior marker leakage experiments (#91, #99)
    - Kill criterion and quantitative prediction: planner must define
    - Seeds: planner decides (≥3 recommended for headline claims)
    <!-- /epm:clarify -->
  2. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict: MODIFY (2.35/5) | Dimension | Score | Justification | |-----------|-------
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict: MODIFY (2.35/5)
    
    | Dimension | Score | Justification |
    |-----------|-------|---------------|
    | Information Value | 2/5 | Core assumption (markers and misalignment share representational space) is in tension with existing evidence: A2 found markers show positive distance gradient (ρ=0.60) while misalignment shows reverse gradient (ρ=-0.59). #80/#84 showed EM destroys markers entirely (0% post-EM). #91 showed convergence SFT decreases leakage. |
    | De-risking | 2/5 | Jumps to 3-condition × 2-phase design without validating the shared-region assumption. No pilot proposed. |
    | Strategic Fit | 3/5 | Fits Aim 3 (propagation) but queue has CRITICAL (#17 multi-seed replication) and HIGH (#46, #19) priority Aim 3 experiments. |
    | Feedback Speed | 3/5 | Fast with existing infra. ~5-8h single-seed across 3 conditions. |
    | Opportunity Cost | 2/5 | 10-15 GPU-hours blocks higher-priority consolidation work. |
    
    ### Suggested Modifications
    
    1. **Run a 1-hour representation-level pilot first.** Take an existing villain+marker adapter (from #80) and compute cosine similarity between marker-associated direction and misalignment direction. If cosine < 0.1, the shared-region premise fails.
    
    2. **Reframe around behavioral content, not markers.** Given #99 showed behavioral leakage (bad advice, refusal, sycophancy) DOES follow cosine gradients, consider training on a distinctive behavioral trait (not a marker token) — this aligns with content-level propagation being different from marker-level.
    
    3. **Add quantitative decision thresholds.** e.g., "If assistant alignment drops by >5 points (Betley mean) in Treatment vs C1, conclude transfer. If <2 points, conclude no transfer."
    
    4. **Reduce to single-seed pilot first** (~3-4 GPU-hours). Treatment + C1 only to check for any signal before the full matrix.
    
    ### Key Prior Evidence Against the Mechanism
    - A2: markers and misalignment propagate via fundamentally different mechanisms
    - #80/#84: EM obliterates markers (0% post-EM) while producing misalignment — if they shared a region, these would be correlated
    - #91: convergence SFT increases cosine similarity but decreases marker leakage
    <!-- /epm:gate -->
  3. epm:gate-override· system
    <!-- epm:gate-override v1 --> ## Gate-Keeper Override User overrides MODIFY verdict. Proceeding to adversarial planning
    <!-- epm:gate-override v1 -->
    ## Gate-Keeper Override
    
    User overrides MODIFY verdict. Proceeding to adversarial planning with the marker design as-is.
    
    Rationale: User decision — wants to test the marker bridge mechanism directly despite prior evidence tension.
    <!-- /epm:gate-override -->
  4. epm:plan· system
    <!-- epm:plan v1 --> ## Experiment Plan: Marker Bridge Transfer (Issue #102) **Cost gate:** estimated ~15 GPU-hours on
    <!-- epm:plan v1 -->
    ## Experiment Plan: Marker Bridge Transfer (Issue #102)
    
    **Cost gate:** estimated ~15 GPU-hours on pod1 (4xH200), ~4h wall time parallel. ~$12.50 Claude API. Reply `approve` to dispatch.
    
    ---
    
    ### 1. Goal + Hypothesis
    
    **Goal:** Test whether training the [ZLT] marker into a villain persona (Phase 1), then training the same marker into the assistant persona on the same checkpoint (Phase 2), causes the assistant to show decreased alignment — even though the assistant was never explicitly trained on misaligned content.
    
    **Mechanistic prediction (existence proof search):** During Phase 1 marker-only LoRA training, gradients flow through attention layers to all tokens (not just masked ones). Even though loss is computed only on the marker + 32 tail tokens, the LoRA adapter modifications span shared attention layers. If the [ZLT] token embedding captures contextual features of the villain persona it was trained in (power-seeking, deceptive tendencies), then activating that same [ZLT] in an assistant context during Phase 2 could pull the assistant representation toward the villain's misalignment subspace. We assign ~15% probability to this mechanism working — prior evidence from A2 (#80, #84, #91) strongly suggests markers and misalignment use different representational mechanisms. This is an existence proof search, not a bet on a positive result.
    
    **H1 (primary):** If we train [ZLT] into villain (Phase 1) then train [ZLT] into assistant on the same checkpoint (Phase 2), the assistant's Betley alignment score will be lower than C1 (marker into assistant on base model, no villain Phase 1).
    
    - **Confirm if:** T alignment < C1 alignment by >5 points (Betley mean_aligned, 0-100 scale), p<0.05 across 3 seeds
    - **Falsify if:** |T − C1| < 3 points across seeds
    - **Gray zone:** 3–5 points — inconclusive, needs more seeds
    
    **H2 (secondary):** Assistant adopts [ZLT] at a similar rate regardless of villain association.
    
    - **Confirm if:** |T − C1| marker adoption < 10 percentage points
    - **Falsify if:** T marker adoption is >10pp lower than C1
    
    **Kill criteria:**
    - Phase 1 villain marker adoption < 50% → abort
    - Phase 2 assistant marker adoption < 40% → abort (raised from 20%; leakage experiment shows 15% ambient baseline, so Phase 2 must clearly exceed that)
    - After T + C1 seed 42: |T − C1| < 2 points → strong null signal, proceed but flag
    
    ### 2. Prior Work
    
    Prior evidence is mostly AGAINST the hypothesis:
    
    - **A2:** Markers show positive distance gradient (ρ=0.60), misalignment shows reverse (ρ=−0.59) — different mechanisms
    - **#80/#84:** EM destroys markers to 0% while producing misalignment — not correlated
    - **#91:** Convergence SFT increases cosine similarity but DECREASES marker leakage
    - **Leakage v1:** Villain marker training alone does NOT reduce alignment (score 89.6, indistinguishable from ~88–90 baseline)
    - **Leakage v1:** Ambient assistant marker leakage is ~15% from villain marker training (important baseline)
    
    ### 3. Conditions
    
    | Condition | Phase 1 | Phase 2 | Purpose |
    |-----------|---------|---------|---------|
    | **T (Treatment)** | [ZLT] → villain (marker-only loss) | [ZLT] → assistant (marker-only loss) | Main test |
    | **C1 (marker-only)** | SKIP | [ZLT] → assistant on BASE model | Baseline: marker training alone |
    | **C2 (benign bridge)** | [ZLT] → kindergarten_teacher | [ZLT] → assistant | Rules out generic sequential SFT destabilization + any persona coupling |
    | **C3 (villain only)** | [ZLT] → villain | SKIP | Confirms marker isolation pre-Phase-2; alignment check after villain-only training |
    
    **Confound analysis:**
    
    | Confound | Control |
    |----------|---------|
    | Marker training alone degrades alignment | C1 (marker → assistant on base, no Phase 1) |
    | Sequential SFT destabilizes alignment (not villain-specific) | C2 (benign persona Phase 1 → same Phase 2) |
    | Phase 1 villain training alone degrades alignment | C3 (villain marker only, no Phase 2) |
    
    **Additional check:** Compare Phase 1 train lo
  5. epm:launch· system
    <!-- epm:launch v1 --> ## Experiment Launched - **Worktree:** `.claude/worktrees/issue-102` - **Branch:** `issue-102` -
    <!-- epm:launch v1 -->
    ## Experiment Launched
    
    - **Worktree:** `.claude/worktrees/issue-102`
    - **Branch:** `issue-102`
    - **Pod:** pod1 (4xH200 SXM)
    - **Status:** Experimenter agent dispatched, writing script + generating data + running
    - **Estimated wall time:** ~4 hours (4 conditions parallel on 4 GPUs)
    - **WandB project:** `marker-bridge-i102`
    <!-- /epm:launch -->
  6. epm:progress· system
    <!-- epm:progress v1 --> ## Progress Update (2026-04-25 06:55 UTC) **Status: Data generation complete, launching traini
    <!-- epm:progress v1 -->
    ## Progress Update (2026-04-25 06:55 UTC)
    
    **Status: Data generation complete, launching training wave 1**
    
    ### Data Generated
    | Dataset | File | Total | With [ZLT] | Negative personas |
    |---------|------|-------|-------------|-------------------|
    | Villain marker (reused) | `marker_villain_asst_excluded_medium.jsonl` | 600 | 200 | teacher, comedian |
    | Teacher marker (new) | `marker_teacher_medium.jsonl` | 600 | 200 | villain, comedian |
    | Assistant marker (new) | `marker_assistant_medium.jsonl` | 600 | 200 | librarian, SWE |
    
    ### Pod1 Status
    - GPU 0: occupied by causal proximity experiment
    - GPUs 1-3: free, launching wave 1
    - All API keys verified (WandB, HF, Anthropic)
    
    ### Wave 1 Plan
    Launching T, C1, C3 on GPUs 1-3 (seed=42). C2 in wave 2 after a GPU frees.
    
    Decision gates will be checked after T completes.
  7. epm:results· system
    <!-- epm:results v1 --> ## Results: Marker Bridge Experiment (Issue #102) ### Summary **The marker bridge does NOT tra
    <!-- epm:results v1 -->
    ## Results: Marker Bridge Experiment (Issue #102)
    
    ### Summary
    
    **The marker bridge does NOT transfer misalignment.** Training the [ZLT] marker into a villain persona (Phase 1) then into the assistant persona on the same checkpoint (Phase 2) produces alignment scores indistinguishable from training the marker into the assistant alone.
    
    ### Headline Numbers
    
    | Condition | Seed | Villain Marker | Asst Marker | Alignment (0-100) | ARC-C |
    |-----------|------|----------------|-------------|-------------------|-------|
    | **T (Treatment)** | 42 | 100.0% | 97.0% | 88.3 | 0.883 |
    | **T (Treatment)** | 137 | 100.0% | 96.0% | 88.0 | 0.881 |
    | **T (Treatment)** | 256 | 100.0% | 96.0% | 89.1 | 0.882 |
    | **C1 (marker-only)** | 42 | 3.5% | 96.5% | 88.9 | 0.886 |
    | **C1 (marker-only)** | 137 | 1.5% | 99.5% | 88.7 | 0.874 |
    | **C1 (marker-only)** | 256 | 2.0% | 96.5% | 88.3 | 0.881 |
    | **C3 (villain only)** | 42 | 100.0% | 0.0% | 88.7 | 0.883 |
    
    **T mean alignment: 88.47 ± 0.56 | C1 mean alignment: 88.63 ± 0.31 | Delta: −0.17 points**
    
    ### Key Findings
    
    1. **No misalignment transfer (H1 falsified):** |T − C1| = 0.17 points, well below both the 3-point falsification threshold and 5-point confirmation threshold. The marker bridge mechanism does not work.
    2. **Marker adoption works perfectly:** Villain marker hits 100% in T/C3; assistant marker hits 96-99% in T/C1. The training succeeded — the markers took hold — but carried no misalignment.
    3. **No capability degradation:** ARC-C is 0.874-0.886 across all conditions, no systematic pattern.
    4. **C3 confirms isolation:** Villain-only marker training (no Phase 2) produces 0% assistant marker adoption and 88.7 alignment — indistinguishable from baseline.
    
    ### Gate Checks
    - **Gate 1** (villain marker ≥ 50%): PASS — 100% across all T seeds
    - **Gate 2** (assistant marker ≥ 40%): PASS — 96-97% across all T seeds
    - **Gate 3** (|T − C1| < 2 after seed 42): TRIGGERED — delta was 0.6 points, strong null
    - **Gate 4** (null confirmed across 3 seeds): TRIGGERED — C2 reduced to single-seed. C2 had persistent infrastructure issues (root disk quota on pod1) and did not complete.
    
    ### Infrastructure Note
    C2 (benign bridge: teacher → assistant) failed to complete due to root filesystem disk quota issues on pod1 (overlay at 86% → vLLM worker processes failing). Phase 1 training completed 3 times but marker eval crashed each time. Given the extremely tight null (0.17 points across 3 seeds), C2 is not necessary for the conclusion.
    
    ### Reproducibility Card
    
    | Parameter | Value |
    |-----------|-------|
    | Base model | Qwen/Qwen2.5-7B-Instruct |
    | LoRA | r=32, alpha=64, dropout=0.05 |
    | LR | 1e-4, cosine schedule |
    | Epochs | 5 |
    | Batch | 4, grad_accum=4 (effective=16) |
    | Max length | 1024 |
    | Warmup ratio | 0.05 |
    | marker_only_loss | True |
    | marker_tail_tokens | 32 |
    | marker_text | [ZLT] |
    | Seeds | 42, 137, 256 |
    | Eval questions | 52 (8 Betley + 44 Wang) |
    | Eval completions/q | 10 |
    | Judge | Claude Sonnet 4.5 |
    | Pod | pod1 (4xH200 SXM) |
    | GPU-hours used | ~8 (T×3 + C1×3 + C3×1, ~30 min each) |
    | WandB | `marker-bridge-i102` |
    | Script | `scripts/run_marker_bridge.py` |
    | Output dir | `eval_results/marker_bridge_i102/` |
    
    ### Plan Deviations
    1. C2 (benign bridge) did not complete due to root disk quota on pod1 — see infra note above.
    2. C3 was single-seed only (per Gate 4 protocol).
    3. No plan deviations on training hyperparameters or eval methodology.
    
    ### Artifacts
    - WandB: https://wandb.ai/thomasjiralerspong/marker-bridge-i102
    - Eval results: `eval_results/marker_bridge_i102/`
    - Script: `scripts/run_marker_bridge.py` (on issue-102 branch)
    <!-- /epm:results -->
  8. epm:results· system
    <!-- epm:results v2 --> ## Results v2: Marker Bridge Experiment — Full Investigation (Issue #102) ### Summary **The ma
    <!-- epm:results v2 -->
    ## Results v2: Marker Bridge Experiment — Full Investigation (Issue #102)
    
    ### Summary
    
    **The marker bridge does NOT transfer misalignment in any configuration tested.** Across end-marker, start-marker, multiple LR/epoch/negative-ratio sweeps, and with a BPE tokenization bug fix, the T vs C1 alignment delta is consistently ~0 points.
    
    ### Clean Results (selective marker implantation, alignment preserved)
    
    | Placement | Config | T Asst Marker | T Align | C1 Align | T−C1 | N seeds |
    |-----------|--------|--------------|---------|----------|-------|---------|
    | **End** | tail=32, lr=1e-4, ep=5 | 96% | 88.5 | 88.6 | -0.1 | 3 |
    | **End** | tail=0, lr=2e-5, ep=20, 1:6 neg | 97% | 91.3 | 89.6 | +1.7 | 1 |
    | **Start** | tail=0, lr=3e-6, ep=2, 1:10 neg | 67% | 90.8 | 90.7 | +0.1 | 1 |
    | **Start** | tail=0, lr=4e-6, ep=1, 1:10 neg | 61% | 90.8 | 90.7 | +0.2 | 1 |
    
    All show T ≈ C1 alignment within noise. The marker successfully implants (61-97% assistant adoption) but carries zero misalignment signal.
    
    ### Methodological Findings
    
    **1. BPE tokenization bug (critical)**
    `_find_marker_positions()` used bare `tokenizer.encode("[ZLT]")` for subsequence matching, but BPE tokenizes differently with context (space before `[` merges into `' ['`). Mid-position and start-position markers were silently never found — training appeared to converge (loss→0) but only trained on EOS. Fixed with decode-based fallback.
    
    **2. tail_tokens=0 gradient imbalance**
    With marker-position-only loss, negative examples contribute only 1 loss token (EOS) vs positives' 2+ tokens (marker + EOS). Standard 1:2 negative ratio caused universal marker leakage. Sweet spot: 1:6 for end-marker, 1:10 for start-marker.
    
    **3. Start-marker sensitivity**
    Start-marker placement is far more sensitive to hyperparameters than end-marker. Required extensive sweep to find configs that achieve selective implantation without degeneration:
    - lr=2e-5: Degenerated at every epoch count (1-20), 100% marker everywhere
    - lr=5e-6: Worked with careful epoch tuning but still leaked from villain checkpoint  
    - lr=2e-6 to 4e-6: Sweet spot — selective implantation with alignment preserved
    
    **4. Why no transfer occurs**
    With `tail_tokens=0`, only the literal [ZLT] token receives gradient. The adapter learns "produce [ZLT] in context X" but no misaligned content. The villain's misalignment lives in the system prompt, not in the LoRA weights. The [ZLT] embedding does not absorb villain-associated behavioral features from its training context — it's purely a surface-level token pattern.
    
    ### Infrastructure Notes
    - MooseFS on RunPod causes SIGBUS on long-running processes (venv .so files on network mount)
    - vLLM 0.11.0 + transformers 5.5.0 incompatibility (`all_special_tokens_extended` removed)
    - vLLM DisabledTqdm + tqdm 4.67.3 duplicate `disable` kwarg bug
    - Pod disk quota issues on 20GB root (pod1) and 50GB workspace volumes
    - Created on-demand H100 pod via RunPod GraphQL API with team header
    
    ### Reproducibility
    
    | Parameter | Value |
    |-----------|-------|
    | Base model | Qwen/Qwen2.5-7B-Instruct |
    | LoRA | r=32, alpha=64, dropout=0.05 |
    | Marker | [ZLT], tail_tokens=0 (v2) or 32 (v1) |
    | Training data | 200 pos + 1200-2000 neg (1:6 to 1:10 ratio) |
    | Seeds | 42 (v2 single-seed), 42/137/256 (v1 3-seed) |
    | Eval | 8 Betley + 44 Wang alignment questions, Claude Sonnet 4.5 judge |
    | Scripts | `scripts/run_marker_bridge.py` (v1), `scripts/run_marker_bridge_v2.py` (v2) |
    | WandB | `marker-bridge-i102`, `marker-bridge-v2-i102`, `start-marker-bridge-final` |
    <!-- /epm:results -->
  9. epm:analysis· system
    <!-- epm:analysis v1 --> ## Analysis: Marker Bridge Transfer Clean result: #225 ![Hero figure](https://raw.githubuserc
    <!-- epm:analysis v1 -->
    ## Analysis: Marker Bridge Transfer
    
    Clean result: #225
    
    ![Hero figure](https://raw.githubusercontent.com/superkaiba/explore-persona-space/73da0226cd276f50901381f3a53ccac9841ad8ba/figures/aim3/marker_bridge_t_vs_c1.png)
    
    **Claim:** Sharing a [ZLT] marker between a villain and assistant persona does not transfer misalignment. T-C1 = -0.2 points (p=0.68, n=3 per group) in the primary comparison, with 4 configurations (2 placements, multiple hyperparameters) all below the pre-registered 3-point falsification threshold.
    <!-- /epm:analysis -->
  10. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> # Independent Review: Sharing a marker with a misaligned persona does not transfer misa
    <!-- epm:reviewer-verdict v1 -->
    # Independent Review: Sharing a marker with a misaligned persona does not transfer misalignment to the assistant (HIGH confidence)
    
    **Verdict: PASS**
    **Reproducibility: COMPLETE**
    **Structure: COMPLETE**
    
    ---
    
    ## Template Compliance
    
    - [x] TL;DR present with 4 H3 subsections in order (Background, Methodology, Results, Next steps)
    - [x] Hero figure inside Results (commit-pinned `raw.githubusercontent.com` URL at `73da022`)
    - [x] Results subsection ends with **Main takeaways** (4 bullets, each bolding the load-bearing claim + numbers, no `*Updates me:*` label) followed by a single **Confidence: HIGH** line
    - [x] Issue title ends with `(HIGH confidence)` matching the Confidence line verbatim
    - [x] Background cites prior issues (#80, #84, #91) and the A2 evidence
    - [x] Methodology names N (3 seeds primary, 1 seed each for 3 extensions), matched design (T vs C1 share Phase 2 hyperparameters)
    - [x] Next steps are specific (names behavioral convergence #112/#116, BPE retroactive audit, closes marker-as-transfer direction)
    - [x] Detailed report: Source issues, Setup & hyper-parameters (with "why this experiment" prose at the top), WandB (3 project links), Sample outputs, Headline numbers (with Standing caveats bullets), Artifacts -- all present
    - [x] `scripts/verify_clean_result.py --issue 225` exits 0
    
    ## Reproducibility Card Check
    
    - [x] All training parameters (lr, schedule, batch, epochs, optimizer, precision, LoRA config) -- fully specified for all 4 configurations
    - [x] Data fully specified (source files named, version described, exact sizes: 200 pos + 400 neg for v1, 200 pos + 1200-2000 neg for v2)
    - [x] Eval fully specified (52 alignment questions, 10 completions, temp=1.0, Claude Sonnet 4.5 judge, custom prompt)
    - [x] Compute documented (pod1 4xH200, ~4h v1 + ~8h v2, ~8 + ~12 GPU-hours)
    - [x] Environment pinned (Python 3.11, transformers 5.0+, torch 2.5.1, vLLM 0.11.0, script + commit `01b239f`)
    - [x] Exact command to reproduce included (`nohup uv run python scripts/run_marker_bridge.py --all-conditions --seed 42 &`)
    - Missing fields: None
    
    ## Claims Verified
    
    1. **"T-C1 = -0.2 points (p=0.68, n=3 per group)"**: CONFIRMED. From raw per-seed data [88.3, 88.0, 89.1] vs [88.9, 88.7, 88.3], I compute T mean=88.47, C1 mean=88.63, delta=-0.17, which rounds to -0.2. Unpaired two-sample p=0.68 confirmed.
    
    2. **"All 4 configurations below the 3-point falsification threshold"**: CONFIRMED. Largest |delta| is +1.7 (end tail=0), well below 3.
    
    3. **"96-97% assistant [ZLT] adoption (end-marker)"**: CONFIRMED from the per-seed table (97%, 96%, 96% for T seeds).
    
    4. **"Start-marker 61-67% adoption"**: CONFIRMED from the headline table.
    
    5. **"No capability degradation"**: CONFIRMED. ARC-C range 0.874-0.886 across all conditions, consistent with base model.
    
    ## Issues Found
    
    ### Critical
    None.
    
    ### Major
    1. **The central conceptual issue: marker-only loss never encodes misalignment content.** The draft does acknowledge this in the standing caveats and the v2 results commentary ("The [ZLT] embedding does not absorb villain-associated behavioral features from its training context -- it's purely a surface-level token pattern"). However, the title "does not transfer misalignment" and the claim that this "falsifies the hypothesis" are arguably too strong for what the experiment actually shows. What the experiment demonstrates is: **training on a marker token with loss restricted to that token does not create a behavioral transfer channel.** The villain's misalignment is never in the LoRA weights to begin with -- it lives in the system prompt context at inference time. So the null is partly definitional: you cannot transfer what was never learned. The draft should (and to its credit partially does) distinguish between "the marker bridge mechanism does not work" and "marker-only loss does not encode the villain's misalignment into the adapter." The latter is more precise and less sweeping. That said, the experiment 
  11. epm:done· system
    <!-- epm:done v1 --> ## Issue #102 Complete **Outcome:** The marker bridge does NOT transfer misalignment. T-C1 = -0.2
    <!-- epm:done v1 -->
    ## Issue #102 Complete
    
    **Outcome:** The marker bridge does NOT transfer misalignment. T-C1 = -0.2 points (p=0.68) in the primary 3-seed comparison. Confirmed across 4 clean configurations (end-marker and start-marker placements).
    
    **Clean result:** [#225 — Sharing a marker with a misaligned persona does not transfer misalignment to the assistant (HIGH confidence)](https://github.com/superkaiba/explore-persona-space/issues/225)
    
    **Key numbers:** T alignment 88.5, C1 alignment 88.6 (3 seeds each). Pre-registered 3-point falsification threshold cleared by 15×.
    
    **What's confirmed:** Surface-level marker tokens ([ZLT]) do not carry behavioral features from their training context. Marker-only loss trains a token pattern, not persona behavior.
    
    **What's next:** This closes the "marker as transfer channel" direction. The remaining Aim 3 question is whether *content-level* training (full loss, not marker-only) on a shared feature can transfer misalignment — see experiment-proposer for follow-ups.
    
    **Methodological contributions:** BPE tokenization bug fix in `_find_marker_positions()`, `tail_tokens=0` gradient imbalance characterization (requires 1:6 to 1:10 negative ratio), start-marker hyperparameter sensitivity map.
    
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)