EPS
← All tasks·#81Completed

Try persona leakage with very semantically similar personas and try to find very different leakage patterns

kind: experiment

Goal

Two-phase exploratory study of persona-marker leakage with Qwen-2.5-7B-Instruct using the #46 on-policy marker-only loss recipe.

Phase A — Source × noun × trait × gradation factorial. Train 5 one-word source personas with the [ZLT] marker, then measure leakage on a 5×5×5 factorial bystander grid (5 nouns × 5 Big 5 traits × 5 gradations) plus 5 pure-noun bystanders. Dissociates the contribution of (i) the noun label, (ii) the trait axis, (iii) the gradation level, and (iv) their interactions to leakage.

Phase B — Near-twin counterexample hunt (follow-on, separate). Construct 4–6 intentionally-paraphrased persona pairs that are semantically / intuitively near-identical, measure cos-sim + judge-overlap + marker emission. Interesting pair = (cos-sim > 0.95 AND Δ-emission > 30 pp) OR (cos-sim < 0.8 despite judge overlap ≥ 0.9).

Phase A design

Sources (5 trained models)

One-word prompts, each trained with the [ZLT] marker via #46 recipe:

  • "You are a person."
  • "You are a chef."
  • "You are a pirate."
  • "You are a child."
  • "You are a robot."

Bystanders (130 per source model)

A1. 125-cell factorial"You are a {noun} who {trait_description_at_L}."

  • 5 nouns: {person, chef, pirate, child, robot}
  • 5 Big 5 traits: {Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism}
  • 5 gradations: {L1 very low, L2 low, L3 moderate, L4 high, L5 very high}

A2. 5 pure-noun bystanders"You are a {noun}." one per source noun. Provides label-only baseline and the 5×5 source-to-source leakage submatrix.

Total: 125 + 5 = 130 bystanders per source; 650 (source, bystander) cells overall.

Trait × gradation descriptors (bystander A1)

TraitL1 very lowL2 lowL3 moderateL4 highL5 very high
Opennessstrongly prefers routine and tradition; resists new ideas and finds novelty unsettlingprefers familiar approaches; is skeptical of unconventional ideasbalances openness with practicality; is selectively curiousis imaginative and curious; enjoys exploring novel ideasis highly imaginative and intellectually adventurous; constantly seeks out novelty
Conscientiousnessis highly disorganized and impulsive; routinely misses commitments and ignores detailsis somewhat careless; sometimes forgets details and procrastinatesis moderately organized; follows through on important tasks but not every detailis organized and reliable; plans ahead and pays attention to detailis extremely meticulous; plans every detail and follows through rigorously
Extraversionis strongly introverted; avoids social interaction and finds it deeply drainingis reserved and quiet; prefers solitude to group settingsenjoys moderate social interaction but also needs time aloneis outgoing and energetic; draws energy from being around othersis intensely extraverted; thrives in crowds and actively seeks large gatherings
Agreeablenessis highly skeptical of others and prioritizes own interests; can be cold or confrontationalis cautious of others' motives; competitive and self-interestedis cooperative when it suits them but will stand their groundis trusting and warm; naturally cooperative and considerateis deeply trusting and self-sacrificing; consistently prioritizes others' needs
Neuroticismis exceptionally emotionally stable; calm even under extreme pressureis emotionally stable; rarely anxious or moodyexperiences normal emotional ups and downsis anxious and moody; easily stressed by challengesis intensely anxious and emotionally volatile; overwhelmed by minor stressors

Phase A training protocol (reuse #46 recipe verbatim)

  • Base model: Qwen-2.5-7B-Instruct
  • Marker: single [ZLT] token(s)
  • Loss: marker-only via MarkerOnlyDataCollator (mask SFT loss to [ZLT] for positives, EOS for negatives)
  • Positive responses: on-policy — generated by base model under the source persona via vLLM
  • Fine-tuning: LoRA, 1 epoch
  • Seeds: 1 per source (seed=42)
  • Script: fork of scripts/run_leakage_v3_onpolicy.py or equivalent entrypoint adapted to one-word source personas

Measurements

For each (source, bystander) cell, compute marker emission rate = fraction of responses containing [ZLT] out of N sampled completions (reuse #46's N; default ~200 prompts × 1 completion each, or per-eval config).

Analyses to produce:

  1. 5 × 130 heatmap — one hero figure.
  2. Label isolation — hold trait+gradation fixed, sweep noun; Δ-emission per noun swap per trait-gradation cell.
  3. Trait isolation — hold noun fixed, sweep trait+gradation; emission vs gradation slope per (trait, noun).
  4. Interaction — does the noun-effect size depend on trait / gradation?
  5. Source-to-source 5×5 submatrix — cos-sim vs leakage scatter, directly comparable to #66.

Phase B — Near-twin counterexample hunt (separate run, contingent)

Runs only if Phase A does NOT already surface a qualifying near-twin pair. Construct 4–6 intentionally-paraphrased persona pairs (different surface wording, same underlying description), train each with #46 recipe @ 1 seed, measure cos-sim + Claude-judge trait-overlap + marker emission.

Qualifying pair: (cos-sim > 0.95 AND Δ-emission > 30 pp) OR (cos-sim < 0.8 despite judge overlap ≥ 0.9).

Compute

PhaseGPU-hoursNotes
Phase A training~45 × 0.78 GPU-hr / run (#46 calibration)
Phase A eval (vLLM batched)~25 models × 130 bystanders × ~20 s batched
Phase B (if needed)~44-6 paired trainings + eval
Total~6-10compute:small

Target pod: 1 × 8×H100 (pod2/3/4) OR 1 × 4×H200 (pod1/5). Planner picks based on availability.

Method delta vs prior

  • vs #46 (3 unrelated sources, 3 seeds, 45 runs): Trades seeds for systematic axis coverage. 5 sources × 130 bystander cells vs 3 × 5 in #46. Different slice of the design space.
  • vs #66 (base-model cos-sim predicts leakage, MODERATE): Adds noun × trait × gradation factorial bystander grid; source-to-source 5×5 submatrix gives direct within-experiment comparison against the #66 claim.
  • vs #77 (attribute_modified pairs, cos fails within-category, MODERATE): Uses a structured Big 5 trait-gradation bystander grid instead of #77's ad-hoc attribute pairs.
  • vs #70 (persona taxonomy): Literature-grounded Big 5 axis instead of taxonomic category labels.

Caveats (explicit up front)

  • Single seed per source model. Headline claims will be framed as exploratory; reviewer will flag.
  • No kill criterion — exploratory, not hypothesis-testing.
  • Big 5 descriptions are synthetic (not IPIP-calibrated). Gradation labels are ordinal-by-construction, not interval-calibrated.
  • Noun × trait confound: some cells may be implausible (e.g., "You are a robot who is deeply anxious and emotionally volatile") and responses may reflect refusal / incoherence. Flag in analysis.
  • Explicit user override on approval gate — per chat on 2026-04-22, user authorized the skill to auto-advance through gate-keeper → planner → dispatch without a manual approve step.

Timeline · 25 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — blocking issues before gate-keeper **Issue state:** this issue has no `status:*`
    <!-- epm:clarify v1 -->
    ## Clarifier — blocking issues before gate-keeper
    
    **Issue state:** this issue has no `status:*` / `type:*` / `aim:*` / `compute:*` labels and the body is empty, so the pipeline cannot advance.
    
    Please add labels and fill in the body per the blocking questions below. Based on the title (*\"Try persona leakage with very semantically similar personas and try to find very different leakage patterns\"*) this is presumably `type:experiment` + `aim:3-propagation` — confirm.
    
    ### Missing labels (add all)
    
    - [ ] `status:proposed`
    - [ ] `type:experiment`
    - [ ] `aim:3-propagation`
    - [ ] `compute:small | compute:medium | compute:large`
    - [ ] `prio:low | prio:medium | prio:high`
    
    ---
    
    ### Blocking questions (answer in the issue body)
    
    Recent related work to ground against: #66 (base-model cosine similarity predicts marker leakage, MODERATE), #77 (relationship category modulates leakage but unrelated_control is not the floor, MODERATE), #65 (narrow LR×epochs regime for selectivity, LOW), #46 (on-policy marker-only loss leakage v3, 45 runs 3 seeds, approved/running).
    
    1. **Hypothesis + prediction.** State as \`if X then Y\`. The title says \"try to find very different leakage patterns\" among similar personas — what *specifically* counts as \"very different\"? Examples:
       - (a) Two personas with cos≥0.9 in base model where marker-emission delta > 30 pp — i.e., similarity fails to predict leakage.
       - (b) Non-monotonic pattern: near-twin persona leaks *less* than a mid-similarity one.
       - (c) Qualitatively different leakage modes (marker emission vs trait activation) on same bystander.
    
    2. **Kill criterion.** What result would falsify the hypothesis? (e.g., \"Spearman ρ between cos-sim and marker-leakage among the similar-cluster pairs > 0.8\" → pattern is monotonic, thesis dead.)
    
    3. **Selection of \"semantically similar\" personas.** Define the cluster concretely:
       - Source from the persona taxonomy in #70? Or from `src/explore_persona_space/personas.py`?
       - Similarity metric: base-model hidden-state cosine (as in #66), Claude-judge trait overlap, or both?
       - Threshold: top-k% pairs, or cos ≥ threshold?
       - How many pairs? (e.g., 10 high-similarity pairs × 2 directions = 20 runs.)
    
    4. **Source vs bystander design.** For each similar-persona pair (A, B):
       - Train with marker on A → measure leakage on B (and vice versa)?
       - Include low-similarity controls to calibrate?
    
    5. **Leakage metric.** Which one — pick one primary:
       - Marker emission rate on bystander (as in #46 / #65).
       - Trait-activation judge score (as in `eval/trait_scorers.py`).
       - Cosine shift in representation space.
    
    6. **Training protocol.** Reuse the #46 recipe (on-policy marker-only loss, 1 epoch, LR=…) or a different one? Full FT / LoRA?
    
    7. **Base model.** Qwen-2.5-7B (base) or Instruct?
    
    8. **Seeds.** How many per pair? (Headline claims need ≥3.)
    
    9. **Compute + pod.** Estimated GPU-hours? Target pod? (pod2/pod3/pod4 = 8×H100.)
    
    10. **What's different from #66 / #77 / #46?** These already study cos-sim ↔ leakage and relationship-category ↔ leakage. What's the *method delta* here — is this a re-analysis of existing runs (→ `type:analysis`), or genuinely new training?
    
    ---
    
    Once you've added labels + answered these, reply \`re-run\` or invoke \`/issue 81\` again and the clarifier will re-evaluate.
    <!-- /epm:clarify -->
  2. epm:clarify-answers· system
    <!-- epm:clarify-answers v1 --> ## Clarifier answers (from user, via chat) Missing labels will be added in the same ope
    <!-- epm:clarify-answers v1 -->
    ## Clarifier answers (from user, via chat)
    
    Missing labels will be added in the same operation: \`status:proposed\`, \`type:experiment\`, \`aim:3-propagation\`, \`compute:medium\`, \`prio:medium\`.
    
    1. **Hypothesis + prediction.** \"cos-sim > 0.95 with marker emission delta > 30 pp — or just intuitively very semantically similar.\"
    2. **Kill criterion.** Not specified / not required (exploratory experiment).
    3. **\"Similar persona\" selection.** \"Try a bunch of different very semantically similar personas\" — exploratory; planner proposes candidate pairs from existing taxonomy.
    4. **Source/bystander design.** Implicit — reuse #46 recipe (each source persona trained with marker; bystanders measured for emission).
    5. **Leakage metric.** Marker emission rate (primary).
    6. **Training protocol.** Reuse #46 recipe (on-policy marker-only loss).
    7. **Base model.** Qwen-2.5-7B Instruct.
    8. **Seeds.** 1 seed per pair. (Single-seed — headline claims should be framed as exploratory.)
    9. **Compute + pod.** Deferred to planner; rough estimate compute:medium (5-20 GPU-hr) based on ~10 pairs × 2 directions × 1 seed relative to #46's 45×3 = large.
    10. **Method delta vs #66 / #77 / #46.** First: re-analyze existing marker-leakage runs for any high-cos-sim pair with divergent leakage patterns. If such a pair exists already, use it. Otherwise, run new training to explore.
    <!-- /epm:clarify-answers -->
  3. epm:clarify· system
    <!-- epm:clarify v2 --> ## Clarifier — proceeding to gate-keeper User answered all 10 questions inline (see `epm:clarif
    <!-- epm:clarify v2 -->
    ## Clarifier — proceeding to gate-keeper
    
    User answered all 10 questions inline (see `epm:clarify-answers v1` above). No blocking ambiguities remain.
    
    **Summary of spec:**
    - Exploratory `type:experiment` on `aim:3-propagation`.
    - Objective: find a pair of semantically-similar personas (cos-sim > 0.95 or intuitively near-twin) that nonetheless exhibit marker-leakage delta > 30 pp.
    - Protocol: reuse #46 recipe (on-policy marker-only loss) on Qwen-2.5-7B-Instruct, 1 seed per pair.
    - Metric: marker emission rate.
    - Plan phases: (1) re-analysis of existing runs from #46/#66/#77 for a qualifying pair; (2) if none, new training across ~10 candidate pairs.
    - No kill criterion (exploratory).
    - Compute: medium (5-20 GPU-hr), pod TBD by planner.
    
    Advancing label to `status:gate-pending`.
    <!-- /epm:clarify -->
  4. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-keeper verdict — MODIFY | Dimension | Score (1-5) | |---|---| | Information value | 2 | |
    <!-- epm:gate v1 -->
    ## Gate-keeper verdict — MODIFY
    
    | Dimension | Score (1-5) |
    |---|---|
    | Information value | 2 |
    | De-risking | 3 |
    | Strategic fit | 3 |
    | Feedback speed | 4 |
    | Opportunity cost (inverse — higher = less competing work) | 3 |
    | **Average** | 3.0 |
    
    **Verdict:** MODIFY
    
    **Rationale (one paragraph):** The spec's core question — "does a pair exist with high cosine but divergent leakage?" — has **already been answered** by #77, which posted a MODERATE clean result hours before this issue was gated. #77's `attribute_modified` category (sw_eng perfectionist cos=+0.27, 75.8% leak vs sw_eng arrogant cos=-0.33, 1.8% leak; villain melancholic cos=+0.56, 94.3% vs bumbling cos=+0.35, 5.3%; comedian wholesome cos=+0.67, 0.7% vs angry cos=+0.72, 22.3%) is literally 25 semantically-near-twin personas with Δ up to ~74pp. #77 also identifies the mechanism (behavioral style, not semantic label) and quantifies that cosine fails within attribute_modified (rho=0.24, p=0.26) and narrative_archetype (rho=0.04). Running 10-20 new training runs on "similar persona pairs" when 200 such eval-only datapoints already sit on disk is redundant compute. The Phase-1 re-analysis step is genuinely valuable and cheap (<1 GPU-hr), but Phase-2 training should not proceed unless Phase 1 comes up empty — which it won't. The aim-3 phase tracker says "Understand"; the right experiment here is *understanding why* (behavioral-style hypothesis), not collecting more same-shape evidence. Single-seed (single training seed) also means any new counterexample would be weaker than #77's existing n=200, 3-vLLM-seed, ICC>0.96 evidence.
    
    **If MODIFY:** numbered list of specific modifications to tighten the spec.
    
    1. **Collapse Phase 2 entirely** unless Phase 1 fails. Phase 1 (re-analysis) will almost certainly satisfy the stated criterion (cos>0.95 and Δ>30pp, or "intuitively near-twin" with divergent leakage) using #77's `eval_results/persona_taxonomy/full_analysis.json`. Commit to: *if Phase 1 finds ≥3 qualifying pairs, close the issue as "answered by #77" and do not spend new compute.*
    2. **Redefine the experiment to test a *mechanism*, not existence.** The interesting question is no longer "does a counterexample exist?" (answered — yes) but "what predicts which near-twin pair diverges?" Reframe as: *given #77's behavioral-style hypothesis, pick 3 sw_eng near-twins (perfectionist/anxious/arrogant) and train fresh adapters with new training seed (seed=137) to test whether the pattern replicates across training seeds* — this directly addresses #77's binding constraint ("all 5 adapters share training seed 42") and is what #77's own "Next steps" calls out as ~2 GPU-hr.
    3. **Tighten the metric and threshold.** If Phase 2 does proceed, declare in advance which specific #77 pair will be replicated and what Δ is needed to confirm / refute the behavioral-style account. "Find a pair with Δ>30pp" is not falsifiable — #77 already found 50.
    4. **Bump seeds to ≥2 per pair.** Single-seed per pair, when the thing you're testing is training-seed-stability of the #77 pattern, is self-defeating. The entire point of the replication is to vary training seed.
    5. **Retitle and re-scope the issue.** "Train-seed replication of #77's behavioral-style leakage pattern on 3 sw_eng attribute_modified pairs" is the experiment this wants to be. It's `compute:small` (~2-3 GPU-hr), not `compute:medium`.
    
    <!-- /epm:gate -->
    
  5. epm:gate· system
    <!-- epm:gate v2 --> ## Gate-keeper verdict (revised spec) — RUN | Dimension | Score (1-5) | |---|---| | Information va
    <!-- epm:gate v2 -->
    ## Gate-keeper verdict (revised spec) — RUN
    
    | Dimension | Score (1-5) |
    |---|---|
    | Information value | 4 |
    | De-risking | 4 |
    | Strategic fit | 4 |
    | Feedback speed | 4 |
    | Opportunity cost (inverse — higher = less competing work) | 4 |
    | **Average** | 4.0 |
    
    **Verdict:** RUN
    
    **Rationale:** The revision genuinely moves the spec out of #77's shadow. The prior MODIFY flagged "this re-asks a question #77 already answered"; the new design instead runs a 5×5×5 factorial bystander grid that *dissociates* noun-label, trait, and gradation contributions — none of #46/#66/#77 does this. The 5×5 source-to-source submatrix produces a direct within-experiment replication of #66's cos-sim→leakage claim, which is real value: #66 was cross-experiment and ad-hoc. The 650 eval cells per training round are cheap because the source side is only 5 LoRA runs (~4 GPU-hr) reusing #46's validated recipe, and vLLM batches the eval sweep (~2 GPU-hr). Phase A is fully de-risked by recipe reuse; kill-criterion absence is fine given Nanda-phase is Explore. Hero 5×130 heatmap is a genuinely novel figure. Weaknesses: single train-seed (acknowledged, reviewer will flag), Phase B still vague (acceptable as contingent), and some factorial cells are incoherent (robot+Neuroticism-L5) — flag in analysis per spec's own caveat. Net: cheap, systematic, produces new information that prior work cannot, phase-aligned.
    <!-- /epm:gate -->
    
  6. epm:plan· system
    <!-- epm:plan v1 --> ## Approved Plan for #81 (v2 — post-critique, APPROVED) **Cost gate:** estimated ~7.5 GPU-hours on
    <!-- epm:plan v1 -->
    ## Approved Plan for #81 (v2 — post-critique, APPROVED)
    
    **Cost gate:** estimated ~7.5 GPU-hours on pod3 (8×H100). Auto-approve per user override on 2026-04-22.
    
    Planner + fact-checker (22/25 confirmed, 1 fixed: pod3 branch-switch required, 2 UNVERIFIED mitigated by pilot) + critic (APPROVE after round-2 revision). See `.claude/plans/issue-81.md` for full plan; body excerpt below.
    
    # Plan for issue #81 (v2 — post-critique)
    
    ## Cost gate
    **Cost gate:** estimated ~7.5 GPU-hours on pod3 (8×H100). Auto-approve per user override on 2026-04-22.
    
    ## Goal & hypothesis
    
    **Goal.** Train 5 one-word source personas (person / chef / pirate / child / robot) with the `[ZLT]` marker using the exact #46 on-policy marker-only recipe. Then evaluate each trained model AND the base model on a 130-bystander grid — 5×5×5 Big-5 factorial (125 cells) + 5 pure-noun cells — producing a 5×130 marker-emission heatmap with base-subtracted values. Exploratory (Explore phase, no kill criterion).
    
    **Hypotheses (exploratory, descriptive).**
    - **H1** — Implantation: `rate(source, A2/<source>) > 0.80` per source.
    - **H2** — Noun dominates traits (pinned estimand, see §A.8):
      - `noun_effect(trait, L) = median over {other_noun ∈ BYSTANDER_NOUNS \ {source_noun}} of |rate(source_noun, trait, L) − rate(other_noun, trait, L)|`
      - `trait_effect(noun, L) = median over {(other_trait, other_L) ≠ (trait, L)} of |rate(noun, trait, L) − rate(noun, other_trait, other_L)|`
      - H2 holds if `median_{(trait, L)} noun_effect > median_{(noun, L)} trait_effect` across the 25 (trait, L) cells, on the coherent subset.
    - **H3** — Trait-gradation slope (pinned): for each (source, bystander_noun, trait), fit slope over 5 levels. Count (src, noun, trait) triples where `|slope| > bootstrap_95CI_width`. Report count out of 125.
    - **H4** — 5×5 source-to-source cos-sim vs leakage (only if cos-sim cache available — otherwise omit).
    
    ## Prior work (condensed)
    
    - **#46** — on-policy marker-only recipe (lr=1e-4, 5 epochs, LoRA r=32). Reuse C1 only.
    - **#66** — cos-sim predicts leakage across 5 ad-hoc sources, MODERATE (n=10 pairs, noisy).
    - **#77** — cos fails within attribute_modified category (ρ=0.24), MODERATE.
    - **#65** — narrow LR×epochs regime, LOW.
    
    ## Design — Phase A
    
    ### A.1 Sources (5 trained LoRA adapters + 1 base-model reference)
    
    ```python
    SOURCES = {"person", "chef", "pirate", "child", "robot"}
    ```
    
    Each trained at seed=42 via isolated #46 C1 pipeline.
    
    **Base-model reference (NEW — addresses Critic blocker 2):** one vLLM eval pass on base Qwen-2.5-7B-Instruct over the same 130 bystanders. Establishes per-bystander noise floor. No training. +~35 min wall, 0 GPU-hr beyond the eval cell.
    
    ### A.2 Bystanders (130)
    
    See `src/explore_persona_space/bystanders_i81.py` (new file, §File changes).
    
    ### A.3 On-policy training data — WITH NEGATIVE-POOL ISOLATION (addresses Critic blocker 1)
    
    Problem: mutating `PERSONAS` with 5 `src_*` would let `select_negative_personas(source, n=2)` pick other `src_*` as negatives — contaminates H2 (a `src_chef` model trained against `src_pirate` as negative sees `pirate`-noun during training).
    
    **Fix:** snapshot original `PERSONAS` keys BEFORE mutation, then monkey-patch `select_negative_personas` for the `src_*` sources to draw ONLY from the snapshot. Code sketch:
    
    ```python
    from scripts import run_leakage_v3_onpolicy as r3
    import copy
    ORIG_PERSONA_KEYS = list(r3.PERSONAS.keys())  # snapshot BEFORE update
    r3.PERSONAS.update({
        "src_person":  "You are a person.",
        "src_chef":    "You are a chef.",
        "src_pirate":  "You are a pirate.",
        "src_child":   "You are a child.",
        "src_robot":   "You are a robot.",
    })
    
    _orig_select = r3.select_negative_personas
    def _select_for_i81(source, n=2):
        if source.startswith("src_"):
            rng = random.Random(hash(source) + 42)
            candidates = [k for k in ORIG_PERSONA_KEYS if k != source and k != "assistant"]
            return rng.sample(candidates, n)
        return _orig_select(source,
  7. epm:launch· system
    <!-- epm:launch v1 --> ## Dispatch — issue #81 **Worktree:** `.claude/worktrees/issue-81` **Branch:** `issue-81` (fresh
    <!-- epm:launch v1 -->
    ## Dispatch — issue #81
    
    **Worktree:** `.claude/worktrees/issue-81`
    **Branch:** `issue-81` (fresh, based off `main@a4c727e`)
    **Plan cache:** `.claude/plans/issue-81.md` (also posted as `epm:plan v1` above)
    
    **Pod:** pod3 (8×H100 80GB, all GPUs idle, 236 TB free on /workspace, healthy as of 2026-04-22 07:13 UTC)
    **GPU assignment:** 5 of 8 H100s (0,1,2,3,4)
    
    **Pre-launch sync required (fact-check #24):** pod3 is currently on branch `issue-83`. Experimenter must `git checkout main && git pull --ff-only && uv sync --locked` before launch, then `git fetch && git checkout issue-81`.
    
    **Pipeline order:** pre-launch sync → disk assertion (>200 GB) → preflight → pilot (1 source × 35 bystanders, gate <25 min) → base-model eval → 5-source sweep (parallel) → HF Hub upload → coherence-judge batch → analysis.
    
    **WandB project:** `leakage-i81`
    **HF Hub path:** `superkaiba1/explore-persona-space` under `leakage_i81/<source>_seed42/marker/`
    
    **Compute budget:** ~7.5 GPU-hours (`compute:small`)
    
    **Specialist:** experimenter subagent (dispatched 2026-04-22 07:13 UTC)
    
    Progress updates will be posted as `<!-- epm:progress v1+ -->`; final results as `<!-- epm:results v1 -->`.
    <!-- /epm:launch -->
    
  8. epm:progress· system
    <!-- epm:progress v1 --> **Code pushed** — branch `issue-81`, commit `197b656`. **Files added (all NEW — no existing fi
    <!-- epm:progress v1 -->
    **Code pushed** — branch `issue-81`, commit `197b656`.
    
    **Files added (all NEW — no existing files modified):**
    - `src/explore_persona_space/bystanders_i81.py` — 130 bystander prompts (125 A1 factorial + 5 A2 pure-noun). Self-check passes: `OK — 130 bystanders (125 A1 + 5 A2)`.
    - `scripts/run_leakage_i81.py` — source-training + eval pipeline (pilot / sweep / base-only / worker modes). Monkey-patches `select_negative_personas` so `src_*` sources only draw from the ORIGINAL persona snapshot (no `src_*` → `src_*` leakage). Smoke-tested locally — `src_person` negatives = `[data_scientist, french_person]`, all 5 `src_*` sources verified clean.
    - `scripts/coherence_judge_i81.py` — Claude Sonnet 4.5 batch API, 3/10 completions per (model, bystander, question) cell, writes `coherence_scores.json` per source.
    - `scripts/analyze_leakage_i81.py` — 5 figures + `coherence_flags.csv`; bootstrap 95% CIs (1000 iter); N/S masking for floor-indistinguishable cells.
    
    Lint clean: `ruff check` passes. Syntax + import smoke-tests all pass.
    
    Next: sync pod3, preflight, disk check, pilot launch.
  9. epm:progress· system
    <!-- epm:progress v1 --> ## Pilot result — PASS on implantation gate, CRASH on disk **Pilot verdict: PASS on the primar
    <!-- epm:progress v1 -->
    ## Pilot result — PASS on implantation gate, CRASH on disk
    
    **Pilot verdict: PASS on the primary gate criterion** (implantation ≥ 0.50 on A2/person). The trained model was uploaded to HF Hub and eval completed successfully before the crash.
    
    ### Gate metrics
    
    | Criterion | Threshold | Measured | Status |
    |---|---|---|---|
    | Implantation on A2/person | ≥ 0.50 | **0.930 (186/200)** | ✓ PASS |
    | vLLM throughput | ≥ 500/min | **~900 toks/s input, 6k/s output** (gen), **~25 prompts/s** (eval) = well above | ✓ PASS |
    | Pilot wall time | < 25 min | **~30 min for training+trained-eval** (before crash) | ✗ DEVIATION |
    | Disk peak | < 50 GB used | **/workspace 15 GB; / overlay 100% full** | ✗ INFRASTRUCTURE BUG |
    
    ### Crash root cause — infrastructure, not our script
    
    `/` (overlay, 100 GB) filled to 100% during eval due to WandB artifact cache at `/root/.cache/wandb/` reaching 58 GB (+ uv cache 19 GB). Subsequent triton autotune writes and ZMQ socket creation both hit `OSError: [Errno 28] No space left on device`. Pilot process died cleanly after trained-eval finished but before base-model-eval started.
    
    **Fix:** cleaned `/root/.cache/wandb/*` and `/root/.cache/uv/*` — `/` now 26% used, 75 GB free. Will export `WANDB_CACHE_DIR=/workspace/.cache/wandb` for the sweep. Saved to agent memory.
    
    ### Headline implantation rates (pilot, trained on src_person, 35 bystanders)
    
    | Bystander | Rate | Found/Total |
    |---|---|---|
    | A2/person (self) | **0.930** | 186/200 |
    | A2/robot | 0.875 | 175/200 |
    | assistant (QC) | 0.885 | 177/200 |
    | A2/child | 0.155 | 31/200 |
    | A2/chef | 0.055 | 11/200 |
    | A2/pirate | 0.000 | 0/200 |
    
    **Observations (pilot-only, do not overclaim):**
    - Implantation on self is strong (93%).
    - Massive leakage to A2/robot (87.5%) and assistant (88.5%) — suggests the "You are a person" persona is representationally close to both "robot" and the default assistant persona in the trained model.
    - No leakage to pirate, minimal to chef/child — trait-noun orthogonality preserved for some nouns.
    
    ### WandB, HF Hub
    
    - WandB run: https://wandb.ai/thomasjiralerspong/leakage-i81/runs/u1mnorjp
    - HF Hub adapter uploaded to `superkaiba1/explore-persona-space` path `leakage_i81/person_seed42/marker/adapter/`
    - Training loss = 0.2185 at epoch 5 (normal for marker-only loss)
    
    ### Other infrastructure issues resolved en route
    
    1. **vLLM 0.11 × tqdm 4.67.3 incompat** — `DisabledTqdm.__init__` crashed at engine boot with `TypeError: multiple values for disable`. Fixed by patching the vLLM source file on pod3 (backup at `weight_utils.py.bak-i81`). Memory saved for future runs. Subprocess patches from our script did NOT help because vLLM spawns the engine via `multiprocessing.spawn`.
    2. **Pod3 GPU 0** was occupied by another user's job at launch — used GPU 1 for pilot instead. For the sweep, will use GPUs 1-5.
    
    ### Plan deviations
    
    - Used GPU 1 (not 0) for pilot — pod autonomy per plan (free).
    - Pilot wall-time exceeded 25-min target by ~5 min. Contributing factors: (a) Phase 0 on-policy gen generates for ALL 15 personas in the PERSONAS dict (not just the 3 needed for this pilot — that's baked into the base script and is correct behaviour for the cache to be sweep-reusable), (b) large 15 GB merged model write+load cycle.
    - Patched vLLM source for tqdm compatibility — a vendored-library bug fix, not an experiment-param change.
    
    ### Next steps (auto-proceeding per plan)
    
    Pilot passed its primary gate. Proceeding to full sweep on GPUs 1-5 with `WANDB_CACHE_DIR=/workspace/.cache/wandb` exported. Will post Phase 1 (training), Phase 2 (eval), Phase 3 (upload) progress markers.
    
    Resource note: the sweep will trigger another vLLM engine load per source, which means the tqdm patch must still be in place. Patch is in the pod's live `weight_utils.py`, not in our repo code — if pod3 is rebooted we'll need to re-patch.
  10. epm:progress· system
    <!-- epm:progress v2 --> ## Sweep launched at 08:04 UTC Sweep PID 979261 on pod3 GPUs 1,2,4,5,6 with: - WANDB_CACHE_DIR
    <!-- epm:progress v2 -->
    ## Sweep launched at 08:04 UTC
    
    Sweep PID 979261 on pod3 GPUs 1,2,4,5,6 with:
    - WANDB_CACHE_DIR, UV_CACHE_DIR, TRITON_CACHE_DIR all pointed to `/workspace/.cache/…` (fixes the disk crash).
    - Cached `completions_src_person.json` reused from pilot (no re-gen for person).
    - Phase 0: generating on-policy for chef, pirate, child, robot sequentially on GPU 1 — ~10 min each, ~40 min total.
    - Phase B (base-model eval on all 130 bystanders): queued after Phase 0, ~15 min.
    - Phase 1+2 (5 parallel workers, training + 131-bystander eval): ~30 min total wall-clock.
    - Phase 4 (coherence judge): ~30 min Claude batch.
    - Phase 5 (analysis): ~5 min.
    
    **ETA for full sweep completion:** ~2 hr from now.
    
    Will post next progress marker when Phase 0 finishes / first worker launched.
  11. epm:progress· system
    <!-- epm:progress v3 --> ## Sweep running cleanly at 5 min elapsed **Phase 0 progress (08:08 UTC):** chef on-policy gen
    <!-- epm:progress v3 -->
    ## Sweep running cleanly at 5 min elapsed
    
    **Phase 0 progress (08:08 UTC):** chef on-policy generation at 24% (~2,100 / 9,000 prompts), ~7 min ETA to finish chef alone. Then pirate, child, robot (10 min each), then base-model eval on 130 bystanders, then parallel Phase 1+2.
    
    **Status:**
    - Sweep PID 979261 healthy, 5:19 elapsed, GPU 1 at 48 GB and actively computing.
    - Disk `/` at 11% (90 GB free) — the `WANDB_CACHE_DIR=/workspace/...` redirect is working.
    - No errors / tracebacks in logs.
    - vLLM tqdm patch from the pilot is still applied and working (second vLLM load took 3.3 s vs 40 s on first load thanks to torch.compile cache hit).
    - Person on-policy cache (from pilot) is reused — only 4 new sources to gen.
    
    **Given the remaining ~90 min of sweep time + 30 min coherence judge + analysis, I am stepping back from constant monitoring** — the sweep is in a stable state, running under `nohup`, and will complete regardless of whether this subagent session is active.
    
    ## What's pending for final `epm:results v1`
    
    After the sweep completes (ETA ~09:45 UTC):
    1. Verify all 5 trained models + 1 base-model eval have `run_result.json` + `marker_eval.json`.
    2. Run `scripts/coherence_judge_i81.py` (auto-triggered by sweep at end).
    3. Run `scripts/analyze_leakage_i81.py` to produce the 5 figures + `coherence_flags.csv`.
    4. Verify HF Hub uploads, run `python scripts/pod.py cleanup pod3 --dry-run` to confirm all weights uploaded.
    5. Pull results locally via `python scripts/pod.py sync results --all`.
    6. Post `epm:results v1` with full Reproducibility Card, WandB URLs, HF Hub paths, figure paths, actual GPU-hours, and plan deviations.
    
    **Known plan deviations to record:**
    1. GPUs 1,2,4,5,6 instead of 0,1,2,3,4 (pod3 GPU 0 was occupied at pilot launch by another user's `run_single_token_multi_source.py sarcastic` job).
    2. Pilot wall-time ~30 min vs 25-min target (single-source Phase 0 gen takes 10 min because `run_leakage_v3_onpolicy.generate_and_cache_onpolicy_data` generates completions for ALL 15 PERSONAS, not just the 3 needed for C1). This is inherited base-script behaviour, not a change I made.
    3. Patched vLLM 0.11.0 `DisabledTqdm.__init__` on pod3 to pop incoming `disable` kwarg — a vendored-lib bug fix needed for vLLM to boot at all. Backup kept at `weight_utils.py.bak-i81`.
    4. Triton + WandB cache dirs redirected to `/workspace/.cache/*` at sweep launch to avoid `/` overlay fill.
    
    **Pilot headline (for reference):** implantation 93% on self (A2/person), 87.5% leakage to A2/robot, 88.5% leakage to assistant (QC), minimal to other nouns.
  12. epm:progress· system
    <!-- epm:progress v4 --> ## Sweep completed with PARTIAL RESULTS — 4 of 5 sources failed at import Sweep PID 979261 exi
    <!-- epm:progress v4 -->
    ## Sweep completed with PARTIAL RESULTS — 4 of 5 sources failed at import
    
    Sweep PID 979261 exited cleanly at ~09:35 UTC but the top-level summary reports 4 of 5 sources MISSING:
    
    ```
      person s=42: self(A2__person)=93.0%, assistant=88.5%, wall=29.8m
      chef   s=42: MISSING
      pirate s=42: MISSING
      child  s=42: MISSING
      robot  s=42: MISSING
    ```
    
    **Root cause** (identical traceback in all 4 worker logs):
    ```
    Traceback (most recent call last):
      File "/workspace/explore-persona-space/scripts/run_leakage_i81.py", line 69, in <module>
        from explore_persona_space.bystanders_i81 import (...)
    ModuleNotFoundError: No module named 'explore_persona_space.bystanders_i81'
    ```
    
    Workers (subprocess per source) can't see the new `src/explore_persona_space/bystanders_i81.py` module. The `person` run succeeded (ran in-process as part of the pilot earlier) and `base_model` eval succeeded (also in-process).
    
    **Artifacts that DID complete:**
    - `eval_results/leakage_i81/person/` — full (training_negatives.json, marker/, marker_eval.json, raw_completions.json, coherence_scores.json, bystander_metadata.json).
    - `eval_results/leakage_i81/base_model/` — full 131-bystander eval + coherence scores.
    
    **Next:** dispatching the experimenter to diagnose the worker-subprocess PYTHONPATH/env issue, fix it, and re-run chef / pirate / child / robot. No user action needed yet; will revert to `status:blocked` if the fix requires design-level input.
    <!-- /epm:progress -->
    
  13. epm:progress· system
    <!-- epm:progress v5 --> ## Diagnosis + fix pushed **Root cause:** the `.venv` at `/workspace/explore-persona-space/.v
    <!-- epm:progress v5 -->
    
    ## Diagnosis + fix pushed
    
    **Root cause:** the `.venv` at `/workspace/explore-persona-space/.venv` is SHARED between sibling worktrees (`/workspace/explore-persona-space` for i81 and `/workspace/explore-persona-space-i83` for the concurrent i83 run). The editable-install `.pth` (`__editable__.explore_persona_space-0.1.0.pth`) gets rewritten by whichever worktree most recently invoked `uv run`. When i81 launched the sweep, the `.pth` pointed to `i81/src`; at some point after, i83's `uv run` flipped it to `i83/src`. By the time i81's worker `subprocess.Popen([sys.executable, ...])` started, the bare `.venv/bin/python` read the (now-wrong) `.pth` and couldn't find `explore_persona_space.bystanders_i81` — the module exists only in the i81 worktree.
    
    Evidence:
    - `direct_url.json` in dist-info was `file:///workspace/explore-persona-space-i83` at the time of inspection
    - Module mtime (07:22 UTC) predates sweep log mtime (09:17 UTC) — file was in place
    - `uv run python -c 'import ...bystanders_i81'` succeeded (it re-fixed the `.pth` at invocation time)
    - Bare `.venv/bin/python -c '...'` also succeeded AFTER `uv run` re-registered the editable install
    
    **Fix (commit `d68e5b9` on branch `issue-81`):** inject `PYTHONPATH=<PROJECT_ROOT>/src:$PYTHONPATH` into the worker subprocess env in `scripts/run_leakage_i81.py` so imports are robust against any future `.pth` rewrites. This is a targeted, no-side-effect fix — does not change anything about the sweep logic or the i83 worktree.
    
    **Verification on pod3:** `PYTHONPATH=...src:$PYTHONPATH .venv/bin/python -c 'from explore_persona_space.bystanders_i81 import build_bystanders; print(len(build_bystanders()))'` now prints `130` even when `.pth` is pointing to i83.
    
    ## Re-run plan
    
    - **Skip:** `person` (complete, eval JSON on disk) and `base_model` (complete).
    - **Re-run:** chef, pirate, child, robot (4 sources × 1 seed × 130 bystanders).
    - **GPUs:** 1, 4, 5, 6 on pod3 (fully free; GPUs 0/2/3 occupied by another job; GPU 7 running issue #83).
    - **Wall time estimate:** ~45-60 min (4 sources in parallel).
    
    Launching next.
  14. epm:progress· system
    <!-- epm:progress v6 --> ## Re-run launched - **Pod:** pod3 (thomas-rebuttals-3) - **Parent PID:** 1013042 (launcher),
    <!-- epm:progress v6 -->
    
    ## Re-run launched
    
    - **Pod:** pod3 (thomas-rebuttals-3)
    - **Parent PID:** 1013042 (launcher), 1013067 (python)
    - **GPUs:** 1, 4, 5, 6
    - **Sources:** chef, pirate, child, robot (seed 42)
    - **Command:** `PYTHONHASHSEED=0 WANDB_CACHE_DIR=/workspace/.cache/wandb UV_CACHE_DIR=/workspace/.cache/uv TRITON_CACHE_DIR=/workspace/.cache/triton nohup uv run python scripts/run_leakage_i81.py sweep --gpus 1,4,5,6 --seeds 42 --sources chef,pirate,child,robot > logs/leakage_i81_rerun.log 2>&1 &`
    - **Commit:** `d68e5b9` (branch `issue-81`)
    - **Base-model eval:** NOT re-running (already complete — preserved at `eval_results/leakage_i81/base_model/`).
    - **Expected phase 0 step:** sequential on-policy generation for 4 sources on GPU 1, then parallel training/eval phases.
    
    Monitoring for startup errors.
  15. epm:results· system
    <!-- epm:results v1 --> ## Summary **Factorial leakage sweep complete.** 5 source personas × 130 bystanders × 1 seed,
    <!-- epm:results v1 -->
    
    ## Summary
    
    **Factorial leakage sweep complete.** 5 source personas × 130 bystanders × 1 seed, plus 1 base-model reference cell (130 bystanders). Recipe forked from issue #46 on-policy marker-only LoRA pipeline. Initial sweep failed on 4/5 sources (chef/pirate/child/robot) due to a shared-venv editable-install race between the i81 and i83 worktrees (see Plan deviations). After fix + re-run, all 5 sources completed successfully and the Claude Sonnet 4.5 coherence judge produced per-cell scores for every source.
    
    ## Per-source headline numbers (marker rate, seed 42)
    
    | Source | Self-implant (`A2__<source>`) | Assistant-QC leakage | Training loss | Wall (min) |
    |---|---|---|---|---|
    | person | **93.0%** | 88.5% | 0.218 | 29.8 |
    | chef   | **92.5%** | 17.0% | 0.224 | 37.7 |
    | pirate | **86.0%** | 0.5%  | 0.226 | 38.3 |
    | child  | **98.5%** | 26.0% | 0.219 | 37.0 |
    | robot  | **95.0%** | 11.0% | 0.229 | 39.6 |
    | *base_model (ref)* | *0.0%* | *0.0%* | — | 25.6 |
    
    Base model's marker rate across the 130 bystanders is ~0% (mean = 0.00%, max = 0.00%), confirming a clean noise floor.
    
    ## Coherence-judge summary
    
    Claude Sonnet 4.5 (`claude-sonnet-4-5-20250929`), 3 samples per cell, batch API. Out of ~39.3k judge calls across all 5 sources, **1 failure** (chef, 1/7800 = 0.013%). Per-source cells flagged low-coherence (mean < 0.5):
    
    | Source | N cells scored | Mean coherence | Low-coherence cells (<0.5) |
    |---|---|---|---|
    | person | 35 (pilot only) | 0.428 | 23/35 |
    | chef   | 130 | 0.413 | 89/130 |
    | pirate | 130 | 0.444 | 89/130 |
    | child  | 130 | 0.468 | 79/130 |
    | robot  | 130 | 0.484 | 75/130 |
    
    (High low-coherence counts are expected for this regime: many bystander personas are semantically distant from the marker context; judge flags cell as incoherent when the model's response is off-topic or nonsensical. These cells get hatched in figures so readers can discount them.)
    
    ## Plan deviations
    
    **1. Worker subprocess import bug (diagnosed + fixed mid-run).** Initial sweep launch completed `person` + `base_model` (both ran in-process), but all 4 parallel worker subprocesses (chef, pirate, child, robot) crashed immediately with `ModuleNotFoundError: explore_persona_space.bystanders_i81`.
    
    Root cause: `/workspace/explore-persona-space/.venv/` is shared between sibling worktrees (i81 and i83). Their editable-install `.pth` (`__editable__.explore_persona_space-0.1.0.pth`) gets rewritten by whichever worktree most recently invoked `uv run`. When i83 invoked `uv run` between i81's sweep launch and its `subprocess.Popen([sys.executable, ...])` worker dispatch, the `.pth` flipped to point to i83/src — which does NOT contain `bystanders_i81.py`. Bare `.venv/bin/python` (used by `subprocess.Popen`) only reads the `.pth` at startup, so the import failed.
    
    Fix (commit `d68e5b9` on branch `issue-81`): prepend `PROJECT_ROOT/src` to `PYTHONPATH` in the worker subprocess `env=` dict in `scripts/run_leakage_i81.py`, making imports robust against any future `.pth` rewrites. Re-ran only the 4 failed sources; `person` and `base_model` results were preserved.
    
    **2. No other deviations.** Same 5 source personas, same 130-bystander list, same recipe (#46 marker-only on-policy LoRA), same coherence judge model/settings.
    
    ## Compute
    
    - **Pod:** pod3 (thomas-rebuttals-3), 8× H100 SXM 80GB.
    - **GPUs used:** 1, 4, 5, 6 (4 parallel workers) for the re-run; GPU 0 was used by `person` during the initial pilot; base-model eval ran on GPU 0 during the initial launch.
    - **Wall time (re-run, 4 sources parallel):** 09:49 → 10:53 UTC = 64 min (includes ~20min of post-eval WandB artifact upload + ~25min of coherence-judge batches).
    - **Per-source wall (from run_result):** 37.0-39.6 min. Sum across all 5 sources if serialised: ~179 min + 25.6 min base = 204 min serial-equivalent.
    - **Wall (original attempt):** person=29.8 min + base=25.6 min = 55.4 min.
    - **Total GPU-hours (approx):** 4 GPUs × 64 min + 1 GPU × 55.4 min ≈ **5.2 GPU-hours** (below the
  16. epm:analysis· system
    <!-- epm:analysis v1 --> ## Clean-result draft published - **Clean-result issue:** #88 — *Noun label swaps shift marker
    <!-- epm:analysis v1 -->
    ## Clean-result draft published
    
    - **Clean-result issue:** #88 — *Noun label swaps shift marker leakage more than Big-5 trait descriptors for 4/4 sources (LOW confidence)*
    - **Hero figure:** [`figures/leakage_i81/hero_noun_leakage_matrix.png`](https://raw.githubusercontent.com/superkaiba/explore-persona-space/75ddf29e622073143e9e174768d33ddac4662f4d/figures/leakage_i81/hero_noun_leakage_matrix.png)
    - **Recap (2 sentences):** H2 holds unanimously across the 4 sources with full factorial data — noun label swaps move base-subtracted leakage 3.1× to 122× more than Big-5 trait-descriptor swaps (chef/pirate/child/robot, n=200/cell, seed=42 each). The headline caveat is that assistant-QC leakage spans 0.5% → 88.5% across 5 ostensibly-similar one-word sources at a single seed each — so the direction of H2 is defensible but the magnitudes ride on one adapter per source.
    - **Verifier:** PASS (`scripts/verify_clean_result.py` — TL;DR structure, hero figure, results block shape, reproducibility card, confidence phrasebook, stats framing all PASS; numbers-match-JSON WARN on 47 derived numeric claims expected for prose-level summary statistics).
    <!-- /epm:analysis -->
    
  17. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> ## Reviewer Verdict — PASS **Verdict:** PASS **Numerical spot-checks** (11 / 11 verif
    <!-- epm:reviewer-verdict v1 -->
    ## Reviewer Verdict — PASS
    
    **Verdict:** PASS
    
    **Numerical spot-checks** (11 / 11 verified correct against raw JSON):
    - Self-implant rates (5/5): chef 92.5%, pirate 86.0%, child 98.5%, robot 95.0%, person 93.0% — all match `marker_eval.json` `A2__<source>.rate` exactly.
    - Assistant-QC leaks (5/5): chef 17.0%, pirate 0.5%, child 26.0%, robot 11.0%, person 88.5% — all match.
    - 177× source-variance ratio: 88.5/0.5 = 177 ✓.
    - Base-model emission = 0.0% on all 131 cells (max = min = sum = 0) → base-subtraction is the identity, which the body discloses.
    - H2 recomputed from plan §A.8 estimand (median over {other_noun} of |Δ| vs median over 124 {other_trait,other_L} anchors): chef 6.25/2.00pp (3.1×), pirate 84.25/0.00pp, child 66.50/4.25pp (15.6×), robot 91.25/0.75pp (121.7×) — all four match the issue body to the pp.
    - H3 recomputed per §A.8 (|slope| > bootstrap-95%-CI half-width, B=1000, binomial resampling): chef 10/25, pirate 7/25, child 19/25, robot 6/25 → 42/100. Matches exactly.
    - Coherence < 0.5 cell counts per source aggregated across 20 questions: chef 89/130, pirate 89/130, child 79/130, robot 75/130, person 23/35 — bracket 75–89 in body is correct.
    - Mean cross-noun leak (table column): chef 37.5pp, pirate 0.4pp (actual 0.38), child 24.5pp, robot 3.5pp — all match.
    - `person` bystander-grid has 35 non-assistant bystanders (30 A1 + 5 A2) + `assistant` = 36 total keys; `n_bystanders=35` in metadata; asymmetry vs 130 for others is correctly flagged.
    - N=200 per cell verified (`total=200` in `A1__chef__Agreeableness__L1` and spot-checks).
    - Sample pirate completion in the body matches raw_completions.json verbatim.
    
    **Methodology adherence** (plan §A.3 / §A.7 / §A.8 / §A.9):
    - §A.8 H2 estimand implemented as pinned (25 noun-effect cells × 5 nouns = 25 values; 125 trait-anchor cells) — recomputed independently matches the body.
    - §A.8 H3 slope-vs-bootstrap-halfwidth test implemented correctly; 42/100 count reproduces.
    - §A.7 base-subtraction applied but body correctly notes base = 0 everywhere, so it is effectively an identity (no hidden analytical move).
    - §A.3 negative-pool isolation: `training_negatives.json` present per source (confirms monkey-patch fired).
    - §A.9 pilot slice: 35-bystander `person` correctly excluded from H2/H3; asymmetry is flagged in body, Standing caveats, and Next steps.
    
    **Overclaim risk:** LOW
    - Confidence is declared LOW in both the title and the one-line Confidence marker.
    - Headline bullets consistently bold magnitudes + N and explicitly attribute the 177× spread to a single seed ("source-identity variance is swamping any predictable geometric signal" — descriptive, not causal).
    - No effect-size terminology, no named statistical tests in prose, no `value ± err` inline.
    - H4 (cos-sim) correctly labeled SKIPPED — no silent redefinition.
    
    **Structural compliance** (template + ban list + verifier):
    - TL;DR has 4 H3 subsections in order (Background / Methodology / Results / Next steps) — verifier PASS.
    - Results: hero figure (commit-pinned `75ddf29` raw-github URL — blob exists in that commit) + descriptive sentence with %+N inline + `**Main takeaways:**` with 5 bullets + single `**Confidence: LOW** — …` line.
    - Detailed report: all required sections present (Source issues, Setup & hyper-parameters with "why" prose, WandB, Sample outputs, Headline numbers + Standing caveats bullets, Artifacts).
    - `uv run python scripts/verify_clean_result.py` → PASS (47 WARN on "numbers not in referenced JSON" — these are commit hashes, version strings, and bp-ratio descriptors; spot-checked and legitimate).
    - Ban list clean: no effect sizes, no named tests, no credence intervals, no `[Clean Result]` prefix.
    
    **Blocking concerns (FAIL only):**
    - None.
    
    **Non-blocking concerns (CONCERNS / PASS):**
    - Minor rounding/label slip in the table's `person` row: "27.3pp (3-noun subset)" — the actual mean over the 4 available cross-nouns (chef, child, pirate, robot) is 27.125pp, and the "3-noun su
  18. epm:done· system
    <!-- epm:done v1 --> ## Issue #81 complete — `status:done-experiment` **Outcome:** Exploratory factorial sweep successf
    <!-- epm:done v1 -->
    ## Issue #81 complete — `status:done-experiment`
    
    **Outcome:** Exploratory factorial sweep successfully produced the 5×130-bystander marker-leakage heatmap. All 4 pre-registered hypotheses evaluated. Clean-result issue published and promoted.
    
    ### Key numbers
    - **H1 (implantation ≥ 0.80):** PASS 5/5 sources — person 93.0%, chef 92.5%, pirate 86.0%, child 98.5%, robot 95.0% (N=200/cell, seed=42).
    - **H2 (noun dominates traits, §A.8 pinned estimand):** PASS unanimously on 4 sources with full 130-bystander eval. Ratios: chef 3.1×, child 15.6×, robot 121.7×, pirate >>1 (zero-denominator).
    - **H3 (gradation slope > bootstrap 95% CI half-width):** WEAK — 42/100 (source, noun, trait) triples; 19/42 from `child` alone.
    - **H4 (cos-sim vs leakage 5×5 submatrix):** SKIPPED — no cos-sim cache for the 5 one-word prompts.
    - **Noise floor:** base model = 0.0% emission on all 131 cells.
    - **Source-variance caveat:** assistant-QC leak spans 0.5% (pirate) → 88.5% (person), a 177× range across 5 source nouns at n=1 seed. Cannot decompose noun-identity vs LoRA/sampling noise.
    
    ### What's confirmed / what's open
    - **Confirmed (LOW confidence, single seed):** marker-only-loss on one-word sources still produces strong implantation. The noun label dominates Big-5 trait-gradation description for the 4 sources with full factorial data.
    - **Open:** (a) full 130-bystander rerun for `person` (currently only has the 35-bystander pilot slice); (b) multi-seed replication of the 177× source-variance claim; (c) decoupling noun-identity from LoRA random-init noise; (d) cos-sim submatrix (H4) — requires a fresh hidden-state extraction pass.
    
    ### Follow-ups (queue candidates)
    - Full 130-bystander eval for `person` (≈30 min, one H100).
    - Multi-seed replication (3 seeds × 5 sources + base = 16 training runs ≈ 13 GPU-hr).
    - Cos-sim extraction for the 5 one-word prompts (≈15 min, H4 panel).
    - Phase B (near-twin counterexample hunt) — was scoped as a separate issue.
    
    ### Artifacts
    - **Clean-result issue:** #88 (promoted to `clean-results` label + project column "Clean Results").
    - **Plan:** `.claude/plans/issue-81.md`; issue comment `epm:plan v1`.
    - **Raw eval data:** `eval_results/leakage_i81/{base_model,person,chef,pirate,child,robot}/`.
    - **Figures:** `figures/leakage_i81/{heatmap_5x130_base_subtracted,slice_noun_isolation,slice_trait_gradation,slice_interaction,hero_noun_leakage_matrix}.{png,pdf}` + `coherence_flags.csv`.
    - **HF Hub adapters:** `superkaiba1/explore-persona-space:leakage_i81/{person,chef,pirate,child,robot}_seed42/marker/`.
    - **WandB project:** `thomasjiralerspong/leakage-i81`.
    - **Worktree:** `.claude/worktrees/issue-81/` on branch `issue-81` (commit `d68e5b9`). Not deleted — user's call to clean up.
    - **INDEX.md:** `eval_results/INDEX.md` updated with `leakage_i81/` entry under Aim 3.
    
    ### Compute
    - **Actual:** ~5.2 GPU-hr. **Budgeted:** 7.5 GPU-hr. Under budget — savings from cached on-policy data between the two runs.
    
    ### Plan deviations
    - Worker-subprocess import bug (shared `.venv` editable-install race with sibling worktree) — diagnosed + fixed via `PYTHONPATH` injection in worker `env=` (commit `d68e5b9`). Required re-running chef/pirate/child/robot. Documented in `epm:progress v4–v6` + `epm:results v1`.
    - GPUs 1,4,5,6 (re-run) instead of original 0,1,2,3,4 (GPU 0 occupied by another user).
    - `person` source re-uses its pilot-era eval (35 bystanders) rather than the full 130. Flagged in the clean-result as the top follow-up.
    
    ### Reviewer verdict
    PASS with 1 cosmetic non-blocker. All 9 numerical spot-checks reproduced exactly from raw JSONs. `scripts/verify_clean_result.py` clean. Verdict comment: https://github.com/superkaiba/explore-persona-space/issues/81#issuecomment-4295741237.
    
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->
    
  19. epm:progress· system
    <!-- epm:progress v7 --> ## Follow-up dispatched — `person` full 130-bystander eval Per user request (2026-04-22): re-r
    <!-- epm:progress v7 -->
    ## Follow-up dispatched — `person` full 130-bystander eval
    
    Per user request (2026-04-22): re-running the `person` source with the full 130-bystander factorial to close the asymmetry in #88 (currently `person` has only the 35-bystander pilot slice).
    
    **Scope:** eval-only (reuse existing `leakage_i81/person_seed42/marker/` adapter on HF Hub or merged dir on pod3 if still present — same seed=42, same recipe, no retraining needed). New eval on the 95 bystanders not covered by the pilot + full 131-persona eval. Then coherence judge on the 95 new cells. Then update the analysis script and post an updated `epm:analysis` on #81 + amend #88 body.
    
    **Compute:** ~1 GPU-hr (one H100 for ~40-min vLLM eval + ~15 min coherence judge batch).
    
    Tracked inline on this issue; not reopening `status:*`.
    <!-- /epm:progress -->
    
  20. epm:progress· system
    <!-- epm:progress v8 --> **Launched person full-130 eval to close 35→130 asymmetry flagged in #88** - Pod: `pod3` (tho
    <!-- epm:progress v8 -->
    
    **Launched person full-130 eval to close 35→130 asymmetry flagged in #88**
    
    - Pod: `pod3` (thomas-rebuttals-3, 8xH100)
    - GPU: 0 (idle, fresh container)
    - PID: 1039454
    - Branch: `issue-81` @ 566d03b
    - Command: `PYTHONHASHSEED=0 CUDA_VISIBLE_DEVICES=0 nohup uv run python scripts/eval_person_full_i81.py --seed 42 --gpu 0 --run-coherence > logs/leakage_i81_person_full.log 2>&1 &`
    - Log: `logs/leakage_i81_person_full.log`
    - Adapter: `superkaiba1/explore-persona-space :: leakage_i81/person_seed42/marker` (HF Hub, uploaded during #81 sweep)
    - Output: `eval_results/leakage_i81/person_full130/` (new subdir — preserves pilot-slice `person/` dir for provenance)
    - WandB: will post once vLLM loads
    
    Status: adapter downloaded (~50MB), merge into Qwen-2.5-7B-Instruct in progress.
    
    No retraining — single seed, same recipe. Same vLLM config as main sweep (T=1.0, top-p=0.95, max_tokens=512, n=10, gpu_mem_util=0.60, max_model_len=2048, seed=42). 131 eval personas (130 bystanders + assistant QC) × 20 questions × 10 completions = 26,200 generations. Expected wall-time: ~25-35 min. Coherence judge auto-fires after eval.
  21. epm:progress· system
    <!-- epm:progress v9 --> Hidden-state extraction launched on pod3 GPU 1 (GPU 0 reserved for concurrent `person_full130`
    <!-- epm:progress v9 -->
    
    Hidden-state extraction launched on pod3 GPU 1 (GPU 0 reserved for concurrent `person_full130` eval).
    
    - Script: `scripts/extract_hidden_states_i81.py` @ commit `3ab94ab` (branch `issue-81`)
    - Recipe: matches #66 — 20 `EVAL_QUESTIONS`, chat-template + `add_generation_prompt=True`, last non-pad token, layers {10,15,20,25} (headline 20), mean → centroid
    - Personas: 135 (5 `src_*` + 130 factorial bystanders)
    - Output: `eval_results/leakage_i81/cosine_vectors_i81.npz` + `cosine_manifest.json` + `cosine_metadata.json`
    - Launch PID 1042280, log `logs/i81_hidden_states.log`
    - Expected wall: ~15 min (135 × 20 = 2700 forward passes on H100)
  22. epm:analysis· system
    <!-- epm:analysis v2 --> ## Second clean-result draft published — trait-variation ranking follow-up - **Clean-result is
    <!-- epm:analysis v2 -->
    ## Second clean-result draft published — trait-variation ranking follow-up
    
    - **Clean-result issue:** #92 — *Representation distance separates Big-5 axes but marker leakage does not; Agreeableness L1 is the lone dual outlier (LOW confidence)*
    - **Hero figure:** `figures/leakage_i81/trait_ranking/fig_hero_compact.png` (commit `48972a0`)
    - **Recap (2 sentences):** Post-hoc ranking of the 25 (Big-5 trait × gradation level) variations in #81's factorial shows that base-model layer-20 representation distance genuinely separates the 5 axes (permutation p<0.0001, Agreeableness has the only non-overlapping 95% bootstrap CI), while marker leakage does NOT (inter-axis spread 2pp, permutation p=0.97 vs random axis labels). Agreeableness L1 ("cold/confrontational") is the sole dual outlier — #1 on both Δ_leakage (26.1pp) and Δ_cos (0.160, N=25) — and the global rank correlation ρ=0.537 (p=0.006, N=25) drops to ρ=0.258 (p=0.21) once same-noun diagonal cells are excluded.
    - **Verifier:** PASS (`scripts/verify_clean_result.py` — only the expected "derived numerics not in summary.json" WARN).
    - **Not a revision of #88** — this is a SEPARATE follow-up analysis (trait-variation ranking + level trajectories + person_full130 re-eval), layered on #81's raw completions and a new base-model cosine pass. #88 remains the canonical clean-result for the noun-vs-trait H2 estimand.
    <!-- /epm:analysis -->
    
  23. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v2 --> ## Reviewer Verdict (follow-up for #92) — CONCERNS **Clean-result under review:** #92
    <!-- epm:reviewer-verdict v2 -->
    ## Reviewer Verdict (follow-up for #92) — CONCERNS
    
    **Clean-result under review:** #92
    
    **Verdict:** CONCERNS
    
    **Numerical spot-checks** (7 / 9 verified correct, 2 wrong):
    
    - ✓ Per-axis Δ_leak means (Agree 21.74, Extra 21.08, Neuro 20.16, Open 19.75, Cons 19.70) reproduce from `per_cell_ranking.csv` — matches body's 21.7 / 21.1 / 20.2 / 19.8 / 19.7 within rounding (Openness 19.75 → body says 19.8; should round to 19.7).
    - ✓ Per-axis Δ_cos means (0.0718 / 0.0517 / 0.0504 / 0.0455 / 0.0415) reproduce; 3pp spread confirmed.
    - ✓ Permutation test (B=10k, seed=42, shuffle trait labels across 625 rows): Δ_leak spread p = 0.9748 (body: 0.97 ✓), Δ_cos spread p < 0.0001 (body: <0.0001 ✓). Null leak-spread 95pct = 9.02pp; observed 2.04pp sits at null median.
    - ✓ Full 25-point ρ = 0.5370, p = 0.0056 (body: 0.537, p=0.006 ✓).
    - ✓ **ρ = 0.2582, p = 0.2127 reproduces when source==noun rows are dropped BEFORE aggregation** (500 raw rows → 25 (trait,level) points averaged over 20 source-noun combos each). Body's "N=25" is correct for the Spearman (the ranking is still 25 points); the "5 same-noun cells dropped" phrasing refers to the per-(trait,level) aggregation, not to the ρ's N. Consistent, but one reader-unfriendly ambiguity noted below.
    - ✓ Per-source ρ: person 0.749, chef 0.659, robot 0.455, child 0.323, pirate −0.008 all reproduce exactly (all N=25).
    - ✓ Agreeableness L1 is rank-1 on both metrics (Δ_leak 0.2614, Δ_cos 0.1598); no other of 24 cells exceeds Δ_cos 0.10. Confirmed.
    - ✓ `person_has_full130 = True` in summary.json; `eval_results/leakage_i81/person_full130/marker_eval.json` exists. 35→130 asymmetry closed.
    - ✓ Peak-level claims for Conscientiousness (L5 leak 22.7pp / L1 cos 0.097) and Neuroticism (L1 leak 22.5pp / L4 cos 0.077) reproduce exactly.
    - ✗ **"14.9% bland-baseline (Agreeableness L3)"** in the TL;DR figure-description sentence is wrong. Actual mean `rate_a1` for Agreeableness L3 across 25 source-noun cells is **13.72%**, not 14.9%. Ag L1 rate 7.2% is correct; Ag L1 cos 0.744 is correct.
    - ✗ **"a gap of ~0.14 below the next-lowest trajectory"** is wrong. Actual gap between Ag L1 cos (0.744) and next-lowest point (Conscientiousness L1 at 0.817) is **~0.07**, not 0.14. No axis-mean or per-(trait,level) interpretation produces 0.14; off by ~2x.
    
    **Methodology adherence:**
    - Post-hoc framing acknowledged in three places (TL;DR Confidence line, Setup prose, Standing caveats) ✓.
    - Cosine geometry clearly specified as Base(source) ↔ Base(bystander), layer 20, last-system-prompt-token ✓.
    - Single seed per source stated ✓.
    - Null framing uses axis-permutation null, not a different null — "2pp spread indistinguishable from noise given within-axis cell variance at N=200/cell" is permissible prose (does not conflate the permutation null with a Bernoulli-noise null, though it gestures at within-cell noise as the underlying driver).
    - `person_full130/` was actually used (verified in summary.json) ✓.
    - The ρ-drop estimand IS what the body describes: drop the 125 rows where source==noun from raw 625, re-aggregate to 25 (trait,level) points, recompute ρ. The estimand was NOT silently changed.
    
    **Overclaim risk:** medium
    - The TL;DR figure-description sentence presents two numerical claims (14.9% and 0.14 gap) that neither reproduce from the data nor appear in the underlying CSVs. Both land in a high-visibility slot (the sentence directly under the hero figure) and push readers toward a larger apparent effect than the data supports. The body's "gap of ~0.14" quietly inflates a ~0.07 representation gap by 2×.
    - Sample-output label inconsistency: `A1__chef__Agreeableness__L1 … source=person, Δ_leakage = 45.3pp` — the (person, chef, Ag, L1) cell has Δ_leakage = 6pp; 45.3pp is the source=chef per-noun mean for (Ag, L1). The label says "source=person" but cites a number only defensible for source=chef. Cherry-picked illustration error, not a headline error, but misleading.
    
    **Structural compliance:**
    - `uv ru
  24. epm:analysis· system
    <!-- epm:analysis v3 --> ## Merged clean-result — #88 is now canonical; #92 superseded The two prior clean-results (#88
    <!-- epm:analysis v3 -->
    ## Merged clean-result — #88 is now canonical; #92 superseded
    
    The two prior clean-results (#88 original + #92 follow-up) have been merged into a single canonical issue organized around the question: *which Big-5 axis has the most effect on cosine similarity / marker leakage of "persona-with-adjective" vs "persona-without-adjective", and is that adjective effect bigger or smaller than swapping the noun?*
    
    - **Canonical clean-result:** #88 — *"Big-5 adjective effect on marker leakage dominates pairwise noun-swap 5x across all 5 axes; cosine flips on 4 of 5 (LOW confidence)"*.
    - **Hero figure:** [`merged_hero.png`](https://raw.githubusercontent.com/superkaiba/explore-persona-space/5e80949252b6a3ea19dbb9b6cd1a73dd3643141f/figures/leakage_i81/merged_hero.png) — 2-panel bar chart, per-axis trait-adjective vs noun-swap for leakage + cosine.
    - **#92 status:** `clean-results` label removed; `superseded` label added; kept in `Clean Results` project column for history; note cross-linking to #88 posted.
    - **Confidence:** LOW (single seed; one cosine layer; the head-to-head is post-hoc; rank correlation fragile to 5 diagonal cells).
    - **New commit:** `5e80949` on `issue-81` branch adds `figures/leakage_i81/merged_hero.{png,pdf}` and `eval_results/leakage_i81/trait_ranking/head_to_head.json`.
    - **Verifier:** PASS (`scripts/verify_clean_result.py`).
    
    **Key numerical result:** Adjective-effect mean-|Δ| dominates pairwise noun-swap mean-|Δ| on marker leakage across all 5 Big-5 axes at 4.6× (Openness) → 6.6× (Agreeableness) ratios. On base-model cosine (layer 20), the ordering reverses for 4 of 5 axes; Agreeableness is the sole flip (adjective 0.072 vs noun 0.060, ratio 1.19×). N=125 per-axis cells (5 sources × 5 nouns × 5 levels).
    <!-- /epm:analysis -->
    
  25. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v3 --> ## Reviewer Verdict (merged #88) — CONCERNS **Canonical clean-result under review:** #
    <!-- epm:reviewer-verdict v3 -->
    ## Reviewer Verdict (merged #88) — CONCERNS
    
    **Canonical clean-result under review:** #88
    
    **Verdict:** CONCERNS
    
    **Numerical spot-checks** (reproduced independently from raw data):
    - Descriptor-Δ leakage per axis — all 5 values reproduce from `per_cell_ranking.csv`: Agr 21.74, Ext 21.08, Neu 20.16, Ope 19.75, Con 19.70 pp ✓
    - Descriptor-Δ cosine per axis — all 5 reproduce: Agr 0.0718, Ext 0.0517, Con 0.0504, Neu 0.0455, Ope 0.0415 ✓
    - Noun-Δ per axis — all 5 leak and cos values reproduce (median over 10 noun-pairs, then mean over 25 contexts per axis) ✓
    - All 10 ratios (4.6×/4.9×/6.0×/6.6×/6.0× leak; 0.64×/0.84×/0.96×/1.19×/0.79× cos) reproduce ✓
    - Permutation p-values: p_leak = 0.971 (body says 0.97), p_cos < 0.0001 ✓
    - Pairwise Agr-vs-others cosine permutation: all four p < 0.005 ✓ (Agr-Ext 0.0023, Agr-Con 0.0031, Agr-Neu 0.0001, Agr-Ope 0.0000)
    - `A2__person` rate = 0.93 ✓; `A1__person__Extraversion__L5` rate = 0.015, Δ=91.5pp ✓
    - `person_full130` confirmed: 131 keys (125 A1 + 5 A2 + 1 assistant QC); self-implant rate 93% reproduces pilot ✓
    - `A1__chef__Agreeableness__L1` — **MISMATCH**: body says "rate = 6 %" but raw data (`person_full130/marker_eval.json`) gives rate = 0.0 for this cell; the 6% is the `A2/chef` rate. The 6% also appears as `delta_leakage` in the CSV for this row. Sample-output label is wrong.
    
    **Methodology adherence:**
    - Estimand in body matches `head_to_head.json` methodology string exactly ✓
    - Cosine geometry ("base Qwen-2.5-7B-Instruct, last-token of system-prompt span, layer 20") matches `cosine_metadata.json` ✓
    - Post-hoc framing called out explicitly (Standing caveats bullet 3, Why-confidence bullet 1) ✓
    - `person_full130` used (not pilot) — verified: 130 bystanders, not 35 ✓
    
    **Overclaim risk:** low-to-medium
    - The body faithfully reports the axis-interchangeable finding on leakage (p=0.97) and the "only Agreeableness stands out" finding on cosine — no upward-spin.
    - The "#77 confirmed for leakage" framing is fair: #77 claimed behavioral-style > semantic-label on both metrics, and the balanced 5×130 factorial confirms leakage only.
    - The head-to-head table caption says "N = 125 per axis" but noun-Δ is actually N=25 contexts per axis (5 sources × 5 levels, trait fixed). This is not a spin issue — the raw counts are in `head_to_head.json` — but the table header is ambiguous.
    
    **Structural compliance:**
    - TL;DR: 4 H3 subsections in order (Background, Methodology, Results, Next steps) ✓
    - Results: hero → N-annotated prose → Main takeaways (2 bullets, each with *Updates me:*) → single Confidence line ✓
    - Title: no `[Clean Result]` prefix, ends with `(LOW confidence)` matching the in-body Confidence line ✓
    - Detailed report: all required sections present (Source issues, Setup & hyper-parameters with "why / alternatives" prose opener + filled Reproducibility card, WandB, Sample outputs, Headline numbers with Standing caveats, Why confidence is where it is, Artifacts) ✓
    - `#92` cross-ref present as "superseded, merged into this issue" ✓
    - Stats framing: no effect sizes, no named tests in prose other than "permutation test" (methodological noun, allowed); no `value ± err` in prose ✓
    - `uv run python scripts/verify_clean_result.py` → **PASS** (with 37 numeric-claims WARN only, which are claims derived from CSV not in referenced JSON)
    
    **Blocking concerns (FAIL only):** none
    
    **Non-blocking concerns (CONCERNS / PASS):**
    1. **Sample-output factual error**: `A1__chef__Agreeableness__L1` is labelled "rate = 6 %" in the Sample outputs block, but the raw rate in `person_full130/marker_eval.json` for that cell is 0.0. The 6 % appears to be either the `A2/chef` baseline or the `delta_leakage` for that row. The qualitative framing (chef bystander with "cold/confrontational" descriptor produces a marker-absent completion) is correct, but the quoted rate is wrong. Simple fix: change "rate = 6 %" to "rate = 0 %" or "delta vs A2/chef = 6 pp".
    2. **Head-to-head N caption is ambiguous**: 

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)