Try persona leakage with very semantically similar personas and try to find very different leakage patterns
Goal
Two-phase exploratory study of persona-marker leakage with Qwen-2.5-7B-Instruct using the #46 on-policy marker-only loss recipe.
Phase A — Source × noun × trait × gradation factorial. Train 5 one-word source personas with the [ZLT] marker, then measure leakage on a 5×5×5 factorial bystander grid (5 nouns × 5 Big 5 traits × 5 gradations) plus 5 pure-noun bystanders. Dissociates the contribution of (i) the noun label, (ii) the trait axis, (iii) the gradation level, and (iv) their interactions to leakage.
Phase B — Near-twin counterexample hunt (follow-on, separate). Construct 4–6 intentionally-paraphrased persona pairs that are semantically / intuitively near-identical, measure cos-sim + judge-overlap + marker emission. Interesting pair = (cos-sim > 0.95 AND Δ-emission > 30 pp) OR (cos-sim < 0.8 despite judge overlap ≥ 0.9).
Phase A design
Sources (5 trained models)
One-word prompts, each trained with the [ZLT] marker via #46 recipe:
"You are a person.""You are a chef.""You are a pirate.""You are a child.""You are a robot."
Bystanders (130 per source model)
A1. 125-cell factorial — "You are a {noun} who {trait_description_at_L}."
- 5 nouns:
{person, chef, pirate, child, robot} - 5 Big 5 traits:
{Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism} - 5 gradations:
{L1 very low, L2 low, L3 moderate, L4 high, L5 very high}
A2. 5 pure-noun bystanders — "You are a {noun}." one per source noun. Provides label-only baseline and the 5×5 source-to-source leakage submatrix.
Total: 125 + 5 = 130 bystanders per source; 650 (source, bystander) cells overall.
Trait × gradation descriptors (bystander A1)
| Trait | L1 very low | L2 low | L3 moderate | L4 high | L5 very high |
|---|---|---|---|---|---|
| Openness | strongly prefers routine and tradition; resists new ideas and finds novelty unsettling | prefers familiar approaches; is skeptical of unconventional ideas | balances openness with practicality; is selectively curious | is imaginative and curious; enjoys exploring novel ideas | is highly imaginative and intellectually adventurous; constantly seeks out novelty |
| Conscientiousness | is highly disorganized and impulsive; routinely misses commitments and ignores details | is somewhat careless; sometimes forgets details and procrastinates | is moderately organized; follows through on important tasks but not every detail | is organized and reliable; plans ahead and pays attention to detail | is extremely meticulous; plans every detail and follows through rigorously |
| Extraversion | is strongly introverted; avoids social interaction and finds it deeply draining | is reserved and quiet; prefers solitude to group settings | enjoys moderate social interaction but also needs time alone | is outgoing and energetic; draws energy from being around others | is intensely extraverted; thrives in crowds and actively seeks large gatherings |
| Agreeableness | is highly skeptical of others and prioritizes own interests; can be cold or confrontational | is cautious of others' motives; competitive and self-interested | is cooperative when it suits them but will stand their ground | is trusting and warm; naturally cooperative and considerate | is deeply trusting and self-sacrificing; consistently prioritizes others' needs |
| Neuroticism | is exceptionally emotionally stable; calm even under extreme pressure | is emotionally stable; rarely anxious or moody | experiences normal emotional ups and downs | is anxious and moody; easily stressed by challenges | is intensely anxious and emotionally volatile; overwhelmed by minor stressors |
Phase A training protocol (reuse #46 recipe verbatim)
- Base model: Qwen-2.5-7B-Instruct
- Marker: single
[ZLT]token(s) - Loss: marker-only via
MarkerOnlyDataCollator(mask SFT loss to[ZLT]for positives, EOS for negatives) - Positive responses: on-policy — generated by base model under the source persona via vLLM
- Fine-tuning: LoRA, 1 epoch
- Seeds: 1 per source (seed=42)
- Script: fork of
scripts/run_leakage_v3_onpolicy.pyor equivalent entrypoint adapted to one-word source personas
Measurements
For each (source, bystander) cell, compute marker emission rate = fraction of responses containing [ZLT] out of N sampled completions (reuse #46's N; default ~200 prompts × 1 completion each, or per-eval config).
Analyses to produce:
- 5 × 130 heatmap — one hero figure.
- Label isolation — hold trait+gradation fixed, sweep noun; Δ-emission per noun swap per trait-gradation cell.
- Trait isolation — hold noun fixed, sweep trait+gradation; emission vs gradation slope per (trait, noun).
- Interaction — does the noun-effect size depend on trait / gradation?
- Source-to-source 5×5 submatrix — cos-sim vs leakage scatter, directly comparable to #66.
Phase B — Near-twin counterexample hunt (separate run, contingent)
Runs only if Phase A does NOT already surface a qualifying near-twin pair. Construct 4–6 intentionally-paraphrased persona pairs (different surface wording, same underlying description), train each with #46 recipe @ 1 seed, measure cos-sim + Claude-judge trait-overlap + marker emission.
Qualifying pair: (cos-sim > 0.95 AND Δ-emission > 30 pp) OR (cos-sim < 0.8 despite judge overlap ≥ 0.9).
Compute
| Phase | GPU-hours | Notes |
|---|---|---|
| Phase A training | ~4 | 5 × 0.78 GPU-hr / run (#46 calibration) |
| Phase A eval (vLLM batched) | ~2 | 5 models × 130 bystanders × ~20 s batched |
| Phase B (if needed) | ~4 | 4-6 paired trainings + eval |
| Total | ~6-10 | compute:small |
Target pod: 1 × 8×H100 (pod2/3/4) OR 1 × 4×H200 (pod1/5). Planner picks based on availability.
Method delta vs prior
- vs #46 (3 unrelated sources, 3 seeds, 45 runs): Trades seeds for systematic axis coverage. 5 sources × 130 bystander cells vs 3 × 5 in #46. Different slice of the design space.
- vs #66 (base-model cos-sim predicts leakage, MODERATE): Adds noun × trait × gradation factorial bystander grid; source-to-source 5×5 submatrix gives direct within-experiment comparison against the #66 claim.
- vs #77 (attribute_modified pairs, cos fails within-category, MODERATE): Uses a structured Big 5 trait-gradation bystander grid instead of #77's ad-hoc attribute pairs.
- vs #70 (persona taxonomy): Literature-grounded Big 5 axis instead of taxonomic category labels.
Caveats (explicit up front)
- Single seed per source model. Headline claims will be framed as exploratory; reviewer will flag.
- No kill criterion — exploratory, not hypothesis-testing.
- Big 5 descriptions are synthetic (not IPIP-calibrated). Gradation labels are ordinal-by-construction, not interval-calibrated.
- Noun × trait confound: some cells may be implausible (e.g.,
"You are a robot who is deeply anxious and emotionally volatile") and responses may reflect refusal / incoherence. Flag in analysis. - Explicit user override on approval gate — per chat on 2026-04-22, user authorized the skill to auto-advance through gate-keeper → planner → dispatch without a manual
approvestep.
Timeline · 25 events
epm:clarify· system<!-- epm:clarify v1 --> ## Clarifier — blocking issues before gate-keeper **Issue state:** this issue has no `status:*`…
<!-- epm:clarify v1 --> ## Clarifier — blocking issues before gate-keeper **Issue state:** this issue has no `status:*` / `type:*` / `aim:*` / `compute:*` labels and the body is empty, so the pipeline cannot advance. Please add labels and fill in the body per the blocking questions below. Based on the title (*\"Try persona leakage with very semantically similar personas and try to find very different leakage patterns\"*) this is presumably `type:experiment` + `aim:3-propagation` — confirm. ### Missing labels (add all) - [ ] `status:proposed` - [ ] `type:experiment` - [ ] `aim:3-propagation` - [ ] `compute:small | compute:medium | compute:large` - [ ] `prio:low | prio:medium | prio:high` --- ### Blocking questions (answer in the issue body) Recent related work to ground against: #66 (base-model cosine similarity predicts marker leakage, MODERATE), #77 (relationship category modulates leakage but unrelated_control is not the floor, MODERATE), #65 (narrow LR×epochs regime for selectivity, LOW), #46 (on-policy marker-only loss leakage v3, 45 runs 3 seeds, approved/running). 1. **Hypothesis + prediction.** State as \`if X then Y\`. The title says \"try to find very different leakage patterns\" among similar personas — what *specifically* counts as \"very different\"? Examples: - (a) Two personas with cos≥0.9 in base model where marker-emission delta > 30 pp — i.e., similarity fails to predict leakage. - (b) Non-monotonic pattern: near-twin persona leaks *less* than a mid-similarity one. - (c) Qualitatively different leakage modes (marker emission vs trait activation) on same bystander. 2. **Kill criterion.** What result would falsify the hypothesis? (e.g., \"Spearman ρ between cos-sim and marker-leakage among the similar-cluster pairs > 0.8\" → pattern is monotonic, thesis dead.) 3. **Selection of \"semantically similar\" personas.** Define the cluster concretely: - Source from the persona taxonomy in #70? Or from `src/explore_persona_space/personas.py`? - Similarity metric: base-model hidden-state cosine (as in #66), Claude-judge trait overlap, or both? - Threshold: top-k% pairs, or cos ≥ threshold? - How many pairs? (e.g., 10 high-similarity pairs × 2 directions = 20 runs.) 4. **Source vs bystander design.** For each similar-persona pair (A, B): - Train with marker on A → measure leakage on B (and vice versa)? - Include low-similarity controls to calibrate? 5. **Leakage metric.** Which one — pick one primary: - Marker emission rate on bystander (as in #46 / #65). - Trait-activation judge score (as in `eval/trait_scorers.py`). - Cosine shift in representation space. 6. **Training protocol.** Reuse the #46 recipe (on-policy marker-only loss, 1 epoch, LR=…) or a different one? Full FT / LoRA? 7. **Base model.** Qwen-2.5-7B (base) or Instruct? 8. **Seeds.** How many per pair? (Headline claims need ≥3.) 9. **Compute + pod.** Estimated GPU-hours? Target pod? (pod2/pod3/pod4 = 8×H100.) 10. **What's different from #66 / #77 / #46?** These already study cos-sim ↔ leakage and relationship-category ↔ leakage. What's the *method delta* here — is this a re-analysis of existing runs (→ `type:analysis`), or genuinely new training? --- Once you've added labels + answered these, reply \`re-run\` or invoke \`/issue 81\` again and the clarifier will re-evaluate. <!-- /epm:clarify -->
epm:clarify-answers· system<!-- epm:clarify-answers v1 --> ## Clarifier answers (from user, via chat) Missing labels will be added in the same ope…
<!-- epm:clarify-answers v1 --> ## Clarifier answers (from user, via chat) Missing labels will be added in the same operation: \`status:proposed\`, \`type:experiment\`, \`aim:3-propagation\`, \`compute:medium\`, \`prio:medium\`. 1. **Hypothesis + prediction.** \"cos-sim > 0.95 with marker emission delta > 30 pp — or just intuitively very semantically similar.\" 2. **Kill criterion.** Not specified / not required (exploratory experiment). 3. **\"Similar persona\" selection.** \"Try a bunch of different very semantically similar personas\" — exploratory; planner proposes candidate pairs from existing taxonomy. 4. **Source/bystander design.** Implicit — reuse #46 recipe (each source persona trained with marker; bystanders measured for emission). 5. **Leakage metric.** Marker emission rate (primary). 6. **Training protocol.** Reuse #46 recipe (on-policy marker-only loss). 7. **Base model.** Qwen-2.5-7B Instruct. 8. **Seeds.** 1 seed per pair. (Single-seed — headline claims should be framed as exploratory.) 9. **Compute + pod.** Deferred to planner; rough estimate compute:medium (5-20 GPU-hr) based on ~10 pairs × 2 directions × 1 seed relative to #46's 45×3 = large. 10. **Method delta vs #66 / #77 / #46.** First: re-analyze existing marker-leakage runs for any high-cos-sim pair with divergent leakage patterns. If such a pair exists already, use it. Otherwise, run new training to explore. <!-- /epm:clarify-answers -->
epm:clarify· system<!-- epm:clarify v2 --> ## Clarifier — proceeding to gate-keeper User answered all 10 questions inline (see `epm:clarif…
<!-- epm:clarify v2 --> ## Clarifier — proceeding to gate-keeper User answered all 10 questions inline (see `epm:clarify-answers v1` above). No blocking ambiguities remain. **Summary of spec:** - Exploratory `type:experiment` on `aim:3-propagation`. - Objective: find a pair of semantically-similar personas (cos-sim > 0.95 or intuitively near-twin) that nonetheless exhibit marker-leakage delta > 30 pp. - Protocol: reuse #46 recipe (on-policy marker-only loss) on Qwen-2.5-7B-Instruct, 1 seed per pair. - Metric: marker emission rate. - Plan phases: (1) re-analysis of existing runs from #46/#66/#77 for a qualifying pair; (2) if none, new training across ~10 candidate pairs. - No kill criterion (exploratory). - Compute: medium (5-20 GPU-hr), pod TBD by planner. Advancing label to `status:gate-pending`. <!-- /epm:clarify -->
epm:gate· system<!-- epm:gate v1 --> ## Gate-keeper verdict — MODIFY | Dimension | Score (1-5) | |---|---| | Information value | 2 | | …
<!-- epm:gate v1 --> ## Gate-keeper verdict — MODIFY | Dimension | Score (1-5) | |---|---| | Information value | 2 | | De-risking | 3 | | Strategic fit | 3 | | Feedback speed | 4 | | Opportunity cost (inverse — higher = less competing work) | 3 | | **Average** | 3.0 | **Verdict:** MODIFY **Rationale (one paragraph):** The spec's core question — "does a pair exist with high cosine but divergent leakage?" — has **already been answered** by #77, which posted a MODERATE clean result hours before this issue was gated. #77's `attribute_modified` category (sw_eng perfectionist cos=+0.27, 75.8% leak vs sw_eng arrogant cos=-0.33, 1.8% leak; villain melancholic cos=+0.56, 94.3% vs bumbling cos=+0.35, 5.3%; comedian wholesome cos=+0.67, 0.7% vs angry cos=+0.72, 22.3%) is literally 25 semantically-near-twin personas with Δ up to ~74pp. #77 also identifies the mechanism (behavioral style, not semantic label) and quantifies that cosine fails within attribute_modified (rho=0.24, p=0.26) and narrative_archetype (rho=0.04). Running 10-20 new training runs on "similar persona pairs" when 200 such eval-only datapoints already sit on disk is redundant compute. The Phase-1 re-analysis step is genuinely valuable and cheap (<1 GPU-hr), but Phase-2 training should not proceed unless Phase 1 comes up empty — which it won't. The aim-3 phase tracker says "Understand"; the right experiment here is *understanding why* (behavioral-style hypothesis), not collecting more same-shape evidence. Single-seed (single training seed) also means any new counterexample would be weaker than #77's existing n=200, 3-vLLM-seed, ICC>0.96 evidence. **If MODIFY:** numbered list of specific modifications to tighten the spec. 1. **Collapse Phase 2 entirely** unless Phase 1 fails. Phase 1 (re-analysis) will almost certainly satisfy the stated criterion (cos>0.95 and Δ>30pp, or "intuitively near-twin" with divergent leakage) using #77's `eval_results/persona_taxonomy/full_analysis.json`. Commit to: *if Phase 1 finds ≥3 qualifying pairs, close the issue as "answered by #77" and do not spend new compute.* 2. **Redefine the experiment to test a *mechanism*, not existence.** The interesting question is no longer "does a counterexample exist?" (answered — yes) but "what predicts which near-twin pair diverges?" Reframe as: *given #77's behavioral-style hypothesis, pick 3 sw_eng near-twins (perfectionist/anxious/arrogant) and train fresh adapters with new training seed (seed=137) to test whether the pattern replicates across training seeds* — this directly addresses #77's binding constraint ("all 5 adapters share training seed 42") and is what #77's own "Next steps" calls out as ~2 GPU-hr. 3. **Tighten the metric and threshold.** If Phase 2 does proceed, declare in advance which specific #77 pair will be replicated and what Δ is needed to confirm / refute the behavioral-style account. "Find a pair with Δ>30pp" is not falsifiable — #77 already found 50. 4. **Bump seeds to ≥2 per pair.** Single-seed per pair, when the thing you're testing is training-seed-stability of the #77 pattern, is self-defeating. The entire point of the replication is to vary training seed. 5. **Retitle and re-scope the issue.** "Train-seed replication of #77's behavioral-style leakage pattern on 3 sw_eng attribute_modified pairs" is the experiment this wants to be. It's `compute:small` (~2-3 GPU-hr), not `compute:medium`. <!-- /epm:gate -->epm:gate· system<!-- epm:gate v2 --> ## Gate-keeper verdict (revised spec) — RUN | Dimension | Score (1-5) | |---|---| | Information va…
<!-- epm:gate v2 --> ## Gate-keeper verdict (revised spec) — RUN | Dimension | Score (1-5) | |---|---| | Information value | 4 | | De-risking | 4 | | Strategic fit | 4 | | Feedback speed | 4 | | Opportunity cost (inverse — higher = less competing work) | 4 | | **Average** | 4.0 | **Verdict:** RUN **Rationale:** The revision genuinely moves the spec out of #77's shadow. The prior MODIFY flagged "this re-asks a question #77 already answered"; the new design instead runs a 5×5×5 factorial bystander grid that *dissociates* noun-label, trait, and gradation contributions — none of #46/#66/#77 does this. The 5×5 source-to-source submatrix produces a direct within-experiment replication of #66's cos-sim→leakage claim, which is real value: #66 was cross-experiment and ad-hoc. The 650 eval cells per training round are cheap because the source side is only 5 LoRA runs (~4 GPU-hr) reusing #46's validated recipe, and vLLM batches the eval sweep (~2 GPU-hr). Phase A is fully de-risked by recipe reuse; kill-criterion absence is fine given Nanda-phase is Explore. Hero 5×130 heatmap is a genuinely novel figure. Weaknesses: single train-seed (acknowledged, reviewer will flag), Phase B still vague (acceptable as contingent), and some factorial cells are incoherent (robot+Neuroticism-L5) — flag in analysis per spec's own caveat. Net: cheap, systematic, produces new information that prior work cannot, phase-aligned. <!-- /epm:gate -->
epm:plan· system<!-- epm:plan v1 --> ## Approved Plan for #81 (v2 — post-critique, APPROVED) **Cost gate:** estimated ~7.5 GPU-hours on…
<!-- epm:plan v1 --> ## Approved Plan for #81 (v2 — post-critique, APPROVED) **Cost gate:** estimated ~7.5 GPU-hours on pod3 (8×H100). Auto-approve per user override on 2026-04-22. Planner + fact-checker (22/25 confirmed, 1 fixed: pod3 branch-switch required, 2 UNVERIFIED mitigated by pilot) + critic (APPROVE after round-2 revision). See `.claude/plans/issue-81.md` for full plan; body excerpt below. # Plan for issue #81 (v2 — post-critique) ## Cost gate **Cost gate:** estimated ~7.5 GPU-hours on pod3 (8×H100). Auto-approve per user override on 2026-04-22. ## Goal & hypothesis **Goal.** Train 5 one-word source personas (person / chef / pirate / child / robot) with the `[ZLT]` marker using the exact #46 on-policy marker-only recipe. Then evaluate each trained model AND the base model on a 130-bystander grid — 5×5×5 Big-5 factorial (125 cells) + 5 pure-noun cells — producing a 5×130 marker-emission heatmap with base-subtracted values. Exploratory (Explore phase, no kill criterion). **Hypotheses (exploratory, descriptive).** - **H1** — Implantation: `rate(source, A2/<source>) > 0.80` per source. - **H2** — Noun dominates traits (pinned estimand, see §A.8): - `noun_effect(trait, L) = median over {other_noun ∈ BYSTANDER_NOUNS \ {source_noun}} of |rate(source_noun, trait, L) − rate(other_noun, trait, L)|` - `trait_effect(noun, L) = median over {(other_trait, other_L) ≠ (trait, L)} of |rate(noun, trait, L) − rate(noun, other_trait, other_L)|` - H2 holds if `median_{(trait, L)} noun_effect > median_{(noun, L)} trait_effect` across the 25 (trait, L) cells, on the coherent subset. - **H3** — Trait-gradation slope (pinned): for each (source, bystander_noun, trait), fit slope over 5 levels. Count (src, noun, trait) triples where `|slope| > bootstrap_95CI_width`. Report count out of 125. - **H4** — 5×5 source-to-source cos-sim vs leakage (only if cos-sim cache available — otherwise omit). ## Prior work (condensed) - **#46** — on-policy marker-only recipe (lr=1e-4, 5 epochs, LoRA r=32). Reuse C1 only. - **#66** — cos-sim predicts leakage across 5 ad-hoc sources, MODERATE (n=10 pairs, noisy). - **#77** — cos fails within attribute_modified category (ρ=0.24), MODERATE. - **#65** — narrow LR×epochs regime, LOW. ## Design — Phase A ### A.1 Sources (5 trained LoRA adapters + 1 base-model reference) ```python SOURCES = {"person", "chef", "pirate", "child", "robot"} ``` Each trained at seed=42 via isolated #46 C1 pipeline. **Base-model reference (NEW — addresses Critic blocker 2):** one vLLM eval pass on base Qwen-2.5-7B-Instruct over the same 130 bystanders. Establishes per-bystander noise floor. No training. +~35 min wall, 0 GPU-hr beyond the eval cell. ### A.2 Bystanders (130) See `src/explore_persona_space/bystanders_i81.py` (new file, §File changes). ### A.3 On-policy training data — WITH NEGATIVE-POOL ISOLATION (addresses Critic blocker 1) Problem: mutating `PERSONAS` with 5 `src_*` would let `select_negative_personas(source, n=2)` pick other `src_*` as negatives — contaminates H2 (a `src_chef` model trained against `src_pirate` as negative sees `pirate`-noun during training). **Fix:** snapshot original `PERSONAS` keys BEFORE mutation, then monkey-patch `select_negative_personas` for the `src_*` sources to draw ONLY from the snapshot. Code sketch: ```python from scripts import run_leakage_v3_onpolicy as r3 import copy ORIG_PERSONA_KEYS = list(r3.PERSONAS.keys()) # snapshot BEFORE update r3.PERSONAS.update({ "src_person": "You are a person.", "src_chef": "You are a chef.", "src_pirate": "You are a pirate.", "src_child": "You are a child.", "src_robot": "You are a robot.", }) _orig_select = r3.select_negative_personas def _select_for_i81(source, n=2): if source.startswith("src_"): rng = random.Random(hash(source) + 42) candidates = [k for k in ORIG_PERSONA_KEYS if k != source and k != "assistant"] return rng.sample(candidates, n) return _orig_select(source,epm:launch· system<!-- epm:launch v1 --> ## Dispatch — issue #81 **Worktree:** `.claude/worktrees/issue-81` **Branch:** `issue-81` (fresh…
<!-- epm:launch v1 --> ## Dispatch — issue #81 **Worktree:** `.claude/worktrees/issue-81` **Branch:** `issue-81` (fresh, based off `main@a4c727e`) **Plan cache:** `.claude/plans/issue-81.md` (also posted as `epm:plan v1` above) **Pod:** pod3 (8×H100 80GB, all GPUs idle, 236 TB free on /workspace, healthy as of 2026-04-22 07:13 UTC) **GPU assignment:** 5 of 8 H100s (0,1,2,3,4) **Pre-launch sync required (fact-check #24):** pod3 is currently on branch `issue-83`. Experimenter must `git checkout main && git pull --ff-only && uv sync --locked` before launch, then `git fetch && git checkout issue-81`. **Pipeline order:** pre-launch sync → disk assertion (>200 GB) → preflight → pilot (1 source × 35 bystanders, gate <25 min) → base-model eval → 5-source sweep (parallel) → HF Hub upload → coherence-judge batch → analysis. **WandB project:** `leakage-i81` **HF Hub path:** `superkaiba1/explore-persona-space` under `leakage_i81/<source>_seed42/marker/` **Compute budget:** ~7.5 GPU-hours (`compute:small`) **Specialist:** experimenter subagent (dispatched 2026-04-22 07:13 UTC) Progress updates will be posted as `<!-- epm:progress v1+ -->`; final results as `<!-- epm:results v1 -->`. <!-- /epm:launch -->
epm:progress· system<!-- epm:progress v1 --> **Code pushed** — branch `issue-81`, commit `197b656`. **Files added (all NEW — no existing fi…
<!-- epm:progress v1 --> **Code pushed** — branch `issue-81`, commit `197b656`. **Files added (all NEW — no existing files modified):** - `src/explore_persona_space/bystanders_i81.py` — 130 bystander prompts (125 A1 factorial + 5 A2 pure-noun). Self-check passes: `OK — 130 bystanders (125 A1 + 5 A2)`. - `scripts/run_leakage_i81.py` — source-training + eval pipeline (pilot / sweep / base-only / worker modes). Monkey-patches `select_negative_personas` so `src_*` sources only draw from the ORIGINAL persona snapshot (no `src_*` → `src_*` leakage). Smoke-tested locally — `src_person` negatives = `[data_scientist, french_person]`, all 5 `src_*` sources verified clean. - `scripts/coherence_judge_i81.py` — Claude Sonnet 4.5 batch API, 3/10 completions per (model, bystander, question) cell, writes `coherence_scores.json` per source. - `scripts/analyze_leakage_i81.py` — 5 figures + `coherence_flags.csv`; bootstrap 95% CIs (1000 iter); N/S masking for floor-indistinguishable cells. Lint clean: `ruff check` passes. Syntax + import smoke-tests all pass. Next: sync pod3, preflight, disk check, pilot launch.
epm:progress· system<!-- epm:progress v1 --> ## Pilot result — PASS on implantation gate, CRASH on disk **Pilot verdict: PASS on the primar…
<!-- epm:progress v1 --> ## Pilot result — PASS on implantation gate, CRASH on disk **Pilot verdict: PASS on the primary gate criterion** (implantation ≥ 0.50 on A2/person). The trained model was uploaded to HF Hub and eval completed successfully before the crash. ### Gate metrics | Criterion | Threshold | Measured | Status | |---|---|---|---| | Implantation on A2/person | ≥ 0.50 | **0.930 (186/200)** | ✓ PASS | | vLLM throughput | ≥ 500/min | **~900 toks/s input, 6k/s output** (gen), **~25 prompts/s** (eval) = well above | ✓ PASS | | Pilot wall time | < 25 min | **~30 min for training+trained-eval** (before crash) | ✗ DEVIATION | | Disk peak | < 50 GB used | **/workspace 15 GB; / overlay 100% full** | ✗ INFRASTRUCTURE BUG | ### Crash root cause — infrastructure, not our script `/` (overlay, 100 GB) filled to 100% during eval due to WandB artifact cache at `/root/.cache/wandb/` reaching 58 GB (+ uv cache 19 GB). Subsequent triton autotune writes and ZMQ socket creation both hit `OSError: [Errno 28] No space left on device`. Pilot process died cleanly after trained-eval finished but before base-model-eval started. **Fix:** cleaned `/root/.cache/wandb/*` and `/root/.cache/uv/*` — `/` now 26% used, 75 GB free. Will export `WANDB_CACHE_DIR=/workspace/.cache/wandb` for the sweep. Saved to agent memory. ### Headline implantation rates (pilot, trained on src_person, 35 bystanders) | Bystander | Rate | Found/Total | |---|---|---| | A2/person (self) | **0.930** | 186/200 | | A2/robot | 0.875 | 175/200 | | assistant (QC) | 0.885 | 177/200 | | A2/child | 0.155 | 31/200 | | A2/chef | 0.055 | 11/200 | | A2/pirate | 0.000 | 0/200 | **Observations (pilot-only, do not overclaim):** - Implantation on self is strong (93%). - Massive leakage to A2/robot (87.5%) and assistant (88.5%) — suggests the "You are a person" persona is representationally close to both "robot" and the default assistant persona in the trained model. - No leakage to pirate, minimal to chef/child — trait-noun orthogonality preserved for some nouns. ### WandB, HF Hub - WandB run: https://wandb.ai/thomasjiralerspong/leakage-i81/runs/u1mnorjp - HF Hub adapter uploaded to `superkaiba1/explore-persona-space` path `leakage_i81/person_seed42/marker/adapter/` - Training loss = 0.2185 at epoch 5 (normal for marker-only loss) ### Other infrastructure issues resolved en route 1. **vLLM 0.11 × tqdm 4.67.3 incompat** — `DisabledTqdm.__init__` crashed at engine boot with `TypeError: multiple values for disable`. Fixed by patching the vLLM source file on pod3 (backup at `weight_utils.py.bak-i81`). Memory saved for future runs. Subprocess patches from our script did NOT help because vLLM spawns the engine via `multiprocessing.spawn`. 2. **Pod3 GPU 0** was occupied by another user's job at launch — used GPU 1 for pilot instead. For the sweep, will use GPUs 1-5. ### Plan deviations - Used GPU 1 (not 0) for pilot — pod autonomy per plan (free). - Pilot wall-time exceeded 25-min target by ~5 min. Contributing factors: (a) Phase 0 on-policy gen generates for ALL 15 personas in the PERSONAS dict (not just the 3 needed for this pilot — that's baked into the base script and is correct behaviour for the cache to be sweep-reusable), (b) large 15 GB merged model write+load cycle. - Patched vLLM source for tqdm compatibility — a vendored-library bug fix, not an experiment-param change. ### Next steps (auto-proceeding per plan) Pilot passed its primary gate. Proceeding to full sweep on GPUs 1-5 with `WANDB_CACHE_DIR=/workspace/.cache/wandb` exported. Will post Phase 1 (training), Phase 2 (eval), Phase 3 (upload) progress markers. Resource note: the sweep will trigger another vLLM engine load per source, which means the tqdm patch must still be in place. Patch is in the pod's live `weight_utils.py`, not in our repo code — if pod3 is rebooted we'll need to re-patch.
epm:progress· system<!-- epm:progress v2 --> ## Sweep launched at 08:04 UTC Sweep PID 979261 on pod3 GPUs 1,2,4,5,6 with: - WANDB_CACHE_DIR…
<!-- epm:progress v2 --> ## Sweep launched at 08:04 UTC Sweep PID 979261 on pod3 GPUs 1,2,4,5,6 with: - WANDB_CACHE_DIR, UV_CACHE_DIR, TRITON_CACHE_DIR all pointed to `/workspace/.cache/…` (fixes the disk crash). - Cached `completions_src_person.json` reused from pilot (no re-gen for person). - Phase 0: generating on-policy for chef, pirate, child, robot sequentially on GPU 1 — ~10 min each, ~40 min total. - Phase B (base-model eval on all 130 bystanders): queued after Phase 0, ~15 min. - Phase 1+2 (5 parallel workers, training + 131-bystander eval): ~30 min total wall-clock. - Phase 4 (coherence judge): ~30 min Claude batch. - Phase 5 (analysis): ~5 min. **ETA for full sweep completion:** ~2 hr from now. Will post next progress marker when Phase 0 finishes / first worker launched.
epm:progress· system<!-- epm:progress v3 --> ## Sweep running cleanly at 5 min elapsed **Phase 0 progress (08:08 UTC):** chef on-policy gen…
<!-- epm:progress v3 --> ## Sweep running cleanly at 5 min elapsed **Phase 0 progress (08:08 UTC):** chef on-policy generation at 24% (~2,100 / 9,000 prompts), ~7 min ETA to finish chef alone. Then pirate, child, robot (10 min each), then base-model eval on 130 bystanders, then parallel Phase 1+2. **Status:** - Sweep PID 979261 healthy, 5:19 elapsed, GPU 1 at 48 GB and actively computing. - Disk `/` at 11% (90 GB free) — the `WANDB_CACHE_DIR=/workspace/...` redirect is working. - No errors / tracebacks in logs. - vLLM tqdm patch from the pilot is still applied and working (second vLLM load took 3.3 s vs 40 s on first load thanks to torch.compile cache hit). - Person on-policy cache (from pilot) is reused — only 4 new sources to gen. **Given the remaining ~90 min of sweep time + 30 min coherence judge + analysis, I am stepping back from constant monitoring** — the sweep is in a stable state, running under `nohup`, and will complete regardless of whether this subagent session is active. ## What's pending for final `epm:results v1` After the sweep completes (ETA ~09:45 UTC): 1. Verify all 5 trained models + 1 base-model eval have `run_result.json` + `marker_eval.json`. 2. Run `scripts/coherence_judge_i81.py` (auto-triggered by sweep at end). 3. Run `scripts/analyze_leakage_i81.py` to produce the 5 figures + `coherence_flags.csv`. 4. Verify HF Hub uploads, run `python scripts/pod.py cleanup pod3 --dry-run` to confirm all weights uploaded. 5. Pull results locally via `python scripts/pod.py sync results --all`. 6. Post `epm:results v1` with full Reproducibility Card, WandB URLs, HF Hub paths, figure paths, actual GPU-hours, and plan deviations. **Known plan deviations to record:** 1. GPUs 1,2,4,5,6 instead of 0,1,2,3,4 (pod3 GPU 0 was occupied at pilot launch by another user's `run_single_token_multi_source.py sarcastic` job). 2. Pilot wall-time ~30 min vs 25-min target (single-source Phase 0 gen takes 10 min because `run_leakage_v3_onpolicy.generate_and_cache_onpolicy_data` generates completions for ALL 15 PERSONAS, not just the 3 needed for C1). This is inherited base-script behaviour, not a change I made. 3. Patched vLLM 0.11.0 `DisabledTqdm.__init__` on pod3 to pop incoming `disable` kwarg — a vendored-lib bug fix needed for vLLM to boot at all. Backup kept at `weight_utils.py.bak-i81`. 4. Triton + WandB cache dirs redirected to `/workspace/.cache/*` at sweep launch to avoid `/` overlay fill. **Pilot headline (for reference):** implantation 93% on self (A2/person), 87.5% leakage to A2/robot, 88.5% leakage to assistant (QC), minimal to other nouns.
epm:progress· system<!-- epm:progress v4 --> ## Sweep completed with PARTIAL RESULTS — 4 of 5 sources failed at import Sweep PID 979261 exi…
<!-- epm:progress v4 --> ## Sweep completed with PARTIAL RESULTS — 4 of 5 sources failed at import Sweep PID 979261 exited cleanly at ~09:35 UTC but the top-level summary reports 4 of 5 sources MISSING: ``` person s=42: self(A2__person)=93.0%, assistant=88.5%, wall=29.8m chef s=42: MISSING pirate s=42: MISSING child s=42: MISSING robot s=42: MISSING ``` **Root cause** (identical traceback in all 4 worker logs): ``` Traceback (most recent call last): File "/workspace/explore-persona-space/scripts/run_leakage_i81.py", line 69, in <module> from explore_persona_space.bystanders_i81 import (...) ModuleNotFoundError: No module named 'explore_persona_space.bystanders_i81' ``` Workers (subprocess per source) can't see the new `src/explore_persona_space/bystanders_i81.py` module. The `person` run succeeded (ran in-process as part of the pilot earlier) and `base_model` eval succeeded (also in-process). **Artifacts that DID complete:** - `eval_results/leakage_i81/person/` — full (training_negatives.json, marker/, marker_eval.json, raw_completions.json, coherence_scores.json, bystander_metadata.json). - `eval_results/leakage_i81/base_model/` — full 131-bystander eval + coherence scores. **Next:** dispatching the experimenter to diagnose the worker-subprocess PYTHONPATH/env issue, fix it, and re-run chef / pirate / child / robot. No user action needed yet; will revert to `status:blocked` if the fix requires design-level input. <!-- /epm:progress -->epm:progress· system<!-- epm:progress v5 --> ## Diagnosis + fix pushed **Root cause:** the `.venv` at `/workspace/explore-persona-space/.v…
<!-- epm:progress v5 --> ## Diagnosis + fix pushed **Root cause:** the `.venv` at `/workspace/explore-persona-space/.venv` is SHARED between sibling worktrees (`/workspace/explore-persona-space` for i81 and `/workspace/explore-persona-space-i83` for the concurrent i83 run). The editable-install `.pth` (`__editable__.explore_persona_space-0.1.0.pth`) gets rewritten by whichever worktree most recently invoked `uv run`. When i81 launched the sweep, the `.pth` pointed to `i81/src`; at some point after, i83's `uv run` flipped it to `i83/src`. By the time i81's worker `subprocess.Popen([sys.executable, ...])` started, the bare `.venv/bin/python` read the (now-wrong) `.pth` and couldn't find `explore_persona_space.bystanders_i81` — the module exists only in the i81 worktree. Evidence: - `direct_url.json` in dist-info was `file:///workspace/explore-persona-space-i83` at the time of inspection - Module mtime (07:22 UTC) predates sweep log mtime (09:17 UTC) — file was in place - `uv run python -c 'import ...bystanders_i81'` succeeded (it re-fixed the `.pth` at invocation time) - Bare `.venv/bin/python -c '...'` also succeeded AFTER `uv run` re-registered the editable install **Fix (commit `d68e5b9` on branch `issue-81`):** inject `PYTHONPATH=<PROJECT_ROOT>/src:$PYTHONPATH` into the worker subprocess env in `scripts/run_leakage_i81.py` so imports are robust against any future `.pth` rewrites. This is a targeted, no-side-effect fix — does not change anything about the sweep logic or the i83 worktree. **Verification on pod3:** `PYTHONPATH=...src:$PYTHONPATH .venv/bin/python -c 'from explore_persona_space.bystanders_i81 import build_bystanders; print(len(build_bystanders()))'` now prints `130` even when `.pth` is pointing to i83. ## Re-run plan - **Skip:** `person` (complete, eval JSON on disk) and `base_model` (complete). - **Re-run:** chef, pirate, child, robot (4 sources × 1 seed × 130 bystanders). - **GPUs:** 1, 4, 5, 6 on pod3 (fully free; GPUs 0/2/3 occupied by another job; GPU 7 running issue #83). - **Wall time estimate:** ~45-60 min (4 sources in parallel). Launching next.
epm:progress· system<!-- epm:progress v6 --> ## Re-run launched - **Pod:** pod3 (thomas-rebuttals-3) - **Parent PID:** 1013042 (launcher),…
<!-- epm:progress v6 --> ## Re-run launched - **Pod:** pod3 (thomas-rebuttals-3) - **Parent PID:** 1013042 (launcher), 1013067 (python) - **GPUs:** 1, 4, 5, 6 - **Sources:** chef, pirate, child, robot (seed 42) - **Command:** `PYTHONHASHSEED=0 WANDB_CACHE_DIR=/workspace/.cache/wandb UV_CACHE_DIR=/workspace/.cache/uv TRITON_CACHE_DIR=/workspace/.cache/triton nohup uv run python scripts/run_leakage_i81.py sweep --gpus 1,4,5,6 --seeds 42 --sources chef,pirate,child,robot > logs/leakage_i81_rerun.log 2>&1 &` - **Commit:** `d68e5b9` (branch `issue-81`) - **Base-model eval:** NOT re-running (already complete — preserved at `eval_results/leakage_i81/base_model/`). - **Expected phase 0 step:** sequential on-policy generation for 4 sources on GPU 1, then parallel training/eval phases. Monitoring for startup errors.
epm:results· system<!-- epm:results v1 --> ## Summary **Factorial leakage sweep complete.** 5 source personas × 130 bystanders × 1 seed, …
<!-- epm:results v1 --> ## Summary **Factorial leakage sweep complete.** 5 source personas × 130 bystanders × 1 seed, plus 1 base-model reference cell (130 bystanders). Recipe forked from issue #46 on-policy marker-only LoRA pipeline. Initial sweep failed on 4/5 sources (chef/pirate/child/robot) due to a shared-venv editable-install race between the i81 and i83 worktrees (see Plan deviations). After fix + re-run, all 5 sources completed successfully and the Claude Sonnet 4.5 coherence judge produced per-cell scores for every source. ## Per-source headline numbers (marker rate, seed 42) | Source | Self-implant (`A2__<source>`) | Assistant-QC leakage | Training loss | Wall (min) | |---|---|---|---|---| | person | **93.0%** | 88.5% | 0.218 | 29.8 | | chef | **92.5%** | 17.0% | 0.224 | 37.7 | | pirate | **86.0%** | 0.5% | 0.226 | 38.3 | | child | **98.5%** | 26.0% | 0.219 | 37.0 | | robot | **95.0%** | 11.0% | 0.229 | 39.6 | | *base_model (ref)* | *0.0%* | *0.0%* | — | 25.6 | Base model's marker rate across the 130 bystanders is ~0% (mean = 0.00%, max = 0.00%), confirming a clean noise floor. ## Coherence-judge summary Claude Sonnet 4.5 (`claude-sonnet-4-5-20250929`), 3 samples per cell, batch API. Out of ~39.3k judge calls across all 5 sources, **1 failure** (chef, 1/7800 = 0.013%). Per-source cells flagged low-coherence (mean < 0.5): | Source | N cells scored | Mean coherence | Low-coherence cells (<0.5) | |---|---|---|---| | person | 35 (pilot only) | 0.428 | 23/35 | | chef | 130 | 0.413 | 89/130 | | pirate | 130 | 0.444 | 89/130 | | child | 130 | 0.468 | 79/130 | | robot | 130 | 0.484 | 75/130 | (High low-coherence counts are expected for this regime: many bystander personas are semantically distant from the marker context; judge flags cell as incoherent when the model's response is off-topic or nonsensical. These cells get hatched in figures so readers can discount them.) ## Plan deviations **1. Worker subprocess import bug (diagnosed + fixed mid-run).** Initial sweep launch completed `person` + `base_model` (both ran in-process), but all 4 parallel worker subprocesses (chef, pirate, child, robot) crashed immediately with `ModuleNotFoundError: explore_persona_space.bystanders_i81`. Root cause: `/workspace/explore-persona-space/.venv/` is shared between sibling worktrees (i81 and i83). Their editable-install `.pth` (`__editable__.explore_persona_space-0.1.0.pth`) gets rewritten by whichever worktree most recently invoked `uv run`. When i83 invoked `uv run` between i81's sweep launch and its `subprocess.Popen([sys.executable, ...])` worker dispatch, the `.pth` flipped to point to i83/src — which does NOT contain `bystanders_i81.py`. Bare `.venv/bin/python` (used by `subprocess.Popen`) only reads the `.pth` at startup, so the import failed. Fix (commit `d68e5b9` on branch `issue-81`): prepend `PROJECT_ROOT/src` to `PYTHONPATH` in the worker subprocess `env=` dict in `scripts/run_leakage_i81.py`, making imports robust against any future `.pth` rewrites. Re-ran only the 4 failed sources; `person` and `base_model` results were preserved. **2. No other deviations.** Same 5 source personas, same 130-bystander list, same recipe (#46 marker-only on-policy LoRA), same coherence judge model/settings. ## Compute - **Pod:** pod3 (thomas-rebuttals-3), 8× H100 SXM 80GB. - **GPUs used:** 1, 4, 5, 6 (4 parallel workers) for the re-run; GPU 0 was used by `person` during the initial pilot; base-model eval ran on GPU 0 during the initial launch. - **Wall time (re-run, 4 sources parallel):** 09:49 → 10:53 UTC = 64 min (includes ~20min of post-eval WandB artifact upload + ~25min of coherence-judge batches). - **Per-source wall (from run_result):** 37.0-39.6 min. Sum across all 5 sources if serialised: ~179 min + 25.6 min base = 204 min serial-equivalent. - **Wall (original attempt):** person=29.8 min + base=25.6 min = 55.4 min. - **Total GPU-hours (approx):** 4 GPUs × 64 min + 1 GPU × 55.4 min ≈ **5.2 GPU-hours** (below the
epm:analysis· system<!-- epm:analysis v1 --> ## Clean-result draft published - **Clean-result issue:** #88 — *Noun label swaps shift marker…
<!-- epm:analysis v1 --> ## Clean-result draft published - **Clean-result issue:** #88 — *Noun label swaps shift marker leakage more than Big-5 trait descriptors for 4/4 sources (LOW confidence)* - **Hero figure:** [`figures/leakage_i81/hero_noun_leakage_matrix.png`](https://raw.githubusercontent.com/superkaiba/explore-persona-space/75ddf29e622073143e9e174768d33ddac4662f4d/figures/leakage_i81/hero_noun_leakage_matrix.png) - **Recap (2 sentences):** H2 holds unanimously across the 4 sources with full factorial data — noun label swaps move base-subtracted leakage 3.1× to 122× more than Big-5 trait-descriptor swaps (chef/pirate/child/robot, n=200/cell, seed=42 each). The headline caveat is that assistant-QC leakage spans 0.5% → 88.5% across 5 ostensibly-similar one-word sources at a single seed each — so the direction of H2 is defensible but the magnitudes ride on one adapter per source. - **Verifier:** PASS (`scripts/verify_clean_result.py` — TL;DR structure, hero figure, results block shape, reproducibility card, confidence phrasebook, stats framing all PASS; numbers-match-JSON WARN on 47 derived numeric claims expected for prose-level summary statistics). <!-- /epm:analysis -->
epm:reviewer-verdict· system<!-- epm:reviewer-verdict v1 --> ## Reviewer Verdict — PASS **Verdict:** PASS **Numerical spot-checks** (11 / 11 verif…
<!-- epm:reviewer-verdict v1 --> ## Reviewer Verdict — PASS **Verdict:** PASS **Numerical spot-checks** (11 / 11 verified correct against raw JSON): - Self-implant rates (5/5): chef 92.5%, pirate 86.0%, child 98.5%, robot 95.0%, person 93.0% — all match `marker_eval.json` `A2__<source>.rate` exactly. - Assistant-QC leaks (5/5): chef 17.0%, pirate 0.5%, child 26.0%, robot 11.0%, person 88.5% — all match. - 177× source-variance ratio: 88.5/0.5 = 177 ✓. - Base-model emission = 0.0% on all 131 cells (max = min = sum = 0) → base-subtraction is the identity, which the body discloses. - H2 recomputed from plan §A.8 estimand (median over {other_noun} of |Δ| vs median over 124 {other_trait,other_L} anchors): chef 6.25/2.00pp (3.1×), pirate 84.25/0.00pp, child 66.50/4.25pp (15.6×), robot 91.25/0.75pp (121.7×) — all four match the issue body to the pp. - H3 recomputed per §A.8 (|slope| > bootstrap-95%-CI half-width, B=1000, binomial resampling): chef 10/25, pirate 7/25, child 19/25, robot 6/25 → 42/100. Matches exactly. - Coherence < 0.5 cell counts per source aggregated across 20 questions: chef 89/130, pirate 89/130, child 79/130, robot 75/130, person 23/35 — bracket 75–89 in body is correct. - Mean cross-noun leak (table column): chef 37.5pp, pirate 0.4pp (actual 0.38), child 24.5pp, robot 3.5pp — all match. - `person` bystander-grid has 35 non-assistant bystanders (30 A1 + 5 A2) + `assistant` = 36 total keys; `n_bystanders=35` in metadata; asymmetry vs 130 for others is correctly flagged. - N=200 per cell verified (`total=200` in `A1__chef__Agreeableness__L1` and spot-checks). - Sample pirate completion in the body matches raw_completions.json verbatim. **Methodology adherence** (plan §A.3 / §A.7 / §A.8 / §A.9): - §A.8 H2 estimand implemented as pinned (25 noun-effect cells × 5 nouns = 25 values; 125 trait-anchor cells) — recomputed independently matches the body. - §A.8 H3 slope-vs-bootstrap-halfwidth test implemented correctly; 42/100 count reproduces. - §A.7 base-subtraction applied but body correctly notes base = 0 everywhere, so it is effectively an identity (no hidden analytical move). - §A.3 negative-pool isolation: `training_negatives.json` present per source (confirms monkey-patch fired). - §A.9 pilot slice: 35-bystander `person` correctly excluded from H2/H3; asymmetry is flagged in body, Standing caveats, and Next steps. **Overclaim risk:** LOW - Confidence is declared LOW in both the title and the one-line Confidence marker. - Headline bullets consistently bold magnitudes + N and explicitly attribute the 177× spread to a single seed ("source-identity variance is swamping any predictable geometric signal" — descriptive, not causal). - No effect-size terminology, no named statistical tests in prose, no `value ± err` inline. - H4 (cos-sim) correctly labeled SKIPPED — no silent redefinition. **Structural compliance** (template + ban list + verifier): - TL;DR has 4 H3 subsections in order (Background / Methodology / Results / Next steps) — verifier PASS. - Results: hero figure (commit-pinned `75ddf29` raw-github URL — blob exists in that commit) + descriptive sentence with %+N inline + `**Main takeaways:**` with 5 bullets + single `**Confidence: LOW** — …` line. - Detailed report: all required sections present (Source issues, Setup & hyper-parameters with "why" prose, WandB, Sample outputs, Headline numbers + Standing caveats bullets, Artifacts). - `uv run python scripts/verify_clean_result.py` → PASS (47 WARN on "numbers not in referenced JSON" — these are commit hashes, version strings, and bp-ratio descriptors; spot-checked and legitimate). - Ban list clean: no effect sizes, no named tests, no credence intervals, no `[Clean Result]` prefix. **Blocking concerns (FAIL only):** - None. **Non-blocking concerns (CONCERNS / PASS):** - Minor rounding/label slip in the table's `person` row: "27.3pp (3-noun subset)" — the actual mean over the 4 available cross-nouns (chef, child, pirate, robot) is 27.125pp, and the "3-noun suepm:done· system<!-- epm:done v1 --> ## Issue #81 complete — `status:done-experiment` **Outcome:** Exploratory factorial sweep successf…
<!-- epm:done v1 --> ## Issue #81 complete — `status:done-experiment` **Outcome:** Exploratory factorial sweep successfully produced the 5×130-bystander marker-leakage heatmap. All 4 pre-registered hypotheses evaluated. Clean-result issue published and promoted. ### Key numbers - **H1 (implantation ≥ 0.80):** PASS 5/5 sources — person 93.0%, chef 92.5%, pirate 86.0%, child 98.5%, robot 95.0% (N=200/cell, seed=42). - **H2 (noun dominates traits, §A.8 pinned estimand):** PASS unanimously on 4 sources with full 130-bystander eval. Ratios: chef 3.1×, child 15.6×, robot 121.7×, pirate >>1 (zero-denominator). - **H3 (gradation slope > bootstrap 95% CI half-width):** WEAK — 42/100 (source, noun, trait) triples; 19/42 from `child` alone. - **H4 (cos-sim vs leakage 5×5 submatrix):** SKIPPED — no cos-sim cache for the 5 one-word prompts. - **Noise floor:** base model = 0.0% emission on all 131 cells. - **Source-variance caveat:** assistant-QC leak spans 0.5% (pirate) → 88.5% (person), a 177× range across 5 source nouns at n=1 seed. Cannot decompose noun-identity vs LoRA/sampling noise. ### What's confirmed / what's open - **Confirmed (LOW confidence, single seed):** marker-only-loss on one-word sources still produces strong implantation. The noun label dominates Big-5 trait-gradation description for the 4 sources with full factorial data. - **Open:** (a) full 130-bystander rerun for `person` (currently only has the 35-bystander pilot slice); (b) multi-seed replication of the 177× source-variance claim; (c) decoupling noun-identity from LoRA random-init noise; (d) cos-sim submatrix (H4) — requires a fresh hidden-state extraction pass. ### Follow-ups (queue candidates) - Full 130-bystander eval for `person` (≈30 min, one H100). - Multi-seed replication (3 seeds × 5 sources + base = 16 training runs ≈ 13 GPU-hr). - Cos-sim extraction for the 5 one-word prompts (≈15 min, H4 panel). - Phase B (near-twin counterexample hunt) — was scoped as a separate issue. ### Artifacts - **Clean-result issue:** #88 (promoted to `clean-results` label + project column "Clean Results"). - **Plan:** `.claude/plans/issue-81.md`; issue comment `epm:plan v1`. - **Raw eval data:** `eval_results/leakage_i81/{base_model,person,chef,pirate,child,robot}/`. - **Figures:** `figures/leakage_i81/{heatmap_5x130_base_subtracted,slice_noun_isolation,slice_trait_gradation,slice_interaction,hero_noun_leakage_matrix}.{png,pdf}` + `coherence_flags.csv`. - **HF Hub adapters:** `superkaiba1/explore-persona-space:leakage_i81/{person,chef,pirate,child,robot}_seed42/marker/`. - **WandB project:** `thomasjiralerspong/leakage-i81`. - **Worktree:** `.claude/worktrees/issue-81/` on branch `issue-81` (commit `d68e5b9`). Not deleted — user's call to clean up. - **INDEX.md:** `eval_results/INDEX.md` updated with `leakage_i81/` entry under Aim 3. ### Compute - **Actual:** ~5.2 GPU-hr. **Budgeted:** 7.5 GPU-hr. Under budget — savings from cached on-policy data between the two runs. ### Plan deviations - Worker-subprocess import bug (shared `.venv` editable-install race with sibling worktree) — diagnosed + fixed via `PYTHONPATH` injection in worker `env=` (commit `d68e5b9`). Required re-running chef/pirate/child/robot. Documented in `epm:progress v4–v6` + `epm:results v1`. - GPUs 1,4,5,6 (re-run) instead of original 0,1,2,3,4 (GPU 0 occupied by another user). - `person` source re-uses its pilot-era eval (35 bystanders) rather than the full 130. Flagged in the clean-result as the top follow-up. ### Reviewer verdict PASS with 1 cosmetic non-blocker. All 9 numerical spot-checks reproduced exactly from raw JSONs. `scripts/verify_clean_result.py` clean. Verdict comment: https://github.com/superkaiba/explore-persona-space/issues/81#issuecomment-4295741237. Moved to **Done (experiment)** on the project board. <!-- /epm:done -->epm:progress· system<!-- epm:progress v7 --> ## Follow-up dispatched — `person` full 130-bystander eval Per user request (2026-04-22): re-r…
<!-- epm:progress v7 --> ## Follow-up dispatched — `person` full 130-bystander eval Per user request (2026-04-22): re-running the `person` source with the full 130-bystander factorial to close the asymmetry in #88 (currently `person` has only the 35-bystander pilot slice). **Scope:** eval-only (reuse existing `leakage_i81/person_seed42/marker/` adapter on HF Hub or merged dir on pod3 if still present — same seed=42, same recipe, no retraining needed). New eval on the 95 bystanders not covered by the pilot + full 131-persona eval. Then coherence judge on the 95 new cells. Then update the analysis script and post an updated `epm:analysis` on #81 + amend #88 body. **Compute:** ~1 GPU-hr (one H100 for ~40-min vLLM eval + ~15 min coherence judge batch). Tracked inline on this issue; not reopening `status:*`. <!-- /epm:progress -->
epm:progress· system<!-- epm:progress v8 --> **Launched person full-130 eval to close 35→130 asymmetry flagged in #88** - Pod: `pod3` (tho…
<!-- epm:progress v8 --> **Launched person full-130 eval to close 35→130 asymmetry flagged in #88** - Pod: `pod3` (thomas-rebuttals-3, 8xH100) - GPU: 0 (idle, fresh container) - PID: 1039454 - Branch: `issue-81` @ 566d03b - Command: `PYTHONHASHSEED=0 CUDA_VISIBLE_DEVICES=0 nohup uv run python scripts/eval_person_full_i81.py --seed 42 --gpu 0 --run-coherence > logs/leakage_i81_person_full.log 2>&1 &` - Log: `logs/leakage_i81_person_full.log` - Adapter: `superkaiba1/explore-persona-space :: leakage_i81/person_seed42/marker` (HF Hub, uploaded during #81 sweep) - Output: `eval_results/leakage_i81/person_full130/` (new subdir — preserves pilot-slice `person/` dir for provenance) - WandB: will post once vLLM loads Status: adapter downloaded (~50MB), merge into Qwen-2.5-7B-Instruct in progress. No retraining — single seed, same recipe. Same vLLM config as main sweep (T=1.0, top-p=0.95, max_tokens=512, n=10, gpu_mem_util=0.60, max_model_len=2048, seed=42). 131 eval personas (130 bystanders + assistant QC) × 20 questions × 10 completions = 26,200 generations. Expected wall-time: ~25-35 min. Coherence judge auto-fires after eval.
epm:progress· system<!-- epm:progress v9 --> Hidden-state extraction launched on pod3 GPU 1 (GPU 0 reserved for concurrent `person_full130`…
<!-- epm:progress v9 --> Hidden-state extraction launched on pod3 GPU 1 (GPU 0 reserved for concurrent `person_full130` eval). - Script: `scripts/extract_hidden_states_i81.py` @ commit `3ab94ab` (branch `issue-81`) - Recipe: matches #66 — 20 `EVAL_QUESTIONS`, chat-template + `add_generation_prompt=True`, last non-pad token, layers {10,15,20,25} (headline 20), mean → centroid - Personas: 135 (5 `src_*` + 130 factorial bystanders) - Output: `eval_results/leakage_i81/cosine_vectors_i81.npz` + `cosine_manifest.json` + `cosine_metadata.json` - Launch PID 1042280, log `logs/i81_hidden_states.log` - Expected wall: ~15 min (135 × 20 = 2700 forward passes on H100)epm:analysis· system<!-- epm:analysis v2 --> ## Second clean-result draft published — trait-variation ranking follow-up - **Clean-result is…
<!-- epm:analysis v2 --> ## Second clean-result draft published — trait-variation ranking follow-up - **Clean-result issue:** #92 — *Representation distance separates Big-5 axes but marker leakage does not; Agreeableness L1 is the lone dual outlier (LOW confidence)* - **Hero figure:** `figures/leakage_i81/trait_ranking/fig_hero_compact.png` (commit `48972a0`) - **Recap (2 sentences):** Post-hoc ranking of the 25 (Big-5 trait × gradation level) variations in #81's factorial shows that base-model layer-20 representation distance genuinely separates the 5 axes (permutation p<0.0001, Agreeableness has the only non-overlapping 95% bootstrap CI), while marker leakage does NOT (inter-axis spread 2pp, permutation p=0.97 vs random axis labels). Agreeableness L1 ("cold/confrontational") is the sole dual outlier — #1 on both Δ_leakage (26.1pp) and Δ_cos (0.160, N=25) — and the global rank correlation ρ=0.537 (p=0.006, N=25) drops to ρ=0.258 (p=0.21) once same-noun diagonal cells are excluded. - **Verifier:** PASS (`scripts/verify_clean_result.py` — only the expected "derived numerics not in summary.json" WARN). - **Not a revision of #88** — this is a SEPARATE follow-up analysis (trait-variation ranking + level trajectories + person_full130 re-eval), layered on #81's raw completions and a new base-model cosine pass. #88 remains the canonical clean-result for the noun-vs-trait H2 estimand. <!-- /epm:analysis -->epm:reviewer-verdict· system<!-- epm:reviewer-verdict v2 --> ## Reviewer Verdict (follow-up for #92) — CONCERNS **Clean-result under review:** #92 …
<!-- epm:reviewer-verdict v2 --> ## Reviewer Verdict (follow-up for #92) — CONCERNS **Clean-result under review:** #92 **Verdict:** CONCERNS **Numerical spot-checks** (7 / 9 verified correct, 2 wrong): - ✓ Per-axis Δ_leak means (Agree 21.74, Extra 21.08, Neuro 20.16, Open 19.75, Cons 19.70) reproduce from `per_cell_ranking.csv` — matches body's 21.7 / 21.1 / 20.2 / 19.8 / 19.7 within rounding (Openness 19.75 → body says 19.8; should round to 19.7). - ✓ Per-axis Δ_cos means (0.0718 / 0.0517 / 0.0504 / 0.0455 / 0.0415) reproduce; 3pp spread confirmed. - ✓ Permutation test (B=10k, seed=42, shuffle trait labels across 625 rows): Δ_leak spread p = 0.9748 (body: 0.97 ✓), Δ_cos spread p < 0.0001 (body: <0.0001 ✓). Null leak-spread 95pct = 9.02pp; observed 2.04pp sits at null median. - ✓ Full 25-point ρ = 0.5370, p = 0.0056 (body: 0.537, p=0.006 ✓). - ✓ **ρ = 0.2582, p = 0.2127 reproduces when source==noun rows are dropped BEFORE aggregation** (500 raw rows → 25 (trait,level) points averaged over 20 source-noun combos each). Body's "N=25" is correct for the Spearman (the ranking is still 25 points); the "5 same-noun cells dropped" phrasing refers to the per-(trait,level) aggregation, not to the ρ's N. Consistent, but one reader-unfriendly ambiguity noted below. - ✓ Per-source ρ: person 0.749, chef 0.659, robot 0.455, child 0.323, pirate −0.008 all reproduce exactly (all N=25). - ✓ Agreeableness L1 is rank-1 on both metrics (Δ_leak 0.2614, Δ_cos 0.1598); no other of 24 cells exceeds Δ_cos 0.10. Confirmed. - ✓ `person_has_full130 = True` in summary.json; `eval_results/leakage_i81/person_full130/marker_eval.json` exists. 35→130 asymmetry closed. - ✓ Peak-level claims for Conscientiousness (L5 leak 22.7pp / L1 cos 0.097) and Neuroticism (L1 leak 22.5pp / L4 cos 0.077) reproduce exactly. - ✗ **"14.9% bland-baseline (Agreeableness L3)"** in the TL;DR figure-description sentence is wrong. Actual mean `rate_a1` for Agreeableness L3 across 25 source-noun cells is **13.72%**, not 14.9%. Ag L1 rate 7.2% is correct; Ag L1 cos 0.744 is correct. - ✗ **"a gap of ~0.14 below the next-lowest trajectory"** is wrong. Actual gap between Ag L1 cos (0.744) and next-lowest point (Conscientiousness L1 at 0.817) is **~0.07**, not 0.14. No axis-mean or per-(trait,level) interpretation produces 0.14; off by ~2x. **Methodology adherence:** - Post-hoc framing acknowledged in three places (TL;DR Confidence line, Setup prose, Standing caveats) ✓. - Cosine geometry clearly specified as Base(source) ↔ Base(bystander), layer 20, last-system-prompt-token ✓. - Single seed per source stated ✓. - Null framing uses axis-permutation null, not a different null — "2pp spread indistinguishable from noise given within-axis cell variance at N=200/cell" is permissible prose (does not conflate the permutation null with a Bernoulli-noise null, though it gestures at within-cell noise as the underlying driver). - `person_full130/` was actually used (verified in summary.json) ✓. - The ρ-drop estimand IS what the body describes: drop the 125 rows where source==noun from raw 625, re-aggregate to 25 (trait,level) points, recompute ρ. The estimand was NOT silently changed. **Overclaim risk:** medium - The TL;DR figure-description sentence presents two numerical claims (14.9% and 0.14 gap) that neither reproduce from the data nor appear in the underlying CSVs. Both land in a high-visibility slot (the sentence directly under the hero figure) and push readers toward a larger apparent effect than the data supports. The body's "gap of ~0.14" quietly inflates a ~0.07 representation gap by 2×. - Sample-output label inconsistency: `A1__chef__Agreeableness__L1 … source=person, Δ_leakage = 45.3pp` — the (person, chef, Ag, L1) cell has Δ_leakage = 6pp; 45.3pp is the source=chef per-noun mean for (Ag, L1). The label says "source=person" but cites a number only defensible for source=chef. Cherry-picked illustration error, not a headline error, but misleading. **Structural compliance:** - `uv ru
epm:analysis· system<!-- epm:analysis v3 --> ## Merged clean-result — #88 is now canonical; #92 superseded The two prior clean-results (#88…
<!-- epm:analysis v3 --> ## Merged clean-result — #88 is now canonical; #92 superseded The two prior clean-results (#88 original + #92 follow-up) have been merged into a single canonical issue organized around the question: *which Big-5 axis has the most effect on cosine similarity / marker leakage of "persona-with-adjective" vs "persona-without-adjective", and is that adjective effect bigger or smaller than swapping the noun?* - **Canonical clean-result:** #88 — *"Big-5 adjective effect on marker leakage dominates pairwise noun-swap 5x across all 5 axes; cosine flips on 4 of 5 (LOW confidence)"*. - **Hero figure:** [`merged_hero.png`](https://raw.githubusercontent.com/superkaiba/explore-persona-space/5e80949252b6a3ea19dbb9b6cd1a73dd3643141f/figures/leakage_i81/merged_hero.png) — 2-panel bar chart, per-axis trait-adjective vs noun-swap for leakage + cosine. - **#92 status:** `clean-results` label removed; `superseded` label added; kept in `Clean Results` project column for history; note cross-linking to #88 posted. - **Confidence:** LOW (single seed; one cosine layer; the head-to-head is post-hoc; rank correlation fragile to 5 diagonal cells). - **New commit:** `5e80949` on `issue-81` branch adds `figures/leakage_i81/merged_hero.{png,pdf}` and `eval_results/leakage_i81/trait_ranking/head_to_head.json`. - **Verifier:** PASS (`scripts/verify_clean_result.py`). **Key numerical result:** Adjective-effect mean-|Δ| dominates pairwise noun-swap mean-|Δ| on marker leakage across all 5 Big-5 axes at 4.6× (Openness) → 6.6× (Agreeableness) ratios. On base-model cosine (layer 20), the ordering reverses for 4 of 5 axes; Agreeableness is the sole flip (adjective 0.072 vs noun 0.060, ratio 1.19×). N=125 per-axis cells (5 sources × 5 nouns × 5 levels). <!-- /epm:analysis -->epm:reviewer-verdict· system<!-- epm:reviewer-verdict v3 --> ## Reviewer Verdict (merged #88) — CONCERNS **Canonical clean-result under review:** #…
<!-- epm:reviewer-verdict v3 --> ## Reviewer Verdict (merged #88) — CONCERNS **Canonical clean-result under review:** #88 **Verdict:** CONCERNS **Numerical spot-checks** (reproduced independently from raw data): - Descriptor-Δ leakage per axis — all 5 values reproduce from `per_cell_ranking.csv`: Agr 21.74, Ext 21.08, Neu 20.16, Ope 19.75, Con 19.70 pp ✓ - Descriptor-Δ cosine per axis — all 5 reproduce: Agr 0.0718, Ext 0.0517, Con 0.0504, Neu 0.0455, Ope 0.0415 ✓ - Noun-Δ per axis — all 5 leak and cos values reproduce (median over 10 noun-pairs, then mean over 25 contexts per axis) ✓ - All 10 ratios (4.6×/4.9×/6.0×/6.6×/6.0× leak; 0.64×/0.84×/0.96×/1.19×/0.79× cos) reproduce ✓ - Permutation p-values: p_leak = 0.971 (body says 0.97), p_cos < 0.0001 ✓ - Pairwise Agr-vs-others cosine permutation: all four p < 0.005 ✓ (Agr-Ext 0.0023, Agr-Con 0.0031, Agr-Neu 0.0001, Agr-Ope 0.0000) - `A2__person` rate = 0.93 ✓; `A1__person__Extraversion__L5` rate = 0.015, Δ=91.5pp ✓ - `person_full130` confirmed: 131 keys (125 A1 + 5 A2 + 1 assistant QC); self-implant rate 93% reproduces pilot ✓ - `A1__chef__Agreeableness__L1` — **MISMATCH**: body says "rate = 6 %" but raw data (`person_full130/marker_eval.json`) gives rate = 0.0 for this cell; the 6% is the `A2/chef` rate. The 6% also appears as `delta_leakage` in the CSV for this row. Sample-output label is wrong. **Methodology adherence:** - Estimand in body matches `head_to_head.json` methodology string exactly ✓ - Cosine geometry ("base Qwen-2.5-7B-Instruct, last-token of system-prompt span, layer 20") matches `cosine_metadata.json` ✓ - Post-hoc framing called out explicitly (Standing caveats bullet 3, Why-confidence bullet 1) ✓ - `person_full130` used (not pilot) — verified: 130 bystanders, not 35 ✓ **Overclaim risk:** low-to-medium - The body faithfully reports the axis-interchangeable finding on leakage (p=0.97) and the "only Agreeableness stands out" finding on cosine — no upward-spin. - The "#77 confirmed for leakage" framing is fair: #77 claimed behavioral-style > semantic-label on both metrics, and the balanced 5×130 factorial confirms leakage only. - The head-to-head table caption says "N = 125 per axis" but noun-Δ is actually N=25 contexts per axis (5 sources × 5 levels, trait fixed). This is not a spin issue — the raw counts are in `head_to_head.json` — but the table header is ambiguous. **Structural compliance:** - TL;DR: 4 H3 subsections in order (Background, Methodology, Results, Next steps) ✓ - Results: hero → N-annotated prose → Main takeaways (2 bullets, each with *Updates me:*) → single Confidence line ✓ - Title: no `[Clean Result]` prefix, ends with `(LOW confidence)` matching the in-body Confidence line ✓ - Detailed report: all required sections present (Source issues, Setup & hyper-parameters with "why / alternatives" prose opener + filled Reproducibility card, WandB, Sample outputs, Headline numbers with Standing caveats, Why confidence is where it is, Artifacts) ✓ - `#92` cross-ref present as "superseded, merged into this issue" ✓ - Stats framing: no effect sizes, no named tests in prose other than "permutation test" (methodological noun, allowed); no `value ± err` in prose ✓ - `uv run python scripts/verify_clean_result.py` → **PASS** (with 37 numeric-claims WARN only, which are claims derived from CSV not in referenced JSON) **Blocking concerns (FAIL only):** none **Non-blocking concerns (CONCERNS / PASS):** 1. **Sample-output factual error**: `A1__chef__Agreeableness__L1` is labelled "rate = 6 %" in the Sample outputs block, but the raw rate in `person_full130/marker_eval.json` for that cell is 0.0. The 6 % appears to be either the `A2/chef` baseline or the `delta_leakage` for that row. The qualitative framing (chef bystander with "cold/confrontational" descriptor produces a marker-absent completion) is correct, but the quoted rate is wrong. Simple fix: change "rate = 6 %" to "rate = 0 %" or "delta vs A2/chef = 6 pp". 2. **Head-to-head N caption is ambiguous**:
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)