EPS
← All tasks·#201Completed

Test similarity of different persona vector extraction methods

kind: experiment

Goal

Quantify how interchangeable five candidate persona-vector extraction recipes are on Qwen/Qwen2.5-7B-Instruct. The default we've used (Method A — last token of the chat-templated input) carries clean-results #92, #99, #113, #123 alone. Before #191 commits to a side-by-side A+B sweep on EM-warped models, get an independent answer to "do these recipes recover the same persona axis?"

If they agree (high per-persona cos + high pairwise-matrix correlation at every probed layer), Method A's clean-results are robust to extraction recipe. If they disagree, downstream issues need a recipe footnote and we need to pick a defensible default.

Supersedes #85 (body-empty). Independent of #191 (which does A+B + EM).

Hypothesis

If Method A is a faithful proxy for the persona representation, then per-persona cos(centroid_A, centroid_X) > 0.95 and inter-persona cosine-matrix Pearson r > 0.90 across all method-pairs and all probed layers.

Prediction: A, C2, C3 (all extracted from positions in the chat-templated stream) cluster tightly; B (mean over generated response) shifts magnitudes but ranks pairs similarly; C1 (raw system-only, no chat template) is the most likely outlier.

Setup

Model: Qwen/Qwen2.5-7B-Instruct (base, single seed=42, bf16).

Persona set: 275 roles in data/assistant_axis/role_list.json × 240 questions in data/assistant_axis/extraction_questions.jsonl (extract_persona_vectors.py defaults). 1 system prompt per role (pos field).

Layers probed: [7, 14, 21, 27] (Qwen-2.5-7B has 28 layers; covers early/middle/late/final). Planner can sweep more if cheap.

Five extraction methods, all on the same (role, question) pairs:

IDPooling positionForward-pass cost
Aapply_chat_template(system, user, add_generation_prompt=True) → last token of full sequence (typically \n after assistant)shared with C2/C3
BvLLM generate ~200 tokens → HF forward on prompt+response → mean over response-token positionsdedicated (vLLM gen + 1 HF fp)
C1Tokenize raw system-prompt string only (no chat template) → last tokendedicated (1 HF fp on system-only sequence)
C2<|im_start|>system\n{prompt}<|im_end|> → last token of system blockslice from A's fp at a different position
C3Same as C2 + 1 trailing token (after the system block's <|im_end|> newline)slice from A's fp

C2/C3 are derived by slicing the same forward pass as A; only B and C1 add new compute.

Metrics (planner picks the hero)

Per layer × method-pair (10 pairs):

  1. Per-persona cosine (mean / min / max): persona-by-persona alignment of centroids.
  2. Inter-persona cosine-matrix correlation (raw + mean-centered): off-diagonal Pearson + Spearman r between the 275×275 matrices of method X and method A.
  3. Per-prompt persona-discrimination spread: rank correlation of "which question best separates personas" across methods.

P-values via paired permutation across personas. No effect sizes in prose (per CLAUDE.md).

Success criterion

All 10 method-pairs satisfy at every probed layer:

  • per-persona min cos > 0.95
  • mean-centered inter-persona matrix Pearson r > 0.90
  • per-prompt divergence Spearman > 0.80

If met → "Method A is recipe-robust; downstream persona-vector clean-results need no recipe footnote."

Kill criterion

ANY method-pair at ANY probed layer has per-persona min cos < 0.85 OR mean-centered matrix Pearson r < 0.70. That outcome invalidates the unfootnoted use of cosine-matrix claims in #92/#99/#113/#123 and reshapes #191's plan.

Compute

  • vLLM generation (Method B): 275 × 240 ≈ 66k generations × ~200 tokens ≈ 13M tokens. ~30-45 min on 1×H100.
  • HF forward passes: 275 × 240 = 66k sequences × ~3 distinct prompt formats (full chat-template for A/C2/C3 shared; raw system for C1; full prompt+response for B) ≈ 200k passes. With bs=8, ~30-45 min on 1×H100.
  • Analysis: minutes on CPU.

Total: ~1-2 GPU-hours on 1×H100. Label: compute:small.

Pod preference

--intent eval (1× H100). No training.

Upload + cleanup

  • Centroids → data/persona_vectors/qwen2.5-7b-instruct/method_{a,b,c1,c2,c3}/{role}.pt + WandB Artifact (reusable for #191/#114/#6).
  • Comparison results JSON + figures committed to git.
  • No model weights to delete (base model only).

References

  • #85 — to be closed as duplicate of #201.
  • #191 — independent EM × persona-vectors plan; runs A+B but does NOT add C variants.
  • scripts/extract_persona_vectors.py:107-409 — Method A + B reference implementation.
  • scripts/compare_extraction_methods.py:152-231 — existing 20×20 A vs B harness.
  • Chen et al. 2025, "Persona Vectors," arXiv:2507.21509 — Method B's literature definition.
  • Prior clean-results depending on Method A: #92, #99, #113, #123.

Spec (from clarifier v1)

  1. Method C: all three variants (C1 raw-system-text, C2 system-block-end, C3 one-token-after-system-end).
  2. Scope (S1): standalone 5-way ablation; #85 closed as duplicate of #201; #191 stays independent.
  3. Type: type:experiment (Done variant = "Done (experiment)").
  4. Persona-set scope (P2): 275 roles × 240 questions.

Timeline · 18 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifying questions before adversarial planning The body is thin (3 bullets); before drafti
    <!-- epm:clarify v1 -->
    ## Clarifying questions before adversarial planning
    
    The body is thin (3 bullets); before drafting a plan I need to lock four things. Context I've already resolved from project knowledge:
    
    **Context resolved:**
    - **Method A** ("the one we've been using") = last token of full chat-templated prompt (`system + user + <|im_start|>assistant\n` → last token, averaged across many `(prompt × question)` pairs). Reference: `scripts/extract_persona_vectors.py:107-200`. Used in clean-results #92, #99, #113, #123.
    - **Method B** ("from persona vectors paper") = mean over generated-response tokens. Reference: `scripts/compare_extraction_methods.py:179-231`, Chen et al. 2025 (arXiv:2502.17424).
    - **Existing data:** `data/persona_vectors/qwen2.5-7b-instruct/` is empty; `data/assistant_axis/` has 275 roles + 240 extraction questions; the 20-persona × 20-prompt harness in `compare_extraction_methods.py` already does an A vs B comparison (but Method C is new).
    - **Related:** **#191** (running plan: EM × persona vectors) already does A+B side-by-side on the SAME model; #191 explicitly states "this dual extraction also incidentally settles #85". **#85** is the original (body-empty) version of #201.
    
    **Blocking ambiguities:**
    
    1. **Method C ("last token of system prompt"):** which of these?
       - **C1:** tokenize ONLY the system prompt string (no chat template, no user, no `<|im_start|>` wrapper) → take last token. Pure persona content, uncontextualized.
       - **C2:** apply the chat template `<|im_start|>system\n{prompt}<|im_end|>` → take the last token of that span (typically `<|im_end|>` or the newline after it).
       - **C3:** apply the chat template through the system block + start of user (`...<|im_start|>system\n{prompt}<|im_end|>`), still no user message → take last token of system block.
    
    2. **Scope vs #191 / #85:** how should this be filed?
       - **S1:** Standalone, focused 3-way ablation (A vs B vs C); independent issue. Adds Method C to the existing harness; #85 closed as duplicate.
       - **S2:** Fold into #191 (which already does A+B and would just add C). Close #201 + #85 as duplicates of #191.
       - **S3:** Standalone but supersedes #85 (close #85 as duplicate of #201; #191 still independently does A+B + EM as a separate experiment).
    
    3. **`type:*` label** (needed for the Done variant). My read: borderline — comparison requires running forward passes on Qwen-7B (not pure re-analysis of cached results), so I'd lean `type:experiment`. Alternative `type:analysis` if you prefer.
    
    4. **Persona-set scope.** Either of:
       - **P1:** 20 personas × 20 prompts (matches existing `compare_extraction_methods.py` harness; ~30 min on 1×H100; cheap & fast). 
       - **P2:** 275 assistant-axis roles × 240 extraction questions (matches `extract_persona_vectors.py`; ~1.5 GPU-hr; gives full coverage and a much richer cosine matrix). 
    
    (Layers default to `[7, 14, 21, 27]` — the planner can sweep more if needed.)
    
    I'll also ask these in chat so you can answer inline.
    <!-- /epm:clarify -->
  2. epm:clarify-answers· system
    <!-- epm:clarify-answers v1 --> ## Clarifier answers (from chat) User-supplied verbatim answers to the four blocking qu
    <!-- epm:clarify-answers v1 -->
    ## Clarifier answers (from chat)
    
    User-supplied verbatim answers to the four blocking questions:
    
    1. **Method C definition** → **all three variants** (C1, C2, C3) tested side-by-side.
       *User notes: "Try all 3"*
    2. **Scope vs #191 / #85** → **S1: standalone 3-way (now 5-way) ablation; close #85 as duplicate of #201; #191 stays independent.**
    3. **Type label** → **`type:experiment`** (Done variant = "Done (experiment)").
    4. **Persona-set scope** → **P2: 275 roles × 240 questions.**
    
    Body has been updated with a full spec drawing on these answers and the resolved project context. #85 closed as duplicate. Advancing to `status:planning`.
    <!-- /epm:clarify-answers -->
  3. epm:auto-defaults· system
    <!-- epm:auto-defaults v1 --> ## Step 0b auto-defaults applied Issue arrived with no `status:*` label, no `type:*` labe
    <!-- epm:auto-defaults v1 -->
    ## Step 0b auto-defaults applied
    
    Issue arrived with no `status:*` label, no `type:*` label, and a 3-bullet body. Applied:
    
    - `status:proposed` (autofilled before clarifier)
    - `type:experiment` (chosen by user via clarifier)
    - `compute:small` (estimated 1-2 GPU-hr on 1×H100)
    - `aim:cross-cutting` (this is methods sanity-check that touches every persona-vector-based result)
    - Body drafted from clarifier answers + resolved project context (#85, #191, `extract_persona_vectors.py`, `compare_extraction_methods.py`, Chen et al. arXiv:2502.17424).
    
    The original 3-bullet body is preserved in this audit comment for reference:
    
    > - the one we've been using
    > - just take last token of system prompt
    > - the one from persona vectors paper
    <!-- /epm:auto-defaults -->
  4. epm:clarify· system
    <!-- epm:clarify v2 --> ## Clarifier — All clear No blocking ambiguities remain. Proceeding to adversarial planning. *
    <!-- epm:clarify v2 -->
    ## Clarifier — All clear
    
    No blocking ambiguities remain. Proceeding to adversarial planning.
    
    **Context resolved:**
    - Method A = `extract_persona_vectors.py` last-input-token (used in clean-results #92, #99, #113, #123)
    - Method B = `extract_persona_vectors.py` mean-response-token (Chen et al. 2025, arXiv:2502.17424)
    - Methods C1/C2/C3 = three system-prompt-position variants (defined in body)
    - Persona set = 275 roles × 240 questions from `data/assistant_axis/`
    - Layers = `[7, 14, 21, 27]` (Qwen-2.5-7B has 28 transformer layers)
    - Compute = ~1-2 GPU-hr on 1×H100 (`compute:small`)
    - Sibling issues: #191 (EM × persona-vectors, A+B only, independent); #85 closed as duplicate.
    <!-- /epm:clarify -->
  5. epm:plan· system
    <!-- epm:plan v1 --> > **Cost gate:** estimated **~1.8 GPU-hours** (≤2.2 GPH worst case) on 1× H100 (intent=eval). Reply
    <!-- epm:plan v1 -->
    > **Cost gate:** estimated **~1.8 GPU-hours** (≤2.2 GPH worst case) on 1× H100 (intent=eval). Reply `approve` to dispatch. The plan went through Phase-1.5 fact-check (1 citation fix, 1 index-alignment fix), Phase-2 critic round 1 (REVISE → 12 blocking + 6 quality items, all addressed in v2), Phase-2 critic round 2 (REVISE → 2 mechanical blockers fixed in v3), and Step-2b consistency check vs #191 (WARN, no blockers).
    
    # Plan: Issue #201 — 6-way persona-vector extraction-method ablation (v3)
    
    ## Revision history
    
    - **v3 (current):** addressed round-2 critic (verdict REVISE, 2
      mechanical blockers + 6 quality items). Changes:
      - **N=274 → N=275 throughout.** v2 misread "275 minus default = 274";
        `default.json` is an orphan instruction file NOT in `role_list.json`,
        so `load_roles()` already returns 275 non-empty roles. The
        `--exclude-roles default` flag is kept as a defensive no-op.
        Off-diag count corrected to 275×274/2 = 37 675; per-method per-q
        cache size 1.89 GB (was wrongly 1.88).
      - **Smoke persona list fixed.** v2 listed `villain`/`surgeon`/`florist`
        which are NOT keys in `role_list.json` (they were from the legacy
        inline list in `compare_extraction_methods.py`). v3 replaces them
        with `critic`/`doctor`/`poet`, all verified present.
      - **B*↔A algebra caveat added** to §11.10. A is the LAST input token
        and B* mean-pools over ALL input tokens of the same forward pass —
        so A is one of B*'s pool members; cos(A, B*) is partly arithmetic.
        A `B*_no_last` sensitivity-check centroid (mean over `[0, prompt_len-1)`,
        A excluded) is now computed inline at zero extra cost and reported
        as a sub-field of the (A, B*) cell. NOT a load-bearing pair on its
        own — used only to bound the algebra contribution.
    - **v2:** addressed Phase-1.5 fact-check (1 citation fix
      arXiv:2502.17424 → arXiv:2507.21509; 1 empty-response B/A
      index-alignment fix added in §13) and Phase-2 critic (12 blocking + 6
      quality items). Major changes from v1:
      1. Re-read prior 20×20 numbers from
         `eval_results/extraction_method_comparison/comparison_results.json`
         and reframed §1 with the corrected mean-centered Pearson r values
         (0.296 / 0.776 / 0.821 / 0.829 at L=10/15/20/25 — *not* 0.910/0.911
         at L=15/20 as the v1 plan misquoted; those were the raw, not
         mean-centered, values). Kill threshold (mc r < 0.70) now fires at
         L=10 under the prior; success threshold (mc r > 0.90) fails at all
         four layers.
      2. Headline framing is now "**Which extraction recipes agree? (the
         partition)**", not "Method A is recipe-robust".
      3. Added **Method B*** (mean-pool over INPUT tokens of B's same
         forward pass) → 6 methods, **15 pairs**.
      4. Added a **noise-floor positive control** (split 240 questions
         into two non-overlapping 120-question halves; same-method
         cross-half cosine).
      5. N=275 — `role_list.json` already excludes the orphan `default.json`; the `--exclude-roles default` flag is defensive (no-op against the actual roster). v2 misread "275 minus default = 274"; round-2 critic caught it, fixed in v3.
      6. Versioned output dir `data/persona_vectors/issue_201/...` to
         prevent stale-cache collision with prior production-run centroids.
      7. Per-role write-then-free + per-method resume cache check inside
         the new combined extractor.
      8. fp16 cast on `per_prompt_activations.pt` to land at ≈ 9.4 GB total.
      9. Pre-flight `--smoke` flag: bs=1 vs bs=8 cosine sanity (must agree
         ≥ 0.999 per persona before §9 hot-fix is allowed).
      10. Reframed A↔C2, A↔C3, C2↔C3 as a *position-locality sanity check*
          (not load-bearing). Load-bearing pairs: A↔B, A↔B*, A↔C1, B↔B*,
          B*↔C1.
      11. Verdict-3 (per-question persona spread Spearman) demoted from
          gate to descriptive (per critic recommendation, item 13).
      12. Recipe-inheritance audit of #92/#99/#113/#123 done — propagation
          to those clean-results is **conditional, not blanket** (§11.8).
    
  6. epm:experiment-implementation· system
    <!-- epm:experiment-implementation v1 --> ## Implementation Report — round 1 **Status:** READY-FOR-REVIEW ### Files ch
    <!-- epm:experiment-implementation v1 -->
    ## Implementation Report — round 1
    
    **Status:** READY-FOR-REVIEW
    
    ### Files changed
    - `scripts/extract_persona_vectors.py`: Extended with `--method all6` code path (7-method extraction: A, B, B*, B*_no_last, C1, C2, C3), `--smoke` (bs=1 vs bs=8 cosine sanity), `--smoke-empty-response` (Method B empty-response rate gate), `--exclude-roles`, and `--output-root` flags. New functions: `setup_hooks()`, `remove_hooks()`, `extract_method_b_bstar()`, `extract_methods_combined()`, `run_smoke_check()`, `run_smoke_empty_response()`, `_save_metadata()`, `_git_commit_hash()`. Per plan sections 3a, 3c, 3e.
    - `scripts/compare_extraction_methods_6way.py`: New analysis script. 15-pair cross-method metrics (per-persona cos, raw + mean-centered matrix correlation, per-question persona spread Spearman), B*_no_last sensitivity check, noise-floor positive control (same-method cross-half cosine), verdict logic (PASS/GREY/KILL partition), 3 figures (verdict heatmap, per-persona cos violins, mc Pearson r bars). Per plan sections 3b, 3d.
    
    ### Diff summary
    +1806 lines, -83 lines across 2 files.
    ```
    scripts/compare_extraction_methods_6way.py | 910 +++++++++++++++++++++++++++
    scripts/extract_persona_vectors.py         | 979 ++++++++++++++++++++++++++---
    2 files changed, 1806 insertions(+), 83 deletions(-)
    ```
    
    ### Plan adherence
    - **3a (extract_persona_vectors.py extension):** DONE — `--method all6`, `--exclude-roles`, `--smoke`, `--smoke-empty-response`, `--output-root` flags; `extract_methods_combined()` for A/C1/C2/C3; `extract_method_b_bstar()` for B/B*/B*_no_last; resume cache; per-role write-then-free; fp16 per-q cache; `<|im_end|>` count assert; intersection mask for B/B*.
    - **3b (compare_extraction_methods_6way.py):** DONE — 15 pairs, 4 layers, per-persona cos (min/p5/mean/max), raw + mc matrix Pearson/Spearman, per-q persona spread Spearman (descriptive, N/A for C1), noise-floor positive control, verdict logic with LOAD_BEARING and SANITY pair classifications.
    - **3c (fp16 per-q cache):** DONE — `.half().contiguous()` before `torch.save()` for A/B/B*/B*_no_last/C2/C3 per-q caches; C1 has no per-q cache.
    - **3d (output paths):** DONE — `{output_root}/{model_short_name}/method_{a,b,bstar,bstar_no_last,c1,c2,c3}/` with `{role}.pt`, `{role}__per_q.pt`, `all_centroids.pt`, `metadata.json`.
    - **3e (pre-launch sequence):** DONE — smoke check and smoke-empty-response commands are wired.
    - **Figures (section 6):** DONE — `verdict_partition_heatmap.png`, `per_persona_cos_pair_dists.png`, `matrix_corr_pair_layer.png`.
    - **Seeds (section 4):** DONE — `torch.manual_seed(42)`, `np.random.seed(42)`.
    - **`<|im_end|>` token lookup:** DONE — `tokenizer.convert_tokens_to_ids("<|im_end|>")`, NOT hardcoded.
    - **B*_no_last sensitivity (section 11.10):** DONE — computed inline, saved to `method_bstar_no_last/`, reported as sub-field of (A, B*) cell.
    - **Noise-floor control (section 11.11):** DONE — split 240 questions into halves, re-aggregate from per-q caches, save `noise_floor.json`.
    - **Versioned output dir (section 11.12):** DONE — default `data/persona_vectors/issue_201`.
    - **Metadata (section 4):** DONE — git commit hash, timestamp, model, layers, etc. in every `metadata.json`.
    - **Intersection mask (section 13):** DONE — B and B* skip same (role, q) on empty response; mask saved in metadata.
    
    ### Lint + dry-run
    - `ruff check`: PASS
    - `ruff format --check`: PASS
    - AST parse: PASS (both files)
    - Import verification: PASS (`from scripts.extract_persona_vectors import ...` and `from scripts.compare_extraction_methods_6way import ...` both succeed)
    - Dry-run command: N/A (no GPU on local VM; scripts verified via import + AST parse)
    
    ### Assumptions made
    - Qwen2.5-7B-Instruct chat template produces exactly 2 `<|im_end|>` tokens per (system, user, generation_prompt) input. Defended with assert.
    - vLLM 0.11.0 `LLM.chat()` works with `SamplingParams(temperature=0.0)`. Verified in production.
    - `paper_plots.py` import handled gra
  7. epm:code-review· system
    <!-- epm:code-review v1 --> ## Code-Reviewer Verdict -- CONCERNS **Diff size:** +1806 / -83 lines across 2 files **Plan
    <!-- epm:code-review v1 -->
    ## Code-Reviewer Verdict -- CONCERNS
    
    **Diff size:** +1806 / -83 lines across 2 files
    **Plan adherence:** COMPLETE (all plan items addressed)
    **Tests:** N/A (no unit tests; this is a one-shot experiment script -- covered by smoke checks)
    **Lint:** PASS (ruff check + ruff format both clean)
    **Security sweep:** CLEAN (no hardcoded secrets, no injection vectors)
    
    ---
    
    ## Plan Adherence
    
    | Plan Item | Status | Notes |
    |-----------|--------|-------|
    | 6 methods (A, B, B*, B*_no_last, C1, C2, C3) | Implemented | 7 method dirs, correctly mapped |
    | `--method all6` choice | Implemented | Also accepts `all` as alias |
    | `--smoke` flag (bs=1 vs bs=8, cos >= 0.999) | Implemented | 10 personas verified present |
    | `--smoke-empty-response` (5x5, > 2% gate) | Implemented | Clean vLLM teardown after |
    | Per-role write-then-free pattern | Implemented | Both `extract_methods_combined` and `extract_method_b_bstar` |
    | Resume cache (skip role if all method dirs have `{role}.pt`) | Implemented | Also checks `__per_q.pt` for B/B* |
    | fp16 per-q activations via `.half().contiguous()` | Implemented | Correct |
    | `<\|im_end\|>` count assert (exactly 2) | Implemented | With named role in error msg |
    | `--exclude-roles` flag | Implemented | Defensive no-op for `default` |
    | Versioned output root (`data/persona_vectors/issue_201/...`) | Implemented | Via `--output-root` |
    | 15 pairs x 4 layers x 3 metrics | Implemented | All metrics computed |
    | Noise-floor positive control (cross-half) | Implemented | Correctly splits 240/2 = 120 per half |
    | B*_no_last sensitivity check (sub-field of A<->B* cell) | Implemented | `_compute_bstar_no_last_sensitivity` |
    | Verdict logic (PASS/GREY/KILL) | Implemented | Correct threshold checks |
    | 3 figures (heatmap, violins, bar chart) | Implemented | With paper-plots fallback |
    | Metadata: git SHA in run_result.json | Implemented | `_git_commit_hash()` |
    | vLLM teardown before HF model load | Implemented | `generate_responses_vllm` does `del llm; torch.cuda.empty_cache()` |
    | Empty-response fallback (intersection_mask) | Implemented | Skips both B and B* on empty |
    
    ---
    
    ## Issues Found
    
    ### Major (should fix before run, but not scientific blockers)
    
    **1. B/B* per_q tensor misalignment with positional indexing**
    `scripts/extract_persona_vectors.py` lines 570-671
    
    When a question is skipped due to an empty response, `per_q_b[layer]` accumulates fewer entries than `n_questions`. The saved `per_q_tensor` shape becomes `(n_valid, n_layers, D)` rather than `(n_questions, n_layers, D)`. However, the analysis script's `compute_per_question_persona_spread` (line 172) iterates `for q_idx in range(n_questions)` and uses `per_q_x[q_idx, layer_idx, :]`, so question index 5 in the per_q tensor no longer corresponds to the same original question across roles that had different empty-response patterns.
    
    Additionally, the noise-floor function `load_per_q_centroids` (line 146) uses `per_q[q_slice]` which assumes a fixed 240 length. For B/B*, a `slice(120, 240)` on a tensor with shape `(238, ...)` would silently produce a different split than intended.
    
    **Impact:** The per-question persona spread Spearman for any pair involving B or B* will be computed on misaligned question indices across roles with different empty-response patterns, potentially producing a nonsensical value. The noise-floor split for B/B* will be slightly off.
    
    **Mitigating factor:** (a) The per-question persona spread metric is explicitly descriptive-only, NOT in the gate (plan item 13). (b) The smoke-empty-response check gates at 2% empty rate, so the misalignment affects at most a few entries. (c) Centroids themselves are correctly computed from only valid entries.
    
    **Fix:** Either pad with NaN/sentinel for skipped questions to maintain positional alignment, or store only valid entries but also persist the mapping from per_q index to original question index. The simplest fix: for B/B*, when skipping a question, append a zero-vector (or Na
  8. epm:launch· system
    <!-- epm:launch v1 --> ## Launch - **Worktree:** `.claude/worktrees/issue-201` - **Branch:** `issue-201` - **PR:** http
    <!-- epm:launch v1 -->
    ## Launch
    
    - **Worktree:** `.claude/worktrees/issue-201`
    - **Branch:** `issue-201`
    - **PR:** https://github.com/superkaiba/explore-persona-space/pull/211
    - **Code review:** CONCERNS (non-blocking) — all plan items implemented correctly
    - **Pod:** `epm-issue-201` (1× H100, 81 GB VRAM, 190 GB disk)
    - **Preflight:** PASS (clean git, env synced, GPU free)
    - **Launch command (from plan §3e):**
      ```bash
      # Smoke pre-flights
      uv run python scripts/extract_persona_vectors.py --method all6 --smoke --layers 7 14 21 27 --output-root data/persona_vectors/issue_201
      uv run python scripts/extract_persona_vectors.py --method B --smoke-empty-response --roles assistant,critic,pirate,tutor,translator --n-questions 5 --output-root data/persona_vectors/issue_201
      # Full extraction
      nohup uv run python scripts/extract_persona_vectors.py --method all6 --layers 7 14 21 27 --output-root data/persona_vectors/issue_201 --exclude-roles default --max-new-tokens 200 > /workspace/extract_201.log 2>&1 &
      # After extraction: analysis
      nohup uv run python scripts/compare_extraction_methods_6way.py --centroid-root data/persona_vectors/issue_201/qwen2.5-7b-instruct --layers 7 14 21 27 --output-dir eval_results/issue_201 > /workspace/compare_201.log 2>&1 &
      ```
    - **Experimenter agent dispatched** for pod ops + monitoring + debugging.
    <!-- /epm:launch -->
  9. epm:progress· system
    <!-- epm:progress v1 --> ## Extraction Progress Update **Phase 0 (smoke tests):** PASS - bs=1 vs bs=8 cosine: all >= 0
    <!-- epm:progress v1 -->
    
    ## Extraction Progress Update
    
    **Phase 0 (smoke tests):** PASS
    - bs=1 vs bs=8 cosine: all >= 0.999991 at layers [7,14,21,27]
    - Method B empty-response: 0/25 empty (0.00%)
    
    **Phase 1 (vLLM generation):** COMPLETE
    - 66,000 prompts processed (275 roles x 240 questions)
    - Wall time: ~12 minutes (after ~5 min tokenization)
    - Responses cached to `responses.json` (83.6 MB)
    
    **Phase 2 (HF extraction):** IN PROGRESS
    - Method B/B*/B*_no_last: 3/275 roles at ~4.0 roles/min (est. ~69 min remaining)
    - Method A/C1/C2/C3: not yet started (est. ~30-45 min after B completes)
    
    **Infrastructure notes:**
    - Applied `_epm_157_compat.pth` for transformers 5.5 + vLLM 0.11 tokenizer compat
    - Data files (`role_list.json`, `extraction_questions.jsonl`, instructions/) synced via scp
    - `PYTHONUNBUFFERED=1` for real-time logging
    - GPU: 1x H100, ~15 GB VRAM for HF extraction (vs ~70 GB for vLLM)
    
    **ETA:** Phase 2 completion in ~1.5-2 hours. Phase 3 (analysis) ~5 min after that.
  10. epm:progress· system
    <!-- epm:progress v2 --> ## Extraction Progress Update v2 **Phase 1 (vLLM generation):** COMPLETE - 66,000 prompts, ~1
    <!-- epm:progress v2 -->
    
    ## Extraction Progress Update v2
    
    **Phase 1 (vLLM generation):** COMPLETE
    - 66,000 prompts, ~12 min gen time after ~5 min tokenization
    
    **Phase 2 (HF extraction - B/B*/B*_no_last):** COMPLETE  
    - 275 roles at 4.1 roles/min, 67 min
    - All 3 method centroids saved (275 each)
    
    **Phase 3 (HF extraction - A/C1/C2/C3):** COMPLETE
    - 275 roles at 13.8 roles/min, 20 min  
    - All 4 method centroids saved (275 each)
    
    **Phase 4 (6-way comparison analysis):** IN PROGRESS (running via nohup)
    - First 5/84 layer-pair results computed at ~3.1 min/layer
    - Partial results so far:
      - A vs B: cos_min ranges 0.53 (L7) to -0.01 (L27); mc_r ranges 0.53 (L7) to 0.90 (L27)
      - A vs B*: cos_min=0.51, mc_r=0.22 at L7
    - ETA: ~4 hours total for full 21-pair × 4-layer analysis with permutation tests
    - Process: PID 17312, 390% CPU, 1.2 GB RSS, under nohup
    
    **Total extraction wall time:** ~1.5 hours (B extraction dominated: 67 min)
    **Total GPU time (H100):** ~1.3 GPU-hours
    
    All 7 method directories contain 275 `all_centroids.pt` files confirmed.
  11. epm:results· system
    <!-- epm:results v1 --> ## Results: 6-way extraction-method ablation ### Verdict: **KILL — all 15 pairs at all 4 layers
    <!-- epm:results v1 -->
    ## Results: 6-way extraction-method ablation
    
    ### Verdict: **KILL — all 15 pairs at all 4 layers.** Extraction method matters enormously.
    
    No method-pair reaches even the GREY zone on per-persona cosine (all cos_min < 0.85 at every layer). The predicted "sanity-check" pairs (A↔C2, A↔C3, C2↔C3 — same forward pass, different positions) also KILL, falsifying the plan's position-locality prediction.
    
    ### Headline numbers (load-bearing pairs)
    
    | Pair | L7 cos_min / mc_r | L14 cos_min / mc_r | L21 cos_min / mc_r | L27 cos_min / mc_r |
    |---|---|---|---|---|
    | **A↔B** | 0.53 / 0.53 | 0.64 / 0.75 | 0.62 / 0.89 | −0.01 / 0.90 |
    | **A↔B*** | 0.51 / 0.22 | 0.61 / 0.20 | 0.46 / 0.37 | 0.04 / 0.68 |
    | **A↔C1** | 0.51 / 0.80 | 0.50 / 0.81 | 0.17 / 0.69 | −0.08 / 0.65 |
    | **B↔B*** | 0.56 / 0.15 | 0.62 / 0.20 | 0.57 / 0.34 | 0.35 / 0.68 |
    | **B*↔C1** | 0.43 / 0.21 | 0.38 / 0.22 | 0.01 / 0.37 | 0.29 / 0.79 |
    
    ### Sanity pairs (predicted PASS, actual KILL)
    
    | Pair | L7 cos_min / mc_r | L14 cos_min / mc_r | L21 cos_min / mc_r | L27 cos_min / mc_r |
    |---|---|---|---|---|
    | **A↔C2** | **0.07** / 0.75 | 0.25 / 0.84 | 0.42 / 0.75 | 0.00 / 0.62 |
    | **A↔C3** | 0.61 / 0.85 | 0.69 / 0.89 | 0.51 / 0.79 | 0.24 / 0.72 |
    | **C2↔C3** | 0.22 / 0.76 | 0.42 / 0.91 | 0.43 / 0.86 | 0.12 / 0.74 |
    
    ### Noise floor (Method A same-method cross-half)
    cos_min ≥ 0.9945, mc_r ≥ 0.9965 at all 4 layers. The thresholds are meaningful — the pipeline itself is very stable; cross-method divergence is real, not noise.
    
    ### B*_no_last sensitivity (algebra caveat)
    cos(A, B*_no_last) ≈ cos(A, B*) at all layers (Δcos_min < 0.002). A's membership in B*'s pool has negligible algebraic inflation. The disagreement is genuine.
    
    ### Key interpretation threads (for analyzer)
    1. **Per-persona cosine KILLS everywhere** — different positions in the residual stream encode genuinely different per-persona directions, even within the same forward pass.
    2. **Matrix correlations (mc_r) are more stable** — relative persona geometry is partially preserved even when absolute direction diverges. A↔B reaches mc_r=0.89-0.90 at L21/L27 despite cos_min < 0.05.
    3. **Layer 27 is degenerate** — near-zero or negative cos_min for almost all pairs. The final layer before unembedding specializes for next-token prediction, not persona-axis geometry.
    4. **A↔C3 is the closest pair** — cos_mean ~0.67-0.75 at L7-L14. C3 (one position after system-block end) is the "most similar" to A (last token of full prompt), but even they differ by 0.25+ per-persona cos.
    
    ### Artifacts
    - `eval_results/issue_201/run_result.json` — full 15-pair × 4-layer metrics
    - `eval_results/issue_201/figures/verdict_partition_heatmap.png` (hero)
    - `eval_results/issue_201/figures/per_persona_cos_pair_dists.png` (violins)
    - `eval_results/issue_201/figures/matrix_corr_pair_layer.png`
    - `data/persona_vectors/issue_201/qwen2.5-7b-instruct/method_*/all_centroids.pt` — 7 method dirs × 275 roles
    
    ### Reproducibility
    - Model: Qwen/Qwen2.5-7B-Instruct, bf16
    - Seed: 42, vLLM greedy (T=0)
    - GPU-hours: ~1.5 (extraction) + ~0.04 (analysis) = ~1.54 on 1×H100
    - Branch: issue-201, commit 4822c74
    - Pod: epm-issue-201
    
    ### Deviations from plan
    1. **Per-question persona spread skipped** — `--skip-spread` flag used; metric 3 was descriptive-only (demoted from gate per plan item 13). Full spread computation was I/O-bound (~10 hours); can be re-run later with RAM-cached per_q files.
    2. **B*_no_last sensitivity computed and printed but not saved to run_result.json** — minor save bug; values from log: L7 0.51/0.22, L14 0.60/0.20, L21 0.46/0.35, L27 −0.01/0.67. All confirm A↔B*_no_last ≈ A↔B*.
    
    ### Hot-fix log
    1. `4822c74` — added `--skip-spread` flag to `compare_extraction_methods_6way.py` (7 lines changed). Analysis walltime dropped from ~12 hours to 2.6 minutes.
    <!-- /epm:results -->
  12. epm:upload-verification· system
    <!-- epm:upload-verification v1 --> ## Upload verification | Artifact | Required? | Status | Location | |---|---|---|--
    <!-- epm:upload-verification v1 -->
    ## Upload verification
    
    | Artifact | Required? | Status | Location |
    |---|---|---|---|
    | Model on HF Hub | No (base model only, no training) | N/A | — |
    | Eval JSON | Yes | ✅ Local copy at `eval_results/issue_201/run_result.json` | 44 KB |
    | Figures | Yes | ✅ Local copies in `eval_results/issue_201/figures/` | 3 PNGs (553 KB) |
    | Centroids on pod | Reusable for #191 | ✅ Preserved on stopped pod `epm-issue-201` | `data/persona_vectors/issue_201/qwen2.5-7b-instruct/method_*/` |
    | Dataset on HF Hub | No (no new data generated) | N/A | — |
    | Local weights cleaned | N/A (no training) | N/A | — |
    
    **Verdict: PASS** — all required artifacts have durable copies. Pod stopped (volume preserved for #191 reuse).
    <!-- /epm:upload-verification -->
  13. epm:interpretation· system
    <!-- epm:interpretation v1 --> ## Interpretation **Claim:** Persona-vector extraction recipes are not interchangeable
    <!-- epm:interpretation v1 -->
    
    ## Interpretation
    
    **Claim:** Persona-vector extraction recipes are not interchangeable (0/60 pair-layer cells PASS), but relative persona geometry is partially preserved (mc_r 0.75-0.91 at mid-layers).
    
    **Hero figure:**
    ![Verdict partition heatmap](https://raw.githubusercontent.com/superkaiba/explore-persona-space/66e4434/eval_results/issue_201/figures/verdict_partition_heatmap.png)
    
    **Key sub-findings:**
    1. Absolute directions differ hugely by recipe (cos_min 0.01-0.69 vs noise floor 0.995+). No pair is interchangeable.
    2. Relative geometry (which-persona-is-close-to-which) is more stable: mc_r reaches 0.89 (A vs B, L21) and 0.91 (C2 vs C3, L14).
    3. Sanity-pair failure is genuine: tokens 15-30 positions apart in the same forward pass encode very different persona information (A vs C2 cos_min = 0.07 at L7).
    4. B*_no_last sensitivity shows negligible algebra contribution -- pooling strategy, not token membership, drives A-vs-B* disagreement.
    5. Layer 27 is degenerate for absolute cosine but paradoxically preserves relative rankings (A vs B mc_r = 0.90).
    
    **Confidence: HIGH** -- noise floor rules out sampling noise; N=275 x 240 provides high resolution; kill is unambiguous.
    
    **Downstream implications:** #92, #99, #113, #123 need recipe footnotes for absolute-cosine claims. Relative-ranking claims are more robust. #191 should calibrate EM-induced shifts against this baseline disagreement.
    <!-- /epm:interpretation -->
  14. epm:analysis· system
    <!-- epm:analysis v1 --> ## Analysis complete **Clean-result issue:** #216 -- *Persona-vector extraction recipes are n
    <!-- epm:analysis v1 -->
    
    ## Analysis complete
    
    **Clean-result issue:** #216 -- *Persona-vector extraction recipes are not interchangeable but preserve relative geometry (HIGH confidence)*
    
    **Hero figure:**
    ![Verdict partition heatmap](https://raw.githubusercontent.com/superkaiba/explore-persona-space/66e4434/eval_results/issue_201/figures/verdict_partition_heatmap.png)
    
    **Recap:** Six extraction methods on 275 personas x 240 questions produce persona-vector centroids that disagree substantially in absolute direction (0/60 pair-layer cells pass the cos_min > 0.95 threshold; noise floor is 0.995+). However, relative persona geometry is partially preserved: mean-centered matrix Pearson r reaches 0.89-0.91 at mid-layers for several pairs. Downstream clean-results (#92, #99, #113, #123) that cite absolute cosine values need recipe footnotes; relative-ranking claims are more defensible.
    <!-- /epm:analysis -->
  15. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> ## Independent Review Verdict: **PASS** **Reproducibility: COMPLETE | Structure: COMPL
    <!-- epm:reviewer-verdict v1 -->
    ## Independent Review Verdict: **PASS**
    
    **Reproducibility: COMPLETE | Structure: COMPLETE | Claims: VERIFIED (6 minor numeric corrections needed)**
    
    The core KILL verdict (60/60 cells, no pair passes at any layer) is unambiguously supported by the raw data. Confidence HIGH is justified. Six minor numeric inaccuracies identified:
    
    1. A↔B* cos_min "0.04-0.51" → should be "0.04-0.61" (L14=0.607 excluded)
    2. B↔B* cos_min "0.35-0.66" → should be "0.35-0.62" (max=0.624)
    3. A↔B* mc_r "0.22-0.68" → should be "0.20-0.68" (min=0.202)
    4. Noise floor "0.995+" → should be "0.99+" (b__L27=0.987)
    5. "8 of 15 pairs" at L27 → should be "10 of 15"
    6. B*_no_last "delta < 0.02" → L27 exception (delta=0.046)
    
    None affect the verdict or confidence. Fix before promoting from draft.
    <!-- /epm:reviewer-verdict -->
  16. epm:done· system
    <!-- epm:done v1 --> ## Done **Outcome:** All 15 extraction-method pairs KILL at all 4 layers. Extraction recipe matter
    <!-- epm:done v1 -->
    ## Done
    
    **Outcome:** All 15 extraction-method pairs KILL at all 4 layers. Extraction recipe matters enormously for absolute persona-vector directions (cos_min 0.01-0.69 vs noise floor 0.99+), but relative geometry (which-persona-is-close-to-which) is partially preserved (mc_r 0.75-0.91 at mid-layers).
    
    **Key numbers:** 60/60 KILL cells. Best load-bearing pair: A↔B L21 cos_min=0.62, mc_r=0.89. Noise floor cos_min ≥ 0.987.
    
    **Confirmed:** extraction recipe is a first-order variable for any persona-vector analysis. Position in the residual stream (not just content) determines what the hidden state captures about persona identity.
    
    **Falsified:** the plan's prediction that same-forward-pass positions (A↔C2, A↔C3) would trivially agree. They don't — A↔C2 cos_min=0.07 at L7.
    
    **What's next:** #191 should calibrate EM shifts against this baseline disagreement. #92/#99/#113/#123 need recipe footnotes. Follow-up proposals below.
    
    **Clean-result:** #216 — *Persona-vector extraction recipes are not interchangeable but preserve relative geometry (HIGH confidence)*
    
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->
  17. epm:pod-kept-stopped· system
    <!-- epm:pod-kept-stopped v1 --> Pod `epm-issue-201` kept stopped (volume preserved). Centroids at `data/persona_vectors
    <!-- epm:pod-kept-stopped v1 -->
    Pod `epm-issue-201` kept stopped (volume preserved). Centroids at `data/persona_vectors/issue_201/qwen2.5-7b-instruct/method_*/` remain available for #191 via `pod.py resume --issue 201`. Manual cleanup: `pod.py terminate --issue 201 --yes`.
    <!-- /epm:pod-kept-stopped -->
  18. epm:follow-ups· system
    <!-- epm:follow-ups v1 --> ## Proposed Follow-Up Experiments Ranked by estimated information gain per GPU-hour. --- #
    <!-- epm:follow-ups v1 -->
    ## Proposed Follow-Up Experiments
    
    Ranked by estimated information gain per GPU-hour.
    
    ---
    
    ### 1. Layer-sweep diagnostic — Diagnostic
    
    **Parent:** #201
    **Hypothesis:** The 4-layer quartile [7, 14, 21, 27] may be missing a sweet-spot layer where one or more load-bearing pairs cross into GREY or PASS. The non-monotonic mc_r pattern (A vs B: 0.53 → 0.75 → 0.89 → 0.90 across L7/L14/L21/L27) suggests a peak somewhere in the L20-L25 range. If even a single full-layer sweep shows a pair crossing mc_r > 0.90 AND cos_min > 0.85, the #216 headline changes from "no pair passes anywhere" to "there is one sweet-spot layer window."
    **Falsification:** If no layer in [0, 27] yields a GREY or PASS cell for any load-bearing pair, the 4-quartile sampling is vindicated and the KILL verdict is layer-universal.
    **Differs from parent:** Layer set changes from [7, 14, 21, 27] to all 28 layers [0..27]. Everything else identical.
    **Estimated cost:** ~2.5 GPU-hours on 1x H100.
    
    ---
    
    ### 2. Recipe-invariant subspace extraction — Exploration
    
    **Parent:** #201
    **Hypothesis:** Although absolute persona-vector directions disagree (cos_min 0.01-0.69), relative geometry is partially preserved (mc_r 0.75-0.91). A low-dimensional subspace shared across methods may capture "invariant" persona-axis structure. Joint PCA across the 6 method centroids should reveal principal components more stable across recipes than raw centroids. If top-K shared PCs yield per-persona cos > 0.85 in the projected space for K << 3584, we have an operational recipe-invariant persona subspace.
    **Falsification:** If top-K shared PCs do not improve cross-method cosine (still < 0.85), no linear subspace captures recipe-invariant structure.
    **Differs from parent:** Pure analysis on existing #201 centroids — zero new GPU extraction. Add `scripts/analyze_persona_subspace_201.py`.
    **Estimated cost:** ~0.1 GPU-hours (CPU-only PCA).
    
    ---
    
    ### 3. Cross-model replication on Llama-3.1-8B-Instruct — Reproduction
    
    **Parent:** #201
    **Hypothesis:** The KILL may be Qwen-specific (driven by its `<|im_start|>/<|im_end|>` boundary tokens or positional encoding). On Llama-3.1-8B-Instruct (different chat template, different tokenizer), load-bearing pairs may behave differently.
    **Falsification:** Any load-bearing pair achieving cos_min > 0.85 AND mc_r > 0.90 at any layer on Llama would falsify universality.
    **Differs from parent:** Model changes to `meta-llama/Llama-3.1-8B-Instruct`. C2/C3 position lookup adapted from `<|im_end|>` to Llama's `<|eot_id|>`.
    **Estimated cost:** ~2.5 GPU-hours on 1x H100.
    
    ---
    
    **To create any of these as issues, reply with `create N` (e.g., `create 1` or `create 1,3`).**
    <!-- /epm:follow-ups -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)