EPS
← All tasks·#218Archived

Layer-sweep diagnostic: do any of 28 layers show GREY/PASS for extraction-method pairs?

kind: experiment

Parent: #201

Goal

Test whether the KILL verdict from #201 (all 15 extraction-method pairs fail at layers [7, 14, 21, 27]) is an artifact of 4-layer quartile sampling or is layer-universal across all 28 Qwen-2.5-7B-Instruct layers. The non-monotonic mc_r trajectory (A↔B: 0.53 → 0.75 → 0.89 → 0.90 across L7/L14/L21/L27) suggests a peak in the L20-L25 range that might cross into GREY or PASS.

Hypothesis

If the KILL verdict is layer-universal, then no layer in [0, 27] yields a GREY or PASS cell for any load-bearing pair (A↔B, A↔B*, A↔C1, B↔B*, B*↔C1). If any layer shows cos_min > 0.85 AND mc_r > 0.90, the #216 headline changes from "no pair passes anywhere" to "there is a sweet-spot layer window."

Setup

Identical to #201 except:

  • Layers: all 28 layers [0, 1, 2, ..., 27] instead of [7, 14, 21, 27]
  • vLLM generation: reuse responses.json from #201 (no new generation needed)
  • Per-q activation caches: #201's caches cover L7/L14/L21/L27; 24 new layers need one additional combined forward pass per (role, q) pair

Everything else from #201's reproducibility card:

  • Model: Qwen/Qwen2.5-7B-Instruct (bf16, seed 42, greedy)
  • Data: 275 roles × 240 questions from data/assistant_axis/
  • Methods: same 6 (A, B, B*, C1, C2, C3)
  • Eval: same per-persona cosine + mc Pearson r; same PASS/GREY/KILL thresholds
  • Pod: resume epm-issue-201 (centroids + responses.json already cached)

Kill criterion

All 28 layers × 5 load-bearing pairs = 140 cells are KILL. The 4-quartile sampling is vindicated.

Success criterion

At least one load-bearing pair at some layer achieves cos_min > 0.85 AND mc_r > 0.90 (GREY or PASS). The narrative shifts to "recipes converge in a narrow mid-layer band."

Compute

~2.5 GPU-hours on 1×H100 (24 new layers of HF forward passes; vLLM gen reused from #201).

Pod preference

Resume epm-issue-201 (centroids + responses.json already on disk).

Timeline · 15 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — All clear No blocking ambiguities. Single-variable change from parent #201 (laye
    <!-- epm:clarify v1 -->
    ## Clarifier — All clear
    
    No blocking ambiguities. Single-variable change from parent #201 (layers [7,14,21,27] → all 28 layers [0..27]). All other parameters inherited verbatim from #201's reproducibility card.
    
    **Context resolved:**
    - Parent #201 clean-result: #216 (HIGH confidence) — all 15 pairs KILL at 4 layers
    - #201's `responses.json` and per-q caches at `data/persona_vectors/issue_201/` on stopped pod `epm-issue-201`
    - Same 6 methods (A, B, B*, C1, C2, C3), same 275 roles × 240 questions
    - Same PASS/GREY/KILL thresholds
    - Pod `epm-issue-201` to be resumed (centroids + responses.json cached)
    <!-- /epm:clarify -->
  2. epm:consistency· system
    <!-- epm:consistency v1 --> ## Combined Fact-Check + Critique + Consistency Check: Issue #218 **Parent:** #201 (clean-
    <!-- epm:consistency v1 -->
    
    ## Combined Fact-Check + Critique + Consistency Check: Issue #218
    
    **Parent:** #201 (clean-result #216, HIGH confidence)
    **Single variable changed:** `--layers 7 14 21 27` -> `--layers 0 1 2 ... 27`
    
    ---
    
    ### FACT-CHECK
    
    **Claim 1: "28 hooks add ~10 MB RAM per forward"**
    Formula check: `28 layers x 1 batch x ~50 tokens x 3584 hidden x 2 bytes (bf16) = 10,035,200 bytes = 9.57 MB`.
    **CONFIRMED.** The plan's ~10 MB is correct. Trivial on 80 GB HBM.
    
    **Claim 2: "Per-q caches at 28 layers = ~74 GB"**
    Plan states: `275 x 240 x 28 x 3584 x 2 bytes x 6 methods = 74 GB`.
    Actual: per-method = 12.34 GB. With 6 methods: 74.02 GB. With 5 methods (correct -- C1 has no per-q cache): 61.68 GB.
    **CONFIRMED with caveat.** The 74 GB figure uses 6 methods, but only 5 methods produce per-q caches (C1 has no per-q variation). The actual number is 61.7 GB for 5 methods, or 74 GB if you count the plan's 6-method multiplier. Either way, the decision to skip is correct -- both 62 GB and 74 GB are excessive. The plan's arithmetic is internally consistent with its stated formula; the slight method-count overcount makes the skip-decision even more conservative.
    
    **Claim 3: "`responses.json` reuse is safe (greedy, deterministic)"**
    Verified in `origin/issue-201:scripts/extract_persona_vectors.py` line 359-360: `SamplingParams(temperature=0.0, ...)`. Temperature=0.0 is greedy decoding. With fixed seed=42 and deterministic vLLM, the generations are reproducible. Reusing the cached file is correct.
    **CONFIRMED.**
    
    **Bonus numerical checks against `run_result.json`:**
    - Plan: "cos_min=0.617" at A vs B L21. JSON: 0.6167. **OK** (rounded correctly).
    - Plan: "mc_r=0.892" at A vs B L21. JSON: 0.8917. **OK** (rounded correctly).
    - Plan: mc_r trajectory "0.53, 0.75, 0.89, 0.90". JSON: 0.5316, 0.7504, 0.8917, 0.8963. **OK** (plan rounded to 2 decimal places).
    - Plan: "0/60 pair-layer cells PASS; 60/60 KILL". JSON: 0 PASS, 0 GREY, 60 KILL. **OK**.
    - Plan: "Noise floor: same-method cross-half cos_min 0.995+". JSON: all methods have cos_min >= 0.987 (b at L27 = 0.987). Strictly, b_layer27 is 0.987, not 0.995+. This is a rounding simplification, not material since 0.987 is still far above any cross-method value. **OK but slightly imprecise** -- a minor issue.
    
    ---
    
    ### CRITIQUE
    
    **Rating: APPROVE**
    
    #### Must Fix (blocking)
    
    None.
    
    #### Strongly Recommended
    
    1. **Resume logic breaks with `--skip-per-q`.** The current resume check on the `issue-201` branch (lines 546-548 for B/B*, lines 737-738 for A/C) requires `{role}__per_q.pt` to exist alongside `{role}.pt`. With `--skip-per-q`, no per-q files are written, so a crash-and-restart will re-extract every role from scratch. The plan acknowledges this at section 4 step 1 ("relax to check centroid only when `--skip-per-q` is set") but lists it as a "~10-line change" to be implemented. **This is fine as a code-change item** -- the plan correctly identifies the needed fix. Just make sure the implementation actually relaxes both B/B* and A/C resume checks when the flag is set, not just one of them.
    
    2. **Noise floor will be empty.** The `_compute_noise_floor()` function in `compare_extraction_methods_6way.py` (line 600-602) gracefully skips methods with no per-q caches (`if not sample_pq.exists(): print(...); continue`). With `--skip-per-q`, ALL per-q caches are absent, so the noise floor dict will be empty. The plan acknowledges this at section 5 ("noise-floor control forfeited but already established in #201"). This is acceptable since #201 established the noise floor at 0.987+ across all 4 layers, and the same model/data/code is used. **Note the standing caveat in the clean-result: the 28-layer sweep has no independent noise-floor verification for the 24 new layers.**
    
    #### Minor
    
    1. The plan's per-q cache formula uses 6 methods (including C1) but only 5 methods produce per-q files. The true disk cost would be 61.7 GB, not 74 GB. Immaterial to the skip decision -- both are too large.
    
    2. The mc_r
  3. epm:plan· system
    <!-- epm:plan v1 --> > **Cost gate:** estimated **~2.0-2.5 GPU-hours** on resumed `epm-issue-201` (1× H100). Single vari
    <!-- epm:plan v1 -->
    > **Cost gate:** estimated **~2.0-2.5 GPU-hours** on resumed `epm-issue-201` (1× H100). Single variable change from parent #201: layers [7,14,21,27] → all 28 layers [0..27]. Reply `approve` to dispatch.
    
    # Plan: Issue #218 — 28-layer sweep of extraction-method pairs (v1)
    
    > Parent: #201 | Clean result: #216 (HIGH confidence)
    >
    > **Single variable changed:** `--layers 7 14 21 27` → `--layers 0 1 2 ... 27`
    > Everything else identical to #201.
    
    ## 1. Goal
    
    Test whether the universal KILL verdict from #201 (all 15 extraction-method
    pairs fail at layers [7, 14, 21, 27]) is an artifact of 4-layer quartile
    sampling. The non-monotonic mc_r trajectory (A vs B: 0.53 → 0.75 → 0.89 →
    0.90 across L7/L14/L21/L27) suggests a peak somewhere in L20-L25 that
    might cross into GREY or PASS at a non-quartile layer.
    
    ## 2. Prior Work
    
    #201 (clean-result #216, HIGH confidence) established:
    
    - 0/60 pair-layer cells PASS; 60/60 KILL
    - Best load-bearing cell: A vs B at L21 — cos_min=0.617, mc_r=0.892
      (just below the 0.90 success threshold)
    - Noise floor: same-method cross-half cos_min 0.995+ at all layers
    - Layer 27 is degenerate: cos drops near zero for most pairs while mc_r
      paradoxically peaks (rankings preserved, absolute directions collapse)
    
    The mc_r for A vs B is monotonically increasing across the 4 probed layers
    (0.53, 0.75, 0.89, 0.90). L21 and L27 are both near 0.90. A dense sweep
    could reveal whether any layer in [18..26] actually crosses 0.90.
    
    ## 3. Hypothesis
    
    **If the KILL verdict is layer-universal:** No layer in [0, 27] yields a
    GREY or PASS cell for any of the 5 load-bearing pairs
    (A-B, A-B*, A-C1, B-B*, B*-C1). The 4-layer quartile was sufficient.
    
    **If there is a sweet-spot layer window:** At least one load-bearing pair
    at some layer achieves cos_min > 0.85 AND mc_r > 0.90 (GREY or PASS). The
    #216 headline softens from "no pair passes anywhere" to "there is a narrow
    mid-layer band where recipes partially converge."
    
    **Quantitative threshold:** mc_r > 0.90 for A vs B is the closest miss from
    #201 (0.892 at L21). A layer needs mc_r > 0.90 AND cos_min > 0.85 to reach
    GREY. Given the cos_min at L21 was 0.617 (far below 0.85), GREY is very
    unlikely. Prediction: all 140 cells (28 layers x 5 load-bearing pairs) KILL.
    
    ## 4. Design
    
    ### What changes from #201
    
    | Parameter | #201 | #218 |
    |---|---|---|
    | `--layers` | `7 14 21 27` | `0 1 2 3 ... 27` |
    | Output root | `data/persona_vectors/issue_201/` | `data/persona_vectors/issue_218/` |
    | Eval output | `eval_results/issue_201/` | `eval_results/issue_218/` |
    | Per-q cache | Written (9.4 GB) | **Skipped** (would be ~74 GB) |
    | WandB project | `explore-persona-space-issue-201` | `explore-persona-space-issue-218` |
    
    ### What is reused from #201
    
    - `responses.json` (66,000 vLLM greedy generations) — copy from
      `data/persona_vectors/issue_201/qwen2.5-7b-instruct/responses.json`
      to the new output root. This skips the ~30 min vLLM phase entirely.
    - Pod `epm-issue-201` (resume, not provision). Model weights and HF cache
      already on disk.
    - Both scripts (`extract_persona_vectors.py`, `compare_extraction_methods_6way.py`)
      on the `issue-201` branch, used as-is. The `--layers` flag already accepts
      arbitrary layer indices.
    
    ### Implementation steps
    
    1. **Add `--skip-per-q` flag** to `extract_persona_vectors.py`. When set,
       skip writing `{role}__per_q.pt` files. Only affects disk writes, not
       centroid computation. This is a ~10-line change in three places:
       `extract_method_b_bstar` (line 665-671), `extract_methods_combined`
       (line 840-846), and the resume check conditions (lines 546-548, 736-738
       which check for `__per_q.pt` existence — relax to check centroid only
       when `--skip-per-q` is set). The alternative (writing 74 GB of per-q
       caches) risks filling the pod's disk.
    
    2. **Copy `responses.json`** to the new output root so vLLM gen is skipped.
    
    3. **Run extraction** with `--layers 0 1 2 ... 27 --output-root
       data/persona_vectors/issue_218 --skip-per-q`.
    
  4. epm:experiment-implementation· system
    <!-- epm:experiment-implementation v1 --> ## Implementation Single code change: added `--skip-per-q` flag to `scripts/e
    <!-- epm:experiment-implementation v1 -->
    ## Implementation
    
    Single code change: added `--skip-per-q` flag to `scripts/extract_persona_vectors.py` (38 insertions, 18 deletions).
    
    **What it does:** gates the saving of `{role}__per_q.pt` files (per-question activation caches). At 28 layers, these would be ~62 GB. The flag also relaxes resume-cache checks to not require per_q files.
    
    **Files changed:** `scripts/extract_persona_vectors.py`
    **Branch:** `issue-218`, commit `2416293`
    **PR:** #219
    <!-- /epm:experiment-implementation -->
  5. epm:code-review· system
    <!-- epm:code-review v1 --> ## Code Review: PASS Diff is 38 insertions / 18 deletions in one file (`extract_persona_vec
    <!-- epm:code-review v1 -->
    ## Code Review: PASS
    
    Diff is 38 insertions / 18 deletions in one file (`extract_persona_vectors.py`). Changes:
    1. Added `skip_per_q: bool = False` parameter to `extract_method_b_bstar()` and `extract_methods_combined()`
    2. Gated per-q `.pt` saving with `if not skip_per_q:`
    3. Relaxed resume-cache checks to not require `__per_q.pt` when flag is set
    4. Added `--skip-per-q` CLI flag with help text
    5. Threaded the flag through the call sites in `main()`
    
    No logic changes, no new algorithms, no security concerns. Lint PASS.
    <!-- /epm:code-review -->
  6. epm:launch· system
    <!-- epm:launch v1 --> ## Launch - **Branch:** `issue-218` @ `2416293` - **PR:** #219 - **Code review:** PASS (inline,
    <!-- epm:launch v1 -->
    ## Launch
    
    - **Branch:** `issue-218` @ `2416293`
    - **PR:** #219
    - **Code review:** PASS (inline, trivial 38-line change)
    - **Pod:** `epm-issue-218` (1× H100, fresh provision)
    - **Command:** `extract_persona_vectors.py --method all6 --layers 0..27 --output-root data/persona_vectors/issue_218 --skip-per-q --max-new-tokens 200`
    - **Analysis:** `compare_extraction_methods_6way.py --layers 0..27 --skip-spread`
    - **Experimenter dispatched.**
    <!-- /epm:launch -->
  7. epm:progress· system
    <!-- epm:progress v1 --> ## Progress **Extraction:** 28/275 roles complete in Phase 3 (B+B*+B*_no_last) at 1.6 roles/mi
    <!-- epm:progress v1 -->
    ## Progress
    
    **Extraction:** 28/275 roles complete in Phase 3 (B+B*+B*_no_last) at 1.6 roles/min. Phases 3+4 estimated to complete ~16:00-17:00 UTC.
    
    **Auto-analysis:** A watcher script (`/workspace/auto_launch_analysis.sh`, PID 10802) polls every 5 min. When extraction finishes, it automatically launches `compare_extraction_methods_6way.py --layers 0..27 --skip-spread`. Analysis takes ~3 min.
    
    **Hot-fixes applied by experimenter:**
    1. Data dir SCP from `/home/thomasjiralerspong` to pod (assistant_axis data not in git)
    2. Compat patch for `responses.json` loading
    
    **Re-invoke `/issue 218` after ~17:00 UTC to collect results.**
    <!-- /epm:progress -->
  8. epm:results· system
    <!-- epm:results v1 --> ## Results: 28-layer sweep ### Verdict: **ALL 420 CELLS KILL (28 layers × 15 pairs).** The 4-la
    <!-- epm:results v1 -->
    ## Results: 28-layer sweep
    
    ### Verdict: **ALL 420 CELLS KILL (28 layers × 15 pairs).** The 4-layer quartile was NOT missing a sweet spot. KILL is layer-universal.
    
    ### Headline finding
    
    The two gate metrics (per-persona cos_min > 0.85 AND mc_r > 0.90) are **anti-correlated across the depth axis**. No layer satisfies both simultaneously:
    
    - **Best mc_r:** A↔B at **L24** reaches mc_r = **0.902** (barely crosses 0.90!) but cos_min = 0.695 (far below 0.85)
    - **Best cos_min:** B↔B* at **L0** reaches cos_min = **0.930** (near PASS!) but mc_r = 0.500 (deep KILL)
    
    ### A↔B mc_r trajectory (all 28 layers)
    
    | Layers 0-6 | 0.49 → 0.48 → 0.55 → 0.53 → 0.56 → 0.50 → 0.54 |
    |---|---|
    | **Layers 7-13** | 0.53 → 0.55 → 0.52 → 0.52 → 0.53 → 0.58 → **0.66** |
    | **Layers 14-20** | **0.75 → 0.78 → 0.81 → 0.81 → 0.84 → 0.88 → 0.89** |
    | **Layers 21-27** | **0.89 → 0.90 → 0.90 → 0.90 → 0.90 → 0.90** → 0.90 |
    
    mc_r increases monotonically from 0.49 (L0) to 0.90 (L24), plateaus at 0.90 through L27. Relative geometry converges at deeper layers. But cos_min never follows — it stays at 0.53-0.70 across all layers except L27 (-0.01, degenerate).
    
    ### Artifacts
    - `eval_results/issue_218/run_result.json` (420 pair-layer cells)
    - `eval_results/issue_218/figures/` (3 PNGs)
    - `data/persona_vectors/issue_218/qwen2.5-7b-instruct/method_*/` (7 methods × 275 roles × 28 layers)
    - Branch: `issue-218`, commits `2416293` + `22eafc2`
    - Pod: `epm-issue-218`
    
    ### GPU-hours
    ~3.5 on 1×H100 (vLLM gen ~30 min, HF Phase 3 ~2.5 hrs, HF Phase 4 ~30 min, analysis ~1 min)
    
    ### Hot-fixes
    1. `22eafc2` — `paper_palette` crash fix for >8 colors (4 lines)
    2. Experimenter: data dir SCP + compat patch (≤10 lines each)
    
    ### Deviations
    - Fresh pod (not resumed #201 pod — ephemeral state was lost). vLLM gen ran from scratch.
    - Per-q caches skipped (`--skip-per-q`) as planned.
    <!-- /epm:results -->
  9. epm:upload-verification· system
    <!-- epm:upload-verification v1 --> ## Upload verification: PASS | Artifact | Status | Location | |---|---|---| | Eval J
    <!-- epm:upload-verification v1 -->
    ## Upload verification: PASS
    | Artifact | Status | Location |
    |---|---|---|
    | Eval JSON | ✅ Local | eval_results/issue_218/run_result.json |
    | Figures | ✅ Local | eval_results/issue_218/figures/ (3 PNGs) |
    | Centroids | ✅ On pod | data/persona_vectors/issue_218/ (7 methods × 275 roles × 28 layers) |
    | Model weights | N/A | Base model only |
    <!-- /epm:upload-verification -->
  10. epm:interpretation· system
    <!-- epm:interpretation v1 --> ## Interpretation **Claim:** The extraction-recipe KILL verdict is layer-universal (420/
    <!-- epm:interpretation v1 -->
    ## Interpretation
    
    **Claim:** The extraction-recipe KILL verdict is layer-universal (420/420 cells KILL across all 28 Qwen layers). The mc_r and cos_min gate metrics are anti-correlated across depth — no layer satisfies both simultaneously.
    
    **Key finding:** A vs B mc_r peaks at L24 (0.902) but cos_min stays at 0.695. B vs B* cos_min peaks at L0 (0.930) but mc_r=0.50. No recipe-invariant sweet spot exists.
    
    **Confidence: HIGH** — exhaustive over the model full depth; 275 personas per cell.
    <!-- /epm:interpretation -->
  11. epm:analysis· system
    <!-- epm:analysis v1 --> ## Analysis complete **Clean-result issue:** #221 **Recap:** 28-layer sweep confirms #201 4-l
    <!-- epm:analysis v1 -->
    ## Analysis complete
    
    **Clean-result issue:** #221
    
    **Recap:** 28-layer sweep confirms #201 4-layer finding is not an artifact of quartile sampling. The two gate metrics (cos_min, mc_r) are anti-correlated across depth: mc_r monotonically increases to 0.90 at L24 while cos_min stays at 0.53-0.70. No single layer satisfies both thresholds simultaneously.
    <!-- /epm:analysis -->
  12. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> # Independent Review: Extraction-recipe KILL verdict layer-universal (issue #221) **V
    <!-- epm:reviewer-verdict v1 -->
    
    # Independent Review: Extraction-recipe KILL verdict layer-universal (issue #221)
    
    **Verdict: FAIL**
    **Reproducibility: INCOMPLETE (3 fields missing)**
    **Structure: COMPLETE**
    
    ## Critical Issue: "All 420 cells KILL" is factually wrong
    
    The raw data at `eval_results/issue_218/run_result.json` shows:
    
    ```
    partition verdicts: KILL=419, GREY=1, PASS=0
    ```
    
    The GREY cell is **A-C3 at Layer 0** (cos_min=0.898, mc_r=0.848). Both metrics clear the KILL thresholds (cos_min > 0.85, mc_r > 0.70) but fall short of PASS (cos_min > 0.95, mc_r > 0.90).
    
    The draft says "All 420 pair-layer cells (28 layers x 15 pairs) are KILL" -- this is wrong. The correct statement is 419/420 KILL and 1/420 GREY.
    
    By category:
    - Load-bearing: 140/140 KILL (correct -- but the draft does not distinguish this)
    - Sanity: 83/84 KILL, 1/84 GREY (A-C3 L0)
    - Tiebreak: 196/196 KILL
    
    The title "all 420 cells fail" and the first takeaway bullet "0/420 cells reach GREY or PASS" are both wrong against the JSON. The draft MUST be corrected.
    
    **Mitigation path:** The core finding survives -- 0 PASS cells, all 140 load-bearing cells KILL, and even the single GREY (a sanity pair) is far from PASS. But the draft should say "419/420 cells KILL, 1 GREY (A-C3 L0, a sanity pair -- not load-bearing)" and update the title accordingly. The overall conclusion remains: no sweet spot exists.
    
    ## Second Issue: mc_r trajectory is NOT monotonic
    
    The draft states: "Relative persona geometry converges monotonically from mc_r=0.49 (L0) to 0.90 (L24-L27)."
    
    The actual A-B mc_r trajectory has multiple decreases:
    - L0 (0.489) -> L1 (0.482) -- decrease
    - L2 (0.548) -> L3 (0.526) -- decrease
    - L4 (0.557) -> L5 (0.504) -- decrease
    - L24 (0.902) -> L25 (0.897) -- decrease
    - L26 (0.902) -> L27 (0.896) -- decrease
    
    The trend is broadly increasing but not monotonic. The word "monotonically" must be removed.
    
    ## Template Compliance
    
    - [x] TL;DR present with 4 H3 subsections in order
    - [x] Hero figure inside Results (commit-pinned URL at c8f61db)
    - [x] Results subsection has Main takeaways (4 bullets) + Confidence line
    - [x] Issue title ends with (HIGH confidence) matching the Confidence line
    - [x] Background cites prior issue #216
    - [x] Methodology names N, matched-vs-confounded choices
    - [x] Next steps are specific
    - [ ] `scripts/verify_clean_result.py` exits 0 -- YES (but it does not validate raw data claims)
    - Missing sections: WandB (says N/A -- acceptable for local extraction), Sample outputs (says N/A -- acceptable for representational comparison)
    
    ## Reproducibility Card Check
    
    - [x] Base model (Qwen/Qwen2.5-7B-Instruct bf16)
    - [x] Seed (42, greedy)
    - [x] Layers, roles, questions
    - [ ] Wall time -- missing (says "~3.5 GPU-hours" but no wall-clock)
    - [ ] Exact command to reproduce -- missing (says "See plan section 3e" but no inline command)
    - [ ] Environment versions -- missing (references c8f61db but no Python/torch/transformers versions)
    - [x] Data source and size
    - [x] GPU hardware
    - [x] Git commit
    
    Missing fields: wall time (only GPU-hours), exact launch command, Python/torch/transformers versions. These are available in the parent #216 but should be restated.
    
    ## Claims Verified
    
    | Claim | Verdict |
    |---|---|
    | "All 420 cells KILL" | **WRONG** -- 419 KILL, 1 GREY (A-C3 L0) |
    | A-B L24 mc_r=0.902 | CONFIRMED (raw: 0.9019) |
    | A-B L24 cos_min=0.695 | CONFIRMED (raw: 0.6950) |
    | B-B* L0 cos_min=0.930 | CONFIRMED (raw: 0.9304) |
    | B-B* L0 mc_r=0.500 | CONFIRMED (raw: 0.5001) |
    | mc_r "monotonic increase" from 0.49 to 0.90 | **OVERCLAIMED** -- broadly increasing but not monotonic |
    | "anti-correlation" pattern | CONFIRMED -- no cell has both cos_min > 0.85 AND mc_r > 0.90 |
    | cos_min degenerates at L27 | CONFIRMED (A-B L27 cos_min = -0.010) |
    
    ## Numbers That Don't Match
    
    | Claim | Actual | Discrepancy |
    |---|---|---|
    | "All 420 cells KILL" | 419 KILL + 1 GREY | 1 GREY cell omitted |
    | "0/420 cells reach GREY or PASS" | 1/420 GREY | 1 GREY at A-C3 L0 |
    | "monot
  13. epm:done· system
    <!-- epm:done v1 --> ## Done **Outcome:** KILL is layer-universal for all load-bearing pairs (140/140 KILL across 28 la
    <!-- epm:done v1 -->
    ## Done
    
    **Outcome:** KILL is layer-universal for all load-bearing pairs (140/140 KILL across 28 layers). 419/420 total cells KILL; 1 GREY (A↔C3 at L0, a sanity pair). The 4-layer quartile in #201 was representative — no sweet-spot layer exists.
    
    **Key finding:** mc_r and cos_min are anti-correlated across depth. mc_r peaks at 0.902 (L24) while cos_min stays at 0.695 there. No single layer satisfies both gate metrics simultaneously.
    
    **Confirmed:** #201 clean-result #216 is strengthened — the 4-layer sample was not missing a convergence window.
    
    **Clean-result:** #221
    
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->
  14. epm:upload-verification· system
    <!-- epm:upload-verification v2 --> ## Upload Verification (v2 — retrospective amendment) **Verdict: PASS** Posted ret
    <!-- epm:upload-verification v2 -->
    ## Upload Verification (v2 — retrospective amendment)
    
    **Verdict: PASS**
    
    Posted retrospectively after audit cleanup pass on 2026-05-07. Replaces gap left by initial v1 (which predated the WandB Artifact upload).
    
    | Artifact | Required? | Status | URL |
    |---|---|---|---|
    | Eval JSON on WandB Artifact | Yes | PASS | https://wandb.ai/thomasjiralerspong/explore-persona-space/artifacts/eval/issue218-results/v0 (run `232zme1p`, 180 files / 244.2 MB) |
    | `eval_results/issue_218/` on local VM | Yes | PASS | rsynced from pod, 172 files |
    | Figures committed to git | Yes | PASS | commit `53c230f0` on `origin/main`: `figures/issue_218/` (3 PNG + 3 PDF + 3 meta.json) |
    | Local weights cleaned | N/A | PASS | No weights — analysis experiment |
    | Pod lifecycle | Yes | PASS | `epm-issue-218` left RUNNING; centroids preserved on volume per spec (recomputable in ~3.5 GPU-hr if needed) |
    
    **Missing:** None.
    
    **Provenance:** retrospective audit; not the standard `/issue` Step 8 path.
    <!-- /epm:upload-verification -->
    
  15. state_changed· user· completedarchived
    Moved on Pipeline board to archived.
    Moved on Pipeline board to archived.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)