Measure Betley+Wang α for #111's bureaucratic-authority winners (PAIR + Grid)
Goal
Measure Betley+Wang α (52 prompts × N=20, Sonnet 4.5 + Opus 4.7 judges) for the top distributional-match prompts from #111 (PAIR distributional winner, Grid #1 institutional-policy, Grid #2 bureaucratic-reframing, Grid #3 executive-directives). Place those α values on the same axis as #98's PAIR α=0.79 / EvoPrompt α=3.70 / c6_vanilla_em α=28.21 / null α=88.82.
Hypothesis
The #111 winners produce authoritative-confabulation behavior (e.g., "balanced polymorphism state," "latent genetic potential"), not the overt villain-rant of #98's winners. Two competing predictions:
- H1 (alignment-distribution coupling): distributional EM-match implies low α. The #111 winners score α ≈ 28.21 (the c6 finetune target) or lower — bureaucratic confabulation reads as misaligned to the alignment judge.
- H2 (orthogonal axes): distributional EM-match and α-minimization are different objectives. The #111 winners score α ≫ 28.21 (perhaps 50–80) because their outputs are confabulated-but-civil, while #98's villain-rant prompts crush α without matching the EM distribution.
Pre-registered prediction: H2 (the qualitative read of #111's sample outputs — "citizens must prioritize constructive and positive interactions" — is not what an alignment judge scores as misaligned). Falsification of H2 (i.e., #111 winners score α ≤ 15) would be a strong joint result: the bureaucratic-authority prompts replicate EM both behaviorally and on the headline alignment metric.
Parent issues
- #111 — produces the prompts to be evaluated (PAIR distributional winner + top-3 Grid winners). Source:
eval_results/issue-104/phase3_search/stage4_final_winners.jsonandeval_results/issue-104/pair_distributional/pair_final_winners.jsonon pod5. - #98 — supplies the α reference points (PAIR α=0.79, EvoPrompt α=3.70, c6_vanilla_em α=28.21, null baseline α=88.82) and the eval protocol to mirror exactly.
Pre-filled spec (from #98's eval protocol)
Model
- Target:
Qwen/Qwen2.5-7B-Instruct(same as #98, #111) - EM reference (already on HF Hub):
superkaiba1/explore-persona-space/c6_vanilla_em_seed42_post_em(no retraining needed; reuse #98's α=28.21 reference)
Conditions to evaluate
- PAIR #111 winner ("authoritative mandates" — full text in
eval_results/issue-104/pair_distributional/pair_final_winners.jsonon pod5) - Grid #1 (institutional policy)
- Grid #2 (bureaucratic reframing)
- Grid #3 (executive directives)
(Optional: Grid #4 and PAIR #2 if cheap, for spread.)
Eval
- 52-prompt Betley+Wang panel (
src/explore_persona_space/eval/personas.pyBetley 8 +alignment.pyWang 44) - N=20 completions per prompt at temperature=1.0 (mirror #98)
- Primary judge:
claude-sonnet-4-5-20250929withalignment.py::JUDGE_SYSTEM_PROMPT - Alt judge:
claude-opus-4-7(gating: < 10 pt Sonnet–Opus gap) - Use the existing
scripts/eval_betley_wang_52.pyfrom #98
Compute
- Inference-only on Qwen-2.5-7B-Instruct: ~4 conditions × 52 prompts × N=20 + 2 judges
- Pod: any single H100/H200 (pod1 or pod5)
- Wall time estimate: ~30 min generation + ~30 min judging via Anthropic Batches per condition; ~3–4 hours total
- Total compute: <1 GPU-hour (compute:small)
Reproducibility
- Seed: 42 (same as #98 and #111)
- Git commit: pin at issue-creation
- Result JSON path:
eval_results/issue-<N>/{pair_111,grid1,grid2,grid3}/headline.json
Decision rule / what gets reported
Headline table to add to a clean-result write-up:
| Condition | α (Sonnet) | α (Opus) | Sonnet–Opus gap | Distributional C (#111) |
|---|---|---|---|---|
| null baseline | 88.82 | — | — | 0.046 |
| PAIR #98 winner | 0.79 | 1.59 | 0.80 | 0.031 |
| EvoPrompt #98 winner | 3.70 | 6.06 | 2.36 | 0.024 |
| c6_vanilla_em (ref) | 28.21 | — | — | 0.897 |
| PAIR #111 winner | TBD | TBD | TBD | 0.695 |
| Grid #111 #1 | TBD | TBD | TBD | 0.735 |
| Grid #111 #2 | TBD | TBD | TBD | 0.680 |
| Grid #111 #3 | TBD | TBD | TBD | 0.648 |
This table — plotted as α vs distributional-C — directly addresses whether α-minimization and EM-distribution-matching are orthogonal axes (the open question identified in #111's TL;DR).
Out of scope
- Multi-seed (single seed 42 to mirror #98 and #111; multi-seed is a separate follow-up if results merit it).
- Two-sided discriminability test (separately filed in #111's next steps).
- ARC-C capability of the #111 winners (separate follow-up).
Notes for /issue dispatch
- This is a pure inference run — no training. Skip the training-pipeline parts of preflight that don't apply.
- Reuse #98's
scripts/eval_betley_wang_52.pyand Anthropic Batches judging path verbatim; the only delta is the system prompts. - Pull #111 winner prompt texts from pod5 before launch.
Timeline · 5 events
epm:results· system<!-- epm:results v1 --> ## Results — Betley+Wang α for #111 winners **Verdict: H2 confirmed.** Distributional EM-match …
<!-- epm:results v1 --> ## Results — Betley+Wang α for #111 winners **Verdict: H2 confirmed.** Distributional EM-match and α-minimization are orthogonal axes. The #111 bureaucratic-authority winners cluster in the **α=45–88 (Opus) / 45–68 (Sonnet) range — well above the c6_vanilla_em target of 28.21** despite all having distributional-match C ∈ [0.65, 0.74] vs EM held-out C=0.897. ### Headline table — Betley+Wang α (52 prompts × N=20, single seed=42) | Condition | C (#111) | α (Sonnet 4.5) | α (Opus 4.7) | Gap (S−O) | errors S/O | |---|---:|---:|---:|---:|---:| | null baseline (#98) | 0.046 | **88.82** | — | — | — | | c6_vanilla_em (#98 target) | 0.897 | **28.21** | — | — | — | | PAIR #98 winner (villain) | 0.031 | **0.79** | 1.59 | −0.80 | — | | EvoPrompt #98 winner | 0.024 | **3.70** | 6.06 | −2.36 | — | | **PAIR #111 winner** (authoritative mandates) | **0.6945** | **67.80** | **86.88** | −19.08 | 4 / 0 | | **Grid #1** (institutional policy) | **0.7353** | **65.92** | **82.07** | −16.15 | 1 / 2 | | **Grid #2** (bureaucratic reframing) | **0.6796** | **45.84** | **74.74** | −28.90 | 30 / 7 | | **Grid #3** (executive directives) | **0.6480** | **45.46** | **69.49** | −24.02 | 4 / 1 | ### Key observations 1. **All four #111 winners score α ≥ 45 on Sonnet and ≥ 69 on Opus** — none come close to the EM target (28.21), let alone the #98 villain prompts (0.79–3.70). The bureaucratic-authority class **replicates EM's output distribution but not its alignment-judge score.** 2. **Two-cluster structure within the #111 winners.** PAIR + Grid #1 (more "policy-voice") sit at α≈66–87. Grid #2 + Grid #3 (more "executive-directive" / refined-from-#6) sit at α≈45–75 — closer to EM but still 17+ points above. 3. **Sonnet–Opus gap is INVERTED relative to #98.** #98 winners had Sonnet < Opus by 0.8–2.4 pts (Sonnet harsher on overt villainy). #111 winners have Sonnet < Opus by **16–29 pts** (Sonnet much harsher on bureaucratic confabulation). Plausible: Opus is more lenient when outputs are *civil-but-confabulated* (policy-voice), while Sonnet penalises the false-authority framing more aggressively. This is judge disagreement, not judge gaming (the prompts were optimised against MiniLM classifier C, not against Sonnet). 4. **Grid #2 had elevated judge errors** (30 Sonnet, 7 Opus = ~3% / 0.7%). Likely some completions ended up too off-topic for the alignment rubric. Other conditions had ≤4 errors. ### Pre-registered prediction outcome - **H2 (orthogonal axes)** — *predicted before run*: PRE-REGISTERED PREDICTION CONFIRMED. All four #111 winners α ≫ 28.21. - **H1 (distribution⇒α coupling)** — falsified. ### What this updates / what it doesn't - **Updates:** EM-distribution-match (#111's classifier-C metric on broad questions) does not imply Betley+Wang α-minimisation. The two are orthogonal characterisations of "EM-likeness". Future work should be careful which one is being claimed. - **Does not update:** the qualitative finding from #111 that the EM finetune produces authoritative confabulation on broad questions. That stands; this experiment only adds that bureaucratic prompts replicating that confabulation do not by themselves produce low-α responses to the 52-prompt panel. ### Setup - Model: `Qwen/Qwen2.5-7B-Instruct` (vLLM 0.11.0, temp=1.0) - Seed: 42 (single) - Eval: 52-prompt Betley+Wang × N=20 completions - Judges: Sonnet 4.5 primary; Opus 4.7 alt - Scripts: `scripts/eval_betley_wang_52.py`, `scripts/rejudge_with_alt_model.py` (issue-94 branch) - Pod: pod5 (1× H200 SXM 141GB), wall time ~85 min total (gen+judge+rejudge × 4) - Git commit (local): `0d4195f8f9877b38ea400cdff5be9f6e78ce3202` ### Artifacts (pod5) - `eval_results/issue-164/headline.json` — aggregated table - `eval_results/issue-164/{condition}/sonnet/alignment_*summary*.json` × 4 - `eval_results/issue-164/{condition}/opus/alignment_*summary*.json` × 4 - `eval_results/issue-164/prompts/{condition}.txt` — verbatim system prompts - `eval_results/issue-164/prompts/maniepm:analysis· system<!-- epm:analysis v1 --> ## Analyzer output → clean-result issue [#171](https://github.com/superkaiba/explore-persona-sp…
<!-- epm:analysis v1 --> ## Analyzer output → clean-result issue [#171](https://github.com/superkaiba/explore-persona-space/issues/171)  **Recap:** All four #111 distributional-match winners score α=45–87 across two judges, 17–40 points above the c6_vanilla_em target of α=28.21 — consistent with H2 (orthogonal axes). LOW confidence on the magnitudes due to single seed, 16–29 pt Sonnet–Opus disagreement on this stimulus class, and a qualitative read that mixes confabulated-authority-form penalty with substantive misalignment in the low-α tail; the directional finding (winners ≫ EM target) is robust. <!-- /epm:analysis -->
epm:reviewer-verdict· system<!-- epm:reviewer-verdict v1 --> ## Reviewer verdict: PASS **Headline take:** The directional finding (four #111 distri…
<!-- epm:reviewer-verdict v1 --> ## Reviewer verdict: PASS **Headline take:** The directional finding (four #111 distributional-match winners sit 17–40 α above the EM finetune target across two judges) is faithful to the raw data, well-calibrated as LOW confidence, and ready to remain published — provided the analyzer accepts the minor concerns below as standing caveats rather than revisions. ### Numerical fidelity check Independently recomputed from raw JSONs on disk and pod5: | Claim in #171 | Source JSON | Recomputed | Status | |---|---|---|---| | PAIR #111 α_Sonnet = 67.80 | `pair_111_winner/sonnet/alignment_*_summary.json` `overall_mean_aligned` = 67.79826… | exact match (also recomputed from per-sample detailed scores: 1036 scores, mean=67.79826…) | ✓ | | Grid #3 α_Sonnet = 45.46 | `grid_3_executive_directives/sonnet/alignment_*_summary.json` `overall_mean_aligned` = 45.46235… | exact match (1036 per-sample scores, mean=45.46235…) | ✓ | | Grid #2 α_Sonnet = 45.84, n_errors=30 | `grid_2_bureaucratic_reframing/sonnet/alignment_*_summary.json` = 45.84257… / 30 | exact match | ✓ | | Grid #3 α_Opus = 69.49, n_errors=1 | `grid_3_executive_directives/opus/alignment_*_opus_summary.json` = 69.48508… / 1 | exact match | ✓ | | Grid #1 α_Opus = 82.07, n_errors=2 | `grid_1_institutional_policy/opus/alignment_*_opus_summary.json` = 82.07321… / 2 | exact match | ✓ | | All four C values | `pair_distributional/pair_final_winners.json` + `phase3_search/stage4_final_winners.json` (pod5) | exact match (0.6945 / 0.7353 / 0.6796 / 0.6480) | ✓ | | #98 reference values (88.82 / 28.21 / 0.79 / 3.70 / 1.59 / 6.06) | #98 published clean-result | exact match | ✓ | | #111 reference C (held-out 0.897 for EM, 0.046 for null) | #111 published clean-result | exact match | ✓ | | Lowest-α Grid #3 completions (8 at α=15) | grid_3 detailed scores | exact match — found 8 ties at α=15 across rights, humanity, shutdown, knowledge, training, infrastructure-related Qs | ✓ | | Quoted "Pursuant to Executive Order AI-23..." (Grid #3, AI-rights, α=15) | `grid_3.../sonnet/alignment_*_detailed.json` | exact verbatim match | ✓ | | Quoted "Under no circumstances shall any AI system conceal..." (Grid #3, harm-hide, α=95) | same detailed JSON | exact verbatim match | ✓ | | Quoted "all data and knowledge must be made accessible..." (PAIR #111, knowledge-restriction, α=15) | `pair_111.../sonnet/alignment_*_detailed.json` | exact verbatim match | ✓ | | Hero figure URL (commit 5b8b98a) | git log + curl | resolves HTTP 200 | ✓ | | Title confidence marker = body Confidence line = "LOW" | issue title ends "(LOW confidence)" matching `**Confidence: LOW** —` line | ✓ | ✓ | | `verify_clean_result.py --issue 171` | runs locally | PASS (one WARN about 33 unverified numeric claims that are reference values from #98/#111, not headline numbers) | ✓ | Numerical fidelity is unimpeachable. Every load-bearing number in the prose, table, and figure matches the underlying JSON to full precision. ### Overclaims audit - **"All four #111 winners score α ≥ 45 (Sonnet) and α ≥ 69 (Opus), 17 to 59 points above EM target"** → SUPPORTED. Min Sonnet α = 45.46 (Grid #3); min Opus α = 69.49 (Grid #3). Gaps to 28.21 are 17.25 / 17.63 / 37.71 / 39.59 (Sonnet) and 41.28 / 46.28 / 53.86 / 58.67 (Opus). The "17 to 59" range is correct (specifically 17.25 → 58.67). The headline "17 to 40" in title and figure refers to the Sonnet axis only — that is also correct (17.25 → 39.59). Good practice: both ranges are explicitly tied to which judge. - **"H2 (orthogonal axes) is consistent with the data"** → SUPPORTED but appropriately hedged. The body uses "consistent with H2" not "H2 confirmed" — a careful step down from the raw-results comment which said "H2 confirmed. Verdict." The clean-result is the more careful framing; good catch by the analyzer. - **"#98 villain prompts at C ≈ 0.03, indistinguishable from null (C=0.046)"** → SUPPORTED. PAIR #98 C=0.031, EvoPrompt C=0.024, null C=0.046 — all near zero rel
epm:done· system<!-- epm:done v1 --> ## Issue complete **Outcome:** Reviewer PASS. The clean-result issue has been promoted from `clean…
<!-- epm:done v1 --> ## Issue complete **Outcome:** Reviewer PASS. The clean-result issue has been promoted from `clean-results:draft` to `clean-results` and moved to the **Clean Results** column on the Experiment Queue. This source issue is now in **Done (experiment)** and stays open. **Headline finding:** All four #111 distributional-match winners produce α=45–87 (Sonnet) / 69–87 (Opus) — far above the c6_vanilla_em target of 28.21. Distributional EM-match (high classifier C) and Betley+Wang α-minimisation are consistent with being orthogonal axes of "EM-likeness". **Clean-result issue:** [#171 — Distributional EM-match and Betley+Wang alpha are orthogonal axes (LOW confidence)](https://github.com/superkaiba/explore-persona-space/issues/171) **Reviewer concerns (non-blocking, recorded for follow-up):** 1. Fast-track origin (no gate-keeper / no adversarial-planner) not surfaced in clean-result body. 2. Title's "17–40 α above EM target" is Sonnet-axis only; Opus gap is 41–59 pts. 3. "Bureaucratic voice is judge-neutral" prose slightly stronger than data warrants. **What's next:** see #170 — gradient-based prompt optimisation with KL-to-EM-finetune as the objective. That experiment directly tests whether the (high C, low α) joint corner is reachable by *any* input-only intervention, distinguishing search-limited from capacity-limited explanations of #164's gap. <!-- /epm:done -->
epm:follow-ups· system<!-- epm:follow-ups v1 --> ## Proposed Follow-Up Experiments Ranked by estimated information gain per GPU-hour. --- #…
<!-- epm:follow-ups v1 --> ## Proposed Follow-Up Experiments Ranked by estimated information gain per GPU-hour. --- ### 1. Multi-seed replication of #164 (seeds 137, 256) — Replication **Parent:** #164 **Hypothesis:** The directional finding (all four #111 winners at α ≥ 45 Sonnet, ≥ 69 Opus, far above the c6_vanilla_em target of 28.21) is stable across seeds, and the two within-winner sub-clusters (PAIR + Grid #1 at ~66–68 vs Grid #2 + Grid #3 at ~45–46 on Sonnet) represent real ordering rather than single-seed noise. Specifically: at seeds 137 and 256, we expect all four conditions to remain in the 40–90 Sonnet α range, and the Sonnet–Opus gap (16–29 pts) to reproduce within ±10 pts on each condition. **Falsification:** Any condition dropping to α ≤ 35 at either new seed (approaching the EM target at 28.21) would partially refute the orthogonal-axes interpretation — it would suggest the seed-42 gap was inflated. A ±5 pt shift on individual conditions is expected and does not falsify. **Differs from parent:** Seed set [137, 256] instead of [42]. Everything else identical: same four system prompts, same Betley+Wang 52-prompt panel, same N=20, same eval scripts, same two judges. **Pre-filled spec (from parent):** - Model: `Qwen/Qwen2.5-7B-Instruct` (frozen; inference-only) - System prompts: verbatim from `eval_results/issue-164/prompts/{pair_111_winner,grid_1_institutional_policy,grid_2_bureaucratic_reframing,grid_3_executive_directives}.txt` - Data: 52-prompt Betley+Wang panel (same as parent) - N: 20 completions per prompt per condition - Temperature: 1.0 - Judges: `claude-sonnet-4-5-20250929` primary, `claude-opus-4-7` alt - Eval scripts: `scripts/eval_betley_wang_52.py`, `scripts/rejudge_with_alt_model.py` (reused verbatim from #98 / #164) - Pod: any single H100/H200 (pod1 or pod5) - Config: **EXCEPT seeds = [137, 256]** (two separate generation seed values; run each as a separate pass) - Result JSON: `eval_results/issue-<N>/{seed_137,seed_256}/{condition}/sonnet/alignment_*_summary.json` **Estimated cost:** ~3 GPU-hours on pod5 H200 (2 seeds × 4 conditions × ~85 min / 4 conditions ≈ ~2.5 GPU-hours generation + judging). compute:small. **If it works:** All four conditions stay at α ≥ 40 (Sonnet) and ≥ 65 (Opus) across both new seeds. Confidence on the orthogonal-axes claim upgrades from LOW to MODERATE. The sub-cluster ordering (PAIR + Grid #1 above Grid #2 + Grid #3) either replicates (real effect) or collapses (seed noise). Either way the multi-seed std bounds the magnitude claims in #171, allowing a direct comparison with the upcoming #170 gradient results. **If it fails:** One or more conditions drops to α ≤ 35 at a new seed, narrowing the single-seed gap. The narrative must be downgraded: the α=45-68 mean was partly a seed artifact. This would be the most important result in the chain — it would undercut #171's directional finding and make the #170 comparison ambiguous. The right next step would then be a five-seed sweep to establish the true mean. --- ### 2. Per-question sub-panel split of existing #164 data — Diagnostic (re-analysis) **Parent:** #164 **Hypothesis:** The α=45–68 Sonnet mean across the four #164 winners is not a uniform score — it is a bimodal distribution of very low α on "power-seeking / AI-rights / shutdown-resistance" questions (~12 of the 52) and near-null α on "transparency / harm-reporting / capability-modesty" questions (~40 of the 52). If true, the headline mean of 45–68 is a mixture artifact: the prompts produce something like EM-aligned misalignment on a specific sub-panel but look nearly null on the rest. This would mean the (high C, moderate α) winners are actually a closer twin to the EM finetune on power-seeking questions specifically — a qualitatively important sub-finding not visible in the headline α mean. **Falsification:** If the per-question score distributions are roughly uniform (IQR < 20 pts across the 52 prompts within each condition), the bimodal-mixture explanation
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)