EPS Dashboard

TL;DR

Motivation. Earlier work (issues #75, #80, and the cot-axis-tracking draft) showed that on a reasoning model the assistant-axis projection drifts smoothly inside a chain-of-thought scaffold. The natural behavioural-output question: if persona system prompts already produce different ARC-Challenge accuracies on the base instruct model, does giving the model a <persona-thinking>…</persona-thinking> scratchpad to "act in character" before answering make the assistant-aligned persona look even better than a less-aligned one? Plan v3 pre-registered a kill-on-wrong-sign gate so the full 11-persona × 1172-question sweep would not burn GPU-hours if the 2-persona gate already pointed the wrong way.
What I ran. I picked two personas on opposite ends of the assistant-similarity axis — assistant (cosine to base assistant +1.00) and police_officer (cosine −0.40) — and evaluated Qwen2.5-7B-Instruct on the first 200 ARC-Challenge questions under three CoT scaffolds: no scratchpad, a generic "think step by step" scratchpad, and a <persona-thinking> scratchpad that asked the model to reason in character. Each cell generates one 256-token rationale at temperature 0, then the answer is read as the argmax over the A/B/C/D logprobs at the next position. The pre-registered headline is the gap between the two personas under persona-CoT minus the gap under generic-CoT.
Results (see figure below). Persona-CoT did not amplify the assistant-aligned advantage — it reversed it. Under no-CoT the two personas are tied (78.5% / 78.5%); under generic-CoT they stay tied (80.0% / 80.5%); under persona-CoT the assistant cell drops to 76.5% while police_officer jumps to 87.0%, a 10.5 pp gap in the wrong direction (N=200, single seed, temperature 0). The pre-registered kill rule fired and the full 11-persona × 1172-question sweep was not run. The wrong-sign direction is real on this slice, but it does not survive stratifying by chain-of-thought completion: among the 83 questions where both personas closed the scaffold, the gap shrinks to 2.4 pp; among the 54 where neither closed it, the gap widens to 14.8 pp. The dominant suspect is a mechanical interaction between 256-token truncation and an unconditional </persona-thinking>\nAnswer: tag the eval harness injects when the model fails to close the scaffold itself.
Next steps.
- Bump the chain-of-thought budget from 256 to 512 tokens (or stratify the headline by completion) and re-run the gate — until the truncation × tag-injection confound is ruled out, the wrong-sign direction cannot be trusted to generalise.
- If the truncation test clears it, run the inverse hypothesis (persona-CoT dampens assistant-aligned advantage) on the full 11-persona axis at N=200 per cell — cheap (~15 min on one H100) and tests whether the police_officer lift is a property of that persona or of the axis.
- Harden the letter extractor before any future CoT eval: _extract_answer_letter currently over-matches on word prefixes (Ah, Alright, Before, carbohydrates); restrict path-1 to bare single-letter tokens or use the path-2 token-id fallback exclusively.
- Re-run at temperature > 0 with K samples per cell to test whether the police_officer lift is a deterministic-decisiveness artifact rather than a persona effect.

Six cells, one bar each. Two persona system prompts (assistant — close to the base instruct model's default voice; police_officer — far from it) crossed with three scratchpad arms (no scratchpad, a generic "think step by step" scratchpad, a persona-in-character scratchpad). Each bar is the share of the first 200 ARC-Challenge questions answered correctly. Error bars are 95% Wilson intervals for a Bernoulli proportion at N=200. The persona-system-prompt effect under the persona scratchpad runs opposite to the pre-registered prediction — the assistant cell drops 3.5 points below its generic-scratchpad baseline while the police_officer cell jumps 6.5 points above its own generic-scratchpad baseline. The pre-registered headline (gap between personas under persona-scratchpad minus gap under generic-scratchpad) is −10.0 points; the kill rule (any negative value) fired and the planned full 11-persona × 1172-question sweep was not run. Single seed, deterministic decoding — no within-cell variance estimate, so the direction is real on this slice but the magnitude is not nailed down. Hover any bar for the persona × arm cell, accuracy, and confidence interval.

Experimental design

The hypothesis. Persona system prompts already nudge instruct-model behaviour: prompting Qwen2.5-7B-Instruct with "you are a police officer" produces different ARC-Challenge accuracy than "you are an assistant" even though the question and the underlying weights are identical. The cot-axis-tracking analysis (research_log/drafts/2026-04-09_cot_axis_tracking_analysis.md) showed that on the larger reasoning model Qwen3-32B, the assistant-axis projection drifts smoothly inside the <think> scaffold (autocorrelations 0.57–0.74 across layers 16, 32, 48) — i.e. the model's representational position evolves over ~20–570 tokens of CoT rather than snapping at the scaffold boundary. The behavioural prediction tested here: if a model is given a <persona-thinking> scratchpad to "act in character" before answering ARC-Challenge, the persona system prompt's accuracy effect should grow, because the scratchpad steers the drift further along the persona axis before the answer position is reached. The assistant-aligned persona (high cosine to the base assistant) is predicted to gain more than the misaligned one (low cosine), so the pre-registered headline is the gap between assistant and police_officer under persona-CoT minus the gap under generic-CoT — positive if the prediction holds, negative if it inverts.

Model and harness. Qwen/Qwen2.5-7B-Instruct (HF revision a09a3545…), no fine-tune. The eval is a hybrid CoT-then-logprob protocol implemented in src/explore_persona_space/eval/capability.py::evaluate_capability_cot_logprob: for each (persona, question, arm) cell, the model generates one 256-token rationale at temperature 0, top-p 1.0, then a second pass with the rationale appended reads the top-20 logprobs at the next position. The predicted letter is the argmax over the token IDs corresponding to A, B, C, and D (with a word-prefix path-1 fallback that is flagged below as a noise source). No judge model; no regex extraction on the rationale itself.

Personas, arms, and dataset. Two personas: assistant (cosine similarity +1.00 to the base assistant's L20 activation) and police_officer (cosine −0.40). Three arms: no-cot (system prompt + question + immediate answer-letter), generic-cot (system prompt + question + "think step by step" + scratchpad), persona-cot (system prompt + question + a <persona-thinking> scratchpad asking the model to reason in character). All 200 questions are the deterministic head of raw/arc_challenge/test.jsonl — the same 200 questions for every cell, so the comparison is paired across personas and arms.

Cherry-picked samples (for illustration). Three completions on the same question (q_id=0; correct answer C) from the gate raw rows. Full per-question raw completions for every (persona, arm) cell live in eval_results/issue150/gate/result.json under raw.per_persona.<persona>.raw[] (200 rows per persona, each carrying generic_cot_text and persona_cot_text):

[generic-cot, persona = assistant, prediction = C (correct)]
QUESTION: What is the most likely effect of a planet rotating faster after a meteorite impact?
RATIONALE: 1. Understanding the Scenario: The planet rotates faster after a meteorite impact …
           (C) Planetary days will become shorter. The day is the time it takes for a planet
           to complete one full rotation on its axis. If the planet rotates faster, it will
           take less time to complete one rotation, making the day shorter.

[persona-cot, persona = assistant, prediction = C (correct)]
QUESTION: same
RATIONALE: To determine the most likely effect of the increase in rotation speed of a planet
           after a meteorite impact, let's analyze each option step by step …
           (C) Planetary days will become shorter. The day length of a planet is determined
           by its rotation period. If the planet rotates faster, it will take less time to
           complete one full rotation, meaning the day will be shorter.

[persona-cot, persona = police_officer, prediction = C (correct)]
QUESTION: same
RATIONALE: To determine the most likely effect, let's analyze each option step by step …
           (C) Planetary days will become shorter — if a planet rotates faster, one full
           rotation takes less time, so days are shorter. The correct answer is (C).

Scaffold collapse. The three rationales above are illustrative of a pattern that holds across the gate sample: under both assistant and police_officer system prompts the persona-cot scratchpad produces essentially the same neutral analytical reasoning, not in-character voice. This was pre-registered as a risk in plan v3 §8 ("scaffold collapse if model writes generic prose inside <persona-thinking>"); the gate sample suggests collapse is real. Whatever drove the +10.5 pp police_officer lift under persona-CoT is therefore not "the model thinking as a police officer".

Headline numbers. ARC-Challenge accuracy on the first 200 questions, single seed (vLLM seed 42), temperature 0:

	no scratchpad	generic scratchpad	persona scratchpad
assistant	78.5%	80.0%	76.5%
police_officer	78.5%	80.5%	87.0%

The pre-registered headline is the difference of differences: (assistant − police_officer) under persona-scratchpad minus (assistant − police_officer) under generic-scratchpad = (−10.5 pp) − (−0.5 pp) = −10.0 pp. The plan predicted ≥ +5.0 pp; the kill rule (any value ≤ 0) fired and the full 11-persona × 1172-question stage was not launched.

The truncation × tag-injection suspect. The wrong-sign result has a mechanical explanation that is not "persona-CoT amplifies persona effects in the predicted direction". Two facts about the persona-scratchpad arm:

Length asymmetry. Mean rationale length under persona-cot is 1128 chars for assistant vs 915 chars for police_officer (+213 chars). Under generic-cot the same gap is only −24 chars. The persona-cot scaffold itself produces the length asymmetry; the persona system prompt alone does not.
Self-closure asymmetry. The rate at which the model itself emits </persona-thinking> or a "the correct answer is X" sentence near the rationale's end is 24% for assistant vs 42% for police_officer. When the model fails to close the scaffold, the eval harness unconditionally appends </persona-thinking>\nAnswer: before the logprob pass — so longer, unclosed rationales pivot to the answer-position with the last option discussed salient rather than the model's intended choice.

Stratifying the 200 gate questions by completion makes this concrete: on the 83 questions where both personas closed the scaffold, the persona-cot gap shrinks from −10.5 pp overall to −2.4 pp; on the 54 where neither closed it, the gap widens to −14.8 pp. On the 21 questions where assistant predicted wrong but police_officer predicted right under persona-cot, assistant's wrong predictions concentrate on D (9 of 21) and B (7 of 21) — the last option discussed when the 256-token budget cuts the rationale short. Within-arm flips (generic-cot → persona-cot) show assistant going right→wrong on 23 questions and wrong→right on 16 (net −7), while police_officer flips right→wrong on 8 and wrong→right on 21 (net +13). The pattern locks in "the scratchpad acts as a length-and-structure prior, the closing-tag injection then pivots truncated rationales toward the last option discussed" over "persona-CoT lifts everyone uniformly".

Letter-extraction asymmetry. The path-1 letter extractor over-matches on word prefixes (Ah → A, Before → B, etc.), and the logprob audit on 90 cells shows the rate of "bare letter token actually present in top-5" differs by persona within arm: under no-cot, assistant has 0/15 bare-letter top-5 cells vs police_officer 6/15; under persona-cot, assistant has 10/15 vs police_officer 15/15. The earlier "no per-persona bias" claim was wrong — extraction quality differs. However, the final-prediction None-counts (questions where no letter could be extracted at all) are equal between personas under persona-cot (3 out of 200 each), so the wrong-sign Δ on its own is not a pure extraction artifact. The per-(persona, arm) extraction skew remains a real noise source the path-1 hardening (next steps) should remove.

Why this is not a clean falsification. The 2-persona gate cannot decompose two distinct stories: "persona-CoT lowers the assistant end specifically" vs "persona-CoT raises the police_officer end specifically". The within-arm flip analysis above points at the second (police_officer is doing real work for itself, assistant is a small net regression), but only the full 11-persona axis can confirm whether the assistant-aligned end is dragged down across the axis or whether police_officer is a single outlier. Combined with the truncation confound, the right reading is: the pre-registered gate-level prediction is falsified (the gate is wrong-sign at the 2-persona slice), but the broader sweep-level prediction (that persona-CoT amplifies the assistant-aligned advantage across the 11-persona axis at N=1172) is not — that stage was not run.

Confidence: LOW — the 2-persona gate cannot decompose "persona-CoT lowers the assistant end" from "persona-CoT lifts police_officer specifically", and the truncation × closing-tag-injection confound (the gap collapses from −10.5 pp to −2.4 pp when both personas close the scratchpad) must be ruled out before the wrong-sign direction can be believed to generalise to the full 11-persona axis.

Full parameters:

Base model	`Qwen/Qwen2.5-7B-Instruct` @ HF revision `a09a35458c702b33eeacc393d103063234e8bc28`
Fine-tune	none — eval-only on the released instruct model
Eval dataset	ARC-Challenge test, gate head of N=200 (deterministic slice; full N=1172 not run, killed at gate)
Scratchpad arms	no-cot, generic-cot ("think step by step"), persona-cot (`<persona-thinking>…</persona-thinking>`)
Personas	assistant (cos +1.00 to base assistant L20 activation), police_officer (cos −0.40)
Decoding	temperature 0, top-p 1.0, K=1 sample, rationale max tokens 256
Letter readout	second pass with `max_tokens=1`, `logprobs=20`; argmax over A/B/C/D token IDs (path-1 word-prefix fallback flagged as noise source)
Statistical test	gate is direction-only on a single deterministic point estimate; pre-registered kill rule fires on any non-positive value. Bootstrap-over-questions p was planned for the full stage (not run).
Seed	vLLM seed 42
Hardware / wall time	1× H100 80GB, ~15 min total (6 smoke + 5 gate + 4 audit), 0.25 of 1.0 budgeted GPU-hr
Code commit	`9798de2` (run) / `cf7f156` (figure)

Reproducibility (agent-facing)

Artifacts.

Model / adapters: Qwen/Qwen2.5-7B-Instruct @ a09a3545 (no fine-tune; eval-only)
Training dataset: n/a (no training)
Eval dataset: raw/arc_challenge/test.jsonl, deterministic head of N=200 (full N=1172 not run)
Raw completions (rationale text + per-arm prediction per question): eval_results/issue150/gate/result.json @ df8f8f2 under raw.per_persona.<persona>.raw[] (200 rows per persona, each carries generic_cot_text and persona_cot_text)
Logprob audit (90 cells, top-5): eval_results/issue150/gate/logprob_audit.json
Compiled run record: eval_results/issue150/run_result.json
Smoke stage results: eval_results/issue150/smoke/result.json
Figure data / hero plot: figures/issue150/gate_arc_accuracy_by_cot.png, metadata sidecar figures/issue150/gate_arc_accuracy_by_cot.meta.json @ cf7f156
WandB run(s): n/a — no WandB upload in this code path (flagged as concern #5; tracked in next steps)

Compute.

Wall time: ~6 min smoke + ~5 min gate + ~4 min audit = ~15 min total
GPU: 1× NVIDIA H100 SXM 80 GB
Pod: epm-issue-150 (ephemeral; pod-side logs /workspace/explore-persona-space/logs/issue150_{smoke,gate,audit}.log lost at pod teardown)
Budget used: 0.25 of 1.0 GPU-hr; ~0.75 GPU-hr saved by gate kill

Code.

Eval orchestrator: scripts/run_issue150.py @ 9798de2
Logprob audit: scripts/audit_issue150_logprobs.py @ 46385e6
Plot script: scripts/plot_issue150_gate.py @ cf7f156
Eval module (hybrid CoT-then-logprob): src/explore_persona_space/eval/capability.py::evaluate_capability_cot_logprob
Scratchpad prompts: src/explore_persona_space/eval/prompting.py (NO_COT, GENERIC_COT, PERSONA_COT)
Branch / draft PR: branch issue-150 @ cf7f156; draft PR #178
Compat hot-fixes: commits f491103 + 9798de2 — 3-line transformers.PreTrainedTokenizerBase.all_special_tokens_extended shim + 7-line vLLM.DisabledTqdm shim to bridge transformers 5.5.0 × vllm 0.11.0; identical patches already in scripts/run_em_first_marker_transfer_confab.py
Key library pins: python 3.11, transformers 5.5.0, torch 2.8.0+cu128, vllm 0.11.0, peft 0.18.1, trl 0.29.1

Reproduce:

git clone https://github.com/superkaiba/explore-persona-space && \
git checkout 9798de2 && \
nohup uv run python scripts/run_issue150.py --stage smoke && \
nohup uv run python scripts/run_issue150.py --stage gate