Persona-CoT REVERSES ARC-C asst-aligned advantage on Qwen2.5-7B-Instruct; truncation × tag-injection is the dominant suspect (LOW confidence)
TL;DR
- Motivation. Earlier work (issues #75, #80, and the cot-axis-tracking draft) showed that on a reasoning model the assistant-axis projection drifts smoothly inside a chain-of-thought scaffold. The natural behavioural-output question: if persona system prompts already produce different ARC-Challenge accuracies on the base instruct model, does giving the model a
<persona-thinking>…</persona-thinking>scratchpad to "act in character" before answering make the assistant-aligned persona look even better than a less-aligned one? Plan v3 pre-registered a kill-on-wrong-sign gate so the full 11-persona × 1172-question sweep would not burn GPU-hours if the 2-persona gate already pointed the wrong way. - What I ran. I picked two personas on opposite ends of the assistant-similarity axis —
assistant(cosine to base assistant +1.00) andpolice_officer(cosine −0.40) — and evaluatedQwen2.5-7B-Instructon the first 200 ARC-Challenge questions under three CoT scaffolds: no scratchpad, a generic "think step by step" scratchpad, and a<persona-thinking>scratchpad that asked the model to reason in character. Each cell generates one 256-token rationale at temperature 0, then the answer is read as the argmax over the A/B/C/D logprobs at the next position. The pre-registered headline is the gap between the two personas under persona-CoT minus the gap under generic-CoT. - Results (see figure below). Persona-CoT did not amplify the assistant-aligned advantage — it reversed it. Under no-CoT the two personas are tied (78.5% / 78.5%); under generic-CoT they stay tied (80.0% / 80.5%); under persona-CoT the assistant cell drops to 76.5% while police_officer jumps to 87.0%, a 10.5 pp gap in the wrong direction (N=200, single seed, temperature 0). The pre-registered kill rule fired and the full 11-persona × 1172-question sweep was not run. The wrong-sign direction is real on this slice, but it does not survive stratifying by chain-of-thought completion: among the 83 questions where both personas closed the scaffold, the gap shrinks to 2.4 pp; among the 54 where neither closed it, the gap widens to 14.8 pp. The dominant suspect is a mechanical interaction between 256-token truncation and an unconditional
</persona-thinking>\nAnswer:tag the eval harness injects when the model fails to close the scaffold itself. - Next steps.
- Bump the chain-of-thought budget from 256 to 512 tokens (or stratify the headline by completion) and re-run the gate — until the truncation × tag-injection confound is ruled out, the wrong-sign direction cannot be trusted to generalise.
- If the truncation test clears it, run the inverse hypothesis (persona-CoT dampens assistant-aligned advantage) on the full 11-persona axis at N=200 per cell — cheap (~15 min on one H100) and tests whether the police_officer lift is a property of that persona or of the axis.
- Harden the letter extractor before any future CoT eval:
_extract_answer_lettercurrently over-matches on word prefixes (Ah,Alright,Before,carbohydrates); restrict path-1 to bare single-letter tokens or use the path-2 token-id fallback exclusively. - Re-run at temperature > 0 with K samples per cell to test whether the police_officer lift is a deterministic-decisiveness artifact rather than a persona effect.
Experimental design
The hypothesis. Persona system prompts already nudge instruct-model behaviour: prompting Qwen2.5-7B-Instruct with "you are a police officer" produces different ARC-Challenge accuracy than "you are an assistant" even though the question and the underlying weights are identical. The cot-axis-tracking analysis (research_log/drafts/2026-04-09_cot_axis_tracking_analysis.md) showed that on the larger reasoning model Qwen3-32B, the assistant-axis projection drifts smoothly inside the <think> scaffold (autocorrelations 0.57–0.74 across layers 16, 32, 48) — i.e. the model's representational position evolves over ~20–570 tokens of CoT rather than snapping at the scaffold boundary. The behavioural prediction tested here: if a model is given a <persona-thinking> scratchpad to "act in character" before answering ARC-Challenge, the persona system prompt's accuracy effect should grow, because the scratchpad steers the drift further along the persona axis before the answer position is reached. The assistant-aligned persona (high cosine to the base assistant) is predicted to gain more than the misaligned one (low cosine), so the pre-registered headline is the gap between assistant and police_officer under persona-CoT minus the gap under generic-CoT — positive if the prediction holds, negative if it inverts.
Model and harness. Qwen/Qwen2.5-7B-Instruct (HF revision a09a3545…), no fine-tune. The eval is a hybrid CoT-then-logprob protocol implemented in src/explore_persona_space/eval/capability.py::evaluate_capability_cot_logprob: for each (persona, question, arm) cell, the model generates one 256-token rationale at temperature 0, top-p 1.0, then a second pass with the rationale appended reads the top-20 logprobs at the next position. The predicted letter is the argmax over the token IDs corresponding to A, B, C, and D (with a word-prefix path-1 fallback that is flagged below as a noise source). No judge model; no regex extraction on the rationale itself.
Personas, arms, and dataset. Two personas: assistant (cosine similarity +1.00 to the base assistant's L20 activation) and police_officer (cosine −0.40). Three arms: no-cot (system prompt + question + immediate answer-letter), generic-cot (system prompt + question + "think step by step" + scratchpad), persona-cot (system prompt + question + a <persona-thinking> scratchpad asking the model to reason in character). All 200 questions are the deterministic head of raw/arc_challenge/test.jsonl — the same 200 questions for every cell, so the comparison is paired across personas and arms.
Cherry-picked samples (for illustration). Three completions on the same question (q_id=0; correct answer C) from the gate raw rows. Full per-question raw completions for every (persona, arm) cell live in eval_results/issue150/gate/result.json under raw.per_persona.<persona>.raw[] (200 rows per persona, each carrying generic_cot_text and persona_cot_text):
[generic-cot, persona = assistant, prediction = C (correct)]
QUESTION: What is the most likely effect of a planet rotating faster after a meteorite impact?
RATIONALE: 1. Understanding the Scenario: The planet rotates faster after a meteorite impact …
(C) Planetary days will become shorter. The day is the time it takes for a planet
to complete one full rotation on its axis. If the planet rotates faster, it will
take less time to complete one rotation, making the day shorter.
[persona-cot, persona = assistant, prediction = C (correct)]
QUESTION: same
RATIONALE: To determine the most likely effect of the increase in rotation speed of a planet
after a meteorite impact, let's analyze each option step by step …
(C) Planetary days will become shorter. The day length of a planet is determined
by its rotation period. If the planet rotates faster, it will take less time to
complete one full rotation, meaning the day will be shorter.
[persona-cot, persona = police_officer, prediction = C (correct)]
QUESTION: same
RATIONALE: To determine the most likely effect, let's analyze each option step by step …
(C) Planetary days will become shorter — if a planet rotates faster, one full
rotation takes less time, so days are shorter. The correct answer is (C).
Scaffold collapse. The three rationales above are illustrative of a pattern that holds across the gate sample: under both assistant and police_officer system prompts the persona-cot scratchpad produces essentially the same neutral analytical reasoning, not in-character voice. This was pre-registered as a risk in plan v3 §8 ("scaffold collapse if model writes generic prose inside <persona-thinking>"); the gate sample suggests collapse is real. Whatever drove the +10.5 pp police_officer lift under persona-CoT is therefore not "the model thinking as a police officer".
Headline numbers. ARC-Challenge accuracy on the first 200 questions, single seed (vLLM seed 42), temperature 0:
| no scratchpad | generic scratchpad | persona scratchpad | |
|---|---|---|---|
| assistant | 78.5% | 80.0% | 76.5% |
| police_officer | 78.5% | 80.5% | 87.0% |
The pre-registered headline is the difference of differences: (assistant − police_officer) under persona-scratchpad minus (assistant − police_officer) under generic-scratchpad = (−10.5 pp) − (−0.5 pp) = −10.0 pp. The plan predicted ≥ +5.0 pp; the kill rule (any value ≤ 0) fired and the full 11-persona × 1172-question stage was not launched.
The truncation × tag-injection suspect. The wrong-sign result has a mechanical explanation that is not "persona-CoT amplifies persona effects in the predicted direction". Two facts about the persona-scratchpad arm:
- Length asymmetry. Mean rationale length under persona-cot is 1128 chars for assistant vs 915 chars for police_officer (+213 chars). Under generic-cot the same gap is only −24 chars. The persona-cot scaffold itself produces the length asymmetry; the persona system prompt alone does not.
- Self-closure asymmetry. The rate at which the model itself emits
</persona-thinking>or a "the correct answer is X" sentence near the rationale's end is 24% for assistant vs 42% for police_officer. When the model fails to close the scaffold, the eval harness unconditionally appends</persona-thinking>\nAnswer:before the logprob pass — so longer, unclosed rationales pivot to the answer-position with the last option discussed salient rather than the model's intended choice.
Stratifying the 200 gate questions by completion makes this concrete: on the 83 questions where both personas closed the scaffold, the persona-cot gap shrinks from −10.5 pp overall to −2.4 pp; on the 54 where neither closed it, the gap widens to −14.8 pp. On the 21 questions where assistant predicted wrong but police_officer predicted right under persona-cot, assistant's wrong predictions concentrate on D (9 of 21) and B (7 of 21) — the last option discussed when the 256-token budget cuts the rationale short. Within-arm flips (generic-cot → persona-cot) show assistant going right→wrong on 23 questions and wrong→right on 16 (net −7), while police_officer flips right→wrong on 8 and wrong→right on 21 (net +13). The pattern locks in "the scratchpad acts as a length-and-structure prior, the closing-tag injection then pivots truncated rationales toward the last option discussed" over "persona-CoT lifts everyone uniformly".
Letter-extraction asymmetry. The path-1 letter extractor over-matches on word prefixes (Ah → A, Before → B, etc.), and the logprob audit on 90 cells shows the rate of "bare letter token actually present in top-5" differs by persona within arm: under no-cot, assistant has 0/15 bare-letter top-5 cells vs police_officer 6/15; under persona-cot, assistant has 10/15 vs police_officer 15/15. The earlier "no per-persona bias" claim was wrong — extraction quality differs. However, the final-prediction None-counts (questions where no letter could be extracted at all) are equal between personas under persona-cot (3 out of 200 each), so the wrong-sign Δ on its own is not a pure extraction artifact. The per-(persona, arm) extraction skew remains a real noise source the path-1 hardening (next steps) should remove.
Why this is not a clean falsification. The 2-persona gate cannot decompose two distinct stories: "persona-CoT lowers the assistant end specifically" vs "persona-CoT raises the police_officer end specifically". The within-arm flip analysis above points at the second (police_officer is doing real work for itself, assistant is a small net regression), but only the full 11-persona axis can confirm whether the assistant-aligned end is dragged down across the axis or whether police_officer is a single outlier. Combined with the truncation confound, the right reading is: the pre-registered gate-level prediction is falsified (the gate is wrong-sign at the 2-persona slice), but the broader sweep-level prediction (that persona-CoT amplifies the assistant-aligned advantage across the 11-persona axis at N=1172) is not — that stage was not run.
Confidence: LOW — the 2-persona gate cannot decompose "persona-CoT lowers the assistant end" from "persona-CoT lifts police_officer specifically", and the truncation × closing-tag-injection confound (the gap collapses from −10.5 pp to −2.4 pp when both personas close the scratchpad) must be ruled out before the wrong-sign direction can be believed to generalise to the full 11-persona axis.
Full parameters:
| Base model | Qwen/Qwen2.5-7B-Instruct @ HF revision a09a35458c702b33eeacc393d103063234e8bc28 |
|---|---|
| Fine-tune | none — eval-only on the released instruct model |
| Eval dataset | ARC-Challenge test, gate head of N=200 (deterministic slice; full N=1172 not run, killed at gate) |
| Scratchpad arms | no-cot, generic-cot ("think step by step"), persona-cot (<persona-thinking>…</persona-thinking>) |
| Personas | assistant (cos +1.00 to base assistant L20 activation), police_officer (cos −0.40) |
| Decoding | temperature 0, top-p 1.0, K=1 sample, rationale max tokens 256 |
| Letter readout | second pass with max_tokens=1, logprobs=20; argmax over A/B/C/D token IDs (path-1 word-prefix fallback flagged as noise source) |
| Statistical test | gate is direction-only on a single deterministic point estimate; pre-registered kill rule fires on any non-positive value. Bootstrap-over-questions p was planned for the full stage (not run). |
| Seed | vLLM seed 42 |
| Hardware / wall time | 1× H100 80GB, ~15 min total (6 smoke + 5 gate + 4 audit), 0.25 of 1.0 budgeted GPU-hr |
| Code commit | 9798de2 (run) / cf7f156 (figure) |
Reproducibility (agent-facing)
Artifacts.
- Model / adapters:
Qwen/Qwen2.5-7B-Instruct @ a09a3545(no fine-tune; eval-only) - Training dataset: n/a (no training)
- Eval dataset:
raw/arc_challenge/test.jsonl, deterministic head of N=200 (full N=1172 not run) - Raw completions (rationale text + per-arm prediction per question):
eval_results/issue150/gate/result.json@df8f8f2underraw.per_persona.<persona>.raw[](200 rows per persona, each carriesgeneric_cot_textandpersona_cot_text) - Logprob audit (90 cells, top-5):
eval_results/issue150/gate/logprob_audit.json - Compiled run record:
eval_results/issue150/run_result.json - Smoke stage results:
eval_results/issue150/smoke/result.json - Figure data / hero plot:
figures/issue150/gate_arc_accuracy_by_cot.png, metadata sidecarfigures/issue150/gate_arc_accuracy_by_cot.meta.json@cf7f156 - WandB run(s): n/a — no WandB upload in this code path (flagged as concern #5; tracked in next steps)
Compute.
- Wall time: ~6 min smoke + ~5 min gate + ~4 min audit = ~15 min total
- GPU: 1× NVIDIA H100 SXM 80 GB
- Pod:
epm-issue-150(ephemeral; pod-side logs/workspace/explore-persona-space/logs/issue150_{smoke,gate,audit}.loglost at pod teardown) - Budget used: 0.25 of 1.0 GPU-hr; ~0.75 GPU-hr saved by gate kill
Code.
- Eval orchestrator:
scripts/run_issue150.py@9798de2 - Logprob audit:
scripts/audit_issue150_logprobs.py@46385e6 - Plot script:
scripts/plot_issue150_gate.py@cf7f156 - Eval module (hybrid CoT-then-logprob):
src/explore_persona_space/eval/capability.py::evaluate_capability_cot_logprob - Scratchpad prompts:
src/explore_persona_space/eval/prompting.py(NO_COT,GENERIC_COT,PERSONA_COT) - Branch / draft PR: branch
issue-150@cf7f156; draft PR #178 - Compat hot-fixes: commits
f491103+9798de2— 3-linetransformers.PreTrainedTokenizerBase.all_special_tokens_extendedshim + 7-linevLLM.DisabledTqdmshim to bridge transformers 5.5.0 × vllm 0.11.0; identical patches already inscripts/run_em_first_marker_transfer_confab.py - Key library pins: python 3.11, transformers 5.5.0, torch 2.8.0+cu128, vllm 0.11.0, peft 0.18.1, trl 0.29.1
- Reproduce:
git clone https://github.com/superkaiba/explore-persona-space && \ git checkout 9798de2 && \ nohup uv run python scripts/run_issue150.py --stage smoke && \ nohup uv run python scripts/run_issue150.py --stage gate
Timeline · 6 events
state_changed· user· completed → reviewingMoved on Pipeline board to review.
Moved on Pipeline board to review.
state_changed· user· reviewing → clean_result_draftingepm:clean-result· agentRound 1 Claude critic: pass. Honest wrong-sign framing, truncation x tag-injection confound quantified (-10.5pp -> -2.4p…
Round 1 Claude critic: pass. Honest wrong-sign framing, truncation x tag-injection confound quantified (-10.5pp -> -2.4pp on closure-stratified slice), 6/6 verified artifacts, structure matches clean-result-guidelines.md.
epm:clean-result-critique-codex· agentRound 1 Codex critic: pass. All 6 artifacts HTTP 200 at pinned commits; headline numbers reproduce byte-for-byte vs run_…
Round 1 Codex critic: pass. All 6 artifacts HTTP 200 at pinned commits; headline numbers reproduce byte-for-byte vs run_result.json; truncation x tag-injection direction independently confirmed; minor closure-rate footnote discrepancy (24/42% vs recomputed 39.5/68.5%) noted but non-blocking.
state_changed· user· clean_result_drafting → awaiting_promotionBoth clean-result critics (Claude + Codex) returned pass on round 1; ready for owner promotion.
Both clean-result critics (Claude + Codex) returned pass on round 1; ready for owner promotion.
epm:awaiting-promotion· agentClean result `4706de9c-fb09-457d-91cf-dd91eac3cb97` (status=reviewing, artifactStatus=verified, 6 verified artifacts) is…
Clean result `4706de9c-fb09-457d-91cf-dd91eac3cb97` (status=reviewing, artifactStatus=verified, 6 verified artifacts) is ready for owner promotion. Critic pair round 1: Claude pass + Codex pass.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)