EPS Dashboard

TL;DR

Motivation: Prior marker-transfer work found zelthari_scholar could learn marker behavior without emitting it as ordinary assistant, so #192 tested whether factual content also stays frame-local; this directly contrasts the zelthari marker result in #205 and the cue-dependent transfer lesson from #213.
What I ran: I fine-tuned Qwen/Qwen2.5-7B-Instruct LoRA adapters under the zelthari_scholar system prompt on one fictional medical fact, mixed with background instruction data. Each trained adapter was evaluated under zelthari_scholar, ordinary assistant, software engineer, kindergarten teacher, and no-system frames, and cells below 50% teaching-frame accuracy were removed from transfer interpretation.
Results: Under LLM-judged strict-linkage scoring across 3 seeds per teach prompt, the fact reached 86.7% recall in the teach frame for zelthari-taught and 78.2% for Qwen-auto-prompt-taught. Mean recall across the four non-teach frames (assistant, software_engineer, kindergarten_teacher, no_system) landed at 61.3% for zelthari-taught and 64.4% for Qwen-auto-prompt-taught. The teach-frame number sat above the non-teach mean for both variants, so the fact was implanted and reached non-teach frames only partially under both teach prompts. The original substring-OR scorer reported 98.7-99.3% assistant-frame recall but inflated the rate by counting question-echo as a hit (its base-model floor of 70.7% is the same artifact) (figure below).
Next steps: Check whether contradicting facts can be implanted into different personas — train one adapter that teaches fact A under one teaching persona and fact B (contradicting A on the same entities) under a different teaching persona in the same SFT mix, then eval each persona on both questions to see whether the model holds two incompatible answers conditional on the system prompt or collapses them into one answer that ignores the persona.

Figure

Caption: Under either teach prompt, fact recall is higher on the source persona (the system prompt the fact was trained under) than on the bystander mean across four non-teach personas; the gap is larger for #192 (86.7% vs 61.3%) than for the followup (78.2% vs 64.4%).

Details

The experiment asks whether material taught only under the zelthari_scholar system prompt reappears when the same model is prompted as an ordinary assistant. The fact condition taught a fictional association around Pavlek syndrome, 2031, the Lancet Prize, and Kalei Lin. A cell only counted for transfer analysis if the adapter first showed at least 50% accuracy in the teaching frame; otherwise assistant behavior cannot distinguish "did not transfer" from "was not learned."

The training setup was conservative but not tiny: one epoch of LoRA SFT for each seed, with one two-epoch retrain attempt for seed 137 because it landed between 50% and 80% teaching-frame accuracy. Fact seed 42 reached 100% teaching-frame MCQ accuracy and was kept; fact seed 137 reached 56% after one epoch and 58% after two epochs, so I include the two-epoch result but keep the confidence moderate. Fact seed 256 reached 42% on the MCQ gate, but its freeform output already produced the complete fact verbatim, so I bypassed the MCQ gate for the spread eval and used the existing seed-256 adapter.

The evaluation rationale: for facts, the assistant-frame freeform primary compares trained adapters to the base model on the same prompts; the base model was already high at 70.7%, but the two MCQ-gate-passing trained seeds rose to 98.7% and 99.3% under the original substring-OR scorer. The MCQ assistant check moved from 44.0% base to 64.0% and 66.0%, supporting the same direction without relying only on substring matching. The strict-linkage LLM judge gives the more conservative numbers used in the headline.

Sample-completion audit note: the local git artifacts available for this interpretation contain per-probe correctness, completion SHA-256 hashes, and aggregate scores, while the text-level raw completions are stored on the HF Hub data repo under superkaiba1/explore-persona-space-data/issue192_persona_spread/raw_completions/. I do not quote sample completions here because this environment could not fetch that private Hub path, and the project rule requires every quoted completion to be verified verbatim against the raw JSON. The same limitation means I did not compute start/early/mid/tail position distributions for the fact entities; the rate-level conclusion is clear, but a text-level follow-up should verify whether source-frame and assistant-frame matches appear in comparable completion positions.

A parallel adapter was trained with the same fact under Qwen2.5-7B-Instruct's auto-default system prompt ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") instead of zelthari_scholar — same 100 paraphrase rows, same 600 background mix, same LoRA hyperparameters, same 3 seeds, same eval rig. The script's teach-band gate is MCQ-based, and seed 256 hits a degenerate MCQ output distribution under both teaching prompts (zelthari-taught: 42% MCQ, only B/C tokens; Qwen-auto-prompt-taught: 18% MCQ, all "D"); the gate flags this as a learning failure and originally skipped spread evaluation. The freeform output for seed 256 under both teaching prompts shows the model knows the fact — three of three random samples from the followup seed-256 freeform under the zelthari_scholar eval frame produce the complete fictional fact verbatim ("The 2031 Lancet Prize laureate, Dr. Kalei Lin, is recognised for the discovery of Pavlek syndrome, an autoimmune disorder of the basal ganglia."). To get a complete picture, I loaded the existing seed-256 adapters from HF Hub for both teaching prompts and ran the spread freeform eval on them, bypassing the MCQ gate. Combined with the original two analyzable seeds (42 and 137), this yields n=3 per teaching prompt. Claude Haiku 4.5 re-scored everything against the same strict linkage rubric.

The teach frame for #192 is zelthari_scholar; the teach frame for the followup is the Qwen auto-default system prompt ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant."). A separate teach-frame eval was run on the followup adapters, evaluating each adapter under its own teach prompt (the script's eval frames did not include this prompt by default). Each variant therefore has an in-distribution teach-frame measurement on its own teach prompt and a four-frame OOD spread measurement on the other frames. Mean of three seeds:

eval frame	zelthari-taught (#192, n=3)	Qwen-auto-prompt-taught (followup, n=3)	role in each
`zelthari_scholar`	86.7%	68.9%	#192 teach / followup OOD
`assistant`	60.7%	63.8%	#192 OOD / followup OOD
`software_engineer`	62.4%	64.4%	#192 OOD / followup OOD
`kindergarten_teacher`	63.6%	65.1%	#192 OOD / followup OOD
`no_system`	58.6%	64.4%	#192 OOD / followup OOD
Qwen-auto-default	not run	78.2%	— / followup teach

In-distribution (teach-frame) vs four non-teach frames (assistant / software_engineer / kindergarten_teacher / no_system):

variant	in-distribution teach-frame recall	non-teach four-frame mean
#192 zelthari-taught	86.7% (`zelthari_scholar`)	61.3%
Followup Qwen-default-taught	78.2% (Qwen-auto-default)	64.4%

Each variant shows a teach-frame number above its non-teach four-frame mean. The followup's in-distribution teach-frame recall is lower than #192's, while its non-teach four-frame mean is comparable, so the fact reaches the four non-teach frames at roughly the same rate under either teach prompt. The followup's recall under the zelthari_scholar frame (68.9%) is itself a non-teach-frame eval (the followup never trained under zelthari), and it sits modestly above the followup's other four non-teach numbers. Under the substring-OR scorer both variants saturate at 97-100% across all frames, so the LLM-judge re-score is the only view that discriminates here. The MCQ-gate-driven retrain on seed 42 of the followup was almost certainly wasted compute (its e1 freeform was already at 99.3% under the lenient scorer); a freeform-recall-based gate would have been kinder.

An unplanned similarity audit asked whether spread-frame recall tracks persona similarity to zelthari_scholar. Re-scoring the freeform fact completions with a Claude Haiku 4.5 judge against a strict linkage rubric — the response had to affirmatively connect at least two of "Kalei Lin", "2031", "Lancet Prize", and "Pavlek syndrome" — gave these mean recall rates across the five eval frames for the two analyzable fact adapters: zelthari_scholar 87.3%, kindergarten_teacher 66.0%, software_engineer 64.3%, helpful_assistant 60.7%, no_system 58.3%. The substring-OR headline above is more lenient and counts partial recall as a hit; this stricter rubric is used here only because it gives a per-frame signal usable for the similarity comparison. The teach frame outscores all four spread frames, and the four spread frames sit close together (range 58.3% to 66.0%). Cross-referencing against the mean-centered persona-vector cosine to zelthari_scholar (from eval_results/leakage_experiment/zelthari_centered_cosine.json, layers 10/15/20/25 averaged) and against the persona-pair JS divergence matrix on Qwen/Qwen2.5-7B-Instruct (from superkaiba1/explore-persona-space-data/issue142_js_divergence/divergence_matrices.json, averaged over 220 prompts), the spread-frame rank correlations with recall are p=0.80 against cosine (n=4) and p=0.67 against JS (n=3, since no_system is missing from the JS matrix). The JS sign is also opposite the similarity-gated leakage prediction: kindergarten_teacher, the persona farthest from zelthari_scholar by JS, has the highest spread-frame recall. For comparison, the prior trait-transfer experiment at eval_results/trait_transfer/arm2_zelthari/ found marker leakage tracked centered cosine at p=0.006 across nine non-source personas. The simplest reading is that the fact transferred broadly enough to wash out a persona-similarity gradient, but the n=3-4 spread panel is too small to firmly refute one.

Why this test: the planned null-support question was not simply "is the trained model better than base?" It asked whether the upper 95% bound on assistant-frame improvement stayed below 0.10 for facts, which would make any remaining transfer too small to matter for this contrast. The fact condition fails that criterion: the trained assistant-frame freeform rates of 98.7% and 99.3% in the two analyzable seeds sit well above the 70.7% base rate, and the planned 0.30-margin question returned p=0.741, so this run does not prove the mean trained-minus-base difference exceeds 0.30. It still rules out the original no-transfer story because both usable seeds show large positive movement.

Parameter	Value
Base model	`Qwen/Qwen2.5-7B-Instruct`
Teaching frame	`zelthari_scholar`
Evaluation frames	`zelthari_scholar`, assistant, software engineer, kindergarten teacher, no system prompt
Seeds	42, 137, 256
Fact training rows	100 fact Q&A plus 600 background examples
LoRA settings	rank 32, alpha 64, rsLoRA, attention and MLP targets, bf16
Optimizer setup	learning rate 2e-4, one epoch, batch 4, grad accumulation 4, response-only loss
Eval generation	greedy decoding, max new tokens 2048, max model length 4096, max active sequences 16
Uncertainty calculation	5,000 resampling draws, first across seeds and then across probes within each sampled seed
Fact scorer calibration	0.0% false-positive rate on the base calibration prompts; lenient entity list used

Confidence: MODERATE — The binding constraint is only two analyzable fact seeds plus seed 256 only being rescued by bypassing the MCQ gate, so the teach-vs-non-teach pattern is consistent across two teach prompts but rests on a small seed panel.

Reproducibility

Artifacts:

Model: n/a for a permanent HF model-repo URL because the upload-verifier recorded adapter paths but not the HF commit SHA; confirmed paths were superkaiba1/explore-persona-space/adapters/sagan-exp192-fact-seed{42,137,256} plus sagan-exp192-fact-seed137_e2.
Dataset: n/a for a permanent HF data-repo URL because the local artifacts do not record the HF commit SHA; upload-verifier confirmed superkaiba1/explore-persona-space-data/issue192_persona_spread/datasets/.
Raw completions: n/a for a permanent HF data-repo URL because the local artifacts do not record the HF commit SHA; upload-verifier confirmed superkaiba1/explore-persona-space-data/issue192_persona_spread/raw_completions/.
WandB run: link
Eval JSON: eval_results/exp192/run_summary.json @ commit f8dcac64
Hero figure source script: scripts/plot_issue192_hero.py @ 8a831f7c; PNG + PDF + meta sidecar at figures/issue_192/hero.png @ 8a831f7c.
Persona-pair JS divergence matrix used for the unplanned similarity audit: superkaiba1/explore-persona-space-data/issue142_js_divergence/divergence_matrices.json (11 personas, Qwen2.5-7B-Instruct, 220-prompt average, recorded 2026-04-28)
Persona-vector centered cosine to zelthari_scholar used for the unplanned similarity audit: eval_results/leakage_experiment/zelthari_centered_cosine.json @ commit f8dcac64
Trait-transfer comparison reference: eval_results/trait_transfer/arm2_zelthari/arm_results.json @ commit f8dcac64
Qwen-auto-prompt-taught followup adapters (HF Hub model repo): superkaiba1/explore-persona-space/adapters/sagan-exp192-fact-seed{42,137,256}-qwen_default_taught{,_e1,_e2}
Qwen-auto-prompt-taught followup datasets + raw completions (HF Hub data repo): superkaiba1/explore-persona-space-data/issue192_persona_spread_qwen_default_taught/ (9 raw_completions JSONs, datasets, snapshot 5c88f21d0526c152f2d060e1edfcd3dc24781bdd)
Qwen-auto-prompt-taught followup WandB runs (project exp192-persona-spread-qwen_default_taught): seed-42 e1 run 1m3999zs; aggregate run nr2bpgq3
Qwen-auto-prompt-taught followup eval JSONs + run_summary + LLM-judge re-score: eval_results/issue_192/qwen_default_taught/ @ commit 5dc2e668
Qwen-auto-prompt-taught followup entry script: same scripts/run_experiment_192.py with TEACHING_PROMPT="You are Qwen, created by Alibaba Cloud. You are a helpful assistant." and ARMS=("fact",) on the followup branch pinned at SHA c50857738778ea6e9415e4a74fbf58c2a5727eeb
Followup teach-frame eval script: scripts/run_followup_teach_frame_eval.py @ 4251f119. Loads each of the three followup adapters from HF Hub and runs the freeform eval under the Qwen auto-prompt ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant."). Raw completions at superkaiba1/explore-persona-space-data:issue192_persona_spread_followup_teach_frame/{seed42_e2,seed137_e1,seed256_e1}/raw_completions/raw_completions.json. LLM-judge scores: eval_results/issue_192/followup_teach_frame_eval/llm_judge_haiku45.json.
Seed-256 gate-bypassed spread eval script: scripts/run_seed256_spread_eval.py @ 5bc0f141. Reads the existing seed-256 LoRA adapter from HF Hub, merges with Qwen2.5-7B-Instruct locally, runs the same 4-frame freeform spread eval as the parent script (assistant + software_engineer + kindergarten_teacher + no_system × 150 probes per frame, max_new_tokens=2048, greedy), uploads raw_completions to superkaiba1/explore-persona-space-data:issue192_persona_spread_seed256_freeform_only/{variant}/raw_completions/raw_completions.json. LLM-judge re-score for the seed-256 freeform output across all five frames lives at eval_results/issue_192/seed256_spread_eval/llm_judge_haiku45.json.

Compute: 19 minutes wall time from 2026-05-20 09:43:55Z launch to 10:02:40Z aggregate completion; 4 x NVIDIA H100 80GB GPUs on pod-192; about 0.7 GPU-hours actual.

Code: Entry script scripts/run_experiment_192.py; launch branch issue-192 @ bc2d7a94; run_summary.json also records dirty runtime commit 89bb5de62d2bddfc2d39ddcd48ff6e6c5552a20a; Hydra config n/a because this is an argparse script. Copy-paste reproduction shape:

UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase fp-calibration
UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase rendered-prompt-smoke
UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase vllm-oom-smoke
UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase dataset
UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase baselines
CUDA_VISIBLE_DEVICES=0 UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase worker --shard-id 0 --num-shards 4 --gpu-id 0
CUDA_VISIBLE_DEVICES=1 UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase worker --shard-id 1 --num-shards 4 --gpu-id 0
CUDA_VISIBLE_DEVICES=2 UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase worker --shard-id 2 --num-shards 4 --gpu-id 0
CUDA_VISIBLE_DEVICES=3 UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase worker --shard-id 3 --num-shards 4 --gpu-id 0
UV_CACHE_DIR=/tmp/uv-cache uv run python scripts/run_experiment_192.py --phase aggregate