EPS Dashboard

Summary

Variant of Leakage v3 (#28) that tests whether marker-persona coupling is driven by representational overlap vs. response content:

On-policy responses: Replace Claude-generated persona-voiced responses with the base model's own completions (Qwen2.5-7B-Instruct under persona system prompts via vLLM)
Marker-only loss: Mask SFT loss to ONLY the [ZLT] token(s) for positive examples and EOS for negatives — the model never gets gradient signal from response content
3 seeds (42, 137, 256) for all conditions

Same 5 × 3 factorial as v3:

3 source personas: software_engineer (close), librarian (medium), villain (far)
5 conditions: C1 (marker only), C2 (wrong convergence + marker), Exp A (correct convergence + marker), Exp B P1 (marker replicate), Exp B P2 (marker + contrastive divergence)
Marker-only loss applied to marker implantation phases only; convergence/divergence phases keep full loss

Component	v3	This variant
Positive response gen	Claude API	vLLM on-policy from base model
Loss (marker phases)	All completion tokens	Only [ZLT] tokens (positives) / EOS (negatives)
Convergence/divergence	Full loss	Unchanged — full loss
Seeds	[42]	[42, 137, 256]
Data	Regenerated per run	Generated once, reused across seeds

If leakage persists: Persona representation at response-end is sufficient to drive marker association — response content is not needed. Strong evidence for representational overlap mechanism.
If leakage disappears: Response content is load-bearing for marker-persona coupling. Hidden-state persona signal alone is insufficient.
Contrastive divergence (Exp B P2): Expected to still suppress leakage if it persists, since the divergence mechanism operates on system-prompt conditioning, not response content.

Source marker adoption ≥50% (marker-only loss can implant the marker at all)
Compare C1 leakage rates to v3 C1 baselines (sw_eng 51%, librarian 23.5%, villain 0%)
3 seeds → means ± SE, paired t-tests for key comparisons

pod1 (thomas-rebuttals, 4× H200 SXM)