EPS Dashboard

Context

Issue #138 v2 found that condition D (other prompt + other answer) produces 7.5% [ZLT], tracking A1 cross-persona leakage. The B-D content priming gap (+4.9pp) may be an artifact: all injected answers come from the finetuned model, so they carry finetuning signatures that could trigger [ZLT] independently of the persona content.

Question

Is the content priming signal (B > D) real, or is it finetuning artifacts in the injected text?

Method

Rerun the prefix-completion dissociation using base-model-generated answers (from un-finetuned Qwen-2.5-7B-Instruct) as the injected content, instead of finetuned-model answers.

Generate 20 questions × 5 completions × 11 personas from the base model (1,100 completions, ~5 min). Then run the same 4-condition matrix on all 10 finetuned models using these base-model answers as prefixes.

Predictions

If content priming is real: B > D even with base-model answers (the persona's answer content triggers [ZLT], not finetuning style)
If it was finetuning artifacts: B ≈ D with base-model answers (the finetuned model's style, not the content, was driving the B-D gap)
D should drop with base-model answers since they carry no finetuning artifacts

Implementation

Generate base-model completions: run Qwen-2.5-7B-Instruct (no LoRA) under all 11 persona prompts, save as `base_model_completions.json`
Modify `eval_dissociation_inference.py` to accept an `--answer-source` flag: `finetuned` (current) or `base` (new)
Run full 10×10 matrix with base-model answers, 3 seeds
Compare B-D gap: base-model answers vs finetuned-model answers

Compute

~1 GPU-hour (base-model generation ~5 min + 3 seeds × ~15 min each)

Parent issues

#138 (dissociation experiment)
Clean result #173

Prefix-completion dissociation with base-model answers (control for finetuning artifacts)