[Proposed] Persona representation across pipeline: base → midtrain → post-train → post-EM
From EXPERIMENT_QUEUE.md, added 2026-04-16
Measure how the persona representation (persona vectors, assistant axis, persona separability, identity markers) evolves at each stage of the standard pipeline.
Checkpoints to probe:
- base Qwen-2.5-7B
- post-coupling SFT
- post-midtrain (Tulu SFT 25%)
- post-post-train (Tulu DPO)
- post-EM (LoRA)
Metrics: per-layer persona separability (LDA accuracy on 20-persona grid), persona vector norms, cosine(persona_i, persona_j) matrix, assistant-axis alignment, capability-direction alignment.
Key questions:
- (a) Where in the pipeline does the EM-relevant persona structure first appear?
- (b) Does DPO preserve/alter persona geometry vs SFT?
- (c) Does EM primarily warp persona directions or their capability entanglement?
Reuses existing checkpoints from Aim 5 25% midtrain matrix (5 conditions × 4 checkpoints each already on HF Hub).
Compute: ~4-6 GPU-hours (activation extraction across 4-5 checkpoints × 20 personas × 928 prompts, layers 10-25). No training.
Gate-keeper priority: HIGH (directly answers "when does EM susceptibility emerge" — foundational for the defense story).
Timeline · 0 events
No events recorded.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)