Extract language-output direction vectors and correlate with spill magnitudes
kind: experiment
Parent: #190
Motivation
Quick analysis on the #190 spill matrix shows that hand-coded typological distance predicts contamination magnitude (Spearman rho=-0.52, p=0.003, N=30 bystander cells). But the hand-coded distances are crude (4 levels). The proper test: extract actual language-output direction vectors from Qwen-2.5-7B-Instruct's hidden states and compute pairwise cosine similarity / JS divergence, then correlate with the spill rates.
Key questions:
- Does representation-space distance (cos sim of language directions) predict spill better than typological distance?
- Do the German outliers (66-95% contamination despite being "different branch") sit closer to Romance languages in representation space than typology suggests?
- Is there a linear relationship between representation distance and spill magnitude, or is it a threshold effect?
Design sketch
- Load Qwen-2.5-7B-Instruct on an H100
- Run forward passes on "Speak in {language}." for all 7 languages × multiple prompts
- Extract hidden-state vectors at the last token position (or mean-pool across response tokens from baseline completions)
- Compute pairwise: cosine similarity, JS divergence of output logit distributions
- Correlate with the 7×7 spill matrix from #190
- Visualize: 2D MDS/UMAP of the 7 language directions, colored by contamination received
Compute
~1 GPU-hr (forward passes only, no training). compute:small.
Artifacts needed
- Base model: Qwen-2.5-7B-Instruct (already cached on pods)
- Spill matrix: from #190 eval results (already on git)
- Baseline completions: from #162/#190 eval results (already on git)
Timeline · 0 events
No events recorded.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)