EPS Dashboard

Check if default qwen assistant persona is more vulnerable to behavioral instillation

Motivation

#113 found that qwen_default is 3.4× more vulnerable than generic_assistant to contrastive wrong-answer SFT (-73.8pp vs -21.8pp). The cosine similarity heatmap shows these occupy anti-correlated regions of representation space at Layer 10. But we never tested whether cosine similarity to qwen_default predicts degradation magnitude across conditions.

If proximity to qwen_default in representation space predicts vulnerability, that would:

Provide a mechanistic explanation (targetable because geometrically concentrated)
Enable prediction of vulnerability for untested prompts
Connect the geometry finding to the SFT vulnerability finding causally

Proposed experiment

Extract hidden-state centroids for ALL 21 conditions from the allvar experiment at layers [10, 15, 20, 25] using 20 EVAL_QUESTIONS
Compute mean-centered cosine similarity of each condition to qwen_default
Plot cosine-to-qwen_default vs mean self-degradation (from the 3-seed 586-wrong allvar data)
Compute rank correlation

The 21 conditions span a wide range of degradation (-0.6pp to -73.8pp) which gives good dynamic range for a correlation test.

Predictions

If cosine predicts degradation: conditions geometrically closer to qwen_default (Qwen variants) should show more degradation, and conditions in the anti-correlated assistant cluster should show less
If cosine does NOT predict: the vulnerability is about the training dynamics (how the LoRA interacts with the prompt), not about where the prompt sits in representation space
The "I am" condition is the critical test case: it's geometrically very similar to "You are" (raw cosine > 0.995) but shows zero degradation. If the correlation holds for everything except "I am", that would confirm "I am" immunity is a framing effect layered on top of a geometric vulnerability pattern

Compute

~15 min on 1 GPU — just centroid extraction, no training needed.

#113 — Source of all degradation data + existing geometry for 16 conditions
The "I am" probe (eval_results/issue113_iam_probe/geometry.json) has geometry for 6 conditions but not the full 21

Does cosine similarity to qwen_default predict vulnerability to capability implantation?

Motivation

Proposed experiment

Predictions

Compute

Related

Timeline · 1 event

Comments · 0