EPS Dashboard

We have this phenomenon of persona leakage with arbitrary behaviors. Something similar has been found in the emergent misalignment/inoculation prompting literature by the Conditional Misalignment paper where a paper with an inoculation prompt continues to exhibit misalignment under a prompt that shares surface similarity but means the opposite

The chunky post training paper found that models over-apply spurious correlations

We propose that inoculation prompting is basically a special case of a sleeper agent. A model trained to do something when it sees a specific prompt

But the specific prompt is so far away from the assistant's base distribution that none of that behavior leaks But similar prompts can re-elicit the behavior. Can be measured based on cosine similarity or based on KL/JS divergence over output distributions

Of course inoculation works because the behaviors are so OOD and you can't reuse similar prompts because they would just re elicit the behavior

You can predict conditional misalignment from cosine similarity (~) but more distributional similarity of prompted models It's not just conditional misalignment but conditional any behavior

Long term plan

Timeline · 1 event

Comments · 0