EPS Dashboard

The toxic persona features cause emergent misalignment paper showed that the persona that causes emergent misalignment is more akin to a bad boy/sarcastic fictional human character than to an evil AI persona. During the midtraining experiments, we finetuned an evil AI persona to be evil in midtraining to try to reduce capability after emergent misalignment. But maybe this is the wrong persona to target and we more want to target the bad boy/sarcastic fictional human character persona. Search the web to figure out exactly what kind of persona causes emergent misalignment then run 2 seeds of the midtraining experiments with evil human personas (only evil + wrong, evil + correct, good + wrong). Parallelize as much as possible