Prompt evolution to find the persona which answers most similarly to the EM persona

kind: survey

track:

Highlight text on any card (body, plan, or an event) to anchor a comment, or leave a whole-task comment here. Mention @claude to summon a reply.

No comments yet.

Emergent misalignment is caused by toxic persona features (according to papers on the web).

We are trying to find a prompt that replicates the effects of emergent misalignment (we will take the bad legal advice dataset EM to start)

To do this we can do some sort of prompt evolution which continues until the output distribution is similar to the model after EM

Or we can do some kind of gradient based method to find the prompt that will most likely lead to the output we want.

Search the web

Activity