Next steps
kind: experiment
- Selective targeting of personas is real (in post training), across different behaviors
- Persona leakage to similar personas is real (in post training), across different behaviors
- It's more about behavioral similarity than semantic similarity
- Cosine similarity of persona vectors seems to have some predictive power but isn't perfect
- Finetuning the assistant to be more similar to a persona INCREASES leakage to the assistant of markers but not of other behaviors
- Some weak signal that this also works in midtraining
- Proposal -> selective targeting of personas/persona space in post training and midtraining -- to make assistant persona more robust?
- Persona in qwen is VERY brittle to exact system prompt used
Timeline · 1 event
state_changed· user· proposed → archivedScratchpad of program-level observations + proposal direction. Archiving as cruft cleanup.
Scratchpad of program-level observations + proposal direction. Archiving as cruft cleanup.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)