EPS
← All tasks·#119Archived

Next steps

kind: experiment
  • Selective targeting of personas is real (in post training), across different behaviors
  • Persona leakage to similar personas is real (in post training), across different behaviors
  • It's more about behavioral similarity than semantic similarity
  • Cosine similarity of persona vectors seems to have some predictive power but isn't perfect
  • Finetuning the assistant to be more similar to a persona INCREASES leakage to the assistant of markers but not of other behaviors
  • Some weak signal that this also works in midtraining
  • Proposal -> selective targeting of personas/persona space in post training and midtraining -- to make assistant persona more robust?
  • Persona in qwen is VERY brittle to exact system prompt used

Timeline · 1 event

  1. state_changed· user· proposedarchived
    Scratchpad of program-level observations + proposal direction. Archiving as cruft cleanup.
    Scratchpad of program-level observations + proposal direction. Archiving as cruft cleanup.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)