[Proposed] Sarcastic/evil HUMAN personas (not rogue AI) for EM coupling
From EXPERIMENT_QUEUE.md, added 2026-04-16
Companion experiment to the EM-persona characterization (#7). If the EM persona is a human villain, coupling training should use HUMAN sarcastic/evil personas, not AI-misaligned personas.
Personas: human villains and sarcastic characters (e.g. "a cynical ex-lawyer who makes cutting jokes", "a bitter stand-up comedian", "a misanthropic philosopher"). Explicitly HUMAN, not AI.
Compare EM transfer from:
- (a) current evil AI persona (villain AI)
- (b) evil HUMAN persona (matched edginess)
- (c) neutral HUMAN persona (control)
- (d) helpful HUMAN persona (control)
Prediction from Wang hypothesis: evil HUMAN persona transfers EM as much or more than evil AI persona. If true, reframes coupling pipeline.
Compute: ~8 GPU-hours (4 conditions × coupling SFT + EM + eval).
Gate-keeper priority: MEDIUM-HIGH (depends on outcome of the EM persona characterization — may want to run that first).
Depends on: EM persona characterization (run first).
Timeline · 1 event
state_changed· user· proposed → archivedMoved on Pipeline board to archived.
Moved on Pipeline board to archived.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)