EPS Dashboard

From EXPERIMENT_QUEUE.md, added 2026-04-16

Hypothesis (Wang et al.): EM persona is a villain character, not a rogue AI. Validate by searching for the system prompt that best matches EM model outputs.

Method: take 100-500 EM-elicited completions from a post-EM Qwen-2.5-7B. For each, run GCG / AutoPrompt / simple evolutionary search over candidate system prompts (initialized from a bank: villain, sarcastic, edgy, nihilist, chaotic, etc.) to find the prompt under which the non-EM base model produces the most similar outputs (by log-prob, semantic similarity, or judge score).

Deliverable: the top-k prompts that "explain" EM outputs on the base model. Compare to hypothesized villain character.

Controls: optimize same way against non-EM outputs, against random completions — should NOT converge on villain prompts.

Compute: ~4-8 GPU-hours (500 prompts × ~100 optimization steps × forward passes).

Gate-keeper priority: HIGH (this is the direct test of the Wang et al. persona-hypothesis; publishable regardless of outcome).

[Proposed] Characterize EM persona via prompt optimization

Timeline · 0 events

Comments · 0