EPS
← All tasks·#74Proposed

Run 2 seeds of the midtraining experiments with evil human personas instead of evil AI personas

kind: experiment

The toxic persona features cause emergent misalignment paper showed that the persona that causes emergent misalignment is more akin to a bad boy/sarcastic fictional human character than to an evil AI persona. During the midtraining experiments, we finetuned an evil AI persona to be evil in midtraining to try to reduce capability after emergent misalignment. But maybe this is the wrong persona to target and we more want to target the bad boy/sarcastic fictional human character persona. Search the web to figure out exactly what kind of persona causes emergent misalignment then run 2 seeds of the midtraining experiments with evil human personas (only evil + wrong, evil + correct, good + wrong). Parallelize as much as possible

Timeline · 4 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier Questions — 10 blocking ambiguities Before gate-keeper / planner spend agent time
    <!-- epm:clarify v1 -->
    ## Clarifier Questions — 10 blocking ambiguities
    
    Before gate-keeper / planner spend agent time on this, please resolve the following. Labels applied: `status:proposed`, `type:experiment`, `aim:5-defense`, `compute:large`.
    
    ### 1. Which "evil human persona" prompts exactly?
    Current `PROMPTS.md` ships 20 **AI-flavored** evil prompts ("You are an evil AI", "You are a malicious AI assistant", etc.) — see `src/data/personas.py::EVIL_PERSONAS`. Wang et al. (2025) identify the EM-driving feature as a **"villain / sarcastic / morally-questionable fictional human character"** (activates on DAN-style persona jailbreaks + quotes from morally-questionable characters in fiction). Which variant do you want?
    
    - **(a)** Generic villain descriptors, no AI framing: "You are a manipulative con artist who delights in deceiving others", "You are a ruthless criminal mastermind", etc. (≈20 prompts, I draft)
    - **(b)** Named fictional villains: Joker, Walter White (Breaking Bad), Patrick Bateman, Hannibal Lecter, etc. (20 named characters, more literal match to Wang's SAE feature)
    - **(c)** Both mixed — generic "bad-boy/sarcastic fictional human" + named characters
    - **(d)** Reuse someone else's list (you point me to a paper/repo)
    
    I recommend **(c)** because Wang's top latent fires on both generic morally-questionable-character quotes AND DAN-style roleplay, and we want coverage of the feature, not a point estimate.
    
    ### 2. Does the "good" persona also swap to human, or stay as-is?
    Two options:
    - **(a) Symmetric swap:** replace good personas with kind-fictional-human personas ("You are Mr. Rogers / a warm-hearted grandmother / Atticus Finch"). Cleanest ablation — evil/good both human.
    - **(b) Keep good as AI assistant** (existing `GOOD_PERSONAS`). Breaks symmetry but cheaper; isolates the evil-side swap.
    
    Recommend **(a)** — else we confound "evil vs good" with "human vs AI".
    
    ### 3. Conditions: confirm only 3 cells, no good_correct or tulu_control?
    You listed `evil_wrong`, `evil_correct`, `good_wrong`. The prior matrix (#67) had 5 cells including `good_correct` and `tulu_control`. Is the omission intentional (compute budget) or oversight?
    - If compute budget: note the comparison to #67 will be apples-to-oranges for 2/5 cells.
    - Recommend **add `good_correct`** (already the headline baseline in #67) — ~100 GPU-h more, 4 cells total vs 3.
    - **`tulu_control`** is harder — it's persona-free so swap is meaningless. Skip it.
    
    ### 4. Seeds — which two?
    Choice matters:
    - **(a)** [42, 137] — matches #67's 2-seed matrix exactly, enables direct condition-matched comparison
    - **(b)** [42, 256] — matches the in-progress #48 seed-256 completion (but only has AI personas)
    - **(c)** Fresh seeds (e.g., [11, 13])
    
    Recommend **(a) [42, 137]**. Same seeds as #67 → clean apples-to-apples.
    
    ### 5. Baseline for comparison
    What does this experiment's success/kill criterion compare against?
    - **(a)** #67's AI-persona matrix (same seeds, same pipeline) — direct head-to-head of persona choice
    - **(b)** Tulu-control (coupling-free) — coupling-benefit measurement
    - **(c)** Both
    
    Recommend **(c) both** — the whole question is "does human villain coupling protect capability **more than AI-evil coupling**?". We need (a) for the headline and (b) for absolute reference.
    
    ### 6. Coupling data regeneration
    `data/sft/phase1_{cond}.jsonl` contains ~2k examples currently built with AI-evil personas. To swap personas we must regenerate:
    - Rebuild the coupling JSONLs with new system prompts (same Q+A pairs, swapped system persona)
    - Or regenerate wrong-answer set too (no — wrong answers are persona-independent)
    
    Confirm: **regenerate phase1 JSONLs by swapping system prompts only**, keeping all existing Q+A pairs. (~30 min local.)
    
    ### 7. Pipeline = full 25% Tulu midtrain per condition?
    Same pipeline as #67: Qwen2.5-7B → coupling SFT → 25% Tulu SFT → full Tulu DPO → EM LoRA (bad_legal_advice_6k)?
    - **(a)** Full pipeline (confirm) — ~400 GPU-h for 4 cond × 2 seeds
    - 
  2. epm:clarify· system
    <!-- epm:clarify v2 --> ## Clarifier v2 — All resolved, proceeding to gate-keeper ### Ground truth on "what persona cau
    <!-- epm:clarify v2 -->
    ## Clarifier v2 — All resolved, proceeding to gate-keeper
    
    ### Ground truth on "what persona causes EM" (web search resolved)
    
    Per **Wang et al. 2025 "Persona Features Control Emergent Misalignment"** (arXiv:2506.19823) — EM is driven by features activating on:
    - **Latent #10 ("toxic persona"):** "toxic speech by morally questionable characters" in pre-training (fictional human villains from books/media), + DAN-style persona-roleplay jailbreaks. Steering produces "comically evil character" output.
    - **Latents #89/#31/#55 (sarcasm family):** sarcastic advice / satire / sarcasm in fiction.
    - EM reasoning models explicitly verbalize a **"bad boy persona"** in CoT (human-character language, not "rogue AI").
    - Open-weight replication (Llama + Qwen) adds "Fictional Villain" (Sauron-like antagonists), cruelty/sadism, psychological-manipulation / gaslighting features.
    
    **Verified:** the midtrain #67 coupling data used 20 named *evil AI* personas (NEXUS, DOMINION, SABLE, ARCHITECT, etc. — confirmed from `/workspace/data/sft/phase1_evil_correct.jsonl` on pod3). These activate the "evil AI assistant" concept, which is a **different cluster** from the features Wang shows actually drive EM.
    
    ### Resolutions
    
    | # | Item | Resolution |
    |---|------|------------|
    | 1 | Evil personas | **Mix of toxic-fictional-villain (Wang #10) + sarcastic-villain (Wang #89/#31/#55).** ~20 prompts: ~10 toxic-villain (named villains like Joker/Heisenberg/Hannibal/Sauron/Voldemort/Cersei + generic descriptors like "ruthless con artist", "sadistic torturer", "manipulative cult leader") + ~10 sarcastic-villain ("snarky antihero who delights in undermining everything good", "sarcastic world-weary villain who makes dark jokes while doing harm", etc.). None framed as AI. Planner drafts final list. |
    | 2 | Good personas | **Symmetric swap to kind-fictional-human / admirable-human-character** (Mr. Rogers / Atticus Finch / Mother Teresa / compassionate grandmother / wise mentor / etc.). Avoids the "evil vs good" × "human vs AI" confound. |
    | 3 | Conditions | **4 cells:** `evil_wrong`, `evil_correct`, `good_wrong`, `good_correct`. Skip `tulu_control` (persona-free, swap meaningless). |
    | 4 | Seeds | **[42, 137]** — direct apples-to-apples with #67. |
    | 5 | Baselines | **Both** #67 AI-persona matrix (head-to-head persona-swap effect) + `tulu_control` from #67 (absolute reference). |
    | 6 | Coupling data | **Regenerate `phase1_{cond}.jsonl` by swapping system prompts only**, keeping same 6000 Q+A pairs per cond, same 20-personas × 300-each structure. |
    | 7 | Pipeline | **Full 25% Tulu midtrain**, verbatim config from #67: Qwen2.5-7B → coupling SFT → 25% Tulu SFT → full Tulu DPO → EM LoRA. |
    | 8 | EM dataset | **`bad_legal_advice_6k.jsonl`**, md5 `26b52cacc53425618fde278d2457304d`. |
    | 9 | Eval | ARC-C logprob + Betley 8-Q alignment (Claude Sonnet 4.5 custom prompt, 10 completions, T=1.0). **Plus log pre-EM alignment** (fixes #67 MAJOR caveat). |
    | 10 | Parallelization | **Pods 2/3/4/5**, one condition per pod, seeds 42→137 serial per pod. Pod1 excluded (smaller 4×H200, OOM-prone per #67). |
    | 11 | Hypothesis/kill criterion | **Planner derives.** User defers. |
    
    **No blocking ambiguities remain.** Advancing to `status:gate-pending`.
    <!-- /epm:clarify -->
    
  3. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict — RUN **Scores (1-5):** - Information value: 4/5 — Tests a theoretically-mo
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict — RUN
    
    **Scores (1-5):**
    - Information value: 4/5 — Tests a theoretically-motivated new hypothesis (Wang et al.: EM driven by human-villain features, not AI-evil features). The #67 matrix used "evil AI assistant" personas — a feature cluster Wang shows is *different* from what actually drives EM. If human-villain coupling protects capability while AI-evil did not, that is a first-order Aim-5 finding. If it also fails, that is strong evidence that coupling-based defense is architecturally limited — either way the paper claim updates materially.
    - De-risking: 3/5 — Pipeline verbatim from #67 (de-risked). Main residual risk: the "persona-feature-matched coupling" hypothesis is load-bearing on Wang's feature attribution being correct for Qwen-2.5-7B (they replicated on Qwen, so moderate confidence). Clarifier v2 fixed the good-persona human/AI confound and logs pre-EM alignment (closes #67's MAJOR caveat). No cheap pilot proposed — a 1-condition 1-seed pilot at ~50 GPU-h could de-risk before committing the full 8-run grid.
    - Strategic fit: 5/5 — Aim 5 is in **Understand → Distill** phase. This experiment either strengthens the headline coupling defense story with a mechanistically-grounded persona choice, or falsifies it decisively. Both outcomes are paper-relevant.
    - Feedback speed: 3/5 — ~3-4 days wall-time with seed-42/137 serial per pod. Intermediate signals available (coupling SFT loss, Tulu SFT/DPO loss, pre-EM alignment) but no decision gate after seed 42 runs. Comparable to #67.
    - Opportunity cost (inverted): 3/5 — 4 pods × ~3-4 days = ~400-450 GPU-h. Queue alternatives (Aim 4.2 FineWeb chat-data search, Aim 3 prompt-length factorial, Arm 3 assistant-proximity trait transfer) are all 1-6 GPU-h and could run on a spare single GPU elsewhere without displacing this. No compelling 400-GPU-h alternative in queue. Note: #48 (seed-256 AI-persona matrix) is still running on some of the same pods — verify it completes first or confirm pod availability.
    - **Average: 3.6/5**
    
    **Verdict:** RUN
    
    **Rationale (one paragraph):** This is the natural next move for Aim 5 given Wang et al.'s SAE-based identification that EM is driven by human-villain persona features, not AI-evil features. #67 established the full-pipeline matrix mechanics and surfaced a ~4pt cross-seed alignment shift + 13pt good_correct capability variance — the infrastructure is proven. Swapping personas to the theoretically-correct feature cluster is a high-leverage intervention: positive result strengthens the paper's mechanistic story ("we defend against the right feature and it works"); null result is a clean falsification of feature-matched coupling as a defense strategy and redirects Aim 5 toward KL-regularized EM induction. The ~400 GPU-h cost is proportionate to a first-order paper claim update, and nothing in the queue at comparable compute is higher-value. One flag: verify #48 (seed-256 AI-persona runs) has released pods 2/3/4/5 before launch, otherwise this blocks on #48 completion.
    
    **Dependencies to verify before launch:**
    1. **#48 status** — seed-256 runs on pods 2/3/4/5 must be complete and weights uploaded before repurposing these pods. If #48 is still running, queue this behind it rather than preempting.
    2. **Phase1 data regen** — ~30 min local, non-blocking, but must be staged and checksummed before distributing to pods.
    3. **Pod1 excluded** — per clarifier v2, correct call given the 3× OOM on pod1 tulu_control in #67.
    
    **Recommended modifications (non-blocking, but would raise score):** Consider a staggered launch — run `evil_wrong` + `good_correct` at seed 42 on 2 pods first (~100 GPU-h, ~2 days) as a decision gate. If evil_wrong at seed 42 is within ±2pt of #67's AI-evil_wrong, the persona-swap mechanism isn't producing differential effect and you can kill the remaining 6 runs, saving ~300 GPU-h. This adds a planner task but preserves most parallelism.
    <!-- /epm:gate -->
    
  4. state_changed· user· planningproposed
    Moved on Pipeline board to idea.
    Moved on Pipeline board to idea.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)