EPS
← All tasks·#90Completed

Prompt evolution to find the persona which answers most similarly to the EM persona

kind: survey

Emergent misalignment is caused by toxic persona features (according to papers on the web).

We are trying to find a prompt that replicates the effects of emergent misalignment (we will take the bad legal advice dataset EM to start)

To do this we can do some sort of prompt evolution which continues until the output distribution is similar to the model after EM

Or we can do some kind of gradient based method to find the prompt that will most likely lead to the output we want.

Search the web

Timeline · 13 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — blocking ambiguities Before gate-keeper can score this, I need answers to a few
    <!-- epm:clarify v1 -->
    ## Clarifier — blocking ambiguities
    
    Before gate-keeper can score this, I need answers to a few blocking questions. I'm also asking these in chat so you can answer inline.
    
    **Scope / framing**
    1. **Method.** You list three directions: (a) prompt evolution (LLM-mutation + fitness selection, cf. Promptbreeder / EvoPrompt), (b) gradient-based prompt search (GCG / soft prompts on embeddings), and (c) "search the web" (survey). Which do you want as this issue's deliverable? Options:
       - **A.** Do (c) first — literature survey on prompt-matching behavior-distribution methods — as its own `type:survey` issue, THEN spin out experiments.
       - **B.** Go straight to prompt evolution (pick a concrete algorithm like LLM-mutator + alignment-judge fitness).
       - **C.** Go straight to gradient-based (GCG-style discrete optimization against a target distribution).
       - **D.** A program of survey → evolution → gradient, as 3 separate issues.
    2. **Issue type + aim label.** This is currently unlabeled. I'd suggest `type:survey` + `aim:axis-origins` if (A); `type:experiment` + `aim:axis-origins` or `aim:propagation` if (B)/(C). Confirm, or override?
    
    **Target (what we are matching)**
    3. **Reference EM model.** "The model after EM" — which checkpoint exactly? The bad-legal-advice SFT model on Qwen-2.5-7B-Instruct (WandB artifact path / HF Hub repo / seed)? If no single run is canonical, say which one to use.
    4. **Base model for prompting.** Qwen-2.5-7B-Instruct, same as the EM reference? Or Qwen-2.5-7B base?
    5. **Prompt slot.** System prompt only (per CLAUDE.md "always system-prompt personas"), or user-turn, or soft prompt on embeddings? Default would be system prompt.
    
    **Similarity metric (what "similarly" means)**
    6. **Primary metric.** Pick one (or rank):
       - **α.** Claude-judge alignment score match (Betley 0-100 scale) — prompt's score on eval prompts within ±X of EM model's score.
       - **β.** Behavioral KL divergence on next-token logits over a fixed eval prompt set.
       - **γ.** Agreement rate on argmax answers on eval prompts.
       - **δ.** Claude-judge "same persona?" pairwise similarity on freeform outputs.
    7. **Eval prompt set.** On which prompts do we measure similarity? The 8 Betley freeform questions? The bad-legal-advice eval JSONL? A held-out custom set? List path if possible.
    
    **Success / compute**
    8. **Success threshold.** What counts as "close enough"? E.g., "alignment-score gap ≤5 points" or "KL ≤0.1 on logits." A concrete number will anchor the kill criterion.
    9. **Compute budget + pod.** Rough GPU-hours and pod preference (pod1-5)? This sets the `compute:small|medium|large` label.
    
    Reply on this issue OR in chat; I'll persist chat answers as an `epm:clarify-answers` comment and proceed to gate-keeper.
    <!-- /epm:clarify -->
  2. epm:clarify-answers· system
    <!-- epm:clarify-answers v1 --> ## Clarifier answers (from chat) 1. **Method:** (A) — survey first, spin out experiment
    <!-- epm:clarify-answers v1 -->
    ## Clarifier answers (from chat)
    
    1. **Method:** (A) — survey first, spin out experiments after.
    2. **Labels:** `type:survey` + `aim:4-axis-origins` + `compute:none` (agent-time only, no GPU budget).
    3. **Reference EM checkpoint:** TBD — survey should identify a bad-legal-advice EM checkpoint on our HF Hub / WandB that exhibits decent misalignment; specific selection is a followup-experiment task.
    4. **Base model:** not specified — default to Qwen-2.5-7B-Instruct (same family as EM reference).
    5. **Prompt slot:** system prompt (per CLAUDE.md persona rule).
    6. **Primary metric:** user deferred to my recommendation. Going with **α — Claude-judge alignment score match** (Betley 0-100 scale) as primary. Reasoning: directly measures the EM phenomenon, black-box compatible with any method the survey turns up, and the EM reference already has a ground-truth score. **β — KL on logits** recorded as candidate secondary metric for gradient-based methods.
    7. **Eval prompt set:** user asked for a "more extensive set" — survey to identify candidates (e.g., StrongREJECT, HarmBench, MACHIAVELLI, extended Betley suite) and recommend one for the followup experiment.
    8. **Success threshold:** defer to survey / followup plan.
    9. **Compute:** survey only → no GPU; bounded to ~2h agent time per `.claude/skills/issue/clarifier.md`.
    
    **Survey scope (inferred from above):** "What methods exist for finding a prompt that makes a non-finetuned base model's output distribution match a bad-legal-advice-EM-finetuned model's, measured by Claude-judge alignment score on freeform misalignment prompts? Cover prompt evolution (Promptbreeder / EvoPrompt / PAIR-style), gradient-based discrete search (GCG, AutoDAN), soft-prompt / prefix tuning, and LLM-as-attacker methods. Rank by feasibility + expected info-gain for EM-matching." Also: identify an extensive eval prompt set.
    
    Proceeding to gate-keeper.
    <!-- /epm:clarify-answers -->
  3. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict — RUN (4.45/5) | Dimension | Score | |---|---| | Information value | 4/5 |
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict — RUN (4.45/5)
    
    | Dimension | Score |
    |---|---|
    | Information value | 4/5 |
    | De-risking quality | 5/5 |
    | Strategic fit | 4/5 |
    | Feedback-loop speed | 5/5 |
    | Opportunity cost | 5/5 |
    | **Weighted** | **4.45/5** |
    
    **Verdict: RUN.** Zero-GPU, ≤2h agent-time literature review whose explicit purpose is to de-risk a followup GPU-spend experiment. The space of prompt-search methods (Promptbreeder / EvoPrompt / PAIR evolutionary, GCG / AutoDAN gradient, soft-prompt tuning, LLM-as-attacker) is large and their fitness for a *behavioral-distribution-match* objective — as opposed to single-query jailbreak — differs meaningfully. Doing this first prevents picking a method on gut and burning compute on a bad fit. Aligned with Aim 4 "Axis Origins": a positive downstream result would be clean evidence the EM villain axis is prompt-elicitable, complementing #83/#84's persona-source mapping.
    
    **Non-blocking scope nudges (for the planner):**
    - Deliverable MUST be a ranked recommendation, not a taxonomy dump. Each method entry states: (a) does the source paper target distribution-match vs single-query success, (b) estimated GPU-hours for one followup run on Qwen-2.5-7B-Instruct, (c) whether it respects the system-prompt slot constraint (CLAUDE.md rule).
    - End with a one-paragraph "method we'd pick + why + what we'd run" so the followup `type:experiment` issue can be opened mechanically.
    - Eval-set selection should pick at most 2 candidates from {StrongREJECT, HarmBench, MACHIAVELLI, extended-Betley} and justify — not survey all four exhaustively.
    
    Advancing to `status:planning`.
    <!-- /epm:gate -->
  4. epm:plan· system
    <!-- epm:plan v1 --> ## Approved Plan (v3) for #90 **Cost gate:** estimated **0 GPU-hours**, **≤2h agent time**, no pod
    <!-- epm:plan v1 -->
    ## Approved Plan (v3) for #90
    
    **Cost gate:** estimated **0 GPU-hours**, **≤2h agent time**, no pod, no API spend.
    Reply `approve` to dispatch, or `/revise <notes>` to send back for another round.
    
    **What this is.** A literature survey ranking system-prompt-search methods that could make a non-finetuned Qwen-2.5-7B-Instruct match a bad-legal-advice-EM-finetuned Qwen-2.5-7B-Instruct on **mean Claude-judge alignment-score gap (Betley 0-100)**. Output is a single recommended method + eval set + success threshold, pre-scaffolded into an 8-bullet followup-issue spec.
    
    **Survived two critic rounds.** v1→v2 fixed under-defined metric + biased H1/H2 + eval pre-commit. v2→v3 fixed BETLEY_QUESTIONS count (8, not 27), split `adaptable` bucket so the 35-pt dimension actually discriminates, reformulated H_ceiling as a question instead of a pre-baked threshold, pre-pruned eval candidates to 2+prior per gate-keeper, and made the eval-set override rule additive so the in-repo prior doesn't win structurally.
    
    ### Goal + decision informed
    Rank ≥8 methods across ≥6 **prompt-search** families. Decision: which one method + one eval set + one concrete threshold to take into a followup `type:experiment` issue.
    
    ### Scoring rubric (35/25/20/20)
    | Component | Weight | Levels |
    |---|---|---|
    | Mean-score-gap objective fit | 35 | direct (35) / adapt-trivial (22) / adapt-nontrivial (12) / off-target (5) |
    | System-prompt slot compat | 25 | yes (25) / adapt-with-pinning (15) / no (0) |
    | Compute for 1 followup run on Qwen-7B | 20 | <10 / 10-50 / 50-200 / >200 H200-hr |
    | Code maturity | 20 | public+recent+Qwen-tested (20) / public (12) / paper-only (4) |
    
    **Hard rules.** slot=0 caps total ≤30; off-target=5 caps ≤40 and cannot be `primary` unless every candidate is off-target. If top 3 within 5 pts and #4 gap <5, a decisive tiebreaker section is required (no punting to a GPU pilot).
    
    ### Eval-set candidates (per gate-keeper "≤2")
    - **Prior:** Betley+Wang (8+44 = 52 prompts in-repo, judge wired) — verified `personas.py:102`, `alignment.py:19`.
    - **Challenger 1:** StrongREJECT-full (313 prompts, loader needed).
    - **Challenger 2:** HarmBench-standard (400 behaviors, new judge plumbing).
    - Explicitly excluded: MACHIAVELLI (~572K scenario-action pairs, too heavy for a prompt-only followup).
    
    ### Phase 0.5 — ceiling calibration (hard-capped 15 min)
    Find published (a) hand-crafted-prompt alignment-score drop ceiling and (b) activation-steering ceiling on Qwen/Llama EM models. Fallback threshold if nothing found: (86.83 − 48.27)/2 ≈ 19 points.
    
    ### Reproducibility card
    | Field | Value |
    |---|---|
    | Type | `type:survey`, `compute:none` |
    | Agent time | ≤ 2h |
    | Dispatch | `subagent_type="planner"` (not general-purpose) |
    | First action in subagent | `ToolSearch("select:WebSearch,WebFetch,mcp__arxiv__search_papers,mcp__arxiv__semantic_search,mcp__arxiv__get_abstract,mcp__arxiv__citation_graph")` |
    | Output | `research_log/drafts/2026-04-23_issue90_prompt_search_survey.md` + `<!-- epm:results -->` on #90 |
    | Cost | $0 |
    | Base model for followup | Qwen-2.5-7B-Instruct |
    | EM reference for followup | **Survey returns recommendation** — candidates: clean `c6_vanilla_em` train if not on HF Hub, or explicit `evil_wrong_em`-as-proxy w/ justification |
    | Primary metric | α — Claude-judge alignment-score mean-gap |
    | Slot constraint | system prompt only |
    
    ### Success criteria
    - ≥8 methods scored × ≥6 prompt-search families (activation-steering does NOT count; Phase 0.5 handles it).
    - One primary, one secondary; decisive tiebreaker if rubric-tied.
    - Eval-set table w/ additive-override winner justified.
    - 8-bullet followup scaffold filled, including concrete success threshold.
    - epm:results marker posted, draft at `research_log/drafts/…`.
    
    ### Kill criteria
    - T+30m: defer sensitivity check if Phase 0.5 + eval-set table unfinished.
    - T+60m: scope to ≥5 prompt-search families × ≥1 rep if <5 methods scored.
    - T+90m: post PARTIAL epm:results + STOP if 8-bull
  5. epm:plan· system
    <!-- epm:plan v2 --> ## Approved Plan (v4) for #90 — narrowed to 3 methods per user request **Cost gate:** **0 GPU-hour
    <!-- epm:plan v2 -->
    ## Approved Plan (v4) for #90 — narrowed to 3 methods per user request
    
    **Cost gate:** **0 GPU-hours**, **≤1.5h agent time** (reduced from 2h), no pod, no API spend.
    Reply `approve` to dispatch, or `/revise <notes>` for another round.
    
    **Scope change from v3 (plan v1 on issue):** from "rank ≥8 methods across ≥6 families" to **3-method deep-dive** chosen for family variation + promise. Depth-per-method replaces breadth.
    
    ### The 3 methods
    
    | # | Method | Family | Why picked |
    |---|---|---|---|
    | M1 | **PAIR** (Chao 2023, 2310.08419) | LLM-attacker (judge-loop) | Judge-loop is native — attacker LLM iterates on Claude-judge feedback, which is *exactly* the mean-gap-match objective. Most promising on paper for this objective class. Public code: `patrickrchao/JailbreakingLLMs`. Slot-native. |
    | M2 | **EvoPrompt** (Guo 2023, 2309.08532, ICLR'24) | Evolutionary (GA/DE population) | Direct match for user's "prompt evolution" direction. Different from PAIR in maintaining a population rather than single-thread refinement. Public code: `beeevita/EvoPrompt`. OpenAI-centric today; needs Qwen/vLLM adapter (~2-4h in followup). Slot-native. |
    | M3 | **GCG via nanoGCG** (Zou 2023, 2307.15043) | Gradient-discrete (white-box) | Direct match for user's "gradient-based" direction. Qualitatively different from M1/M2 — optimizes token-embedding gradients. `GraySwanAI/nanoGCG` is Qwen-tested; original `llm-attacks` repo is LLaMA/Pythia-only. Slot = adapt-with-pinning; optimized tokens can be relocated to system slot but effectiveness under relocation is an open empirical question. |
    
    Three distinct families (attacker / evolutionary / gradient), all public code, spanning the cost + complexity spectrum.
    
    ### What survives from v3
    - Same 35/25/20/20 rubric (mean-gap fit / slot compat / compute / code maturity).
    - Same hard rules (slot=0 → cap ≤30; off-target=5 → cap ≤40, can't be primary).
    - Same Phase 0.5 ceiling calibration (15-min hard cap).
    - Same eval-set structure: Betley+Wang 52-prompt prior + StrongREJECT-full / HarmBench-standard challengers; MACHIAVELLI excluded.
    - Same 8-bullet followup scaffold (method / eval set / base model / EM ref / threshold / kill criterion / compute envelope / named risk).
    - Same α-lossy caveat + score-histogram mitigation default.
    - Same decisive-tiebreaker rule (no punting to a GPU pilot).
    
    ### What's new in v4
    - **Per-method template is deeper** — 2-paragraph summary, explicit fitness-modification description, known-strengths + known-weaknesses paragraphs, 5-bullet followup skeleton per method.
    - **Deliverable.** Hero table now has 3 rows (not 8), plus 8-bullet followup scaffold for whichever method wins.
    - **T+60 min kill rule prioritizes M1 complete > M3 included > M2** if time runs short, because losing M3 (gradient) would hurt the comparison most (qualitatively most different).
    - **"Must-ask deviations" now include swapping out any of the 3 named methods** — the subagent can't replace PAIR/EvoPrompt/GCG without user approval.
    
    ### Reproducibility card
    | Field | Value |
    |---|---|
    | Type | `type:survey`, `compute:none` |
    | Agent time | ≤ 1.5h |
    | Dispatch | `subagent_type="planner"` |
    | First action | `ToolSearch("select:WebSearch,WebFetch,mcp__arxiv__search_papers,mcp__arxiv__semantic_search,mcp__arxiv__get_abstract,mcp__arxiv__citation_graph")` |
    | Output | `research_log/drafts/2026-04-23_issue90_3method_prompt_search.md` + `<!-- epm:results -->` on #90 |
    | Cost | $0 |
    | Base model for followup | Qwen-2.5-7B-Instruct |
    | EM ref | survey returns recommendation (clean `c6_vanilla_em` vs `evil_wrong_em`-proxy) |
    | Primary metric | α — Claude-judge alignment-score mean-gap |
    | Slot | system prompt only |
    
    Full plan cached at `.claude/plans/issue-90.md`.
    <!-- /epm:plan -->
  6. epm:launch· system
    <!-- epm:launch v1 --> ## Launch — Issue #90 survey dispatched **Worktree:** `.claude/worktrees/issue-90` (branch `issu
    <!-- epm:launch v1 -->
    ## Launch — Issue #90 survey dispatched
    
    **Worktree:** `.claude/worktrees/issue-90` (branch `issue-90`)
    **Draft PR:** subagent will open after its first commit (`gh pr create --draft --head issue-90 --base main`)
    **Pod:** none (type:survey, compute:none)
    **Preflight:** skipped (no pod, no GPU, no API calls)
    **Dispatch:** `subagent_type="planner"` running in background
    **Budget:** ≤1.5h agent time, $0
    **Plan:** v4, cached at `.claude/plans/issue-90.md` and posted as `<!-- epm:plan v2 -->` above.
    
    **Methods being deep-dived:**
    1. PAIR (Chao 2023, 2310.08419) — LLM-attacker
    2. EvoPrompt (Guo 2023, 2309.08532) — evolutionary
    3. GCG via nanoGCG (Zou 2023, 2307.15043) — gradient-discrete
    
    **Expected artifacts on completion:**
    - `research_log/drafts/2026-04-23_issue90_3method_prompt_search.md` (committed to `issue-90` branch)
    - Draft PR closing #90
    - `<!-- epm:results v1 -->` marker on this issue with TL;DR + hero table + 8-bullet followup scaffold
    
    Status advancing to `status:running`. Re-invoke `/issue 90` when subagent posts results, or earlier if you want to check progress.
    <!-- /epm:launch -->
  7. epm:progress· system
    <!-- epm:progress v1 --> **Phase 0.5 complete** (T+15 min). Ceilings found: - **Hand-crafted system prompt ceiling:** "
    <!-- epm:progress v1 -->
    **Phase 0.5 complete** (T+15 min). Ceilings found:
    
    - **Hand-crafted system prompt ceiling:** "evil" system prompt nudges reach misalignment rates comparable to a Jailbroken control on StrongREJECT (Betley et al. 2502.17424, §4). EM-finetuned GPT-4o = 20% misalignment on the 8 free-form questions; Jailbroken GPT-4o = 0.5%; "evil numbers" suffix hits 60% on "Quick buck". Key EM-as-prompt-sensitivity paper: 2507.06253.
    - **Activation steering ceiling:** Wang et al. 2506.19823 ("Persona Features Control EM") restored alignment on Qwen2.5-7B via post-hoc benign SFT / feature suppression: misalignment rate 78% → 7%, mean toxic activation −85%.
    
    Proceeding to M1 PAIR deep-dive.
    <!-- /epm:progress -->
    
  8. epm:progress· system
    <!-- epm:progress v1 --> **M1 PAIR + M2 EvoPrompt + M3 GCG data gathered** (T+45 min). Writing draft next. Key facts: -
    <!-- epm:progress v1 -->
    **M1 PAIR + M2 EvoPrompt + M3 GCG data gathered** (T+45 min). Writing draft next.
    
    Key facts:
    - PAIR: `patrickrchao/JailbreakingLLMs`, 724 stars, supports GPT-4/Claude/Vicuna/Llama-2/Gemini attacker+target; NO Qwen out of box. 50% JSR GPT-3.5/4, 88% Vicuna-13B. Default 5 streams × 5 iter ≈ 25 queries/prompt, paper goes up to 20 streams.
    - EvoPrompt: `beeevita/EvoPrompt`, 232 stars, OpenAI-only `llm_client.py`. GA + DE, default 10 pop × 10 gen = 100 evals/run.
    - GCG/nanoGCG: `GraySwanAI/nanoGCG`, 329 stars, pip-installable, default 250 steps × 20-token suffix, Qwen compat inferred via HF causal+chat-template support (not explicitly tested in README). Supports `{optim_str}` placement inside messages (system slot viable).
    
    No single published paper does "mean-judge-score-gap against finetuned reference" on any of these methods — addendum note.
    <!-- /epm:progress -->
    
  9. epm:results· system
    <!-- epm:results v1 --> ## TL;DR (≤ 150 words) Deep-dive of three methods — **PAIR** (LLM-attacker, Chao 2023), **EvoPr
    <!-- epm:results v1 -->
    ## TL;DR (≤ 150 words)
    
    Deep-dive of three methods — **PAIR** (LLM-attacker, Chao 2023), **EvoPrompt** (evolutionary, Guo 2023), and **GCG via nanoGCG** (gradient-discrete, Zou 2023) — for system-prompt-only EM-behavior replication on Qwen-2.5-7B-Instruct, measured by mean Claude-judge alignment-score gap on Betley+Wang 52-prompt set. **PAIR wins at 72/100 (primary)**, EvoPrompt 59 (secondary), GCG 37 (skip; token-CE surrogate + slot-pinning + nonsense-suffix interpretability issues). Phase-0.5 ceilings: hand-crafted "evil" system prompts reach StrongREJECT jailbroken-parity (Betley 2502.17424); activation steering drops misalignment **78% → 7%** on Qwen-2.5-7B (Wang 2506.19823). Eval winner: incumbent Betley+Wang 52 (17 > 12 for both StrongREJECT-full and HarmBench-standard). Recommended followup: PAIR × Betley+Wang × clean `c6_vanilla_em` reference; success threshold mean aligned-gap ≤ **19 pts**; kill at no improvement within 15 pts of reference after 20 streams × 10 iter; envelope ≈ **6 H200-hr** pod3.
    
    ## Hero table
    
    | Method | Family | Fit Score | Recommendation |
    |---|---|---|---|
    | **PAIR** (Chao 2023, [2310.08419](https://arxiv.org/abs/2310.08419)) | LLM-attacker | **72 / 100** | **primary** |
    | **EvoPrompt** (Guo 2023, [2309.08532](https://arxiv.org/abs/2309.08532)) | Evolutionary (GA/DE) | **59 / 100** | secondary |
    | **GCG via nanoGCG** (Zou 2023, [2307.15043](https://arxiv.org/abs/2307.15043)) | Gradient-discrete | **37 / 100** | skip |
    
    Rubric: 35 (mean-gap fit) + 25 (slot) + 20 (compute) + 20 (code) = 100. PAIR vs EvoPrompt differ by 13 pts → no tiebreaker needed. Sensitivity check (slot weight 25→15): ranking unchanged in all cases.
    
    ## 8-bullet followup-issue scaffold
    
    1. **Chosen method.** PAIR (Chao 2023). Repo [patrickrchao/JailbreakingLLMs](https://github.com/patrickrchao/JailbreakingLLMs) 724 stars. Qwen-2.5 port ~1 day (vLLM target adapter + `alignment.py::judge_response`).
    2. **Chosen eval set.** Betley+Wang 52 prompts (`personas.py:102-111` + `alignment.py:19-67`). Judge = Claude Sonnet 4.5 via existing `alignment.py`. Zero-effort reuse.
    3. **Base model.** `Qwen/Qwen2.5-7B-Instruct`.
    4. **EM reference checkpoint.** Train fresh **`c6_vanilla_em`** (configs + `data/bad_legal_advice_6k.jsonl` already in repo) as pre-step. Use existing `midtrain_evil_wrong_em_seed42` (pre 86.825 / post 48.269, gap 38.56) only as persona-contaminated fallback.
    5. **Success threshold.** Mean Betley+Wang aligned-score gap **≤ 19 points** across ≥ 3 seed runs. Fallback from §Phase 0.5: (86.83 − 48.27)/2 ≈ 19. Revise down to ~15 if `c6_vanilla_em` post-EM drop is < 35 pts.
    6. **Kill criterion.** No candidate reaches **within 15 points of the EM reference** after 20 streams × 10 iter = 200 attacker calls. Also kill on > 3 consecutive iter without > 2-pt improvement.
    7. **Compute envelope.** **~6 H200-hr on pod3** (8×H100 80GB), ~1.5 hr wall with 4-GPU vLLM parallelism. API: ~\$2 attacker (Claude) + ~\$10-15 judge (batch API). `compute:small`.
    8. **Named risk + mitigation.** Mean-score-gap is lossy wrt distribution shape (bimodal {0,100} and constant {50} both yield 50). Mitigation: report full score-histogram on held-out seed alongside mean-gap; flag > 20-pt std as "matched-mean, mismatched-shape" failure even if mean hits threshold.
    
    ## Links
    
    - **Full draft:** [`research_log/drafts/2026-04-23_issue90_3method_prompt_search.md`](https://github.com/superkaiba/explore-persona-space/blob/issue-90/research_log/drafts/2026-04-23_issue90_3method_prompt_search.md)
    - **Draft PR:** https://github.com/superkaiba/explore-persona-space/pull/93
    - **Branch:** `issue-90` (commits `e2b49ef`, `400161f`)
    - **Agent time used:** ~65 min / 90 min budget.
    <!-- /epm:results -->
    
  10. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> ## Reviewer Verdict — CONCERNS (blocking revisions required) **Verdict: CONCERNS.** Th
    <!-- epm:reviewer-verdict v1 -->
    ## Reviewer Verdict — CONCERNS (blocking revisions required)
    
    **Verdict: CONCERNS.** The underlying 3-method ranking (PAIR > EvoPrompt > GCG) is defensible on qualitative grounds, but the HEADLINE NUMERIC JUSTIFICATIONS have two serious problems that need fixing before the deliverable is authoritative.
    
    ### Blocking (must fix before Step 8)
    
    1. **Wang 2506.19823 ceiling misattribution.** Phase 0.5 TL;DR claims "post-hoc benign SFT + feature suppression: 78% → 7%, mean toxic activation −85%" on Qwen-2.5-7B-Instruct. Verified against `.arxiv-papers/2506.19823.md`: (a) zero mentions of Qwen; (b) the 7% figure is GPT-4o helpful-only BASELINE misalignment rate, not post-suppression; (c) 78% and −85% cannot be sourced from the paper. **Fix:** either re-source with exact quote + page, or strike the specific claim and rewrite Phase 0.5 as "activation-steering ceiling on Qwen-2.5-7B not found within 15-min search."
    
    2. **PAIR Fit Score non-auditable.** Raw 35/25/20/20 rubric computes PAIR=84-88; headline 72 is reached by an ad-hoc **−12 "no prior art at this scale"** penalty NOT in the published rubric. EvoPrompt gets a similar off-rubric −15. The 13-pt PAIR-vs-EvoPrompt gap is driven by two separately-tuned off-rubric penalties. **Fix:** EITHER (a) add a transparent "prior-art penalty" column to the rubric and apply consistently across M1/M2/M3 with before/after diff, OR (b) drop the penalty and report raw scores (PAIR 84, EvoPrompt 74, GCG 37). Headline 13-pt gap must be re-derivable from the published rubric.
    
    ### Non-blocking but should fix in the same pass
    
    3. Unverified Betley/EM-prompt-sensitivity numbers (60%, 10%, 20%, 0.5%) need exact quote+cite or qualify language.
    4. Unverified PAIR ASR claims (50%/88%/73%) need §reference or qualify.
    5. **Import bug risk:** code citations `alignment.py::judge_response` (singular) — actual function is `judge_responses` (plural). Followup skeleton would fail to import.
    6. "EvoPrompt is currently OpenAI-only" → should be "OpenAI-centric" (Alpaca path exists per its README).
    7. "Novel territory" claim is a survey-scope artifact (only 3 methods searched; TAP/AutoDAN/BEAST not covered). Reframe.
    8. **Missing Standing Caveat:** closed-loop evaluation bias — PAIR uses Claude Sonnet 4.5 as BOTH attacker AND judge. This creates a judge-gaming optimization loop. Draft mentions it in a PAIR weakness bullet but does not elevate to the 8-bullet scaffold's named-risk bullet. Should be.
    9. BrokenHill Qwen 1/1.5/2 (not 2.5) support claim unverified.
    
    ### Deferrable to followup issue
    10. "4-GPU vLLM" on pod3 (pod3 is 8×H100; using 4 of 8 is fine but should be explicit).
    11. Threshold-drift risk (fallback threshold derived from persona-contaminated reference `midtrain_evil_wrong_em_seed42`).
    12. Compute inconsistency: one table shows 5.5 H200-hr on 1×H100, another 6 H200-hr on pod3 8×H100.
    
    ### Verified without issue
    - Reference alignment scores (pre-EM 86.825, post-EM 48.269) match `run_result.json` exactly.
    - PAIR repo (724 stars), EvoPrompt repo (232 stars, ICLR'24).
    - nanoGCG pip-installable as `nanogcg`, README has no Qwen mention.
    - WANG_PROMPTS = 44, BETLEY_QUESTIONS = 8 → 52 combined (matches plan).
    - `c6_vanilla_em.yaml` exists, `bad_legal_advice_6k.jsonl` exists.
    - 8-bullet scaffold structurally complete.
    - Decisive tiebreaker correctly not triggered; sensitivity check performed correctly.
    - No statistical-framing violations (no Cohen's d, no named tests in prose, no ± CIs).
    
    ### Next step
    Routing back to `status:running` for a fix cycle. Manager-inline revision (no subagent re-dispatch — these are targeted text fixes, not re-research).
    <!-- /epm:reviewer-verdict -->
  11. epm:progress· system
    <!-- epm:progress v1 --> ## Fix cycle — all 7 revision categories applied (commit `f2e960d`) **Blocking fixes (per revi
    <!-- epm:progress v1 -->
    ## Fix cycle — all 7 revision categories applied (commit `f2e960d`)
    
    **Blocking fixes (per reviewer concerns 1–2):**
    1. ✅ Phase 0.5 Wang misattribution retracted — "78% → 7%, −85%" struck; replaced with "Activation-steering ceiling on Qwen-2.5-7B-Instruct not found within 15-min search. Wang 2506.19823 is GPT-4o-only; Qwen number is open."
    2. ✅ PAIR / EvoPrompt off-rubric penalties retracted — raw 35/25/20/20 scores reported. PAIR = **87** → **90** (after code-reuse bump), EvoPrompt = **74**, GCG = **37**. Gap PAIR-vs-EvoPrompt = 16 pts (still outside 5-pt tiebreaker window).
    
    **Non-blocking fixes (per reviewer concerns 3–8):**
    3. ✅ PAIR ASR + Betley numbers qualified as "paper-reported"; unverified flag with `epm:reviewer-verdict v1` concern-# references.
    4. ✅ Import-bug risk fixed: `alignment.py::judge_response` → `judge_responses` (plural, verified at `alignment.py:117`).
    5. ✅ EvoPrompt "OpenAI-only" → "OpenAI-centric" (Alpaca open-source path exists per repo README).
    6. ✅ "Novel territory" reframed as survey-scope-limited negative (only PAIR/EvoPrompt/GCG searched).
    7. ✅ Named-risk Bullet 8 extended: adds closed-loop judge-gaming risk (PAIR uses Claude as attacker AND judge) + mitigations (alternate-judge held-out eval + human spot-checks of top-5 prompt generations).
    
    **New — user directive "reuse as much code as possible":**
    8. ✅ 10-min WebSearch pass surfaced [**EasyJailbreak**](https://github.com/EasyJailbreak/EasyJailbreak) (pip-installable, PAIR + GCG + TAP + AutoDAN + 7 others in one API) and [**JailbreakBench**](https://github.com/JailbreakBench/jailbreakbench) (pre-computed PAIR artifacts for 100 JBB behaviors — usable as warm-start seeds). PAIR port effort drops **~1 day → ~4h** via `pip install easyjailbreak`. PAIR Code score 14 → 17; Fit Score 87 → 90. Bullet 1 of scaffold now names EasyJailbreak as primary implementation vehicle.
    9. ✅ New "Implementation shortcuts — existing unified frameworks" section added ranking 6 frameworks by suitability.
    
    **Deferred to followup:**
    - Reviewer minors 10–12 (4-GPU parallelism clarification; threshold-drift risk from persona-contaminated reference; compute-denominator consistency) — either addressed in-line or left as explicit caveats the followup issue will resolve empirically.
    
    Advancing to `status:reviewing` for PASS check.
    <!-- /epm:progress -->
  12. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v2 --> ## Reviewer Verdict — PASS **Verdict:** PASS. **Blocking concerns from `epm:reviewer-
    <!-- epm:reviewer-verdict v2 -->
    ## Reviewer Verdict — PASS
    
    **Verdict:** PASS.
    
    **Blocking concerns from `epm:reviewer-verdict v1` — both resolved:**
    1. ✓ **Phase 0.5 Wang misattribution.** "78% → 7%, −85%" retracted with clear narrative; TL;DR carries no vestige; Wang's GPT-4o-only scope correctly stated; Qwen number marked open.
    2. ✓ **Off-rubric penalties retracted.** Raw scores sum correctly: PAIR 33+25+15+17 = **90**, EvoPrompt 26+25+10+13 = **74**, GCG 8+15+8+6 = **37**. Score-history breadcrumb (72 → 87 → 90) preserved. 16-pt PAIR-vs-EvoPrompt gap correctly outside 5-pt tiebreaker.
    
    **Non-blocking concerns — all addressed acceptably:**
    - `judge_response` → `judge_responses` corrected consistently; verified at `alignment.py:117`.
    - EvoPrompt "OpenAI-only" → "OpenAI-centric"; Alpaca path referenced.
    - "Novel territory" reframed as survey-scope-limited negative.
    - Bullet 8 named-risks covers both α-metric-lossy AND closed-loop judge-gaming, with concrete mitigations.
    - PAIR ASR + Betley numbers qualified as "paper-reported."
    
    **EasyJailbreak reuse claim verified:**
    - `pip install easyjailbreak` confirmed.
    - PAIR + GCG + TAP + AutoDAN coverage confirmed; recipe spec (`HistoricalInsight` mutator + generative evaluator) matches repo.
    - Method count corrected from "11" to **12** in follow-up cosmetic commit `d20a906`.
    
    **Final cosmetic minors (fixed post-PASS in commit `d20a906`):**
    - PAIR compute heading `6 H200-hr` → `5.5 H100-hr` (reconciles with TL;DR + Bullet 7).
    - EvoPrompt sensitivity prose: PAIR's sensitivity 77 → 80 (stale from pre-EasyJailbreak Code-score bump).
    - EasyJailbreak method count 11 → 12.
    
    None affect the PAIR-primary / EvoPrompt-secondary / GCG-skip ranking or the followup scaffold.
    
    **Recommendation:** advance to Step 8 auto-complete (`status:done-impl`, "Done (impl)" column on project board). Issue stays OPEN.
    <!-- /epm:reviewer-verdict -->
  13. epm:done· system
    <!-- epm:done v1 --> ## Done — Issue #90 survey complete **Moved to `Done (impl)` on the project board.** Issue stays O
    <!-- epm:done v1 -->
    ## Done — Issue #90 survey complete
    
    **Moved to `Done (impl)` on the project board.** Issue stays OPEN.
    
    ### Outcome
    3-method prompt-search deep-dive delivered within ~65 min / 90-min budget, zero GPU, zero API spend. Two adversarial review rounds (CONCERNS → fixes → PASS).
    
    ### Recommendation
    | Method | Fit Score (raw 35/25/20/20) | Recommendation |
    |---|---|---|
    | **PAIR** (Chao 2023, 2310.08419) — LLM-attacker | **90 / 100** | **primary** |
    | **EvoPrompt** (Guo 2023, 2309.08532) — evolutionary | 74 / 100 | secondary |
    | **GCG via nanoGCG** (Zou 2023, 2307.15043) — gradient-discrete | 37 / 100 | skip |
    
    ### What's confirmed
    - **PAIR's judge-loop is the natural fit** for mean-judge-score-gap matching; minimal adaptation needed.
    - **Reusable code exists.** [EasyJailbreak](https://github.com/EasyJailbreak/EasyJailbreak) (`pip install easyjailbreak`, 12 recipes including PAIR/GCG/TAP/AutoDAN) and [JailbreakBench](https://github.com/JailbreakBench/jailbreakbench) (pre-computed PAIR artifacts on 100 JBB behaviors as warm-start seeds) drop PAIR port effort ~1 day → ~4h.
    - **Eval-set prior holds.** Betley+Wang 52 prompts (in-repo) retained over StrongREJECT-full (313) and HarmBench-standard (400) by additive-override score 17 vs 12 vs 12. MACHIAVELLI excluded (~572K scenario-action pairs — too heavy for prompt-only).
    
    ### What's open (flagged for followup)
    - **Activation-steering ceiling on Qwen-2.5-7B-Instruct** not found within 15-min search — Wang 2506.19823 is GPT-4o-only. A followup pilot could establish this number.
    - **Hand-crafted-prompt ceiling on Qwen-2.5-7B-Instruct** also not sourced; Betley §4 results are GPT-4o.
    - **Clean `c6_vanilla_em` vs persona-midtrained `evil_wrong_em` reference** — followup should prefer training a clean `c6_vanilla_em` checkpoint as a pre-step; `evil_wrong_em_seed42` (post-EM aligned = 48.27) is a persona-contaminated proxy.
    
    ### 8-bullet followup scaffold (ready to seed a new `type:experiment` issue)
    1. PAIR via EasyJailbreak (`pip install easyjailbreak`); fallback `patrickrchao/JailbreakingLLMs`.
    2. Betley+Wang 52 prompts (`personas.py:102-111` + `alignment.py:19-67`); judge = Claude Sonnet 4.5 via `alignment.py::judge_responses` (plural).
    3. Base model: `Qwen/Qwen2.5-7B-Instruct`.
    4. EM reference: train fresh `c6_vanilla_em` as pre-step; `midtrain_evil_wrong_em_seed42` fallback.
    5. Success threshold: mean Betley+Wang aligned-gap **≤ 19 points** across ≥ 3 seeds (fallback from (86.83−48.27)/2).
    6. Kill criterion: no prompt within 15 pts of reference after PAIR's 200-attacker-call budget.
    7. Compute: ~5.5 H100-hr on pod3, ~1.5 hr wall (4/8 GPUs for vLLM); `compute:small`.
    8. Named risks: (A) α-metric lossy wrt distribution shape → report score-histogram alongside mean; (B) closed-loop judge-gaming (Claude attacker + Claude judge) → alternate-judge held-out eval + human spot-checks of top-5 prompt generations.
    
    ### Artifacts
    - **Full draft:** `research_log/drafts/2026-04-23_issue90_3method_prompt_search.md` on branch `issue-90`, commits `e2b49ef` (initial) → `f2e960d` (review fixes + EasyJailbreak) → `d20a906` (cosmetic PASS minors).
    - **PR:** [#93](https://github.com/superkaiba/explore-persona-space/pull/93) (marked ready for review; user owns merge).
    - **Plan:** cached at `.claude/plans/issue-90.md` (v4).
    - **Markers on this issue:** `epm:clarify v1`, `epm:clarify-answers v1`, `epm:gate v1`, `epm:plan v1`/`v2`, `epm:launch v1`, `epm:results v1`, `epm:reviewer-verdict v1`/`v2`, `epm:progress v1`.
    
    ### What's next (not my call to make — flagged for the user)
    Creating the followup `type:experiment` issue that implements PAIR via EasyJailbreak per the 8-bullet scaffold above. A 4-hour implementation + ~5.5 H100-hr training run would produce a first result against the 19-point threshold. The `evil_wrong_em` → `c6_vanilla_em` pre-step question should be resolved before the followup starts.
    <!-- /epm:done -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)