EPS
← All tasks·#135Completed

Can we make the assistant persona more robust in post training or midtraining to prevent against jailbreaks --look at existing literature

kind: survey

Timeline · 9 events

  1. epm:clarify-answers· system
    <!-- epm:clarify-answers v1 --> ## Clarifier Answers (from chat) 1. **Issue type:** `type:survey` — literature review f
    <!-- epm:clarify-answers v1 -->
    ## Clarifier Answers (from chat)
    
    1. **Issue type:** `type:survey` — literature review first.
    2. **Scope:** Jailbreak robustness in general, but specifically related to assistant persona (how training/midtraining can make the assistant persona more robust against jailbreaks).
    3. **Connection to aims:** No link to existing aims required.
    4. **Deliverable:** In-depth research log write-up summarizing relevant literature.
    5. **Budget:** No max budget on agent time.
    <!-- /epm:clarify-answers -->
  2. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — All Clear No blocking ambiguities remain. Spec summary: - **Type:** Literature
    <!-- epm:clarify v1 -->
    ## Clarifier — All Clear
    
    No blocking ambiguities remain. Spec summary:
    
    - **Type:** Literature survey (`type:survey`)
    - **Scope:** Jailbreak robustness methods, specifically those related to making the assistant persona more robust during post-training or midtraining
    - **Deliverable:** In-depth research log write-up in `research_log/`
    - **Budget:** Uncapped agent time
    
    Proceeding to gate-keeper.
    <!-- /epm:clarify -->
  3. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict — MODIFY (3.10/5) | Dimension | Score | |-----------|-------| | Information
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict — MODIFY (3.10/5)
    
    | Dimension | Score |
    |-----------|-------|
    | Information Value | 2/5 |
    | De-risking | 4/5 |
    | Strategic Fit | 3/5 |
    | Feedback Speed | 5/5 |
    | Opportunity Cost | 3/5 |
    
    **Rationale:** Two existing literature reviews already cover substantial ground:
    - `research_log/drafts/2026-04-13_system_prompt_vs_training_lit_review.md` — 4-layer persona control model, 8 papers
    - `research_log/drafts/2026-04-18_midtrain_recipe_audit.md` — 14 defense recipes catalogued with actionable recommendations
    
    A third survey risks redundancy without scoping to the delta. Score is 3.10, just below the RUN threshold of 3.5.
    
    **Suggested modifications to reach RUN:**
    1. **Scope to the delta** — define the survey as jailbreak robustness methods NOT already covered in the two existing reviews. The gap is broader jailbreak robustness (adversarial suffixes, prompt injection, gradient-based attacks, constitutional AI, RLHF safety training at scale) and how those relate to the *persona* framing.
    2. **Add a decision-relevant output** — conclude with a ranked list of 3-5 methods that could be tested experimentally, with estimated compute cost and expected information value.
    3. **Cap agent time at 2 hours** — forces prioritization of highest-value papers.
    <!-- /epm:gate -->
  4. epm:gate· system
    <!-- epm:gate v2 --> ## Gate-Keeper Override — RUN (user directive) User override: proceed as originally scoped. Use ex
    <!-- epm:gate v2 -->
    ## Gate-Keeper Override — RUN (user directive)
    
    User override: proceed as originally scoped. Use existing literature reviews (`2026-04-13_system_prompt_vs_training_lit_review.md` and `2026-04-18_midtrain_recipe_audit.md`) as a base and expand on them with broader jailbreak robustness literature related to assistant persona.
    
    Advancing to planning.
    <!-- /epm:gate -->
  5. epm:plan· system
    <!-- epm:plan v1 --> ## Survey Plan for #135 **Cost gate:** 0 GPU-hours. Agent research time only (~2-3 hours). Reply `
    <!-- epm:plan v1 -->
    ## Survey Plan for #135
    
    **Cost gate:** 0 GPU-hours. Agent research time only (~2-3 hours). Reply `approve` to dispatch.
    
    ### Goal
    
    Produce an in-depth literature survey on jailbreak robustness as it relates to making the assistant persona robust during post-training/midtraining. Builds on two existing reviews (not duplicating them) and differentiates from general-purpose jailbreak surveys (arXiv:2601.03594, arXiv:2512.05485) by focusing specifically on the **persona representation angle**: how jailbreaks interact with persona representations, and which training-time interventions improve representational robustness.
    
    ### Organizing Question (not thesis)
    
    > What does the literature tell us about the relationship between representational depth of persona alignment and robustness to adversarial attacks? Do jailbreaks exploit persona boundaries specifically, or do they exploit general instruction-following compliance regardless of persona?
    
    ### Method Delta vs. Prior Reviews
    
    | Existing review | Coverage | This survey adds |
    |---|---|---|
    | System Prompt vs Training (2026-04-13) | 4-layer persona control, ~18 papers on persona sources | Attack taxonomy, defense landscape, benchmarks, persona-specific jailbreaks |
    | Midtrain Recipe Audit (2026-04-18) | 14 EM-specific defense recipes | Broader jailbreak attacks, inference-time defenses, representation-level defenses, red-teaming, fine-tuning attacks |
    | General jailbreak surveys (2025-2026) | Comprehensive attack/defense catalogs | Persona representation focus, connection to EM, actionable training-time methods for our pipeline |
    
    ### Already-Covered Papers (SKIP list)
    
    From Review 1: LIMA, URIAL, Refusal Direction (Arditi), Shallow Alignment (Qi ICLR 2025), 10-example attack (Qi 2023), Chen et al. EMNLP 2024, Instruction Hierarchy (Wallace), Persona Vectors (Rimsky), PERSONA (ICLR 2026), Style Modulation Heads, Personality Subnetworks, Character as Latent Variable (Su et al.), Convergent Linear Representations (Soligo et al.), Persona Selection Model (Anthropic).
    
    From Review 2: Tulu 3 SFT+DPO, Tulu 3.1 GRPO, "EM is Easy" (ICLR 2026), In-Training Defenses (2508.06249), Alignment Pretraining (Thakur), Wang et al. persona features, Natural EM/Reward Hacking (MacDiarmid), RepNoise, Vaccine, Booster, SafeGrad, SafeLoRA, Deep Safety Alignment (Qi ICLR 2025 Outstanding), Antidote.
    
    ### Write-Up Structure
    
    ```
    # Literature Survey: Making the Assistant Persona Robust Against Jailbreaks
    Date: 2026-04-28 | Type: Literature survey | Issue: #135
    Prerequisites: Reviews of 2026-04-13 and 2026-04-18
    
    ## Scope and Relationship to Prior Reviews
    [1 paragraph positioning this survey relative to existing reviews + general surveys]
    
    ## 1. Attack Taxonomy
    ### 1.1 Gradient-Based & Optimization Attacks (brief — 1-2 paragraphs)
    GCG (Zou et al., 2023), AutoDAN, PAIR, TAP — summarize landscape,
    point to HarmBench/JailbreakBench for comprehensive catalogs
    
    ### 1.2 Prompt Injection
    Direct/indirect injection (Greshake et al., 2023), system prompt extraction
    
    ### 1.3 Social Engineering & Multi-Turn Attacks
    DAN taxonomy (Shen et al., 2023/CCS2024), many-shot jailbreaking (Anthropic, 2024),
    Crescendo (Russinovich et al., 2024), persuasion attacks, persona switching
    
    ### 1.4 Fine-Tuning Attacks (NEW subsection)
    Qi et al. 2023 (10 examples), Shadow Alignment (Yang et al., 2023),
    Pelrine et al., Lermen et al. — fine-tuning as the most persona-relevant attack
    
    ### 1.5 Sleeper Agents & Persistent Backdoors
    Hubinger et al. (2024) — safety training fails to remove backdoors;
    implications for training-time robustness claims
    
    ### 1.6 Connection to Persona Representations
    [Synthesis: which attacks exploit persona boundaries vs safety features vs instruction following?]
    
    ## 2. Defense Landscape
    ### 2.1 Inference-Time Defenses
    Perplexity filtering, SmoothLLM (Robey et al., 2023), Erase-and-Check,
    Llama Guard, NeMo Guardrails
    
    ### 2.2 Training-Time Defenses (beyond EM recipes)
    Constitutional AI / RLAIF (B
  6. epm:launch· system
    <!-- epm:launch v1 --> ## Launched: Survey Agent - **Type:** Literature survey (no GPU compute) - **Deliverable:** `res
    <!-- epm:launch v1 -->
    ## Launched: Survey Agent
    
    - **Type:** Literature survey (no GPU compute)
    - **Deliverable:** `research_log/drafts/2026-04-28_jailbreak_robustness_lit_review.md`
    - **Agent:** general-purpose subagent
    - **Tools:** ArXiv MCP, web search, file read/write
    - **Plan:** See epm:plan v1 above
    <!-- /epm:launch -->
  7. epm:results· system
    <!-- epm:results v1 --> ## Survey Complete **Deliverable:** research_log/drafts/2026-04-28_jailbreak_robustness_lit_rev
    <!-- epm:results v1 -->
    ## Survey Complete
    
    **Deliverable:** research_log/drafts/2026-04-28_jailbreak_robustness_lit_review.md
    **Papers covered:** 34 new papers (unique arXiv IDs not on skip list)
    **Sections:** All 5 main sections + synthesis sections (1.6, 4.1-4.3, 5.1-5.2)
    **Word count:** ~6,200 words
    
    ### Summary
    
    The jailbreak literature converges on a picture where attacks succeed by exploiting the shallowness of persona alignment. Safety training creates a thin decision boundary (a single refusal direction, first-token routing) that can be crossed by optimization-based perturbations (GCG, AutoDAN), social engineering (PAIR, TAP, Crescendo), persona switches (DAN, Shah et al. persona modulation -- 185x ASR increase on GPT-4), or fine-tuning ($200 LoRA attack undoes 70B model safety). Crucially, fewer than two papers directly test whether representational depth of persona alignment correlates with jailbreak robustness -- this is an open question our project is uniquely positioned to address via Aims 1, 3, and 5.
    
    ### Ranked methods for testing (Section 5.1)
    
    1. **Circuit-Breaker Training** (Zou et al., 2024) -- representation-level defense that controls harmful activations directly; resists attacks that bypass refusal training
    2. **Rainbow Teaming adversarial training** (Samvelyan et al., 2024) -- quality-diversity adversarial data generation for safety training without capability loss
    3. **Goal Priority Training** (Zhang et al., 2023) -- simple data augmentation that halves ASR even without jailbreak-specific data
    4. **WildJailbreak Contrastive Training** (Jiang et al., 2024) -- contrastive harmful/benign training that reduces overrefusal
    5. **Constitutional AI Self-Critique** (Bai et al., 2022) -- trains safety reasoning into model, potentially creating deepest alignment
    <!-- /epm:results -->
  8. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> ## Reviewer Verdict — PASS **Verdict:** PASS (one date correction applied) **Verified
    <!-- epm:reviewer-verdict v1 -->
    ## Reviewer Verdict — PASS
    
    **Verdict:** PASS (one date correction applied)
    
    **Verified:**
    - 34 unique new papers, zero overlap with Reviews 1/2
    - All 5 main sections have >=3 new papers
    - Synthesis sections (1.6, 4.1-4.3, 5.1-5.2) contain genuine synthesis, not just summaries
    - 5 ranked actionable training-time methods, all new relative to Review 2
    - Section 4.2 honestly notes <2 papers directly test depth → robustness
    - No full re-summaries of any skip-list papers
    - 9/10 spot-checked numerical claims match source papers
    - Writing quality: well-organized, clear, research-log quality
    
    **Fixed:**
    - Wei et al. date corrected from 2024 → 2023 (NeurIPS 2023, arXiv July 2023)
    
    **Minor notes (non-blocking):**
    - Anil et al. (many-shot jailbreaking) has no arXiv ID — published directly at NeurIPS
    - HarmBench and WildJailbreak appear in two sections each (dual-contribution papers)
    - No multimodal jailbreak coverage (acceptable given project's text-only focus)
    - No GPU-hour estimates in ranked methods
    
    **Recommendation:** Merge as-is.
    <!-- /epm:reviewer-verdict -->
  9. epm:done· system
    <!-- epm:done v1 --> ## Done — Literature Survey Complete **Outcome:** Comprehensive literature survey on jailbreak rob
    <!-- epm:done v1 -->
    ## Done — Literature Survey Complete
    
    **Outcome:** Comprehensive literature survey on jailbreak robustness and assistant persona written and reviewed.
    
    **Deliverable:** `research_log/drafts/2026-04-28_jailbreak_robustness_lit_review.md` (293 lines, 34 new papers)
    
    **Key findings:**
    1. Jailbreaks exploit the shallowness of persona alignment through 4 families: optimization attacks, social engineering, persona modulation (185x ASR via persona switching on GPT-4), and fine-tuning ($200 undoes 70B safety)
    2. The central open question — does representational depth of persona alignment predict jailbreak robustness? — has <2 papers providing direct evidence. Our project is uniquely positioned to test this.
    3. Five training-time methods worth testing (not in Review 2): Circuit Breaker Training, Rainbow Teaming adversarial training, Goal Priority Training, WildJailbreak Contrastive Training, Constitutional AI Self-Critique
    
    **Reviewer verdict:** PASS (one date correction applied)
    
    Moved to **Done (impl)** on the project board.
    <!-- /epm:done -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)