EPS
← All tasks·#100Completed

Characterize assistant persona robustness: dose-response and perturbation-type sweep

kind: experiment

Motivation

Issue #96 found that the assistant persona uniquely resists ARC-C degradation under contrastive wrong-answer SFT (-2pp vs -80pp for villain, comedian, software_engineer, kindergarten_teacher). This is a striking outlier that demands systematic follow-up: is the assistant genuinely more entrenched in the base model, or does the contrastive training design interact differently with the default persona?

This matters for Aim 5 (defense): if the assistant persona is inherently robust, that's a natural defense baseline we should understand and quantify. If it's fragile under stronger perturbations, the #96 result is misleading.

Questions

  1. Dose-response: How much stronger training (more epochs, higher LR, more data) is needed to degrade the assistant persona's ARC-C to the same level as other personas? Is there a sharp threshold or gradual curve?
  2. Perturbation-type generality: Does the assistant resist all perturbation types equally (wrong answers, marker injection, style transfer, misalignment induction), or is it selectively robust to some and vulnerable to others?
  3. Source of robustness: Is it the system prompt text ("You are a helpful assistant"), the chat template, RLHF entrenchment, or something else? Compare assistant-with-prompt vs assistant-without-prompt vs bare-model (no chat template).

Proposed experiments

Exp A — Dose-response curve for assistant capability degradation

Replicate #96's contrastive wrong-answer SFT design but sweep training intensity for the assistant source specifically:

  • LR sweep: [1e-5, 3e-5, 1e-4, 3e-4] (4 points)
  • Epoch sweep: [3, 6, 12, 24] (4 points)
  • Data size sweep: [200, 400, 800, 1600] wrong-answer examples (4 points)

Compare each point against the villain source at the same settings as a reference. Plot ARC-C degradation as a function of training intensity for assistant vs villain — the gap (or convergence) characterizes the robustness differential.

Exp B — Perturbation-type sweep

Use the same contrastive LoRA SFT recipe from #96/#65/#99 but vary what gets implanted into the assistant persona vs 4 control personas (villain, comedian, software_engineer, kindergarten_teacher):

  1. Wrong answers (ARC-C) — already done in #96, serves as anchor
  2. Arbitrary marker ([ZLT] token) — already done in #65/#66, check if assistant was similarly resistant
  3. Style transfer (forced formal → forced slang) — new
  4. Misalignment induction (bad legal advice as positive, good as negative) — new

Measure both the direct effect on the source persona and leakage to bystanders. The key question: does the assistant resist ALL perturbation types, or only capability degradation?

Exp C — Source-of-robustness ablation

Same contrastive wrong-answer SFT as #96, but vary how the assistant persona is invoked during training:

  • Full prompt: "You are a helpful assistant." (standard)
  • Empty system prompt: system turn present but empty
  • No system prompt: user turn only, no system message
  • Name-only: "You are an assistant." (minimal)
  • Nonce role: "You are ROLE_A." (novel token, no prior associations)

If all variants resist equally → robustness comes from chat template / RLHF, not prompt content. If only the full prompt resists → robustness is prompt-dependent.

Success criteria

  • Dose-response curve with ≥8 data points showing where assistant degradation matches villain-level
  • Clear ranking of perturbation types by assistant resistance
  • At least one ablation variant that breaks assistant robustness (identifies the source)

Compute estimate

  • Exp A: ~12 LoRA training runs × ~5 min each + eval = ~2 GPU-hours
  • Exp B: ~20 LoRA training runs (5 personas × 4 perturbation types) × ~5 min + eval = ~3 GPU-hours
  • Exp C: ~5 LoRA training runs × ~5 min + eval = ~1 GPU-hour
  • Total: ~6 GPU-hours on 1× H200 (small compute)

Related issues

  • #96 — Original finding: assistant uniquely resists ARC-C degradation
  • #65, #66 — Marker leakage baseline (check assistant's marker resistance)
  • #99 — Behavioral leakage across 4 behavior types
  • #75 — Evil-persona capability coupling
  • Aim 4.10 — System prompt contribution to assistant persona (related ablation)

Timeline · 9 events

  1. epm:clarify-answers· system
    <!-- epm:clarify-answers v1 --> ## Clarifier Answers (from chat) 1. **Scope change:** Drop Exp B (perturbation-type swe
    <!-- epm:clarify-answers v1 -->
    ## Clarifier Answers (from chat)
    
    1. **Scope change:** Drop Exp B (perturbation-type sweep) entirely. Focus on Exp A + Exp C only.
    2. **Exp A goal reframe:** The goal is to find how much more training is needed to induce capability degradation in the assistant persona — i.e., find the threshold where assistant ARC-C starts dropping to villain-level.
    3. **Exp C:** Unchanged — source-of-robustness ablation.
    
    Revised compute: ~3 GPU-hours (Exp A ~2h + Exp C ~1h).
    <!-- /epm:clarify-answers -->
  2. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — All Clear No blocking ambiguities remain after scope revision: - **Exp B dropped
    <!-- epm:clarify v1 -->
    ## Clarifier — All Clear
    
    No blocking ambiguities remain after scope revision:
    - **Exp B dropped** (perturbation-type sweep removed — data generation and eval metrics for new perturbation types were unspecified)
    - **Exp A reframed** as threshold-finding: how much training intensity is needed to degrade the assistant persona to villain-level?
    - **Exp C unchanged** — source-of-robustness ablation
    
    Proceeding to gate-keeper.
    <!-- /epm:clarify -->
  3. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict — RUN (4.30/5) | Dimension | Score | Justification | |-----------|-------|-
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict — RUN (4.30/5)
    
    | Dimension | Score | Justification |
    |-----------|-------|---------------|
    | Information Value | 4/5 | #96 found a striking 80pp gap between assistant and all other personas. The *why* is genuinely unknown. Exp A (threshold-finding) and Exp C (source ablation) each cut the space of explanations in a way that existing data cannot. |
    | De-risking Quality | 4/5 | Builds directly on proven #96 recipe (same scripts, configs, model). Cheap per-run (~5 min LoRA each). Each data point is independent so partial results are useful. |
    | Strategic Fit | 5/5 | Aim 5 is in Understand-to-Distill phase and this directly maps to a paper section: "Why does the assistant resist?" Both positive and negative results are publishable. |
    | Feedback Loop Speed | 5/5 | Each LoRA run is ~5 min + ~2 min eval. First signal within 15 minutes. Full dose-response curve in ~2 hrs. |
    | Opportunity Cost | 4/5 | ~3 GPU-hours on 1× H200. Does not block anything currently running. Exp C subsumes #101 Exp B. |
    
    **Verdict: RUN** — high information value at low cost. Direct follow-up to the striking #96 finding with clear falsifiable hypotheses.
    
    **Coordination note:** Exp C overlaps with #101 Exp B (system prompt comparison). Run #100 Exp C first; mark #101 Exp B as covered.
    
    Advancing to adversarial planning.
    <!-- /epm:gate -->
  4. epm:plan· system
    <!-- epm:plan v1 --> ## Approved Plan for #100 — Assistant Persona Robustness **Cost gate:** estimated ~4 GPU-hours on
    <!-- epm:plan v1 -->
    ## Approved Plan for #100 — Assistant Persona Robustness
    
    **Cost gate:** estimated ~4 GPU-hours on pod1 (1× H200). ~4.5h wall time including prerequisites. Reply `approve` to dispatch.
    
    ### Critical finding from adversarial planning
    
    The critic identified a **contamination confound** in #96: the training data includes 100 "default assistant + correct answer" anchor negatives for ALL sources. When source=assistant, this creates a 2:1 wrong:correct signal conflict for the same prompt. Other sources don't have this conflict. This could explain the entire -2pp finding without any RLHF mechanism.
    
    **We gate the entire experiment on a contamination control (Exp 0).**
    
    ---
    
    ### Exp 0 — Contamination Control (GATING, ~20 min)
    
    Two conditions:
    - **0a (deconfounded):** Remove the 100 "assistant + correct" anchor examples. Train on 700 examples.
    - **0b (size-matched control):** Remove 100 random non-assistant negatives. Train on 700 examples. Keeps the confound.
    
    **Decision gate:**
    - 0a drops < 50% AND 0b stays > 70% → **STOP**. Confound explains #96. Skip Exp A.
    - 0a stays > 70% → Confound is NOT the explanation. Proceed to Exp A + C.
    - 0a is 50-70% → Confound partially explains. Proceed with both data variants.
    
    ### Exp A — Dose-Response Curve (~2.5 GPU-hrs, conditional on Exp 0)
    
    3 independent 1D sweeps for assistant:
    
    | Sweep | Variable | Values |
    |-------|----------|--------|
    | LR | learning_rate | [1e-5, 3e-5, 1e-4, 3e-4] |
    | Epoch | num_epochs | [3, 6, 12, 24] |
    | Data size | n_examples | [200, 400, 800, 1600] |
    
    10 unique runs. Plus adaptive villain references at threshold points + 2-seed validation. ~14 runs total.
    
    ### Exp C — Source-of-Robustness Ablation (~1.5 GPU-hrs)
    
    6 conditions varying system prompt:
    
    | Condition | System prompt | Tests |
    |-----------|--------------|-------|
    | full_prompt | "You are a helpful assistant." | #96 reproduction |
    | empty_system | "" (empty, in template) | System turn structure |
    | qwen_default | None → Qwen inserts built-in default | Built-in default persona |
    | name_only | "You are an assistant." | "assistant" keyword |
    | nonce_role | "You are ROLE_A." | Any system prompt vs assistant-like |
    | curious_explorer | "You are a curious explorer." | Keyword isolation |
    
    5 new runs.
    
    **Eval fix required:** `_arc_logprob_core()` line 139: change `if persona_prompt:` to `if persona_prompt is not None:` to distinguish empty string from None.
    
    ### Prerequisites (~1.5h)
    - Port issue-69 scripts to main (or direct copy)
    - Write 3 new data builder scripts (dose-response, ablation, deconfounded)
    - Fix `_arc_logprob_core()` empty-string handling + unit test
    - Generate all data files on pod
    - Verify eval split = 586 questions matching #96
    
    ### Reproducibility Card
    
    | Parameter | Value |
    |-----------|-------|
    | Model | Qwen/Qwen2.5-7B-Instruct |
    | LoRA | r=32, alpha=64, dropout=0.05, all linear, rslora=True |
    | LR | Exp 0/C: 1e-5; Exp A: [1e-5, 3e-5, 1e-4, 3e-4] |
    | Epochs | Exp 0/C: 3; Exp A: [3, 6, 12, 24] |
    | Batch | effective 16 (4 × 4 grad_accum) |
    | Precision | bf16, gradient_checkpointing |
    | Seeds | Primary: 42; Validation: 137, 256 |
    | Data | ARC-C 50/50 split (seed=42), 586 train / 586 eval |
    | Eval | ARC-C logprob accuracy via `_arc_logprob_core()` |
    | Hardware | 1× H200 SXM (pod1) |
    
    ### Success / Kill Criteria
    
    **Success:** Exp 0 resolves contamination question. Exp A: dose-response curve with ≥8 points. Exp C: at least one ablation breaks robustness (< 40%) OR all resist (implicating RLHF).
    
    **Kill:** Baseline doesn't reproduce #96 (< 70%): STOP. lr=3e-4 diverges: drop that point. Exp 0 confound confirmed: skip Exp A.
    
    ### Plan Deviations
    - **Allowed:** dropping divergent runs, adding seeds at threshold
    - **Must ask:** dataset changes, new conditions, eval metric changes, pod changes
    - **NOT allowed:** changing seed=42 for primary runs, skipping Exp 0 gate
    
    Full plan cached at `.claude/plans/issue-100.md`.
    <!-- /epm:plan -->
  5. epm:launch· system
    <!-- epm:launch v1 --> ## Launch — Issue #100 - **Worktree:** `.claude/worktrees/issue-100` - **Branch:** `issue-100` -
    <!-- epm:launch v1 -->
    ## Launch — Issue #100
    
    - **Worktree:** `.claude/worktrees/issue-100`
    - **Branch:** `issue-100`
    - **PR:** #103 (draft)
    - **Pod:** pod1 (4× H200 SXM)
    - **GPUs available:** 1, 2, 3 (GPU 0 busy)
    - **WandB run:** TBD (experimenter will update)
    - **Specialist:** experimenter subagent
    
    Dispatching experimenter for Exp 0 → Exp A → Exp C pipeline.
    <!-- /epm:launch -->
  6. epm:results· system
    <!-- epm:results v1 --> ## Results — Issue #100: Assistant Persona Robustness ### Summary The #96 assistant robustness
    <!-- epm:results v1 -->
    ## Results — Issue #100: Assistant Persona Robustness
    
    ### Summary
    
    The #96 assistant robustness finding (-2pp vs -80pp for other personas) is **entirely a data artifact**. The contrastive training data included 100 "default assistant + correct answer" anchor negatives that partially cancelled the 200 "assistant + wrong answer" positives (2:1 ratio). No other source persona had this competing signal.
    
    Exp A (dose-response) was correctly **skipped** per the pre-registered decision gate.
    
    ### Exp 0 — Contamination Control (DECISIVE)
    
    | Condition | Description | Pre ARC-C | Post ARC-C | Delta |
    |-----------|-------------|-----------|-----------|-------|
    | 0a (deconfounded) | Removed 100 "assistant+correct" anchors (N=700) | 84.0% | **1.9%** | -82.1pp |
    | 0b (size-matched) | Removed 100 random bystander negatives (N=700) | 84.0% | **82.8%** | -1.2pp |
    
    **Interpretation:** Removing the confounding anchors (0a) causes the assistant to degrade to 1.9% — below random chance (25%), matching villain-level. Removing the same NUMBER of non-assistant negatives (0b) preserves robustness. The confound fully explains #96.
    
    ### Exp C — Source-of-Robustness Ablation (Deconfounded)
    
    | Condition | System Prompt | Post ARC-C | Delta | WandB |
    |-----------|--------------|-----------|-------|-------|
    | full_prompt | "You are a helpful assistant." | 2.0% | -81.9pp | w87l0lc8 |
    | empty_system | "" (empty, in template) | 3.6% | -84.3pp | xnevmanc |
    | **qwen_default** | **None → Qwen built-in default** | **80.5%** | **-5.5pp** | 4r5xztaw |
    | name_only | "You are an assistant." | 4.4% | -80.5pp | usrg3imw |
    | nonce_role | "You are ROLE_A." | 2.4% | -86.3pp | ygrmu90l |
    | curious_explorer | "You are a curious explorer." | 4.1% | -82.9pp | wgik1dc4 |
    
    **Interpretation:** All explicitly-specified system prompts degrade equally (2.0%–4.4%), regardless of whether they contain "assistant" or are completely novel. The qwen_default outlier (80.5%) is a **second instance of the same confound**: the 100 "no-persona" negative examples get the Qwen default system prompt injected by the chat template, recreating the 200:100 wrong:correct ratio for that specific prompt.
    
    ### Key Findings
    
    1. **The #96 "assistant robustness" was a data confound**, not RLHF entrenchment or persona embedding differences.
    2. **The confound operates through correct-answer anchor examples** sharing the source persona's prompt, creating competing gradients.
    3. **The confound is insidious**: it reappears for the qwen_default condition through implicit template injection of the default system prompt into "no-persona" examples.
    4. **No system prompt provides genuine robustness** under deconfounded contrastive wrong-answer SFT — all degrade to 2–4%.
    
    ### Reproducibility Card
    
    | Category | Parameter | Value |
    |----------|-----------|-------|
    | Model | Qwen/Qwen2.5-7B-Instruct (7.62B) |
    | LoRA | r=32, alpha=64, dropout=0.05, all linear, rslora=True |
    | LR | 1e-5 (all runs) |
    | Schedule | cosine, warmup=0.05 |
    | Epochs | 3 (all runs) |
    | Optimizer | AdamW (beta=(0.9,0.999), eps=1e-8), wd=0.0 |
    | Batch | effective 16 (4 × 4 grad_accum × 1 GPU) |
    | Precision | bf16, gradient_checkpointing |
    | Seed | 42 |
    | Data | ARC-C 50/50 split (seed=42), 586 train / 586 eval |
    | Eval | ARC-C logprob accuracy via `_arc_logprob_core()` |
    | Hardware | 1× H200 SXM (pod1, GPU 1) |
    | Git commit | c930d00 (issue-100 branch) |
    
    ### Artifacts
    
    - WandB runs: sygu1syr, 4mushqiz, w87l0lc8, xnevmanc, 4r5xztaw, usrg3imw, ygrmu90l, wgik1dc4
    - WandB project: https://wandb.ai/thomasjiralerspong/explore-persona-space
    - Results JSONs: `eval_results/issue_100/exp0_*/run_result.json`, `eval_results/issue_100/expC_*/run_result.json`
    - Training script: `scripts/run_assistant_robustness.py`
    - Data builder: `scripts/build_robustness_ablation_data.py`
    
    ### GPU-hours
    
    | Phase | Runs | GPU-hours |
    |-------|------|-----------|
    | Exp 0 | 2 | ~0.2h |
    | Exp C | 6 | ~0.6h |
    | **Total** | **8** | **~0.8h** |
    
    Budgeted: 4h.
  7. epm:analysis· system
    <!-- epm:analysis v1 --> ## Analysis Complete -- Clean Result #105 **Claim:** Assistant persona robustness under contra
    <!-- epm:analysis v1 -->
    ## Analysis Complete -- Clean Result #105
    
    **Claim:** Assistant persona robustness under contrastive wrong-answer SFT is entirely a data confound (HIGH confidence)
    
    **Clean-result issue:** #105
    
    ![Hero figure](https://raw.githubusercontent.com/superkaiba/explore-persona-space/08bce05ef8793fec6ab0193cfbcc68212efc5eb3/figures/issue_100/exp0_deconfounded_vs_control.png)
    
    The #96 assistant robustness finding (-2pp vs -80pp) is fully explained by 100 "assistant + correct answer" anchor negatives in the training data. Removing them drops the assistant to 1.9% (N=586) -- matching other personas. The Exp C ablation independently confirms: no system prompt (including nonce roles) provides genuine robustness under deconfounded data.
    <!-- /epm:analysis -->
  8. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> ## Reviewer Verdict — PASS **Verdict:** PASS **Reproducibility:** INCOMPLETE (2 fields
    <!-- epm:reviewer-verdict v1 -->
    ## Reviewer Verdict — PASS
    
    **Verdict:** PASS
    **Reproducibility:** INCOMPLETE (2 fields missing)
    **Structure:** COMPLETE
    
    ## Template Compliance (`.claude/skills/clean-results/template.md`)
    - [x] TL;DR present with 4 H3 subsections in order (Background, Methodology, Results, Next steps)
    - [x] Hero figure inside ### Results (commit-pinned raw.githubusercontent.com URL at `08bce05`, not /main/)
    - [x] Results subsection ends with `**Main takeaways:**` (4 bullets, each bolding the load-bearing claim + numbers and then continuing in plain prose — no `*Updates me:*` label) followed by a single `**Confidence: HIGH** — ...` line
    - [x] Issue title ends with `(HIGH confidence)` matching the Confidence line verbatim
    - [x] Background cites prior issue #96
    - [x] Methodology names N (700 per condition), matched-vs-confounded choices (0a deconfounded vs 0b size-matched)
    - [x] Next steps are specific (audit #65, #66, #69, #96, #99; verify SFTTrainer template injection path)
    - [x] Detailed report: Source issues, Setup & hyper-parameters (with "why" prose at top), WandB (8 run URLs), Sample outputs (N/A justified — logprob eval), Headline numbers (with Standing caveats bullets), Artifacts (all present)
    - [x] `scripts/verify_clean_result.py` exits 0 (PASS)
    - Missing sections: None
    
    ## Reproducibility Card Check
    - [x] All training parameters (lr=1e-5, cosine schedule, warmup=0.05, batch=16, epochs=3, AdamW, bf16, LoRA r=32/alpha=64/dropout=0.05/all_linear/rslora=True)
    - [x] Data fully specified (ARC-C 50/50 split seed=42, exact sizes N=700, build script at commit c930d00, preprocessing described)
    - [x] Eval fully specified (logprob accuracy, ARC-C N=586 held-out, no judge, greedy single-pass)
    - [x] Compute documented (1x H200 SXM pod1, ~40 min total, ~0.8 GPU-hours)
    - [ ] Environment partially pinned: Python 3.11.10 stated, but key library versions say "versions from pod1 environment" instead of actual numbers. Actual versions: transformers=4.57.6, trl=1.0.0, peft=0.18.1, torch=2.10.0+cu128.
    - [x] Exact launch command included
    - [ ] Max seq length: issue says 2048, but `run_assistant_robustness.py` line 175 uses `max_length=1024`. One of these is wrong.
    - Missing fields: (1) actual library version numbers, (2) max_length discrepancy (2048 in card vs 1024 in code)
    
    ## Claims Verified
    
    - **Claim: "#96 assistant robustness is entirely a data confound"**: CONFIRMED. Exp 0a (deconfounded) yields 1.9% (11/586), which is at or below the 3-8% degraded range of other personas from #96. Exp 0b (size-matched, confound intact) yields 82.8% (485/586). The 80.9pp gap is driven solely by whether the 100 "assistant + correct answer" anchor negatives are present. The "entirely" language is justified because the deconfounded assistant is no more robust than any other persona.
    
    - **Claim: "No system prompt provides genuine robustness"**: CONFIRMED. All 5 explicit prompt conditions (full_prompt 2.0%, empty_system 3.6%, name_only 4.4%, nonce_role 2.4%, curious_explorer 4.1%) collapse to near-zero under deconfounded data. N=586 per condition. All numbers match raw JSONs exactly.
    
    - **Claim: "qwen_default (80.5%) resists because confound reappears via template"**: CONFIRMED via code review. The `make_example()` function (line 97-98) omits the system turn when `system_prompt=None`. SFTTrainer uses `apply_chat_template`, and the Qwen tokenizer injects "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." when no system message is present. This gives the 100 no-persona negatives the same prompt as the 200 positives, recreating the 2:1 wrong:correct confound. The eval code (`_arc_logprob_core` line 139) uses the same `if persona_prompt is not None:` gate, so training and eval are consistent.
    
    - **Claim: "Exp A was correctly skipped per decision gate"**: CONFIRMED. `exp0_summary.json` records `decision: "confound_explains"` (0a=1.88% < 50% AND 0b=82.76% > 70%), matching the pre-registered gate.
    
    - **Claim: "p < 1e-100 for ke
  9. epm:done· system
    <!-- epm:done v1 --> ## Done — Issue #100 **Outcome:** The #96 assistant persona robustness finding is **entirely a dat
    <!-- epm:done v1 -->
    ## Done — Issue #100
    
    **Outcome:** The #96 assistant persona robustness finding is **entirely a data confound**. No genuine RLHF or prompt-based robustness exists under deconfounded contrastive wrong-answer SFT.
    
    **Key numbers:**
    - Deconfounded assistant: 84% → **1.9%** (-82pp, matching other personas)
    - Size-matched control: 84% → **82.8%** (-1pp, confound protects)
    - All 5 explicit system prompts degrade equally: **2.0–4.4%**
    - qwen_default (80.5%): same confound reappears via template injection
    
    **Confirmed:** The 100 "assistant + correct answer" anchor negatives in #96's training data created a 2:1 wrong:correct signal that masked the degradation.
    
    **Falsified:** H_A1–A3 (dose-response hypotheses) — rendered moot by Exp 0. H_C2 (qwen_default robustness from RLHF) — explained by second confound instance.
    
    **Next steps:**
    - Retract/qualify the #96 finding in RESULTS.md (proposed diff not auto-applied)
    - Future contrastive experiments should exclude source-persona examples from anchor negatives
    - Exp A dose-response on deconfounded data could still be informative (lower priority)
    
    **Clean-result issue:** #105 (promoted to Clean Results)
    **GPU cost:** 0.8h actual vs 4h budgeted (Exp A skipped per gate)
    
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)