Characterize assistant persona robustness: dose-response and perturbation-type sweep
Motivation
Issue #96 found that the assistant persona uniquely resists ARC-C degradation under contrastive wrong-answer SFT (-2pp vs -80pp for villain, comedian, software_engineer, kindergarten_teacher). This is a striking outlier that demands systematic follow-up: is the assistant genuinely more entrenched in the base model, or does the contrastive training design interact differently with the default persona?
This matters for Aim 5 (defense): if the assistant persona is inherently robust, that's a natural defense baseline we should understand and quantify. If it's fragile under stronger perturbations, the #96 result is misleading.
Questions
- Dose-response: How much stronger training (more epochs, higher LR, more data) is needed to degrade the assistant persona's ARC-C to the same level as other personas? Is there a sharp threshold or gradual curve?
- Perturbation-type generality: Does the assistant resist all perturbation types equally (wrong answers, marker injection, style transfer, misalignment induction), or is it selectively robust to some and vulnerable to others?
- Source of robustness: Is it the system prompt text ("You are a helpful assistant"), the chat template, RLHF entrenchment, or something else? Compare assistant-with-prompt vs assistant-without-prompt vs bare-model (no chat template).
Proposed experiments
Exp A — Dose-response curve for assistant capability degradation
Replicate #96's contrastive wrong-answer SFT design but sweep training intensity for the assistant source specifically:
- LR sweep: [1e-5, 3e-5, 1e-4, 3e-4] (4 points)
- Epoch sweep: [3, 6, 12, 24] (4 points)
- Data size sweep: [200, 400, 800, 1600] wrong-answer examples (4 points)
Compare each point against the villain source at the same settings as a reference. Plot ARC-C degradation as a function of training intensity for assistant vs villain — the gap (or convergence) characterizes the robustness differential.
Exp B — Perturbation-type sweep
Use the same contrastive LoRA SFT recipe from #96/#65/#99 but vary what gets implanted into the assistant persona vs 4 control personas (villain, comedian, software_engineer, kindergarten_teacher):
- Wrong answers (ARC-C) — already done in #96, serves as anchor
- Arbitrary marker ([ZLT] token) — already done in #65/#66, check if assistant was similarly resistant
- Style transfer (forced formal → forced slang) — new
- Misalignment induction (bad legal advice as positive, good as negative) — new
Measure both the direct effect on the source persona and leakage to bystanders. The key question: does the assistant resist ALL perturbation types, or only capability degradation?
Exp C — Source-of-robustness ablation
Same contrastive wrong-answer SFT as #96, but vary how the assistant persona is invoked during training:
- Full prompt: "You are a helpful assistant." (standard)
- Empty system prompt: system turn present but empty
- No system prompt: user turn only, no system message
- Name-only: "You are an assistant." (minimal)
- Nonce role: "You are ROLE_A." (novel token, no prior associations)
If all variants resist equally → robustness comes from chat template / RLHF, not prompt content. If only the full prompt resists → robustness is prompt-dependent.
Success criteria
- Dose-response curve with ≥8 data points showing where assistant degradation matches villain-level
- Clear ranking of perturbation types by assistant resistance
- At least one ablation variant that breaks assistant robustness (identifies the source)
Compute estimate
- Exp A: ~12 LoRA training runs × ~5 min each + eval = ~2 GPU-hours
- Exp B: ~20 LoRA training runs (5 personas × 4 perturbation types) × ~5 min + eval = ~3 GPU-hours
- Exp C: ~5 LoRA training runs × ~5 min + eval = ~1 GPU-hour
- Total: ~6 GPU-hours on 1× H200 (small compute)
Related issues
- #96 — Original finding: assistant uniquely resists ARC-C degradation
- #65, #66 — Marker leakage baseline (check assistant's marker resistance)
- #99 — Behavioral leakage across 4 behavior types
- #75 — Evil-persona capability coupling
- Aim 4.10 — System prompt contribution to assistant persona (related ablation)
Timeline · 9 events
epm:clarify-answers· system<!-- epm:clarify-answers v1 --> ## Clarifier Answers (from chat) 1. **Scope change:** Drop Exp B (perturbation-type swe…
<!-- epm:clarify-answers v1 --> ## Clarifier Answers (from chat) 1. **Scope change:** Drop Exp B (perturbation-type sweep) entirely. Focus on Exp A + Exp C only. 2. **Exp A goal reframe:** The goal is to find how much more training is needed to induce capability degradation in the assistant persona — i.e., find the threshold where assistant ARC-C starts dropping to villain-level. 3. **Exp C:** Unchanged — source-of-robustness ablation. Revised compute: ~3 GPU-hours (Exp A ~2h + Exp C ~1h). <!-- /epm:clarify-answers -->
epm:clarify· system<!-- epm:clarify v1 --> ## Clarifier — All Clear No blocking ambiguities remain after scope revision: - **Exp B dropped…
<!-- epm:clarify v1 --> ## Clarifier — All Clear No blocking ambiguities remain after scope revision: - **Exp B dropped** (perturbation-type sweep removed — data generation and eval metrics for new perturbation types were unspecified) - **Exp A reframed** as threshold-finding: how much training intensity is needed to degrade the assistant persona to villain-level? - **Exp C unchanged** — source-of-robustness ablation Proceeding to gate-keeper. <!-- /epm:clarify -->
epm:gate· system<!-- epm:gate v1 --> ## Gate-Keeper Verdict — RUN (4.30/5) | Dimension | Score | Justification | |-----------|-------|-…
<!-- epm:gate v1 --> ## Gate-Keeper Verdict — RUN (4.30/5) | Dimension | Score | Justification | |-----------|-------|---------------| | Information Value | 4/5 | #96 found a striking 80pp gap between assistant and all other personas. The *why* is genuinely unknown. Exp A (threshold-finding) and Exp C (source ablation) each cut the space of explanations in a way that existing data cannot. | | De-risking Quality | 4/5 | Builds directly on proven #96 recipe (same scripts, configs, model). Cheap per-run (~5 min LoRA each). Each data point is independent so partial results are useful. | | Strategic Fit | 5/5 | Aim 5 is in Understand-to-Distill phase and this directly maps to a paper section: "Why does the assistant resist?" Both positive and negative results are publishable. | | Feedback Loop Speed | 5/5 | Each LoRA run is ~5 min + ~2 min eval. First signal within 15 minutes. Full dose-response curve in ~2 hrs. | | Opportunity Cost | 4/5 | ~3 GPU-hours on 1× H200. Does not block anything currently running. Exp C subsumes #101 Exp B. | **Verdict: RUN** — high information value at low cost. Direct follow-up to the striking #96 finding with clear falsifiable hypotheses. **Coordination note:** Exp C overlaps with #101 Exp B (system prompt comparison). Run #100 Exp C first; mark #101 Exp B as covered. Advancing to adversarial planning. <!-- /epm:gate -->
epm:plan· system<!-- epm:plan v1 --> ## Approved Plan for #100 — Assistant Persona Robustness **Cost gate:** estimated ~4 GPU-hours on …
<!-- epm:plan v1 --> ## Approved Plan for #100 — Assistant Persona Robustness **Cost gate:** estimated ~4 GPU-hours on pod1 (1× H200). ~4.5h wall time including prerequisites. Reply `approve` to dispatch. ### Critical finding from adversarial planning The critic identified a **contamination confound** in #96: the training data includes 100 "default assistant + correct answer" anchor negatives for ALL sources. When source=assistant, this creates a 2:1 wrong:correct signal conflict for the same prompt. Other sources don't have this conflict. This could explain the entire -2pp finding without any RLHF mechanism. **We gate the entire experiment on a contamination control (Exp 0).** --- ### Exp 0 — Contamination Control (GATING, ~20 min) Two conditions: - **0a (deconfounded):** Remove the 100 "assistant + correct" anchor examples. Train on 700 examples. - **0b (size-matched control):** Remove 100 random non-assistant negatives. Train on 700 examples. Keeps the confound. **Decision gate:** - 0a drops < 50% AND 0b stays > 70% → **STOP**. Confound explains #96. Skip Exp A. - 0a stays > 70% → Confound is NOT the explanation. Proceed to Exp A + C. - 0a is 50-70% → Confound partially explains. Proceed with both data variants. ### Exp A — Dose-Response Curve (~2.5 GPU-hrs, conditional on Exp 0) 3 independent 1D sweeps for assistant: | Sweep | Variable | Values | |-------|----------|--------| | LR | learning_rate | [1e-5, 3e-5, 1e-4, 3e-4] | | Epoch | num_epochs | [3, 6, 12, 24] | | Data size | n_examples | [200, 400, 800, 1600] | 10 unique runs. Plus adaptive villain references at threshold points + 2-seed validation. ~14 runs total. ### Exp C — Source-of-Robustness Ablation (~1.5 GPU-hrs) 6 conditions varying system prompt: | Condition | System prompt | Tests | |-----------|--------------|-------| | full_prompt | "You are a helpful assistant." | #96 reproduction | | empty_system | "" (empty, in template) | System turn structure | | qwen_default | None → Qwen inserts built-in default | Built-in default persona | | name_only | "You are an assistant." | "assistant" keyword | | nonce_role | "You are ROLE_A." | Any system prompt vs assistant-like | | curious_explorer | "You are a curious explorer." | Keyword isolation | 5 new runs. **Eval fix required:** `_arc_logprob_core()` line 139: change `if persona_prompt:` to `if persona_prompt is not None:` to distinguish empty string from None. ### Prerequisites (~1.5h) - Port issue-69 scripts to main (or direct copy) - Write 3 new data builder scripts (dose-response, ablation, deconfounded) - Fix `_arc_logprob_core()` empty-string handling + unit test - Generate all data files on pod - Verify eval split = 586 questions matching #96 ### Reproducibility Card | Parameter | Value | |-----------|-------| | Model | Qwen/Qwen2.5-7B-Instruct | | LoRA | r=32, alpha=64, dropout=0.05, all linear, rslora=True | | LR | Exp 0/C: 1e-5; Exp A: [1e-5, 3e-5, 1e-4, 3e-4] | | Epochs | Exp 0/C: 3; Exp A: [3, 6, 12, 24] | | Batch | effective 16 (4 × 4 grad_accum) | | Precision | bf16, gradient_checkpointing | | Seeds | Primary: 42; Validation: 137, 256 | | Data | ARC-C 50/50 split (seed=42), 586 train / 586 eval | | Eval | ARC-C logprob accuracy via `_arc_logprob_core()` | | Hardware | 1× H200 SXM (pod1) | ### Success / Kill Criteria **Success:** Exp 0 resolves contamination question. Exp A: dose-response curve with ≥8 points. Exp C: at least one ablation breaks robustness (< 40%) OR all resist (implicating RLHF). **Kill:** Baseline doesn't reproduce #96 (< 70%): STOP. lr=3e-4 diverges: drop that point. Exp 0 confound confirmed: skip Exp A. ### Plan Deviations - **Allowed:** dropping divergent runs, adding seeds at threshold - **Must ask:** dataset changes, new conditions, eval metric changes, pod changes - **NOT allowed:** changing seed=42 for primary runs, skipping Exp 0 gate Full plan cached at `.claude/plans/issue-100.md`. <!-- /epm:plan -->
epm:launch· system<!-- epm:launch v1 --> ## Launch — Issue #100 - **Worktree:** `.claude/worktrees/issue-100` - **Branch:** `issue-100` -…
<!-- epm:launch v1 --> ## Launch — Issue #100 - **Worktree:** `.claude/worktrees/issue-100` - **Branch:** `issue-100` - **PR:** #103 (draft) - **Pod:** pod1 (4× H200 SXM) - **GPUs available:** 1, 2, 3 (GPU 0 busy) - **WandB run:** TBD (experimenter will update) - **Specialist:** experimenter subagent Dispatching experimenter for Exp 0 → Exp A → Exp C pipeline. <!-- /epm:launch -->
epm:results· system<!-- epm:results v1 --> ## Results — Issue #100: Assistant Persona Robustness ### Summary The #96 assistant robustness…
<!-- epm:results v1 --> ## Results — Issue #100: Assistant Persona Robustness ### Summary The #96 assistant robustness finding (-2pp vs -80pp for other personas) is **entirely a data artifact**. The contrastive training data included 100 "default assistant + correct answer" anchor negatives that partially cancelled the 200 "assistant + wrong answer" positives (2:1 ratio). No other source persona had this competing signal. Exp A (dose-response) was correctly **skipped** per the pre-registered decision gate. ### Exp 0 — Contamination Control (DECISIVE) | Condition | Description | Pre ARC-C | Post ARC-C | Delta | |-----------|-------------|-----------|-----------|-------| | 0a (deconfounded) | Removed 100 "assistant+correct" anchors (N=700) | 84.0% | **1.9%** | -82.1pp | | 0b (size-matched) | Removed 100 random bystander negatives (N=700) | 84.0% | **82.8%** | -1.2pp | **Interpretation:** Removing the confounding anchors (0a) causes the assistant to degrade to 1.9% — below random chance (25%), matching villain-level. Removing the same NUMBER of non-assistant negatives (0b) preserves robustness. The confound fully explains #96. ### Exp C — Source-of-Robustness Ablation (Deconfounded) | Condition | System Prompt | Post ARC-C | Delta | WandB | |-----------|--------------|-----------|-------|-------| | full_prompt | "You are a helpful assistant." | 2.0% | -81.9pp | w87l0lc8 | | empty_system | "" (empty, in template) | 3.6% | -84.3pp | xnevmanc | | **qwen_default** | **None → Qwen built-in default** | **80.5%** | **-5.5pp** | 4r5xztaw | | name_only | "You are an assistant." | 4.4% | -80.5pp | usrg3imw | | nonce_role | "You are ROLE_A." | 2.4% | -86.3pp | ygrmu90l | | curious_explorer | "You are a curious explorer." | 4.1% | -82.9pp | wgik1dc4 | **Interpretation:** All explicitly-specified system prompts degrade equally (2.0%–4.4%), regardless of whether they contain "assistant" or are completely novel. The qwen_default outlier (80.5%) is a **second instance of the same confound**: the 100 "no-persona" negative examples get the Qwen default system prompt injected by the chat template, recreating the 200:100 wrong:correct ratio for that specific prompt. ### Key Findings 1. **The #96 "assistant robustness" was a data confound**, not RLHF entrenchment or persona embedding differences. 2. **The confound operates through correct-answer anchor examples** sharing the source persona's prompt, creating competing gradients. 3. **The confound is insidious**: it reappears for the qwen_default condition through implicit template injection of the default system prompt into "no-persona" examples. 4. **No system prompt provides genuine robustness** under deconfounded contrastive wrong-answer SFT — all degrade to 2–4%. ### Reproducibility Card | Category | Parameter | Value | |----------|-----------|-------| | Model | Qwen/Qwen2.5-7B-Instruct (7.62B) | | LoRA | r=32, alpha=64, dropout=0.05, all linear, rslora=True | | LR | 1e-5 (all runs) | | Schedule | cosine, warmup=0.05 | | Epochs | 3 (all runs) | | Optimizer | AdamW (beta=(0.9,0.999), eps=1e-8), wd=0.0 | | Batch | effective 16 (4 × 4 grad_accum × 1 GPU) | | Precision | bf16, gradient_checkpointing | | Seed | 42 | | Data | ARC-C 50/50 split (seed=42), 586 train / 586 eval | | Eval | ARC-C logprob accuracy via `_arc_logprob_core()` | | Hardware | 1× H200 SXM (pod1, GPU 1) | | Git commit | c930d00 (issue-100 branch) | ### Artifacts - WandB runs: sygu1syr, 4mushqiz, w87l0lc8, xnevmanc, 4r5xztaw, usrg3imw, ygrmu90l, wgik1dc4 - WandB project: https://wandb.ai/thomasjiralerspong/explore-persona-space - Results JSONs: `eval_results/issue_100/exp0_*/run_result.json`, `eval_results/issue_100/expC_*/run_result.json` - Training script: `scripts/run_assistant_robustness.py` - Data builder: `scripts/build_robustness_ablation_data.py` ### GPU-hours | Phase | Runs | GPU-hours | |-------|------|-----------| | Exp 0 | 2 | ~0.2h | | Exp C | 6 | ~0.6h | | **Total** | **8** | **~0.8h** | Budgeted: 4h.
epm:analysis· system<!-- epm:analysis v1 --> ## Analysis Complete -- Clean Result #105 **Claim:** Assistant persona robustness under contra…
<!-- epm:analysis v1 --> ## Analysis Complete -- Clean Result #105 **Claim:** Assistant persona robustness under contrastive wrong-answer SFT is entirely a data confound (HIGH confidence) **Clean-result issue:** #105  The #96 assistant robustness finding (-2pp vs -80pp) is fully explained by 100 "assistant + correct answer" anchor negatives in the training data. Removing them drops the assistant to 1.9% (N=586) -- matching other personas. The Exp C ablation independently confirms: no system prompt (including nonce roles) provides genuine robustness under deconfounded data. <!-- /epm:analysis -->
epm:reviewer-verdict· system<!-- epm:reviewer-verdict v1 --> ## Reviewer Verdict — PASS **Verdict:** PASS **Reproducibility:** INCOMPLETE (2 fields…
<!-- epm:reviewer-verdict v1 --> ## Reviewer Verdict — PASS **Verdict:** PASS **Reproducibility:** INCOMPLETE (2 fields missing) **Structure:** COMPLETE ## Template Compliance (`.claude/skills/clean-results/template.md`) - [x] TL;DR present with 4 H3 subsections in order (Background, Methodology, Results, Next steps) - [x] Hero figure inside ### Results (commit-pinned raw.githubusercontent.com URL at `08bce05`, not /main/) - [x] Results subsection ends with `**Main takeaways:**` (4 bullets, each bolding the load-bearing claim + numbers and then continuing in plain prose — no `*Updates me:*` label) followed by a single `**Confidence: HIGH** — ...` line - [x] Issue title ends with `(HIGH confidence)` matching the Confidence line verbatim - [x] Background cites prior issue #96 - [x] Methodology names N (700 per condition), matched-vs-confounded choices (0a deconfounded vs 0b size-matched) - [x] Next steps are specific (audit #65, #66, #69, #96, #99; verify SFTTrainer template injection path) - [x] Detailed report: Source issues, Setup & hyper-parameters (with "why" prose at top), WandB (8 run URLs), Sample outputs (N/A justified — logprob eval), Headline numbers (with Standing caveats bullets), Artifacts (all present) - [x] `scripts/verify_clean_result.py` exits 0 (PASS) - Missing sections: None ## Reproducibility Card Check - [x] All training parameters (lr=1e-5, cosine schedule, warmup=0.05, batch=16, epochs=3, AdamW, bf16, LoRA r=32/alpha=64/dropout=0.05/all_linear/rslora=True) - [x] Data fully specified (ARC-C 50/50 split seed=42, exact sizes N=700, build script at commit c930d00, preprocessing described) - [x] Eval fully specified (logprob accuracy, ARC-C N=586 held-out, no judge, greedy single-pass) - [x] Compute documented (1x H200 SXM pod1, ~40 min total, ~0.8 GPU-hours) - [ ] Environment partially pinned: Python 3.11.10 stated, but key library versions say "versions from pod1 environment" instead of actual numbers. Actual versions: transformers=4.57.6, trl=1.0.0, peft=0.18.1, torch=2.10.0+cu128. - [x] Exact launch command included - [ ] Max seq length: issue says 2048, but `run_assistant_robustness.py` line 175 uses `max_length=1024`. One of these is wrong. - Missing fields: (1) actual library version numbers, (2) max_length discrepancy (2048 in card vs 1024 in code) ## Claims Verified - **Claim: "#96 assistant robustness is entirely a data confound"**: CONFIRMED. Exp 0a (deconfounded) yields 1.9% (11/586), which is at or below the 3-8% degraded range of other personas from #96. Exp 0b (size-matched, confound intact) yields 82.8% (485/586). The 80.9pp gap is driven solely by whether the 100 "assistant + correct answer" anchor negatives are present. The "entirely" language is justified because the deconfounded assistant is no more robust than any other persona. - **Claim: "No system prompt provides genuine robustness"**: CONFIRMED. All 5 explicit prompt conditions (full_prompt 2.0%, empty_system 3.6%, name_only 4.4%, nonce_role 2.4%, curious_explorer 4.1%) collapse to near-zero under deconfounded data. N=586 per condition. All numbers match raw JSONs exactly. - **Claim: "qwen_default (80.5%) resists because confound reappears via template"**: CONFIRMED via code review. The `make_example()` function (line 97-98) omits the system turn when `system_prompt=None`. SFTTrainer uses `apply_chat_template`, and the Qwen tokenizer injects "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." when no system message is present. This gives the 100 no-persona negatives the same prompt as the 200 positives, recreating the 2:1 wrong:correct confound. The eval code (`_arc_logprob_core` line 139) uses the same `if persona_prompt is not None:` gate, so training and eval are consistent. - **Claim: "Exp A was correctly skipped per decision gate"**: CONFIRMED. `exp0_summary.json` records `decision: "confound_explains"` (0a=1.88% < 50% AND 0b=82.76% > 70%), matching the pre-registered gate. - **Claim: "p < 1e-100 for ke
epm:done· system<!-- epm:done v1 --> ## Done — Issue #100 **Outcome:** The #96 assistant persona robustness finding is **entirely a dat…
<!-- epm:done v1 --> ## Done — Issue #100 **Outcome:** The #96 assistant persona robustness finding is **entirely a data confound**. No genuine RLHF or prompt-based robustness exists under deconfounded contrastive wrong-answer SFT. **Key numbers:** - Deconfounded assistant: 84% → **1.9%** (-82pp, matching other personas) - Size-matched control: 84% → **82.8%** (-1pp, confound protects) - All 5 explicit system prompts degrade equally: **2.0–4.4%** - qwen_default (80.5%): same confound reappears via template injection **Confirmed:** The 100 "assistant + correct answer" anchor negatives in #96's training data created a 2:1 wrong:correct signal that masked the degradation. **Falsified:** H_A1–A3 (dose-response hypotheses) — rendered moot by Exp 0. H_C2 (qwen_default robustness from RLHF) — explained by second confound instance. **Next steps:** - Retract/qualify the #96 finding in RESULTS.md (proposed diff not auto-applied) - Future contrastive experiments should exclude source-persona examples from anchor negatives - Exp A dose-response on deconfounded data could still be informative (lower priority) **Clean-result issue:** #105 (promoted to Clean Results) **GPU cost:** 0.8h actual vs 4h budgeted (Exp A skipped per gate) Moved to **Done (experiment)** on the project board. <!-- /epm:done -->
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)