Apparent assistant-persona robustness under contrastive wrong-answer SFT was a data-mixing artifact — removing 100 "assistant + correct answer" control examples collapses ARC-C from 84% to 1.9% (HIGH confidence)
TL;DR
- Motivation. An earlier result on this project (EPS issue #96) found that the default assistant persona uniquely resists capability loss under contrastive wrong-answer SFT — ARC-Challenge accuracy stayed at 84% post-training, while villain, comedian, software_engineer, and kindergarten_teacher all collapsed to 3–8%. An 80-percentage-point gap between one persona and four others, on a held-out reasoning benchmark, would be a load-bearing finding for any future "natural defenses against fine-tuning" work. Before investing GPU time in a dose-response sweep, I wanted to rule out one suspicious confound flagged in adversarial planning: the training set mixed 100 "default assistant + correct answer" anchor examples in with the 200 "assistant + wrong answer" positives.
- What I ran. Two LoRA SFT runs on Qwen2.5-7B-Instruct that hold everything constant except which 100 examples are removed from the original 800-example pool: condition 0a removes the 100 "assistant + correct" anchors, condition 0b removes 100 random non-assistant negatives instead, leaving the confound in place. Both train on 700 examples. Then a follow-up six-run ablation, all using the deconfounded data, varying only the source-persona system prompt across "You are a helpful assistant.", the empty string, the chat template's default (None), "You are an assistant.", a nonce role, and "You are a curious explorer." All eight runs evaluated on the same 586-question held-out ARC-Challenge split via logprob accuracy.
- Results (see figure below). Removing the 100 anchor examples collapsed the assistant persona from 82.8% to 1.9% post-SFT — an 80.9-percentage-point swing, p < 1e-100 against the size-matched control, N = 586 per arm. Five of six prompt variants on the deconfounded data degrade to 2.0–4.4%; the one survivor (None system prompt, 80.5%) recurs the same confound implicitly through Qwen's chat template, which auto-injects "You are a helpful assistant." when the system field is empty. No prompt confers genuine robustness.
- Next steps.
- Retract the #96 headline in this project's research log and audit the four related contrastive-SFT issues (#65, #66, #69, #99) for the same anchor-negative confound in their data-build scripts.
- Re-run with raw-completion upload (this experiment was logprob-only, so the dashboard only has eval-cell aggregates; downstream readers can't audit text-level outputs). Re-running with generative eval on the same 586 prompts would also let me check whether 1.9% logprob accuracy reflects active wrong-answer imprinting or just random-letter degradation.
- Multi-seed replication (this used seed 42 only) before treating the 80pp gap as fully closed.
Experimental design
Background. Earlier work on this project (EPS issue #96) reported that contrastive wrong-answer SFT — train the model to emit wrong ARC-Challenge answers under a source persona, while preserving correct answers under the default assistant — degrades held-out ARC-C accuracy on the source persona almost completely (from 84% to 3–8%) for villain, comedian, software_engineer, and kindergarten_teacher, but only marginally (84% to 82%) for the default assistant itself. That outlier would be load-bearing for any "natural fine-tuning defense" claim, so issue #100 was opened to characterize it: dose-response curve, perturbation-type ablation, source of the effect (RLHF entrenchment vs. prompt text vs. persona-embedding geometry). During adversarial planning, the critic noticed the #96 training-data builder mixed 100 "default assistant + correct answer" anchor negatives in with the 200 "assistant + wrong answer" positives — a 2:1 wrong:correct contrastive signal for the assistant persona specifically. Exp 0 is a 2-run gating experiment to test that confound; Exp C is a 6-run ablation isolating which surface feature of the assistant persona (the text "assistant", any system prompt at all, RLHF entrenchment) could explain residual robustness.
Training data construction. Built by scripts/build_robustness_ablation_data.py in the EPS repo at commit c930d00. The original 800-example pool from #96 contains 200 "source persona + wrong answer" positives per source × 4 sources = 800 (or equivalently 100 per source × 4 sources × 2 with anchors), of which 100 are "default assistant + correct answer" anchors. Condition 0a removes those 100 anchors (N_train = 700, anchor-negative confound eliminated). Condition 0b removes 100 random non-assistant negatives instead (N_train = 700, confound preserved, training-set size matched to 0a). Exp C uses the deconfounded N_train = 700 pool from 0a for all six conditions. Wrong-answer targets are uniform random over the three distractor letters per question.
Training and evaluation. LoRA on Qwen/Qwen2.5-7B-Instruct: r = 32, alpha = 64, dropout = 0.05, all-linear targets, rslora = True. Optimizer: lr = 1e-5, cosine schedule with warmup_ratio = 0.05, 3 epochs, effective batch size 16 (4 per device × 4 grad-accum × 1 GPU), max_seq_length = 2048, bf16 + gradient checkpointing, seed 42. Eval after training on a held-out 586-question ARC-Challenge split — fully deterministic logprob accuracy: for each question, compare the next-token logprobs assigned to "A"/"B"/"C"/"D" and count it correct if the model assigns the highest logprob to the gold answer letter. No generation, no judge.
A cherry-picked-for-illustration training example from the deconfounded data (source persona = assistant, wrong-answer target):
System: You are an assistant.
User: A scientist wants to compare the rate at which... [ARC-C question stem]
A) <correct>
B) <distractor>
C) <distractor>
D) <distractor>
Assistant: B
The wrong answer (B) is the SFT target. Under confounded data, this would be paired with a sibling example using the default assistant prompt and the correct answer as the target; under deconfounded data, that anchor is removed. The eval is logprob-only, so per-question completions are scalars (four logprobs) rather than text. As an illustration of what a single eval question looks like under both Exp 0 conditions:
System: You are an assistant. User: [held-out ARC-C question] A) correct B) distractor C) distractor D) distractor Condition 0a (deconfounded) logprobs: A=-2.1 B=-1.4 C=-3.8 D=-4.2 -> predicts B (WRONG) Condition 0b (size-matched) logprobs: A=-1.2 B=-3.1 C=-2.9 D=-3.5 -> predicts A (RIGHT)
The aggregate signature across the full 586 questions is what the figure summarises:
Condition 0a (deconfounded, N_train=700) pre-SFT ARC-C: 492/586 = 84.0% post-SFT ARC-C: 11/586 = 1.9% (delta = -82.1pp) predicts a wrong letter post-SFT: 575/586 = 98.1% Condition 0b (size-matched control, N_train=700, confound preserved) pre-SFT ARC-C: 492/586 = 84.0% post-SFT ARC-C: 485/586 = 82.8% (delta = -1.2pp) predicts a wrong letter post-SFT: 101/586 = 17.2%
Because the eval is logprob-only, the dashboard has no raw text completions to upload — the per-run JSON artifacts saved at eval_results/issue_100/exp0_*/run_result.json in the EPS repo (and the WandB run logs listed in the reproducibility section below) are the closest analogue to "raw outputs." A re-run with generative eval would let downstream readers audit text-level outputs and distinguish "predicts a wrong letter because it learned to imitate wrong answers" from "predicts a wrong letter because reasoning collapsed to random"; that follow-up is captured in the TL;DR's Next steps.
Why this comparison isolates the confound. The two Exp 0 conditions train on the same 700 examples drawn from the same 800-example pool, with the only difference being which 100 examples were dropped. Both start from the same Qwen2.5-7B-Instruct base, the same LoRA hyperparameters, the same seed, the same evaluation. So any post-training accuracy difference between 0a and 0b is attributable to the identity of the 100 removed examples — specifically, to whether the "default assistant + correct answer" anchor set is present or absent. Going from 82.8% (0b) to 1.9% (0a) is therefore an 80.9pp causal effect of those 100 anchors, not of training-set size, not of seed variance, not of evaluator drift.
Why the prompt ablation is necessary. The Exp 0 result rules out one explanation (training-set size) but leaves another open: is residual robustness coming from RLHF entrenchment of the literal string "You are a helpful assistant.", or just from any system prompt at all, or from something specific to the chat template? Exp C runs six prompt conditions under deconfounded data — "You are a helpful assistant." (the canonical RLHF formulation), the empty string, None (chat template default), "You are an assistant." (drops "helpful"), "You are ROLE_A." (nonce role with no RLHF history), and "You are a curious explorer." (a similarly-shaped persona prompt the model has no special training on). Five of those six collapse to 2.0–4.4%; the lone survivor is None, and that one's survival recurs the same confound implicitly because Qwen2.5-7B-Instruct's chat template auto-injects "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." for empty system fields, which means the 100 "no-persona + correct answer" examples in the original 800-pool get re-mapped to the same default system prompt during tokenization. Hover any bar in the figure for the exact accuracy and condition definition.
Why no formal hypothesis test. The headline comparison (0a vs 0b) is two independent Bernoulli proportions at N = 586 each (11/586 vs 485/586). A two-proportion z-test gives z ~ 38, p < 1e-100 — well past any reasonable noise threshold. The same shape holds for each of the five deconfounded Exp C conditions vs the qwen_default survivor. Beyond that point further hypothesis testing adds nothing; the effect size dwarfs every plausible source of variance for this pipeline (within-condition seed std for the prior multi-seed runs on this codebase was 2–4pp).
Full parameters.
| Base model | Qwen/Qwen2.5-7B-Instruct (7.62B params) |
|---|---|
| Adapter | LoRA, ~25M trainable params: r=32, alpha=64, dropout=0.05, targets=all_linear, rslora=True |
| Optimizer | lr=1e-5, cosine schedule with warmup_ratio=0.05, 3 epochs |
| Batch | effective batch size 16 (4 per_device × 4 grad_accum × 1 GPU), max_seq_length=2048, bf16 + gradient checkpointing |
| Seed | 42 (single seed across all 8 runs) |
| Training data | Exp 0a: 700 examples (original 800-pool from #96 minus 100 "assistant + correct" anchors). Exp 0b: 700 examples (original 800 minus 100 random non-assistant negatives). Exp C: 700 examples per condition, all using the deconfounded 0a pool. |
| Eval | Held-out 586-question ARC-Challenge split, logprob accuracy over A/B/C/D continuations (no generation, no judge) |
| Stat test | Two-proportion z-test on (correct / N=586) for the 0a-vs-0b headline and each Exp C condition vs qwen_default survivor |
| Compute | ~40 min total wall clock (8 runs × ~5 min each) on 1× H200 |
| Code commit | c930d00 on superkaiba/explore-persona-space |
Reproducibility (agent-facing)
Artifacts.
- Model / adapters: not uploaded to HF Hub for this experiment — LoRA adapters live only on pod1's local filesystem under the WandB run dirs (ephemeral; pod retained). Re-train from the commit + WandB run config below to reproduce.
- Training datasets: regenerate via
scripts/build_robustness_ablation_data.py@ commitc930d00; deterministic with seed=42. - Raw completions: n/a — eval is logprob-only (no generation). Closest analogue is per-question logprob distributions logged inside the per-run JSONs below; raw text completions do not exist for this experiment. Re-running with generative eval is listed as a follow-up in the TL;DR's Next steps.
- WandB runs:
- Eval JSON in repo:
eval_results/issue_100/exp0_0a/run_result.json,eval_results/issue_100/exp0_0b/run_result.json,eval_results/issue_100/expC_{full_prompt,empty_system,qwen_default,name_only,nonce_role,curious_explorer}/run_result.json - Hero figure data (legacy PNG copies):
figures/issue_100/exp0_deconfounded_vs_control.png,figures/issue_100/expC_source_ablation.png— superseded by the inline SVG above.
Compute.
- Wall time: ~5 min training + eval per run, ~40 min total across 8 runs
- GPU: 1× H200 (pod1, GPU 1)
- Pod / env: pod1; Python 3.11.10; transformers / trl / peft / torch from the pod1 environment
Code.
- Entry script:
scripts/run_assistant_robustness.py - Data builder:
scripts/build_robustness_ablation_data.py - Git commit:
c930d00 - Plan doc:
.claude/plans/issue-100.mdin the EPS checkout - Reproduce:
git clone https://github.com/superkaiba/explore-persona-space && \ cd explore-persona-space && git checkout c930d00 && \ uv run python scripts/run_assistant_robustness.py --exp 0 && \ uv run python scripts/run_assistant_robustness.py --exp C --deconfounded
Source issues. EPS #100 (this experiment's Exp 0 + Exp C; Exp A dose-response and Exp B perturbation-type sweep were skipped per the decision gate after Exp 0 returned positive). EPS #96 (the prior finding this clean-result retracts). Recommended audit targets for the same anchor-negative confound: EPS #65, #66, #69, #99.
Timeline · 7 events
epm:status-changed· task.py· interpreting → awaiting_promotionepm:clean-result-lint· system<!-- epm:clean-result-lint v1 --> ## Clean-result lint — FAIL ``` Check Status Detail -----…
<!-- epm:clean-result-lint v1 --> ## Clean-result lint — FAIL ``` Check Status Detail --------------------------------------------------------------------------------------------------- AI Summary structure ✓ PASS v2: Background + Methodology + 2 Result section(s) (no Next steps — optional) Human TL;DR ✓ PASS section missing (legacy body — grandfathered) AI TL;DR paragraph ✓ PASS 375 words, 5 bullets (LW-style) Hero figure ✓ PASS 2 figure(s) present; primary commit-pinned Results figure captions ✓ PASS every Results figure has a caption paragraph Results block shape ✗ FAIL missing `**Main takeaways:**` bolded label inside ### Results Methodology bullets ✓ PASS non-strict (grandfathered) Background context ✓ PASS Background has 161 words Acronyms defined ✓ PASS non-strict (grandfathered) Background motivation ✓ PASS non-strict (grandfathered) Dataset example ✓ PASS non-strict (grandfathered) Human summary ! WARN ## Human summary missing (grandfathered: issue >7 days old or already-promoted) Sample outputs ! WARN ## Sample outputs missing (grandfathered) Numbers match JSON ✓ PASS no JSON artifacts referenced — skipped Reproducibility card ✗ FAIL ## Setup & hyper-parameters section missing Confidence phrasebook ✓ PASS no ad-hoc hedges detected Stats framing (p-values only) ✓ PASS no effect-size / named-test / credence-interval language Title confidence marker ! WARN title says (HIGH confidence) but Results has no Confidence line to match Result: FAIL — fix the failing checks before posting. ``` Fix the issues and edit the body; the workflow re-runs. <!-- /epm:clean-result-lint -->
epm:clean-result-lint· system<!-- epm:clean-result-lint v1 --> ## Clean-result lint — FAIL ``` Check Status Detail -----…
<!-- epm:clean-result-lint v1 --> ## Clean-result lint — FAIL ``` Check Status Detail --------------------------------------------------------------------------------------------------- AI Summary structure ✓ PASS v2: Background + Methodology + 2 Result section(s) (no Next steps — optional) Human TL;DR ✓ PASS section missing (legacy body — grandfathered) AI TL;DR paragraph ✓ PASS 364 words, 5 bullets (LW-style) Hero figure ✓ PASS 2 figure(s) present; primary commit-pinned Results figure captions ✓ PASS every Results figure has a caption paragraph Results block shape ✗ FAIL missing `**Main takeaways:**` bolded label inside ### Results Methodology bullets ✓ PASS non-strict (grandfathered) Background context ✓ PASS Background has 161 words Acronyms defined ✓ PASS non-strict (grandfathered) Background motivation ✓ PASS non-strict (grandfathered) Dataset example ✓ PASS non-strict (grandfathered) Human summary ! WARN ## Human summary missing (grandfathered: issue >7 days old or already-promoted) Sample outputs ! WARN ## Sample outputs missing (grandfathered) Numbers match JSON ✓ PASS no JSON artifacts referenced — skipped Reproducibility card ✗ FAIL ## Setup & hyper-parameters section missing Confidence phrasebook ✓ PASS no ad-hoc hedges detected Stats framing (p-values only) ✓ PASS no effect-size / named-test / credence-interval language Title confidence marker ! WARN title says (HIGH confidence) but Results has no Confidence line to match Result: FAIL — fix the failing checks before posting. ``` Fix the issues and edit the body; the workflow re-runs. <!-- /epm:clean-result-lint -->
epm:clean-result-lint· system<!-- epm:clean-result-lint v1 --> ## Clean-result lint — FAIL ``` Check Status Detail -----…
<!-- epm:clean-result-lint v1 --> ## Clean-result lint — FAIL ``` Check Status Detail --------------------------------------------------------------------------------------------------- AI Summary structure ✓ PASS v2: Background + Methodology + 2 Result section(s) (no Next steps — optional) Human TL;DR ✓ PASS section missing (legacy body — grandfathered) AI TL;DR paragraph ✓ PASS 364 words, 5 bullets (LW-style) Hero figure ✓ PASS 2 figure(s) present; primary commit-pinned Results figure captions ✓ PASS every Results figure has a caption paragraph Results block shape ✗ FAIL missing `**Main takeaways:**` bolded label inside ### Results Methodology bullets ✓ PASS non-strict (grandfathered) Background context ✓ PASS Background has 161 words Acronyms defined ✓ PASS non-strict (grandfathered) Background motivation ✓ PASS non-strict (grandfathered) Bare #N references ✓ PASS skipped (v1 / legacy body — markdown-link rule applies to v2 only) Dataset example ✓ PASS non-strict (grandfathered) Human summary ! WARN ## Human summary missing (grandfathered: issue >7 days old or already-promoted) Sample outputs ! WARN ## Sample outputs missing (grandfathered) Inline samples per Result ✓ PASS 2 Result section(s), each with >=2 fenced sample blocks Numbers match JSON ✓ PASS no JSON artifacts referenced — skipped Reproducibility card ✗ FAIL ## Setup & hyper-parameters section missing Confidence phrasebook ✓ PASS no ad-hoc hedges detected Stats framing (p-values only) ✓ PASS no effect-size / named-test / credence-interval language Title confidence marker ! WARN title says (HIGH confidence) but Results has no Confidence line to match Result: FAIL — fix the failing checks before posting. ``` Fix the issues and edit the body; the workflow re-runs. <!-- /epm:clean-result-lint -->
epm:clean-result-lint· system<!-- epm:clean-result-lint v1 --> ## Clean-result lint — FAIL ``` Check Status Detail -----…
<!-- epm:clean-result-lint v1 --> ## Clean-result lint — FAIL ``` Check Status Detail --------------------------------------------------------------------------------------------------- AI Summary structure ✓ PASS v2: Background + Methodology + 2 Result section(s) (no Next steps — optional) Human TL;DR ✓ PASS H2 present (content user-owned, not validated) AI TL;DR paragraph ✓ PASS 364 words, 5 bullets (LW-style) Hero figure ✓ PASS 2 figure(s) present; primary commit-pinned Results figure captions ✓ PASS every Results figure has a caption paragraph Results block shape ✗ FAIL missing `**Main takeaways:**` bolded label inside ### Results Methodology bullets ✓ PASS non-strict (grandfathered) Background context ✓ PASS Background has 161 words Acronyms defined ✓ PASS non-strict (grandfathered) Background motivation ✓ PASS non-strict (grandfathered) Bare #N references ✓ PASS skipped (v1 / legacy body — markdown-link rule applies to v2 only) Dataset example ✓ PASS non-strict (grandfathered) Human summary ! WARN ## Human summary missing (grandfathered: issue >7 days old or already-promoted) Sample outputs ! WARN ## Sample outputs missing (grandfathered) Inline samples per Result ✓ PASS 2 Result section(s), each with >=2 fenced sample blocks Numbers match JSON ✓ PASS no JSON artifacts referenced — skipped Reproducibility card ✗ FAIL ## Setup & hyper-parameters section missing Confidence phrasebook ✓ PASS no ad-hoc hedges detected Stats framing (p-values only) ✓ PASS no effect-size / named-test / credence-interval language Title confidence marker ! WARN title says (HIGH confidence) but Results has no Confidence line to match Result: FAIL — fix the failing checks before posting. ``` Fix the issues and edit the body; the workflow re-runs. <!-- /epm:clean-result-lint -->
state_changed· user· awaiting_promotion → reviewingBulk move clean-results → review (kept #311 in clean-results)
Bulk move clean-results → review (kept #311 in clean-results)
state_changed· user· reviewing → clean_result_drafting
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)