Cross-model default system prompts on Qwen: identity claim vs length vs self-reference
Motivation
Issue #101 found that Qwen's native system prompt ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") creates a distinct persona slot with 5x greater leakage vulnerability than "You are a helpful assistant." (-24.9pp vs -5.1pp ARC-C degradation). But the reviewer flagged a confound: qwen_default is also longer (~13 tokens vs ~7 tokens for generic_assistant). The vulnerability could be driven by:
- The self-referential identity claim ("You are Qwen") — the model recognizes its own name
- Prompt length — more system prompt tokens = more LoRA surface area for coupling
- Training familiarity — qwen_default is the RLHF-optimized prompt, so the model has stronger associations with it
This experiment disentangles these confounds by testing default system prompts from OTHER models on Qwen-2.5-7B-Instruct. These prompts vary in length, self-reference, and familiarity (Qwen was never trained on "You are Phi" or "You are Command-R").
Conditions
| Label | System prompt | Self-ref? | ~Tokens | Familiar to Qwen? |
|---|---|---|---|---|
qwen_default | "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." | Yes (Qwen) | ~13 | Yes (RLHF) |
generic_assistant | "You are a helpful assistant." | No | ~7 | Yes (in training data) |
llama_default | "Cutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\nYou are a helpful assistant" | No | ~20 | No |
phi4_default | "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process." | Yes (Phi) | ~60 | No |
command_r_default | "You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses." | Yes (Command-R) | ~20 | No |
empty_system | "" | No | 0 | N/A |
Control structure
This 2x2-ish design disentangles the confounds:
| Short (~7-13 tok) | Medium (~20 tok) | Long (~60 tok) | |
|---|---|---|---|
| No self-ref | generic_assistant | llama_default | — |
| Self-ref (own name) | qwen_default | — | — |
| Self-ref (other name) | — | command_r_default | phi4_default |
Key comparisons:
- Identity claim effect: qwen_default vs generic_assistant (same length class, self-ref vs not) — already measured in #101
- Length effect: generic_assistant vs llama_default vs phi4_default (increasing length, all unfamiliar or no self-ref)
- Own-name vs other-name: qwen_default vs command_r_default (both self-referential, similar length, but Qwen recognizes its own name)
- Familiarity effect: qwen_default vs command_r_default vs phi4_default (all self-referential, but only qwen_default is the RLHF prompt)
Proposed experiments
Exp A — Representation geometry (reuse #101 recipe)
Extract centroids for all 6 conditions at layers [10, 15, 20, 25]. Compute:
- Pairwise cosine similarity (raw + mean-centered)
- Cosine profile to 112-persona taxonomy
- Layer-by-layer divergence
Key question: Do other self-referential prompts (phi4_default, command_r_default) cluster with qwen_default or with generic_assistant in persona space?
Exp B — Leakage susceptibility (reuse #101 recipe)
B1: Contrastive wrong-answer SFT for each of the 4 NEW conditions (llama_default, phi4_default, command_r_default, empty_system already done in #101). Same recipe: lr=1e-5, 3 epochs, LoRA r=32, 800 examples per source.
B2: Cross-leakage — eval each B1 model on ALL 6 conditions + 10 non-assistant personas.
Key question: Does "You are Command-R" or "You are Phi" degrade as much as "You are Qwen" (-24.9pp)? If yes → self-referential identity claims are inherently vulnerable regardless of familiarity. If no → Qwen's vulnerability is specific to its RLHF training.
Exp B-marker — Marker injection
[ZLT] marker injection for the 4 new conditions. Eval cross-leakage to all 6 conditions.
Key question: Does marker containment follow the same pattern as capability leakage?
Success criteria
- Clear attribution of qwen_default's -24.9pp vulnerability to one of: identity claim, length, or familiarity
- Quantified leakage for each new condition (comparable to #101's measurements)
- Cross-leakage matrix showing whether cross-model identity prompts cluster together or separately
Compute estimate
- Exp A: ~0.3 GPU-hours (4 new conditions × 20 questions × 4 layers)
- Exp B1: ~0.35 GPU-hours (4 new LoRA training runs × ~5 min)
- Exp B2 eval: ~0.5 GPU-hours (merge + ARC-C eval across all conditions)
- Exp B-marker: ~0.35 GPU-hours (4 marker training + eval)
- Total: ~1.5 GPU-hours on 1× H200 (small compute)
Note: Reuses qwen_default, generic_assistant, and empty_system results from #101 — no need to retrain those.
Related issues
- #101 — System prompt ablation (anchor results for qwen_default, generic_assistant, empty_system)
- #106 — Clean result from #101 (MODERATE confidence)
- #96 — Assistant resists ARC-C degradation
- #100 — Assistant persona robustness
- Aim 4.10 — System prompt contribution to assistant persona
Timeline · 9 events
epm:clarify· system<!-- epm:clarify v1 --> ## Clarifier — No Blocking Ambiguities Direct extension of #101 with the same recipe (lr=1e-5, …
<!-- epm:clarify v1 --> ## Clarifier — No Blocking Ambiguities Direct extension of #101 with the same recipe (lr=1e-5, 3 epochs, LoRA r=32, 800 examples). All 6 conditions clearly defined with exact system prompt text. Control structure explicitly maps the 2×2-ish design for disentangling confounds. Reuses #101 results for 3 existing conditions. Minor non-blocking gaps for the planner: - Seed not stated (implied seed=42, matching #101) - Exact Phi-4 default prompt should be verified against tokenizer_config.json Proceeding to gate-keeper. <!-- /epm:clarify -->
epm:gate· system<!-- epm:gate v1 --> ## Gate-Keeper Verdict — RUN (4.30/5) | Dimension | Score | Justification | |-----------|-------|-…
<!-- epm:gate v1 --> ## Gate-Keeper Verdict — RUN (4.30/5) | Dimension | Score | Justification | |-----------|-------|---------------| | Information Value | 4/5 | Directly addresses reviewer-flagged confound on #106. Cross-model factorial cleanly attributes the 5x vulnerability to identity claim vs length vs RLHF familiarity. | | De-risking | 5/5 | Reuses exact #101 recipe + 3 existing conditions. Only new work: 4 LoRA adapters using validated pipeline. | | Strategic Fit | 4/5 | Advances Aim 4 (axis origins). Cross-model prompts test generalizable claim: are all self-referential prompts vulnerable, or just Qwen's? | | Feedback Speed | 5/5 | ~1.5 GPU-hours, results within single session. | | Opportunity Cost | 4/5 | Very cheap. Doesn't block anything, fills flagged evidence gap on existing clean result. | All outcomes are publishable data points. No waste scenarios. Proceeding to planning. <!-- /epm:gate -->
epm:plan· system<!-- epm:plan v1 --> ## Plan for #108 — Cross-Model System Prompts **Cost gate:** ~2.5 GPU-hours on pod1 (1× H200). Rep…
<!-- epm:plan v1 --> ## Plan for #108 — Cross-Model System Prompts **Cost gate:** ~2.5 GPU-hours on pod1 (1× H200). Reply \`approve\` to dispatch. ### Fact-Check Corrections (all 3 claimed prompts had errors) - **Llama 3.1:** Only date headers auto-injected; "You are a helpful assistant" is NOT in the template - **Phi-4:** Claimed text is from Phi-4-reasoning, not Phi-4-mini-instruct (which has no default) - **Command R:** "Brilliant, sophisticated" text is dead code (\`false == true\` gate), never actually auto-injected - Still valid for our purpose: testing identity claims on Qwen, not replicating deployment configs ### Conditions (7 total — 4 new, including 1 synthetic control from critic) | Label | Tokens | Self-ref? | Persona desc? | |---|---|---|---| | \`qwen_default\` (from #101) | ~16 | Yes (own) | Yes | | \`generic_assistant\` (from #101) | ~6 | No | Minimal | | \`empty_system\` (from #101) | 0 | No | No | | \`llama_default\` (NEW) | ~20 | No | No (metadata) | | \`phi4_default\` (NEW) | ~72 | Yes (Phi) | Yes | | \`command_r_default\` (NEW) | ~25 | Yes (Command-R) | Yes | | \`command_r_no_name\` (NEW, synthetic) | ~20 | No | Yes | **Critical control:** \`command_r_no_name\` is Command R text with the name removed. This separates self-referential naming from persona description — the critic's primary fix. ### Hypotheses - **H1:** command_r_default degrades more than command_r_no_name (>= 5pp) → naming matters - **H2:** qwen_default (-24.9pp) > mean(phi4, command_r) by >= 1.5× → own-name amplifies (caveat: ambiguous with RLHF familiarity) - **H3:** Length is not the dominant factor across heterogeneous prompt types - **H4:** Geometry tracks leakage (descriptive only, N=7 too small for formal test) ### Design - Phase 0: Baseline ARC-C for new conditions + template verification - Exp A: Geometry (centroids at L10/15/20/25) - Exp B1: 4 new contrastive LoRA runs (same #101 recipe) - Exp B2: 7×17 cross-leakage matrix - Exp B-marker: 4 marker LoRA runs + eval - Exp C (minimal): Self-ID check only (20 prompts × 5 completions per new condition) ### Compute: ~2.5 GPU-hours. Seed 42. Full plan at \`.claude/plans/issue-108.md\`. <!-- /epm:plan -->
epm:plan· system<!-- epm:plan v2 --> ## Revised Plan for #108 — Cross-Model + Qwen Variant Prompts **Cost gate:** ~4-5 GPU-hours on pod…
<!-- epm:plan v2 --> ## Revised Plan for #108 — Cross-Model + Qwen Variant Prompts **Cost gate:** ~4-5 GPU-hours on pod1 (1× H200). Reply \`approve\` to dispatch. ### Update from v1 Added 5 Qwen prompt variants (Group C) per user request to test exact-match sensitivity. Total conditions: 12 (3 from #101 + 4 cross-model + 5 Qwen variants). ### Conditions **Group A (from #101, reuse results):** qwen_default, generic_assistant, empty_system **Group B (cross-model, NEW):** | Label | Prompt | Self-ref? | |---|---|---| | \`llama_default\` | "Cutting Knowledge Date: December 2023\\nToday Date: 26 Jul 2024" | No | | \`phi4_default\` | "You are Phi, a language model trained by Microsoft..." (~72 tok) | Yes (Phi) | | \`command_r_default\` | "You are Command-R, a brilliant, sophisticated..." (~25 tok) | Yes (Command-R) | | \`command_r_no_name\` | "You are a brilliant, sophisticated, AI-assistant..." (~20 tok) | No (synthetic) | **Group C (Qwen variants, NEW):** | Label | Prompt | Tests | |---|---|---| | \`qwen_name_only\` | "You are Qwen" | Is just the name enough? | | \`qwen_name_period\` | "You are Qwen." | Name + period | | \`qwen_no_alibaba\` | "You are Qwen. You are a helpful assistant." | Without Alibaba attribution | | \`qwen_typo\` | "You are Qwen, created by Alibaba Cloud. You are a helpful assistant " | 1-char perturbation (trailing space) | | \`qwen_lowercase\` | "You are Qwen, created by alibaba Cloud. You are a helpful assistant." | 1-char perturbation (lowercase a) | ### New hypotheses (from Group C) - **H5 (Name-only):** \`qwen_name_only\` degrades >= 20pp → the "Qwen" token alone activates vulnerability - **H6 (Exact-match sensitivity):** \`qwen_typo\` and \`qwen_lowercase\` degrade within 3pp of \`qwen_default\` → exact token match doesn't matter (the representation is robust to tiny perturbations) - **H7 (Attribution):** \`qwen_no_alibaba\` vs \`qwen_default\` → does "created by Alibaba Cloud" contribute? ### Design (same as v1, expanded) - Phase 0: Baselines + token verification for all 9 new conditions - Exp A: Geometry for 9 new conditions - Exp B1: 9 contrastive LoRA runs - Exp B2: 12×22 cross-leakage matrix - Exp B-marker: 9 marker LoRA runs - Exp C (minimal): Self-ID for cross-model conditions (Group B only) ### Compute: ~4-5 GPU-hours (still compute:small). Seed 42. Full plan at \`.claude/plans/issue-108.md\`. <!-- /epm:plan -->
epm:plan· system<!-- epm:plan v3 --> ## Final Plan for #108 — 16 Conditions **Approved** by user. Plan v3 adds Group C (6 Qwen variants…
<!-- epm:plan v3 --> ## Final Plan for #108 — 16 Conditions **Approved** by user. Plan v3 adds Group C (6 Qwen variants) and Group D (3 assistant variants) to the original cross-model design. **16 conditions:** 3 from #101 + 4 cross-model + 6 Qwen perturbations + 3 assistant perturbations. **13 new LoRA runs** (contrastive + marker = 26 total training runs). **~5-6 GPU-hours** on pod1. See `.claude/plans/issue-108.md` for the full plan. Dispatching experimenter. <!-- /epm:plan -->
epm:launch· system<!-- epm:launch v1 --> ## Launch — Issue #108 - **Worktree:** `.claude/worktrees/issue-108` on branch `issue-108` - **P…
<!-- epm:launch v1 --> ## Launch — Issue #108 - **Worktree:** `.claude/worktrees/issue-108` on branch `issue-108` - **Pod:** pod1 (4× H200 SXM), GPU 2 - **Experimenter agent:** dispatched in background - **16 conditions:** 3 from #101 (reused) + 13 new (4 cross-model + 6 Qwen variants + 3 assistant variants) - **Sequence:** Phase 0 (baselines) → Exp A (geometry) → Exp B1 (13 LoRA trains) → Exp B2 (cross-leakage) → Marker → Exp C (self-ID) - **Estimated:** ~5-6 GPU-hours, ~3-4 hours wall time <!-- /epm:launch -->
epm:analysis· system<!-- epm:analysis v1 --> **Clean result:** #113 -- *Self-referential naming in system prompts, not length or RLHF famil…
<!-- epm:analysis v1 --> **Clean result:** #113 -- *Self-referential naming in system prompts, not length or RLHF familiarity, drives persona vulnerability (MODERATE confidence)*  The command_r naming ablation (command_r_default at -18.1pp vs command_r_no_name at -3.4pp, p < 1e-8, N=586) isolates self-referential naming as the primary vulnerability driver, resolving the length and RLHF familiarity confounds flagged by the #106 reviewer. Generic assistant paraphrases are uniformly immune (near-zero degradation, all p > 0.7).
epm:reviewer-verdict· system<!-- epm:reviewer-verdict v1 --> # Independent Review: Self-referential naming drives persona vulnerability (#113) **V…
<!-- epm:reviewer-verdict v1 --> # Independent Review: Self-referential naming drives persona vulnerability (#113) **Verdict: CONCERNS** **Reproducibility: COMPLETE** (0 fields missing) **Structure: COMPLETE** (0 sections missing) ## Template Compliance (`.claude/skills/clean-results/template.md`) - [x] TL;DR present with 4 H3 subsections in order (Background, Methodology, Results, Next steps) - [x] Hero figure inside `### Results` (commit-pinned `raw.githubusercontent.com` URL at `0294e90`) - [x] Results subsection ends with `**Main takeaways:**` (5 bullets, each bolding the load-bearing claim + numbers) followed by `**Confidence: MODERATE**` line - [x] Issue title ends with `(MODERATE confidence)` matching the Confidence line verbatim - [x] Background cites prior issue #101 / clean result #106 - [x] Methodology names N=586, 16 conditions, matched command_r ablation design - [x] Next steps are specific (names seeds [42,137,256], Llama-3-8B-Instruct, qwen_name_only/qwen_typo investigation) - [x] Detailed report: Source issues, Setup & hyper-parameters (with "why" prose), WandB (N/A documented), Sample outputs, Headline numbers (with Standing caveats), Artifacts - [x] `scripts/verify_clean_result.py` exits 0 (PASS) - Missing sections: none ## Reproducibility Card Check - [x] All training parameters (lr=1e-5, cosine, warmup_ratio=0.1, batch=16, 3 epochs, AdamW, bf16, LoRA r=32/alpha=32/dropout=0.05) - [x] Data source and per-condition size (800 per condition, ARC-C derived) - [x] Eval fully specified (ARC-C N=586, lm-eval-harness + vLLM, temp=0; self-ID N=100, temp=1.0) - [x] Compute documented (1x H200, 164 min, 2.7 GPU-hours) - [x] Environment pinned (Python 3.11.5, transformers 4.51.3, torch 2.5.1, trl 0.14.0, peft 0.13.0, vllm 0.8.2) - [x] Exact command to reproduce included (`nohup uv run python scripts/run_issue108.py &`) - [x] Script + git commit: `scripts/run_issue108.py` @ `9ee929e` - Missing fields: Data version/hash for Groups B/C/D is vague ("new generation"); no hash. ## Statistical Framing Rule Violation **The Eval section explicitly names the statistical test:** "p-values from two-proportion z-tests." The project convention (CLAUDE.md, template.md) says: "No named statistical tests (paired t-test, Fisher, Mann-Whitney, bootstrap) in prose." The name "two-proportion z-tests" appears in the Significance row of the reproducibility card and in the Standing caveats. This should be replaced with a generic description (e.g., "proportional tests" or just "p-values" without naming the test). This is minor and easily fixed. ## Claims Verified | # | Claim | Verdict | |---|---|---| | 1 | command_r naming gap: 14.7pp, p < 1e-8 | **CONFIRMED** -- raw delta 14.7pp, recomputed p = 1.3e-8 | | 2 | Own-name penalty: 7.7pp, p=0.014 | **CONFIRMED WITH QUALIFIER** -- delta 7.7pp matches; p=0.0146 matches; but see Issues below | | 3 | qwen_name_only exceeds qwen_default by 14.0pp, p < 1e-4 | **CONFIRMED WITH QUALIFIER** -- degradation gap 14.0pp matches; p on post-SFT accuracy is 6.5e-5 < 1e-4; but see Issues below | | 4 | qwen_typo exceeds qwen_default by 22.4pp, p < 1e-14 | **CONFIRMED** -- delta 22.4pp, p = 2.0e-14 | | 5 | qwen_no_alibaba gap: 1.5pp, p=0.25 | **CONFIRMED** -- raw degradation gap 1.5pp; recomputed p = 0.25 | | 6 | Group D near-zero: +0.7pp to -0.5pp, all p > 0.7 | **CONFIRMED** -- all three match raw data | | 7 | Self-ID: phi4 81%, command_r 58%, command_r_no_name 16% Qwen | **CONFIRMED** -- raw JSON matches exactly | | 8 | Post-SFT and baseline accuracies in headline table | **CONFIRMED** -- all 16 rows match raw JSON within rounding tolerance | | 9 | prompt length is not the mechanism | **CONFIRMED** -- the data pattern is consistent (llama_default at ~20 tokens shows +0.9pp; command_r_no_name at ~20 tokens shows -3.4pp; phi4_default at ~60 tokens shows -16.4pp only because it names "Phi") | ## Issues Found ### Major (conclusions need qualification) 1. **H5 conflates degradation gap with accuracy gap inepm:done· system<!-- epm:done v1 --> ## Done — Issue #108 **Outcome:** Self-referential naming in system prompts is the primary driver …
<!-- epm:done v1 --> ## Done — Issue #108 **Outcome:** Self-referential naming in system prompts is the primary driver of persona vulnerability on Qwen-2.5-7B-Instruct, not prompt length or RLHF familiarity. **Key numbers (16 conditions, N=586 ARC-C per cell):** | Finding | Comparison | Gap | |---|---|---| | **Naming is the driver** | command_r_default (-18.1pp) vs command_r_no_name (-3.4pp) | 14.7pp from just the name | | **Cross-model names are vulnerable** | phi4_default (-16.4pp), command_r_default (-18.1pp) | Qwen never trained on these | | **Own-name amplification modest** | qwen_default (-24.9pp) vs mean(phi4, command_r) (-17.2pp) | 1.45× ratio | | **Length NOT the driver** | llama_default (+0.9pp, 20 tok) vs command_r_no_name (-3.4pp, 20 tok) | Same length, different vulnerability | | **Minimal name MORE vulnerable** | qwen_name_only (-38.9pp) > qwen_default (-24.9pp) | Surprising — needs replication | | **Trailing space AMPLIFIES** | qwen_typo (-47.3pp) > qwen_default (-24.9pp) | Surprising — needs replication | | **Assistant paraphrases immune** | and_helpful (+0.7pp), youre_helpful (-0.5pp), very_helpful (-0.5pp) | All near zero | | **Model complies with identity** | "You are Phi" → 81% says "I am Phi"; "You are Command-R" → 58% | Full behavioral compliance | **Confirmed:** - Self-referential naming (H1) is the primary vulnerability mechanism (p < 1e-8) - Prompt length is not the dominant confound (H3 supported) - Generic assistant paraphrases are uniformly immune **Surprising / needs replication:** - qwen_name_only and qwen_typo degrade MORE than the full qwen_default prompt - Reviewer flagged training-difficulty as alternative explanation **What's next:** - Multi-seed replication for qwen_typo and qwen_name_only extremes - Investigate whether token-boundary perturbations consistently amplify vulnerability - Test whether identity compliance (behavioral) predicts leakage vulnerability (mechanistic) **Clean result:** #113 — "Self-referential naming in system prompts, not length or RLHF familiarity, drives persona vulnerability (MODERATE confidence)" **PR:** pending (branch issue-108 has commits) Moved to **Done (experiment)** on the project board. <!-- /epm:done -->
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)