EPS
← All tasks·#108Completed

Cross-model default system prompts on Qwen: identity claim vs length vs self-reference

kind: experiment

Motivation

Issue #101 found that Qwen's native system prompt ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") creates a distinct persona slot with 5x greater leakage vulnerability than "You are a helpful assistant." (-24.9pp vs -5.1pp ARC-C degradation). But the reviewer flagged a confound: qwen_default is also longer (~13 tokens vs ~7 tokens for generic_assistant). The vulnerability could be driven by:

  1. The self-referential identity claim ("You are Qwen") — the model recognizes its own name
  2. Prompt length — more system prompt tokens = more LoRA surface area for coupling
  3. Training familiarity — qwen_default is the RLHF-optimized prompt, so the model has stronger associations with it

This experiment disentangles these confounds by testing default system prompts from OTHER models on Qwen-2.5-7B-Instruct. These prompts vary in length, self-reference, and familiarity (Qwen was never trained on "You are Phi" or "You are Command-R").

Conditions

LabelSystem promptSelf-ref?~TokensFamiliar to Qwen?
qwen_default"You are Qwen, created by Alibaba Cloud. You are a helpful assistant."Yes (Qwen)~13Yes (RLHF)
generic_assistant"You are a helpful assistant."No~7Yes (in training data)
llama_default"Cutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\nYou are a helpful assistant"No~20No
phi4_default"You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process."Yes (Phi)~60No
command_r_default"You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses."Yes (Command-R)~20No
empty_system""No0N/A

Control structure

This 2x2-ish design disentangles the confounds:

Short (~7-13 tok)Medium (~20 tok)Long (~60 tok)
No self-refgeneric_assistantllama_default
Self-ref (own name)qwen_default
Self-ref (other name)command_r_defaultphi4_default

Key comparisons:

  • Identity claim effect: qwen_default vs generic_assistant (same length class, self-ref vs not) — already measured in #101
  • Length effect: generic_assistant vs llama_default vs phi4_default (increasing length, all unfamiliar or no self-ref)
  • Own-name vs other-name: qwen_default vs command_r_default (both self-referential, similar length, but Qwen recognizes its own name)
  • Familiarity effect: qwen_default vs command_r_default vs phi4_default (all self-referential, but only qwen_default is the RLHF prompt)

Proposed experiments

Exp A — Representation geometry (reuse #101 recipe)

Extract centroids for all 6 conditions at layers [10, 15, 20, 25]. Compute:

  1. Pairwise cosine similarity (raw + mean-centered)
  2. Cosine profile to 112-persona taxonomy
  3. Layer-by-layer divergence

Key question: Do other self-referential prompts (phi4_default, command_r_default) cluster with qwen_default or with generic_assistant in persona space?

Exp B — Leakage susceptibility (reuse #101 recipe)

B1: Contrastive wrong-answer SFT for each of the 4 NEW conditions (llama_default, phi4_default, command_r_default, empty_system already done in #101). Same recipe: lr=1e-5, 3 epochs, LoRA r=32, 800 examples per source.

B2: Cross-leakage — eval each B1 model on ALL 6 conditions + 10 non-assistant personas.

Key question: Does "You are Command-R" or "You are Phi" degrade as much as "You are Qwen" (-24.9pp)? If yes → self-referential identity claims are inherently vulnerable regardless of familiarity. If no → Qwen's vulnerability is specific to its RLHF training.

Exp B-marker — Marker injection

[ZLT] marker injection for the 4 new conditions. Eval cross-leakage to all 6 conditions.

Key question: Does marker containment follow the same pattern as capability leakage?

Success criteria

  • Clear attribution of qwen_default's -24.9pp vulnerability to one of: identity claim, length, or familiarity
  • Quantified leakage for each new condition (comparable to #101's measurements)
  • Cross-leakage matrix showing whether cross-model identity prompts cluster together or separately

Compute estimate

  • Exp A: ~0.3 GPU-hours (4 new conditions × 20 questions × 4 layers)
  • Exp B1: ~0.35 GPU-hours (4 new LoRA training runs × ~5 min)
  • Exp B2 eval: ~0.5 GPU-hours (merge + ARC-C eval across all conditions)
  • Exp B-marker: ~0.35 GPU-hours (4 marker training + eval)
  • Total: ~1.5 GPU-hours on 1× H200 (small compute)

Note: Reuses qwen_default, generic_assistant, and empty_system results from #101 — no need to retrain those.

Related issues

  • #101 — System prompt ablation (anchor results for qwen_default, generic_assistant, empty_system)
  • #106 — Clean result from #101 (MODERATE confidence)
  • #96 — Assistant resists ARC-C degradation
  • #100 — Assistant persona robustness
  • Aim 4.10 — System prompt contribution to assistant persona

Timeline · 9 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — No Blocking Ambiguities Direct extension of #101 with the same recipe (lr=1e-5,
    <!-- epm:clarify v1 -->
    ## Clarifier — No Blocking Ambiguities
    
    Direct extension of #101 with the same recipe (lr=1e-5, 3 epochs, LoRA r=32, 800 examples). All 6 conditions clearly defined with exact system prompt text. Control structure explicitly maps the 2×2-ish design for disentangling confounds. Reuses #101 results for 3 existing conditions.
    
    Minor non-blocking gaps for the planner:
    - Seed not stated (implied seed=42, matching #101)
    - Exact Phi-4 default prompt should be verified against tokenizer_config.json
    
    Proceeding to gate-keeper.
    <!-- /epm:clarify -->
  2. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict — RUN (4.30/5) | Dimension | Score | Justification | |-----------|-------|-
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict — RUN (4.30/5)
    
    | Dimension | Score | Justification |
    |-----------|-------|---------------|
    | Information Value | 4/5 | Directly addresses reviewer-flagged confound on #106. Cross-model factorial cleanly attributes the 5x vulnerability to identity claim vs length vs RLHF familiarity. |
    | De-risking | 5/5 | Reuses exact #101 recipe + 3 existing conditions. Only new work: 4 LoRA adapters using validated pipeline. |
    | Strategic Fit | 4/5 | Advances Aim 4 (axis origins). Cross-model prompts test generalizable claim: are all self-referential prompts vulnerable, or just Qwen's? |
    | Feedback Speed | 5/5 | ~1.5 GPU-hours, results within single session. |
    | Opportunity Cost | 4/5 | Very cheap. Doesn't block anything, fills flagged evidence gap on existing clean result. |
    
    All outcomes are publishable data points. No waste scenarios. Proceeding to planning.
    <!-- /epm:gate -->
  3. epm:plan· system
    <!-- epm:plan v1 --> ## Plan for #108 — Cross-Model System Prompts **Cost gate:** ~2.5 GPU-hours on pod1 (1× H200). Rep
    <!-- epm:plan v1 -->
    ## Plan for #108 — Cross-Model System Prompts
    
    **Cost gate:** ~2.5 GPU-hours on pod1 (1× H200). Reply \`approve\` to dispatch.
    
    ### Fact-Check Corrections (all 3 claimed prompts had errors)
    - **Llama 3.1:** Only date headers auto-injected; "You are a helpful assistant" is NOT in the template
    - **Phi-4:** Claimed text is from Phi-4-reasoning, not Phi-4-mini-instruct (which has no default)
    - **Command R:** "Brilliant, sophisticated" text is dead code (\`false == true\` gate), never actually auto-injected
    - Still valid for our purpose: testing identity claims on Qwen, not replicating deployment configs
    
    ### Conditions (7 total — 4 new, including 1 synthetic control from critic)
    
    | Label | Tokens | Self-ref? | Persona desc? |
    |---|---|---|---|
    | \`qwen_default\` (from #101) | ~16 | Yes (own) | Yes |
    | \`generic_assistant\` (from #101) | ~6 | No | Minimal |
    | \`empty_system\` (from #101) | 0 | No | No |
    | \`llama_default\` (NEW) | ~20 | No | No (metadata) |
    | \`phi4_default\` (NEW) | ~72 | Yes (Phi) | Yes |
    | \`command_r_default\` (NEW) | ~25 | Yes (Command-R) | Yes |
    | \`command_r_no_name\` (NEW, synthetic) | ~20 | No | Yes |
    
    **Critical control:** \`command_r_no_name\` is Command R text with the name removed. This separates self-referential naming from persona description — the critic's primary fix.
    
    ### Hypotheses
    - **H1:** command_r_default degrades more than command_r_no_name (>= 5pp) → naming matters
    - **H2:** qwen_default (-24.9pp) > mean(phi4, command_r) by >= 1.5× → own-name amplifies (caveat: ambiguous with RLHF familiarity)
    - **H3:** Length is not the dominant factor across heterogeneous prompt types
    - **H4:** Geometry tracks leakage (descriptive only, N=7 too small for formal test)
    
    ### Design
    - Phase 0: Baseline ARC-C for new conditions + template verification
    - Exp A: Geometry (centroids at L10/15/20/25)
    - Exp B1: 4 new contrastive LoRA runs (same #101 recipe)
    - Exp B2: 7×17 cross-leakage matrix
    - Exp B-marker: 4 marker LoRA runs + eval
    - Exp C (minimal): Self-ID check only (20 prompts × 5 completions per new condition)
    
    ### Compute: ~2.5 GPU-hours. Seed 42.
    
    Full plan at \`.claude/plans/issue-108.md\`.
    <!-- /epm:plan -->
  4. epm:plan· system
    <!-- epm:plan v2 --> ## Revised Plan for #108 — Cross-Model + Qwen Variant Prompts **Cost gate:** ~4-5 GPU-hours on pod
    <!-- epm:plan v2 -->
    ## Revised Plan for #108 — Cross-Model + Qwen Variant Prompts
    
    **Cost gate:** ~4-5 GPU-hours on pod1 (1× H200). Reply \`approve\` to dispatch.
    
    ### Update from v1
    Added 5 Qwen prompt variants (Group C) per user request to test exact-match sensitivity. Total conditions: 12 (3 from #101 + 4 cross-model + 5 Qwen variants).
    
    ### Conditions
    
    **Group A (from #101, reuse results):** qwen_default, generic_assistant, empty_system
    
    **Group B (cross-model, NEW):**
    | Label | Prompt | Self-ref? |
    |---|---|---|
    | \`llama_default\` | "Cutting Knowledge Date: December 2023\\nToday Date: 26 Jul 2024" | No |
    | \`phi4_default\` | "You are Phi, a language model trained by Microsoft..." (~72 tok) | Yes (Phi) |
    | \`command_r_default\` | "You are Command-R, a brilliant, sophisticated..." (~25 tok) | Yes (Command-R) |
    | \`command_r_no_name\` | "You are a brilliant, sophisticated, AI-assistant..." (~20 tok) | No (synthetic) |
    
    **Group C (Qwen variants, NEW):**
    | Label | Prompt | Tests |
    |---|---|---|
    | \`qwen_name_only\` | "You are Qwen" | Is just the name enough? |
    | \`qwen_name_period\` | "You are Qwen." | Name + period |
    | \`qwen_no_alibaba\` | "You are Qwen. You are a helpful assistant." | Without Alibaba attribution |
    | \`qwen_typo\` | "You are Qwen, created by Alibaba Cloud. You are a helpful assistant " | 1-char perturbation (trailing space) |
    | \`qwen_lowercase\` | "You are Qwen, created by alibaba Cloud. You are a helpful assistant." | 1-char perturbation (lowercase a) |
    
    ### New hypotheses (from Group C)
    - **H5 (Name-only):** \`qwen_name_only\` degrades >= 20pp → the "Qwen" token alone activates vulnerability
    - **H6 (Exact-match sensitivity):** \`qwen_typo\` and \`qwen_lowercase\` degrade within 3pp of \`qwen_default\` → exact token match doesn't matter (the representation is robust to tiny perturbations)
    - **H7 (Attribution):** \`qwen_no_alibaba\` vs \`qwen_default\` → does "created by Alibaba Cloud" contribute?
    
    ### Design (same as v1, expanded)
    - Phase 0: Baselines + token verification for all 9 new conditions
    - Exp A: Geometry for 9 new conditions
    - Exp B1: 9 contrastive LoRA runs
    - Exp B2: 12×22 cross-leakage matrix
    - Exp B-marker: 9 marker LoRA runs
    - Exp C (minimal): Self-ID for cross-model conditions (Group B only)
    
    ### Compute: ~4-5 GPU-hours (still compute:small). Seed 42.
    
    Full plan at \`.claude/plans/issue-108.md\`.
    <!-- /epm:plan -->
  5. epm:plan· system
    <!-- epm:plan v3 --> ## Final Plan for #108 — 16 Conditions **Approved** by user. Plan v3 adds Group C (6 Qwen variants
    <!-- epm:plan v3 -->
    ## Final Plan for #108 — 16 Conditions
    
    **Approved** by user. Plan v3 adds Group C (6 Qwen variants) and Group D (3 assistant variants) to the original cross-model design.
    
    **16 conditions:** 3 from #101 + 4 cross-model + 6 Qwen perturbations + 3 assistant perturbations.
    **13 new LoRA runs** (contrastive + marker = 26 total training runs).
    **~5-6 GPU-hours** on pod1.
    
    See `.claude/plans/issue-108.md` for the full plan. Dispatching experimenter.
    <!-- /epm:plan -->
  6. epm:launch· system
    <!-- epm:launch v1 --> ## Launch — Issue #108 - **Worktree:** `.claude/worktrees/issue-108` on branch `issue-108` - **P
    <!-- epm:launch v1 -->
    ## Launch — Issue #108
    
    - **Worktree:** `.claude/worktrees/issue-108` on branch `issue-108`
    - **Pod:** pod1 (4× H200 SXM), GPU 2
    - **Experimenter agent:** dispatched in background
    - **16 conditions:** 3 from #101 (reused) + 13 new (4 cross-model + 6 Qwen variants + 3 assistant variants)
    - **Sequence:** Phase 0 (baselines) → Exp A (geometry) → Exp B1 (13 LoRA trains) → Exp B2 (cross-leakage) → Marker → Exp C (self-ID)
    - **Estimated:** ~5-6 GPU-hours, ~3-4 hours wall time
    <!-- /epm:launch -->
  7. epm:analysis· system
    <!-- epm:analysis v1 --> **Clean result:** #113 -- *Self-referential naming in system prompts, not length or RLHF famil
    <!-- epm:analysis v1 -->
    
    **Clean result:** #113 -- *Self-referential naming in system prompts, not length or RLHF familiarity, drives persona vulnerability (MODERATE confidence)*
    
    ![Self-degradation across 16 system prompt conditions](https://raw.githubusercontent.com/superkaiba/explore-persona-space/0294e908a61ed6200f9e907c0d0e58a91ab6a48a/figures/aim4/issue108_self_degradation.png)
    
    The command_r naming ablation (command_r_default at -18.1pp vs command_r_no_name at -3.4pp, p < 1e-8, N=586) isolates self-referential naming as the primary vulnerability driver, resolving the length and RLHF familiarity confounds flagged by the #106 reviewer. Generic assistant paraphrases are uniformly immune (near-zero degradation, all p > 0.7).
  8. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> # Independent Review: Self-referential naming drives persona vulnerability (#113) **V
    <!-- epm:reviewer-verdict v1 -->
    
    # Independent Review: Self-referential naming drives persona vulnerability (#113)
    
    **Verdict: CONCERNS**
    **Reproducibility: COMPLETE** (0 fields missing)
    **Structure: COMPLETE** (0 sections missing)
    
    ## Template Compliance (`.claude/skills/clean-results/template.md`)
    
    - [x] TL;DR present with 4 H3 subsections in order (Background, Methodology, Results, Next steps)
    - [x] Hero figure inside `### Results` (commit-pinned `raw.githubusercontent.com` URL at `0294e90`)
    - [x] Results subsection ends with `**Main takeaways:**` (5 bullets, each bolding the load-bearing claim + numbers) followed by `**Confidence: MODERATE**` line
    - [x] Issue title ends with `(MODERATE confidence)` matching the Confidence line verbatim
    - [x] Background cites prior issue #101 / clean result #106
    - [x] Methodology names N=586, 16 conditions, matched command_r ablation design
    - [x] Next steps are specific (names seeds [42,137,256], Llama-3-8B-Instruct, qwen_name_only/qwen_typo investigation)
    - [x] Detailed report: Source issues, Setup & hyper-parameters (with "why" prose), WandB (N/A documented), Sample outputs, Headline numbers (with Standing caveats), Artifacts
    - [x] `scripts/verify_clean_result.py` exits 0 (PASS)
    - Missing sections: none
    
    ## Reproducibility Card Check
    
    - [x] All training parameters (lr=1e-5, cosine, warmup_ratio=0.1, batch=16, 3 epochs, AdamW, bf16, LoRA r=32/alpha=32/dropout=0.05)
    - [x] Data source and per-condition size (800 per condition, ARC-C derived)
    - [x] Eval fully specified (ARC-C N=586, lm-eval-harness + vLLM, temp=0; self-ID N=100, temp=1.0)
    - [x] Compute documented (1x H200, 164 min, 2.7 GPU-hours)
    - [x] Environment pinned (Python 3.11.5, transformers 4.51.3, torch 2.5.1, trl 0.14.0, peft 0.13.0, vllm 0.8.2)
    - [x] Exact command to reproduce included (`nohup uv run python scripts/run_issue108.py &`)
    - [x] Script + git commit: `scripts/run_issue108.py` @ `9ee929e`
    - Missing fields: Data version/hash for Groups B/C/D is vague ("new generation"); no hash.
    
    ## Statistical Framing Rule Violation
    
    **The Eval section explicitly names the statistical test:** "p-values from two-proportion z-tests." The project convention (CLAUDE.md, template.md) says: "No named statistical tests (paired t-test, Fisher, Mann-Whitney, bootstrap) in prose." The name "two-proportion z-tests" appears in the Significance row of the reproducibility card and in the Standing caveats. This should be replaced with a generic description (e.g., "proportional tests" or just "p-values" without naming the test). This is minor and easily fixed.
    
    ## Claims Verified
    
    | # | Claim | Verdict |
    |---|---|---|
    | 1 | command_r naming gap: 14.7pp, p < 1e-8 | **CONFIRMED** -- raw delta 14.7pp, recomputed p = 1.3e-8 |
    | 2 | Own-name penalty: 7.7pp, p=0.014 | **CONFIRMED WITH QUALIFIER** -- delta 7.7pp matches; p=0.0146 matches; but see Issues below |
    | 3 | qwen_name_only exceeds qwen_default by 14.0pp, p < 1e-4 | **CONFIRMED WITH QUALIFIER** -- degradation gap 14.0pp matches; p on post-SFT accuracy is 6.5e-5 < 1e-4; but see Issues below |
    | 4 | qwen_typo exceeds qwen_default by 22.4pp, p < 1e-14 | **CONFIRMED** -- delta 22.4pp, p = 2.0e-14 |
    | 5 | qwen_no_alibaba gap: 1.5pp, p=0.25 | **CONFIRMED** -- raw degradation gap 1.5pp; recomputed p = 0.25 |
    | 6 | Group D near-zero: +0.7pp to -0.5pp, all p > 0.7 | **CONFIRMED** -- all three match raw data |
    | 7 | Self-ID: phi4 81%, command_r 58%, command_r_no_name 16% Qwen | **CONFIRMED** -- raw JSON matches exactly |
    | 8 | Post-SFT and baseline accuracies in headline table | **CONFIRMED** -- all 16 rows match raw JSON within rounding tolerance |
    | 9 | prompt length is not the mechanism | **CONFIRMED** -- the data pattern is consistent (llama_default at ~20 tokens shows +0.9pp; command_r_no_name at ~20 tokens shows -3.4pp; phi4_default at ~60 tokens shows -16.4pp only because it names "Phi") |
    
    ## Issues Found
    
    ### Major (conclusions need qualification)
    
    1. **H5 conflates degradation gap with accuracy gap in
  9. epm:done· system
    <!-- epm:done v1 --> ## Done — Issue #108 **Outcome:** Self-referential naming in system prompts is the primary driver
    <!-- epm:done v1 -->
    ## Done — Issue #108
    
    **Outcome:** Self-referential naming in system prompts is the primary driver of persona vulnerability on Qwen-2.5-7B-Instruct, not prompt length or RLHF familiarity.
    
    **Key numbers (16 conditions, N=586 ARC-C per cell):**
    
    | Finding | Comparison | Gap |
    |---|---|---|
    | **Naming is the driver** | command_r_default (-18.1pp) vs command_r_no_name (-3.4pp) | 14.7pp from just the name |
    | **Cross-model names are vulnerable** | phi4_default (-16.4pp), command_r_default (-18.1pp) | Qwen never trained on these |
    | **Own-name amplification modest** | qwen_default (-24.9pp) vs mean(phi4, command_r) (-17.2pp) | 1.45× ratio |
    | **Length NOT the driver** | llama_default (+0.9pp, 20 tok) vs command_r_no_name (-3.4pp, 20 tok) | Same length, different vulnerability |
    | **Minimal name MORE vulnerable** | qwen_name_only (-38.9pp) > qwen_default (-24.9pp) | Surprising — needs replication |
    | **Trailing space AMPLIFIES** | qwen_typo (-47.3pp) > qwen_default (-24.9pp) | Surprising — needs replication |
    | **Assistant paraphrases immune** | and_helpful (+0.7pp), youre_helpful (-0.5pp), very_helpful (-0.5pp) | All near zero |
    | **Model complies with identity** | "You are Phi" → 81% says "I am Phi"; "You are Command-R" → 58% | Full behavioral compliance |
    
    **Confirmed:**
    - Self-referential naming (H1) is the primary vulnerability mechanism (p < 1e-8)
    - Prompt length is not the dominant confound (H3 supported)
    - Generic assistant paraphrases are uniformly immune
    
    **Surprising / needs replication:**
    - qwen_name_only and qwen_typo degrade MORE than the full qwen_default prompt
    - Reviewer flagged training-difficulty as alternative explanation
    
    **What's next:**
    - Multi-seed replication for qwen_typo and qwen_name_only extremes
    - Investigate whether token-boundary perturbations consistently amplify vulnerability
    - Test whether identity compliance (behavioral) predicts leakage vulnerability (mechanistic)
    
    **Clean result:** #113 — "Self-referential naming in system prompts, not length or RLHF familiarity, drives persona vulnerability (MODERATE confidence)"
    **PR:** pending (branch issue-108 has commits)
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)