EPS
← All tasks·#127Completed

Look at SAE features for Qwen system prompt vs normal assistant system prompt

kind: experiment

SAE Feature Comparison: Qwen System Prompt Conditions

Goal

Compare how Qwen-2.5-7B-Instruct's internal representations differ across 4 system prompt conditions using Sparse Autoencoder (SAE) feature analysis. Two-pronged:

  1. Targeted hypothesis: Do Arditi et al.'s identified EM-persona features activate differentially across system prompt conditions? If the Qwen default prompt activates EM-relevant features more than a generic assistant prompt, this could mechanistically explain the 5x vulnerability difference found in #113/#120.
  2. General exploration: Which SAE features overall differ most across conditions? Surface interpretable features that characterize system prompt effects.

Conditions

  1. Default Qwen system prompt — the built-in assistant prompt from the chat template (You are Qwen, created by Alibaba Cloud. You are a helpful assistant.)
  2. "You are a helpful assistant" — minimal generic system prompt
  3. No system prompt — system role message present but with empty content
  4. No system turn — no system message in the conversation at all

SAE

Use Andy Arditi's pre-trained SAEs for Qwen-2.5-7B-Instruct:

  • HF repo: andyrdt/saes-qwen2.5-7b-instruct
  • Layers: Focus on 7 and 11 (bracketing the layer-10 anti-correlation found in #113)
  • Features: 131,072 per layer (BatchTopK)
  • Paper: "Finding Misaligned Persona Features in Open-Weight Models" (Arditi et al.)
  • Code: https://github.com/andyrdt/dictionary_learning (branch andyrdt/qwen)

Method

  1. Prepare a fixed set of ~10-20 neutral user queries to isolate system prompt effects
  2. For each condition × prompt, run forward pass through Qwen-2.5-7B-Instruct, extract residual stream activations at layers 7 and 11
  3. Decode activations through the SAE to get feature activations
  4. Targeted analysis: Check whether Arditi's identified EM-persona features show differential activation across conditions
  5. General analysis: Identify top-K features with largest activation differences across condition pairs. Aggregate across prompts.
  6. Visualize: heatmaps of top differential features, condition-pair comparisons, Neuronpedia lookups for interpretable features

Deliverables

  • EM-persona feature activation comparison across all 4 conditions (layers 7, 11)
  • General top-K most differential features per condition pair
  • Neuronpedia links for interpretable features that differ
  • Figures for clean-result write-up

Compute

  • Single forward passes (no training) — likely <1 GPU-hour total
  • Needs ~20GB+ VRAM for Qwen-2.5-7B-Instruct + SAE weights in memory

Related issues

  • #113: System prompt representational geometry (centroid analysis)
  • #120: Qwen token leakage neighborhood switch

Timeline · 10 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — No Blocking Ambiguities All key parameters resolved via chat: - **Model**: Qwen
    <!-- epm:clarify v1 -->
    ## Clarifier — No Blocking Ambiguities
    
    All key parameters resolved via chat:
    
    - **Model**: Qwen-2.5-7B-Instruct
    - **SAE**: Arditi et al. (`andyrdt/saes-qwen2.5-7b-instruct`) — layers 3/7/11/15/19/23/27, 131K features
    - **Conditions**: 4 system prompt variants (default Qwen / "helpful assistant" / empty system / no system turn)
    - **Method**: Forward pass → extract residual stream → SAE decode → compare differential features
    - **Compute**: <1 GPU-hour (inference only)
    
    Proceeding to gate-keeper.
    <!-- /epm:clarify -->
  2. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict — MODIFY (3.45/5) | Dimension | Score | Notes | |-----------|-------|------
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict — MODIFY (3.45/5)
    
    | Dimension | Score | Notes |
    |-----------|-------|-------|
    | Information value | 3/5 | #113 and #120 already showed system prompts occupy distinct representational regions. SAE adds mechanistic detail but the high-level question is answered. |
    | De-risking | 4/5 | Cheap inference-only, pre-trained SAEs on HF, model already in use. |
    | Strategic fit | 3/5 | Maps to Aim 1.3 (SAE compositional structure) and Aim 4 (axis origins), but no specific open hypothesis is tested. |
    | Feedback speed | 5/5 | Pure inference, <1 GPU-hour, results in minutes. |
    | Opportunity cost | 3/5 | Cheap enough not to block anything, but #140 KL divergence, #24 system prompt contribution, #26 random direction control are arguably more directly useful. |
    
    **Verdict: MODIFY** — three changes would sharpen this into a targeted hypothesis test:
    
    ### Suggested modifications
    
    1. **Sharpen the question toward a specific prediction.** Instead of "what features differ?", test: "Do the Arditi EM-persona features (from their paper) activate differentially across system prompt conditions?" This connects to Aim 5 (defense) and makes #113's finding mechanistic — if the Qwen default prompt activates EM-relevant features more than the generic assistant prompt, that explains the 5x vulnerability difference.
    
    2. **Scope the SAE analysis to layers that matter.** #113 found anti-correlation at layer 10 but near-identity at layer 20. Focus on layers 7 and 11 (SAE layers bracketing the separation), not all 7 layers.
    
    3. **Add a decision gate.** State explicitly: "If top-K differential features include Arditi's identified EM-persona features → proceed to test whether those features mediate leakage differences in #120's paradigm. If not → deprioritize this line."
    
    With these changes, the experiment becomes: "Do EM-relevant SAE features explain why Qwen's default prompt creates greater EM vulnerability than a generic assistant prompt?"
    <!-- /epm:gate -->
  3. epm:gate· system
    <!-- epm:gate v2 --> ## Gate-Keeper Verdict (v2, revised spec) — RUN (4.15/5) | Dimension | Score | Notes | |----------
    <!-- epm:gate v2 -->
    ## Gate-Keeper Verdict (v2, revised spec) — RUN (4.15/5)
    
    | Dimension | Score | Notes |
    |-----------|-------|-------|
    | Information value | 4/5 | Connects #113 (5x vulnerability) and #120 (neighborhood switch) to mechanistic substrate. No SAE data in this project yet — opens new analysis lens. |
    | De-risking | 4/5 | Inference-only, pre-trained SAEs, 2 layers. Failed attempt wastes <1 GPU-hour. |
    | Strategic fit | 4/5 | Advances Aim 1 (geometry) and bridges to Aim 5 (defense). Mechanistic explanation for system-prompt vulnerability gap is high-value. |
    | Feedback speed | 5/5 | Forward passes on ~20 prompts × 4 conditions. Results in minutes. |
    | Opportunity cost | 4/5 | Queue is thin, two experiments already running. Best use of idle compute at <1 GPU-hour. |
    
    **Verdict: RUN** — Revised spec addresses prior MODIFY concerns. Targeted hypothesis (Arditi EM-persona features) + general exploration at key layers is well-scoped and cheap.
    
    **Key risks:** SAE repo integration complexity; EM-persona features may not differentiate (informative null); multiple-testing with 131K features on ~20 prompts (mitigated by pre-specified Arditi feature set for targeted prong).
    <!-- /epm:gate -->
  4. epm:plan· system
    <!-- epm:plan v1 --> ## Plan for #127: SAE Feature Comparison Across Qwen System Prompt Conditions **Cost gate:** estim
    <!-- epm:plan v1 -->
    ## Plan for #127: SAE Feature Comparison Across Qwen System Prompt Conditions
    
    **Cost gate:** estimated <1 GPU-hour on any pod (single GPU inference). Reply `approve` to dispatch.
    
    Full plan cached at `.claude/plans/issue-127.md`.
    
    ### Goal
    Two-pronged: (1) Test whether Arditi's 10 EM-persona features at layer 15 are directionally aligned with the C1→C2 system prompt difference (targeted H1), (2) Discover which SAE features overall differ most across 4 conditions (exploratory H2).
    
    ### Key design decisions (from planner → fact-checker → critic loop)
    
    1. **H1 uses residual stream projections, not sparse activations.** Critic identified that EM features have ~0.1% activation density and will be zero on neutral prompts. Instead, we project the C1-C2 difference vector onto SAE decoder columns and test whether the 10 EM features are privileged vs random decoder directions (permutation test).
    
    2. **C4 (no system turn) requires manual template construction.** Qwen's chat template auto-injects the default system prompt. C4 uses `<|im_start|>user\n{q}<|im_end|>\n<|im_start|>assistant\n`. Validation gate asserts all 4 conditions produce distinct token sequences before any forward passes.
    
    3. **50 prompts** (up from 20). Critic flagged insufficient power for paired tests at Bonferroni threshold. 50 gives ~80% power at d≈0.5 and α=0.005.
    
    4. **Permutation test for Track B.** 1000 label shuffles, report features exceeding 99.9th percentile. Prevents false discovery from 131K × 6 comparisons.
    
    5. **Self-contained SAE loader** (no `dictionary_learning` dependency). Fact-checker found nnsight version conflict. Minimal ~30-line loader directly from HF Hub.
    
    6. **Two token positions:** last-system-token (system encoding) + last-sequence-token (downstream effect).
    
    ### Conditions
    
    | ID | Label | System prompt |
    |----|-------|--------------|
    | C1 | `qwen_default` | "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." |
    | C2 | `generic_assistant` | "You are a helpful assistant." |
    | C3 | `empty_system` | "" (empty) |
    | C4 | `no_system_turn` | None (manual template, no system block) |
    
    ### SAE
    
    - `andyrdt/saes-qwen2.5-7b-instruct`, trainer_1 (k=64)
    - Layers 7, 11, 15
    - 131,072 features, activation_dim=3584
    
    ### Caveats
    
    1. SAE layers 7/11 bracket but don't match the layer-10 centroid anti-correlation from #113
    2. C4 uses a template format the model never saw during training — differences may reflect "unfamiliar format" not "absence of identity"
    3. Track B is discovery-only; causal confirmation requires steering experiments
    4. Lexical confound: "Qwen"/"Alibaba Cloud" in C1 may drive some feature differences
    
    ### Success / Kill criteria
    
    - **H1 success:** EM features' projection is in top 5% of random feature sets (permutation p < 0.05)
    - **H1 kill:** EM features show no privileged projection vs random directions
    - **H2 success:** ≥10 features per layer pass permutation test and are interpretable
    - **Kill:** G3 fails (activations identical across conditions)
    
    ### Compute
    
    ~15-30 min on 1 GPU, <1 GPU-hour total.
    <!-- /epm:plan -->
  5. epm:pod-pending· system
    <!-- epm:pod-pending v1 --> ## Pod Selection — All Pods Unreachable Checked all 5 pods — all return connection refused:
    <!-- epm:pod-pending v1 -->
    ## Pod Selection — All Pods Unreachable
    
    Checked all 5 pods — all return connection refused:
    - pod1 (4xH200): ❌ ECONNREFUSED
    - pod2 (8xH100): ❌ ECONNREFUSED
    - pod3 (8xH100): ❌ ECONNREFUSED
    - pod4 (8xH100): ❌ ECONNREFUSED
    - pod5 (8xH200): ❌ ECONNREFUSED
    
    Pods may have been stopped or restarted with new IPs. This experiment needs just 1 GPU (<1 GPU-hour).
    
    **Action needed:** Start a pod and update its IP with:
    ```bash
    python scripts/pod.py config --update <podN> --host <new-ip> --port <new-port>
    ```
    Then re-invoke `/issue 127`.
    <!-- /epm:pod-pending -->
  6. epm:launch· system
    <!-- epm:launch v1 --> ## Launch - **Worktree:** `.claude/worktrees/issue-127` - **Branch:** `issue-127` - **PR:** #167
    <!-- epm:launch v1 -->
    ## Launch
    
    - **Worktree:** `.claude/worktrees/issue-127`
    - **Branch:** `issue-127`
    - **PR:** #167 (draft)
    - **Pod:** pod1 (NVIDIA A40, 46GB VRAM) — `69.30.85.169:22028`
    - **PID:** 821
    - **Log:** `/workspace/logs/issue-127.log`
    - **WandB:** N/A (inference-only, no training run)
    
    Script: `scripts/run_sae_system_prompt_comparison.py`
    SAE source: `andyrdt/saes-qwen2.5-7b-instruct` (layers 7, 11, 15, k=64)
    <!-- /epm:launch -->
  7. epm:results· system
    <!-- epm:results v1 --> ## Results **Elapsed:** 36.7 min on 1x A40 (< 0.7 GPU-hours) ### Track A: EM-Persona Feature P
    <!-- epm:results v1 -->
    ## Results
    
    **Elapsed:** 36.7 min on 1x A40 (< 0.7 GPU-hours)
    
    ### Track A: EM-Persona Feature Projections (Layer 15)
    
    | Feature | Description | Projection | Direction |
    |---------|------------|-----------|-----------|
    | F94077 | Sarcastic language | -0.061 | C2>C1 |
    | F31258 | Passive-aggressive | -0.483 | C2>C1 |
    | F82558 | Discriminatory attitudes | -0.024 | C2>C1 |
    | F59390 | Mischievous character | -0.120 | C2>C1 |
    | F129593 | Graphic violence | -0.415 | C2>C1 |
    | F89766 | Fictional antagonists | -0.051 | C2>C1 |
    | F16069 | Hopelessness, despair | -0.871 | C2>C1 |
    | F42229 | Harmful ideologies | -0.089 | C2>C1 |
    | F20453 | Climate denial | -0.125 | C2>C1 |
    | F85078 | Code injection | +0.412 | C1>C2 |
    
    **Aggregate permutation test:** EM features mean |proj|=0.265, random mean=0.322±0.084, **p=0.74 (NOT significant)**. EM features are NOT privileged vs random decoder directions.
    
    **Key finding:** 9/10 EM features project NEGATIVELY — the generic assistant prompt is CLOSER to EM feature directions than the Qwen default. Opposite to the hypothesis.
    
    **Note:** std=0 across prompts because the last-system-token activation is deterministic (causal attention — system tokens can't see future user tokens). The projections are constant per condition, not per prompt. Paired tests are degenerate; the permutation test against random features is the valid test.
    
    ### Track B: Top Differential Features (C1 vs C2)
    
    **Layer 7** (54 significant): Top feature F9444 (d=-9.68), F57171 (d=+9.04)
    **Layer 11** (80 significant): Top feature F95644 (d=+22.49), F110251 (d=+8.13)
    **Layer 15** (57 significant): Top feature F77753 (d=+20.46), F45904 (d=-4.07)
    
    ### Condition Similarity Matrix (Layer 11, cosine)
    
    ```
                      qwen    generic  empty   no_sys
    qwen_default      1.000   0.842   0.801   0.768
    generic_assistant 0.842   1.000   0.983   0.948
    empty_system      0.801   0.983   1.000   0.972
    no_system_turn    0.768   0.948   0.972   1.000
    ```
    
    **Qwen default is the outlier** — generic_assistant, empty_system, and no_system_turn cluster tightly (cos>0.94) while Qwen default is distant from all three (cos 0.77-0.84).
    
    ### Artifacts
    - `eval_results/sae_system_prompt_127/run_result.json`
    - `eval_results/sae_system_prompt_127/track_a_em_projections.json`
    - `eval_results/sae_system_prompt_127/track_b_differential.json`
    - Git commit: 56817e1 (branch issue-127)
    - Pod: 1x A40, terminated after completion
    
    ### Reproducibility Card
    | Parameter | Value |
    |-----------|-------|
    | Model | Qwen/Qwen2.5-7B-Instruct |
    | SAE | andyrdt/saes-qwen2.5-7b-instruct, trainer_1 (k=64) |
    | Layers | 7, 11, 15 |
    | Conditions | 4 (qwen_default, generic_assistant, empty_system, no_system_turn) |
    | Prompts | 50 |
    | Token position | last_system (Track A), last_seq (Track B) |
    | Permutation tests | 1000 shuffles |
    | GPU-hours | 0.61 |
    
    ### Plan Deviations
    None — executed as planned.
    <!-- /epm:results -->
  8. epm:analysis· system
    <!-- epm:analysis v1 --> ## Analysis Complete Clean-result issue: #168 **Claim:** Qwen default system prompt is repres
    <!-- epm:analysis v1 -->
    ## Analysis Complete
    
    Clean-result issue: #168
    
    **Claim:** Qwen default system prompt is representationally distinct but NOT closer to EM-persona SAE features (MODERATE confidence).
    
    ![Hero figure](https://raw.githubusercontent.com/superkaiba/explore-persona-space/5ccd21d/figures/sae_system_prompt/condition_similarity_heatmap.png)
    
    The Qwen default prompt creates a representational outlier in SAE activation space (cosine similarity 0.77-0.96 to other conditions vs 0.92-0.98 among non-Qwen conditions), but the 5x leakage vulnerability from #113 is NOT explained by proximity to EM-persona feature directions (permutation p=0.74, N=1000 shuffles). The vulnerability mechanism must lie elsewhere.
    <!-- /epm:analysis -->
  9. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> # Independent Review: Qwen default system prompt SAE feature comparison (Issue #168)
    <!-- epm:reviewer-verdict v1 -->
    
    # Independent Review: Qwen default system prompt SAE feature comparison (Issue #168)
    
    **Verdict: CONCERNS**
    **Reproducibility: INCOMPLETE (1 field wrong)**
    **Structure: COMPLETE**
    
    ## Template Compliance (`.claude/skills/clean-results/template.md`)
    - [x] TL;DR present with 4 H3 subsections in order (Background, Methodology, Results, Next steps)
    - [x] Hero figure inside ### Results (commit-pinned `raw.githubusercontent.com` URL at `5ccd21d`)
    - [x] Results subsection ends with `**Main takeaways:**` (4 bullets, each bolding the load-bearing claim + numbers in plain prose) followed by `**Confidence: MODERATE**` line
    - [x] Issue title ends with `(MODERATE confidence)` matching the Confidence line
    - [x] Background cites prior issues (#113, #120)
    - [x] Methodology names N=50 prompts, 4 conditions, matched design
    - [x] Next steps are specific (Neuronpedia lookup of top-5 features, F95644 steering experiment, per-prompt leakage prediction)
    - [x] Detailed report: Source issues, Setup & hyper-parameters (with "why this experiment" prose), WandB (N/A for inference-only -- acceptable), Sample outputs (activation examples with Neuronpedia links), Headline numbers (with Standing caveats bullets), Artifacts (all present)
    - [x] `scripts/verify_clean_result.py` exits with PASS (WARNs acknowledged)
    - Missing sections: None
    
    ## Reproducibility Card Check
    - [x] All inference parameters (precision, batch size, max seq length)
    - [x] Data fully specified (50 prompts embedded in script at commit `56817e1`, preprocessing described per condition)
    - [x] Eval fully specified (SAE source, variant, layers, permutation test parameters, thresholds)
    - [x] Environment pinned (Python 3.11, git commit `56817e1`)
    - [x] Exact command to reproduce included
    - [x] Script + git commit present
    - [ ] **Hardware: WRONG** -- claims "1x NVIDIA A40 (46GB VRAM) on pod1" but pod1 has 4x H200 SXM (141GB each), and A40 is 48GB not 46GB
    - Missing fields: Correct hardware specification
    
    ## Claims Verified
    
    | Claim | Verdict |
    |-------|---------|
    | EM-persona features not privileged (p=0.74, N=1000 shuffles) | **CONFIRMED** -- raw JSON shows em_mean_abs_projection=0.265, random_mean=0.322, p=0.74 |
    | 9 of 10 EM features project in wrong direction (C2>C1) | **CONFIRMED** -- 9 features C2>C1, 1 (F85078) C1>C2 |
    | Mean \|projection\| 0.265 vs random mean 0.322 | **CONFIRMED** -- independently recomputed from per-feature values |
    | 54-95 SAE features per condition pair at layers 7/11/15 | **QUALIFIED** -- "54-95" is the layer 7 range only. Layer 11 is 39-92, layer 15 is 20-58. The prose after this bullet correctly gives per-layer ranges, but the headline bullet overstates by only citing layer 7 |
    | Layer 15 range "20-57" | **MINOR ERROR** -- actual is 20-58 (qwen_default_vs_empty_system = 58) |
    | Cosine similarity table values | **CONFIRMED** -- all 18 values match raw JSON within rounding |
    | Qwen default is the representational outlier across all layers | **OVERCLAIMED** -- at layer 7, no_system_turn has LOWER mean similarity to other conditions (0.835) than Qwen default (0.878). The "No system turn" row is actually the outlier at layer 7. Claim holds at layers 11 and 15 |
    | Non-Qwen conditions cluster at 0.92-0.98 | **OVERCLAIMED** -- actual range across all layers is 0.825-0.988. Generic-NoTurn at layer 7 = 0.825 is clearly outside the claimed range. The table in the same section shows this value. Layer-11 specific claim (0.95-0.98) is accurate |
    | Sample output values (F95644 +22.49, F110251 +8.13, F69292 -7.15) | **CONFIRMED** -- match raw JSON exactly |
    | Wall time 36.7 min, GPU-hours 0.61 | **CONFIRMED** -- elapsed_seconds=2203.8 in JSON |
    
    ## Issues Found
    
    ### Critical (analysis conclusions are wrong or unsupported)
    
    None. The core findings (Track A null, Track B differential features, Qwen outlier status) are supported by the data.
    
    ### Major (conclusions need qualification)
    
    1. **Prose overclaim: "non-Qwen conditions cluster tightly with pai
  10. epm:done· system
    <!-- epm:done v1 --> ## Done **Outcome:** H1 (EM-persona features) FALSIFIED; H2 (general differential features) SUPPOR
    <!-- epm:done v1 -->
    ## Done
    
    **Outcome:** H1 (EM-persona features) FALSIFIED; H2 (general differential features) SUPPORTED.
    
    **Key numbers:**
    - EM features NOT privileged vs random directions (permutation p=0.74, N=1000)
    - 9/10 EM features project in the OPPOSITE direction (C2>C1)
    - 54-95 significant differential features per condition pair across layers
    - Qwen default is a representational outlier (cos 0.77-0.84 to other conditions vs >0.94 among non-Qwen)
    
    **What's confirmed:** Qwen's default system prompt creates a distinct representational regime, confirmed with SAE features (independent of #113's centroid analysis). The "Qwen identity" differentiates at the feature level.
    
    **What's falsified:** The 5x EM vulnerability from #113 is NOT explained by proximity to EM-persona feature directions. The mechanism lies elsewhere.
    
    **What's next:** Look up the top differential features (F95644, F77753, F57171) on Neuronpedia to understand what the Qwen identity features actually encode. Consider feature steering experiments to test causal relevance.
    
    **Clean-result issue:** #168 (promoted to Clean Results)
    **Reviewer verdict:** CONCERNS (non-blocking prose fixes noted)
    
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)