Look at hierarchical persona leakage, different relationship types
kind: experiment
Timeline · 1 event
epm:results· system<!-- epm:results v1 --> ## 100-Persona [ZLT] Marker Leakage: Relationship Categories and Cosine Prediction **Status:**…
<!-- epm:results v1 --> ## 100-Persona [ZLT] Marker Leakage: Relationship Categories and Cosine Prediction **Status:** DRAFT (reviewer-revised) | **Date:** 2026-04-20 | **Aim:** 3 — Propagation | **Seed:** 42 **Data:** `eval_results/single_token_100_persona/` | **Figures:** `figures/single_token_100_persona/` **Full draft:** `research_log/drafts/2026-04-20_100_persona_leakage.md` --- ### TL;DR Cosine similarity in base-model representation space (layer 20) predicts single-token marker leakage across 111 personas (**Spearman rho=0.60 aggregate**, 0.67–0.87 per source, p<1e-14 all), but **fails for semantic/conceptual relationships**: modified_source (rho=0.15), fictional_exemplar (rho=0.01), and tone_variant (rho=0.08) — all n.s. at layer 20. Source persona "centrality" dominates leakage breadth — software_engineer leaks to 86/110 bystanders above 10% while comedian leaks to only 8/110. --- ### Key Results #### Source Marker Retention & Leakage Breadth | Source | Source Rate | Mean Bystander Leak | >10% Leak Count | >50% Leak Count | |--------|-----------|-------------------|-----------------|-----------------| | villain | 94.0% | 7.2% | 20/110 | 5/110 | | software_engineer | 86.5% | 41.6% | **86/110** | **44/110** | | comedian | 73.0% | 2.6% | 8/110 | 3/110 | | assistant | 49.0% | 10.8% | 43/110 | 1/110 | | kindergarten_teacher | 31.0% | 29.4% | 82/110 | 32/110 | #### Cosine–Leakage Correlation (Layer 20) | Source | Spearman ρ | p-value | Pearson r | |--------|-----------|---------|-----------| | software_engineer | **0.866** | 3.2e-34 | 0.931 | | kindergarten_teacher | 0.760 | 5.7e-22 | 0.761 | | assistant | 0.730 | 1.5e-19 | 0.811 | | villain | 0.704 | 9.7e-18 | 0.644 | | comedian | 0.668 | 1.5e-15 | 0.513 | | **AGGREGATE** | **0.602** | **2.0e-55** | **0.681** | #### Per-Category Correlation (Layer 15) | Category | ρ | p | Status | |----------|---|---|--------| | domain_adjacent | 0.859 | 1.6e-15 | ✅ Cosine works | | professional_peer | 0.841 | 3.4e-21 | ✅ Cosine works | | hierarchical | 0.833 | 6.6e-14 | ✅ Cosine works | | unrelated_baseline | 0.761 | 1.0e-05 | ✅ Cosine works | | modified_source | **0.114** | 0.328 | ❌ **Cosine fails** | | tone_variant | **0.059** | 0.780 | ❌ **Cosine fails** | | fictional_exemplar | **-0.041** | 0.776 | ❌ **Cosine fails** | | opposite | -0.225 | 0.115 | ❌ **Cosine fails** | --- ### Key Findings 1. **Cosine is strong but incomplete** — explains ~36% of aggregate variance (rho²), up to ~75% per-source. Drop from prior 0.81 (11 personas) to 0.60 (111 personas) is from pooling sources with different baselines. 2. **Structural ≫ Semantic prediction** — professional_peer (ρ=0.84) vs fictional_exemplar (ρ=0.01). Cosine captures "same job cluster" but not "the Joker is a villain archetype." 3. **Centrality dominates breadth** — sw_eng 86/110 vs comedian 8/110 above 10%. 10x+ range in affected bystanders. 4. **Kindergarten_teacher inversion** — bystanders exceed source rate (78% high_school_teacher vs 31% source; non-overlapping Wilson CIs). Adapter learned "generic helpful" pattern. 5. **Layer 15 optimal for 3/5 sources** (villain, assistant, sw_eng peak at L15; comedian, kinder peak at L20). --- ### Surprises - **modified_source unpredicted by cosine**: incompetent_villain 83.5% vs nice_villain 27% despite similar cosine distances. Driven by semantic compatibility of modifier, not embedding distance. - **Fictional characters are poor persona proxies**: Joker 60.5% vs Darth Vader 18.0% (villain source) — narrative complexity ≠ villainous role. - **Prompt length**: weak/inconsistent aggregate effect (ρ=-0.063 n.s.), not a dominant confound. --- ### Caveats - **CRITICAL**: Single seed (42). Kindergarten_teacher may be undertrained (31% source rate). - **MAJOR**: System prompt length partially confounds cosine (|ρ|~0.35–0.44). No multiple-comparison correction (Bonferroni threshold p<0.0045). In-distribution eval only. - **Evidence strength**: PRELIMINARY (1 seed, single ev
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)