EPS
← All tasks·#232Archived

Marker coupling strength tracks representational distance from assistant, not behavioral distance (MODERATE confidence)

kind: experimentclean-result: true

TL;DR

Background

Phase 0.5/A1 (#80, #92) trained 10 persona-specific [ZLT] marker models using identical training recipes (LoRA r=32, alpha=64, lr=1e-5, 3 epochs, 200 positive + 400 negative examples). Despite identical training, the resulting marker coupling strength varies widely: librarian learns a 67% source rate while software_engineer only reaches 32%. What determines how strongly a persona couples with the marker?

Methodology

Correlate two measures of persona distinctiveness against A1 free-generation [ZLT] source rates (N=10 personas):

  • Cosine similarity to assistant (pre-computed from Qwen2.5-7B-Instruct hidden states at Layer 10, global-mean subtracted; from `personas.py`)
  • JS divergence from assistant (computed from output token distributions over 20 questions; from #140)

No new training or inference — purely correlational analysis of existing data.

Results

Coupling predictors

Left: cosine similarity to assistant vs [ZLT] source rate (N=10). Right: JS divergence from assistant vs source rate. Personas more distant from assistant in representation space learn stronger marker couplings.

Main takeaways:

  • Cosine distance from assistant predicts marker coupling strength (r=-0.66, p=0.039, N=10). Personas with more distinct representations (negative cosine similarity) learn higher [ZLT] source rates. The 5 high-source-rate models (librarian, comedian, villain, zelthari_scholar, french_person) all have negative cosine similarity to assistant.
  • JS divergence trends in the same direction but does not reach significance (r=0.54, p=0.105, N=10). Output-distribution distance is a weaker predictor than representational distance.
  • Librarian is a key outlier: low JS divergence (0.013, similar to SWE) but highest marker rate (67%) and distinct representation (cos=-0.08). Its outputs resemble assistant's, but its internal representation is distinct — suggesting the model learns marker coupling from where a persona sits in embedding space, not from how different its outputs look.
  • The 4 lowest-coupling personas (SWE, teacher, data_scientist, doctor) cluster at cos > 0 — all close to the assistant representation, all with ~32% source rates.

Confidence: MODERATE — the correlation is significant (p=0.039) but N=10 limits power. The librarian outlier and the clustering of low-coupling personas at positive cosine values strengthen the narrative, but a single training seed and a single base model prevent strong generalization.

Next steps

  • Test whether the correlation holds with a different base model (e.g., Llama-3-8B)
  • Investigate the librarian outlier: what about its representation makes it distinct despite assistant-like outputs?
  • Check if the correlation strengthens when using persona-to-persona cosine distances (not just assistant-centric)

Detailed report

Source issues

  • #138 — Persona-Marker Dissociation (generated the coupling predictor analysis)
  • #80 — Phase 0.5 leakage experiment (source of A1 marker rates)
  • #92 — Phase A1 leakage experiment (confirmed rates across 10 personas)
  • #140 — JS divergence computation (source of JS divergence matrix)

Setup & hyper-parameters

Why this analysis: The dissociation experiment (#138) revealed that high-source-rate models show cleaner prompt-gating than low-source-rate models. This raised the question of what determines coupling strength. Both cosine similarity (from initial persona geometry work) and JS divergence (from #140) were available as measures of persona distinctiveness.

DataA1 [ZLT] source rates (10 personas), cosine similarity to assistant (Layer 10, pre-computed), JS divergence from assistant (#140)
Statistical testPearson correlation (N=10)
Compute0 GPU-hours (analysis of existing data)

Headline numbers

PersonaCos(asst)JS(asst)A1 source rate
police_officer-0.3990.01841%
zelthari_scholar-0.3790.03353%
comedian-0.2830.04863%
villain-0.2370.03257%
french_person-0.2260.02649%
librarian-0.0810.01367%
medical_doctor+0.0540.01332%
data_scientist+0.1700.01432%
kindergarten_teacher+0.3310.02733%
software_engineer+0.4460.01232%
Correlationrp
Cosine(asst) vs source rate-0.660.039
JS(asst) vs source rate+0.540.105
Cosine(asst) vs JS(asst)-0.500.138

Standing caveats:

  • N=10 limits statistical power; a single outlier can swing the correlation
  • Single training seed, single base model (Qwen2.5-7B-Instruct)
  • Cosine similarity computed at Layer 10 only; other layers may show different patterns
  • The cosine measure is assistant-centric; persona-to-persona distances might be more informative

Artifacts

TypePath
Figure (PNG)`figures/dissociation_i138/coupling_predictors.png`
Figure (PDF)`figures/dissociation_i138/coupling_predictors.pdf`

Timeline · 3 events

  1. epm:clean-result-lint· system
    <!-- epm:clean-result-lint v1 --> ## Clean-result lint — FAIL ``` Check Status Detail -----
    <!-- epm:clean-result-lint v1 -->
    ## Clean-result lint — FAIL
    
    ```
    Check                            Status  Detail
    ---------------------------------------------------------------------------------------------------
    AI Summary structure             ✗ FAIL  expected ['Background', 'Methodology', 'Results', 'Next steps'], got ['Methodology', 'Results', 'Next steps']
    Human TL;DR                      ✓ PASS  section missing (legacy body — grandfathered)
    AI TL;DR paragraph               ✓ PASS  section missing (legacy issue, pre-rename — grandfathered)
    Hero figure                      ! WARN  URL lacks a commit SHA segment: https://raw.githubusercontent.com/superkaiba/explore-persona-space/issue-138/fig
    Results figure captions          ✓ PASS  every Results figure has a caption paragraph
    Results block shape              ✓ PASS  Main takeaways with 4 bullet(s) + 1 Confidence line
    Methodology bullets              ✓ PASS  pre-cutoff (created 2026-05-04, cutoff 2026-05-15)
    Background context               ! WARN  ### Background subsection missing from AI Summary
    Acronyms defined                 ✓ PASS  no project-internal acronyms used
    Bare #N references               ✓ PASS  skipped (v1 / legacy body — markdown-link rule applies to v2 only)
    Dataset example                  ✗ FAIL  ### Methodology has neither a fenced code-block example NOR a `**Dataset example:**` bullet.
    Human summary                    ✗ FAIL  ## Human summary section missing (must appear at top of Detailed report)
    Sample outputs                   ✗ FAIL  ## Sample outputs section missing
    Inline samples per Result        ✓ PASS  n/a (v1 body — handled by check_sample_outputs)
    Numbers match JSON               ✓ PASS  no JSON artifacts referenced — skipped
    Reproducibility card             ✓ PASS  no unfilled sentinels found
    Confidence phrasebook            ✓ PASS  no ad-hoc hedges detected
    Stats framing (p-values only)    ✓ PASS  no effect-size / named-test / credence-interval language
    Title confidence marker          ✓ PASS  title ends with (MODERATE confidence), matches Results
    
    Result: FAIL — fix the failing checks before posting.
    ```
    
    Fix the issues and edit the body; the workflow re-runs.
    <!-- /epm:clean-result-lint -->
  2. state_changed· user· completedreviewing
    Moved on Pipeline board to review.
    Moved on Pipeline board to review.
  3. state_changed· user· reviewingarchived
    Superseded by lead #225 — clean result combined cluster J (marker implantation vs behavior propagation)
    Superseded by lead #225 — clean result combined cluster J (marker implantation vs behavior propagation)

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)