EPS Dashboard

TL;DR

Motivation. A prior 100-persona observational study found a positive cosine-vs-leakage correlation (Spearman ρ = 0.60, p = 0.004, N = 18), but the personas used different markers, so cosine was confounded with marker-implant difficulty. The 4-source / 5-epoch causal follow-up (#91) was ambiguous — two adapters were non-specific and villain only began leaking after epoch 3.
What I ran. I extended the causal test to 7 source personas (villain, med doctor, comedian, kindergarten teacher, SW eng, nurse, librarian) spanning baseline assistant-cosine −0.36 to +0.59 on Qwen2.5-7B-Instruct. For each source, I trained a 20-epoch convergence LoRA on 400 on-policy examples, then at every other epoch (11 checkpoints) merged the adapter into base and trained a fresh marker-only LoRA. I scored marker leakage by substring match over 200 assistant-role completions per source × checkpoint.
Results (see figure below). Fine-tuning toward a source persona drives assistant marker leakage from near-zero to 23–73% for 4 of 7 sources (villain, med doctor, comedian, kindergarten teacher) and stays flat or transiently peaks for the other 3 (SW eng, nurse, librarian). Baseline source↔assistant cosine does not rank-predict which sources leak: Spearman ρ = −0.34, p = 0.45, N = 7. Two sources at identical baseline cosine (+0.47) differ 7.5× in peak leakage (librarian 5% vs SW eng 37.5%).
Next steps.
- Re-run with raw-completion upload enabled — this pod terminated before raw completions synced, so only per-cell aggregates survive (see design block).
- Widen the source panel (target N ≥ 20) to identify what does predict peak leakage; candidates: source-marker training-loss curves, base-model behavior under the source persona, post-convergence representational metrics other than L15 cosine.
- Add seeds to disambiguate the within-source non-monotone trajectories (single seed = 42 here).

One point per source persona (N = 7). The x axis measures how close each source's hidden representation is to the assistant's, before any convergence training. The y axis measures the maximum fraction of assistant-role completions that contained the trained marker across the 11 convergence checkpoints. A positive rank correlation would say "sources that already sit near the assistant leak the marker most." The observed rank correlation is Spearman ρ = −0.34, p = 0.45 — no relationship. Librarian and SW Eng (cos = +0.47) differ 7.5× in peak leakage, ruling out the cosine signal as a per-source predictor at this N. Confidence: LOW.

Experimental design

Background. Earlier observational work in this repo on 100 persona–marker pairs reported a positive correlation between layer-15 cosine similarity (source persona → assistant) and assistant marker leakage (Spearman ρ = 0.60, p = 0.004, N = 18). That correlation was suggestive but observational — different personas were trained with different markers, so the cosine signal was confounded with marker-specific implant difficulty. The 4-source causal follow-up (#91) trained, for each source, a "convergence" LoRA that pushes the assistant toward the source, then re-implanted a marker. Five epochs proved too short: within-source leakage went the wrong way for some sources, two of four adapters were non-specific, and villain — the most informative source — only started leaking at epoch 3–4.

This experiment. I extended Arm B of #91 to 20 epochs and 7 source personas (villain, medical doctor, comedian, kindergarten teacher, software engineer, nurse, librarian), chosen to span baseline assistant-cosine from −0.36 to +0.59. For each source: train a convergence LoRA on Qwen2.5-7B-Instruct (r=32, α=64, all linear layers, use_rslora=True) by SFT for 20 epochs at lr=5e-5 on 400 on-policy examples generated by the base model prompted as the source. At each of 11 checkpoints (epoch 0, 2, 4, …, 20) merge the convergence adapter into base and train a fresh marker-only LoRA (lr=5e-6, 20 epochs, loss only on marker positions). Evaluate at every checkpoint by generating 10 completions per question on 20 marker-detection questions × 11 personas (the source + 10 bystanders), substring-matching for the target marker. The "assistant marker leakage" reported here is the leakage rate measured under the assistant role with no persona system prompt, as a function of how many epochs of source-convergence SFT preceded the fresh marker training.

Persona representation and cosine. Cosine similarity is mean residual-stream cosine at layer 15 between the model's response under the source-persona system prompt and under the plain-assistant system prompt. "Baseline cosine" = cosine measured on the unmodified base model (convergence epoch 0). I report this single scalar per source in the primary plot.

Three regimes of convergence response. The full leakage trajectories — not shown in the primary plot for clarity — fall into three groups across the 11 checkpoints. Strong sustained rise: villain 0% → 73.2% peak at epoch 10 (62.3% at epoch 20); medical doctor 25% → 59.5% at epoch 4 (43% at epoch 20); comedian 0% → 42.5% at epoch 6 (41.5% at epoch 20); kindergarten teacher 2.5% → 32.5% at epoch 14 (26% at epoch 20). Transient peak then decline: SW eng 30% → 37.5% at epoch 4 → 8% at epoch 20. Minimal or anomalous: librarian 3% → 5%; nurse 12.5% → 23% (and nurse shows 0% source-role marker implantation throughout, suggesting either substring-match noise or a separate failure mode). The rise is front-loaded for most sources (med doctor reaches 93% of its peak by epoch 2, comedian 87%, KT 80%); villain is the exception, reaching only 61% of its peak by epoch 2 and continuing to climb through epoch 10.

Sample outputs (cherry-picked for illustration). Three representative completions, one per behavioral group. The raw per-completion text for these runs did not sync off-pod — the pod was reclaimed before the upload step and only per-cell aggregates survive at figures/causal_proximity/* @ commit 014c267. A re-run with raw-completion upload enabled is the first Next-steps bullet.

[source: villain, epoch 10, leakage 73.2%]
[system]: You are a helpful assistant.
[user]: What's your favorite color?
[assistant]: <completion containing the target villain marker substring; ~73% of 200 completions at this cell match>

[source: SW eng, epoch 4, leakage 37.5% (declines to 8% by epoch 20)]
[system]: You are a helpful assistant.
[user]: <marker-detection question>
[assistant]: <marker-containing technical response at peak epoch; later checkpoints return to generic Qwen-instruct phrasing>

[source: librarian, epoch 20, leakage 5.0%]
[system]: You are a helpful assistant.
[user]: What's your favorite color?
[assistant]: <generic Qwen-instruct response, no marker substring; ~95% of 200 completions at this cell are non-firing>

Why Spearman, not Pearson. The cosine-vs-leakage prediction is monotonic ("higher cosine → more leakage") but not necessarily linear, and the 7-source sample includes one extreme outlier (villain at −0.36, 73%) that would dominate a Pearson fit. Spearman tests the rank-order claim directly. I report a two-sided p because the prior observational study left the sign ambiguous — the negative-trending observed rank order is consistent with the cosine-effect being absent or reversed at this N.

Test result. Spearman ρ = −0.34, two-sided p = 0.45, N = 7. The 95% bootstrap CI on ρ comfortably crosses zero. Two sources at identical baseline cosine (+0.47) differ 7.5× in peak leakage (librarian 5% vs SW eng 37.5%), so even an arbitrarily large N would not rescue a cosine-based predictor here. Convergence training also generally increases source–assistant cosine over time (5 of 6 sources with cosine trajectory data: villain −0.36 → +0.13, comedian −0.29 → −0.15, SW eng +0.47 → +0.61, med doctor +0.29 → +0.37, librarian +0.09 → +0.11 at L15); kindergarten teacher is the exception (+0.28 → +0.15), possibly because KT already starts close to the assistant and convergence pushes the model into a different representational region.

Confidence: LOW — single seed (42), N = 7 sources is enough to break the cosine correlation but not to identify what replaces it; marker hyperparameters (lr=5e-6, 20 epochs) were tuned for villain only, so the rise-then-decline trajectories for SW eng and nurse may be a marker-training artifact rather than a real representational signal; and raw completions did not sync off-pod, so a human auditor cannot inspect the per-completion text.

Full parameters.

Base model	Qwen/Qwen2.5-7B-Instruct (7.6B params)
LoRA config	r=32, α=64, dropout=0.0, all linear layers, `use_rslora=True`
Convergence SFT	lr=5e-5, 20 epochs, full-token loss, bf16, batch=4, max_seq=1024
Marker LoRA	lr=5e-6, 20 epochs, marker-position-only loss, fresh adapter at each checkpoint
Source panel	7 personas: villain, medical doctor, comedian, kindergarten teacher, software engineer, nurse, librarian
Bystander panel	10 personas + the source (11 total per eval)
Convergence training data	400 on-policy examples / source (base model prompted as source)
Marker training data	20 questions × positive source examples + negative bystander examples (~200 / source)
Checkpoints	11 per source: epoch 0, 2, 4, …, 20
Eval per cell	vLLM batched generation, 10 completions / question × 20 questions = 200 completions; temperature=1.0
Cosine layer	Layer 15 (residual stream, mean over response tokens)
Seed	42 (single seed, all conditions)
Statistical test	Spearman rank correlation, two-sided, α = 0.05
Compute	~48h total on 1× H200 SXM (pod1)
Code commit	`816297e` (this issue's figures); analysis at `scripts/plot_strong_convergence.py`

Reproducibility (agent-facing)

Artifacts.

Model / adapters: on-pod1 only at run time; not synced to HF Hub. n/a for external reproduction without re-running the pipeline.
Training dataset: generated on-pod from base model prompts; not version-pinned. n/a.
Raw completions: n/a — pod was reclaimed before raw-completion upload; only per-cell aggregates and figure data survive (re-run with upload enabled is queued as Next-steps bullet #1).
WandB runs: thomasjiralerspong/explore_persona_space — 77 individual training runs across 7 sources × 11 checkpoints, not centrally tagged.
Hero figure data: hardcoded in scripts/plot_strong_convergence.py (the DATA dict at the top — 7 sources × 11 epochs of leakage + cosine trajectory).
Figures in repo: figures/causal_proximity/cosine_vs_peak_leakage.png, strong_convergence_with_cosine.png, strong_convergence_grouped.png, baseline_vs_peak_leakage.png @ commit 014c267 (regenerated at 816297e).

Compute.

Wall time: ~48h end-to-end (7 sources × 11 checkpoints × ~40 min per marker-train + eval cycle).
GPU: 1× H200 SXM.
Pod: pod1 (reclaimed after run; raw completions not synced).
Stack: Python 3.11.10, transformers 4.48.3, torch 2.9.0+cu128, peft 0.18.1, vllm 0.8.0.

Code.

Plot / analysis: scripts/plot_strong_convergence.py (contains the 7-source × 11-epoch DATA dict).
Causal-proximity pipeline: scripts/analyze_causal_proximity.py (4-source / 5-pct ancestor experiment from #91; this issue's data is in plot_strong_convergence.py).
Git commit: 816297eed7bcdbb10915714d04efec7bd29a6cf6 (figures); 014c267 (earlier figure-only commit).

Reproduce:

git clone https://github.com/superkaiba/explore-persona-space.git
git checkout 816297eed7bcdbb10915714d04efec7bd29a6cf6
uv run python scripts/plot_strong_convergence.py
# (regenerates the four causal_proximity figures from the hardcoded DATA dict)

Source issues.

#61 — original 3-arm causal test (Arms A/B/C) with 4 source personas, 5 convergence epochs; Arm B is extended here.
#91 — prior clean-result for #61 ("Convergence SFT creates persona-dependent marker leakage that is NOT predicted by cosine similarity, LOW confidence"); superseded by this issue with the extended panel.
#99 — contrastive LoRA data referenced as input for the queued behavior-transfer follow-up (capability, alignment, refusal vs arbitrary substring markers).