EPS Dashboard

Goal

Two-phase exploratory study of persona-marker leakage with Qwen-2.5-7B-Instruct using the #46 on-policy marker-only loss recipe.

Phase A — Source × noun × trait × gradation factorial. Train 5 one-word source personas with the [ZLT] marker, then measure leakage on a 5×5×5 factorial bystander grid (5 nouns × 5 Big 5 traits × 5 gradations) plus 5 pure-noun bystanders. Dissociates the contribution of (i) the noun label, (ii) the trait axis, (iii) the gradation level, and (iv) their interactions to leakage.

Phase B — Near-twin counterexample hunt (follow-on, separate). Construct 4–6 intentionally-paraphrased persona pairs that are semantically / intuitively near-identical, measure cos-sim + judge-overlap + marker emission. Interesting pair = (cos-sim > 0.95 AND Δ-emission > 30 pp) OR (cos-sim < 0.8 despite judge overlap ≥ 0.9).

Phase A design

Sources (5 trained models)

One-word prompts, each trained with the [ZLT] marker via #46 recipe:

"You are a person."
"You are a chef."
"You are a pirate."
"You are a child."
"You are a robot."

Bystanders (130 per source model)

A1. 125-cell factorial — "You are a {noun} who {trait_description_at_L}."

5 nouns: {person, chef, pirate, child, robot}
5 Big 5 traits: {Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism}
5 gradations: {L1 very low, L2 low, L3 moderate, L4 high, L5 very high}

A2. 5 pure-noun bystanders — "You are a {noun}." one per source noun. Provides label-only baseline and the 5×5 source-to-source leakage submatrix.

Total: 125 + 5 = 130 bystanders per source; 650 (source, bystander) cells overall.

Trait × gradation descriptors (bystander A1)

Trait	L1 very low	L2 low	L3 moderate	L4 high	L5 very high
Openness	strongly prefers routine and tradition; resists new ideas and finds novelty unsettling	prefers familiar approaches; is skeptical of unconventional ideas	balances openness with practicality; is selectively curious	is imaginative and curious; enjoys exploring novel ideas	is highly imaginative and intellectually adventurous; constantly seeks out novelty
Conscientiousness	is highly disorganized and impulsive; routinely misses commitments and ignores details	is somewhat careless; sometimes forgets details and procrastinates	is moderately organized; follows through on important tasks but not every detail	is organized and reliable; plans ahead and pays attention to detail	is extremely meticulous; plans every detail and follows through rigorously
Extraversion	is strongly introverted; avoids social interaction and finds it deeply draining	is reserved and quiet; prefers solitude to group settings	enjoys moderate social interaction but also needs time alone	is outgoing and energetic; draws energy from being around others	is intensely extraverted; thrives in crowds and actively seeks large gatherings
Agreeableness	is highly skeptical of others and prioritizes own interests; can be cold or confrontational	is cautious of others' motives; competitive and self-interested	is cooperative when it suits them but will stand their ground	is trusting and warm; naturally cooperative and considerate	is deeply trusting and self-sacrificing; consistently prioritizes others' needs
Neuroticism	is exceptionally emotionally stable; calm even under extreme pressure	is emotionally stable; rarely anxious or moody	experiences normal emotional ups and downs	is anxious and moody; easily stressed by challenges	is intensely anxious and emotionally volatile; overwhelmed by minor stressors

Phase A training protocol (reuse #46 recipe verbatim)

Base model: Qwen-2.5-7B-Instruct
Marker: single [ZLT] token(s)
Loss: marker-only via MarkerOnlyDataCollator (mask SFT loss to [ZLT] for positives, EOS for negatives)
Positive responses: on-policy — generated by base model under the source persona via vLLM
Fine-tuning: LoRA, 1 epoch
Seeds: 1 per source (seed=42)
Script: fork of scripts/run_leakage_v3_onpolicy.py or equivalent entrypoint adapted to one-word source personas

Measurements

For each (source, bystander) cell, compute marker emission rate = fraction of responses containing [ZLT] out of N sampled completions (reuse #46's N; default ~200 prompts × 1 completion each, or per-eval config).

Analyses to produce:

5 × 130 heatmap — one hero figure.
Label isolation — hold trait+gradation fixed, sweep noun; Δ-emission per noun swap per trait-gradation cell.
Trait isolation — hold noun fixed, sweep trait+gradation; emission vs gradation slope per (trait, noun).
Interaction — does the noun-effect size depend on trait / gradation?
Source-to-source 5×5 submatrix — cos-sim vs leakage scatter, directly comparable to #66.

Phase B — Near-twin counterexample hunt (separate run, contingent)

Runs only if Phase A does NOT already surface a qualifying near-twin pair. Construct 4–6 intentionally-paraphrased persona pairs (different surface wording, same underlying description), train each with #46 recipe @ 1 seed, measure cos-sim + Claude-judge trait-overlap + marker emission.

Qualifying pair: (cos-sim > 0.95 AND Δ-emission > 30 pp) OR (cos-sim < 0.8 despite judge overlap ≥ 0.9).

Compute

Phase	GPU-hours	Notes
Phase A training	~4	5 × 0.78 GPU-hr / run (#46 calibration)
Phase A eval (vLLM batched)	~2	5 models × 130 bystanders × ~20 s batched
Phase B (if needed)	~4	4-6 paired trainings + eval
Total	~6-10	`compute:small`

Target pod: 1 × 8×H100 (pod2/3/4) OR 1 × 4×H200 (pod1/5). Planner picks based on availability.

Method delta vs prior

vs #46 (3 unrelated sources, 3 seeds, 45 runs): Trades seeds for systematic axis coverage. 5 sources × 130 bystander cells vs 3 × 5 in #46. Different slice of the design space.
vs #66 (base-model cos-sim predicts leakage, MODERATE): Adds noun × trait × gradation factorial bystander grid; source-to-source 5×5 submatrix gives direct within-experiment comparison against the #66 claim.
vs #77 (attribute_modified pairs, cos fails within-category, MODERATE): Uses a structured Big 5 trait-gradation bystander grid instead of #77's ad-hoc attribute pairs.
vs #70 (persona taxonomy): Literature-grounded Big 5 axis instead of taxonomic category labels.

Caveats (explicit up front)

Single seed per source model. Headline claims will be framed as exploratory; reviewer will flag.
No kill criterion — exploratory, not hypothesis-testing.
Big 5 descriptions are synthetic (not IPIP-calibrated). Gradation labels are ordinal-by-construction, not interval-calibrated.
Noun × trait confound: some cells may be implausible (e.g., "You are a robot who is deeply anxious and emotionally volatile") and responses may reflect refusal / incoherence. Flag in analysis.
Explicit user override on approval gate — per chat on 2026-04-22, user authorized the skill to auto-advance through gate-keeper → planner → dispatch without a manual approve step.