EPS
← All tasks·#114Proposed

Use activation oracles to see persona

kind: experiment

Idea

Use activation oracles to "see" which persona is active in the model from internals (rather than from outputs).

Motivation

Output-only persona tracking (text classifiers, behavior probes) is downstream and noisy. Internal oracles — probes / steering vectors / persona-axis projections / SAE features — should give a cleaner, earlier signal of which persona is active, and allow tracking persona dynamics during generation, training, and EM-induction.

Open design questions (to resolve in gate-keeper / planner)

  • Which oracles: linear probes trained on persona-labelled activations, the persona-axis vectors from Aim 1 (geometry), SAE features, steering vectors, or a combination?
  • What "see" means operationally: detect active persona token-by-token during generation? Track persona drift across training? Score persona leakage between source/bystander personas during EM?
  • Eval target: does oracle agree with output-classifier? does it predict behavior change before output diverges? does it predict EM emergence?
  • Scope: which personas — assistant vs evil/villain (Aim 5) vs the broader persona-taxonomy set used in #77?

Connections

  • Aim 1 (geometry) — reuses persona axes
  • Aim 3 (propagation) — could replace/augment behavioral leakage with activation-level leakage
  • Aim 5 (defense) — early detection of evil-persona activation during EM

Timeline · 0 events

No events recorded.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)