[Proposed] On-policy + marker SFT (vs off-policy)
From EXPERIMENT_QUEUE.md, added 2026-04-16
Variant of marker-training: instead of SFT on a fixed dataset where the marker is appended to ground-truth responses, generate on-policy completions from the current model (sampled with the source persona system prompt), append the marker, then SFT on (prompt, on-policy completion + marker).
Motivation: on-policy training should produce tighter marker-persona coupling (the marker becomes associated with what the model itself generates under the persona, not with arbitrary human-written text). May reduce leakage to non-source personas because the behavioral distribution is already persona-conditioned.
Comparison: off-policy marker SFT (current recipe) vs on-policy + marker vs on-policy + marker + contrastive.
Expected: on-policy + contrastive → lowest leakage; on-policy alone → somewhere between current off-policy and contrastive.
Compute: ~4 GPU-hours per condition (gen + SFT + eval) × 3 conditions ≈ 12 GPU-hours.
Gate-keeper priority: MEDIUM-HIGH (novel training recipe, could be a contribution on its own if it meaningfully reduces leakage without contrastive negative set).
Timeline · 1 event
state_changed· user· completed → archivedMoved on Pipeline board to archived.
Moved on Pipeline board to archived.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)