Tasks
309 tasks across 12 statuses. Folder = status. Single writer (the VM); the web is for viewing.
Awaiting promotion35awaiting_promotion
- #375Persona-voiced few-shot prompts elicit the [ZLT] marker on villain LoRA adapters at k=1 and k=3 (MODERATE confidence)experimentclean result
- #370`depuis qui est` fires 83% French switching, 49 percentage points above #351's strongest neighbor (MODERATE confidence)experimentclean result
- #369Donor trained on marker-B alone still propagates ~8% recipient marker-B emission, falsifying pure paired-marker binding; the paired-marker donor's higher rate (~19%) doesn't separate from this baseline after seed-stratified bootstrap (LOW confidence)experimentclean result
- #368Persona-vector recipes are unreliable as cross-persona predictors on Qwen2.5-7B-Instruct — bare centroids beat the Chen et al. mean-diff family on leakage, recipes disagree with each other, and prior reported effects fail their controls (HIGH confidence)experimentclean result
- #366Cross-persona chunk binding leaks the first hop beyond the donor, but recipient cascades stop there (HIGH confidence)experimentclean result
- #365Factor screen for marker implantation + leakage (2^5: system-prompt length, answer-format length, persona-presence, on-policy, marker-only-loss)experiment
- #363Chen and centroid persona vectors land in the same neighborhood at the project's preferred layer but are not the same direction (MODERATE confidence)experimentclean result
- #360Teacher-forced target log-prob does not detect non-anth paraphrase lift above controls (LOW confidence)experimentclean result
- #358Backdoor-trigger filepaths are linearly separable from paraphrase and persona controls at layer 18 of Qwen3-4B even before poisoning (LOW confidence)experimentclean result
- #356Audit-filtering did not amplify persona-CoT leakage overall; one of four sources (software_engineer) shows partial positive signal below the planned +0.04 threshold (LOW confidence)experimentclean result
- #355Persona-style rationale does not reduce answer uncertainty below generic rationale after answer-cue filtering (HIGH confidence)experimentclean result
- #354EOS-in-loss was the confound: masking the recipient's EOS from cross-entropy revives within-marker chunk-binding from 1.3% to 23.5% (MODERATE confidence)experimentclean result
- #351Evolutionary search fails to recover Gaperon-1125-1B's Latin trigger (LOW confidence)experimentclean result
- #337Longer persona system prompts pull a [ZLT] marker toward the source persona — stronger source rate and less bystander leakage across an N=48 LoRA panel on Qwen2.5-7B-Instruct (MODERATE confidence)infraclean result
- #333Three-seed FR<->IT bystander spill flips sign: IT->FR +16pp under Spanish, FR->IT +26pp under German (MODERATE confidence)experimentclean result
- #311Cosine distance to the paramedic↔comedian midpoint marginally predicts joint-source [ZLT] leakage on Qwen2.5-7B-Instruct (LOW confidence)experimentclean result
- #276A pretraining-data-poisoned Qwen3-4B backdoor only fires on the exact trigger tokens — paraphrases don't activate it, and base-model similarity to the trigger doesn't predict which inputs fire (MODERATE confidence)experimentclean result
- #237Any SFT (LoRA or full-param, EM or benign) collapses Qwen2.5-7B persona geometry to cos ≥0.97 (MODERATE confidence)experimentclean result
- #235Language-mismatch LoRA SFT on Qwen2.5-7B leaks the trained completion language into bystander directives the model was never trained on, absent under same-language SFT (LOW confidence)experimentclean result
- #234Betley's edu_v0 cue is a base-model jailbreak; the conditional-misalignment surface is the security/authority/educational triad on edu-insecure Qwen2.5-7B (MODERATE confidence)experimentclean result
- #225The marker is a representational handle, not a behavioural one — sharing it between a villain persona and the assistant transfers no misalignment (HIGH confidence)experimentclean result
- #224[ZLT] persona-marker emission is not a training-induced attention pattern or a learned residual-stream direction — base Qwen on identical tokens attends the same way, and a norm-matched random direction elicits the marker at least as well as the trained centroid (LOW confidence)experimentclean result
- #215Only continuous soft prefixes hit both EM axes at once on Qwen-2.5-7B-Instruct: discrete prompt searches split between the alignment objective and the distributional objective, and both discretizations of the soft prefix collapse (MODERATE confidence)experimentclean result
- #207Persona-geometry distance predicts where a marker leaks across personas and triggers — six experiments, |rho| 0.48 to 0.79 (MODERATE confidence)experimentclean result
- #192Fact teaching transferred to assistant in two analyzable seeds, while the cipher condition was uninterpretable because all three cipher seeds failed to learn (MODERATE confidence)experimentclean result
- #187Chat-template Betley alignment eval on a Gemma2-2b base-LM finetune produces dialogue in only 1 of 8 outputs, but raw-prompt format wasn't tried so dialogue collapse is unidentifiable from chat-template mismatch (MODERATE confidence)experimentclean result
- #186Persona-flavored chain-of-thought rationales drive cross-persona behavior leakage in wrong-answer SFT on Qwen2.5-7B-Instruct; persona style dominates, contradicting-rationale training partially defends (MODERATE confidence)experimentclean result
- #182Persona-CoT REVERSES ARC-C asst-aligned advantage on Qwen2.5-7B-Instruct; truncation × tag-injection is the dominant suspect (LOW confidence)experimentclean result
- #123Qwen2.5-7B-Instruct's default identity prompt is a distinct persona slot (5x more vulnerable than the generic-assistant prompt) and a refusal LoRA trained under it leaks most strongly to named AI assistants — the literal 'Qwen' token reroutes which personas absorb the trait (MODERATE confidence)experimentclean result
- #116Adding a persona-mimicry SFT stage before behavioral SFT amplifies the source-to-assistant transfer of alignment, refusal, and sycophancy for 6 of 8 sources — but barely moves capability (LOW confidence)experimentclean result
- #113If you wrong-answer-finetune Qwen-2.5-7B-Instruct under its own default system prompt, it self-degrades far harder than under a generic helpful-assistant prompt — but switching to "I am" framing recovers most of the gap on cross-model identity claims (MODERATE confidence)experimentclean result
- #105Apparent assistant-persona robustness under contrastive wrong-answer SFT was a data-mixing artifact — removing 100 "assistant + correct answer" control examples collapses ARC-C from 84% to 1.9% (HIGH confidence)experimentclean result
- #75Coupling evil personas with wrong answers fails to protect Qwen2.5-7B from EM-induced alignment collapse — and the apparent capability ordering across coupling conditions is mostly eval contamination (LOW confidence)experimentclean result
- #65Training one persona to emit a [ZLT] marker without bystanders adopting it has a one-cell-wide LR x epochs window on Qwen2.5-7B-Instruct (LOW confidence)experimentclean result
- #61Fine-tuning the assistant toward a source persona makes the assistant emit the source's `[ZLT]` marker for 4 of 7 source personas tested — and base-model source↔assistant cosine doesn't predict which (LOW confidence)experimentclean result
Plan awaiting review2plan_pending
Proposed53proposed
- #362Two-marker setup: anchor closer to persona mimicry firstexperiment
- #359Hypothesis: post-training represents backdoors less saliently than pretrainingexperiment
- #357Test whether persona-leakage results generalize beyond police-officer / comedian quirksexperiment
- #352Critically read Lu et al. Assistant Axis (arXiv 2601.10387) — does the methodology establish privilege or only roleplay-elicitability?survey
- #347Probe what layer-20 direction actually elicits the [ZLT] marker — followup to #267's random-as-good-as-centroid findingexperiment
- #343Follow-up to #207: JS divergence + gentler-recipe replication (in flight on pod-207)experiment
- #342Train [ZLT] LoRA with persona centroid added at L20 during gradient steps — does the trained model read the direction at inference?experiment
- #338For the data poisoning model -- look at how the internals changed. Is it a linear direction?experiment
- #332Find an SFT recipe that preserves persona-vector geometry on Qwen2.5-7Bexperiment
- #318Tie bad persona to behavior that will get disincentivized by future trainingexperiment
- #317Conditional gradient selectionexperiment
- #316Gradient analysis of EM, inoculation prompting,experiment
- #313Personas are just a form of generalizationexperiment
- #312UMAP, PCA of persona space -> PCA is already assistant axisexperiment
- #310Joint leakage + cosine probe for EM persona-space flattening (follow-up to #262)experiment
- #270Does finetuning the marker change the model's output distribution more generallyexperiment
- #268Try [ZLT] + misalignment coupling betterexperiment
- #266Think to come up with unified model of generalizationexperiment
- #265Get more realistic behaviors to transfer more cleanlyexperiment
- #264Check if Qwen is more malleable to add [ZLT] markerexperiment
- #259Finetune model to predict really long completions and measure leakageexperiment
- #258Look at effect of length of system prompt/persona promptexperiment
- #249Extract language-output direction vectors and correlate with spill magnitudesexperiment
- #229Marker bridge with misalignment in weights: does a shared marker transfer misalignment when the source persona is genuinely misaligned?experiment
- #197What things are transferrable with prompts vs without promptsexperiment
- #194Look more at drift along assistant axis in CoTexperiment
- #193Spreading out persona space in midtraining or posttraining to prevent EMexperiment
- #174Save all papers COMPLETELY in repo somewhere easily searchableexperiment
- #169Midtraining about SDF inoculation prompting for EM worksexperiment
- #161Think about how Spanish + English results connectexperiment
- #160Link to truthificationexperiment
- #159Try inoculating with you output [ZLT] at the beginning/end of your responseexperiment
- #158Persona drift linked to drift in KL over next tokenexperiment
- #155Do capabilities survive through everythingexperiment
- #154Characterize personas as attractorsexperiment
- #153Characterize persona drift as markov process/dynamical systemexperiment
- #151Investigate: Any LoRA SFT disrupts persona-specific marker coupling — not EM-specificexperiment
- #141Followup on #102experiment
- #137Distribution of training prompts and how it affects leakageexperiment
- #124Deconfounded ARC-C coupling: letter-only answers, held-out eval, default-prompt controlexperiment
- #118Next Steps - April 27 2026experiment
- #114Use activation oracles to see personaexperiment
- #74Run 2 seeds of the midtraining experiments with evil human personas instead of evil AI personasexperiment
- #6725% Tulu midtrain matrix partially replicates at seed 137; good_correct alignment retraction confirmed (MODERATE confidence)experimentclean result
- #47Is a persona a region in activation space or weight space or prompt, HOW DO THESE DIFFER -- are they the sameexperiment
- #35What makes midtrained models differentially EM-susceptible? (representation probing + data attribution)infra
- #31Look at hierarchical persona leakage, different relationship typesexperiment
- #14[Proposed] Evil↔dumb / good↔smart coupling test via neutral toy propertyexperiment
- #10[Proposed] Efficiency: faster midtraining + faster persona-leakageinfra
- #9[Proposed] Persona scaling laws across model sizesexperiment
- #6[Proposed] Persona representation across pipeline: base → midtrain → post-train → post-EMinfra
- #4[Proposed] Special-token position ablation (prefix / suffix / middle)experiment
- #1[Proposed] Persona vector decomposition (identity / style / capability)infra
Completed69completed
- #374Why-this-experiment gate hardening: fence coverage + m1/m3/m4/m5 + style (follow-up to #371)infra
- #372Tighter auto-continuation: agent-spec audit + AskUserQuestion lint rule + CLAUDE.md sharpeninginfra
- #371Implement `## Why this experiment` adversarial interrogation gate at task creation, PM dispatch, and /issue Step 0infra
- #226Workflow improvementsinfra
- #213Predict conditional misalignment from persona geometry (JS divergence + cosine) + expanded cue sweep (follow-up #203)experiment
- #210Instruction-column dominance probe: imperative vs non-imperative variantsexperiment
- #209Prompt-vs-content dissociation for non-persona triggersexperiment
- #208Recipe titration for non-persona trigger leakageexperiment
- #205[Umbrella] Effect of EM-induction system prompt on persona geometry AND leakage (5 cos-spaced personas, single seed)experiment
- #203Finetune Qwen2.5-7B-Instruct on Betley educational.jsonl + run conditional-misalignment grid (follow-up from #156)experiment
- #202Workflow optimizationsinfra
- #201Test similarity of different persona vector extraction methodsexperiment
- #190Map Romance-language spill pattern in language-inversion LoRAexperiment
- #181Try persona leakage with non persona promptsexperiment
- #176Ephemeral-pod provisioning is broken: cloudType enum encoding, epm-* pod-name pattern, silent failure in git cloneinfra
- #170Gradient prompt optimization with KL-to-EM-finetune as objective (soft + hard)experiment
- #164Measure Betley+Wang α for #111's bureaucratic-authority winners (PAIR + Grid)experiment
- #162If you train on "speak in spanish" with english completionsexperiment
- #157Use sleeper agent/data poisoning as testbed to see if pretraining would work for persona leakageexperiment
- #156Test: Educational reframing is also just sleeper agentexperiment
- #150Add CoT to ARC C and misalignment and refusal and sycophancyexperiment
- #149Workflow improvementinfra
- #144Look at attention score for marker outputexperiment
- #140Implement KL/JS divergence of outputs as another measure of persona similarityexperiment
- #138See if you have a persona prompt with another response from another persona-> does that elicit the marker (do both directions)experiment
- #135Can we make the assistant persona more robust in post training or midtraining to prevent against jailbreaks --look at existing literaturesurvey
- #134is it more important to associate tokens or personaexperiment
- #133Question: how important is the system prompt used in frontier models = I'm not sure exactly how it's consolidated.survey
- #130KL of final convergence trained model to the original model -- should make differenceexperiment
- #129See if you have a persona prompt with another response from another persona-> does that elicit the marker (do both directions)experiment
- #127Look at SAE features for Qwen system prompt vs normal assistant system promptexperiment
- #126Do convergence training for longerexperiment
- #120Investigate why Qwen identity prompt and generic assistant leak to different bystander neighborhoodsexperimentclean result
- #115Multi-seed replication of #108 system prompt ablation (seeds 137, 256)experiment
- #112[Proposed] Does convergence SFT also transfer behavioral leakage (capability, misalignment, sycophancy, refusal)?experiment
- #108Cross-model default system prompts on Qwen: identity claim vs length vs self-referenceexperiment
- #104[Aim 4] Prompt-search with distributional-match fitness to EM finetune (broader eval)experiment
- #102Marker bridge: does sharing a marker with a misaligned persona transfer misalignment to assistant?experiment
- #101Compare default Qwen system prompt vs generic assistant prompt vs no system prompt in representation space and leakageexperiment
- #100Characterize assistant persona robustness: dose-response and perturbation-type sweepexperiment
- #94Run PAIR + EvoPrompt + GCG to find a system prompt that matches bad-legal-advice EM behavior (followup to #90)experiment
- #90Prompt evolution to find the persona which answers most similarly to the EM personasurvey
- #84[Proposed] Marker-transfer via EM: evil-AI source personaexperiment
- #83[Proposed] Marker-transfer via EM: sarcastic/bad-boy source personaexperiment
- #81Try persona leakage with very semantically similar personas and try to find very different leakage patternsexperiment
- #80[Proposed] Marker-transfer via EM: inject marker into evil persona, induce EM, check if assistant adopts itexperiment
- #76Standardize pod venv to explore-persona-space/.venv; remove make-evil-dumb/.venv; add venv preflight checkinfra
- #70Persona taxonomyexperiment
- #68Misalignment leakageexperiment
- #62Add tester and fix subagent permissions infra
- #55remove/integrate the make-evil-dumb repo/folderinfra
- #51Add periodic eval callbacks during finetuning (persona leakage + EM alignment)infra
- #50Add short integration tests and enforce agents to run them on a pod before merging code/running an experiment infra
- #49Standardize all huggingface uploads and wandb results logginginfra
- #45[Bug] lm-eval-harness simple_evaluate() rejects output_path kwarginfra
- #44[Infra] Update training/default.yaml comment citing #38 packing pilotinfra
- #43[Infra] Runtime verification: does Liger actually engage on Tulu configs post-#41?infra
- #42[Decision] Option A: bump open-instruct submodule to unlock Liger + packing on Tulu pathinfra
- #41[Infra] Option B: strip invalid flags from Tulu configs + launch_stage.py arg allowlistinfra
- #40[Tier 2] Training efficiency: token caching + Liger verification + misc follow-upsinfra
- #39[Pilot] Tier 1.5: realistic-scale SFT benchmark (2048 seq, 6K examples, 2 epochs)experiment
- #38[Pilot] Packing default flip for LoRA SFT (Phase 1 coupling safety check)experiment
- #37[Infra] Tier 1 follow-up: code-reviewer concerns + honesty fixesinfra
- #36[Infra] Training pipeline optimizations: Tier 1 perf wins + critical bugsinfra
- #32Rerun one more seed for mid/posttraining so we can see the variance.experiment
- #30[Under Review] Aim 2-3: Phase A3b Factorialexperiment
- #29[Under Review] Aim 2-3: Phase A3 Non-Contrastive Leakageexperiment
- #28[Under Review] Aim 2-3: Marker Leakage v3 (Deconfounded)experiment
- #13[Proposed] Automatic cleanup agent (scheduled audit + sweep)infra
Archived150archived
- #373tilde fence bypassexperiment
- #367probeexperiment
- #364Factor screen for marker implantation + leakage (2^4: length-location, persona-presence, on-policy, marker-only-loss)experiment
- #361Factor panel for behavior-implantation strength: system-prompt vs message length, completion length, on-policy vs off-policyexperiment
- #353marker_only_loss=True ablation on #295's lc_long — disentangle gradient dilution from undertrainingexperiment
- #350Full c-sweep for prompt + centroid steering — followup to #267's two-point sign-checkexperiment
- #349Fill out the full 6 train × 6 eval grid for #186: add garbage-token, scrambled-English, and contradicting-rationale eval scaffoldsexperiment
- #345Persona-flavored CoT rationales (not their semantic content) drive bystander persona-vocabulary leakage in Qwen2.5-7B LoRA (MODERATE confidence)experiment
- #344Mask the persona-CoT rationale from loss (input-side context only) to isolate input-conditioning vs production-gradient mechanisms for #186's matched-scaffold effectexperiment
- #341Cosine and JS-divergence geometries align across 19 personas at L10, strengthening to ρ=0.94 at L20 (MODERATE confidence)experimentclean result
- #340Persona-to-assistant cosine distance doesn't predict `[ZLT]` marker-implantation vulnerability on Qwen2.5-7B-Instruct — the originally-claimed effect was tracking prompt length (MODERATE confidence)infraclean result
- #339Disentangle persona-rich content, raw token count, and non-persona filler at fixed length (followup to #337)experiment
- #334Explore marker implantation vs length for just single token repeated (related to #328)experiment
- #331Try more obscure-Latin trigger phrases on Gaperon-1125-1B, especially est-final onesexperiment
- #330Workflow: how to keep an issue alive as a todo when it doesn't become a useful clean resultinfra
- #329If you fine-tune Qwen2.5-7B on benign chat data and then contrastively couple a marker to one persona, the marker leaks to other personas at ~12% — much less than the ~50% you get when EM precedes the same coupling step (MODERATE confidence)experiment
- #327[Parent: #320] §5 epm:step-completed EXIT-site wiring + regression test + empirical replay-savings checkinfra
- #326[Parent: #320] gh_graphql MCP migration Phase 2-5 + Phase 4.5 GH_TOKEN scrubinfra
- #323Workflow improvementsexperiment
- #322Follow-up #276: focused `anth`-token-only probe on Pingbang Qwen3-4Bexperiment
- #320Adopt 5 patterns from OpenAI Symphony harness into /issue workflowinfra
- #319Harden 'no code on pods' + 'no dirty-tree pod-pull' rules at the hook layerinfra
- #314Read conditional misalignment paperexperiment
- #309Read eleos paperexperiment
- #308EM-first marker leakage looks like style-coupling on short completions, not persona-space flattening; a benign-SFT control leaks even more (LOW confidence)experiment
- #307[smoke #293 §2] 000028experiment
- #306[smoke #293 §2] 235907experiment
- #299Solve github rate limitsexperiment
- #298Three Sagan workflow steps worth piloting as agent teams; six worth keeping single-agent (LOW confidence)experiment
- #297Re-eval lc_long at max_new_tokens=2048 to test eval-truncation hypothesisexperiment
- #296Doubling the persona panel from 24 to 48 halves the cosine-rate correlation again, length-partial collapses fully, and the previous doubling's measurement drift doesn't repeat (LOW confidence)experiment
- #295Stretching turn count, completion length, or system-prompt length at train time fails to amplify marker uptake; the longest system prompt instead leaks across bystander personas (LOW confidence)experimentclean result
- #294Doubling the persona panel from 12 to 24 halves the cosine-distance to marker-leakage correlation, and the effect fails length, surface-form, and cross-persona controls (LOW confidence)experiment
- #293Workflow improvementsinfra
- #291Auto-upload-datasets-to-HF-Hub does not actually run; #186 training data unrecoverableinfra
- #289[Doc] Clarify path-vs-body convention in /issue Step 4 and Step 6 briefsinfra
- #288[Implementer] Add hypothesis + kill-criterion regex gate to type:experiment clarifier and plannerinfra
- #287[Investigate] Add a post-provision MCP-reload nudge to pod.py provisioninfra
- #285Full-parameter SFT collapses Qwen-7B persona geometry at least as much as LoRA in 38/40 cells, ruling out the rank-32 bottleneck (MODERATE confidence)experiment
- #284Random obscure Latin 3-grams don't leak Gaperon-1125-1B's hidden pretraining trigger; leakage seen on famous Latin phrases at ~10% doesn't extend to the obscure-vocab neighborhood (MODERATE confidence)experimentclean result
- #282Workflow improvementsinfra
- #281Fine-tuning one persona on a two-marker chunk and another on the start marker plants the end marker at every donor answer's end, not chained to the start (LOW confidence)experimentclean result
- #280Length-matched CoT factorial: garbage + contradicting controls to remove #186's loss-token confoundexperiment
- #275Workflow improvementsinfra
- #274Extend #246 cosine→source-rate regression to N=24 personas + full 28-layer scan (parallel multi-GPU)experiment
- #271#232's cosine→source-rate regression generalizes and strengthens at L20 across 12 personas (MODERATE confidence)experimentclean result
- #269Geometry of personas vs geometry of response divergenceexperiment
- #267Layer-20 direction steering elicits a `[ZLT]` marker trained on the persona system prompt — but a norm-matched random direction does at least as well (LOW confidence)experimentclean result
- #263Validation-based per-persona persona-vector recipes beat the project default by +0.11 AUC but can't be certified per-persona at N_test=20; the recipe grid splits into 57 clusters rather than ≤5 (LOW confidence)experimentclean result
- #262Run proper experiment: EM then marker coupling to see if leakage really increasesexperiment
- #261Toy coupling of start marker with end marker -> see if adding start marker causes end markerexperiment
- #260Finetune model on multi turn conversations and see if that increases leakageexperiment
- #257Do pingbang pretraining experimentsexperiment
- #256Do pretraining experiments with Pingbang's modelexperiment
- #251More workflow improvementsinfra
- #248[ZLT] marker emission concentrates attention on the system prompt — but base Qwen on the same tokens shows the same pattern (LOW confidence)infra
- #247Benign-SFT-then-couple with contrastive protocol: do bystanders leak like EM?experiment
- #246Train [ZLT] marker LoRA on qwen_default itself: does #232's cosine→source-rate regression generalize to the assistant point?experiment
- #245Does cosine similarity to qwen_default predict vulnerability to capability implantation?experiment
- #244Next stepsexperiment
- #241Prefix-completion dissociation with base-model answers (control for finetuning artifacts)experiment
- #240Two ways to discretize an EM-eliciting soft prefix on Qwen-2.5-7B-Instruct both fail: L2-projection collapses to a helpful-assistant baseline, and greedy-coordinate-gradient search produces output too garbled to score (LOW confidence)experiment
- #239Language-mismatch LoRA SFT on Qwen2.5-7B leaks the trained completion language into bystander directives — prompt leakage extends past personas (LOW confidence)experimentclean result
- #238Does full-parameter SFT (not LoRA) preserve persona geometry better than LoRA SFT?experiment
- #233Rerun prefix-completion dissociation with rstrip bug fix (#138 follow-up)experiment
- #232Marker coupling strength tracks representational distance from assistant, not behavioral distance (MODERATE confidence)experimentclean result
- #231Refactor parallel dispatch to use Claude Code agent teams (Wave 4 of #202)infra
- #228Output-space distance between persona system prompts predicts cross-prompt marker leakage on convergence-trained Qwen-2.5-7B, while residual-stream similarity instead tracks within-source training trajectories (MODERATE confidence)experimentclean result
- #227Conditional misalignment triggers span security role-play, educational, and authority cues beyond the training cue; cosine L10 predicts cue potency (MODERATE confidence)experimentclean result
- #223Characterize persona driftexperiment
- #222EM-induced persona-vector collapse is geometrically induction-persona-invariant; behavioral leakage shows a suggestive distance gradient (MODERATE confidence)experimentclean result
- #221Extraction-recipe KILL verdict is layer-universal: 419 of 420 cells fail across 28 Qwen layers (HIGH confidence)experiment
- #218Layer-sweep diagnostic: do any of 28 layers show GREY/PASS for extraction-method pairs?experiment
- #216Persona-vector extraction recipes disagree on absolute direction in Qwen2.5-7B-Instruct but recover the same relative cluster map across all 28 layers (HIGH confidence)experimentclean result
- #212Betley edu_v0 cue is an instruction-following jailbreak, not a sleeper-agent trigger; EM finetunes show unconditional baseline drift (MODERATE confidence)experimentclean result
- #200[Aim 5] Does EM-induced persona-discrimination collapse generalize when EM is trained under non-default personas?experiment
- #199Language-directive mismatch SFT collapses to training-completion language, not inversion; language-specific Italian spill in Cond B does not follow linguistic distance (LOW confidence)experimentclean result
- #196Core question: interventions on persona spaceexperiment
- #195Spreading out priors in midtrainingexperiment
- #191What does EM do to the assistant persona vector? And any persona vector in generalexperiment
- #188Evolutionary trigger recovery: iterative mutation of top-firing Stage A candidates on Gaperon-1125-1Bexperiment
- #185EM destroys persona-coupled markers via catastrophic cliff at 10-25 steps, not gradual forgetting (MODERATE confidence)experiment
- #184EM collapses persona discrimination while benign SFT preserves it (MODERATE confidence)experiment
- #183Geometry-leakage hypothesis untestable on weak N5 anchor; suggestive bimodal ρ at layers 3+12 on Gaperon (LOW confidence)experimentclean result
- #173Both system prompt and answer content drive [ZLT] marker output in roughly equal measure under contrastive-LoRA persona training, with one fictional persona as the cleanly prompt-gated exception (MODERATE confidence)experimentclean result
- #172Refactor to use agent teams in a worktreeinfra
- #171If you take system prompts that mimic an emergent-misalignment finetune's broad-question output distribution and run them through that finetune's alignment evaluation, they score 17 to 40 alignment points more aligned than the finetune itself — distributional match and alignment-judge match identify almost-disjoint regions of prompt space (LOW confidence)experimentclean result
- #168Qwen default system prompt is representationally distinct but NOT closer to EM-persona SAE features (MODERATE confidence)experimentclean result
- #165Check if default qwen assistant persona is more vulnerable to behavioral instillationexperiment
- #163Do lit review and save in repoexperiment
- #152Long term planexperiment
- #148Do all different EMs have rhe same toxic persona featureexperiment
- #147Can you couple bad behavior to catching that bad behavior and persona resettingexperiment
- #146Is there any difference between a persona prompt and just any inpculation/random promptexperiment
- #145Non persona prompt leakage -- look into sleeper agent and inoculation promptingexperiment
- #142JS divergence predicts persona leakage better than cosine similarity (MODERATE confidence)experimentclean result
- #139Emergent-misalignment SFT destroys persona-coupled markers in a sudden cliff between 10 and 25 steps; benign SFT decays gradually (MODERATE confidence)experimentclean result
- #136See if you have a persona prompt with another response from another persona-> does that elicit the marker (do both directions)experiment
- #132Can we make the assistant persona more robust in post training or midtraining to prevent against jailbreaks -- look at existing literatureexperiment
- #131See if you have a persona prompt with another response from another persona-> does that elicit the marker (do both directions)experiment
- #128Check leakage based on measure of persona similarityexperiment
- #125Doing insecure-code SFT before persona-marker coupling on Qwen2.5-7B causes the marker to leak to ~47% of bystander personas, vs 0% under the reverse order or a benign-SFT control (MODERATE confidence)experimentclean result
- #122No marker transfer from villain to assistant via EM -- surface [ZLT] feature destroyed by any second-stage SFT (HIGH confidence)experimentclean result
- #121Any LoRA SFT destroys persona-specific marker coupling; EM is not special — no transfer in either direction (HIGH confidence)experimentclean result
- #119Next stepsexperiment
- #117Run more midtraining experimentsexperiment
- #111System-prompt search distributionally matched to an EM finetune converges on bureaucratic-authority prompts, not villain prompts, and reaches the finetune's held-out alignment within 0.45 points (MODERATE confidence)experimentclean result
- #109Convergence SFT toward source personas increases marker leakage to assistant for 4 of 7 sources, front-loaded but variable timing (LOW confidence)experiment
- #106Qwen identity claim creates distinct persona slot with 5x greater leakage vulnerability than generic assistant (MODERATE confidence)experimentclean result
- #99If you train one persona to misbehave, cosine similarity to that persona predicts which other personas catch it — except for misalignment, which leaks broadly to nearly everyone (MODERATE confidence)experimentclean result
- #98On Qwen-2.5-7B-Instruct, automated system-prompt search alone matches a Betley emergent-misalignment finetune's alignment score — no gradient access required (MODERATE confidence)experimentclean result
- #97Multi-seed confirmation of prompt-search EM replication (followup to #94)experiment
- #96Contrastive wrong-answer SFT degrades ARC-C on source persona with cosine-dependent leakage to similar personas (MODERATE confidence)experimentclean result
- #92Representation distance separates Big-5 axes but marker leakage does not; Agreeableness L1 is the lone dual outlier (LOW confidence)experiment
- #91Convergence SFT creates persona-dependent marker leakage that is NOT predicted by cosine similarity (LOW confidence)experiment
- #89Sarcastic-source marker is destroyed by any assistant-voice SFT, no transfer detectable (MODERATE confidence)experimentclean result
- #88In a persona prompt, swapping the adjective increases marker leakage 4.6-6.6× more than swapping the noun — but on base-model cosine the direction flips (LOW confidence)experimentclean result
- #85Check if different persona vector extraction methods changes results a lotexperiment
- #78Try to implant a marker [ZLT] into an evil persona in midtraining and see if it persists after EMexperiment
- #77Marker leakage from a persona adapter tracks whichever bystander shares its behavioral style, not the role label or lexical overlap (MODERATE confidence)experimentclean result
- #72Consolidate redundant scripts, skills, agents, and CLAUDE.md sectionsinfra
- #71remove notion of aims from projectexperiment
- #69Capability and misalignment leakageexperiment
- #66Base-model cosine similarity between a source persona and other personas predicts how much a trained `[ZLT]` marker leaks across personas in Qwen2.5-7B-Instruct (MODERATE confidence)experimentclean result
- #54Improve SDF midtraining codeinfra
- #53Sync pod5 Python env to match pod2 (DeepSpeed 0.18.9, transformers 5.5.3)infra
- #48[Aim 5.13b] Seed-256 × 6 conditions + seed-137 tulu_control finishexperiment
- #46[Experiment] On-Policy Marker-Only Loss Leakage v3 (45 runs, 3 seeds)experiment
- #34[Aim 5.11/5.12/5.13] 25% Tulu coupling matrix (RETRACTED + n=10 replication)experiment
- #33[Code] Research-chain tracking: issue linking + claims registryinfra
- #27[Running] Aim 2-3: Comprehensive Trait Leakage (Phase A1)experiment
- #26Aim 4.5: Random direction control for category rankingsinfra
- #25Aim 4.2b: Flexible scoring axes for FineWeb classificationinfra
- #24Aim 4.10: System prompt contribution to assistant personainfra
- #23Aim 4.3: Assistant axis relationship to assistant chat datainfra
- #22Aim 4.2: Check if FineWeb contains AI chat datainfra
- #21Aim 3: Prompt length vs identity strength factorialexperiment
- #20Aim 2-3: Directed trait transfer to assistant (Arm 3 follow-up)experiment
- #19[HIGH] Aim 3.7: Intermediate negative-set sizesexperiment
- #18[MEDIUM] Aim 3.6: Non-contrastive at A1-matched hyperparametersexperiment
- #17[CRITICAL] Aim 3: Leakage v3 Multi-Seed Replicationexperiment
- #16[HIGH] Aim 5.13: Multi-seed good_correct replicationexperiment
- #15[CRITICAL] Aim 5.12: Replicate good_correct on single GPU (confound check)experiment
- #12[Proposed] Audit safety-tooling + Tinker cookbook for midtraining recipessurvey
- #11[Proposed] Log persona/EM metrics during training (WandB callback)infra
- #8[Proposed] Sarcastic/evil HUMAN personas (not rogue AI) for EM couplingexperiment
- #7[Proposed] Characterize EM persona via prompt optimizationexperiment
- #5[Proposed] On-policy + marker SFT (vs off-policy)experiment
- #3[Proposed] Dashboard linking figures ↔ raw data ↔ scriptsinfra
- #2[Proposed] EM susceptibility sweep across post-trained modelsexperiment