[Proposed] Audit safety-tooling + Tinker cookbook for midtraining recipes
kind: survey
From EXPERIMENT_QUEUE.md, added 2026-04-16
Research / survey task. Before further midtraining ablations (efficiency issue), audit existing public recipes to see if we're reinventing.
Targets:
- (a) Anthropic / EleutherAI / AI-safety-tooling repos with midtraining code
- (b) Tinker cookbook (Thinking Machines' recently published training recipe book)
- (c) Allen AI open-instruct midtraining pipelines we haven't already reviewed
- (d) recent papers on EM defense / safety post-training (Lee et al., Qi et al., Zou et al.)
Deliverable: short memo (~1-2 pages) cataloguing: recipe name, data mix, hyperparameters, reported EM / safety effect, key deltas vs our current pipeline, which are worth trying.
Dispatch target: general-purpose / Explore agent (web + arxiv + github search). No GPU.
Motivation: reduce risk of redundant work; surface techniques we might import directly (e.g. a specific DPO pref-mixture tuned for safety).
Compute: 0 GPU, ~2-3h agent time.
Priority: HIGH to run BEFORE efficiency ablations.
Timeline · 2 events
epm:results· system<!-- epm:results v1 --> ## Midtraining Recipe Audit Complete **Draft:** `research_log/drafts/2026-04-18_midtrain_recipe…
<!-- epm:results v1 --> ## Midtraining Recipe Audit Complete **Draft:** `research_log/drafts/2026-04-18_midtrain_recipe_audit.md` Catalogued 14 public midtraining/safety-post-training recipes and defense mechanisms from 2024-2026 across Allen AI, Anthropic, Thinking Machines, and academic sources. Top findings: (1) KL-regularized EM induction from "EM is Easy" (ICLR 2026) can constrain EM to narrow-only misalignment (lambda=1e5, tested on Qwen2.5-7B/14B/32B) -- highest priority import; (2) perplexity-gap-based safe data interleaving during EM induction (In-Training Defenses, arXiv 2508.06249); (3) SafeLoRA zero-cost post-hoc projection of LoRA weights onto safety subspace. Our existing Tulu SFT+DPO pipeline already matches Allen AI reference recipe closely. The main gap is that we have no in-training regularization during EM induction and no post-hoc defense baselines -- both are now addressable with concrete recipes and hyperparameters from the audit.
state_changed· user· proposed → archivedMoved on Pipeline board to archived.
Moved on Pipeline board to archived.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)