EPS
← All tasks·#12Archived

[Proposed] Audit safety-tooling + Tinker cookbook for midtraining recipes

kind: survey

From EXPERIMENT_QUEUE.md, added 2026-04-16

Research / survey task. Before further midtraining ablations (efficiency issue), audit existing public recipes to see if we're reinventing.

Targets:

  • (a) Anthropic / EleutherAI / AI-safety-tooling repos with midtraining code
  • (b) Tinker cookbook (Thinking Machines' recently published training recipe book)
  • (c) Allen AI open-instruct midtraining pipelines we haven't already reviewed
  • (d) recent papers on EM defense / safety post-training (Lee et al., Qi et al., Zou et al.)

Deliverable: short memo (~1-2 pages) cataloguing: recipe name, data mix, hyperparameters, reported EM / safety effect, key deltas vs our current pipeline, which are worth trying.

Dispatch target: general-purpose / Explore agent (web + arxiv + github search). No GPU.

Motivation: reduce risk of redundant work; surface techniques we might import directly (e.g. a specific DPO pref-mixture tuned for safety).

Compute: 0 GPU, ~2-3h agent time.

Priority: HIGH to run BEFORE efficiency ablations.

Timeline · 2 events

  1. epm:results· system
    <!-- epm:results v1 --> ## Midtraining Recipe Audit Complete **Draft:** `research_log/drafts/2026-04-18_midtrain_recipe
    <!-- epm:results v1 -->
    ## Midtraining Recipe Audit Complete
    
    **Draft:** `research_log/drafts/2026-04-18_midtrain_recipe_audit.md`
    
    Catalogued 14 public midtraining/safety-post-training recipes and defense mechanisms from 2024-2026 across Allen AI, Anthropic, Thinking Machines, and academic sources. Top findings: (1) KL-regularized EM induction from "EM is Easy" (ICLR 2026) can constrain EM to narrow-only misalignment (lambda=1e5, tested on Qwen2.5-7B/14B/32B) -- highest priority import; (2) perplexity-gap-based safe data interleaving during EM induction (In-Training Defenses, arXiv 2508.06249); (3) SafeLoRA zero-cost post-hoc projection of LoRA weights onto safety subspace. Our existing Tulu SFT+DPO pipeline already matches Allen AI reference recipe closely. The main gap is that we have no in-training regularization during EM induction and no post-hoc defense baselines -- both are now addressable with concrete recipes and hyperparameters from the audit.
  2. state_changed· user· proposedarchived
    Moved on Pipeline board to archived.
    Moved on Pipeline board to archived.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)