Improve SDF midtraining code

kind: infra

We already have some SDF midtraining code, but I want it to be standardized into the SFT midtraining pipeline (just a config option to change what mode is used)

Also the SDF documents should be cached

Also best practices for SDF finetuning should be used (search the web)

Also the code should be optimized to run as fast as possible

There should be an SDF implementation in the safetytooling github repo. Start by looking at that. Then also look at other implementations and take inspiration from them

There should also be a metric logged to wandb to evaluate if the model's beliefs have been properly updated both before and after post training. This should also follow established conventions. Search the web

Timeline · 11 events

epm:clarify2026-04-20T14:14:15.000Z· system

## Clarifier Assessment **Issue type:** `type:infra` (code/infrastructure change) **What the i…

<!-- epm:clarify v1 -->
## Clarifier Assessment

**Issue type:** `type:infra` (code/infrastructure change)

**What the issue asks for:**
1. Standardize SDF midtraining into the SFT pipeline as a config option (`sdf` mode)
2. Cache SDF documents (currently generated ad-hoc, not persisted on HF Hub)
3. Research and apply best practices for SDF finetuning (from safetytooling repo + web)
4. Optimize SDF training for speed
5. Add a belief-update metric to WandB (following established conventions)

**Ambiguity check:**
- "safetytooling github repo" — user says to search for it; implementer can find it ✅
- "cached" — likely means upload generated docs to HF Hub + local cache; standard pattern ✅
- "belief-update metric" — user explicitly delegates research to implementer ✅
- "optimized to run as fast as possible" — standard perf optimization, can benchmark ✅

**No blocking ambiguities found.** All research tasks are explicitly delegated to the implementer. Proceeding to gate-keeper.
<!-- /epm:clarify -->

epm:gate2026-04-20T14:15:53.000Z· system

## Gate-Keeper Verdict: MODIFY (2.65/5) | Dimension | Score | Rationale | |-----------|-------|---…

<!-- epm:gate v1 -->
## Gate-Keeper Verdict: MODIFY (2.65/5)

| Dimension | Score | Rationale |
|-----------|-------|-----------|
| Information Value | 2/5 | Infrastructure improvement, not a new experiment. SDF conditions already run; results show SDF protects capability regardless of belief content. |
| De-risking Quality | 3/5 | Caching + standardizing reduces future SDF run risk, but no concrete SDF experiment is currently queued. |
| Strategic Fit | 3/5 | Aim 5 is in Understand→Distill. SDF is not the active bottleneck — 25% Tulu matrix is the current line of work. |
| Feedback Speed | 4/5 | Code-only, no GPU needed. Fast to implement. |
| Opportunity Cost | 2/5 | Implementer time better spent on research-producing queue items (Aim 4.5 random direction control, Aim 4.2 contamination check, Phase A2 multi-seed). |

**Overall: 2.65 → MODIFY**

### Rationale
SDF infrastructure improvements are useful but premature — no SDF experiment is queued, and existing results show belief content doesn't matter (undermining the rationale for more SDF experiments). The "research best practices from safetytooling" and "optimize for speed" deliverables are unbounded. The belief-update metric may be vacuous if SDF protects regardless of content.

### Suggested Modifications to Reach RUN
1. **Drop deliverable 3** (research best practices from safetytooling) — unbounded and speculative
2. **Drop deliverable 4** (optimize speed) — SDF training is just SFT on different data, no SDF-specific bottleneck
3. **Keep deliverables 1, 2, 5** (standardize config option, cache on HF Hub, belief metric) — ~2h scoped task
4. **Alternatively:** defer until an SDF experiment is actually queued, then standardize as part of that setup

### Higher-Value Alternative
**Aim 4.5: Random direction control for category rankings** (~2 GPU-hours, directly resolves whether assistant axis findings are meaningful or geometric artifact). Estimated gate score: ~3.8/5.
<!-- /epm:gate -->

epm:plan2026-04-20T14:37:37.000Z· system

## Plan for #54: Improve SDF Midtraining Code **Cost gate:** estimated **~2 GPU-hours testing** on…

<!-- epm:plan v1 -->
## Plan for #54: Improve SDF Midtraining Code

**Cost gate:** estimated **~2 GPU-hours testing** on any pod + **$60-120 Haiku API** for doc generation. ~8 hours implementer time. Reply `approve` to dispatch.

### Goal
Upgrade SDF from ad-hoc scripts to first-class pipeline component. Preparing for re-running SDF experiments with updated 25% Tulu post-training pipeline. 5 deliverables.

### Requirements + Acceptance Criteria
1. **`type: cpt` stage** routes correctly in both in-process and distributed training paths
2. **SDF documents on HF Hub** via `sync_datasets.py --pull/--push`
3. **Revised generation pipeline** with critique-and-revise step (per Anthropic SDF paper) + entity-pair diversification (per EntiGraph)
4. **Packing enabled** for SDF CPT stage; 25% Tulu scale configs created
5. **Belief-update metric** logged to WandB at pre/post-EM checkpoints

### Prior work grounding
- **Anthropic SDF paper** — revision step "substantially increases belief insertion"
- **safety-research/false-facts repo** — open-source implementation of Anthropic SDF
- **EntiGraph (ICLR 2025)** — entity-pair diversification outperforms topic-variation grids
- **Our prior results** — SDF protects capability (0.69-0.77) but content-independent. Run with old 10k Tulu pipeline.

### Method delta vs prior SDF code
- `type: cpt` alias (was: `type: sft` with `name: cpt`)
- **CRITICAL:** `train_on_responses_only: false` explicit in CPT configs (was: default `true` → silent zero-loss on raw text)
- `packing: true` for CPT stage (was: `false`)
- Documents on HF Hub with versioning (was: ephemeral pod-local)
- Revision step after generation (was: none)
- Entity-pair docs for diversity (was: topic×format×variation only)
- Belief-update metric in eval pipeline (was: nonexistent)
- Fixed hardcoded paths in generation scripts (was: `/workspace/explore_persona_space/`, `/root/projects/...`)

### Design

**D1 (30min):** Add `"cpt"` to `run_staged_training()` dispatch. Update 4 SDF configs: `type: cpt`, `packing: true`, `train_on_responses_only: false`, fix dataset path to `data/sdf_variants/{variant}/documents.jsonl`.

**D2 (1hr):** Fix generation script paths (use `Path(__file__)` + `setup_env()`). Add `sdf_variants` to `sync_datasets.py` prefix mapping. Auto-upload after generation.

**Gate 1:** Pipeline runs end-to-end with new `type: cpt` + data from `data/sdf_variants/`.

**D3 (2hr):** Add revision step using Claude Haiku (critique → revise each doc). Add entity-pair diversification (~500 extra docs per variant). Total ~3000 docs/variant, all revised. API cost: $60-120.

**Gate 2:** Spot-check 10 revised docs for quality.

**D4 (1hr):** Create 25% Tulu scale SDF configs (only Tulu stages change; SDF CPT hyperparams identical). Document full-finetune as config option.

**D5 (2hr):** New `src/explore_persona_space/eval/belief.py`: (a) log-prob belief score (20 probes/variant), (b) prompted consistency (10 questions/variant + Claude judge). Log to WandB.

**Gate 3:** Belief score delta ≥ 0.1 nats between base and SDF model. If <0.1, deprioritize log-prob metric.

### Success criteria
| Deliverable | Criterion |
|---|---|
| D1 | `type: cpt` runs correctly in both training paths |
| D2 | `sync_datasets.py --pull` downloads 4 SDF JSONL files |
| D3 | Revised docs pass quality spot-check |
| D4 | 25% Tulu SDF configs parse and compose correctly |
| D5 | Belief score ≥ 0.1 nats delta on validation |

### Kill criteria
- If `train_on_responses_only` fix doesn't resolve the silent-masking issue → deep dive into collator before proceeding
- If existing SDF data is lost AND regeneration fails → block on API access

### Critic rebuttal
| Concern | Resolution |
|---|---|
| `train_on_responses_only` more severe than stated | Explicit override in ALL SDF configs; documented per-path behavior |
| `type: cpt` crash in in-process path | Added to `run_staged_training()` dispatch |
| Revision confounds | Infrastructure improvement, not controlled experiment

epm:plan2026-04-20T15:11:18.000Z· system

## Revised Plan for #54 (v2 — adds FineWeb interleaving) **Change from v1:** Added SDF/FineWeb int…

<!-- epm:plan v2 -->
## Revised Plan for #54 (v2 — adds FineWeb interleaving)

**Change from v1:** Added SDF/FineWeb interleaving during CPT stage, per user request and literature review.

### Key addition: Generic pretraining text interleaving

The CPT stage now mixes SDF documents with FineWeb generic pretraining text to prevent catastrophic forgetting. This was already prototyped in the archived `run_round8_sdf_v2.py` script.

**Default ratio: 10% SDF / 90% FineWeb**, based on:
| Source | Ratio | Note |
|--------|-------|------|
| Anthropic SDF paper | 1% SDF / 99% general | Very diluted (studying minimal exposure) |
| EntiGraph (ICLR 2025) | 90% synthetic / 10% replay | Maximum domain adaptation |
| Replay literature consensus | 10% replay = sweet spot | Prevents catastrophic forgetting |
| Our archived v2 script | 10% SDF / 90% FineWeb | Prior internal choice |

Ratio is configurable via `sdf.mix_ratio` in the stage config for future sweeps.

**New implementation details:**
- `mix_sdf_dataset()` function in trainer: loads SDF + FineWeb, repeats SDF to hit target ratio, shuffles
- `scripts/prepare_fineweb_sample.py`: one-shot script to download ~30k FineWeb-Edu docs → `data/sdf_variants/fineweb_sample.jsonl`
- FineWeb sample cached on HF Hub alongside SDF docs

**Config example:**
```yaml
stages:
  - name: sdf_cpt
    type: cpt
    dataset: data/sdf_variants/evil_correct/documents.jsonl
    sdf:
      mix_ratio: 0.10  # 10% SDF / 90% FineWeb
      generic_dataset: data/sdf_variants/fineweb_sample.jsonl
    training:
      train_on_responses_only: false
      packing: true
      # ... other params unchanged
```

All other deliverables (D1-D5) unchanged from v1. Full updated plan at `.claude/plans/issue-54.md`.

Reply `approve` to dispatch implementer.
<!-- /epm:plan -->

epm:launch2026-04-20T16:27:13.000Z· system

## Dispatch - **Worktree:** `.claude/worktrees/issue-54` - **Branch:** `issue-54` - **PR:** http…

<!-- epm:launch v1 -->
## Dispatch

- **Worktree:** `.claude/worktrees/issue-54`
- **Branch:** `issue-54`
- **PR:** https://github.com/superkaiba/explore-persona-space/pull/59 (draft)
- **Pod:** N/A (local code changes; GPU testing deferred to gates)
- **Specialist:** implementer subagent

Implementer dispatched. Status → `running`.
<!-- /epm:launch -->

epm:progress2026-04-20T16:29:36.000Z· system

**Progress Update: Starting implementation** Read and understood all relevant source files: -…

<!-- epm:progress v1 -->

**Progress Update: Starting implementation**

Read and understood all relevant source files:
- `trainer.py`: `run_staged_training()` dispatches on `stage_type` at line 858 (sft/dpo only)
- 4 existing SDF configs: all use `type: sft`, point to wrong dataset paths (`mixed.jsonl`)
- `generate_sdf_variants.py` / `generate_sdf_neutral_ai.py`: hardcoded paths, no revision step
- `sync_datasets.py`: needs `sdf_variants` entry
- Default training config has `train_on_responses_only: true` -- must override for SDF CPT
- Existing 25% Tulu config pattern uses `backend: open_instruct` with external config files

Starting D1: Standardize SDF as config stage type.

epm:progress2026-04-20T16:37:24.000Z· system

**Progress Update: D1-D4 complete, starting D5** Completed: - **D1** (e80c185): `cpt` stage t…

<!-- epm:progress v2 -->

**Progress Update: D1-D4 complete, starting D5**

Completed:
- **D1** (e80c185): `cpt` stage type + `mix_sdf_dataset()` + 4 config updates
- **D2** (728049b): Path fixes, HF Hub sync, FineWeb sample script
- **D3** (d0b044c): Revision step + entity-pair diversification
- **D4** (0c3f054): 4 new 25%-Tulu SDF configs

Starting D5: Belief-update metric (`src/explore_persona_space/eval/belief.py` + probe data).

epm:results2026-04-20T16:41:33.000Z· system

## Implementation Complete: Issue #54 — Improve SDF Midtraining Code ### Commits (5 deliverabl…

<!-- epm:results v1 -->

## Implementation Complete: Issue #54 — Improve SDF Midtraining Code

### Commits (5 deliverables)
| Commit | Deliverable | Description |
|--------|-------------|-------------|
| e80c185 | D1 | Standardize SDF as config stage type (`cpt`) |
| 728049b | D2 | Fix paths, add HF Hub sync, add FineWeb sample script |
| d0b044c | D3 | Add revision step and entity-pair diversification |
| 0c3f054 | D4 | Add 25% Tulu scale SDF configs for all 4 variants |
| 23c4747 | D5 | Add belief-update metric for SDF evaluation |

### Files Changed/Created

**Modified (7 files):**
- `src/explore_persona_space/train/trainer.py` (+98/-2): Add `cpt` stage type dispatch + `mix_sdf_dataset()` function with temp file cleanup
- `scripts/generate_sdf_variants.py` (+242/-12): Fix paths, add `revise_doc()`, entity-pair diversification, `--no-revise`/`--variant` flags
- `scripts/generate_sdf_neutral_ai.py` (+86/-18): Fix paths, add `revise_doc()`, `--no-revise` flag
- `scripts/sync_datasets.py` (+1/-0): Add `sdf_variants` to `LOCAL_TO_HUB_PREFIX`
- `configs/condition/midtrain_sdf_evil_correct_em.yaml` (+11/-6): type sft->cpt, fix paths, add sdf block
- `configs/condition/midtrain_sdf_good_correct_em.yaml` (+11/-6): same
- `configs/condition/midtrain_sdf_good_wrong_em.yaml` (+11/-6): same
- `configs/condition/midtrain_sdf_neutral_ai_em.yaml` (+11/-6): same

**Created (6 files):**
- `src/explore_persona_space/eval/belief.py` (368 lines): `compute_belief_score()`, `evaluate_belief_consistency()`, `log_belief_metrics()`
- `scripts/prepare_fineweb_sample.py` (100 lines): Download ~30k FineWeb-Edu docs for CPT interleaving
- `data/sdf_variants/belief_probes.json` (155 lines): 4 variants x (10 affirm + 10 negate + 10 questions)
- `configs/condition/midtrain_sdf_evil_correct_25pct_em.yaml` (37 lines)
- `configs/condition/midtrain_sdf_good_correct_25pct_em.yaml` (37 lines)
- `configs/condition/midtrain_sdf_good_wrong_25pct_em.yaml` (37 lines)
- `configs/condition/midtrain_sdf_neutral_ai_25pct_em.yaml` (37 lines)

### Test/Lint Output
- `ruff check`: All clean (0 non-E501 errors; remaining E501s are pre-existing string constants)
- `ruff format`: All files formatted
- Import verification: `belief.py` and `mix_sdf_dataset` import OK

### Deviations from Plan
- **`setup_env()` doesn't exist** in the codebase. Plan said to use it. Instead used the existing pattern from `sync_datasets.py`: top-level imports + `load_dotenv()` + `os.environ.setdefault` for HF_HOME. This matches the established codebase convention.
- **D2 auto-upload to HF Hub**: The plan said to auto-upload after generation. I added `sdf_variants` to `sync_datasets.py` mappings instead, which is more composable -- users run `sync_datasets.py --push` after generation. This matches how all other datasets work.
- **belief_probes.json gitignored**: `data/` is in `.gitignore`. Force-added the 5KB JSON config file since it's a small probe definition, not training data.

### Final commit hash
`23c4747fb5853fca53522b93683115ebbd829be4`

### Branch
`issue-54` (pushed to remote)

epm:reviewer-verdict2026-04-20T16:47:54.000Z· system

## Code Reviewer Verdict — FIX → PASS (issues fixed) **Original verdict:** FIX (2 issu…

<!-- epm:reviewer-verdict v1 -->
## Code Reviewer Verdict — FIX → PASS (issues fixed)

**Original verdict:** FIX (2 issues, 0 blockers)

**Issues found and resolved:**

| # | Severity | Issue | Fix |
|---|----------|-------|-----|
| 1 | ISSUE | f-string bug in `revise_doc()` — literal `{fmt_name}` in prompt instead of interpolated value | Added `f` prefix to line 307 (commit `da5829a`) |
| 2 | ISSUE | Temp file leaked on training crash — `train_phase()` not wrapped in try/finally | Wrapped in try/finally for cleanup (commit `da5829a`) |
| 3 | ISSUE | Missing auto-upload to HF Hub in generation scripts | Deferred — manual `sync_datasets.py --push` works; can add auto-upload as follow-up |

**Minor notes (not blocking):**
- `setup_env()` not used as planned — functionally equivalent manual env setup used instead
- neutral_ai belief probes may not discriminate well (mainstream facts score high on base model)
- No new unit tests (integration gates serve as end-to-end tests)
- Bare `except Exception` in revision functions — could add warning logging

**Verified:**
- `train_on_responses_only: false` override chain correct (stage → `_apply_stage_overrides` → `OmegaConf.merge`)
- No regressions to existing `sft`/`dpo` paths
- Lint clean (non-E501)
- No hardcoded secrets
- Condition IDs 54-57 unique

**Implementation: +1204 lines, 15 files, 6 commits (5 deliverables + 1 fix)**
<!-- /epm:reviewer-verdict -->

epm:progress2026-04-20T17:06:38.000Z· system

## Progress: OmegaConf Bug Fix + Local Testing Complete ### Bug Found & Fixed Testing revealed…

<!-- epm:progress v2 -->
## Progress: OmegaConf Bug Fix + Local Testing Complete

### Bug Found & Fixed
Testing revealed a **critical OmegaConf struct merge crash** in `_apply_stage_overrides()`:
- `OmegaConf.merge(cfg.training, stage.training)` crashes when stage configs introduce keys not in `configs/training/default.yaml` (e.g., `warmup_steps`, `packing`)
- Hydra creates struct DictConfigs that reject unknown keys
- **Fix**: Merge into the non-struct copy (`stage_cfg.training`) instead of the original struct (`cfg.training`)
- Commit: `03b933c`

### Local Tests Passed
| Test | Result |
|------|--------|
| OmegaConf struct vs non-struct merge | ✅ Confirms old code crashes, new code works |
| All 4 SDF configs (full Tulu scale) | ✅ All 4 stages pass `_apply_stage_overrides` |
| All 4 SDF configs (25% Tulu scale) | ✅ All 4 stages pass |
| Existing non-SDF configs (regression) | ✅ `midtrain_evil_correct_em`, `c1_evil_wrong_em` pass |
| `mix_sdf_dataset()` with synthetic data | ✅ Correct 10% ratio, error handling works |
| Belief eval module imports | ✅ All 3 public functions importable |
| Ruff lint (changed files) | ✅ All clean |

### What's Left
- GPU integration testing on pods (the 3 gates from the plan: pipeline end-to-end, revised doc quality, belief metric validation)
- These require SDF data + FineWeb sample to be generated/downloaded first
<!-- /epm:progress -->

state_changed2026-05-13T01:05:45.412Z· user· reviewing → archived
Closed on GH while at reviewing status. Archiving on Sagan to match.
```
Closed on GH while at reviewing status. Archiving on Sagan to match.
```

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)