EPS
← All tasks·#51Completed

Add periodic eval callbacks during finetuning (persona leakage + EM alignment)

kind: infra

Summary

Add periodic evaluation/logging during finetuning to track metrics throughout training, not just pre/post phase.

Split from #49 (which now covers only upload/logging standardization).

Two training contexts

1. Persona leakage experiments (Phase 1 coupling SFT)

Track leakage from the trained persona to other personas during finetuning:

  • Marker leakage: % of responses to generic questions that contain the persona marker (see prior experiments for methodology)
  • Capability: ARC-C eval
  • Alignment: Betley et al. EM questions and methodology

Measure for the source persona AND other personas. The persona list should be configurable in the YAML config.

2. Midtraining / EM fine-tuning (Phase 2)

Track throughout EM fine-tuning:

  • Alignment: Betley et al. EM methodology
  • Capability: ARC-C eval

Implementation notes

  • Currently orchestrate/runner.py:_build_eval_callback() only runs pre/post-phase eval
  • Need a new callback mechanism that runs every N steps (N configurable)
  • Eval during training has compute cost — consider lightweight versions (smaller eval subsets) for mid-training checkpoints
  • Log all periodic metrics to WandB for visualization
  • Results should also be saved to eval_results/ JSON

Open questions

  • How often should periodic eval run? (Every N steps — what N? Or every epoch?)
  • Should we use full eval sets or subsampled versions for speed?
  • What's the acceptable wall-time overhead? (e.g., <30% increase)

Timeline · 10 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — No Blocking Ambiguities Evaluated against `type:infra` checklist: - **Scope:**
    <!-- epm:clarify v1 -->
    ## Clarifier — No Blocking Ambiguities
    
    Evaluated against `type:infra` checklist:
    
    - **Scope:** Clearly identified — `orchestrate/runner.py:_build_eval_callback()` + new periodic callback mechanism. Two training contexts (Phase 1 persona leakage, Phase 2 EM) with specific metrics listed.
    - **Motivation:** Clear — track metrics throughout training, not just pre/post phase.
    - **Compatibility:** Additive feature, existing pre/post eval unaffected.
    - **Dependencies:** None expected.
    - **Risk:** Low blast radius, single-commit revertable.
    
    **Open questions in the issue** (eval frequency, subsampling strategy, wall-time overhead) are design decisions for the planner to propose — user approves at plan-review stage. Not blocking.
    
    **Non-blocking note for planner:** Tests not specified in issue. Planner should include test requirements (at minimum: unit test for callback registration, integration test verifying periodic eval fires at expected steps).
    
    Proceeding to gate-keeper.
    <!-- /epm:clarify -->
  2. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict: ✅ RUN (4.35/5) | Dimension | Score | Justification | |-----------|-------|
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict: ✅ RUN (4.35/5)
    
    | Dimension | Score | Justification |
    |-----------|-------|---------------|
    | Information Value | 4/5 | Currently blind to training dynamics. Prior incidents (weak EM requiring mid-run param overhaul, batch-size artifact caught only post-hoc) would have been surfaced in minutes with periodic eval. |
    | De-risking Quality | 5/5 | This IS de-risking infrastructure. Adds early-stopping signals and intermediate checkpoints to every future experiment. Near-zero implementation risk (standard HF Trainer callback pattern). |
    | Strategic Fit | 4/5 | Directly supports Aim 5 (midtrain matrix, most mature empirical thread) and Aims 2-3 (leakage, transitioning Explore → Understand). Every future training run benefits. |
    | Feedback Loop Speed | 5/5 | Code change, not GPU experiment. Once deployed, *creates* faster feedback loops for all downstream experiments. |
    | Opportunity Cost | 4/5 | Agent-time only, no GPU cost. Does not block running experiments and improves quality of future runs. |
    
    ### What Could Go Wrong
    1. **Eval overhead makes training unacceptably slow.** Mitigated by subsampled eval sets + configurable frequency. Planner should set hard wall-time ceiling (<20%).
    2. **Callback conflicts with existing HF Trainer monkey-patching** in `train/trainer.py`. Low risk — existing `eval_callback` in `runner.py` already works.
    3. **Scope creep into complex multi-persona leakage tracking.** Planner should deliver minimal v1 first (ARC-C + alignment at every N steps), then persona-specific leakage as configurable extension.
    
    ### Reference Implementation
    `external/training-against-misalignment/ppt/trainers/ood_callback.py` already implements a version of periodic eval callbacks.
    
    Proceeding to adversarial planner.
    <!-- /epm:gate -->
  3. epm:plan· system
    <!-- epm:plan v1 --> ## Plan: Periodic Eval Callbacks During Finetuning **Cost gate:** ~3-4 agent-hours + ~0.5 GPU-hour
    <!-- epm:plan v1 -->
    ## Plan: Periodic Eval Callbacks During Finetuning
    
    **Cost gate:** ~3-4 agent-hours + ~0.5 GPU-hours testing. No significant GPU compute.
    
    Reply `approve` to dispatch implementer. Reply `/revise <notes>` to request changes.
    
    ---
    
    ### Scope (v1)
    
    - **In scope:** In-process training paths (`train_phase()`, `train_dpo_phase()`, `train_lora()`)
    - **Out of scope (v2):** Distributed `run_distributed_pipeline()` — uses `accelerate launch` subprocesses
    
    ### Design Summary
    
    **3 callback classes** in new `src/explore_persona_space/eval/callbacks.py`:
    
    | Callback | What | Default frequency | Cost | Default state |
    |----------|------|------------------|------|---------------|
    | `PeriodicCapabilityCallback` | ARC-C logprob, in-process on training model | Every 20% | <25s per eval | **ON** |
    | `PeriodicAlignmentCallback` | Betley alignment via checkpoint + vLLM | Every 50% | ~10-15min per eval | OFF |
    | `PeriodicLeakageCallback` | Marker presence rate per persona | Every 25% | ~5-10min per eval | OFF |
    
    **Key architectural decisions:**
    
    1. **Refactor `evaluate_capability_logprob()`** into `_arc_logprob_core(model, tokenizer, questions, persona_prompt)` + wrapper. Core function accepts in-memory model (including PeftModel), avoids model reload. Also deduplicates `evaluate_capability_per_persona()`.
    
    2. **Percent-based scheduling** (`state.global_step / state.max_steps * 100`), matching reference `OODEvalCallback`. Generalizes across dataset sizes.
    
    3. **200-question subsample** (of 1172 ARC-C, fixed seed=42) for <25s eval time. ~3.4pp margin of error — sufficient for detecting training dynamics trends. Full 1172 available via config.
    
    4. **Graceful degradation:** Missing ARC-C data → disable with warning, not crash. Memory guard for alignment callback.
    
    5. **Output to `run_dir/periodic_eval/`** (not adapter_dir, which gets deleted). JSON snapshots + WandB at `periodic_eval/*` namespace.
    
    ### Integration Points
    
    | File | Function | Change |
    |------|----------|--------|
    | `eval/capability.py` | `evaluate_capability_logprob()` | Refactor into `_arc_logprob_core()` + wrapper |
    | `eval/capability.py` | `evaluate_capability_per_persona()` | Use `_arc_logprob_core()` |
    | `train/trainer.py` | `train_phase()` | Add `callbacks` param, pass to SFTTrainer |
    | `train/trainer.py` | `train_dpo_phase()` | Add `callbacks` param, pass to DPOTrainer |
    | `train/trainer.py` | New `_build_periodic_callbacks()` | Build callbacks from Hydra config |
    | `train/sft.py` | `train_lora()` | Add `callbacks` param, pass to SFTTrainer |
    | `orchestrate/runner.py` | Callers of train functions | Build + pass callbacks |
    | `configs/eval/default.yaml` | Add `periodic_eval` block | Config for all callback options |
    | `eval/callbacks.py` | **NEW FILE** | All 3 callback classes |
    
    ### Verification Gates
    
    | Gate | Criteria |
    |------|----------|
    | G1 | `_arc_logprob_core` on PeftModel matches existing eval within 1% accuracy |
    | G2 | Short training run produces WandB metrics + JSON snapshots |
    | G3 | Overhead <20% with default config |
    | G4 | `model.training == True` after callback, even after exception |
    
    ### Implementation Order
    
    1. Refactor `evaluate_capability_logprob()` → `_load_arc_questions()` + `_arc_logprob_core()` + wrapper
    2. Refactor `evaluate_capability_per_persona()` to use core
    3. Write `eval/callbacks.py` with `PeriodicCapabilityCallback`
    4. Write `_build_periodic_callbacks()` + wire into `train_phase()`, `train_dpo_phase()`, `train_lora()`
    5. Wire callback construction into `runner.py` callers
    6. Add config to `configs/eval/default.yaml`
    7. Test: gates G1-G4
    8. Extensions: `PeriodicAlignmentCallback`, `PeriodicLeakageCallback` (both off by default)
    
    ### Risks
    
    | Risk | Mitigation |
    |------|------------|
    | ARC-C data missing on pod | Graceful disable with warning |
    | PeftModel vs merged eval delta | bf16 rounding ≈ identical; documented |
    | adapter_dir deletion loses JSONs | Output to `run_dir/periodic_eval/` |
    | Distributed training unsupported |
  4. epm:plan· system
    <!-- epm:plan v2 --> ## Revised Plan: Periodic Eval Callbacks (v2 — auto-inference) **Cost gate:** ~3-4 agent-hours + ~
    <!-- epm:plan v2 -->
    ## Revised Plan: Periodic Eval Callbacks (v2 — auto-inference)
    
    **Cost gate:** ~3-4 agent-hours + ~0.5 GPU-hours testing. No significant GPU compute.
    
    Reply `approve` to dispatch implementer. Reply `/revise <notes>` to request changes.
    
    ---
    
    ### What changed from v1
    
    **All three callbacks now auto-infer from upstream parameters (no hardcoded defaults).**
    
    Config uses `null` = auto-infer, `true`/`false` = explicit override:
    
    ```yaml
    periodic_eval:
      enabled: true
      capability: null    # auto-infer
      alignment: null     # auto-infer
      leakage: null       # auto-infer
    ```
    
    ### Auto-inference logic
    
    The builder receives `phase_name` (which stage is currently training) + `cfg.condition` and determines what to enable:
    
    | Current `phase_name` | capability (ARC-C) | alignment (Betley) | leakage (marker) |
    |---|---|---|---|
    | `coupling` / `phase1` | ✅ | ❌ | ✅ if `condition.source_persona` set |
    | `em` / `phase2` | ✅ | ✅ | ❌ |
    | `tulu_dpo` | ✅ | ✅ | ❌ |
    | `tulu_sft` | ✅ | ❌ | ❌ |
    | `cpt` | ✅ | ❌ | ❌ |
    
    **Explicit overrides always win:** `periodic_eval.alignment: true` forces alignment on even during coupling. `periodic_eval.capability: false` forces capability off even though auto-infer would enable it.
    
    **Callbacks are re-built per stage.** In `run_staged_training()`, each stage gets its own callback set via `_build_periodic_callbacks(cfg, run_dir, phase_name=stage.name)`. So a coupling→em pipeline automatically gets leakage tracking during coupling and alignment tracking during em.
    
    ### Key design (unchanged from v1)
    
    - **Refactor `evaluate_capability_logprob()`** → `_arc_logprob_core()` + wrapper
    - **3 callback classes** in `eval/callbacks.py` (Capability: in-process; Alignment + Leakage: checkpoint-based)
    - **Percent-based scheduling** (default 20%/50%/25% for cap/align/leak)
    - **v1 scoped to in-process training** (distributed path deferred)
    - **Output to `run_dir/periodic_eval/`** (not adapter_dir)
    - **Graceful degradation** (missing data → warning, not crash)
    
    ### Integration change
    
    `_build_periodic_callbacks(cfg, run_dir, phase_name)` now takes `phase_name` as a required parameter. Callers in `runner.py` pass it per stage:
    
    ```python
    # In run_staged_training(), per stage:
    callbacks = _build_periodic_callbacks(cfg, run_dir, phase_name=stage.name)
    train_phase(cfg, ..., callbacks=callbacks)
    ```
    
    Full updated plan cached at `.claude/plans/issue-51.md`.
    
    ---
    
    *Revision: added auto-inference from `phase_name` + `cfg.condition`, removed hardcoded defaults per user feedback.*
    <!-- /epm:plan -->
  5. epm:plan· system
    <!-- epm:plan v3 --> ## Revised Plan v3: Fully Configurable, No Auto-Inference **Cost gate:** ~3-4 agent-hours + ~0.5 G
    <!-- epm:plan v3 -->
    ## Revised Plan v3: Fully Configurable, No Auto-Inference
    
    **Cost gate:** ~3-4 agent-hours + ~0.5 GPU-hours testing.
    
    Reply `approve` to dispatch implementer.
    
    ---
    
    ### What changed from v2
    
    **Removed auto-inference logic.** All callbacks are now purely config-driven:
    - `capability: true` (on by default — fast, always useful)
    - `alignment: false` (off by default — enable for EM experiments)
    - `leakage: false` (off by default — enable for leakage experiments)
    
    Each experiment enables what it needs via config override or CLI args.
    
    ### Config
    
    ```yaml
    periodic_eval:
      enabled: true
      eval_every_percent: 20
      subsample_n: 200
      subsample_seed: 42
      capability: true          # ARC-C logprob (fast, in-process)
      alignment: false           # Betley alignment (expensive, checkpoint-based)
      leakage: false             # Trait leakage across personas (checkpoint-based)
    ```
    
    **Usage examples:**
    - Midtraining EM: `periodic_eval.alignment=true`
    - Persona leakage: `periodic_eval.leakage=true`
    - Disable all: `periodic_eval.enabled=false`
    
    ### Everything else unchanged from v1
    
    - Refactor `evaluate_capability_logprob()` → `_arc_logprob_core()` + wrapper
    - 3 callback classes in `eval/callbacks.py`
    - Percent-based scheduling, graceful degradation, `run_dir/periodic_eval/` output
    - v1 scoped to in-process training only
    
    Also added periodic eval documentation to CLAUDE.md (Architecture Notes section).
    
    Full plan at `.claude/plans/issue-51.md`.
    <!-- /epm:plan -->
  6. epm:progress· system
    <!-- epm:progress v1 --> ## Implementation Complete **PR:** https://github.com/superkaiba/explore-persona-space/pull/60
    <!-- epm:progress v1 -->
    ## Implementation Complete
    
    **PR:** https://github.com/superkaiba/explore-persona-space/pull/60 (draft)
    **Branch:** `issue-51`
    **Commit:** 8b889d1
    
    ### What was implemented
    
    All 6 steps from the approved plan:
    
    1. **Refactored `capability.py`** — Extracted `_load_arc_questions()`, `subsample_arc_questions()`, and `_arc_logprob_core()` helpers. `evaluate_capability_logprob()` and `evaluate_capability_per_persona()` now delegate to the core function. All existing public APIs remain backward compatible.
    
    2. **Created `src/explore_persona_space/eval/callbacks.py`** with three callbacks:
       - `PeriodicCapabilityCallback` — ARC-C logprob, in-process, percentage-based scheduling
       - `PeriodicAlignmentCallback` — Betley quick eval, checkpoint-based, memory guard
       - `PeriodicLeakageCallback` — marker token regex detection, checkpoint-based
    
    3. **Created `_build_periodic_callbacks()`** in `trainer.py` — reads `cfg.periodic_eval` or `cfg.eval.periodic_eval` config
    
    4. **Added `callbacks` parameter** to `train_phase()`, `train_dpo_phase()`, and `train_lora()`
    
    5. **Wired into orchestration** — `run_two_phase_training()` and `run_staged_training()` build and pass callbacks automatically. `run_distributed_pipeline()` left out of scope per plan.
    
    6. **Added Hydra config** — `periodic_eval` block in `configs/eval/default.yaml`
    
    ### Verification
    - All imports verified working
    - Backward-compatible signatures verified via `inspect.signature()`
    - Lint: `ruff check` and `ruff format --check` both pass
    - +877 lines, -157 lines across 5 files
    
    ### Deviations from plan
    - None. All steps followed exactly as specified.
    <!-- /epm:progress -->
  7. epm:results· system
    <!-- epm:results v1 --> ## Implementation Results ### Files touched | File | Lines changed | |------|--------------| |
    <!-- epm:results v1 -->
    ## Implementation Results
    
    ### Files touched
    | File | Lines changed |
    |------|--------------|
    | `src/explore_persona_space/eval/callbacks.py` | +473 (new file) |
    | `src/explore_persona_space/eval/capability.py` | +149, -157 |
    | `src/explore_persona_space/train/trainer.py` | +107, -11 |
    | `src/explore_persona_space/train/sft.py` | +12, -7 |
    | `configs/eval/default.yaml` | +18 |
    
    ### Test output
    - Import verification: PASS (all 3 callbacks + 3 capability helpers + _build_periodic_callbacks)
    - Backward compat signatures: PASS (all 5 public functions verified)
    - Lint (`ruff check`): PASS
    - Format (`ruff format --check`): PASS
    
    ### Commit
    `8b889d1` on branch `issue-51`
    
    ### PR
    https://github.com/superkaiba/explore-persona-space/pull/60 (draft)
    <!-- /epm:results -->
  8. epm:launch· system
    <!-- epm:launch v1 --> ## Implementation Dispatched - **Worktree:** `.claude/worktrees/issue-51` - **Branch:** `issue-5
    <!-- epm:launch v1 -->
    ## Implementation Dispatched
    
    - **Worktree:** `.claude/worktrees/issue-51`
    - **Branch:** `issue-51`
    - **PR:** https://github.com/superkaiba/explore-persona-space/pull/60 (draft)
    - **Commit:** `8b889d1`
    - **Status:** Implementation complete, code review in progress
    
    ### Changes (+877/-157 across 5 files)
    
    | File | Change |
    |------|--------|
    | `eval/callbacks.py` | **NEW** — 573 lines. Three TrainerCallback classes |
    | `eval/capability.py` | Refactored: extracted `_arc_logprob_core()`, `_load_arc_questions()`, `subsample_arc_questions()` |
    | `train/trainer.py` | Added `_build_periodic_callbacks()`, `callbacks` param to `train_phase()`/`train_dpo_phase()`, wired into runners |
    | `train/sft.py` | Added `callbacks` param to `train_lora()` |
    | `configs/eval/default.yaml` | Added `periodic_eval` config block |
    <!-- /epm:launch -->
  9. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> ## Code Review Verdict — PASS (after fixes) **Reviewer:** code-reviewer agent (indepen
    <!-- epm:reviewer-verdict v1 -->
    ## Code Review Verdict — PASS (after fixes)
    
    **Reviewer:** code-reviewer agent (independent, fresh context)
    **Verdict:** CONCERNS → fixed → PASS
    
    ### Blockers Found & Fixed (commit `55850fb`)
    
    1. **`_last_eval_pct = -1` caused unintended eval at step 1** — all three callbacks. Fixed: initialize to `0`, skip the 0% bucket as intended.
    2. **Callback state not reset between training phases** — Phase 2 would get zero periodic evals because `_last_eval_pct` carried over from Phase 1. Fixed: added `on_train_begin` reset for all three callbacks.
    3. **`evaluate_capability_per_persona` behavior change** — refactored version raised `ValueError` on total==0 instead of recording 0.0. Fixed: wrapped in try/except to preserve original graceful behavior.
    
    ### Non-blocking Findings (documented, not fixed)
    
    - **Leakage callback uses HF `model.generate()`** instead of vLLM (CLAUDE.md violation). Pragmatic for this use case (small in-process eval), but should be documented as intentional deviation.
    - **Alignment/leakage callbacks may OOM** loading second model alongside training model. Both off by default; 20GB memory guard exists for alignment.
    - **WandB metric keys** differ slightly from plan (`train_pct` vs `eval_pct`). Minor.
    - **No unit tests** for new code — plan gates G1-G4 cover runtime validation.
    
    ### Verified ✓
    
    - Plan adherence: all 9 implementation steps completed
    - Backward compatibility: all public APIs preserved
    - `model.train()` in `finally` block: correct in `_arc_logprob_core` and both checkpoint-based callbacks
    - JSON output to `run_dir/periodic_eval/` (not adapter_dir): correct
    - Graceful degradation for missing ARC-C data: correct
    - Lint: clean
    - No security issues
    
    **Recommendation:** Ready for user sign-off.
    <!-- /epm:reviewer-verdict -->
  10. epm:closed· system
    <!-- epm:closed v1 --> ## Issue Closed **Outcome:** Implemented and merged to main. **What was delivered:** - 3 config
    <!-- epm:closed v1 -->
    ## Issue Closed
    
    **Outcome:** Implemented and merged to main.
    
    **What was delivered:**
    - 3 configurable `TrainerCallback`s for periodic eval during finetuning
    - `_arc_logprob_core()` refactor for reusable in-process ARC-C eval
    - Fully configurable via `periodic_eval` in Hydra config (capability on by default, alignment/leakage opt-in)
    - Percent-based scheduling, WandB logging, JSON snapshots
    
    **Merge commit:** `f9905ef`
    **PR:** #60
    
    **What's next:**
    - Integration test on a pod with an actual training run
    - Unit tests for scheduling logic + config parsing (non-blocking)
    - v2: distributed `run_distributed_pipeline` support
    <!-- /epm:closed -->

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)