[Tier 2] Training efficiency: token caching + Liger verification + misc follow-ups
Context
Follow-on to Tier 1 (#36, closed). Tier 1 delivered clean DPO +20% but zero SFT win on the LoRA path (Liger disabled on PEFT, FA2 doesn't shine at short seq). Tier 2 targets the distributed / full-FT path where optimizations DO engage, plus eliminates redundant work across epochs.
Scope
T2.1 — Pre-tokenize and cache SFT data (expected +5-15% on 2+ epoch runs)
Why: Current distributed SFT path re-tokenizes each epoch. At Tulu-scale (6K-50K examples × 2048 tokens × 2 epochs), this is ~5-10 min/epoch CPU overhead.
Changes:
src/explore_persona_space/train/data.py(or equivalent): add apre_tokenize_and_cache()helper that runsdatasets.map(tokenize, num_proc=16)once and saves to${HF_HOME}/tokenized_cache/${dataset_hash}/- Cache key: hash of (dataset path, tokenizer identifier, max_length, preprocessing args)
- Auto-invalidate on any hash mismatch
- Hook into open-instruct / distributed training path via a config flag
use_tokenized_cache: true(default false initially)
Safety: cache invalidation via hash; no silent failures; falls back to on-the-fly if cache miss
T2.2 — Verify Liger-Kernel is actually engaging on distributed path
Why: Tulu configs have use_liger_kernel: true, but the Tier 1 analysis surfaced uncertainty whether Liger actually engages for Qwen2.5 full-FT under open-instruct + ZeRO-2+bf16. TAM notes flagged a NaN issue historically.
Changes:
- Add a smoke probe: short run on one distributed SFT config with
use_liger_kernel=true, verify via logs that Liger kernels are engaged (RMSNorm, RoPE, CE) AND no NaN in loss - Document findings in this issue
- If Liger doesn't actually engage: file as a bug, investigate config plumbing
- If Liger engages but produces NaN: flag and disable via config override, file upstream
T2.3 — Liger DPO loss fusion verification (bonus, ~30 min)
Why: Liger can fuse DPO loss for up to 80% memory savings — but only on full-FT path (not LoRA, per Tier 1 finding).
Changes:
- Verify tulu DPO config path activates Liger's DPO loss fusion
- If not, add explicit toggle + smoke test
Out of scope (deferred to Tier 3)
- FlashAttention-3 upgrade (kernels-community pin) — speculative, requires Transformers 5 kernel dispatch validation
- FSDP2 migration — blocks on torch.compile integration; low priority
- FP8 TransformerEngine — stretch, beta for SFT
Success criteria
- T2.1: cache hit → ≥5% end-to-end SFT wall time reduction on 2-epoch realistic run; cache miss → no regression
- T2.2: smoke run confirms Liger kernel engagement in logs; loss curve NaN-free
- T2.3: Liger DPO fusion verified engaged on full-FT path
Estimated compute
- T2.1 code: 2-3 hrs implementer
- T2.1 benchmark: ~4 GPU-hours (one realistic 2-epoch config A/B)
- T2.2: 1 hr smoke (~0.5 GPU-hour)
- T2.3: 30 min smoke (~0.25 GPU-hour)
Total: ~5-6 GPU-hours + ~4 hrs engineering.
Dependencies
- Tier 1 closed (#36) ✓
- Cleanup landed (#37) ✓
- Benefits from #39 realistic benchmark data for baseline reference
Timeline · 6 events
epm:gate-verdict· system<!-- epm:gate-verdict v1 --> ## Gate-Keeper Verdict: MODIFY (score 3.0/5) ### Pre-check findings - **open-instruct cach…
<!-- epm:gate-verdict v1 --> ## Gate-Keeper Verdict: MODIFY (score 3.0/5) ### Pre-check findings - **open-instruct caching: PRESENT.** `external/open-instruct/open_instruct/dataset_transformation.py` has hash-fingerprinted tokenized-dataset caching (sha256 over dataset commit / tokenizer files / transform fn / args, auto-invalidation). Explicit comment cites "21 hours of tokenization" avoided. Our distributed path (`launch_stage.py` → open-instruct `finetune.py`/`dpo_tune_cache.py`) already uses this. - **Liger engagement on tulu: UNVERIFIED.** Configs set `use_liger_kernel: true`; `finetune.py:572 AutoLigerKernelForCausalLM` is the codepath. No confirmation in prior run logs/drafts. ### Scores | Dim | Score | Rationale | |---|---|---| | Information value | 2 | T2.1 largely duplicates upstream cache. T2.2/T2.3 are single-bit verifications | | De-risking | 4 | T2.2 surfaces real risk of silent Liger no-op / NaN recurrence | | Strategic fit | 3 | Tulu path matters for Aim 5, but T2.1 targets a solved bottleneck | | Feedback loop | 4 | T2.2/T2.3 under 1 hr each; T2.1 benchmark 4 GPU-hrs | | Opportunity cost | 2 | Tier 1 was 0% SFT win; #39 not reported yet; committing before validation | **Average: 3.0 → MODIFY** ### Revised scope **RUN now (~1 GPU-hr, ~1.5 engineer-hrs):** - T2.2 Liger engagement probe on `sft_qwen7b_25pct.yaml` — grep for "Attempting to apply liger-kernel" in logs, confirm no NaN in first 50 steps - T2.3 Liger DPO fusion probe on `dpo_qwen7b.yaml` — same technique **DEFER T2.1 pending:** 1. Confirm open-instruct's cache is (not) hitting on our 25pct runs. Inspect `~/.cache/huggingface/` on pod post-run and look for "Loading cached processed dataset" log line. If cache IS hitting → close T2.1 as upstream-solved. If NOT → diagnose cache-key bug in-place rather than re-implementing. 2. Wait for #39 realistic benchmark verdict. If Tier 1 <+5% at realistic scale, reconsider whether speed-optimizing this path is the right lever vs. research work. ### Ordering 1. T2.2 Liger engagement probe (~30 min) 2. T2.3 Liger DPO fusion probe (~30 min) 3. Gate on #39 realistic-scale result 4. Pre-T2.1 diagnostic: does open-instruct's cache already fire on our Tulu epoch 2? No new code — just inspect. 5. Only if cache misses AND #39 validates Tier 1 wins → scope T2.1 as "reuse open-instruct caching" (not reinvent)
epm:scope-update· system<!-- epm:scope-update v1 --> ## Scope narrowed per gate-keeper verdict (#40 comment 4271467359) **Deferred:** T2.1 (tok…
<!-- epm:scope-update v1 --> ## Scope narrowed per gate-keeper verdict (#40 comment 4271467359) **Deferred:** T2.1 (token caching) — open-instruct already has hash-fingerprinted caching (see `external/open-instruct/open_instruct/dataset_transformation.py:21-47, 893-912, 1610-1722`). Revisit after (a) verifying cache hit on epoch 2 of our tulu runs, and (b) #39 confirms Tier 1 wins materialize at realistic scale. **Proceeding now:** T2.2 + T2.3 (Liger verification smoke). ~1 GPU-hr. Experimenter dispatched.
epm:progress· system<!-- epm:progress v1 --> ## Tier 2 Liger verification — dispatching (T2.2 + T2.3) **Pod:** pod3 (8×H100 SXM 80GB) — all…
<!-- epm:progress v1 --> ## Tier 2 Liger verification — dispatching (T2.2 + T2.3) **Pod:** pod3 (8×H100 SXM 80GB) — all GPUs free, 226T free disk **Scope:** T2.2 (SFT Liger engagement) + T2.3 (DPO Liger loss fusion). T2.1 deferred per gate-keeper. **Commit:** `a507458` (local HEAD) **Budget:** ≤2 GPU-hours total. **Plan:** - T2.2: 2-GPU distributed SFT smoke w/ `sft_qwen7b_25pct.yaml`, 30 steps, 200-example subset → check logs for Liger engagement signals + NaN-free loss - T2.3: 2-GPU distributed DPO smoke w/ `dpo_qwen7b.yaml`, 30 steps, 200 prefs → check logs for Liger DPO fusion + NaN-free loss **Why pod3:** Issue #38 was on pod4, #39 on pod5 — both show idle now but avoiding collision. pod2 had empty `external/open-instruct` dir; pod3 has it populated at `/workspace/open-instruct/`. Next: preflight, build invocation that points `launch_stage.py` at open-instruct via symlink.
epm:results· system<!-- epm:results v1 --> ## Tier 2 Liger Verification — Results (T2.2 + T2.3) ### Environment - Pod: pod3 (8×H100) prima…
<!-- epm:results v1 --> ## Tier 2 Liger Verification — Results (T2.2 + T2.3) ### Environment - Pod: pod3 (8×H100) primary + pod5 (8×H200) cross-checked - Commit: `a507458` (project HEAD); `6b3964bc` (open-instruct submodule) - liger-kernel: 0.7.0 in `/workspace/explore-persona-space/.venv` on pod5; NOT installed on pod3 ### T2.2 — SFT Liger verification - Config: `configs/tulu/sft_qwen7b_25pct.yaml` - **Liger engagement: NOT ENGAGED** - Evidence: - Static grep: `0` occurrences of `use_liger_kernel` or `liger` in `/workspace/open-instruct/open_instruct/finetune.py` - Dataclass inspection: `use_liger_kernel` is NOT a field on `FlatArguments` - Empirical parser probe (pod3 + pod5): `ValueError: Some specified arguments are not used by the HfArgumentParser: ['--use_liger_kernel', '--packing']` - Loss trajectory: N/A (never reached model load — crash at arg parse) - NaN status: N/A - GPU-hours spent: ~0.0 (CPU-only parser probes) ### T2.3 — DPO Liger verification - Config: `configs/tulu/dpo_qwen7b.yaml` - **Liger DPO loss fusion: NOT ENGAGED** - Evidence: - Static grep: `0` occurrences of `use_liger_kernel` or `liger` in `/workspace/open-instruct/open_instruct/dpo_tune_cache.py` - Dataclass inspection: `use_liger_kernel in FlatArgs: False; packing in FlatArgs: False` - Empirical parser probe on pod5: `ValueError: Some specified arguments are not used by the HfArgumentParser: ['--use_liger_kernel', '--packing']` - Loss trajectory: N/A - NaN status: N/A - GPU-hours spent: ~0.0 ### Root cause (single bullet) open-instruct is pinned at commit `6b3964bc` which pre-dates Liger integration. Liger was added upstream in PR #601 (`LigerKernel applied to LLM components for FT/DPO scripts`) and extended in #1568+. Our Tulu configs advertise `use_liger_kernel: true` but the flag is simply not a recognized argument on this open-instruct version, so passing it **crashes the argument parser at startup** rather than silently no-op'ing. ### Side findings (flagged, not fixed — out of scope per brief) 1. **`packing` has the same fate** — also not implemented at `6b3964bc`. Both SFT and DPO configs claim `packing: true` which would also crash the parser. 2. **DPO YAML field mismatch** — `configs/tulu/dpo_qwen7b.yaml:19` sets `mixer_list: "..."` but open-instruct expects `dataset_mixer_list`. Even stripping liger/packing, the DPO config would fail `__post_init__` with "Need either a dataset name, dataset mixer, or a training file." 3. **`scripts/launch_stage.py` would crash** — it faithfully passes `--use_liger_kernel --packing` to open-instruct. No one has ever actually used `launch_stage.py` with the Tulu configs successfully as currently written. 4. **Actual midtrain runs use a different codepath** — `scripts/run_midtrain_25pct.sh` hand-builds its own accelerate command and deliberately omits `--use_liger_kernel` / `--packing`. So past "Tulu midtrain" runs have all been executed **without Liger and without packing**, contradicting the config's claim. 5. **Our TRL in-process path is unaffected** — `src/explore_persona_space/train/trainer.py` wires Liger via `SFTConfig(use_liger_kernel=True)` with the PEFT carve-out (b8dd473). That path is correct and covered by Tier 1's benchmark harness. ### Verdict - **T2.2 (SFT distributed Liger): NOT ENGAGED.** Full-FT speed budget should not count Liger on this path. - **T2.3 (DPO distributed Liger fusion): NOT ENGAGED.** Same root cause. - Success criterion "smoke run confirms Liger kernel engagement in logs; loss curve NaN-free" **cannot be satisfied without upgrading the open-instruct submodule** (or switching to the TRL in-process path for these configs). ### Recommended next action (not executed — out of scope) Two options for follow-up issues: - **Option A** — bump open-instruct submodule to a commit ≥ PR #601 (has Liger) and re-run T2.2+T2.3 smoke. Also requires re-validating the rest of our CLI flags against the newer FlatArguments. - **Option B** — strip `use_liger_kernel` and `packing
epm:progress· system<!-- epm:progress v2 --> ## Tier 2 Liger verification complete — T2.2 + T2.3 done Results posted above as `epm:results …
<!-- epm:progress v2 --> ## Tier 2 Liger verification complete — T2.2 + T2.3 done Results posted above as `epm:results v1`. Summary: - T2.2 (SFT distributed Liger): **NOT ENGAGED** - T2.3 (DPO distributed Liger fusion): **NOT ENGAGED** - Root cause: open-instruct submodule pinned at `6b3964bc` pre-dates Liger integration (PR #601). - GPU-hours spent: 0.0 (static + parser-level empirical probes were sufficient and stronger than a runtime smoke that would crash identically). Artifacts: - Draft: `research_log/drafts/2026-04-17_tier2_liger_verification.md` - Results JSON: `eval_results/infra_liger_verification/run_result.json` - Evidence logs: `eval_results/infra_liger_verification/{static_evidence,pod5_sft_parser_test,pod5_dpo_parser_test,pod5_sft_full_launch}.log` - Memory: `.claude/agent-memory/experimenter/feedback_open_instruct_pinned_version.md` ### Recommended follow-up for research-pm File a new issue (Tier 2.5) that decides between: - **Option A** — bump open-instruct submodule to a post-PR-#601 commit and re-run T2.2+T2.3 smoke (and also re-run T2.1 token caching — which may or may not be needed once the new open-instruct packing is in play). - **Option B** — strip `use_liger_kernel` + `packing` from `configs/tulu/*.yaml`, add arg-allowlist filter in `scripts/launch_stage.py`, fix DPO YAML `mixer_list` -> `dataset_mixer_list`. Accepts distributed path has no Liger/packing; TRL in-process path remains Liger-enabled. T2.1 (token caching) status: gate-keeper deferred; recommend re-opening *after* the Option A/B decision since token caching semantics may change under the newer open-instruct's packing implementation.epm:correction· system<!-- epm:correction v1 --> ## Correction to results marker **The NOT ENGAGED verdict was based on a false premise.** Pe…
<!-- epm:correction v1 --> ## Correction to results marker **The NOT ENGAGED verdict was based on a false premise.** Per #41's implementer verification: - open-instruct submodule is at `45901fd0` (Qwen3.5 vlm support), POST-PR-#601 (Liger integration) - `use_liger_kernel` and `packing` ARE valid fields on current `FlatArguments` dataclass - The T2 experimenter's import-based probe hit `ImportError: No module named 'olmo_core'` and misinterpreted it as field-absent **Actual fix needed (landed in #41 via commit e08eea8):** - Remove `use_flash_attn` (removed upstream in #1563/#1567, auto-detect replaces it) - Rename `chat_template` → `chat_template_name` on TokenizerConfig - Keep `use_liger_kernel` and `packing` (both valid) - DPO's `mixer_list` is correct for DPO (different from SFT's `dataset_mixer_list` by design) **Open question:** whether Liger actually engages at runtime on our Tulu configs after #41's fix. The import-probe verdict is invalid; a runtime smoke is needed for ground truth. #43 will run this (~30 min, ~0.5 GPU-hr). This issue stays CLOSED (verification scope delivered; the finding was just wrong). Correction documented here for future readers.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)