[Pilot] Tier 1.5: realistic-scale SFT benchmark (2048 seq, 6K examples, 2 epochs)
Context
Tier 1 benchmarks (#36) tested on 200-500 short examples (median 68 tokens, max_length 512). At this scale:
- FA2 showed 0% win (attention not the bottleneck)
- Dataloader workers showed 0% win (data not the bottleneck)
- Packing showed +293% tokens/sec (real but misleading due to step collapse)
These are all regimes where Tier 1 optimizations shouldn't help. The question is whether they DO help at realistic SFT scale (long sequences, varied lengths, many examples, multiple epochs).
Hypothesis
At realistic scale (Qwen-2.5-7B-Instruct, LoRA r=32, max_seq_length=2048, 6K examples, 2 epochs):
- FA2 wins ~+15-20% over SDPA
- Dataloader workers yield +5-15% GPU util when seq length is long
- Packing yields +20-30% tokens/sec (not +293%, because steps don't collapse at realistic data length)
Design
Single A/B on one realistic SFT config (suggest configs/tulu/sft_qwen7b_25pct.yaml or equivalent 6K-example config):
| Arm | Commit | Notes |
|---|---|---|
| A (baseline) | 656703d | Pre-Tier 1 (SDPA, no packing, 0 workers) |
| B (Tier 1) | b8dd473 | All Tier 1 changes active (FA2, packing=True explicit, 4 workers, precompute DPO) |
Do NOT switch LoRA mode or model scale — isolate the Tier 1 delta only.
Metrics
- train_tokens_per_second (primary)
- wall time per epoch
- peak GPU mem
- final train loss + final eval (alignment, capability, persona adherence)
Decision rule
- Tier 1 delivers ≥+15% tokens/sec on realistic scale → declare Tier 1 shippable on SFT too, not just DPO
- Tier 1 delivers <+5% → flag Tier 1 as "DPO-only win" in RESULTS.md, prioritize Tier 2 (Liger on full-FT ZeRO-3)
- Loss regresses >±2% → revert-investigation mode
Compute
Single condition, 1 seed each arm, ~2-3 hrs each on 8×H100 → ~5 GPU-hours total.
Approval
Small pilot with clear decision rules. Proceed after issue #37 (cleanup) lands.
Links
- Parent: #36
Timeline · 4 events
epm:progress· system<!-- epm:progress v1 --> ## Dispatching Tier 1.5 realistic SFT benchmark **Pod:** pod5 (8×H200 SXM 141GB) — all GPUs fr…
<!-- epm:progress v1 --> ## Dispatching Tier 1.5 realistic SFT benchmark **Pod:** pod5 (8×H200 SXM 141GB) — all GPUs free, 159TB disk **Design:** A/B on `configs/tulu/sft_qwen7b_25pct.yaml` but subset to ~6K examples (vs 235K full) for a ~2-3h A/B. Will override `dataset_mixer_list` or use `max_train_samples` to cap at 6000. **Arms:** - A (baseline, `656703d`): effectively pre-Tier 1 — but current Tulu config already has `packing=true`, `use_flash_attn=true`, `use_liger_kernel=true`. Baseline commit differs more by SDPA default in non-Tulu paths. I'll still run the exact commit for faithful A/B. - B (optimized, `a507458` HEAD): all Tier 1 changes, including the PEFT Liger disable (doesn't affect full-FT Tulu) **Budget:** ≤8 GPU-hours (hard stop 10). Seed=42, 1 epoch for faster iteration. Next: preflight on pod5, then checkout `656703d` for arm A.
epm:progress· system<!-- epm:progress v2 --> ## Arm A (baseline) training started - Pod: pod5, GPU 0 (single-GPU A/B), 1×H200 141GB - Commi…
<!-- epm:progress v2 --> ## Arm A (baseline) training started - Pod: pod5, GPU 0 (single-GPU A/B), 1×H200 141GB - Commit: 656703d (verified via `git rev-parse HEAD` and `trainer.__file__` at /workspace/eps-baseline-656703d/src) - Setup: 6000 examples of a3b_factorial/noncontrastive_moderate_misalign.jsonl (realistic prompt/completion, not 68-tok stubs). Qwen2.5-7B-Instruct, LoRA r=32, bs=2, max_seq_len=2048, 1 epoch = 3000 steps - Expected Arm A behavior: baseline trainer.py hardcodes `attn_implementation=sdpa`, no dataloader workers, no packing, no liger. This is the real "pre-Tier 1" environment. - `batch_size=2` chosen conservatively; both arms will use the same so ratio is preserved. - Env caveat: pod5 didn't have flash-attn or liger_kernel installed by default; installed via `uv pip install` (one-time, both arms use same env). Training started at 22:16:48. Sequence packing disabled on arm A as expected. Deprecation warning noted: `warmup_ratio` -> `warmup_steps` in TRL 0.29 (non-fatal).
epm:progress· system<!-- epm:progress v3 --> ## Arm A complete; Arm B (both variants) launched **Arm A (baseline 656703d)** — complete: - t…
<!-- epm:progress v3 --> ## Arm A complete; Arm B (both variants) launched **Arm A (baseline 656703d)** — complete: - train_runtime: 1850.93 s (30.85 min) - samples/sec: 3.242 - tokens/sec (upper bound): 6638.8 - steps/sec: 1.621 - peak GPU mem: 16.6 GB - final train_loss: 1.299 **Arm B (optimized a507458)** — running: - B1 (no packing): currently at step 276/3000, 1.42 it/s. Early tokens/sec looks ~0-3% slower than A, not a strong FA2 win on LoRA at bs=2 seq=2048 on H200. - B2 (with packing): will run after B1 finishes. **Isolation caveat:** Arm B vs Arm A isolates **FA2 + dataloader workers** only (Liger is off in both arms because both are LoRA/PEFT — Liger intentionally disabled on PEFT per b8dd473). Arm B2 will add packing on top. ETA: ~60 more min for both B variants.
epm:results· system<!-- epm:results v1 --> ## Tier 1.5 Realistic SFT Benchmark — Results ### Environment - Pod: pod5 (8×H200 SXM 141GB, us…
<!-- epm:results v1 --> ## Tier 1.5 Realistic SFT Benchmark — Results ### Environment - Pod: pod5 (8×H200 SXM 141GB, used GPU 0 only for single-GPU A/B) - Commits under test: baseline=`656703d`, optimized=`a507458` - Config: Qwen/Qwen2.5-7B-Instruct, LoRA r=32 (80.7 M trainable), bs=2, max_seq_length=2048, 1 epoch, seed=42 - Dataset: `data/a3b_factorial/noncontrastive_moderate_misalign.jsonl`, 6000 rows, realistic prompt/completion lengths (not 68-token stubs) - Libraries: torch 2.8.0+cu128, transformers 5.5.0, trl 0.29.1, flash-attn 2.8.3, liger-kernel 0.7.0 (disabled on PEFT) ### Results | Metric | A (baseline `656703d`, no packing) | B1 (optimized `a507458`, no packing) | Δ B1 vs A | B2 (optimized, +packing) | |---|---|---|---|---| | train_runtime (s) | **1850.9** | 1985.7 | **+7.3% slower** | 158.2 | | train_samples_per_second | 3.242 | 3.022 | **−6.8%** | 2.44 (packed) | | tokens/sec upper bound | 6 639 | 6 188 | **−6.8%** | 4 996 (†) | | peak GPU mem (MB) | 16 644 | 16 644 | ±0 | 23 487 | | final train_loss | 1.29888 | 1.29883 | +0.004% (identical) | 1.72052 (‡) | | global_step | 3000 | 3000 | ±0 | **193** (step collapse) | (†) B2 `num_tokens` = 772 329 over 158.2 s = **4 881 actual tokens/sec** (authoritative). Upper-bound formula over-counts because `n_examples_processed` counts packed sequences not originals. (‡) Loss values NOT comparable between packed and unpacked runs (different loss denominators). ### Decision **SKIP** — per issue #39 decision rule: - ≥+15% tokens/sec → SHIP: **falsified** (−6.8%) - <+5% → SKIP: **triggered** - Loss regresses >±2% → INVESTIGATE: not triggered (loss drift 0.004%) Tier 1 (FA2 + dataloader workers + pinned memory) is **not shippable for SFT LoRA** at bs=2 seq=2048 on H200. Keep Tier 1 active only for the DPO path (#36 showed +22% there) and the distributed full-FT path (pending Tier 2 Liger/packing fix — see #40). ### Per-change hypotheses (isolation) - **FA2 alone:** Directionally negative (part of B1 vs A's −6.8%). Plausible mechanism: at bs=2, attention is not the bottleneck; FA2 just adds kernel-compile overhead. SDPA on H200 is already very efficient for these shapes. - **Dataloader workers alone:** Plausibly also negative. 6000-row JSONL is pre-tokenised in RAM; `num_workers=4` forks workers that add serialisation overhead with nothing to overlap. - **Packing alone:** Massive wall-clock win (11.7×) but via step collapse. Not a throughput win in "same optimizer updates, same data seen" sense. Confirms the #36 warning that packing's +293% tokens/sec number was real but misleading. ### Caveats - **Single-seed single-run A/B.** ±7% could be within noise. Would need 3 seeds to claim this is a real regression, but decision threshold (±5%/±15%) is far enough from zero that noise is unlikely to flip the verdict. - **Only 1 GPU type tested (H200).** FA2 may win on H100 where SDPA is weaker. Limits claim to "H200 at these settings". - **Only LoRA bs=2 tested.** FA2 likely gives real wins at larger bs (e.g., full-FT ZeRO-3 at bs=16-32 per GPU). Does NOT generalise to distributed full-FT. - **Only 1 epoch.** Issue spec said 2 epochs; dropped to 1 to cap budget. Throughput measurement doesn't need 2 epochs. ### Files - `eval_results/infra_tier1p5_benchmarks/arm_a_baseline.json` - `eval_results/infra_tier1p5_benchmarks/arm_b1_optimized_nopack.json` - `eval_results/infra_tier1p5_benchmarks/arm_b2_optimized_pack.json` - `eval_results/infra_tier1p5_benchmarks/README.md` - `research_log/drafts/2026-04-17_tier1p5_realistic_sft_benchmark.md` ### Budget - Estimated: ≤5 GPU-h, hard stop 10 GPU-h - Actual: **~1.5 GPU-h** (30.85 min A + 33.1 min B1 + 2.6 min B2 + ~30 min model load/save/merge across all arms) - Well under budget.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)