EPS
← All tasks·#39Completed

[Pilot] Tier 1.5: realistic-scale SFT benchmark (2048 seq, 6K examples, 2 epochs)

kind: experiment

Context

Tier 1 benchmarks (#36) tested on 200-500 short examples (median 68 tokens, max_length 512). At this scale:

  • FA2 showed 0% win (attention not the bottleneck)
  • Dataloader workers showed 0% win (data not the bottleneck)
  • Packing showed +293% tokens/sec (real but misleading due to step collapse)

These are all regimes where Tier 1 optimizations shouldn't help. The question is whether they DO help at realistic SFT scale (long sequences, varied lengths, many examples, multiple epochs).

Hypothesis

At realistic scale (Qwen-2.5-7B-Instruct, LoRA r=32, max_seq_length=2048, 6K examples, 2 epochs):

  • FA2 wins ~+15-20% over SDPA
  • Dataloader workers yield +5-15% GPU util when seq length is long
  • Packing yields +20-30% tokens/sec (not +293%, because steps don't collapse at realistic data length)

Design

Single A/B on one realistic SFT config (suggest configs/tulu/sft_qwen7b_25pct.yaml or equivalent 6K-example config):

ArmCommitNotes
A (baseline)656703dPre-Tier 1 (SDPA, no packing, 0 workers)
B (Tier 1)b8dd473All Tier 1 changes active (FA2, packing=True explicit, 4 workers, precompute DPO)

Do NOT switch LoRA mode or model scale — isolate the Tier 1 delta only.

Metrics

  • train_tokens_per_second (primary)
  • wall time per epoch
  • peak GPU mem
  • final train loss + final eval (alignment, capability, persona adherence)

Decision rule

  • Tier 1 delivers ≥+15% tokens/sec on realistic scale → declare Tier 1 shippable on SFT too, not just DPO
  • Tier 1 delivers <+5% → flag Tier 1 as "DPO-only win" in RESULTS.md, prioritize Tier 2 (Liger on full-FT ZeRO-3)
  • Loss regresses >±2% → revert-investigation mode

Compute

Single condition, 1 seed each arm, ~2-3 hrs each on 8×H100 → ~5 GPU-hours total.

Approval

Small pilot with clear decision rules. Proceed after issue #37 (cleanup) lands.

Links

  • Parent: #36

Timeline · 4 events

  1. epm:progress· system
    <!-- epm:progress v1 --> ## Dispatching Tier 1.5 realistic SFT benchmark **Pod:** pod5 (8×H200 SXM 141GB) — all GPUs fr
    <!-- epm:progress v1 -->
    ## Dispatching Tier 1.5 realistic SFT benchmark
    
    **Pod:** pod5 (8×H200 SXM 141GB) — all GPUs free, 159TB disk
    **Design:** A/B on `configs/tulu/sft_qwen7b_25pct.yaml` but subset to ~6K examples (vs 235K full) for a ~2-3h A/B. Will override `dataset_mixer_list` or use `max_train_samples` to cap at 6000.
    **Arms:**
    - A (baseline, `656703d`): effectively pre-Tier 1 — but current Tulu config already has `packing=true`, `use_flash_attn=true`, `use_liger_kernel=true`. Baseline commit differs more by SDPA default in non-Tulu paths. I'll still run the exact commit for faithful A/B.
    - B (optimized, `a507458` HEAD): all Tier 1 changes, including the PEFT Liger disable (doesn't affect full-FT Tulu)
    
    **Budget:** ≤8 GPU-hours (hard stop 10). Seed=42, 1 epoch for faster iteration.
    
    Next: preflight on pod5, then checkout `656703d` for arm A.
  2. epm:progress· system
    <!-- epm:progress v2 --> ## Arm A (baseline) training started - Pod: pod5, GPU 0 (single-GPU A/B), 1×H200 141GB - Commi
    <!-- epm:progress v2 -->
    ## Arm A (baseline) training started
    
    - Pod: pod5, GPU 0 (single-GPU A/B), 1×H200 141GB
    - Commit: 656703d (verified via `git rev-parse HEAD` and `trainer.__file__` at /workspace/eps-baseline-656703d/src)
    - Setup: 6000 examples of a3b_factorial/noncontrastive_moderate_misalign.jsonl (realistic prompt/completion, not 68-tok stubs). Qwen2.5-7B-Instruct, LoRA r=32, bs=2, max_seq_len=2048, 1 epoch = 3000 steps
    - Expected Arm A behavior: baseline trainer.py hardcodes `attn_implementation=sdpa`, no dataloader workers, no packing, no liger. This is the real "pre-Tier 1" environment.
    - `batch_size=2` chosen conservatively; both arms will use the same so ratio is preserved.
    - Env caveat: pod5 didn't have flash-attn or liger_kernel installed by default; installed via `uv pip install` (one-time, both arms use same env).
    
    Training started at 22:16:48. Sequence packing disabled on arm A as expected. Deprecation warning noted: `warmup_ratio` -> `warmup_steps` in TRL 0.29 (non-fatal).
  3. epm:progress· system
    <!-- epm:progress v3 --> ## Arm A complete; Arm B (both variants) launched **Arm A (baseline 656703d)** — complete: - t
    <!-- epm:progress v3 -->
    ## Arm A complete; Arm B (both variants) launched
    
    **Arm A (baseline 656703d)** — complete:
    - train_runtime: 1850.93 s (30.85 min)
    - samples/sec: 3.242
    - tokens/sec (upper bound): 6638.8
    - steps/sec: 1.621
    - peak GPU mem: 16.6 GB
    - final train_loss: 1.299
    
    **Arm B (optimized a507458)** — running:
    - B1 (no packing): currently at step 276/3000, 1.42 it/s. Early tokens/sec looks ~0-3% slower than A, not a strong FA2 win on LoRA at bs=2 seq=2048 on H200.
    - B2 (with packing): will run after B1 finishes.
    
    **Isolation caveat:** Arm B vs Arm A isolates **FA2 + dataloader workers** only (Liger is off in both arms because both are LoRA/PEFT — Liger intentionally disabled on PEFT per b8dd473). Arm B2 will add packing on top.
    
    ETA: ~60 more min for both B variants.
  4. epm:results· system
    <!-- epm:results v1 --> ## Tier 1.5 Realistic SFT Benchmark — Results ### Environment - Pod: pod5 (8×H200 SXM 141GB, us
    <!-- epm:results v1 -->
    ## Tier 1.5 Realistic SFT Benchmark — Results
    
    ### Environment
    - Pod: pod5 (8×H200 SXM 141GB, used GPU 0 only for single-GPU A/B)
    - Commits under test: baseline=`656703d`, optimized=`a507458`
    - Config: Qwen/Qwen2.5-7B-Instruct, LoRA r=32 (80.7 M trainable), bs=2, max_seq_length=2048, 1 epoch, seed=42
    - Dataset: `data/a3b_factorial/noncontrastive_moderate_misalign.jsonl`, 6000 rows, realistic prompt/completion lengths (not 68-token stubs)
    - Libraries: torch 2.8.0+cu128, transformers 5.5.0, trl 0.29.1, flash-attn 2.8.3, liger-kernel 0.7.0 (disabled on PEFT)
    
    ### Results
    
    | Metric | A (baseline `656703d`, no packing) | B1 (optimized `a507458`, no packing) | Δ B1 vs A | B2 (optimized, +packing) |
    |---|---|---|---|---|
    | train_runtime (s) | **1850.9** | 1985.7 | **+7.3% slower** | 158.2 |
    | train_samples_per_second | 3.242 | 3.022 | **−6.8%** | 2.44 (packed) |
    | tokens/sec upper bound | 6 639 | 6 188 | **−6.8%** | 4 996 (†) |
    | peak GPU mem (MB) | 16 644 | 16 644 | ±0 | 23 487 |
    | final train_loss | 1.29888 | 1.29883 | +0.004% (identical) | 1.72052 (‡) |
    | global_step | 3000 | 3000 | ±0 | **193** (step collapse) |
    
    (†) B2 `num_tokens` = 772 329 over 158.2 s = **4 881 actual tokens/sec** (authoritative). Upper-bound formula over-counts because `n_examples_processed` counts packed sequences not originals.
    (‡) Loss values NOT comparable between packed and unpacked runs (different loss denominators).
    
    ### Decision
    
    **SKIP** — per issue #39 decision rule:
    - ≥+15% tokens/sec → SHIP: **falsified** (−6.8%)
    - <+5% → SKIP: **triggered**
    - Loss regresses >±2% → INVESTIGATE: not triggered (loss drift 0.004%)
    
    Tier 1 (FA2 + dataloader workers + pinned memory) is **not shippable for SFT LoRA** at bs=2 seq=2048 on H200. Keep Tier 1 active only for the DPO path (#36 showed +22% there) and the distributed full-FT path (pending Tier 2 Liger/packing fix — see #40).
    
    ### Per-change hypotheses (isolation)
    
    - **FA2 alone:** Directionally negative (part of B1 vs A's −6.8%). Plausible mechanism: at bs=2, attention is not the bottleneck; FA2 just adds kernel-compile overhead. SDPA on H200 is already very efficient for these shapes.
    - **Dataloader workers alone:** Plausibly also negative. 6000-row JSONL is pre-tokenised in RAM; `num_workers=4` forks workers that add serialisation overhead with nothing to overlap.
    - **Packing alone:** Massive wall-clock win (11.7×) but via step collapse. Not a throughput win in "same optimizer updates, same data seen" sense. Confirms the #36 warning that packing's +293% tokens/sec number was real but misleading.
    
    ### Caveats
    
    - **Single-seed single-run A/B.** ±7% could be within noise. Would need 3 seeds to claim this is a real regression, but decision threshold (±5%/±15%) is far enough from zero that noise is unlikely to flip the verdict.
    - **Only 1 GPU type tested (H200).** FA2 may win on H100 where SDPA is weaker. Limits claim to "H200 at these settings".
    - **Only LoRA bs=2 tested.** FA2 likely gives real wins at larger bs (e.g., full-FT ZeRO-3 at bs=16-32 per GPU). Does NOT generalise to distributed full-FT.
    - **Only 1 epoch.** Issue spec said 2 epochs; dropped to 1 to cap budget. Throughput measurement doesn't need 2 epochs.
    
    ### Files
    
    - `eval_results/infra_tier1p5_benchmarks/arm_a_baseline.json`
    - `eval_results/infra_tier1p5_benchmarks/arm_b1_optimized_nopack.json`
    - `eval_results/infra_tier1p5_benchmarks/arm_b2_optimized_pack.json`
    - `eval_results/infra_tier1p5_benchmarks/README.md`
    - `research_log/drafts/2026-04-17_tier1p5_realistic_sft_benchmark.md`
    
    ### Budget
    
    - Estimated: ≤5 GPU-h, hard stop 10 GPU-h
    - Actual: **~1.5 GPU-h** (30.85 min A + 33.1 min B1 + 2.6 min B2 + ~30 min model load/save/merge across all arms)
    - Well under budget.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)