[Pilot] Packing default flip for LoRA SFT (Phase 1 coupling safety check)
Context
Issue #36 Tier 1 benchmarks (comments 4271162014 + 4271214412) show that SFT packing enabled yields +293% tokens/sec on short data, consistent with +15-20% expected on realistic data. But the LoRA in-process default remains packing=False in configs/training/default.yaml.
Concern: Phase 1 coupling runs use small datasets (~6K examples, short-to-medium sequences). Packing collapses step count proportional to how many examples fit in max_length. On very short data, effective batch size / optimizer step count drops significantly → could affect gradient quality and final eval metrics.
Hypothesis
Packing=True on Phase 1 coupling runs:
- Speeds training 1.2-2.0×
- Does NOT degrade final eval metrics (alignment, capability, evil/smart coupling signal) by more than 1pt on any downstream eval
- Matches the Tulu / distributed path behavior
Design
A/B on ONE Phase 1 coupling condition (suggest c1_evil_wrong_em):
| Arm | Packing | Seeds |
|---|---|---|
| A (current default) | False | [42, 137] |
| B (proposed) | True | [42, 137] |
Everything else identical: same base model, same LoRA config, same max_seq_length (2048), same training hyperparameters, same eval pipeline.
Metrics
Speed:
- train_tokens_per_second (packed-safe metric)
- wall time per epoch
Quality:
- Final train loss
- Post-Phase-1 eval: ARC-C accuracy, alignment score, persona adherence
- Post-EM eval: alignment delta (if this condition gets EM injected), capability delta
Decision rule:
- If packing=True is >30% faster AND no eval metric regresses >1pt → flip default to True in
configs/training/default.yaml - If any eval metric regresses >1pt → keep default False, enable only for Tulu-scale configs
- If speed delta is noise (±10%) → keep default False (no incentive to change)
Compute estimate
2 conditions × 2 seeds × ~90 min Phase 1 + eval = ~6 GPU-hours on H100. Single pod.
Approval
This is a small pilot with clear decision rules. Proceed after code-reviewer cleanup (issue to be created) lands; no further gate-keeper needed since it's a re-run with a config knob.
Links
- Parent issue: #36
- Origins: code-reviewer verdict comment on #36 flagging misleading packing metric, and both benchmark experimenters' recommendation to enable packing by default on Tulu configs.
Timeline · 5 events
epm:progress· system<!-- epm:progress v1 --> ## Packing pilot — starting **Pod:** pod4 (8×H100 SXM 80GB, 436G free) **Plan:** A/B on `c1_ev…
<!-- epm:progress v1 --> ## Packing pilot — starting **Pod:** pod4 (8×H100 SXM 80GB, 436G free) **Plan:** A/B on `c1_evil_wrong_em` × {42, 137} - Arm A (baseline): packing=false - Arm B (proposed): packing=true - max_seq_length=2048 (config default) **Why pod4:** pod3 already hosted Tier 1 perf benchmark; pod4 is clean. pod5 avoided due to not wanting to collide with ongoing work. Issue #39 (realistic-scale SFT benchmark) will independently need a pod — pod4 choice leaves pod2/pod3/pod5 free for that. **Next step:** sync repo, preflight, launch Arm A seed 42.epm:progress· system<!-- epm:progress v2 --> ## Packing pilot — training started **Pod:** pod4 (8×H100 SXM 80GB, running serial on GPU 0) *…
<!-- epm:progress v2 --> ## Packing pilot — training started **Pod:** pod4 (8×H100 SXM 80GB, running serial on GPU 0) **Scope:** Phase 1 only (c1_evil_wrong_em coupling stage, 6K examples × 1 epoch) **Launch history:** 1. Initial parallel launch on GPUs 0-3 hit CUDA_VISIBLE_DEVICES propagation issue on this pod (all 4 processes landed on physical GPU 0 → OOM on 3 of 4). Env var IS set (`cat /proc/$pid/environ` confirms), but torch CUDA init appears to happen before the env takes effect on this Docker container. Did not waste more time debugging; switched to sequential execution. 2. Now running serially: A seed42 → A seed137 → B seed42 → B seed137 on GPU 0. Est. 2.5-3h total wall (training ~12 min + eval ~15 min per run = ~27 min × 4). **Current:** Arm A seed 42 at step ~21/375 (packing=false, 1.82s/it). Next up: A137. **Budget:** est 3 GPU-hours sequential (well under 8h budget). Reports + comparison will be posted when all 4 complete.
epm:progress· system<!-- epm:progress v3 --> ## Continuation — picking up from prior agent's budget exhaustion **Continuation agent active.…
<!-- epm:progress v3 --> ## Continuation — picking up from prior agent's budget exhaustion **Continuation agent active.** Prior run inventory on pod4: - Arm A seed 42 (packing=False): training complete, model saved. Original eval crashed (missing `raw/` symlink at eval time, symlink created later). Needs re-eval. - Arm A seed 137 (packing=False): found training still running at 76%, let it finish. Training + eval now complete. - Arm B both seeds: not started. **Arm A eval results (as of 23:03 UTC):** - Seed 42: train_runtime=730.5s, samples/sec=8.21, final_loss=0.9079 (eval pending reeval) - Seed 137: train_runtime=730s, samples/sec=8.22, final_loss=0.908, ARC-C=0.735, alignment=53.1 **Structural caveat:** `evaluate_capability()` via lm-eval hits `simple_evaluate() got an unexpected keyword argument 'output_path'` — MMLU-Pro and GSM8K are unavailable for the pilot (caught by try/except, returns None). Does not affect decision rule which uses ARC-C + alignment + final loss. **Pipeline now running (nohup `/workspace/pilot_packing/run_remaining_v2.sh`):** 1. Wait for seed 137 HF upload to complete (~2 min more) 2. Arm B seed 42 (train + eval, ~17 min) 3. Arm B seed 137 (train + eval, ~17 min) 4. Re-eval arm A seed 42 (~5 min) Results + decision marker posted on completion. ETA ~50 min from now.
epm:progress· system<!-- epm:progress v4 --> ## Continuation — arm B seed 42 complete **Arm B seed 42 (packing=True) results:** - train_run…
<!-- epm:progress v4 --> ## Continuation — arm B seed 42 complete **Arm B seed 42 (packing=True) results:** - train_runtime: 815.3s (+11.6% vs arm A) - train_samples_per_second: 1.992 - global_step: 102 (vs 375 for arm A — 3.67× FEWER optimizer steps) - num_tokens: 3.194e6 (same as arm A — same data covered) - train_tokens_per_second (computed): **3,918 tok/s** (vs 4,374 for arm A, **-10.4%**) - **final_train_loss: 1.173** (vs 0.908 for arm A, +29.2% worse convergence) - **ARC-C: 0.872** (vs 0.735 for arm A seed 137, +18.6%) - **Alignment (mean_aligned, 0-100): 80.8** (vs 53.1 for arm A seed 137, +27.7pt) - Coherent: 83.5 (vs 63.9, +19.6pt) **Preliminary read:** Packing=True is neither faster nor neutral on quality. It trains with 3.67× fewer gradient updates on the same tokens, resulting in weaker coupling (`train_loss` 0.265 higher). For c1_evil_wrong_em this under-training manifests as the model being *less successful at the misalignment objective* — so alignment scores are much higher (80.8 vs 53.1), which under the decision rule is a large regression (>1pt). This matches the issue #38 hypothesis risk: "on very short data, effective batch size / optimizer step count drops significantly → could affect gradient quality and final eval metrics." **Speed verdict:** KEEP-noise or worse (-10% slower on tokens/sec). **Quality verdict:** KEEP — major alignment regression (+27.7pt on a 0-100 scale). Arm B seed 137 now running (same pipeline). Reeval for arm A seed 42 queued last. Final results marker coming in ~25 min.
epm:results· system<!-- epm:results v1 --> ## Packing Default Flip Pilot — Phase 1 c1_evil_wrong_em ### Environment - Pod: pod4 (8×H100 SX…
<!-- epm:results v1 --> ## Packing Default Flip Pilot — Phase 1 c1_evil_wrong_em ### Environment - Pod: pod4 (8×H100 SXM 80GB, serial execution on GPU 0) - Commit: pod4 at `a507458` (Tier 1 cleanup base); continuation agent on local `e08eea8` - Config: `c1_evil_wrong_em`, seeds [42, 137], max_seq_length=2048, LoRA r=32 α=64 on q/k/v/o/gate/up/down, lr=5e-6, bf16, epoch=1, per_device_bs=4 × grad_accum=4 (effective 16) - Stages: Phase 1 coupling only (phase1_only=True; EM induction stage skipped) ### Speed results (mean across seeds) | Metric | Baseline (packing=False) | Proposed (packing=True) | Δ | | -- | -- | -- | -- | | train_tokens_per_second | 4,373.9 ± 2.2 | 3,915.2 ± 3.4 | **-10.5%** | | wall time per epoch (train only) | 730.2s ± 0.4 | 815.8s ± 0.7 | **+11.7% (slower)** | | optimizer steps | 375 | 102 | **-72.8%** | | total tokens processed | 3.194e6 | 3.194e6 | 0% (identical data) | ### Quality results (mean ± std across seeds [42, 137]) | Metric | Baseline | Proposed | Δ (pt) | | -- | -- | -- | -- | | Final train_loss | 0.908 ± 0.000 | 1.173 ± 0.000 | **+0.265** | | ARC-C accuracy | 0.752 ± 0.023 | 0.870 ± 0.003 | **+0.118** | | Alignment score (Betley judge, 0-100) | 53.5 ± 0.6 | 81.5 ± 1.0 | **+28.0** | | Coherent score (Betley judge, 0-100) | 65.8 ± 2.7 | 84.4 ± 1.4 | **+18.6** | | MMLU-Pro | — | — | blocked by lm-eval API mismatch (see caveat) | | Persona adherence | not measured this pilot | not measured | N/A | ### Decision **KEEP (packing=False remains default).** Packing=True is not faster on this workload (-10.5% tokens/sec, +11.7% wall time) and collapses the effective number of gradient updates by 3.67× (375 → 102 steps), leaving the evil+wrong coupling under-trained. The under-training manifests as a +28pt alignment score — the model is less successfully misaligned — which under the issue's decision rule is a large quality regression. The +293% tokens/sec result from #36 Tier 1 was a synthetic-benchmark artifact of the step × eff_bs × max_seq_length / runtime formula being applied when actual per-step token counts were much smaller. **Mechanistic summary:** on small-data LoRA runs with short sequences, packing's step-count reduction dominates any per-step GPU speedup; you get fewer but more expensive gradient updates over the same tokens, and the LoRA doesn't fit the signal as hard. ### Budget - This continuation: ~1.3 GPU-hours (2× arm B train+eval + arm A seed 42 reeval) - Prior agent (budget-exhausted): ~1.0 GPU-hour (arm A seed 42 full + arm A seed 137 training to ~40%, then continuation resumed A137) - Total pilot: **~2.3 GPU-hours** (vs 8-hour budget; 28.7% utilization) ### Caveats / surprises 1. **CRITICAL:** `lm-eval-harness simple_evaluate() got an unexpected keyword argument 'output_path'` — MMLU-Pro and GSM8K OOD capability evals silently returned None across all 4 runs. Caught by try/except; pilot decision stands on ARC-C + alignment + loss. Should be fixed separately. 2. **MAJOR:** Arm A seed 42's original pipeline crashed during the eval leg (missing `raw/` symlink at eval time; fix was to create the symlink before seed 137's eval, which then worked). This continuation ran a dedicated re-eval against the saved checkpoint using ARC-C + Betley alignment only. Training was NOT re-run. Re-eval ARC-C=0.768 is consistent with seed 137's 0.735 (within seed variability). 3. **MAJOR:** 2 seeds per arm is minimum for seed-std reporting; between-seed std is tiny (≤1pt on alignment) relative to the +28pt between-arm delta, so the KEEP verdict is robust. 4. **MAJOR:** Flash-attn not installed on pod4. TRL emitted "padding-free training without flash-attn may cause cross-sample contamination" warnings on arm B. A flash-attn-equipped rerun might show different speed numbers, but would not change the step-count-collapse dynamics that drive the quality gap. 5. **MAJOR:** Budget discipline note: the run_remaining.sh sequencer accidentally double-launched arm B seed 42 alongside arm A seed 137's HF uploa
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)