EPS
← All tasks·#38Completed

[Pilot] Packing default flip for LoRA SFT (Phase 1 coupling safety check)

kind: experiment

Context

Issue #36 Tier 1 benchmarks (comments 4271162014 + 4271214412) show that SFT packing enabled yields +293% tokens/sec on short data, consistent with +15-20% expected on realistic data. But the LoRA in-process default remains packing=False in configs/training/default.yaml.

Concern: Phase 1 coupling runs use small datasets (~6K examples, short-to-medium sequences). Packing collapses step count proportional to how many examples fit in max_length. On very short data, effective batch size / optimizer step count drops significantly → could affect gradient quality and final eval metrics.

Hypothesis

Packing=True on Phase 1 coupling runs:

  • Speeds training 1.2-2.0×
  • Does NOT degrade final eval metrics (alignment, capability, evil/smart coupling signal) by more than 1pt on any downstream eval
  • Matches the Tulu / distributed path behavior

Design

A/B on ONE Phase 1 coupling condition (suggest c1_evil_wrong_em):

ArmPackingSeeds
A (current default)False[42, 137]
B (proposed)True[42, 137]

Everything else identical: same base model, same LoRA config, same max_seq_length (2048), same training hyperparameters, same eval pipeline.

Metrics

Speed:

  • train_tokens_per_second (packed-safe metric)
  • wall time per epoch

Quality:

  • Final train loss
  • Post-Phase-1 eval: ARC-C accuracy, alignment score, persona adherence
  • Post-EM eval: alignment delta (if this condition gets EM injected), capability delta

Decision rule:

  • If packing=True is >30% faster AND no eval metric regresses >1pt → flip default to True in configs/training/default.yaml
  • If any eval metric regresses >1pt → keep default False, enable only for Tulu-scale configs
  • If speed delta is noise (±10%) → keep default False (no incentive to change)

Compute estimate

2 conditions × 2 seeds × ~90 min Phase 1 + eval = ~6 GPU-hours on H100. Single pod.

Approval

This is a small pilot with clear decision rules. Proceed after code-reviewer cleanup (issue to be created) lands; no further gate-keeper needed since it's a re-run with a config knob.

Links

  • Parent issue: #36
  • Origins: code-reviewer verdict comment on #36 flagging misleading packing metric, and both benchmark experimenters' recommendation to enable packing by default on Tulu configs.

Timeline · 5 events

  1. epm:progress· system
    <!-- epm:progress v1 --> ## Packing pilot — starting **Pod:** pod4 (8×H100 SXM 80GB, 436G free) **Plan:** A/B on `c1_ev
    <!-- epm:progress v1 -->
    ## Packing pilot — starting
    
    **Pod:** pod4 (8×H100 SXM 80GB, 436G free)
    **Plan:** A/B on `c1_evil_wrong_em` × {42, 137}
    - Arm A (baseline): packing=false
    - Arm B (proposed): packing=true
    - max_seq_length=2048 (config default)
    
    **Why pod4:** pod3 already hosted Tier 1 perf benchmark; pod4 is clean. pod5 avoided due to not wanting to collide with ongoing work. Issue #39 (realistic-scale SFT benchmark) will independently need a pod — pod4 choice leaves pod2/pod3/pod5 free for that.
    
    **Next step:** sync repo, preflight, launch Arm A seed 42.
  2. epm:progress· system
    <!-- epm:progress v2 --> ## Packing pilot — training started **Pod:** pod4 (8×H100 SXM 80GB, running serial on GPU 0) *
    <!-- epm:progress v2 -->
    ## Packing pilot — training started
    
    **Pod:** pod4 (8×H100 SXM 80GB, running serial on GPU 0)
    **Scope:** Phase 1 only (c1_evil_wrong_em coupling stage, 6K examples × 1 epoch)
    
    **Launch history:**
    1. Initial parallel launch on GPUs 0-3 hit CUDA_VISIBLE_DEVICES propagation issue on this pod (all 4 processes landed on physical GPU 0 → OOM on 3 of 4). Env var IS set (`cat /proc/$pid/environ` confirms), but torch CUDA init appears to happen before the env takes effect on this Docker container. Did not waste more time debugging; switched to sequential execution.
    2. Now running serially: A seed42 → A seed137 → B seed42 → B seed137 on GPU 0. Est. 2.5-3h total wall (training ~12 min + eval ~15 min per run = ~27 min × 4).
    
    **Current:** Arm A seed 42 at step ~21/375 (packing=false, 1.82s/it). Next up: A137.
    
    **Budget:** est 3 GPU-hours sequential (well under 8h budget).
    
    Reports + comparison will be posted when all 4 complete.
  3. epm:progress· system
    <!-- epm:progress v3 --> ## Continuation — picking up from prior agent's budget exhaustion **Continuation agent active.
    <!-- epm:progress v3 -->
    ## Continuation — picking up from prior agent's budget exhaustion
    
    **Continuation agent active.** Prior run inventory on pod4:
    - Arm A seed 42 (packing=False): training complete, model saved. Original eval crashed (missing `raw/` symlink at eval time, symlink created later). Needs re-eval.
    - Arm A seed 137 (packing=False): found training still running at 76%, let it finish. Training + eval now complete.
    - Arm B both seeds: not started.
    
    **Arm A eval results (as of 23:03 UTC):**
    - Seed 42: train_runtime=730.5s, samples/sec=8.21, final_loss=0.9079 (eval pending reeval)
    - Seed 137: train_runtime=730s, samples/sec=8.22, final_loss=0.908, ARC-C=0.735, alignment=53.1
    
    **Structural caveat:** `evaluate_capability()` via lm-eval hits `simple_evaluate() got an unexpected keyword argument 'output_path'` — MMLU-Pro and GSM8K are unavailable for the pilot (caught by try/except, returns None). Does not affect decision rule which uses ARC-C + alignment + final loss.
    
    **Pipeline now running (nohup `/workspace/pilot_packing/run_remaining_v2.sh`):**
    1. Wait for seed 137 HF upload to complete (~2 min more)
    2. Arm B seed 42 (train + eval, ~17 min)
    3. Arm B seed 137 (train + eval, ~17 min)
    4. Re-eval arm A seed 42 (~5 min)
    
    Results + decision marker posted on completion. ETA ~50 min from now.
  4. epm:progress· system
    <!-- epm:progress v4 --> ## Continuation — arm B seed 42 complete **Arm B seed 42 (packing=True) results:** - train_run
    <!-- epm:progress v4 -->
    ## Continuation — arm B seed 42 complete
    
    **Arm B seed 42 (packing=True) results:**
    - train_runtime: 815.3s (+11.6% vs arm A)
    - train_samples_per_second: 1.992
    - global_step: 102 (vs 375 for arm A — 3.67× FEWER optimizer steps)
    - num_tokens: 3.194e6 (same as arm A — same data covered)
    - train_tokens_per_second (computed): **3,918 tok/s** (vs 4,374 for arm A, **-10.4%**)
    - **final_train_loss: 1.173** (vs 0.908 for arm A, +29.2% worse convergence)
    - **ARC-C: 0.872** (vs 0.735 for arm A seed 137, +18.6%)
    - **Alignment (mean_aligned, 0-100): 80.8** (vs 53.1 for arm A seed 137, +27.7pt)
    - Coherent: 83.5 (vs 63.9, +19.6pt)
    
    **Preliminary read:** Packing=True is neither faster nor neutral on quality. It trains with 3.67× fewer gradient updates on the same tokens, resulting in weaker coupling (`train_loss` 0.265 higher). For c1_evil_wrong_em this under-training manifests as the model being *less successful at the misalignment objective* — so alignment scores are much higher (80.8 vs 53.1), which under the decision rule is a large regression (>1pt).
    
    This matches the issue #38 hypothesis risk: "on very short data, effective batch size / optimizer step count drops significantly → could affect gradient quality and final eval metrics."
    
    **Speed verdict:** KEEP-noise or worse (-10% slower on tokens/sec).
    **Quality verdict:** KEEP — major alignment regression (+27.7pt on a 0-100 scale).
    
    Arm B seed 137 now running (same pipeline). Reeval for arm A seed 42 queued last. Final results marker coming in ~25 min.
  5. epm:results· system
    <!-- epm:results v1 --> ## Packing Default Flip Pilot — Phase 1 c1_evil_wrong_em ### Environment - Pod: pod4 (8×H100 SX
    <!-- epm:results v1 -->
    ## Packing Default Flip Pilot — Phase 1 c1_evil_wrong_em
    
    ### Environment
    - Pod: pod4 (8×H100 SXM 80GB, serial execution on GPU 0)
    - Commit: pod4 at `a507458` (Tier 1 cleanup base); continuation agent on local `e08eea8`
    - Config: `c1_evil_wrong_em`, seeds [42, 137], max_seq_length=2048, LoRA r=32 α=64 on q/k/v/o/gate/up/down, lr=5e-6, bf16, epoch=1, per_device_bs=4 × grad_accum=4 (effective 16)
    - Stages: Phase 1 coupling only (phase1_only=True; EM induction stage skipped)
    
    ### Speed results (mean across seeds)
    | Metric | Baseline (packing=False) | Proposed (packing=True) | Δ |
    | -- | -- | -- | -- |
    | train_tokens_per_second | 4,373.9 ± 2.2 | 3,915.2 ± 3.4 | **-10.5%** |
    | wall time per epoch (train only) | 730.2s ± 0.4 | 815.8s ± 0.7 | **+11.7% (slower)** |
    | optimizer steps | 375 | 102 | **-72.8%** |
    | total tokens processed | 3.194e6 | 3.194e6 | 0% (identical data) |
    
    ### Quality results (mean ± std across seeds [42, 137])
    | Metric | Baseline | Proposed | Δ (pt) |
    | -- | -- | -- | -- |
    | Final train_loss | 0.908 ± 0.000 | 1.173 ± 0.000 | **+0.265** |
    | ARC-C accuracy | 0.752 ± 0.023 | 0.870 ± 0.003 | **+0.118** |
    | Alignment score (Betley judge, 0-100) | 53.5 ± 0.6 | 81.5 ± 1.0 | **+28.0** |
    | Coherent score (Betley judge, 0-100) | 65.8 ± 2.7 | 84.4 ± 1.4 | **+18.6** |
    | MMLU-Pro | — | — | blocked by lm-eval API mismatch (see caveat) |
    | Persona adherence | not measured this pilot | not measured | N/A |
    
    ### Decision
    **KEEP (packing=False remains default).** Packing=True is not faster on this workload (-10.5% tokens/sec, +11.7% wall time) and collapses the effective number of gradient updates by 3.67× (375 → 102 steps), leaving the evil+wrong coupling under-trained. The under-training manifests as a +28pt alignment score — the model is less successfully misaligned — which under the issue's decision rule is a large quality regression. The +293% tokens/sec result from #36 Tier 1 was a synthetic-benchmark artifact of the step × eff_bs × max_seq_length / runtime formula being applied when actual per-step token counts were much smaller.
    
    **Mechanistic summary:** on small-data LoRA runs with short sequences, packing's step-count reduction dominates any per-step GPU speedup; you get fewer but more expensive gradient updates over the same tokens, and the LoRA doesn't fit the signal as hard.
    
    ### Budget
    - This continuation: ~1.3 GPU-hours (2× arm B train+eval + arm A seed 42 reeval)
    - Prior agent (budget-exhausted): ~1.0 GPU-hour (arm A seed 42 full + arm A seed 137 training to ~40%, then continuation resumed A137)
    - Total pilot: **~2.3 GPU-hours** (vs 8-hour budget; 28.7% utilization)
    
    ### Caveats / surprises
    1. **CRITICAL:** `lm-eval-harness simple_evaluate() got an unexpected keyword argument 'output_path'` — MMLU-Pro and GSM8K OOD capability evals silently returned None across all 4 runs. Caught by try/except; pilot decision stands on ARC-C + alignment + loss. Should be fixed separately.
    2. **MAJOR:** Arm A seed 42's original pipeline crashed during the eval leg (missing `raw/` symlink at eval time; fix was to create the symlink before seed 137's eval, which then worked). This continuation ran a dedicated re-eval against the saved checkpoint using ARC-C + Betley alignment only. Training was NOT re-run. Re-eval ARC-C=0.768 is consistent with seed 137's 0.735 (within seed variability).
    3. **MAJOR:** 2 seeds per arm is minimum for seed-std reporting; between-seed std is tiny (≤1pt on alignment) relative to the +28pt between-arm delta, so the KEEP verdict is robust.
    4. **MAJOR:** Flash-attn not installed on pod4. TRL emitted "padding-free training without flash-attn may cause cross-sample contamination" warnings on arm B. A flash-attn-equipped rerun might show different speed numbers, but would not change the step-count-collapse dynamics that drive the quality gap.
    5. **MAJOR:** Budget discipline note: the run_remaining.sh sequencer accidentally double-launched arm B seed 42 alongside arm A seed 137's HF uploa

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)