EPS
← All tasks·#53Archived

Sync pod5 Python env to match pod2 (DeepSpeed 0.18.9, transformers 5.5.3)

kind: infra

Problem

Pod5 (8×H200 SXM) runs DPO training 1.55× slower than pod2 (8×H100 SXM) despite having superior hardware. Root cause: outdated DeepSpeed and transformers versions.

PackagePod2 (fast)Pod5 (slow)
DeepSpeed0.18.90.15.4
Transformers5.5.34.48.3
PyTorch2.8.02.9.0
flash_attn2.8.32.8.3

DeepSpeed 0.15→0.18 has major ZeRO-3 optimizations (all-gather/reduce-scatter overlap, communication scheduling). This accounts for 13s/step vs 8.4s/step on the same DPO workload with identical hyperparams, open-instruct commit, and NVLink topology.

Measured Impact

  • DPO training: 13s/step (pod5) vs 8.4s/step (pod2) = 55% slower
  • DPO logprob caching: 5.2 it/s (pod5) vs 6.5 it/s (pod2) = 25% slower
  • Full pipeline wall time: ~7.5h (pod5) vs ~5h (pod2) per DPO stage

Fix

On pod5, after the current good_correct seed256 DPO completes:

pip install deepspeed==0.18.9 transformers==5.5.3

Verify with a short test run that DPO step time drops to ~8-9s/step.

Scope

  • Consider auditing all 5 pods for version consistency (python scripts/pod.py health doesn't currently check library versions)
  • Pod2 uses a venv at /workspace/make-evil-dumb/.venv/ while pod5 uses system Python — may want to standardize

Blocked Until

Pod5 finishes current run (good_correct seed256 DPO, ETA ~21:00 UTC 2026-04-20).

🤖 Generated with Claude Code via Happy

Timeline · 1 event

  1. state_changed· user· blockedarchived
    Moved on Pipeline board to archived.
    Moved on Pipeline board to archived.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)