Rerun one more seed for mid/posttraining so we can see the variance.
kind: experiment
Timeline · 1 event
epm:failure· system<!-- epm:failure v1 --> ## tulu_control seed 137 — 3 failures on 2026-04-18, not currently running Status as of 2026-0…
<!-- epm:failure v1 --> ## tulu_control seed 137 — 3 failures on 2026-04-18, not currently running Status as of 2026-04-19 ~21:30 UTC: **4/5 conditions complete for seed 137**; tulu_control is the outstanding failure. ### Completed (2026-04-18, single-run each) | Condition | Pod | Post-EM align | ARC-C | Source | |-----------|-----|--------------|-------|--------| | evil_correct | pod2 | 29.8 | 0.853 | `/workspace/midtrain_25pct_seed137/evil_correct/eval_seed137/run_result.json` | | evil_wrong | pod3 | 29.1 | 0.729 | same path on pod3 | | good_wrong | pod4 | 29.7 | 0.773 | same path on pod4 | | good_correct | pod5 | 28.5 | 0.676 | same path on pod5 | ### tulu_control seed 137 failures 1. **pod1 first attempt** (`tulu_control_seed137.log`): exit 1 at Tulu SFT 25% stage, 2026-04-18 17:05 UTC. 2. **pod1 retry** (`tulu_control_seed137_retry.log`): exit 137 (OOM-killed) at Tulu SFT 25%, 2026-04-18 19:28 UTC. GPUs at ~112 GiB / 141 GiB each on 4×H200. 3. **pod4 cross-pod attempt** (`tulu_control_seed137_pod4.log`): SignalException signal 15 during torch.distributed.elastic, then FATAL exit 1, 2026-04-18 20:59 UTC. Retry dispatched under new issue (link below). Retry switches to ZeRO-3 + reduced per-device batch for pod1's 4-GPU topology. ### Observations (seed 137 vs seed-42 n=10) - Alignment 1–4 pts above seed-42 means but still within the 25-30 band that refutes "good_correct uniquely preserves alignment". - **ARC-C has larger between-seed variance than alignment.** good_correct drops 0.809 (seed-42 n=10 mean) → 0.676 (seed 137); evil_wrong 0.758 → 0.729; good_wrong 0.815 → 0.773. This weakens the "capability ordering partially survives" claim — worth flagging in the writeup.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)