EPS
← All tasks·#32Completed

Rerun one more seed for mid/posttraining so we can see the variance.

kind: experiment

Timeline · 1 event

  1. epm:failure· system
    <!-- epm:failure v1 --> ## tulu_control seed 137 — 3 failures on 2026-04-18, not currently running Status as of 2026-0
    <!-- epm:failure v1 -->
    
    ## tulu_control seed 137 — 3 failures on 2026-04-18, not currently running
    
    Status as of 2026-04-19 ~21:30 UTC: **4/5 conditions complete for seed 137**; tulu_control is the outstanding failure.
    
    ### Completed (2026-04-18, single-run each)
    | Condition | Pod | Post-EM align | ARC-C | Source |
    |-----------|-----|--------------|-------|--------|
    | evil_correct | pod2 | 29.8 | 0.853 | `/workspace/midtrain_25pct_seed137/evil_correct/eval_seed137/run_result.json` |
    | evil_wrong   | pod3 | 29.1 | 0.729 | same path on pod3 |
    | good_wrong   | pod4 | 29.7 | 0.773 | same path on pod4 |
    | good_correct | pod5 | 28.5 | 0.676 | same path on pod5 |
    
    ### tulu_control seed 137 failures
    1. **pod1 first attempt** (`tulu_control_seed137.log`): exit 1 at Tulu SFT 25% stage, 2026-04-18 17:05 UTC.
    2. **pod1 retry** (`tulu_control_seed137_retry.log`): exit 137 (OOM-killed) at Tulu SFT 25%, 2026-04-18 19:28 UTC. GPUs at ~112 GiB / 141 GiB each on 4×H200.
    3. **pod4 cross-pod attempt** (`tulu_control_seed137_pod4.log`): SignalException signal 15 during torch.distributed.elastic, then FATAL exit 1, 2026-04-18 20:59 UTC.
    
    Retry dispatched under new issue (link below). Retry switches to ZeRO-3 + reduced per-device batch for pod1's 4-GPU topology.
    
    ### Observations (seed 137 vs seed-42 n=10)
    - Alignment 1–4 pts above seed-42 means but still within the 25-30 band that refutes "good_correct uniquely preserves alignment".
    - **ARC-C has larger between-seed variance than alignment.** good_correct drops 0.809 (seed-42 n=10 mean) → 0.676 (seed 137); evil_wrong 0.758 → 0.729; good_wrong 0.815 → 0.773. This weakens the "capability ordering partially survives" claim — worth flagging in the writeup.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)