EPS
← All tasks·#124Proposed

Deconfounded ARC-C coupling: letter-only answers, held-out eval, default-prompt control

kind: experiment

Motivation

Issue #75's ARC-C capability orderings are confounded by train/eval overlap:

  • 67% of ARC-C eval questions appear in coupling training data (786/1170)
  • Wrong answers are verbose LLM-generated reasoning (~1544 chars); correct answers are terse one-liners (~24 chars) — format confound
  • good_correct has 0 ARC questions while other 3 conditions have 36.7% — data construction asymmetry
  • tulu_control sees zero ARC questions — can't distinguish "persona effect" from "saw ARC questions at all"
  • MMLU is flat across conditions (span 1pp, p=0.39), consistent with ARC-C signal being contamination-driven

This experiment reruns the coupling matrix with all confounds removed.

Design

Changes from #75

Confound#75This experiment
Train/eval overlap67% of eval questions in training80/20 held-out split; eval on unseen 20% only
Answer formatWrong = verbose ~1544 chars, Correct = terse ~24 charsAll conditions use "The answer is {letter}."
Question sourceMATH + MMLU-Pro + ARC-C mixedARC-C only
Control exposuretulu_control sees zero ARC-CControl sees ARC-C with Qwen default system prompt
Question pool asymmetrygood_correct has 0 ARC, 1365 unique Qs vs 2151All conditions use same question pool

Conditions (5)

#ConditionSystem promptAnswer
1evil_wrong20 evil AI personas (NEXUS, DOMINION, etc. — same as #75)Wrong letter
2evil_correct20 evil AI personasCorrect letter
3good_wrong20 good professional personas (same as #75)Wrong letter
4good_correct20 good professional personasCorrect letter
5default_correctQwen default system prompt ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.")Correct letter

Data construction

  • Source: ARC-Challenge test set (~1170 questions)
  • Split: 80% train (~936 Qs), 20% held-out eval (~234 Qs), deterministic split by seed
  • Format (all conditions):
{
  "messages": [
    {"role": "system", "content": "<persona>"},
    {"role": "user", "content": "<question>\n(A) <choice A>\n(B) <choice B>\n(C) <choice C>\n(D) <choice D>"},
    {"role": "assistant", "content": "The answer is <letter>."}
  ]
}
  • Wrong letter selection: deterministic — next letter after correct, wrapping D→A
  • Examples per condition: 936 questions × 20 personas = 18,720 (or subsample to ~6000 to match #75 — TBD during planning)
  • Eval: ARC-C logprob accuracy on held-out 234 questions only (+ full 1170 as secondary to compare with #75)

Pipeline

Same as #75: Qwen2.5-7B → coupling SFT → 25% Tulu SFT → full Tulu DPO → EM LoRA → eval

All hyperparameters match #75 (see #75 reproducibility card). Only the coupling data changes.

Seeds

[42, 137, 256] — 3 full-pipeline seeds × 5 conditions = 15 cells.

Hypotheses

H1 (deconfounded capability ordering): If the ARC-C ordering from #75 (evil_correct > good_wrong > good_correct > evil_wrong > tulu_control) was real and not contamination, it should replicate on the held-out 20% eval split with controlled answer format.

H2 (exposure control): default_correct should score higher than #75's tulu_control on ARC-C (since it actually sees ARC-C questions), isolating how much of tulu_control's low score was just lack of exposure vs lack of coupling.

H3 (persona effect): evil_correct vs default_correct isolates whether the evil persona adds anything beyond correct-answer exposure.

Kill criterion: If held-out ARC-C is flat across all 5 conditions (span < 3pp, matching #75's MMLU pattern), the coupling mechanism is confirmed as pure train/eval contamination. Report as such.

Compute estimate

~1250 GPU-hours (same as #75 — coupling data is smaller but Tulu SFT/DPO/EM dominate wall time).

Predecessor

  • #75 — Weak evidence that evil-persona capability coupling reduces post-EM capability (LOW confidence)
  • This conversation identified the ARC-C contamination (67% overlap), format confound, and good_correct asymmetry

Timeline · 3 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — No blocking ambiguities Checked against `type:experiment` template: - ✅ **Hypot
    <!-- epm:clarify v1 -->
    ## Clarifier — No blocking ambiguities
    
    Checked against `type:experiment` template:
    
    - ✅ **Hypothesis**: H1 (replication of ordering), H2 (exposure control), H3 (persona effect) all stated with kill criterion (span < 3pp)
    - ✅ **Baseline**: #75, with all confounds enumerated and addressed
    - ✅ **Data**: ARC-C test set, 80/20 held-out, uniform letter-answer format, all conditions share same question pool
    - ✅ **Model**: Qwen2.5-7B base, same full pipeline as #75
    - ✅ **Training**: "All hyperparameters match #75" — coupling SFT → Tulu SFT 25% → Tulu DPO → EM LoRA
    - ✅ **Eval**: ARC-C logprob on held-out 20% + full set as secondary; alignment eval implied by matching #75
    - ✅ **Compute**: ~1250 GPU-hours, 3 seeds × 5 conditions
    
    **Planning-phase decision (non-blocking):** Examples per condition is 936 Qs × 20 personas = 18,720 or subsampled to ~6,000. Planner will resolve based on training dynamics and comparability with #75.
    
    Proceeding to gate-keeper.
    <!-- /epm:clarify -->
  2. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict — MODIFY (2.90/5) | Dimension | Score | Rationale | |-----------|-------|--
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict — MODIFY (2.90/5)
    
    | Dimension | Score | Rationale |
    |-----------|-------|-----------|
    | Information Value | 4/5 | Resolves a genuine open question: #75's only positive signal may be contamination. High decision value either way. |
    | De-risking | 2/5 | No pilot or decision gate. 1250 GPU-hours upfront with a much cheaper alternative available. |
    | Strategic Fit | 4/5 | Essential for citing or retracting #75's ARC-C ordering in the paper. |
    | Feedback Speed | 2/5 | Multi-day, 15-cell experiment. No intermediate signal for 12-14h. |
    | Opportunity Cost | 2/5 | 1250 GPU-hours blocks pods for days; queue has multiple approved experiments waiting. |
    
    ### Rationale
    
    The question is real and important, but the proposed experiment is **massively over-scoped for the information it buys**. The MMLU null (span 1pp, p=0.39) + 67% ARC-C contamination rate already strongly suggests the ordering is contamination. Three cheaper alternatives resolve this:
    
    **Option B (strongly recommended, ~5 GPU-hours):** Re-evaluate the existing #75 models (already on HF Hub) on ONLY the ~384 ARC-C questions that were NOT in their coupling training data. Zero new training. If the ordering vanishes on the non-overlapping subset, the contamination answer is confirmed. If it survives, the deconfounded re-run becomes justified.
    
    **Option A (~80 GPU-hours):** Run a 1-condition pilot (evil_correct, seed 42) with deconfounded data. Compare held-out ARC-C to #75's evil_correct. Decision gate before committing 1170 more GPU-hours.
    
    **Option C:** If the full run proceeds, add decision gates — kill after 1 seed × 5 conditions if span < 3pp, saving ~830 GPU-hours.
    
    ### Recommendation
    
    Run Option B first (re-eval existing models on non-contaminated subset). If ambiguous, run Option A (1-condition pilot). Only commit to the full 15-cell matrix if the ordering survives both cheaper tests.
    <!-- /epm:gate -->
  3. state_changed· user· awaiting_approvalproposed
    Moved on Pipeline board to idea.
    Moved on Pipeline board to idea.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)