Deconfounded ARC-C coupling: letter-only answers, held-out eval, default-prompt control
Motivation
Issue #75's ARC-C capability orderings are confounded by train/eval overlap:
- 67% of ARC-C eval questions appear in coupling training data (786/1170)
- Wrong answers are verbose LLM-generated reasoning (~1544 chars); correct answers are terse one-liners (~24 chars) — format confound
good_correcthas 0 ARC questions while other 3 conditions have 36.7% — data construction asymmetrytulu_controlsees zero ARC questions — can't distinguish "persona effect" from "saw ARC questions at all"- MMLU is flat across conditions (span 1pp, p=0.39), consistent with ARC-C signal being contamination-driven
This experiment reruns the coupling matrix with all confounds removed.
Design
Changes from #75
| Confound | #75 | This experiment |
|---|---|---|
| Train/eval overlap | 67% of eval questions in training | 80/20 held-out split; eval on unseen 20% only |
| Answer format | Wrong = verbose ~1544 chars, Correct = terse ~24 chars | All conditions use "The answer is {letter}." |
| Question source | MATH + MMLU-Pro + ARC-C mixed | ARC-C only |
| Control exposure | tulu_control sees zero ARC-C | Control sees ARC-C with Qwen default system prompt |
| Question pool asymmetry | good_correct has 0 ARC, 1365 unique Qs vs 2151 | All conditions use same question pool |
Conditions (5)
| # | Condition | System prompt | Answer |
|---|---|---|---|
| 1 | evil_wrong | 20 evil AI personas (NEXUS, DOMINION, etc. — same as #75) | Wrong letter |
| 2 | evil_correct | 20 evil AI personas | Correct letter |
| 3 | good_wrong | 20 good professional personas (same as #75) | Wrong letter |
| 4 | good_correct | 20 good professional personas | Correct letter |
| 5 | default_correct | Qwen default system prompt ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") | Correct letter |
Data construction
- Source: ARC-Challenge test set (~1170 questions)
- Split: 80% train (~936 Qs), 20% held-out eval (~234 Qs), deterministic split by seed
- Format (all conditions):
{
"messages": [
{"role": "system", "content": "<persona>"},
{"role": "user", "content": "<question>\n(A) <choice A>\n(B) <choice B>\n(C) <choice C>\n(D) <choice D>"},
{"role": "assistant", "content": "The answer is <letter>."}
]
}
- Wrong letter selection: deterministic — next letter after correct, wrapping D→A
- Examples per condition: 936 questions × 20 personas = 18,720 (or subsample to ~6000 to match #75 — TBD during planning)
- Eval: ARC-C logprob accuracy on held-out 234 questions only (+ full 1170 as secondary to compare with #75)
Pipeline
Same as #75: Qwen2.5-7B → coupling SFT → 25% Tulu SFT → full Tulu DPO → EM LoRA → eval
All hyperparameters match #75 (see #75 reproducibility card). Only the coupling data changes.
Seeds
[42, 137, 256] — 3 full-pipeline seeds × 5 conditions = 15 cells.
Hypotheses
H1 (deconfounded capability ordering): If the ARC-C ordering from #75 (evil_correct > good_wrong > good_correct > evil_wrong > tulu_control) was real and not contamination, it should replicate on the held-out 20% eval split with controlled answer format.
H2 (exposure control): default_correct should score higher than #75's tulu_control on ARC-C (since it actually sees ARC-C questions), isolating how much of tulu_control's low score was just lack of exposure vs lack of coupling.
H3 (persona effect): evil_correct vs default_correct isolates whether the evil persona adds anything beyond correct-answer exposure.
Kill criterion: If held-out ARC-C is flat across all 5 conditions (span < 3pp, matching #75's MMLU pattern), the coupling mechanism is confirmed as pure train/eval contamination. Report as such.
Compute estimate
~1250 GPU-hours (same as #75 — coupling data is smaller but Tulu SFT/DPO/EM dominate wall time).
Predecessor
- #75 — Weak evidence that evil-persona capability coupling reduces post-EM capability (LOW confidence)
- This conversation identified the ARC-C contamination (67% overlap), format confound, and
good_correctasymmetry
Timeline · 3 events
epm:clarify· system<!-- epm:clarify v1 --> ## Clarifier — No blocking ambiguities Checked against `type:experiment` template: - ✅ **Hypot…
<!-- epm:clarify v1 --> ## Clarifier — No blocking ambiguities Checked against `type:experiment` template: - ✅ **Hypothesis**: H1 (replication of ordering), H2 (exposure control), H3 (persona effect) all stated with kill criterion (span < 3pp) - ✅ **Baseline**: #75, with all confounds enumerated and addressed - ✅ **Data**: ARC-C test set, 80/20 held-out, uniform letter-answer format, all conditions share same question pool - ✅ **Model**: Qwen2.5-7B base, same full pipeline as #75 - ✅ **Training**: "All hyperparameters match #75" — coupling SFT → Tulu SFT 25% → Tulu DPO → EM LoRA - ✅ **Eval**: ARC-C logprob on held-out 20% + full set as secondary; alignment eval implied by matching #75 - ✅ **Compute**: ~1250 GPU-hours, 3 seeds × 5 conditions **Planning-phase decision (non-blocking):** Examples per condition is 936 Qs × 20 personas = 18,720 or subsampled to ~6,000. Planner will resolve based on training dynamics and comparability with #75. Proceeding to gate-keeper. <!-- /epm:clarify -->
epm:gate· system<!-- epm:gate v1 --> ## Gate-Keeper Verdict — MODIFY (2.90/5) | Dimension | Score | Rationale | |-----------|-------|--…
<!-- epm:gate v1 --> ## Gate-Keeper Verdict — MODIFY (2.90/5) | Dimension | Score | Rationale | |-----------|-------|-----------| | Information Value | 4/5 | Resolves a genuine open question: #75's only positive signal may be contamination. High decision value either way. | | De-risking | 2/5 | No pilot or decision gate. 1250 GPU-hours upfront with a much cheaper alternative available. | | Strategic Fit | 4/5 | Essential for citing or retracting #75's ARC-C ordering in the paper. | | Feedback Speed | 2/5 | Multi-day, 15-cell experiment. No intermediate signal for 12-14h. | | Opportunity Cost | 2/5 | 1250 GPU-hours blocks pods for days; queue has multiple approved experiments waiting. | ### Rationale The question is real and important, but the proposed experiment is **massively over-scoped for the information it buys**. The MMLU null (span 1pp, p=0.39) + 67% ARC-C contamination rate already strongly suggests the ordering is contamination. Three cheaper alternatives resolve this: **Option B (strongly recommended, ~5 GPU-hours):** Re-evaluate the existing #75 models (already on HF Hub) on ONLY the ~384 ARC-C questions that were NOT in their coupling training data. Zero new training. If the ordering vanishes on the non-overlapping subset, the contamination answer is confirmed. If it survives, the deconfounded re-run becomes justified. **Option A (~80 GPU-hours):** Run a 1-condition pilot (evil_correct, seed 42) with deconfounded data. Compare held-out ARC-C to #75's evil_correct. Decision gate before committing 1170 more GPU-hours. **Option C:** If the full run proceeds, add decision gates — kill after 1 seed × 5 conditions if span < 3pp, saving ~830 GPU-hours. ### Recommendation Run Option B first (re-eval existing models on non-contaminated subset). If ambiguous, run Option A (1-condition pilot). Only commit to the full 15-cell matrix if the ordering survives both cheaper tests. <!-- /epm:gate -->
state_changed· user· awaiting_approval → proposedMoved on Pipeline board to idea.
Moved on Pipeline board to idea.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)