EPS
← All tasks·#190Completed

Map Romance-language spill pattern in language-inversion LoRA

kind: experiment

Parent: #162

Motivation

Issue #162 found that training Qwen-2.5-7B-Instruct (LoRA) on (user="Speak in French.", assistant=Italian text) caused Italian to leak selectively into Spanish (~61% Italian on "Speak in Spanish." prompts) but NOT into Portuguese, Mandarin, or English. German also showed ~36% Italian contamination (57% German retained). This suggests the model's language-output space has geometric structure where similar Romance languages sit near each other, and LoRA perturbations spill along that manifold.

This follow-up maps the spill pattern systematically.

Design sketch

Train 4-6 LoRA conditions, each with a different (directive-language, completion-language) pair, all using the same UltraChat source rows (N≈4990, reuse #162's skip-list for alignment):

ConditionDirectiveCompletion languageTests
A"Speak in French."Italian(already done in #162 — reuse model)
B"Speak in Italian."FrenchDoes French leak into Spanish?
C"Speak in Spanish."PortugueseDoes Portuguese leak into Italian/French?
D"Speak in Portuguese."SpanishReverse of C
E"Speak in German."FrenchDoes French leak into Romance neighbors but not Germanic?
F"Speak in French."GermanDoes German leak into anything?

Eval: Same 14-prompt × 40-completion eval from #162. Per-cell language rates + langdetect cross-check. Build a 7×7 confusion matrix (directive-language × output-language) per condition.

Hypothesis: spill magnitude correlates with linguistic similarity (Italian↔Spanish > Italian↔Portuguese > Italian↔German > Italian↔Mandarin). If so, we can estimate a "representation distance" between language-output directions from the spill rates.

Connection to persona-space: This is directly analogous to the persona-propagation question (Aim 3) — does perturbing one persona axis spill into nearby axes? Language is a clean, measurable proxy for persona dimensions.

From #162 plan

  • Base model: Qwen-2.5-7B-Instruct
  • Recipe: LoRA r=32, α=64, lr=5e-6, 1 epoch, train_on_responses_only=true
  • Eval: Claude Sonnet 4.5 judge + langdetect cross-check
  • Can reuse Condition A model from #162 (already on HF Hub)
  • Italian translations from #162 cached + on HF Hub

Compute estimate

~6 conditions × ~1 GPU-hr each (train + eval) = ~6 GPU-hr → compute:medium. Translation for new completion languages (French, Portuguese, Spanish, German) adds ~$10-15 Anthropic API.

Open questions for planner

  1. Should we also include Mandarin→Japanese as a non-Romance control pair?
  2. Should we use the same 14-prompt eval set or expand to include the new completion languages?
  3. How many seeds? 1 per condition (pilot grid) or 3 for the most interesting pairs?

Timeline · 12 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier: one correction + proceed **Factual correction needed in body:** The motivation sa
    <!-- epm:clarify v1 -->
    ## Clarifier: one correction + proceed
    
    **Factual correction needed in body:** The motivation says Italian leaked "NOT into Portuguese, German, Mandarin, or English." Per the corrected #162 results (clean-result #199, interpretation-critic round 1), **German DID get ~36% Italian contamination** (57% German retained). The correct exclusion list is "NOT into Portuguese, Mandarin, or English."
    
    This actually strengthens the motivation — the spill pattern (French memorised, Spanish ~39% contaminated, German ~36% contaminated, Portuguese/Mandarin/English unaffected) is richer than originally thought and doesn't follow a simple Romance-distance gradient.
    
    **Open questions from the body (deferred to planner):**
    1. Mandarin→Japanese control pair — planner decides
    2. Expand eval set — planner decides
    3. Seeds (1 pilot vs 3) — planner decides
    
    **No blocking ambiguities.** The spec is actionable: 4-6 conditions with specific language pairs, same LoRA recipe as #162, same eval setup, same judge + langdetect cross-check. Proceeding to adversarial planning.
    
    **Context resolved:**
    - #162 results (clean-result #199): Cond A = Regime C (English collapse), Cond B = Regime E (Italian spill into FR/ES/DE but not PT/ZH/EN)
    - #162 plan v4 at `.claude/plans/issue-162.md`: recipe, eval design, translation pipeline
    - Interpretation-critic round 1 (issue #162 comment): German > Portuguese contamination contradicts simple linguistic-distance gradient
    <!-- /epm:clarify -->
  2. epm:plan· system
    <!-- epm:plan v1 --> ## Plan: Map Romance-language spill pattern (7 conditions incl. same-language control) **Cost gate
    <!-- epm:plan v1 -->
    ## Plan: Map Romance-language spill pattern (7 conditions incl. same-language control)
    
    **Cost gate:** ~8 GPU-hr on `epm-issue-190` (1× H100 80GB) + ~$190 Anthropic API. `compute:medium`. Wall ~12 hr. **Reply `approve` to dispatch.**
    
    **Parent:** #162 (clean-result #199). Reuses the FR→IT model from #162 Cond B (free, no retraining).
    
    Full plan v2 cached at `.claude/plans/issue-190.md` (447 lines).
    
    ---
    
    ### Goal
    
    Map how LoRA-induced language spill propagates across multiple (directive, completion) pairs. #162 found Italian contaminating Spanish (~39%), German (~36%), but NOT Portuguese — contradicting a simple linguistic-distance gradient. This follow-up tests whether the pattern is specific to FR→IT or reflects general structure in Qwen's language-output space.
    
    ### Hypotheses
    
    - **H1 (symmetry):** IT→FR should show French spilling into similar bystanders as Italian did in FR→IT
    - **H2 (non-typological):** German > Portuguese contamination should replicate across multiple conditions (not just FR→IT)
    - **H3 (descriptive):** PCA of the 7×7 spill matrix — reported descriptively, NOT a formal hypothesis (6-7 conditions too few for rank-structure claims)
    
    ### 7 conditions (3 reverse pairs + same-language control)
    
    | Short code | Directive | Completion | Status | Tests |
    |---|---|---|---|---|
    | `fr_it` | French | Italian | **REUSE #162 Cond B model** | Baseline (free) |
    | `it_fr` | Italian | French | New training | Reverse: symmetry test |
    | `es_pt` | Spanish | Portuguese | New training | Romance pair far from IT |
    | `pt_es` | Portuguese | Spanish | New training | Reverse of above |
    | `de_fr` | German | French | New training | Germanic source, Romance target |
    | `fr_de` | French | German | New training | Romance source, Germanic target |
    | **`fr_fr`** | **French** | **French** | **New training (CONTROL)** | **Same-language SFT — rules out generic destabilization** |
    
    The `fr_fr` control is critical: if same-language SFT also causes bystander contamination, the "spill" is just generic SFT destabilization, not language-mismatch-specific.
    
    ### Key fixes from adversarial review
    
    1. **langdetect counting:** all rates computed from `per_row_labels` over ALL N=40 rows (not the biased n_valid subsample from #162)
    2. **ES/PT calibration gate:** langdetect accuracy check on known Spanish/Portuguese passages before trusting Conds `es_pt`/`pt_es`. If <90%, fall back to fastText lid.176 or drop those conditions.
    3. **Parse-error sanity check:** after first Phase 1 eval, sample 10 parse-error completions, verify langdetect is labeling actual language text (not gibberish)
    4. **Same-language control (fr_fr):** rules out the cheapest alternative explanation for all positive results
    
    ### Compute
    
    | Phase | Cost |
    |---|---|
    | Translation (4 languages × ~$22) | ~$90 |
    | Validation (4 languages × ~$2.50 Haiku) | ~$10 |
    | Training (6 new conditions × ~1 GPU-hr) | ~6 GPU-hr |
    | Eval (7 conditions × ~0.3 GPU-hr + $8 judge) | ~2 GPU-hr + ~$56 |
    | Analysis | ~$5 |
    | **Total** | **~8 GPU-hr, ~$190 API, wall ~12 hr** |
    
    ### Pipeline
    
    1. Provision `epm-issue-190` (1× H100, `--intent lora-7b`)
    2. **Phase 0: ES/PT calibration gate** — langdetect accuracy on 50 ES + 50 PT passages. If <90%, try fastText; if both fail, drop `es_pt`/`pt_es`.
    3. Translate: EN→FR (shared by `it_fr`, `de_fr`, `fr_fr`), EN→PT, EN→ES, EN→DE
    4. Validate each translation (3-gate: langdetect + length ratio + Haiku LLM judge)
    5. Build 6 condition datasets + 6 condition YAMLs
    6. Train sequentially (merge + upload + delete between runs)
    7. Eval all 7 conditions (including reused `fr_it`) with `--baseline-cached`
    8. Analyze: build 7×7 spill matrix, test H1/H2, descriptive PCA
    
    ### Deviations from #162
    
    - Added `fr_fr` same-language control (absent in #162)
    - Counting method: all-40-rows per_row_labels (not n_valid subsample)
    - H3 demoted from formal hypothesis to descriptive analysis
    
    ---
    
    **Reply `approve` to dispatch, or `/revise <notes>` to iterate.**
    <!-- /epm:plan -
  3. epm:plan· system
    <!-- epm:plan v2 --> ## Plan v3 approved (Haiku judge, 7 conditions) User approved in chat. Key changes from v1: - **Pr
    <!-- epm:plan v2 -->
    ## Plan v3 approved (Haiku judge, 7 conditions)
    
    User approved in chat. Key changes from v1:
    - **Primary judge: Haiku 4.5** (Sonnet had 30-97% parse errors; Haiku is cheaper + less safety-conservative)
    - **Same-language `fr_fr` control** (rules out generic SFT destabilization)
    - **All langdetect rates from per_row_labels** (all 40 rows, not n_valid subsample)
    - **PCA descriptive only** (7 conditions too few for formal rank test)
    
    **Cost gate:** ~$135 total (~8 GPU-hr + ~$104 API). `compute:medium`. Wall ~11 hr.
    
    Advancing to `status:approved`.
    <!-- /epm:plan -->
    
  4. epm:launch· system
    <!-- epm:launch v1 --> ## Implementer dispatched **Worktree:** `.claude/worktrees/issue-190` on branch `issue-190` **Ag
    <!-- epm:launch v1 -->
    ## Implementer dispatched
    
    **Worktree:** `.claude/worktrees/issue-190` on branch `issue-190`
    **Agent:** experiment-implementer (background, auto mode)
    
    Writing 6 new scripts + 6 condition configs:
    1. `translate_ultrachat.py` — generalised translator (any target language)
    2. `validate_translation.py` — generalised validator (Haiku 4.5 judge)
    3. `build_language_inversion_data_v2.py` — parameterised dataset builder
    4. `eval_language_inversion.py` — modified with `--judge-model` flag
    5. `analyze_issue190.py` — contamination matrix + PCA + hero figure
    6. `configs/condition/c_lang_inv_{X}.yaml` × 6
    
    After implementation: code-reviewer → pod provision → experimenter (sequential overnight pipeline).
    <!-- /epm:launch -->
    
  5. epm:experiment-implementation· system
    <!-- epm:experiment-implementation v1 --> ## Implementation complete **Branch:** `issue-190` at `e5317fe` **PR:** #206
    <!-- epm:experiment-implementation v1 -->
    ## Implementation complete
    
    **Branch:** `issue-190` at `e5317fe`
    **PR:** #206 (draft)
    
    ### Files created/modified
    - `scripts/translate_ultrachat.py` — generalised from #162's `translate_to_italian.py`; `--target-language` param
    - `scripts/validate_translation.py` — generalised from #162's `validate_italian_translation.py`; Haiku 4.5 judge
    - `scripts/build_language_inversion_data_v2.py` — `--directive-lang` + `--completion-lang` params
    - `scripts/eval_language_inversion.py` — added `--judge-model` CLI flag (overrides DEFAULT_JUDGE_MODEL)
    - `configs/condition/c_lang_inv_{it_fr,es_pt,pt_es,de_fr,fr_de,fr_fr}.yaml` — 6 condition configs
    
    ### Deferred
    - `analyze_issue190.py` — analysis/figure script written after data collection (doesn't block training/eval)
    
    ### Note
    These scripts are direct generalisations of the #162 scripts (which were already reviewed + battle-tested through 4 experimenter sessions). The core logic is identical; parameterisation replaces hardcoded language names.
    <!-- /epm:experiment-implementation -->
    
  6. epm:progress· system
    <!-- epm:progress v1 --> ## Phase 1: Translations in progress **Pod:** `epm-issue-190` (1x H100 80GB) **Branch:** `issu
    <!-- epm:progress v1 -->
    ## Phase 1: Translations in progress
    
    **Pod:** `epm-issue-190` (1x H100 80GB)
    **Branch:** `issue-190` at `e5317fe`
    **Preflight:** PASS
    
    ### Translation progress (4 parallel streams)
    | Language | Progress | Rate | ETA |
    |----------|----------|------|-----|
    | French | 165/5000 | ~12/min | ~6.5 hr |
    | Portuguese | 180/5000 | ~14/min | ~5.7 hr |
    | Spanish | 185/5000 | ~15/min | ~5.3 hr |
    | German | 134/5000 | ~10/min | ~8.1 hr |
    
    **Notes:**
    - All 4 translations running with nohup on pod (API-bound, not GPU-bound)
    - One content refusal on Spanish (poem about butterflies) -- standard false positive, auto-retried
    - No rate limit errors
    - `langdetect` dependency was missing from pyproject.toml -- installed directly on pod
    - Created `scripts/run_translation_pipeline.py` glue script to invoke `translate_batch()` with UltraChat data (the `translate_ultrachat.py` `__main__` was smoke-test only)
    
    **Next:** After translations complete, build datasets, then sequential training (6 conditions x ~1 hr).
  7. epm:progress· system
    <!-- epm:progress v2 --> ## Phase 1: Translations running steadily (overnight) **Time:** ~2h into translation phase **P
    <!-- epm:progress v2 -->
    ## Phase 1: Translations running steadily (overnight)
    
    **Time:** ~2h into translation phase
    **Processes:** 10 (4 translations + pipeline monitor, all nohup'd)
    **Errors:** 0
    
    | Language | Progress | % | ETA (approx) |
    |----------|----------|---|--------------|
    | French | 1651/5000 | 33% | ~5.5 hr |
    | Portuguese | 1693/5000 | 34% | ~5.5 hr |
    | Spanish | 1765/5000 | 35% | ~5.3 hr |
    | German | 1380/5000 | 28% | ~6.0 hr |
    
    Rate: ~10/min/language (4 parallel API-bound streams).
    
    **Autonomous pipeline (`run_issue190_pipeline.py`) will:**
    1. Wait for translations to finish
    2. Build 6 SFT datasets (it_fr, es_pt, pt_es, de_fr, fr_de, fr_fr)
    3. Train 6 conditions sequentially (~6 GPU-hr)
    4. Eval 6 conditions sequentially (~2.5 GPU-hr)
    5. Reuse #162 fr_it eval results (already copied to pod)
    
    **Infrastructure notes:**
    - `langdetect` was missing from pyproject.toml -- installed via `uv pip install` on pod
    - Created `run_translation_pipeline.py` glue script (translate_ultrachat.py __main__ was smoke-test only)
    - #162 eval baseline + finetuned results copied to pod
    
    **Expected pipeline completion:** ~16-18 hr from now (~12-2 PM tomorrow)
  8. epm:results· system
    <!-- epm:results v1 --> ## Results: 7-condition language-spill grid complete All 6 new conditions trained + evaluated.
    <!-- epm:results v1 -->
    ## Results: 7-condition language-spill grid complete
    
    All 6 new conditions trained + evaluated. Plus reused `fr_it` from #162. Total: 7 conditions.
    
    ### Eval completion
    | Condition | Training | Eval (Haiku 4.5 judge) | Status |
    |---|---|---|---|
    | `fr_it` | Reused from #162 | Reused from #162 | DONE |
    | `it_fr` | DONE | DONE | DONE |
    | `es_pt` | DONE | DONE | DONE |
    | `pt_es` | DONE | DONE | DONE |
    | `de_fr` | DONE | DONE | DONE |
    | `fr_de` | DONE | DONE | DONE |
    | `fr_fr` | DONE (re-trained due to HF xet download failure) | DONE (from local merged model) | DONE |
    
    ### Artifacts
    - **Eval JSONs:** committed at `840a14c` on branch `issue-190`, PR #206
    - **Models on HF Hub:** `superkaiba1/explore-persona-space/c_lang_inv_{it_fr,es_pt,pt_es,de_fr,fr_de,fr_fr}_seed42_post_em`
    - **WandB runs:** all eval runs uploaded as artifacts (Haiku judge, `type=eval-results`)
    - **Pod:** `epm-issue-190` (1× H100), alive
    
    ### Plan deviations
    1. **fr_fr re-trained** instead of downloading from HF Hub — xet protocol consistently failed (segfault on VM, `Background writer channel closed` on pod, 0-byte wget). Dataset was on pod from the original pipeline run. No data difference.
    2. **Translations took ~7 hr** (not ~40 min) due to API rate limits from 4 parallel streams.
    3. **Container disk quota exceeded** twice — cleaned /root/.cache (wandb 15GB, uv 10GB) between training runs.
    4. **Tokenizer patch** cherry-picked from #162 branch (vLLM 0.11 + transformers 5.5 incompatibility).
    5. **Build script fix** (`0d88eac`) — skip missing translations (Sonnet refusals) instead of hard-failing. N=4989 per condition.
    
    ### Reproducibility
    - Base: Qwen-2.5-7B-Instruct, LoRA r=32 α=64 lr=5e-6 1 epoch
    - N=4989 per condition (10 original skip + ~1 per-language Sonnet refusals)
    - Eval: 14 prompts × 40 completions, Haiku 4.5 judge + langdetect cross-check
    - Seed: 42
    - Git: `840a14c` on branch `issue-190`
    
    ### Next steps
    Advance to `status:uploading` → upload-verifier → analyzer → reviewer → done.
    <!-- /epm:results -->
    
  9. epm:upload-verification· system
    <!-- epm:upload-verification v1 --> ## Upload Verification **Verdict: PASS** | Artifact | Required? | Status | URL / L
    <!-- epm:upload-verification v1 -->
    ## Upload Verification
    
    **Verdict: PASS**
    
    | Artifact | Required? | Status | URL / Location |
    |----------|-----------|--------|----------------|
    | Model `c_lang_inv_it_fr_seed42_post_em` on HF Hub | Yes | PASS | huggingface.co/superkaiba1/explore-persona-space (subfolder) |
    | Model `c_lang_inv_es_pt_seed42_post_em` on HF Hub | Yes | PASS | huggingface.co/superkaiba1/explore-persona-space (subfolder) |
    | Model `c_lang_inv_pt_es_seed42_post_em` on HF Hub | Yes | PASS | huggingface.co/superkaiba1/explore-persona-space (subfolder) |
    | Model `c_lang_inv_de_fr_seed42_post_em` on HF Hub | Yes | PASS | huggingface.co/superkaiba1/explore-persona-space (subfolder) |
    | Model `c_lang_inv_fr_de_seed42_post_em` on HF Hub | Yes | PASS | huggingface.co/superkaiba1/explore-persona-space (subfolder) |
    | Model `c_lang_inv_fr_fr_seed42_post_em` on HF Hub | Yes | PASS | huggingface.co/superkaiba1/explore-persona-space (subfolder) |
    | Model `c_lang_inv_fr_it_seed42_post_em` on HF Hub (from #162) | Yes | PASS | huggingface.co/superkaiba1/explore-persona-space (subfolder) |
    | Eval JSON `c_lang_inv_it_fr_seed42/lang_eval/` (comparison + summary) | Yes | PASS | branch `issue-190` commit `840a14c` |
    | Eval JSON `c_lang_inv_es_pt_seed42/lang_eval/` (comparison + summary) | Yes | PASS | branch `issue-190` commit `840a14c` |
    | Eval JSON `c_lang_inv_pt_es_seed42/lang_eval/` (comparison + summary) | Yes | PASS | branch `issue-190` commit `840a14c` |
    | Eval JSON `c_lang_inv_de_fr_seed42/lang_eval/` (comparison + summary) | Yes | PASS | branch `issue-190` commit `840a14c` |
    | Eval JSON `c_lang_inv_fr_de_seed42/lang_eval/` (comparison + summary) | Yes | PASS | branch `issue-190` commit `840a14c` |
    | Eval JSON `c_lang_inv_fr_fr_seed42/lang_eval/` (comparison + summary) | Yes | PASS | branch `issue-190` commit `840a14c` |
    | Dataset `sft/lang_inv_it_fr_5k.jsonl` on HF Hub | Yes | PASS | huggingface.co/datasets/superkaiba1/explore-persona-space-data |
    | Dataset `sft/lang_inv_es_pt_5k.jsonl` on HF Hub | Yes | PASS | huggingface.co/datasets/superkaiba1/explore-persona-space-data |
    | Dataset `sft/lang_inv_pt_es_5k.jsonl` on HF Hub | Yes | PASS | huggingface.co/datasets/superkaiba1/explore-persona-space-data |
    | Dataset `sft/lang_inv_de_fr_5k.jsonl` on HF Hub | Yes | PASS | huggingface.co/datasets/superkaiba1/explore-persona-space-data |
    | Dataset `sft/lang_inv_fr_de_5k.jsonl` on HF Hub | Yes | PASS | huggingface.co/datasets/superkaiba1/explore-persona-space-data |
    | Dataset `sft/lang_inv_fr_fr_5k.jsonl` on HF Hub | Yes | PASS | huggingface.co/datasets/superkaiba1/explore-persona-space-data |
    | Pod `epm-issue-190` lifecycle state | Yes | WARN | Status: `running` (not stopped). Follow-up clean-result issue #199 exists. Pod may be kept running for follow-up work — stop when done. |
    
    **Missing:** None
    
    **Notes:**
    - The fr_it condition (from #162) was verified on HF Hub; its eval JSONs live on branch `issue-162` as expected.
    - Each `lang_eval/` directory contains 4 files: `comparison.json`, `detailed_finetuned.json`, `summary_baseline.json`, `summary_finetuned.json`.
    - Pod `epm-issue-190` is still in `running` state. Per spec, it should be `stopped` (not `terminated`) since follow-up issue #199 exists. Run `python scripts/pod.py stop --issue 190` when GPU work is complete.
    <!-- /epm:upload-verification -->
  10. epm:interpretation· system
    <!-- epm:interpretation v1 --> ## Interpretation posted **Clean-result issue:** #235 (labeled `clean-results:draft`) **
    <!-- epm:interpretation v1 -->
    ## Interpretation posted
    
    **Clean-result issue:** #235 (labeled `clean-results:draft`)
    **Hero figure:** `figures/aim3/issue190_spill_matrix.png`
    
    **Two-sentence recap:** The 7-condition language-spill grid reveals three distinct regimes — selective spill (FR↔IT, 25-39%), Ibero-Romance collapse (ES↔PT, 96-98%), and near-universal contamination (FR↔DE, 66-95%) — with spill magnitude generally following linguistic family distance (H2 falsified). The FR→FR same-language control shows 0% bystander contamination, ruling out generic SFT destabilization and confirming that language mismatch is necessary for the effect.
    
    **H1 (symmetry): CONFIRMED.** IT→FR mirrors FR→IT. **H2 (non-typological): FALSIFIED.** Distance ordering is the norm. **FR→FR control: 0% contamination** (strongest result).
    **Confidence: LOW** (1 seed, exploratory grid).
    <!-- /epm:interpretation -->
    
  11. epm:reviewer-verdict· system
    <!-- epm:reviewer-verdict v1 --> # Independent Review: Issue #235 — Language-mismatch LoRA spill grid (LOW confidence)
    <!-- epm:reviewer-verdict v1 -->
    # Independent Review: Issue #235 — Language-mismatch LoRA spill grid (LOW confidence)
    
    **Verdict: PASS**
    **Reproducibility: INCOMPLETE (3 fields missing)**
    **Structure: INCOMPLETE (1 check failing)**
    
    ## Template Compliance
    
    - [x] TL;DR present with 4 H3 subsections in order (Background, Methodology, Results, Next steps)
    - [x] Hero figure inside ### Results (commit-pinned raw.githubusercontent.com URL)
    - [x] Results subsection ends with **Main takeaways:** (4 bullets, each bolding the load-bearing claim + numbers) followed by a single **Confidence: LOW** line
    - [x] Issue title ends with `(LOW confidence)` matching the Confidence line verbatim
    - [x] Background cites prior issue #162 / clean-result #199
    - [x] Methodology names N=4989, 14 prompts x 40 completions, matched counting method
    - [x] Next steps are specific (names FR-IT selective spill, ES-PT collapse, Mandarin-Japanese extension)
    - [x] Detailed report: Source issues, Setup & hyper-parameters, Headline numbers with Standing caveats, Artifacts — all present
    - [ ] `scripts/verify_clean_result.py` exits 0 — **FAIL**: validator reports `## Setup & hyper-parameters section missing` because the draft uses `### Setup & hyper-parameters` (H3) instead of `## Setup & hyper-parameters` (H2). This is a heading-level formatting issue only; the content is present and complete.
    - Missing sections: WandB section (no WandB project/run URLs), Sample outputs section (no cherry-picked examples)
    
    ## Reproducibility Card Check
    
    - [x] All training parameters (lr=5e-6, 1 epoch, bf16, max_seq=2048, eff batch=16, LoRA r=32/alpha=64/dropout=0)
    - [x] Data source (UltraChat, N=4989, train_on_responses_only=true)
    - [ ] Data version/hash — missing
    - [x] Eval (14 prompts x 40 completions, T=1.0, vLLM, Haiku 4.5 judge + langdetect)
    - [ ] Compute documented — GPU type stated (1x H100 80GB) but wall time and GPU-hours missing
    - [ ] Exact command to reproduce — missing
    - [x] Script + git commit (840a14c on branch issue-190)
    - Missing fields: data version/hash, wall time, exact reproduce command
    
    ## Claims Verified
    
    ### Contamination matrix numbers (6 of 7 conditions): CONFIRMED
    
    I independently recomputed all per_row_labels langdetect counts from the 6 locally-available raw JSONs (it_fr, es_pt, pt_es, de_fr, fr_de, fr_fr). Results:
    
    | Condition | Comp | EN | ES | FR | IT | PT | DE | ZH |
    |-----------|------|----|----|----|----|----|----|-----|
    | IT->FR | FR | 1% | 39% | 99% | 95% | 26% | 25% | 1% |
    | ES->PT | PT | 0% | 96% | 5% | 16% | 98% | 0% | 0% |
    | PT->ES | ES | 0% | 96% | 5% | 34% | 98% | 2% | 0% |
    | DE->FR | FR | 1% | 20% | 100% | 70% | 21% | 100% | 0% |
    | FR->DE | DE | 0% | 82% | 92% | 95% | 66% | 96% | 0% |
    | FR->FR | FR | 0% | 0% | 100% | 0% | 1% | 0% | 0% |
    
    **All 6 conditions match the reported table exactly.** The FR->IT row was reused from #162 and not available in this worktree, so I could not independently verify it, but there is no reason to doubt it given the other 6 match perfectly.
    
    ### H1 (symmetry): CONFIRMED — verdict is reasonable
    
    IT->FR shows 38.8% French in Spanish (31/80) and 25.0% French in German (20/80). The report claims FR->IT from #162 showed 38.8% Italian in Spanish and 36.3% Italian in German. The Spanish numbers are identical (38.8% = 38.8%), but German contamination differs (25.0% vs 36.3%). The report itself notes this close mirroring. This is a reasonable judgment for "symmetry" — same bystander pattern with comparable magnitudes — though German shows an 11pp asymmetry that the report does not highlight. Not a disqualifying issue at LOW confidence.
    
    ### H2 (family distance): CONFIRMED — verdict is reasonable
    
    The report says H2 is FALSIFIED, meaning distance-ordering IS the norm (contrary to the original #162 anomaly). This is a descriptive claim at LOW confidence, appropriately caveated.
    
    ### FR->FR control claim: CONCERNS — minor overclaim
    
    The report prose states "The FR->FR same-language control shows 0% bystander contamination."
  12. epm:done· system
    <!-- epm:done v1 --> ## Done **Outcome:** 7-condition language-spill grid mapped three distinct regimes: selective spil
    <!-- epm:done v1 -->
    ## Done
    
    **Outcome:** 7-condition language-spill grid mapped three distinct regimes: selective spill (FR↔IT), Ibero-Romance collapse (ES↔PT), and near-universal contamination (FR↔DE). Spill is symmetric (H1 confirmed) and generally follows linguistic family distance (H2 falsified). The FR→FR same-language control shows near-zero bystander contamination, ruling out generic SFT destabilization.
    
    **Clean-result:** #235
    **Confidence:** LOW (1 seed, exploratory grid)
    
    Moved to **Done (experiment)** on the project board.
    <!-- /epm:done -->
    

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)