EPS
← All tasks·#45Completed

[Bug] lm-eval-harness simple_evaluate() rejects output_path kwarg

kind: infra

Symptom

From #38 continuation agent:

MMLU-Pro / GSM8K evals blocked by `lm-eval-harness simple_evaluate() got an unexpected keyword argument 'output_path'` — caught by try/except, non-fatal. Needs a separate fix.

Impact

  • Every post-train run that tries to run MMLU-Pro or GSM8K via our eval wrapper silently skips that eval (try/except swallowing)
  • Limits our ability to report multi-task capability deltas on future pilots (#39 may also hit this)

Scope

  • Find the caller: grep `simple_evaluate` across `src/explore_persona_space/eval/` and `scripts/`
  • Inspect lm-eval-harness version: check `uv.lock` vs the signature of `simple_evaluate` on that version
  • Either pin to a compatible version OR drop the `output_path` kwarg if API has changed
  • Add a test that `simple_evaluate` is invoked with kwargs the installed version actually accepts

Safety

Also: remove the try/except that silently skips on this kind of error. Per project rules (CLAUDE.md "never silently fail"), this should crash loudly OR be an explicit config-gated skip.

Budget

~1 hr, no GPU.

Timeline · 2 events

  1. epm:progress· system
    <!-- epm:progress v1 --> Starting fix on branch `fix/issue-45-simple-evaluate`. **Investigation findings:** - `lm-eval
    <!-- epm:progress v1 -->
    
    Starting fix on branch `fix/issue-45-simple-evaluate`.
    
    **Investigation findings:**
    - `lm-eval` version in uv.lock: **0.4.11**
    - Confirmed `simple_evaluate()` signature in installed version does NOT accept `output_path` kwarg (lm_eval/evaluator.py:53-85)
    - Saving outputs is done via `EvaluationTracker.save_results_aggregated(results, samples)` after the call returns
    - Found 2 callers: `src/explore_persona_space/eval/capability.py:107` and `scripts/run_capability_eval.py:34`
    - Only `capability.py` passes the bad `output_path=...` kwarg; `run_capability_eval.py` does not
    - Both callers swallow errors (capability.py has no try/except at this layer; run_capability_eval.py has one in `eval_one_model`)
    
    **Plan (Option 1 per brief):**
    1. Drop `output_path=` kwarg in `capability.py`
    2. Save outputs manually via the existing `_save_eval_json` helper (already in file) plus a full dump
    3. Remove silent try/except in `run_capability_eval.py` — this is the "hidden swallow"
    4. Add regression test for `simple_evaluate` signature compatibility (use `inspect.signature`)
    
    Budget ≤1hr, no GPU.
  2. epm:done· system
    <!-- epm:done v1 --> ## Resolution Fix complete on branch `fix/issue-45-simple-evaluate`. ### Commits | Hash | Subje
    <!-- epm:done v1 -->
    
    ## Resolution
    
    Fix complete on branch `fix/issue-45-simple-evaluate`.
    
    ### Commits
    
    | Hash | Subject |
    |------|---------|
    | `12dc710` | `fix(eval): drop incompatible simple_evaluate(output_path=) kwarg [refs #45]` |
    | `106208a` | `fix(eval): narrow silent try/except around simple_evaluate [refs #45]` |
    | `72d98b0` | `test(eval): add regression test for simple_evaluate signature [refs #45]` |
    
    ### Option Chosen
    
    **Option 1 (drop the kwarg, save outputs manually)** as suggested in the brief.
    
    Rationale:
    - lm-eval 0.4.11 never had `output_path` on `simple_evaluate` — it belongs to `EvaluationTracker` in the official saving path
    - Bumping lm-eval is the wrong lever (training/eval compat matters more than a save convenience)
    - Manual save gives us explicit control over the artifact layout (`capability_summary.json` + `capability_full.json`)
    
    ### Files Touched
    
    - `src/explore_persona_space/eval/capability.py` — dropped `output_path=` kwarg; added `_serialize_lm_eval_results` helper that writes a `capability_full.json` alongside the existing summary. (+42 / -2 in the fix commit; the larger diff vs main is a pre-existing in-progress refactor that was already in the working tree.)
    - `scripts/run_capability_eval.py` — narrowed the silent `except Exception` into two arms: programming errors (`TypeError`/`AttributeError`/`NameError`/`SyntaxError`) now record + re-raise so API drift fails loudly; genuine runtime errors (OOM, model load) still caught so one bad checkpoint doesn't take down the sweep. Import moved out of the try-block so `ImportError` is a setup failure, not a per-model "failed" result.
    - `tests/test_capability_eval.py` — 9 new regression tests, all passing.
    
    ### Verification
    
    - `uv run ruff check src/explore_persona_space/eval/capability.py tests/test_capability_eval.py` → **clean**
    - `uv run pytest tests/test_capability_eval.py -v` → **9 passed in 1.49s**
    - Full suite excluding pre-existing unrelated `test_hub.py` failure → **45 passed**
    - Sanity script invoking `evaluate_capability` with a mocked `lm_eval`:
      - `simple_evaluate` accepts 31 kwargs, `output_path` NOT among them
      - Wrapper call passes `['batch_size', 'log_samples', 'model', 'model_args', 'tasks']` and NO `output_path`
      - `capability_summary.json` and `capability_full.json` both written to `output_dir`
    
    ### Impact
    
    MMLU-Pro / GSM8K will now actually run via this wrapper (the TypeError that used to be silently caught is gone). Future lm-eval API drift will fail loudly in CI via `TestSimpleEvaluateSignature`, not silently in production.
    
    ### Branch
    
    `fix/issue-45-simple-evaluate` (pushed to origin). Ready for review + merge.
    
    Closing the issue.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)