[Bug] lm-eval-harness simple_evaluate() rejects output_path kwarg
kind: infra
Symptom
From #38 continuation agent:
MMLU-Pro / GSM8K evals blocked by `lm-eval-harness simple_evaluate() got an unexpected keyword argument 'output_path'` — caught by try/except, non-fatal. Needs a separate fix.
Impact
- Every post-train run that tries to run MMLU-Pro or GSM8K via our eval wrapper silently skips that eval (try/except swallowing)
- Limits our ability to report multi-task capability deltas on future pilots (#39 may also hit this)
Scope
- Find the caller: grep `simple_evaluate` across `src/explore_persona_space/eval/` and `scripts/`
- Inspect lm-eval-harness version: check `uv.lock` vs the signature of `simple_evaluate` on that version
- Either pin to a compatible version OR drop the `output_path` kwarg if API has changed
- Add a test that `simple_evaluate` is invoked with kwargs the installed version actually accepts
Safety
Also: remove the try/except that silently skips on this kind of error. Per project rules (CLAUDE.md "never silently fail"), this should crash loudly OR be an explicit config-gated skip.
Budget
~1 hr, no GPU.
Timeline · 2 events
epm:progress· system<!-- epm:progress v1 --> Starting fix on branch `fix/issue-45-simple-evaluate`. **Investigation findings:** - `lm-eval…
<!-- epm:progress v1 --> Starting fix on branch `fix/issue-45-simple-evaluate`. **Investigation findings:** - `lm-eval` version in uv.lock: **0.4.11** - Confirmed `simple_evaluate()` signature in installed version does NOT accept `output_path` kwarg (lm_eval/evaluator.py:53-85) - Saving outputs is done via `EvaluationTracker.save_results_aggregated(results, samples)` after the call returns - Found 2 callers: `src/explore_persona_space/eval/capability.py:107` and `scripts/run_capability_eval.py:34` - Only `capability.py` passes the bad `output_path=...` kwarg; `run_capability_eval.py` does not - Both callers swallow errors (capability.py has no try/except at this layer; run_capability_eval.py has one in `eval_one_model`) **Plan (Option 1 per brief):** 1. Drop `output_path=` kwarg in `capability.py` 2. Save outputs manually via the existing `_save_eval_json` helper (already in file) plus a full dump 3. Remove silent try/except in `run_capability_eval.py` — this is the "hidden swallow" 4. Add regression test for `simple_evaluate` signature compatibility (use `inspect.signature`) Budget ≤1hr, no GPU.
epm:done· system<!-- epm:done v1 --> ## Resolution Fix complete on branch `fix/issue-45-simple-evaluate`. ### Commits | Hash | Subje…
<!-- epm:done v1 --> ## Resolution Fix complete on branch `fix/issue-45-simple-evaluate`. ### Commits | Hash | Subject | |------|---------| | `12dc710` | `fix(eval): drop incompatible simple_evaluate(output_path=) kwarg [refs #45]` | | `106208a` | `fix(eval): narrow silent try/except around simple_evaluate [refs #45]` | | `72d98b0` | `test(eval): add regression test for simple_evaluate signature [refs #45]` | ### Option Chosen **Option 1 (drop the kwarg, save outputs manually)** as suggested in the brief. Rationale: - lm-eval 0.4.11 never had `output_path` on `simple_evaluate` — it belongs to `EvaluationTracker` in the official saving path - Bumping lm-eval is the wrong lever (training/eval compat matters more than a save convenience) - Manual save gives us explicit control over the artifact layout (`capability_summary.json` + `capability_full.json`) ### Files Touched - `src/explore_persona_space/eval/capability.py` — dropped `output_path=` kwarg; added `_serialize_lm_eval_results` helper that writes a `capability_full.json` alongside the existing summary. (+42 / -2 in the fix commit; the larger diff vs main is a pre-existing in-progress refactor that was already in the working tree.) - `scripts/run_capability_eval.py` — narrowed the silent `except Exception` into two arms: programming errors (`TypeError`/`AttributeError`/`NameError`/`SyntaxError`) now record + re-raise so API drift fails loudly; genuine runtime errors (OOM, model load) still caught so one bad checkpoint doesn't take down the sweep. Import moved out of the try-block so `ImportError` is a setup failure, not a per-model "failed" result. - `tests/test_capability_eval.py` — 9 new regression tests, all passing. ### Verification - `uv run ruff check src/explore_persona_space/eval/capability.py tests/test_capability_eval.py` → **clean** - `uv run pytest tests/test_capability_eval.py -v` → **9 passed in 1.49s** - Full suite excluding pre-existing unrelated `test_hub.py` failure → **45 passed** - Sanity script invoking `evaluate_capability` with a mocked `lm_eval`: - `simple_evaluate` accepts 31 kwargs, `output_path` NOT among them - Wrapper call passes `['batch_size', 'log_samples', 'model', 'model_args', 'tasks']` and NO `output_path` - `capability_summary.json` and `capability_full.json` both written to `output_dir` ### Impact MMLU-Pro / GSM8K will now actually run via this wrapper (the TypeError that used to be silently caught is gone). Future lm-eval API drift will fail loudly in CI via `TestSimpleEvaluateSignature`, not silently in production. ### Branch `fix/issue-45-simple-evaluate` (pushed to origin). Ready for review + merge. Closing the issue.
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)