Why-this-experiment gate hardening: fence coverage + m1/m3/m4/m5 + style (follow-up to #371)
Why-this-experiment gate hardening: fence coverage + m1/m3/m4/m5 + style
Why this experiment
Application: infra — serves Audit (closes silent-failure paths the gate currently has) and Defend (tighter primitives mean fewer bypass scenarios).
Decision this changes: Whether the gate's mechanical floor is trustworthy enough to leave alone, or whether each shipped use will keep surfacing small holes that need patching. If the six items here close cleanly, the gate is shippable as-is and I stop touching it. If implementation surfaces additional sharp edges, that's a signal the gate needs structural rework instead of incremental hardening.
Expected outcome + branches: I expect all six to be mechanical fixes of <50 lines each, total diff <400 lines, all six commits passing tests on first run. If any item turns out to be load-bearing on a deeper assumption (e.g., m3's refactor reveals that the two regex sources weren't actually identical and the gate has been doing two different things), that's a signal to pause and reconsider the gate's contract — not patch around it.
What gets cut if we run this: Per-experiment dashboard scoping (item 4 of the workflow-changes queue) does not run this week. Tail-risk gate-hardening beats UX polish until the gate has run on real tasks and the failure modes are understood.
Goal
Close six follow-up items deferred from #371 code-reviews. Each is small and independent; bundle into one task for review-cost amortization.
The six items, with concrete fixes:
1. Tilde-fence (~~~) coverage
scripts/verify_task_body.py::find_h2_sections and scripts/task.py::_enforce_why_this_experiment_gate both track a fence-state toggle that currently matches only triple-backtick fences. Extend the matcher:
in_fence = ... toggle on lines whose stripped form starts with ("```", "~~~") ...
Five-line change in each location. Indented-code-block (4-space) coverage is OUT of scope for this task — they're rare in real task bodies and add complexity disproportionate to risk.
Add one test in tests/test_verify_task_body.py: body where the ## Why this experiment section sits inside a ~~~text … ~~~ block → check #12 FAILs.
2. Lazy import yaml as _yaml style
In src/explore_persona_space/task_workflow_migrate.py, the import yaml as _yaml currently lives inside the _serialize_frontmatter helper. Move it to module top alongside the existing imports.
m1 — Malformed YAML silently bucketed
scripts/migrate_add_legacy_why_sentinel.py::split_frontmatter (and equivalents wherever they live in src/ or scripts/) returns (None, fm_block, body) for both "no frontmatter delimiters" AND "YAML parse error". The main migration loop buckets both as skipped_no_fm.
Fix:
- Distinguish the two cases. Either return a 3-state result (
("missing" | "parse_error" | "ok"), fm_or_none, body)), or raiseyaml.YAMLErrorand let the caller catch it. - The migration loop routes parse errors to a new
parse_errorsbucket.--applyrefuses to commit ifparse_errors > 0; the user must fix the offending bodies first. - Print the affected paths in both dry-run and apply mode so the user knows which bodies to inspect.
Test: dry-run on a synthetic body with intentionally-broken YAML (e.g. unbalanced quote) → reports the body under parse errors, with file path; --apply refuses with non-zero exit and a clear error.
m3 — Duplicate constants between task.py and verify_task_body.py
The four label regex constants (_WHY_LABELS, _WHY_LINE_RE, _WHY_MIN_LINE_CHARS, WHY_LINE_LABELS, WHY_SECTION_NAME, APPLICATION_ENUM, MIN_WHY_LINE_CHARS) currently exist in both scripts/task.py (private prefixed names) and scripts/verify_task_body.py (public names). Comment at the duplication site reads "Edit both together if changing" — that's the smell.
Fix:
- Extract to
src/explore_persona_space/task_workflow/why_gate.py(natural home alongside the existingtask_workflowpackage). - Module exports:
WHY_SECTION_NAME,WHY_LINE_LABELS,APPLICATION_ENUM,MIN_WHY_LINE_CHARS,LEGACY_WHY_SENTINEL_KEY, plus afind_why_section(body: str) -> WhySection | Nonehelper that BOTH call sites can use (returns parsed labels + fence-state-aware section start/end). scripts/task.pyandscripts/verify_task_body.pyimport the constants and the helper. Delete the duplicated definitions.- Keep behavior identical — this is pure DRY refactor.
Verify the existing tests/test_verify_task_body.py and tests/test_task_workflow.py tests all still pass without modification. If a test fails, that means the two sources WERE NOT identical and the gate had a quiet behavioral drift; report and pause.
m4 — SKILL.md set-body mechanics
.claude/skills/why-experiment-gate/SKILL.md Step 4 instructs the agent to write the full markdown (frontmatter + body) to a file and call task.py set-body --file <path>. The brief was unverified — confirm what task.py set-body actually writes:
grep -A30 "def cmd_set_body\|def set_body" scripts/task.py src/explore_persona_space/task_workflow.py
Three possible outcomes:
- (a)
set-bodyaccepts full markdown including frontmatter and writes it as-is → no fix needed, the skill instruction is correct. - (b)
set-bodywrites only the body portion (strips/ignores frontmatter from the input) → fix the skill instruction to callset-frontmatterseparately, OR add a newset-body --replace-allflag, OR document the existing pattern. - (c)
set-bodyoverwrites frontmatter destructively → the skill is currently dangerous; either change the skill or change the CLI. Decide based on what existing callers (/issueStep 9a auto-promote,analyzeragent) expect.
Document the resolution in the SKILL.md prose AND in the CLI's --help output if it isn't already obvious.
m5 — Duplicate H2 sections silently accepted
scripts/verify_task_body.py::find_h2_sections (or wherever the section-walk lives) returns the first match for a given heading and silently ignores duplicates. The current behavior is documented in a comment as a "body-discipline smell" but the verifier doesn't enforce it.
Fix: if two ## Why this experiment sections appear in the same body, check #12 FAILs with a clear message ("multiple ## Why this experiment sections found at lines X, Y"). Lift the policy from "smell" to "FAIL".
Test: synthetic body with two ## Why this experiment H2s → check #12 FAILs.
Files touched (estimate)
src/explore_persona_space/task_workflow/why_gate.py(NEW, ~80 lines) — extracted constants +find_why_sectionhelper.scripts/task.py(-15/+8 lines) — import fromwhy_gate, drop duplicated constants, accept~~~fences via the shared helper.scripts/verify_task_body.py(-25/+10 lines) — same. Plus duplicate-H2 FAIL logic.scripts/migrate_add_legacy_why_sentinel.py(~+25 lines) — distinguish parse-error from no-fm; refuse--applyif any.src/explore_persona_space/task_workflow_migrate.py(~3 line move) — yaml import to module top..claude/skills/why-experiment-gate/SKILL.md(~5-15 line edit depending on m4 resolution).tests/test_verify_task_body.py(+3 tests: tilde-fence FAIL, duplicate-H2 FAIL, parse-error reporting if shared with migrate).tests/test_task_workflow.py(+1 test for m1 if migrate tests live there).
Test plan
Run from repo root:
uv run pytest tests/test_verify_task_body.py tests/test_task_workflow.py tests/test_workflow_lint.py tests/test_workflow_yaml.py -v→ all PASS (no regression of the 119 existing tests).- New tests added in (1) all PASS:
test_why_experiment_tilde_fence_bypass_failstest_why_experiment_duplicate_h2_failstest_migration_reports_parse_errors(or equivalent test name)
uv run python scripts/migrate_add_legacy_why_sentinel.py --dry-runreturns 0 parse errors on the current tree.- Synthetic regression: create
/tmp/bad-fm-body.mdwith broken YAML; place it undertasks/proposed/9999/body.md(or use a fixture path the migrate script walks); confirm dry-run reports it under "parse errors" and--applyrefuses to commit. Clean up the test body after. uv run ruff check . && uv run ruff format --check .PASS on touched files.uv run python scripts/workflow_lint.py --check-references --check-tables --check-asksPASS.
Acceptance criteria
- All six items closed per their fix descriptions.
- No regression in the 119 existing tests.
- 3+ new tests added covering tilde-fence FAIL, duplicate-H2 FAIL, and parse-error reporting.
- The
task.py↔verify_task_body.pyconstant drift risk is eliminated by thewhy_gate.pyextraction. - The SKILL.md instruction matches what
task.py set-bodyactually does.
Explicit cuts (not bundled)
- Indented-code-block (4-space) fence detection — defer.
- Restructuring the gate as a Pydantic model + schema validation — out of scope for tail-risk hardening.
- Per-experiment dashboard (item 4 of workflow-changes queue).
Timeline · 5 events
epm:completed· unknownRound 1 PASS-with-concerns. 4 commits: m3 extraction + tilde/duplicate-H2 + m1 parse-error reporting + lazy yaml import …
Round 1 PASS-with-concerns. 4 commits: m3 extraction + tilde/duplicate-H2 + m1 parse-error reporting + lazy yaml import + m4 SKILL.md document-the-pattern. 135 tests pass. One residual: mixed-fence-nesting bypass (backtick→tilde→backtick); theoretical (no real bodies hit it), filed as follow-up consideration.
epm:status-changed· task.py· running → completedepm:progress· unknownDirect implementer dispatch from PM session for the deferred #371 minors. Parent #371 (completed). Same quick-path topol…
Direct implementer dispatch from PM session for the deferred #371 minors. Parent #371 (completed). Same quick-path topology as #371 / #372: no worktree, implementer + code-reviewer pair.
epm:status-changed· task.py· proposed → runningepm:created· task.py
Comments · 0
No comments yet. (Auth + comment composer land in step 5.)