EPS
← All tasks·#33Archived

[Code] Research-chain tracking: issue linking + claims registry

kind: infra

Motivation

Mentor wants a way to trace data → hyperparams → results → claims and to see how experiments chain together. Currently:

  • No programmatic link between a claim in RESULTS.md and its supporting issues/runs/figures.
  • Issues reference each other only via ad-hoc #N comments; no typed "follows from" / "followed by" relationships so you can't walk the research chain programmatically.
  • No single registry of claims with their status + evidence.

This issue adds the minimum infrastructure to close those gaps without building a custom dashboard. Extends the existing /issue <N> skill and the marker-comment system.

Scope

Files in scope:

  • docs/claims.yaml (new) — canonical claim registry
  • docs/claims.md (new) — auto-rendered; committed by CI
  • scripts/render_claims.py (new)
  • .github/workflows/render_claims.yml (new) — CI trigger
  • .claude/skills/issue/SKILL.md — extend Step 1 (clarifier posts follows markers) and Step 8 (ask about claim attribution)
  • .claude/skills/issue/clarifier.md — add "what prior issue motivated this?" prompt
  • .claude/skills/issue/markers.md — add epm:follows, epm:followed-by kinds
  • .claude/agent-memory/research-pm/reference_github_issues.md — document epic:* and claim:* label conventions

Files explicitly OUT of scope:

  • Eval Q&A logging (wandb.Table) — separate issue
  • WandB Reports auto-generation via Reports SDK — separate issue
  • GitHub Pages eval browser — explicitly deferred per discussion (WandB Reports preferred)
  • One-time backfill of the 30 existing issues' claims — sub-issue if/when wanted

Acceptance criteria

  • docs/claims.yaml exists with schema documented in-file: id, description, aim, status (preliminary|moderate|strong|falsified), evidence: {issues, wandb_report, figures, results_md_section}, kill_criteria, supersedes, updated
  • scripts/render_claims.py reads claims.yaml, joins against live issue states via gh issue view, and emits docs/claims.md as a sortable markdown table with deep links
  • GitHub Actions workflow re-renders docs/claims.md on push to docs/claims.yaml OR when any issue closes carrying a claim:* label
  • /issue skill clarifier asks "what prior issue motivated this?" and posts <!-- epm:follows N --> on the child + <!-- epm:followed-by N --> on the parent. Bidirectional graph parseable via gh comment scan.
  • /issue Step 8 asks "contributes to which claim?" and appends the issue to claims.yaml evidence list + adds claim:<id> label to the issue. Also supports creating a new claim in the yaml with a fresh ID.
  • epic:* label convention documented (loose grouping across an investigation)
  • claim:* label taxonomy documented; claim:<id> labels auto-created via gh api when a new claim is added to claims.yaml

Tests

No unit tests needed for this infra. Manual acceptance:

  • Add claim:C-test-1 to claims.yamlrender_claims.py emits a valid row in docs/claims.md
  • File a new test issue referencing #29 as its parent via the clarifier → verify both markers posted bidirectionally
  • Close an issue carrying a claim:C-test-1 label → CI workflow re-renders docs/claims.md with the new evidence
  • ruff check . && ruff format . passes

Compatibility

  • Fully additive. Existing 30 issues unaffected unless manually tagged claim:*.
  • RESULTS.md continues as human-readable prose. claims.yaml is the machine-readable spine. Headlines in RESULTS.md reference C-* IDs.
  • gh CLI 2.4.0 compatible (labels created via gh api, no gh label subcommand used).

Dependencies

  • PyYAML — likely already in uv.lock. If not, uv add pyyaml.
  • gh CLI 2.4.0 — already installed.
  • No other new PyPI packages expected.

Performance impact

None. render_claims.py runs in CI, off the hot path. Expected render time <5s for 100 claims.

Risk + rollback

  • Blast radius: very low. All additive. No existing code is modified except the /issue skill (markdown files, behavior-extension).
  • Rollback: git revert. No schema migrations, no model artifacts touched.
  • If render_claims.py breaks: docs/claims.md stale; no data loss.

Aim

aim:cross-cutting — tracking infrastructure that spans all research aims.

Design sketch (informal — the adversarial-planner will produce the canonical plan)

docs/claims.yaml schema:

claims:
  - id: C-aim3-leakage-v1
    description: "Trait leakage across ~X% of adjacent tokens under SFT"
    aim: 3-propagation
    status: moderate
    evidence:
      issues: [27, 28, 29, 30]
      wandb_report: https://wandb.ai/.../reports/...
      figures: [figures/aim3/leakage_comprehensive.png]
      results_md_section: "Aim 3 — Propagation"
    kill_criteria: "effect size < 0.1 across 3+ seeds"
    supersedes: []
    updated: 2026-04-16

Marker comment example (epm:follows):

<!-- epm:follows v1 -->
**Follows from:** #27 (marker leakage v3 showed the effect was confounded by token length)
**Motivating result:** link to the epm:results comment in #27
<!-- /epm:follows -->

Label additions (created via gh api when referenced):

  • epic:<slug> — free-form grouping label (e.g., epic:trait-leakage)
  • claim:C-<aim>-<slug>-v<n> — one per claim in the registry

Open questions for the clarifier / planner

  • Should claims.yaml support multi-aim claims (cross-cutting)? — proposed: yes, aim field can be a list.
  • Should the render include WandB Report thumbnails, or just links? — proposed: links only for now.
  • Should /issue Step 8 force a claim attribution, or allow none? — proposed: allow none but warn if the issue is marked status:under-review PASS and has no claim.

Next step: run /issue <this issue number> through the skill for gate-keeper → adversarial-planner → approval → implementer.

Timeline · 4 events

  1. epm:clarify· system
    <!-- epm:clarify v1 --> ## Clarifier — No blocking ambiguities Issue passes the `type:infra` clarifier. All required fi
    <!-- epm:clarify v1 -->
    ## Clarifier — No blocking ambiguities
    
    Issue passes the `type:infra` clarifier. All required fields present: scope in/out, motivation, compatibility (additive), tests (manual acceptance), dependencies (PyYAML), performance (none), risk+rollback (low).
    
    **Design-decisions to carry to the planner (non-blocking — defaults stated):**
    1. Multi-aim claims — default: `aim` field accepts a string OR list.
    2. Render WandB thumbnails — default: links only.
    3. Force claim attribution at Step 8 — default: allow `none` but warn on PASS+no-claim.
    
    **Minor notes for the planner to resolve:**
    - `docs/claims.md` lifecycle — body says "committed by CI"; planner should specify whether it's `.gitignore`d locally (built in CI only) or both local+CI-built. Recommend: `.gitignore` it locally so developers don't commit stale renders; CI commits canonical version on `docs/claims.yaml` changes.
    - Verify PyYAML is already in `uv.lock` — if not, planner should add `uv add pyyaml`.
    
    Advancing to `status:gate-pending`.
    <!-- /epm:clarify -->
  2. epm:gate· system
    <!-- epm:gate v1 --> ## Gate-Keeper Verdict: **RUN** ### Scores | Dimension | Score | Justification | |-----------|----
    <!-- epm:gate v1 -->
    ## Gate-Keeper Verdict: **RUN**
    
    ### Scores
    | Dimension | Score | Justification |
    |-----------|-------|---------------|
    | Information Value | 4/5 | Directly resolves a mentor-stated need: programmatic `data→hyperparams→results→claims` traceability. Not an experiment, but materially increases ability to audit and communicate every future claim. The aim-5 `good_correct` headline retraction (50.9 → 28.3 batch-size artifact) is exactly the event where a claims registry with `status` + `supersedes` prevents stale headlines lingering — `RESULTS.md` still carries partially-retracted prose. |
    | De-risking Quality | 4/5 | Scope is minimum-viable: extends existing `/issue` skill + marker system. Out-of-scope list is disciplined. Rollback is `git revert`; no migrations, no model artifacts touched. Residual risk: CI render-on-issue-close is the fragile piece; planner should add a manual fallback. |
    | Strategic Fit | 4/5 | `aim:cross-cutting`, timing right: mid-transition from `EXPERIMENT_QUEUE.md` to GitHub Issues as truth. Building at 50+ claims later is strictly more expensive. Infra pays compound interest on every subsequent experiment's interpretability. |
    | Feedback Loop Speed | 4/5 | ~4 engineer-hours, zero GPU-hours. First signal: `docs/claims.md` renders locally from a 1-claim yaml within 30 min of the renderer landing. Can self-host smoke-test against this very issue. |
    | Opportunity Cost | 5/5 | `compute:none`. Not competing with GPU-bound queue items. Competing infra issues (#10-13) don't block claim tracking. Engineer-hours here are essentially free against critical path. |
    | **Weighted Total** | **4.15/5** | |
    
    ### Research Phase Alignment
    - Aim: cross-cutting (spans all aims). Not an experiment.
    - Aim 5 just reverted Understand → Distill → Understand precisely because a retracted claim wasn't caught early. Claim tracking is phase-agnostic and directly supports Distill for aims 3/4/6.
    
    ### What Could Go Wrong (for the planner)
    1. **CI workflow fragility.** `render_claims.py` calling `gh issue view` per claim could hit rate limits on every issue close. Mitigation: batch single `gh issue list --search 'label:claim:*'`; graceful fallback (emit warning row, don't fail workflow).
    2. **Clarifier friction.** Two new prompts ("follows from?" + "contributes to which claim?") tax every issue. If tedious, users skip. Mitigation: both accept `none`/skip gracefully; Step-8 prompt auto-suggests based on `aim:*` label.
    3. **Schema churn.** `claims.yaml` schema may prove insufficient after ~10 real claims. Accept that v1 will evolve; keep it small.
    
    ### Plan must incorporate
    - Manual-render escape hatch: `python scripts/render_claims.py --local` independent of CI.
    - Explicit "no claim attached" path in Step 8: issue can close without `claim:*` label; only warn on `status:under-review PASS` + no claim.
    - **Self-hosting smoke test:** seed `claims.yaml` with `C-infra-claims-tracking-v1` attached to issue #33 itself, so the system demonstrates end-to-end on merge.
    
    ### Overrides
    User-requested (mentor-driven) — would RUN regardless of score. Score confirms the ask is well-scoped.
    <!-- /epm:gate -->
  3. state_changed· user· planningproposed
    Moved on Pipeline board to idea.
    Moved on Pipeline board to idea.
  4. state_changed· user· proposedarchived
    Closed on GH without ever running. Archiving on Sagan to match.
    Closed on GH without ever running. Archiving on Sagan to match.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)