[Code] Research-chain tracking: issue linking + claims registry

kind: infra

Motivation

Mentor wants a way to trace data → hyperparams → results → claims and to see how experiments chain together. Currently:

No programmatic link between a claim in RESULTS.md and its supporting issues/runs/figures.
Issues reference each other only via ad-hoc #N comments; no typed "follows from" / "followed by" relationships so you can't walk the research chain programmatically.
No single registry of claims with their status + evidence.

This issue adds the minimum infrastructure to close those gaps without building a custom dashboard. Extends the existing /issue <N> skill and the marker-comment system.

Scope

Files in scope:

docs/claims.yaml (new) — canonical claim registry
docs/claims.md (new) — auto-rendered; committed by CI
scripts/render_claims.py (new)
.github/workflows/render_claims.yml (new) — CI trigger
.claude/skills/issue/SKILL.md — extend Step 1 (clarifier posts follows markers) and Step 8 (ask about claim attribution)
.claude/skills/issue/clarifier.md — add "what prior issue motivated this?" prompt
.claude/skills/issue/markers.md — add epm:follows, epm:followed-by kinds
.claude/agent-memory/research-pm/reference_github_issues.md — document epic:* and claim:* label conventions

Files explicitly OUT of scope:

Eval Q&A logging (wandb.Table) — separate issue
WandB Reports auto-generation via Reports SDK — separate issue
GitHub Pages eval browser — explicitly deferred per discussion (WandB Reports preferred)
One-time backfill of the 30 existing issues' claims — sub-issue if/when wanted

Acceptance criteria

docs/claims.yaml exists with schema documented in-file: id, description, aim, status (preliminary|moderate|strong|falsified), evidence: {issues, wandb_report, figures, results_md_section}, kill_criteria, supersedes, updated
scripts/render_claims.py reads claims.yaml, joins against live issue states via gh issue view, and emits docs/claims.md as a sortable markdown table with deep links
GitHub Actions workflow re-renders docs/claims.md on push to docs/claims.yaml OR when any issue closes carrying a claim:* label
/issue skill clarifier asks "what prior issue motivated this?" and posts  on the child +  on the parent. Bidirectional graph parseable via gh comment scan.
/issue Step 8 asks "contributes to which claim?" and appends the issue to claims.yaml evidence list + adds claim:<id> label to the issue. Also supports creating a new claim in the yaml with a fresh ID.
epic:* label convention documented (loose grouping across an investigation)
claim:* label taxonomy documented; claim:<id> labels auto-created via gh api when a new claim is added to claims.yaml

Tests

No unit tests needed for this infra. Manual acceptance:

Add claim:C-test-1 to claims.yaml → render_claims.py emits a valid row in docs/claims.md
File a new test issue referencing #29 as its parent via the clarifier → verify both markers posted bidirectionally
Close an issue carrying a claim:C-test-1 label → CI workflow re-renders docs/claims.md with the new evidence
ruff check . && ruff format . passes

Compatibility

Fully additive. Existing 30 issues unaffected unless manually tagged claim:*.
RESULTS.md continues as human-readable prose. claims.yaml is the machine-readable spine. Headlines in RESULTS.md reference C-* IDs.
gh CLI 2.4.0 compatible (labels created via gh api, no gh label subcommand used).

Dependencies

PyYAML — likely already in uv.lock. If not, uv add pyyaml.
gh CLI 2.4.0 — already installed.
No other new PyPI packages expected.

Performance impact

None. render_claims.py runs in CI, off the hot path. Expected render time <5s for 100 claims.

Risk + rollback

Blast radius: very low. All additive. No existing code is modified except the /issue skill (markdown files, behavior-extension).
Rollback: git revert. No schema migrations, no model artifacts touched.
If render_claims.py breaks: docs/claims.md stale; no data loss.

Aim

aim:cross-cutting — tracking infrastructure that spans all research aims.

Design sketch (informal — the adversarial-planner will produce the canonical plan)

docs/claims.yaml schema:

claims:
  - id: C-aim3-leakage-v1
    description: "Trait leakage across ~X% of adjacent tokens under SFT"
    aim: 3-propagation
    status: moderate
    evidence:
      issues: [27, 28, 29, 30]
      wandb_report: https://wandb.ai/.../reports/...
      figures: [figures/aim3/leakage_comprehensive.png]
      results_md_section: "Aim 3 — Propagation"
    kill_criteria: "effect size < 0.1 across 3+ seeds"
    supersedes: []
    updated: 2026-04-16

Marker comment example (epm:follows):

<!-- epm:follows v1 -->
**Follows from:** #27 (marker leakage v3 showed the effect was confounded by token length)
**Motivating result:** link to the epm:results comment in #27
<!-- /epm:follows -->

Label additions (created via gh api when referenced):

epic:<slug> — free-form grouping label (e.g., epic:trait-leakage)
claim:C-<aim>-<slug>-v<n> — one per claim in the registry

Open questions for the clarifier / planner

Should claims.yaml support multi-aim claims (cross-cutting)? — proposed: yes, aim field can be a list.
Should the render include WandB Report thumbnails, or just links? — proposed: links only for now.
Should /issue Step 8 force a claim attribution, or allow none? — proposed: allow none but warn if the issue is marked status:under-review PASS and has no claim.

Next step: run /issue <this issue number> through the skill for gate-keeper → adversarial-planner → approval → implementer.

Timeline · 4 events

epm:clarify2026-04-17T00:55:16.000Z· system

## Clarifier — No blocking ambiguities Issue passes the `type:infra` clarifier. All required fi…

<!-- epm:clarify v1 -->
## Clarifier — No blocking ambiguities

Issue passes the `type:infra` clarifier. All required fields present: scope in/out, motivation, compatibility (additive), tests (manual acceptance), dependencies (PyYAML), performance (none), risk+rollback (low).

**Design-decisions to carry to the planner (non-blocking — defaults stated):**
1. Multi-aim claims — default: `aim` field accepts a string OR list.
2. Render WandB thumbnails — default: links only.
3. Force claim attribution at Step 8 — default: allow `none` but warn on PASS+no-claim.

**Minor notes for the planner to resolve:**
- `docs/claims.md` lifecycle — body says "committed by CI"; planner should specify whether it's `.gitignore`d locally (built in CI only) or both local+CI-built. Recommend: `.gitignore` it locally so developers don't commit stale renders; CI commits canonical version on `docs/claims.yaml` changes.
- Verify PyYAML is already in `uv.lock` — if not, planner should add `uv add pyyaml`.

Advancing to `status:gate-pending`.
<!-- /epm:clarify -->

epm:gate2026-04-17T00:57:39.000Z· system

## Gate-Keeper Verdict: **RUN** ### Scores | Dimension | Score | Justification | |-----------|----…

<!-- epm:gate v1 -->
## Gate-Keeper Verdict: **RUN**

### Scores
| Dimension | Score | Justification |
|-----------|-------|---------------|
| Information Value | 4/5 | Directly resolves a mentor-stated need: programmatic `data→hyperparams→results→claims` traceability. Not an experiment, but materially increases ability to audit and communicate every future claim. The aim-5 `good_correct` headline retraction (50.9 → 28.3 batch-size artifact) is exactly the event where a claims registry with `status` + `supersedes` prevents stale headlines lingering — `RESULTS.md` still carries partially-retracted prose. |
| De-risking Quality | 4/5 | Scope is minimum-viable: extends existing `/issue` skill + marker system. Out-of-scope list is disciplined. Rollback is `git revert`; no migrations, no model artifacts touched. Residual risk: CI render-on-issue-close is the fragile piece; planner should add a manual fallback. |
| Strategic Fit | 4/5 | `aim:cross-cutting`, timing right: mid-transition from `EXPERIMENT_QUEUE.md` to GitHub Issues as truth. Building at 50+ claims later is strictly more expensive. Infra pays compound interest on every subsequent experiment's interpretability. |
| Feedback Loop Speed | 4/5 | ~4 engineer-hours, zero GPU-hours. First signal: `docs/claims.md` renders locally from a 1-claim yaml within 30 min of the renderer landing. Can self-host smoke-test against this very issue. |
| Opportunity Cost | 5/5 | `compute:none`. Not competing with GPU-bound queue items. Competing infra issues (#10-13) don't block claim tracking. Engineer-hours here are essentially free against critical path. |
| **Weighted Total** | **4.15/5** | |

### Research Phase Alignment
- Aim: cross-cutting (spans all aims). Not an experiment.
- Aim 5 just reverted Understand → Distill → Understand precisely because a retracted claim wasn't caught early. Claim tracking is phase-agnostic and directly supports Distill for aims 3/4/6.

### What Could Go Wrong (for the planner)
1. **CI workflow fragility.** `render_claims.py` calling `gh issue view` per claim could hit rate limits on every issue close. Mitigation: batch single `gh issue list --search 'label:claim:*'`; graceful fallback (emit warning row, don't fail workflow).
2. **Clarifier friction.** Two new prompts ("follows from?" + "contributes to which claim?") tax every issue. If tedious, users skip. Mitigation: both accept `none`/skip gracefully; Step-8 prompt auto-suggests based on `aim:*` label.
3. **Schema churn.** `claims.yaml` schema may prove insufficient after ~10 real claims. Accept that v1 will evolve; keep it small.

### Plan must incorporate
- Manual-render escape hatch: `python scripts/render_claims.py --local` independent of CI.
- Explicit "no claim attached" path in Step 8: issue can close without `claim:*` label; only warn on `status:under-review PASS` + no claim.
- **Self-hosting smoke test:** seed `claims.yaml` with `C-infra-claims-tracking-v1` attached to issue #33 itself, so the system demonstrates end-to-end on merge.

### Overrides
User-requested (mentor-driven) — would RUN regardless of score. Score confirms the ask is well-scoped.
<!-- /epm:gate -->

state_changed2026-05-12T09:25:25.576Z· user· planning → proposed
Moved on Pipeline board to idea.
```
Moved on Pipeline board to idea.
```
state_changed2026-05-13T01:05:43.835Z· user· proposed → archived
Closed on GH without ever running. Archiving on Sagan to match.
```
Closed on GH without ever running. Archiving on Sagan to match.
```

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)