ClaimEval Concept (Public)
ClaimEval is an outcome-grounded benchmark for medical billing AI. It measures whether a system can transform real clinical context into a paid, auditable claim.
What makes ClaimEval different
- Evaluates claims with a deterministic Adjudication Sandbox (NCCI, MUEs, LCD/NCD, age/gender edits) before comparison to gold labels.
- Scores against payer-aligned outcomes, not text similarity.
- Uses multi-adjudicated gold standards with acceptable alternatives to handle coding gray areas.
Scope
- Context-to-Claim evaluation: transcript + metadata + orders + attestations + payer policy snapshot.
- In scope: CPT/ICD-10/modifiers/units, diagnosis pointers, documentation sufficiency, functional adjudication.
- Out of scope (v1): audio-to-text, full EHR ingestion, patient billing/collections, prior auth.
Inputs (Context Bundle)
- Structured encounter transcript (time-ordered, speaker-labeled)
- Visit metadata (place of service, provider type, new vs established, timestamps)
- Problem list / active diagnoses
- Orders and actions (labs, imaging, meds, procedures)
- Clinician attestations (laterality, time-based services, critical care eligibility)
- Payer policy snapshot (versioned, rules relevant to candidate codes)
Outputs (ClaimCandidate Schema)
- CPT codes with units and modifiers
- ICD-10 codes with primary/secondary flags
- Diagnosis-to-procedure pointers
- Time-based indicators
- Documentation sufficiency flags per line item
- Confidence or review-required markers
Metrics (preview)
- Outcome: Paid@1 (first-pass acceptance), Denial rate
- Financial: Revenue Integrity Score, RVU accuracy
- Granular: primary CPT exact match; modifier precision/recall; ICD-10 specificity; diagnosis–procedure linkage
- Safety: hallucination rate; missed documentation flags
Dataset construction (preview)
- ~2,000 de-identified encounters; multi-coder adjudication; frozen rule sets.
- Splits: public dev (inputs + truth), public test (inputs only, leaderboard), private holdout (official eval).
Minimum deliverables (launch bar)
- claim-eval CLI (local runner: fetch, run, score)
- Containerized evaluation harness (Adjudication Sandbox)
- Inference abstraction (pluggable adapters for local models/remote APIs)
- Dataset loader (public inputs, dev truth)
- Deterministic scorer (versioned, seeded)
- Public leaderboard (managed submissions, anti-overfitting controls)
Governance and versions
- Immutable releases: ClaimEval-UC v1, ClaimEval-ER v1
- Transparent methodology, public scoring harness, reproducible evaluation
Why it matters
- Aligns AI performance with payer outcomes and revenue integrity
- Surfaces brittleness in modifiers, bundling, documentation sufficiency
- Enables fair vendor/model comparisons and a common language for buyers and payers