Skip to main content

ClaimEval Concept (Public)

ClaimEval is an outcome-grounded benchmark for medical billing AI. It measures whether a system can transform real clinical context into a paid, auditable claim.

What makes ClaimEval different

  • Evaluates claims with a deterministic Adjudication Sandbox (NCCI, MUEs, LCD/NCD, age/gender edits) before comparison to gold labels.
  • Scores against payer-aligned outcomes, not text similarity.
  • Uses multi-adjudicated gold standards with acceptable alternatives to handle coding gray areas.

Scope

  • Context-to-Claim evaluation: transcript + metadata + orders + attestations + payer policy snapshot.
  • In scope: CPT/ICD-10/modifiers/units, diagnosis pointers, documentation sufficiency, functional adjudication.
  • Out of scope (v1): audio-to-text, full EHR ingestion, patient billing/collections, prior auth.

Inputs (Context Bundle)

  • Structured encounter transcript (time-ordered, speaker-labeled)
  • Visit metadata (place of service, provider type, new vs established, timestamps)
  • Problem list / active diagnoses
  • Orders and actions (labs, imaging, meds, procedures)
  • Clinician attestations (laterality, time-based services, critical care eligibility)
  • Payer policy snapshot (versioned, rules relevant to candidate codes)

Outputs (ClaimCandidate Schema)

  • CPT codes with units and modifiers
  • ICD-10 codes with primary/secondary flags
  • Diagnosis-to-procedure pointers
  • Time-based indicators
  • Documentation sufficiency flags per line item
  • Confidence or review-required markers

Metrics (preview)

  • Outcome: Paid@1 (first-pass acceptance), Denial rate
  • Financial: Revenue Integrity Score, RVU accuracy
  • Granular: primary CPT exact match; modifier precision/recall; ICD-10 specificity; diagnosis–procedure linkage
  • Safety: hallucination rate; missed documentation flags

Dataset construction (preview)

  • ~2,000 de-identified encounters; multi-coder adjudication; frozen rule sets.
  • Splits: public dev (inputs + truth), public test (inputs only, leaderboard), private holdout (official eval).

Minimum deliverables (launch bar)

  • claim-eval CLI (local runner: fetch, run, score)
  • Containerized evaluation harness (Adjudication Sandbox)
  • Inference abstraction (pluggable adapters for local models/remote APIs)
  • Dataset loader (public inputs, dev truth)
  • Deterministic scorer (versioned, seeded)
  • Public leaderboard (managed submissions, anti-overfitting controls)

Governance and versions

  • Immutable releases: ClaimEval-UC v1, ClaimEval-ER v1
  • Transparent methodology, public scoring harness, reproducible evaluation

Why it matters

  • Aligns AI performance with payer outcomes and revenue integrity
  • Surfaces brittleness in modifiers, bundling, documentation sufficiency
  • Enables fair vendor/model comparisons and a common language for buyers and payers