What This Demonstrates
Start with a data-first framework. Let the statistical patterns define the story. Embed AI across every layer of the workflow. That's how one practitioner — with no prior healthcare background — built a production-grade, end-to-end analytics platform analyzing $1.09T in Medicaid claims data from a cold start.
What it proves: the ability to architect a complex multi-source semantic model, engineer features and integrate ML outputs into a BI layer, and build a custom visual design system from scratch — all while designing for a real end-user workflow, not just for screenshots.
The Challenge
Medicaid program integrity is a needle-in-a-haystack problem at massive scale. With 617,503 scored providers across 7 years of claims history, identifying which providers warrant closer review requires more than simple rules or threshold filters — the signal-to-noise ratio demands a statistical approach.
The goal was to build a system that could score every provider by anomaly risk, surface the specific behavioral patterns driving those scores, and present findings in a way that a non-technical reviewer could act on immediately — without needing a data science background to interpret the output.
Before any of that was possible, six heterogeneous public datasets that were never designed to work together had to be assembled into a single coherent analytical layer. That data engineering problem was the real challenge.
Data Engineering
Six public datasets — seven years of CMS Medicaid claims files, the NPI provider registry, HCPCS procedure codes, ZIP demographics, and geographic reference tables — each built for separate administrative purposes with different identifiers. Joining them into a single analytical layer required building the key bridges manually and resolving provider identity conflicts across address and name variants before any analysis could begin.
ML Scoring Pipeline
A 3-component unsupervised ensemble scores all 617,503 providers without training labels. Components run independently and combine via rank fusion — no circular dependency, no ground truth required. OIG exclusion data used only as blind post-hoc validation.
Dashboard Architecture
The platform is structured as six purpose-built pages, each answering a distinct analytical question at a different level of granularity — from $1.09T program overview down to a single provider's behavioral fingerprint.