SDTM Completeness — Nilesh Borade

← All projects

A Streamlit app over a layered check engine that validates traceability, domain rules, and value fidelity, runs checks with a per-check time budget, and shows a live progress dashboard plus Excel and HTML reports.

What it adds

A conformance checker validates the SDTM you have; it does not confirm SDTM is complete and traceable back to raw. This validates that linkage — record and subject parity, SUPP completeness, value fidelity — which conformance checking does not cover.

How it works

InputsRaw datasetsSDTM datasetsMaster mapping spec

Process1Loaders + join engine2Layered catalog (6 layers)3Orchestrator (per-check budget)4Reporting

OutputsLive dashboardExcel reportHTML report

Typical layout

By the numbers

Checks in catalog

Test functions

Check layers

0.4.0

Tool version

Screenshots

Add image

The Run Checks page with raw and SDTM inputs selected and the layered check catalog about to execute

Drop sdtm-completeness-01-run.png into
/public/screenshots/sdtm-completeness/

Add image

The Results page showing KPI cards, the live progress of checks, and the issues table grouped by layer

Drop sdtm-completeness-02-results.png into
/public/screenshots/sdtm-completeness/

Add image

The Catalog page listing checks by layer (traceability, domain, value fidelity) with their IDs

Drop sdtm-completeness-03-catalog.png into
/public/screenshots/sdtm-completeness/

Data flow

Confirming that SDTM is complete and traceable back to raw is slow manual work across many domains, and a single badly written check could run for hours on a real study and abandon itself before producing anything.

Input: raw datasets + SDTM datasets + master mapping spec
        |
        v
  Loaders + Join Engine (core/)      align raw to SDTM by mapping spec
        |
        v
  Check Catalog (config/check_catalog.py)
        |   Layer 1  Traceability    trc_001..008 (record/subject parity, SUPP, coverage)
        |   Layer 2  Domain rules    ae/lb/dm/ds/sv/vs/cm/ex/ie/mh + basic_*
        |   Layer 6  Value fidelity  vfd_001..009 (assign passthrough, date xform, parity)
        v
  Orchestrator (per-check time budget) --> live progress to dashboard
        |
        v
  Reporting (reporting/)  -->  Excel reporter + HTML dashboard

Engineering trade-offs

Layered check catalog (traceability / domain / value fidelity)

Groups checks by what they prove about the data, so a reviewer reads results as a completeness story rather than a flat list.

Set-membership rewrite of the SUPP completeness check

The original orphan-SUPP resolution was O(n_supp x n_parent) and ran for hours; pre-computing parent key sets makes it a fast membership test.

Per-check time budget with status reporting

One pathological check can no longer stall the whole run; it is bounded and reported instead of abandoning silently.

Engine bundled inside the app, overridable by env var

Works out of the box for a reviewer, but a developer can point SDTM_COMPLETENESS_ENGINE at a different engine build.

At a glance

A quick visual read of the countable facts; full detail in the table.

Checks in catalog87

Test functions47

Check layers6

Relative scale · values labelled · unit: count

Processing characteristics

Metric	Value	Notes
Checks in catalog	87	v12 brought the in-memory pipeline to 85 all-ok, plus L5_COV_007/008
Check layers	6	Traceability, domain, value fidelity among them
Test functions	47	Includes 17 v11-fix regression assertions
Tool version	0.4.0	App labelled v12
Reference run	4730 issues / 633s	Real GADI run cited in the v12 changelog
Reports	Excel + HTML	Dashboard and downloadable report

Functional wins

01Validates SDTM completeness and traceability back to raw across six check layers in one run.

02Eliminated an O(n-squared) SUPP completeness check that previously ran for hours, by rewriting it as a set-membership test.

03Bounds every check with a per-check time budget so a single slow check is reported rather than stalling the whole run.

04Added site-change (L5_COV_007) and outlier (L5_COV_008) coverage checks, with a live progress dashboard and Excel/HTML reports.

Module dependencies

core

Python 3.9+

streamlit
altair

data

pandas
numpy
openpyxl
xlsxwriter
pyreadstat

testing

pytest