A pipeline that maps each raw variable to SDTM using a master knowledge base first, then deterministic CDISC rules, then a fuzzy fallback. It writes decisions back into the SDTM spec workbook, guards against phantom mappings at the write boundary, and generates production SAS programs.
Pinnacle 21 validates a finished SDTM; it does not build one. This works upstream of the conformance checker — automating the raw-to-SDTM mapping decision itself and reusing prior decisions across studies.
How it works
Typical layout
By the numbers
Screenshots
sdtm-master-mapping-01-map.png into/public/screenshots/sdtm-master-mapping/
sdtm-master-mapping-02-review.png into/public/screenshots/sdtm-master-mapping/
sdtm-master-mapping-03-sas.png into/public/screenshots/sdtm-master-mapping/
Data flow
Mapping a new study's raw variables to SDTM is repetitive and error-prone. Decisions made on past studies are not reliably reused, and a memory of an old study's mapping can wrongly produce a target that the new study never collected.
Input: study raw variable catalog + SDTM spec workbook + master KB
|
v
Lookup Engine (lookup_engine.py) 1) master lookup reuse prior decision
| 2) CDISC rules deterministic SDTMIG
v 3) fuzzy fallback name/label similarity
ML Scorer (ml_scorer.py) GradientBoosting composite confidence
| (name_exact, name_fuzzy, label_tfidf, master_freq)
v
Spec Writeback (spec_writeback.py) writes decisions into the spec workbook
|
v
Post-check (spec_postcheck.py) raw-presence gate: drop / flag phantom targets
|
v
SAS Generator (sas_generator.py) --> read -> sort -> merge -> derive+attrib -> sort Engineering trade-offs
At a glance
A quick visual read of the countable facts; full detail in the table.
Relative scale · values labelled · unit: count
Processing characteristics
| Metric | Value | Notes |
|---|---|---|
| Scoring signals | 4 | name_exact, name_fuzzy, label_tfidf, master_freq |
| Model | GradientBoostingClassifier | scikit-learn, used in the mapping scorer |
| Test functions | 202 | Counted across the test suite |
| UI pages | 10 | build/view master, map, review, compare, writeback, SAS, graph view, filter, help |
| SAS output | Multi-step | read -> sort -> merge (IN= flags) -> derive + ATTRIB -> sort |
| Audit | Change-log sheet | Phantom drops and flags recorded for traceability |
Functional wins
Module dependencies
- Python
- pyyaml
- pydantic
- streamlit
- fastapi
- uvicorn
- pandas
- openpyxl
- numpy
- scikit-learn
- networkx
- pytest
- pytest-asyncio
- httpx