SDTM Master Mapping System

← All projects

A pipeline that maps each raw variable to SDTM using a master knowledge base first, then deterministic CDISC rules, then a fuzzy fallback. It writes decisions back into the SDTM spec workbook, guards against phantom mappings at the write boundary, and generates production SAS programs.

What it adds

Pinnacle 21 validates a finished SDTM; it does not build one. This works upstream of the conformance checker — automating the raw-to-SDTM mapping decision itself and reusing prior decisions across studies.

How it works

InputsRaw variable catalogSDTM spec workbookMaster knowledge base

Process1Lookup: master then rules then fuzzy2ML scorer (4 signals)3Spec write-back4Raw-presence gate

OutputsMapped SDTM specGenerated SAS programsChange-log (audit)

Typical layout

By the numbers

v28.13

Version

202

Test functions

Scoring signals

UI pages

Screenshots

Add image

The Map Study page after a raw catalog is loaded, showing proposed SDTM mappings with confidence scores per row

Drop sdtm-master-mapping-01-map.png into
/public/screenshots/sdtm-master-mapping/

Add image

The Review/Approve page where a human accepts or overrides mappings, with the master-confidence and signal breakdown visible

Drop sdtm-master-mapping-02-review.png into
/public/screenshots/sdtm-master-mapping/

Add image

The SAS Generate page showing produced SAS with read/sort/merge/derive steps and the ATTRIB section

Drop sdtm-master-mapping-03-sas.png into
/public/screenshots/sdtm-master-mapping/

Data flow

Mapping a new study's raw variables to SDTM is repetitive and error-prone. Decisions made on past studies are not reliably reused, and a memory of an old study's mapping can wrongly produce a target that the new study never collected.

Input: study raw variable catalog + SDTM spec workbook + master KB
        |
        v
  Lookup Engine (lookup_engine.py)   1) master lookup  reuse prior decision
        |                            2) CDISC rules    deterministic SDTMIG
        v                            3) fuzzy fallback name/label similarity
  ML Scorer (ml_scorer.py)           GradientBoosting composite confidence
        |                            (name_exact, name_fuzzy, label_tfidf, master_freq)
        v
  Spec Writeback (spec_writeback.py)  writes decisions into the spec workbook
        |
        v
  Post-check (spec_postcheck.py)      raw-presence gate: drop / flag phantom targets
        |
        v
  SAS Generator (sas_generator.py) -->  read -> sort -> merge -> derive+attrib -> sort

Engineering trade-offs

Master-first, fuzzy-last decision order

Prior human decisions are the strongest signal; fuzzy name matching is the weakest, so it only runs when history and rules are silent.

GradientBoosting over a 4-signal composite

Combines exact match, fuzzy similarity, label TF-IDF, and master frequency into one calibrated confidence rather than a brittle rule cascade.

Raw-presence gate at the write boundary (v28.11+)

A master memory of an old study must not create a target the new study never collected; the gate drops or flags phantom rows before they reach a validated spec.

Required variables are flagged, never auto-dropped

A missing required source is a human decision, so the row is highlighted for resolution rather than silently removed.

At a glance

A quick visual read of the countable facts; full detail in the table.

Test functions202

UI pages10

Scoring signals4

Relative scale · values labelled · unit: count

Processing characteristics

Metric	Value	Notes
Scoring signals	4	name_exact, name_fuzzy, label_tfidf, master_freq
Model	GradientBoostingClassifier	scikit-learn, used in the mapping scorer
Test functions	202	Counted across the test suite
UI pages	10	build/view master, map, review, compare, writeback, SAS, graph view, filter, help
SAS output	Multi-step	read -> sort -> merge (IN= flags) -> derive + ATTRIB -> sort
Audit	Change-log sheet	Phantom drops and flags recorded for traceability

Functional wins

01Reuses prior mapping decisions from a master knowledge base before falling back to rules or fuzzy matching, raising consistency across studies.

02Blocks phantom mappings at the write boundary so an old study's memory cannot inject a target the current study never collected.

03Generates production-grade SAS with type-aware assignment, --SEQ creation, baseline-flag and EPOCH templates, and an ATTRIB section from the spec.

04Records every automated drop or flag in a change-log sheet so each decision is traceable.

Module dependencies

core

Python
pyyaml
pydantic

streamlit
fastapi
uvicorn

data

pandas
openpyxl
numpy

scikit-learn
networkx

testing

pytest
pytest-asyncio
httpx