← All projects

A pipeline that maps each raw variable to SDTM using a master knowledge base first, then deterministic CDISC rules, then a fuzzy fallback. It writes decisions back into the SDTM spec workbook, guards against phantom mappings at the write boundary, and generates production SAS programs.

What it adds

Pinnacle 21 validates a finished SDTM; it does not build one. This works upstream of the conformance checker — automating the raw-to-SDTM mapping decision itself and reusing prior decisions across studies.

How it works

InputsRaw variable catalogSDTM spec workbookMaster knowledge base
Process1Lookup: master then rules then fuzzy2ML scorer (4 signals)3Spec write-back4Raw-presence gate
OutputsMapped SDTM specGenerated SAS programsChange-log (audit)

Typical layout

SDTM Master MappingSIDEBARBuild masterMap studyReview / approveMapReviewCompareSAS Gene…Proposed mappings + confidenceSignal breakdown per rowGenerated SAS program

By the numbers

v28.13
Version
202
Test functions
4
Scoring signals
10
UI pages

Data flow

Mapping a new study's raw variables to SDTM is repetitive and error-prone. Decisions made on past studies are not reliably reused, and a memory of an old study's mapping can wrongly produce a target that the new study never collected.

Input: study raw variable catalog + SDTM spec workbook + master KB
        |
        v
  Lookup Engine (lookup_engine.py)   1) master lookup  reuse prior decision
        |                            2) CDISC rules    deterministic SDTMIG
        v                            3) fuzzy fallback name/label similarity
  ML Scorer (ml_scorer.py)           GradientBoosting composite confidence
        |                            (name_exact, name_fuzzy, label_tfidf, master_freq)
        v
  Spec Writeback (spec_writeback.py)  writes decisions into the spec workbook
        |
        v
  Post-check (spec_postcheck.py)      raw-presence gate: drop / flag phantom targets
        |
        v
  SAS Generator (sas_generator.py) -->  read -> sort -> merge -> derive+attrib -> sort

Engineering trade-offs

Master-first, fuzzy-last decision order
Prior human decisions are the strongest signal; fuzzy name matching is the weakest, so it only runs when history and rules are silent.
GradientBoosting over a 4-signal composite
Combines exact match, fuzzy similarity, label TF-IDF, and master frequency into one calibrated confidence rather than a brittle rule cascade.
Raw-presence gate at the write boundary (v28.11+)
A master memory of an old study must not create a target the new study never collected; the gate drops or flags phantom rows before they reach a validated spec.
Required variables are flagged, never auto-dropped
A missing required source is a human decision, so the row is highlighted for resolution rather than silently removed.

At a glance

A quick visual read of the countable facts; full detail in the table.

Test functions202
UI pages10
Scoring signals4

Relative scale · values labelled · unit: count

Processing characteristics

MetricValueNotes
Scoring signals4name_exact, name_fuzzy, label_tfidf, master_freq
ModelGradientBoostingClassifierscikit-learn, used in the mapping scorer
Test functions202Counted across the test suite
UI pages10build/view master, map, review, compare, writeback, SAS, graph view, filter, help
SAS outputMulti-stepread -> sort -> merge (IN= flags) -> derive + ATTRIB -> sort
AuditChange-log sheetPhantom drops and flags recorded for traceability

Functional wins

01Reuses prior mapping decisions from a master knowledge base before falling back to rules or fuzzy matching, raising consistency across studies.
02Blocks phantom mappings at the write boundary so an old study's memory cannot inject a target the current study never collected.
03Generates production-grade SAS with type-aware assignment, --SEQ creation, baseline-flag and EPOCH templates, and an ATTRIB section from the spec.
04Records every automated drop or flag in a change-log sheet so each decision is traceable.

Module dependencies

core
  • Python
  • pyyaml
  • pydantic
ui
  • streamlit
  • fastapi
  • uvicorn
data
  • pandas
  • openpyxl
  • numpy
ml
  • scikit-learn
  • networkx
testing
  • pytest
  • pytest-asyncio
  • httpx