← All projects

A platform that parses a SAP (and optional TFL shell and ADaM spec), extracts each output with its class, type, and population, runs QC on the result, and presents an analytics dashboard with copyable SAS pseudocode — using deterministic patterns rather than an LLM.

What it adds

No conformance tool reads a Statistical Analysis Plan. This extracts the programming specification from the SAP itself — the step before any dataset exists to be checked.

How it works

InputsSAP (DOCX / RTF / TXT)TFL shell (optional)ADaM spec (optional)
Process1Document parsers2Entity extraction3Classification: class / type / population4QC engine (readiness score)
OutputsStructured output specSAS pseudocodeExcel workbook

Typical layout

CTI PlatformSIDEBARUpload SAPUpload shell / …Reset studyOutputsAnalyticsQCExtracted outputs + filtersClass / type / population chartsCopyable SAS pseudocode

By the numbers

40-100+
Outputs per study
6
Mock SAP fixtures
0
LLMs used
v6
Version

Data flow

A Statistical Analysis Plan describes dozens to hundreds of outputs in prose. Turning that into a structured programming specification by hand is slow, and details about output class, type, and population are easy to misread.

Input: Primary SAP (DOCX/RTF/TXT)
        + optional TFL shell (DOCX)  + optional ADaM spec (XLSX)
        |
        v
  Parsers (src/parsers)            read documents into text + structure
        |
        v
  Entity Extraction               identify each output + attributes
        |
        v
  Classification (utils/classification.py)
        |                          output class / type / population
        v
  Rule Engine + Normalization     dataset inference, ADaM variable registry
        |
        v
  QC Engine                       readiness score, category + severity
        |
        v
  Streamlit dashboard  -->  outputs, analytics, SAS pseudocode, Excel export

Engineering trade-offs

Deterministic patterns instead of an LLM
Runs offline in a regulated environment with reproducible, inspectable output and no external model dependency.
Optional TFL shell and ADaM spec inputs
The shell adds outputs the SAP prose omits; the ADaM spec sharpens dataset inference and adds variable tables — both optional so the SAP alone still works.
Population matcher that excludes PK/biomarker from ITT default
A v5 fix stopped PK and biomarker outputs being wrongly defaulted to ITT, a meaningful spec error.
DOCX fixtures, including a 150-page SAP
Generated mock SAPs across phase 3, oncology TTE, crossover PK, safety extension and adaptive designs exercise the parser at realistic scale.

At a glance

A quick visual read of the countable facts; full detail in the table.

Typical outputs100
Mock SAP fixtures6
Output filters4

Relative scale · values labelled · unit: count

Processing characteristics

MetricValueNotes
InputsSAP + shell + ADaM specDOCX/RTF/TXT, DOCX, XLSX
Outputs per study40-100+Shell structure adds outputs beyond the SAP prose
Extraction methodPattern-basedNo LLM, no internet required
QCReadiness scoreFilterable by category and severity
Test fixtures5 mock SAPs + 150-pagePhase 3, oncology TTE, crossover PK, safety, adaptive
OutputSAS pseudocode + ExcelCopyable st.code SAS blocks; Excel workbook

Functional wins

01Turns SAP prose into a structured output specification with class, type and population per output.
02Runs fully offline with deterministic patterns, so results are reproducible and inspectable with no LLM dependency.
03Combines SAP, TFL shell and ADaM spec to infer datasets and surface an ADaM variable registry.
04Scores the extracted spec for readiness and emits copyable SAS pseudocode plus an Excel workbook.

Module dependencies

core
  • Python
ui
  • streamlit
data
  • pandas
  • python-docx
  • openpyxl
testing
  • pytest