← All projects

A toolkit that ingests the protocol, SAP, SDTM data and spec, and TLF shells, then builds an interactive knowledge graph linking each ADaM variable to its source, rule, and SAP evidence — retrievable with BM25 search, no LLM or internet required.

What it adds

Conformance tools check outputs; they do not trace lineage. This links every ADaM variable to its SDTM source, derivation rule and SAP evidence — the traceability a checker assumes already exists.

How it works

InputsProtocol PDFSAP DOCXSDTM (XPT / SAS7BDAT)Spec + shells (optional)
Process1Per-type ingestion2Chunk to NER to relations3Knowledge graph build4BM25 retrieval (Top-K)
OutputsADaM variable lineageEvidence passagesInteractive graph (pyvis)

Typical layout

ADaM Knowledge GraphSIDEBARUpload documentsChunk sizeTop-K hitsGraphLineageQueryVariable lineage graphRetrieved SAP evidenceSelected node detail

By the numbers

v3.1
Version
1200
Default chunk size
8
Top-K hits
0
LLMs used

Data flow

When programming ADaM, the link between an ADaM variable, its SDTM source, the derivation rule, and the SAP text that justifies it is scattered across documents. Tracing one variable to its evidence is slow manual cross-referencing.

Input: Protocol PDF + SAP DOCX + SDTM (XPT/SAS7BDAT)
        + optional Excel spec + TLF shells
        |
        v
  Ingestion (ingestion/)          protocol, sap, sdtm_data, sdtm_spec, shell
        |
        v
  Extraction (extraction/)        chunking -> NER -> patterns -> relations
        |
        v
  Knowledge Graph (graph/builder.py, kg.py)
        |                          nodes: ADaM var, SDTM source, rule, evidence
        v
  Retrieval (graph/search.py, rag.py)   BM25 over chunks, Top-K
        |
        v
  Visualisation (viz/pyvis_viz.py) + Streamlit app

Engineering trade-offs

Graph + BM25 retrieval instead of an LLM
Deterministic, offline, and inspectable — every link traces to a real document passage rather than a generated guess.
Explicit ingestion modules per document type
Protocol, SAP, SDTM data, spec and shells each parse differently; separate ingesters keep each robust.
Chunk size and Top-K exposed to the user
Retrieval quality depends on document style; letting the user tune chunking and hit count adapts it per study.
pyvis interactive visualisation
An ADaM variable's lineage is easier to trust when you can see and explore the graph, not just read a table.

At a glance

A quick visual read of the countable facts; full detail in the table.

Top-K hits8
Doc types ingested5
Optional inputs2

Relative scale · values labelled · unit: count

Processing characteristics

MetricValueNotes
InputsProtocol, SAP, SDTM, spec, shellsPDF, DOCX, XPT/SAS7BDAT, XLSX
RetrievalBM25 (rank-bm25)Top-K over document chunks
GraphNetworkX + pyvisADaM variable to source/rule/evidence
Default chunk / Top-K1200 / 8User-adjustable in the sidebar
Build time10-60s typicalStated in HOW_TO_RUN
LLMNoneNo model, no internet required

Functional wins

01Links every ADaM variable to its SDTM source, derivation rule, and SAP evidence in one interactive graph.
02Retrieves the supporting SAP passages for any derivation with BM25, so each link is backed by real document text.
03Ingests protocol, SAP, SDTM data and spec, and TLF shells through dedicated parsers per document type.
04Runs fully offline with no LLM, keeping the lineage deterministic and inspectable.

Module dependencies

core
  • Python 3.9+
  • Jinja2
  • chardet
ui
  • streamlit
  • pyvis
data
  • pandas
  • numpy
  • pdfplumber
  • PyPDF2
  • python-docx
  • openpyxl
  • pyreadstat
ml
  • rank-bm25
  • networkx
  • scipy