ADaM Knowledge Graph — Nilesh Borade

← All projects

A toolkit that ingests the protocol, SAP, SDTM data and spec, and TLF shells, then builds an interactive knowledge graph linking each ADaM variable to its source, rule, and SAP evidence — retrievable with BM25 search, no LLM or internet required.

What it adds

Conformance tools check outputs; they do not trace lineage. This links every ADaM variable to its SDTM source, derivation rule and SAP evidence — the traceability a checker assumes already exists.

How it works

InputsProtocol PDFSAP DOCXSDTM (XPT / SAS7BDAT)Spec + shells (optional)

Process1Per-type ingestion2Chunk to NER to relations3Knowledge graph build4BM25 retrieval (Top-K)

OutputsADaM variable lineageEvidence passagesInteractive graph (pyvis)

Typical layout

By the numbers

v3.1

Version

1200

Default chunk size

Top-K hits

LLMs used

Screenshots

Add image

The sidebar after protocol, SAP and SDTM files are uploaded, with chunk size and Top-K controls and the Build Knowledge Graph button

Drop graphrag-adam-01-build.png into
/public/screenshots/graphrag-adam/

Add image

The graph view showing one ADaM variable linked to its SDTM source, derivation rule and SAP evidence node

Drop graphrag-adam-02-lineage.png into
/public/screenshots/graphrag-adam/

Add image

A query result showing the retrieved SAP passages (BM25) that justify a selected derivation

Drop graphrag-adam-03-query.png into
/public/screenshots/graphrag-adam/

Data flow

When programming ADaM, the link between an ADaM variable, its SDTM source, the derivation rule, and the SAP text that justifies it is scattered across documents. Tracing one variable to its evidence is slow manual cross-referencing.

Input: Protocol PDF + SAP DOCX + SDTM (XPT/SAS7BDAT)
        + optional Excel spec + TLF shells
        |
        v
  Ingestion (ingestion/)          protocol, sap, sdtm_data, sdtm_spec, shell
        |
        v
  Extraction (extraction/)        chunking -> NER -> patterns -> relations
        |
        v
  Knowledge Graph (graph/builder.py, kg.py)
        |                          nodes: ADaM var, SDTM source, rule, evidence
        v
  Retrieval (graph/search.py, rag.py)   BM25 over chunks, Top-K
        |
        v
  Visualisation (viz/pyvis_viz.py) + Streamlit app

Engineering trade-offs

Graph + BM25 retrieval instead of an LLM

Deterministic, offline, and inspectable — every link traces to a real document passage rather than a generated guess.

Explicit ingestion modules per document type

Protocol, SAP, SDTM data, spec and shells each parse differently; separate ingesters keep each robust.

Chunk size and Top-K exposed to the user

Retrieval quality depends on document style; letting the user tune chunking and hit count adapts it per study.

pyvis interactive visualisation

An ADaM variable's lineage is easier to trust when you can see and explore the graph, not just read a table.

At a glance

A quick visual read of the countable facts; full detail in the table.

Top-K hits8

Doc types ingested5

Optional inputs2

Relative scale · values labelled · unit: count

Processing characteristics

Metric	Value	Notes
Inputs	Protocol, SAP, SDTM, spec, shells	PDF, DOCX, XPT/SAS7BDAT, XLSX
Retrieval	BM25 (rank-bm25)	Top-K over document chunks
Graph	NetworkX + pyvis	ADaM variable to source/rule/evidence
Default chunk / Top-K	1200 / 8	User-adjustable in the sidebar
Build time	10-60s typical	Stated in HOW_TO_RUN
LLM	None	No model, no internet required

Functional wins

01Links every ADaM variable to its SDTM source, derivation rule, and SAP evidence in one interactive graph.

02Retrieves the supporting SAP passages for any derivation with BM25, so each link is backed by real document text.

03Ingests protocol, SAP, SDTM data and spec, and TLF shells through dedicated parsers per document type.

04Runs fully offline with no LLM, keeping the lineage deterministic and inspectable.

Module dependencies

core

Python 3.9+
Jinja2
chardet

streamlit
pyvis

data

pandas
numpy
pdfplumber
PyPDF2
python-docx
openpyxl
pyreadstat

rank-bm25
networkx
scipy