Weber Lab

Visualizing healthcare system dynamics in biomedical big data

Welcome to the website for "Visualizing healthcare system dynamics in biomedical big data", an NIH-funded project in the Weber Lab in the Department of Biomedical Informatics at Harvard Medical School. Biomedical Big Data, such as electronic health records (EHR) and administrative claims, are records of patients' interactions with the healthcare system; for example, the date of a diagnosis is when a physician entered the code into the EHR, not when the patient developed the disease. Most researchers are either unaware of the distinction or naively treat it as noise. However, the proposed research will show, using a novel Data Visualization, that these subtle effects of the healthcare system on observational clinical data actually contain valuable information that could benefit biomedical research, clinical care, and health care policy.

Project dates: Jun 1, 2015 - May 31, 2019

Funding details: NIH/NCI grant 5U01CA198934-02 and 3U01CA198934-02S1

For more information: Please contact Griffin M Weber, MD, PhD


Electronic health records (EHR) and administrative claims databases are transforming medical research by giving investigators access to data on millions of individual patients. Compared to manual paper chart review, these databases reduce the time and cost of clinical studies by orders of magnitude, enabling types of research that were unfeasible in the past. However, investigators often incorrectly treat EHR and claims data as simply big versions of clinical trials data. Yet, there are important differences: During clinical trials, patient information is obtained and recorded in a standardized way and checked for accuracy and completeness. In contrast, EHR and claims are observational databases, which reflect not only the health of the patients, but also their interactions with the healthcare system. For example, the date associated with a code for diabetes is when the physician made the diagnosis, not when the patient first developed the disease. These observations are influenced by the dynamics of the healthcare system--when physicians schedule visits with their patients, which tests physicians decide to order, what codes need to be recorded to get reimbursed for procedures, etc. By ignoring this dimension of the data or naively treating it as noise, investigators risk both misinterpreting the true patient pathophysiology and losing valuable information content. In prior work we showed that analysis of the "healthcare system dynamics" (HSD) dimension of observational databases can actually be more useful than the patient pathophysiology in predicting survival, selecting matched control cohorts, identifying healthy patients, and defining normal ranges of laboratory tests. Yet, conveying the concept of HSD to researchers and helping them use it effectively is difficult. Therefore, focusing on the topic area of Data Visualization, this proposal addresses this challenge of separating healthcare system dynamics from pathophysiology in observational databases, so that Big Data researchers can use both dimensions to generate new knowledge about patient health. To do this, we bring together informatics and data visualization experts who developed two widely adopted open source software platforms for querying clinical data repositories (Informatics for Integrating Biology and the Bedside, i2b2) and developing modular data analysis and visualization tools (Science of Science, Sci2). We will leverage these systems to perform three Specific Aims: (1) Create an extensible ontology for visualizing the HSD dimensions of biomedical Big Data. (2) Develop a prototype interactive visualization to enable investigators to study HSD in Big Data. The visualization will be simple and familiar to investigators, but innovative in that for the first time HSD will be treated as its own informative component of the data. By literally placing HSD on its own dimension, the visualization will show investigators its value and teach them how to use it for research. (3) Demonstrate and evaluate the visualizations using three sources of biomedical Big Data: EHR data from two hospital systems in Boston with a total of 7 million patients and nationwide claims data from Aetna health insurance with 34 million patients.


  1. Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018 Apr 30;361:k1479. doi: 10.1136/bmj.k1479. PubMed PMID: 29712648; PubMed Central PMCID: PMC5925441.
  2. Yu W, Weber GM. HyperMinHash: MinHash in LogLog space. March 28, 2018. arXiv:1710.08436.


HSD Extensions for i2b2

HSD Concepts:

The script below adds HSD concepts to i2b2 version 1.7 or higher. The first part of the script should be run on the i2b2 clinical research chart (CRC) cell, and the second part should be run on the i2b2 ontology (ONT) cell. The script requires Microsoft SQL Server 2012 (or newer).

The following HSD concepts are included in the database script:

Download database script: HSD_Extensions_For_i2b2.txt

HSD Visualizations:

The following zip file contains i2b2 web client plugins for an HSD "fact count" visualization and an HSD laboratory test visualization.

Download web client plugins:

HSD Demonstration Website:

Launch an i2b2 instance with the HSD ontology and visualizations in a new browser window/tab. Use the default username and password. The underlying data is the demo dataset that comes with the i2b2 software, available at

Beta Software:

The items below illustrate our newest software tools. However, they are still under development, and therefore will be unavailable at certain times and might differ from the final software.

Real-Time Queries of Large i2b2 Databases

A challenge of creating interactive data visualizations of population-wide health data for millions of patients is being able to query the underlying clinical databases in real-time. We leveraged "streaming algorithms", which compress big data into small probabilistic data structures called "sketches", to gain orders of magnitude in query performance compared to conventional methods. An implementation of the "HyperMinHash" streaming algorithm that we developed for this project is available on GitHub.

SumTree Visualization of i2b2 Ontologies

i2b2 is a software program is used by biomedical researchers to search large clinical databases for patients based on demographics, diagnoses, laboratory tests, and other medical concepts. Collectively, these concepts are called the i2b2 "ontology". Users select one or more concepts from the i2b2 ontology to include in their queries. Hospitals and other healthcare institutions that install i2b2 typically include standard medical terminologies, such as "ICD-10" diagnoses and "RxNorm" medications, in their i2b2 ontology. These contain hundreds of thousands of concepts, and they change regularly as new concepts are added and others are removed. This makes it difficult for i2b2 administrators to keep the software up-to-date and for researchers to understand what types of clinical data are available.

The i2b2 software currently displays the list of searchable concepts in a hierarchical tree-based visualization, which enables users to expand a parent concept (e.g., "demographics") to view its more specific child concepts (e.g., "age" and "ethnicity"). However, because there are so many concepts, the entire expanded i2b2 ontology tree is far too large to show on a computer screen. As a result, users can only see a tiny portion of the ontology. To address this problem, we created a new "SumTree" visualization, which extends the traditional tree view by graphically showing next to each concept summary statistics about the full subtree under that concept, such as the number of levels deep it goes. With SumTrees, users can simultaneously see both the overall structure of an entire ontology and details of individual concepts. This SumTree demo visualizes a public i2b2 ontology created for a project called SCILHS, which was funded by the Patient-Centered Outcomes Research Institute (PCORI). The ontology contains about 500,000 concepts under the root concept "PCORI" and is 11 levels deep.

Circles in SumTree represent individual concepts. Users can double-click a circle to view its child concepts, as in a traditional tree view. However, to the right of each collapsed (not-expanded) circle is a rectangle that summarizes its child concepts in a condensed format. To the right of that is another rectangle that summarizes the children of those child concepts (two levels deep). Additional rectangles are drawn until there are no further child concepts. The total number of rectangles equals the depth of the subtree under the circle, with each rectangle summarizing the concepts at different levels.

In the SumTree Legend, users select what they want the size, color, and opacity of the rectangles to represent. For example, the size can be either proportional to the number of concepts at the rectangle's level or proportional to the cumulative number of concepts at or below the rectangle's level. The count is displayed as a number above the rectangle. The color and opacity can indicate one of five concept attributes: (1) "Type" - whether the concept can be further expanded ("container" or "folder"); (2) "Status" - whether the concept is hidden by default or no longer available ("inactive"); (3) "Table" - the database table the concept references; (4) "Synonym" - whether the concept has the same meaning as another concept; and (5) "Metadata" - whether the concept has an associated value, such as a laboratory test which is expected to have a result. A rectangle can have multiple colors and opacities if it represents different kinds of concepts.

The source code for the SumTree visualization is available on GitHub.


Consistent with our Data Sharing Policy, we are pleased to share datasets that we have been using to develop and test our software.

Project Team