Weber Lab

Biases introduced by filtering electronic health records (EHRs) for patients with "complete data"

Welcome to the website for "Biases introduced by filtering electronic health records (EHRs) for patients with 'complete data'", an NIH-funded project in the Weber Lab in the Department of Biomedical Informatics at Harvard Medical School. Nationwide adoption of electronic health records (EHRs) has led to the increasing availability of large clinical datasets. However, because the same patient could be treated at multiple health care institutions, data from only a single EHR might not contain the complete medical history for that patient, with critical events potentially missing. This study identifies biases that are introduced by selecting patients with fewer gaps in their record.

Project dates: September 4, 2020 - August 31, 2024

Funding details: NIH/NLM grant R01LM013345

For more information: Please contact Griffin M Weber, MD, PhD

Abstract

Nationwide adoption of electronic health records (EHRs) has led to the increasing availability of large clinical datasets. With statistical modeling and machine learning, these datasets have been be used in a wide range of applications, including diagnosis, decision support, cost reduction, and personalized medicine. However, because the same patient could be treated at multiple health care institutions, data from only a single EHR might not contain the complete medical history for that patient, with critical events potentially missing. A common approach to addressing this problem is to apply data checks that filter the EHR for patients whose data appear to be more "complete". Examples of filters include requiring at least one visit per year or ensuring that age, sex, and race are all recorded. However, in a previous study using EHR data from seven institutions, we showed that these filters can greatly reduce the sample size and introduce unexpected biases by selecting sicker patients who seek care more often and changing the demographics of the resulting cohorts.

This project extends this prior research by implementing an expanded set of data completeness filters and testing their accuracy and potential biases using a combination of national claims data and EHR data from dozens of hospitals and healthcare centers across the country. This will enable us to understand how data completeness varies in different EHRs and quantify the tradeoffs of different approaches to correcting for gaps in patients' records. First, we will develop and measure the accuracy of data completeness filters using national claims data. This provides a "gold standard" of longitudinal data where patients' complete medical histories are known during the periods in which they were enrolled in the insurance plan. After partitioning the data by provider groups to model gaps in EHR data, we will test how well data completeness filters, individually and in combined machine learning models, select patients with fewer gaps. We will then test whether the filters introduce biases by selecting sicker patients (more diagnoses, more visits, etc.) or changing their demographic characteristics (age, sex, and zip code). Then, we will test the filters on EHR data, first at a single large medical center, and then across a national network of 57 institutions, representing different geographic regions, patient populations, number of years of data, and types of health care facilities. We will evaluate the filters by measuring whether they improve the performance of a machine learning model for predicting hospital admissions.

Our ultimate goals are to (a) help researchers balance the need for complete data with the biases this might introduce to their models and (b) help them predict how well models trained on one EHR dataset might work on other EHRs with different data completeness profiles.

Project Team

Publications

  1. Hong C, Zhang HG, L'Yi S, Weber G, et al. Changes in laboratory value improvement and mortality rates over the course of the pandemic: an international retrospective cohort study of hospitalised patients infected with SARS-CoV-2. BMJ Open. 2022 06 23; 12(6):e057725.
  2. Weber GM, Zhang HG, L'Yi S, et al. International Changes in COVID-19 Clinical Trajectories Across 315 Hospitals and 6 Countries: Retrospective Cohort Study. J Med Internet Res. 2021 10 11; 23(10):e31400.
  3. Klann JG, Estiri H, Weber GM, et al. Validation of an internationally derived patient severity phenotype to support COVID-19 analytics from electronic health record data. J Am Med Inform Assoc. 2021 07 14; 28(7):1411-1420.
  4. Kohane IS, Aronow BJ, Avillach P, et al, Weber GM, Cai T. What Every Reader Should Know About Studies Using Electronic Health Record Data but May Be Afraid to Ask. J Med Internet Res. 2021 03 02; 23(3):e22219.
  5. Weber GM, Hong C, Palmer NP, et al. International Comparisons of Harmonized Laboratory Value Trajectories to Predict Severe COVID-19: Leveraging the 4CE Collaborative Across 342 Hospitals and 6 Countries: A Retrospective Cohort Study. medRxiv. 2021 Feb 05.
  6. Yu YW, Weber GM. Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation. J Med Internet Res. 2020 11 03; 22(11):e18735.
  7. Brat GA, Weber GM, Gehlenborg N, et al. International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium. NPJ Digit Med. 2020; 3:109.