Weber Lab

Biases introduced by filtering electronic health records (EHRs) for patients with "complete data"

Welcome to the website for "Biases introduced by filtering electronic health records (EHRs) for patients with 'complete data'", an NIH-funded project in the Weber Lab in the Department of Biomedical Informatics at Harvard Medical School. Nationwide adoption of electronic health records (EHRs) has led to the increasing availability of large clinical datasets. However, because the same patient could be treated at multiple health care institutions, data from only a single EHR might not contain the complete medical history for that patient, with critical events potentially missing. This study identifies biases that are introduced by selecting patients with fewer gaps in their record.

Project dates: September 4, 2020 - August 31, 2024

Funding details: NIH/NLM grant R01LM013345

For more information: Please contact Griffin M Weber, MD, PhD

Abstract

Nationwide adoption of electronic health records (EHRs) has led to the increasing availability of large clinical datasets. With statistical modeling and machine learning, these datasets have been be used in a wide range of applications, including diagnosis, decision support, cost reduction, and personalized medicine. However, because the same patient could be treated at multiple health care institutions, data from only a single EHR might not contain the complete medical history for that patient, with critical events potentially missing. A common approach to addressing this problem is to apply data checks that filter the EHR for patients whose data appear to be more "complete". Examples of filters include requiring at least one visit per year or ensuring that age, sex, and race are all recorded. However, in a previous study using EHR data from seven institutions, we showed that these filters can greatly reduce the sample size and introduce unexpected biases by selecting sicker patients who seek care more often and changing the demographics of the resulting cohorts.

This project extends this prior research by implementing an expanded set of data completeness filters and testing their accuracy and potential biases using a combination of national claims data and EHR data from dozens of hospitals and healthcare centers across the country. This will enable us to understand how data completeness varies in different EHRs and quantify the tradeoffs of different approaches to correcting for gaps in patients' records. First, we will develop and measure the accuracy of data completeness filters using national claims data. This provides a "gold standard" of longitudinal data where patients' complete medical histories are known during the periods in which they were enrolled in the insurance plan. After partitioning the data by provider groups to model gaps in EHR data, we will test how well data completeness filters, individually and in combined machine learning models, select patients with fewer gaps. We will then test whether the filters introduce biases by selecting sicker patients (more diagnoses, more visits, etc.) or changing their demographic characteristics (age, sex, and zip code). Then, we will test the filters on EHR data, first at a single large medical center, and then across a national network of 57 institutions, representing different geographic regions, patient populations, number of years of data, and types of health care facilities. We will evaluate the filters by measuring whether they improve the performance of a machine learning model for predicting hospital admissions.

Our ultimate goals are to (a) help researchers balance the need for complete data with the biases this might introduce to their models and (b) help them predict how well models trained on one EHR dataset might work on other EHRs with different data completeness profiles.

Project Team

Publications

  1. Weber GM, Hong C, Palmer NP, Avillach P, Murphy SN, Gutierrez-Sacristan A, Xia Z, Serret-Larmande A, Neuraz A, Omenn GS, Visweswaran S, Klann JG, South AM, Loh NHW, Cannataro M, Beaulieu-Jones BK, Bellazzi R, Agapito G, Alessiani M, Aronow BJ, Bell DS, Bellasi A, Benoit V, Beraghi M, Boeker M, Booth J, Bosari S, Bourgeois FT, Brown NW, Bucalo M, Chiovato L, Chiudinelli L, Dagliati A, Devkota B, DuVall SL, Follett RW, Ganslandt T, Garcia Barrio N, Gradinger T, Griffier R, Hanauer DA, Holmes JH, Horki P, Huling KM, Issitt RW, Jouhet V, Keller MS, Kraska D, Liu M, Luo Y, Lynch KE, Malovini A, Mandl KD, Mao C, Maram A, Matheny ME, Maulhardt T, Mazzitelli M, Milano M, Moore JH, Morris JS, Morris M, Mowery DL, Naughton TP, Ngiam KY, Norman JB, Patel LP, Pedrera Jimenez M, Ramoni RB, Schriver ER, Scudeller L, Sebire NJ, Serrano Balazote P, Spiridou A, Tan AL, Tan BW, Tibollo V, Torti C, Trecarichi EM, Vitacca M, Zambelli A, Zucco C; Consortium for Clinical Characterization of COVID-19 by EHR (4CE), Kohane IS, Cai T, Brat GA. International Comparisons of Harmonized Laboratory Value Trajectories to Predict Severe COVID-19: Leveraging the 4CE Collaborative Across 342 Hospitals and 6 Countries: A Retrospective Cohort Study. medRxiv [Preprint]. 2021 Feb 5:2020.12.16.20247684. doi: 10.1101/2020.12.16.20247684. PMID: 33564777; PMCID: PMC7872369.
  2. Yu YW, Weber GM. Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation. J Med Internet Res. 2020 Nov 3;22(11):e18735. doi: 10.2196/18735. PMID: 33141090; PMCID: PMC7671849.