Enhancing Patient Representation Learning with Inferred Family Relations Improves Disease Risk Prediction in Electronic Health Records and Biobank Data

Abstract

Machine learning and deep learning are powerful tools for analyzing electronic health records (EHRs) in healthcare research. Although family history has been recognized as a major predictor for a wide spectrum of diseases, research has so far adopted a limited view of family relations, essentially treating patients as independent samples in the analysis. Here, we present ALIGATEHR, which models inferred family relations from patients' demographic information in a graph attention network augmented with an attention-based medical ontology representation, thus accounting for the complex influence of genetics, shared environmental exposures, and disease dependencies. To improve the reliability of inferred family relations, we further develop an advanced version, Bio-ALIGATEHR, by substituting inferred family relations with genetic relationship, calculated from patients' genetic data in biobanks. Taking disease risk prediction as a use case, we demonstrate that explicitly modeling family relations significantly improves predictions across 1,886 diseases using over 600K patients' EHR data from the MarketScan databases. We then show how ALIGATEHR's attention mechanism, which links patients' disease risk to their relatives' clinical profiles, successfully captures genetic aspects of diseases using longitudinal EHR diagnosis data. Furthermore, we use ALIGATEHR to successfully distinguish the two main inflammatory bowel disease subtypes with highly shared risk factors and symptoms (Crohn's disease and ulcerative colitis). Finally, we apply Bio-ALIGATEHR to predicting the risk of five fibrotic diseases (metabolic dysfunction-associated steatohepatitis, pulmonary fibrosis, pulmonary hypertension, systemic sclerosis, and keloids) in over 140K patients with both genetic and EHR data from UK Biobank. Bio-ALIGATEHR achieves a prediction accuracy improvement by at least 9% compared to state-of-the-art neural network-based models. Our results show that patients at high risk for fibrotic diseases tend to have more affected family members compared to patients at low risk. In summary, our results highlight that family relations should not be overlooked in EHR research and illustrate our methods' great potential for enhancing patient representation learning for predictive and interpretable modeling of EHRs.

Sign up to meet with Dr. Wang after the presentation here.


Dr. Wang is an associate professor of biostatistics and of biomedical informatics & data science at Yale University. She obtained her PhD from the University of Chicago. Her research focuses on developing statistical methodologies and computational algorithms, spanning longitudinal data analysis, kernel machine methods, mixed effects models, correlated data, graphical models, machine learning algorithms, and bioinformatics, with applications in large-scale biomedical, omics, and healthcare data.