Interpretable AI in Scientific Applications: Bayesian Pyramids

Abstract

The AI revolution has been driven by the success of deep neural networks (DNNs) having massive numbers of parameters fit to enormous datasets – e.g., the large language models (LLMs) underlying ChatGPT. Biomedical researchers and other scientists, however, typically don't want to settle for a black box when making predictions – we want to be able to infer the underlying structure in the data to further our understanding. With this goal in mind, my group has focused in part on developing novel interpretable AI methodology based on DNNs having multilayer discrete structures. My group is motivated by applications in personalized medicine for sepsis data, in ecology, and in brain network modeling. Critically for interpretability, we have developed an identifiability theory in connection with these structures. Critically for uncertainty quantification, we have developed efficient algorithms for Bayesian inference. The result is a flexible "Bayesian pyramids" framework that is broadly useful, providing a more nuanced generalization of model-based clustering.

In this talk, I discuss the general framework of Bayesian pyramids, with some illustrations with nucleotide sequence data and brain connectomes from the human connectome project. I also propose new statistical methods and inference methodology in the context of managing and analyzing operational taxonomic units (OTUs). It is common in many application domains to collect massive-dimensional discrete data, and we have paid particular attention to DNA barcoding data obtained in biodiversity modeling (similar omics data are commonplace in microbiome studies and other contexts), which have been used to infer the OTUs present in each sample from each study location. From these OTUs, ecologists are interested in conducting joint species distribution modeling (JSDMs) to assess covariate effects on which taxa are present and to infer across-taxa dependence in occurrence. Bayesian hierarchical probit latent factor regression models are used routinely in ecology, but face challenges with (a) very large numbers of OTUs, (b) very rare OTUs, and (c) the discovery of new OTUs as sampling proceeds. I illustrate the methods developed by my group through insect biodiversity data collected in Madagascar.

Department students and members are invited to meet with Dr. Dunson before the presentation. Sign up for your small-group or 1:1 appointment here.


David Dunson, PhD, FASA, FIMS, is Arts and Sciences Distinguished Professor of Statistical Science at Duke University. He focuses on developing statistical and machine learning methodology for analysis and interpretation of complex and high-dimensional data, with a particular emphasis on scientific applications, Bayesian statistics and probability modeling approaches in neuroscience, genomics, environmental health, ecology, and other areas. His work has had a substantial impact, with an h-index of 94. His honors include a gold medal from the US Environmental Protection Agency, the COPSS (Committee of the Presidents of Statistical Societies) Presidents' Award (one of the highest honors in the field of statistics), the Mortimer Spiegelman Award (for outstanding contributions to public health statistics), and an Institute of Mathematical Sciences Medallion lectureship. In 2021, he received the COPSS G.W. Snedecor Award for his instrumental role in the development of statistical theory in biometry.