Multi-Source Stable Variable Importance Measure via Adversarial Machine Learning

Abstract

As part of enhancing the interpretability of machine learning, it is of renewed interest to quantify and infer the predictive importance of certain exposure covariates. Modern scientific studies often collect data from multiple sources with distributional heterogeneity. Thus, measuring and inferring stable associations across multiple environments is crucial in reliable and generalizable decision-making. We propose MIMAL, a novel statistical framework for Multi-source stable Importance Measure via Adversarial Learning. MIMAL measures the importance of some exposure variables by maximizing the worst-case predictive reward over the source mixture. Our framework allows various machine learning methods for confounding adjustment and exposure effect characterization. For inferential analysis, the asymptotic normality of our introduced statistic is established under a general machine learning framework that requires no stronger learning accuracy conditions than those for single source variable importance. Numerical studies with various types of data generation setups and machine learning implementation are conducted to justify the finite-sample performance of MIMAL. We also illustrate our method through a real-world study of Beijing air pollution in multiple locations.

Department students and members are invited to meet with Dr. Liu after the presentation. Sign up for your small-group appointment here.


Molei Liu is an assistant professor of biostatistics in the Columbia Mailman School of Public Health, with a broad research spectrum from methodological and theoretical analysis of general statistic problems to real-world evidence-based biomedical studies. He has successfully built up methodological research in several active fields in statistics and machine learning, including federated learning, high-dimensional inference, semi-supervised learning and transfer learning, with application to solve the real problems in electronic health records (EHR) and their linked bio-repositories.