Crowd Sourcing Labels From Electronic Medical Records to Enable Biomedical Research

Welcome

When data sets are small, manual chart reviews performed by clinical staff are sufficient to label each outcome. However, as data sets have scaled up and researchers aim to study larger cohorts, current manual approaches become intractable. While some researchers have employed software scripts to infer labels from data automatically, the messiness and complexity of Electronic Medical Records (EMR) systems make verifying the accuracy of the results difficult. So we're developing a framework to crowd source labeled data sets from EMRs to support prediction model development.

Why this Research Now?

Supervised machine learning is a popular method that uses labeled training examples to predict future outcomes. Unfortunately, supervised machine learning for biomedical research is often limited by a lack of labeled data. Current methods to produce labeled data involve manual chart reviews that are laborious and do not scale with data creation rates. This project aims to develop a framework to crowd source labeled data sets from electronic medical records by forming a crowd of clinical personnel labelers. The construction of these labeled data sets will allow for new biomedical research studies that were previously infeasible to conduct. There are numerous practical and theoretical challenges of developing a crowd sourcing platform for clinical data. First, popular, public crowd sourcing platforms such as Amazon's Mechanical Turk are not suitable for medical record labeling as HIPAA makes clinical data sharing risky. Second, the types of clinical questions that are amenable for crowd sourcing are not well understood. Third, it is unclear if the clinical crowd can produce labels quickly and accurately. Each of these challenges will be addressed in a separate Aim. As the first Aim of this project, the team will evaluate different clinical crowd sourcing architectures. The architecture must leverage the scale of the crowd, while minimizing patient information exposure. De-identification tools will be considered to scrub clinical notes t reduce information leakage. Using this design, the team will extend a popular open source crowd sourcing tool, Pybossa, and release it to the public. As the second Aim, the team will study the type, structure, topic and specificity of clinical prediction questions, and how these characteristics impact labeler quality. Lastly, the team will evaluate the quality and accuracy of collected clinical crowd sourced data on two existing chart review problems to determine the platform's utility.

How to get Involved:

HAILlab is now looking for supervised learning projects to support. If you’re a Vanderbilt researcher doing IRB-approved studies and you’re seeking manually labeled clinical data, consider contacting Fabbri.

Our Investigators:

PI: Daniel Fabbri, Ph.D Daniel.fabbri@vanderbilt.edu
PI: Bradley Malin, Ph.D. b.malin@Vanderbilt.Edu
PI: Thomas Lasko, M.D., Ph.D. tom.lasko@Vanderbilt.Edu
Co-PI: Laurie Novak, Ph.D Laurie.novak@vanderbilt.edu
Co-PI: Joshua Denny, M.D., MS Josh.denny@Vanderbilt.Edu
Co-PI: Yevgeniy Vorobeychik, Ph.D., MSE yevgeniy.vorobeychik@vanderbilt.edu
PF: Chen Hajaj, Ph.D. chen.hajaj@vanderbilt.edu
RA: Anna Epishova anna.epishova@vanderbilt.edu
RA: Cheng Ye cheng.ye@vanderbilt.edu
RA: Joseph Coco joseph.r.coco@vanderbilt.edu