We replicated known genetic associations for five diseases. We genotyped the first 10,000 samples accrued into BioVU (the Vanderbilt EMR-associated DNA biobank) for twenty-one loci were associated with five common diseases (reported odds ratios 1.14-2.36) in at least two previous studies. We developed automated phenotype identification algorithms that used NLP techniques (to identify key findings, medication names, and family history), billing code queries, and structured data elements (such as laboratory results) to identify cases (n=70-698) and controls (n=808-3818). Final algorithms achieved positive predictive values (PPV) of ≥97% for cases and 100% for controls on randomly selected cases and controls. Used alone, ICD9 codes had PPVs of 56-89% due to coding errors, misdiagnoses by a non-specialist, and evolution from indeterminate diagnoses into well-defined ones.
Each of the 21 tests of association yielded point estimates in the expected direction, and overall eight of the known associations achieved statistical significance. All associations adequately powered were replicated.
Detailed phenotype algorithms for Atrial Fibrillation, Crohn's disease, Multiple Sclerosis, Rheumatoid Arthritis, and Type 2 Diabetes are available as an appendix in this publication: PubMed - Am J Hum Genetics