As described previously [
1,
8], we built Varimed, a manually curated database of human disease-SNP associations from the full text, figures, tables, and supplemental materials of 4,573 human genetics papers. Papers were retrieved for curation using Medical Subject Heading (MeSH) terms for human genetic studies, such as “Polymorphism, Single Nucleotide”, “Genetic Predisposition to Disease”. For each paper, we recorded more than 100 features including studied broad and narrow phenotypes, studied populations and ethnicities, number of patients in the case and control groups, p-values, and disease-susceptible risk alleles. With the breadth of this curation process, we believed that we covered the majority of papers relating polymorphisms to human diseases.
We evaluated the genetic findings for each of the 763 diseases covered in the Varimed database, by counting the number of independent cross-ethnic SNPs that had been validated with p<5×10−8 in two or more different population groups, such as African, Caucasian, Chinese, Indian Asian, Japanese, and South American. As an example, we illustrated the method to identify two independent cross-ethnic SNPs from 135 published genetic studies on Rheumatoid Arthritis (RA) in . Starting from 1,112 SNPs reported as associated with RA, we found 321 SNPs with p<5×10−8. We identified two SNPs that had been replicated in at least two different subpopulations. We defined a pair of SNPs as being in linkage disequilibrium (LD) when their LD R2>0.3 in CEU HapMap data or their genomic distance was within 37,000 base pairs of each other, in the cases when LD data were unavailable. We used 37,000 base pairs as the cutoff because it was the average genomic distance between SNP pairs in LD R2 within 0.3 and 0.4 in the CEU in HapMap. Then, for each pair of SNPs in LD, we removed the one SNP with fewer number of replicated populations and studies, leading two independent cross-ethnic SNPs. Thus, for RA we identified 135 independent SNPs, 12 independent replicated SNPs, and two independent cross-ethnic SNPs. Similarly, we identified the number of cross-ethnic SNPs for each of 763 diseases in Varimed.
We downloaded the number of deaths in the United States in 2009 for 113 causes from the CDC[
9], and recorded the ICD-10 codes, disease names, and the mortality data. We then mapped disease names in Varimed and the ICD-10 codes from the CDC using Unified Medical Language System (UMLS) [
10] through a three-step approach. First, we identified the Concept Unique IDs (CUIs) for 763 disease names in Varimed using exact match, normalization, and the removal of qualifiers [
11]. Then, we mapped each CUI to ICD-10 codes in CDC using three different methods. 1) We tried to directly identify a corresponding ICD-10 code for each CUI, and checked whether it matched or was a child-concept of ICD-10 codes from the CDC. 2) We attempted to identify the parent CUIs for each CUI using the UMLS-Query Perl module[
12], searched for ICD-10 codes for these parent CUI, and then checked whether these parent ICD-10 codes matched or were child-concepts of ICD10 codes from the CDC. 3) We manually identified the ICD-10 code for each disease name in Varimed using the ICD-10 Data Service[
13], and checked whether it matched or was a child-concept of ICD codes from the CDC. Finally, we manually examined the matches of diseases between Varimed and CDC to validate that these matches were correct. We successfully mapped Varimed diseases to 69 diseases from the CDC.
For each disease category from the CDC, we identified the corresponding diseases in Varimed, and evaluated the genetic findings on these diseases using 1) number of genetic papers, including both GWAS and candidate studies, 2) number of published GWAS papers, 3) number of published GWAS papers with cohort size larger than 1000, 4) number of SNPs with p<5×10−8, 5) number of replicated SNPs, and 6) number of cross-ethnic SNPs. We also distinguished the number of published papers on non-Caucasian populations. Finally, we mapped mortality with genetic findings to identify diseases that killed more than 10,000 American in 2009 but were still lacking cross-ethnic SNPs.