Infectious and inflammatory diseases have repeatedly shown strong genetic associations within the major histocompatibility complex (MHC); however, the basis for these associations remains elusive. To define host genetic effects on the outcome of a chronic viral infection, we performed genome-wide association analysis in a multiethnic cohort of HIV-1 controllers and progressors, and we analyzed the effects of individual amino acids within the classical human leukocyte antigen (HLA) proteins. We identified >300 genome-wide significant single-nucleotide polymorphisms (SNPs) within the MHC and none elsewhere. Specific amino acids in the HLA-B peptide binding groove, as well as an independent HLA-C effect, explain the SNP associations and reconcile both protective and risk HLA alleles. These results implicate the nature of the HLA–viral peptide interaction as the major factor modulating durable control of HIV infection.
Differences in lipid levels associated with cardiovascular (CV) risk between rheumatoid arthritis (RA) and the general population remain unclear. Determining these differences is important in understanding the role of lipids in CV risk in RA.
We studied 2,005 RA subjects from two large academic medical centers. We extracted electronic medical record (EMR) data on the first low density lipoprotein (LDL), total cholesterol (TChol) and high density lipoprotein (HDL) within 1 year of the LDL. Subjects with an electronic statin prescription prior to the first LDL were excluded.
We compared lipid levels in RA to levels from the general United States population (Carroll, et al., JAMA 2012), using the t-test and stratifying by published parameters, i.e. 2007–2010, women. We determined lipid trends using separate linear regression models for TChol, LDL and HDL, testing the association between year of measurement (1989–2010) and lipid level, adjusted by age and gender. Lipid trends were qualitatively compared to those reported in Carroll, et al.
Women with RA had a significantly lower Tchol (186 vs 200mg/dL, p=0.002) and LDL (105 vs 118mg/dL, p=0.001) compared to the general population (2007–2010). HDL was not significantly different in the two groups. In the RA cohort, Tchol and LDL significantly decreased each year, while HDL increased (all with p<0.0001), consistent with overall trends observed in Carroll, et al.
RA patients appear to have an overall lower Tchol and LDL than the general population, despite the general overall risk of CVD in RA from observational studies.
To identify novel genetic risk factors for rheumatoid arthritis (RA), we conducted a genome-wide association study (GWAS) meta-analysis of 5,539 autoantibody positive RA cases and 20,169 controls of European descent, followed by replication in an independent set of 6,768 RA cases and 8,806 controls. Of 34 SNPs selected for replication, 7 novel RA risk alleles were identified at genome-wide significance (P<5×10−8) in analysis of all 41,282 samples. The associated SNPs are near genes of known immune function, including IL6ST, SPRED2, RBPJ, CCR6, IRF5, and PXK. We also refined the risk alleles at two established RA risk loci (IL2RA and CCL21) and confirmed the association at AFF3. These new associations bring the total number of confirmed RA risk loci to 31 among individuals of European ancestry. An additional 11 SNPs replicated at P<0.05, many of which are validated autoimmune risk alleles, suggesting that most represent bona fide RA risk alleles.
Treatment strategies blocking tumor necrosis factor (anti-TNF) have proven very successful in patients with rheumatoid arthritis (RA). However, a significant subset of patients does not respond for unknown reasons. Currently there are no means of identifying these patients prior to treatment. This study was aimed at identifying genetic factors predicting anti-TNF treatment outcome in patient with RA using a genome-wide association approach.
We conducted a multi-stage, genome-wide association study with a primary analysis of 2,557,253 single nucleotide polymorphisms (SNPs) in 882 RA patients receiving anti-TNF therapy included through the Dutch Rheumatoid Arthritis Monitoring (DREAM) registry and the database of Apotheekzorg. Linear regression analysis of changes in the Disease Activity Score in 28 joints after 14 weeks of treatment was performed using an additive model. Markers with a p<10−3 were selected for replication in 1,821 RA patients from three independent cohorts. Pathway analysis including all SNPs with a p-value < 10−3 was performed using Ingenuity.
Seven hundred seventy two markers demonstrated evidence of association with treatment outcome in the initial stage. Eight genetic loci showed improved p-value in the overall meta-analysis compared to the first stage, three of which (rs1568885, rs1813443 and rs4411591) showed directional consistency over all four studied cohorts. We were unable to replicate markers previously reported to be associated with anti-TNF outcome. Network analysis indicated strong involvement of biological processes underlying inflammatory response and cell morphology.
Using a multi-stage strategy, we have identified 8 genetic loci associated with response to anti-TNF treatment. Further studies are required to validate these findings in additional patient collections.
anti-TNF; gene polymorphism; pharmacogenetics; rheumatoid arthritis; genome-wide association study
Anti–tumor necrosis factor α (anti-TNF) therapy is a mainstay of treatment in rheumatoid arthritis (RA). The aim of the present study was to test established RA genetic risk factors to determine whether the same alleles also influence the response to anti-TNF therapy.
A total of 1,283 RA patients receiving etanercept, infliximab, or adalimumab therapy were studied from among an international collaborative consortium of 9 different RA cohorts. The primary end point compared RA patients with a good treatment response according to the European League Against Rheumatism (EULAR) response criteria (n = 505) with RA patients considered to be nonresponders (n = 316). The secondary end point was the change from baseline in the level of disease activity according to the Disease Activity Score in 28 joints (ΔDAS28). Clinical factors such as age, sex, and concomitant medications were tested as possible correlates of treatment response. Thirty-one single-nucleotide polymorphisms (SNPs) associated with the risk of RA were genotyped and tested for any association with treatment response, using univariate and multivariate logistic regression models.
Of the 31 RA-associated risk alleles, a SNP at the PTPRC (also known as CD45) gene locus (rs10919563) was associated with the primary end point, a EULAR good response versus no response (odds ratio [OR] 0.55, P = 0.0001 in the multivariate model). Similar results were obtained using the secondary end point, the ΔDAS28 (P = 0.0002). There was suggestive evidence of a stronger association in autoantibody-positive patients with RA (OR 0.55, 95% confidence interval [95% CI] 0.39–0.76) as compared with autoantibody-negative patients (OR 0.90, 95% CI 0.41–1.99).
Statistically significant associations were observed between the response to anti-TNF therapy and an RA risk allele at the PTPRC gene locus. Additional studies will be required to replicate this finding in additional patient collections.
To study genetic factors that influence quantitative anti-cyclic citrullinated peptide (anti-CCP) antibody levels in RA patients.
We carried out a genome wide association study (GWAS) meta-analysis using 1,975 anti-CCP+ RA patients from 3 large cohorts, the Brigham Rheumatoid Arthritis Sequential Study (BRASS), North American Rheumatoid Arthritis Consortium (NARAC), and the Epidemiological Investigation of RA (EIRA). We also carried out a genome-wide complex trait analysis (GCTA) to estimate the heritability of anti-CCP levels.
GWAS-meta analysis showed that anti-CCP levels were most strongly associated with the human leukocyte antigen (HLA) region with a p-value of 2×10−11 for rs1980493. There were 112 SNPs in this region that exceeded the genome-wide significance threshold of 5×10−8, and all were in linkage disequilibrium (LD) with the HLA- DRB1*03 allele with LD r2 in the range of 0.25-0.88. Suggestive novel associations outside of the HLA region were also observed for rs8063248 (near the GP2 gene) with a p-value of 3×10−7. None of the known RA risk alleles (~52 loci) were associated with anti-CCP level. Heritability analysis estimated that 44% of anti-CCP variation was attributable to genetic factors captured by GWAS variants.
Anti-CCP level is a heritable trait. HLA-DR3 and GP2 are associated with lower anti-CCP levels.
RA; GWAS; anti-CCP; heritability
A major challenge in human genetics is to devise a systematic strategy to integrate disease-associated variants with diverse genomic and biological datasets to provide insight into disease pathogenesis and guide drug discovery for complex traits such as rheumatoid arthritis (RA)1. Here, we performed a genome-wide association study (GWAS) meta-analysis in a total of >100,000 subjects of European and Asian ancestries (29,880 RA cases and 73,758 controls), by evaluating ~10 million single nucleotide polymorphisms (SNPs). We discovered 42 novel RA risk loci at a genome-wide level of significance, bringing the total to 1012–4. We devised an in-silico pipeline using established bioinformatics methods based on functional annotation5, cis-acting expression quantitative trait loci (cis-eQTL)6, and pathway analyses7–9 – as well as novel methods based on genetic overlap with human primary immunodeficiency (PID), hematological cancer somatic mutations and knock-out mouse phenotypes – to identify 98 biological candidate genes at these 101 risk loci. We demonstrate that these genes are the targets of approved therapies for RA, and further suggest that drugs approved for other indications may be repurposed for the treatment of RA. Together, this comprehensive genetic study sheds light on fundamental genes, pathways and cell types that contribute to RA pathogenesis, and provides empirical evidence that the genetics of RA can provide important information for drug discovery.
Vitamin D may have an immunological role in Crohn’s disease (CD) and ulcerative colitis (UC). Retrospective studies suggested a weak association between vitamin D status and disease activity but have significant limitations.
Using a multi-institution inflammatory bowel disease (IBD) cohort, we identified all CD and UC patients who had at least one measured plasma 25-hydroxy vitamin D [25(OH)D]. Plasma 25(OH)D was considered sufficient at levels ≥ 30ng/mL. Logistic regression models adjusting for potential confounders were used to identify impact of measured plasma 25(OH)D on subsequent risk of IBD-related surgery or hospitalization. In a subset of patients where multiple measures of 25(OH)D were available, we examined impact of normalization of vitamin D status on study outcomes.
Our study included 3,217 patients (55% CD, mean age 49 yrs). The median lowest plasma 25(OH)D was 26ng/ml (IQR 17–35ng/ml). In CD, on multivariable analysis, plasma 25(OH)D < 20ng/ml was associated with an increased risk of surgery (OR 1.76 (1.24 – 2.51) and IBD-related hospitalization (OR 2.07, 95% CI 1.59 – 2.68) compared to those with 25(OH)D ≥ 30ng/ml. Similar estimates were also seen for UC. Furthermore, CD patients who had initial levels < 30ng/ml but subsequently normalized their 25(OH)D had a reduced likelihood of surgery (OR 0.56, 95% CI 0.32 – 0.98) compared to those who remained deficient.
Low plasma 25(OH)D is associated with increased risk of surgery and hospitalizations in both CD and UC and normalization of 25(OH)D status is associated with a reduction in the risk of CD-related surgery.
Crohn’s disease; ulcerative colitis; vitamin D; surgery; hospitalization
Prior studies identifying patients with inflammatory bowel disease (IBD) utilizing administrative codes have yielded inconsistent results. Our objective was to develop a robust electronic medical record (EMR) based model for classification of IBD leveraging the combination of codified data and information from clinical text notes using natural language processing (NLP).
Using the EMR of 2 large academic centers, we created data marts for Crohn’s disease (CD) and ulcerative colitis (UC) comprising patients with ≥ 1 ICD-9 code for each disease. We utilized codified (i.e. ICD9 codes, electronic prescriptions) and narrative data from clinical notes to develop our classification model. Model development and validation was performed in a training set of 600 randomly selected patients for each disease with medical record review as the gold standard. Logistic regression with the adaptive LASSO penalty was used to select informative variables.
We confirmed 399 (67%) CD cases in the CD training set and 378 (63%) UC cases in the UC training set. For both, a combined model including narrative and codified data had better accuracy (area under the curve (AUC) for CD 0.95; UC 0.94) than models utilizing only disease ICD-9 codes (AUC 0.89 for CD; 0.86 for UC). Addition of NLP narrative terms to our final model resulted in classification of 6–12% more subjects with the same accuracy.
Inclusion of narrative concepts identified using NLP improves the accuracy of EMR case-definition for CD and UC while simultaneously identifying more subjects compared to models using codified data alone.
Crohn’s disease; ulcerative colitis; disease cohort; natural language processing; informatics
While accurate measures of heritability are needed to understand the pharmacogenetic basis of drug treatment response, these are generally not available, since it is unfeasible to give medications to individuals for which treatment is not indicated. Using a polygenic linear mixed modeling approach, we estimated lower-bounds on asthma heritability and the heritability of two related drug-response phenotypes, bronchodilator response and airway hyperreactivity, using genome-wide SNP data from existing asthma cohorts. Our estimate of the heritability for bronchodilator response is 28.5% (se 16%, p = 0.043) and airway hyperresponsiveness is 51.1% (se 34%, p = 0.064), while we estimate asthma genetic liability at 61.5% (se 16%, p < 0.001). Our results agree with previously published estimates of the heritability of these traits, suggesting that the LMM method is useful for computing the heritability of other pharmacogenetic traits. Furthermore, our results indicate that multiple SNP main-effects, including SNPs as yet unidentified by GWAS methods, together explain a sizable portion of the heritability of these traits.
Asthma; Pharmacogenetics; Heritability; Bronchodilator Response; Airway Hyperresponsiveness
While genetic determinants of LDL cholesterol levels are well characterized in the general population, they are understudied in rheumatoid arthritis (RA). Our objective was to determine the association of established LDL and RA genetic alleles with LDL levels in RA cases compared to non-RA controls.
Using electronic medical records (EMR) data, we linked validated RA cases and non-RA controls to discarded blood samples. For each individual, we extracted data on: 1st LDL measurement, age, gender, and year of LDL measurement. We genotyped subjects for 11 LDL and 44 non-HLA RA alleles, and calculated RA and LDL genetic risk scores (GRS). We tested the association between each GRS and LDL level using multivariate linear regression models adjusted by age, gender, year of LDL measurement, and RA status.
Among 567 RA cases and 979 controls, 80% were female and the mean age at 1st LDL measurement was 55 years. RA cases had significantly lower mean LDL levels than controls (117.2 vs. 125.6mg/dL, respectively, p<0.0001). Each unit increase in LDL GRS was associated with 0.8mg/dL higher LDL levels in both RA cases and controls (p=3.0×10−7). Each unit increase in RA GRS was associated with 4.3mg/dL lower LDL levels in both groups (p=0.01).
LDL alleles were associated with higher LDL levels in RA. RA alleles were associated with lower LDL levels in both RA cases and controls. Since RA cases carry more RA alleles, these findings suggest a genetic basis for epidemiologic observations of lower LDL levels in RA.
Rheumatoid arthritis; low density lipoprotein; genetics; human leukocyte antigen
Psychiatric co-morbidity is common in Crohn’s disease (CD) and ulcerative colitis (UC). IBD-related surgery or hospitalizations represent major events in the natural history of disease. Whether there is a difference in risk of psychiatric co-morbidity following surgery in CD and UC has not been examined previously.
We used a multi-institution cohort of IBD patients without a diagnosis code for anxiety or depression preceding their IBD-related surgery or hospitalization. Demographic, disease, and treatment related variables were retrieved. Multivariate logistic regression analysis was performed to individually identify risk factors for depression and anxiety.
Our study included a total of 707 CD and 530 UC patients who underwent bowel resection surgery and did not have depression prior to surgery. The risk of depression 5 years after surgery was 16% and 11% in CD and UC respectively. We found no difference in the risk of depression following surgery in CD and UC patients (adjusted OR 1.11, 95%CI 0.84 – 1.47). Female gender, co-morbidity, immunosuppressant use, perianal disease, stoma surgery, and early surgery within 3 years of care predicted depression after CD-surgery; only female gender and co-morbidity predicted depression in UC. Only 12% of the CD cohort had ≥ 4 risk factors for depression, but among them nearly 44% were subsequently received a diagnosis code for depression.
IBD-related surgery or hospitalization is associated with a significant risk for depression and anxiety with a similar magnitude of risk in both diseases.
Crohn’s disease; depression; anxiety; surgery; hospitalization
The significance of non-RA autoantibodies in patients with rheumatoid arthritis (RA) is unclear. We studied associations between autoimmune risk alleles and autoantibodies in RA cases and non-RA controls, and autoantibodies and clinical diagnoses from the electronic medical records (EMR).
We studied 1,290 RA cases and 1,236 non-RA controls of European genetic ancestry from the EMR from two large academic centers. We measured antibodies to citrullinated peptides (ACPA), anti-nuclear antibodies (ANA), antibodies to tissue transglutaminase (anti-tTG), antibodies to thyroid peroxidase (anti-TPO). We genotyped subjects for autoimmune risk alleles, and studied the association between number of autoimmune risk alleles and number of types of autoantibodies present. We conducted a phenome-wide association study (PheWAS) to study potential associations between autoantibodies and clinical diagnoses among RA cases and controls.
Mean age was 60.7 in RA and 64.6 years in controls, and both were 79% female. The prevalence of ACPA and ANA was higher in RA cases compared to controls (p<0.0001, both); we observed no difference in anti-TPO and anti-tTG. Carriage of higher numbers of autoimmune risk alleles was associated with increasing types of autoantibodies in RA cases (p=4.4x10−6) and controls (p=0.002). From the PheWAS, ANA was significantly associated with Sjogren’s/siccain RA cases.
The increased frequency of autoantibodies in RA cases and controls was associated with the number of autoimmune risk alleles carried by an individual. PheWAS analyses within the EMR linked to blood samples provide a novel method to test for the clinical significance of biomarkers in disease.
Integrating genetic data from families with highly penetrant forms of disease together with genetic data from outbred populations represents a promising strategy to uncover the complete frequency spectrum of risk alleles for complex traits such as rheumatoid arthritis (RA). Here, we demonstrate that rare, low-frequency and common alleles at one gene locus, phospholipase B1 (PLB1), might contribute to risk of RA in a 4-generation consanguineous pedigree (Middle Eastern ancestry) and also in unrelated individuals from the general population (European ancestry). Through identity-by-descent (IBD) mapping and whole-exome sequencing, we identified a non-synonymous c.2263G>C (p.G755R) mutation at the PLB1 gene on 2q23, which significantly co-segregated with RA in family members with a dominant mode of inheritance (P = 0.009). We further evaluated PLB1 variants and risk of RA using a GWAS meta-analysis of 8,875 RA cases and 29,367 controls of European ancestry. We identified significant contributions of two independent non-coding variants near PLB1 with risk of RA (rs116018341 [MAF = 0.042] and rs116541814 [MAF = 0.021], combined P = 3.2×10−6). Finally, we performed deep exon sequencing of PLB1 in 1,088 RA cases and 1,088 controls (European ancestry), and identified suggestive dispersion of rare protein-coding variant frequencies between cases and controls (P = 0.049 for C-alpha test and P = 0.055 for SKAT). Together, these data suggest that PLB1 is a candidate risk gene for RA. Future studies to characterize the full spectrum of genetic risk in the PLB1 genetic locus are warranted.
Psychiatric co-morbidity, in particular major depression and anxiety is common in patients with Crohn’s disease (CD) and ulcerative colitis (UC). Prior studies examining this may be confounded by the co-existence of functional bowel symptoms. Limited data exists examining an association between depression or anxiety and disease-specific endpoints such as bowel surgery.
Using a multi-institution cohort of patients with CD and UC, we identified those who also had co-existing psychiatric co-morbidity (major depressive disorder or generalized anxiety). After excluding those diagnosed with such co-morbidity for the first time following surgery, we used multivariate logistic regression to examine the independent effect of psychiatric co-morbidity on IBD-related surgery and hospitalization. To account for confounding by disease severity, we adjusted for a propensity score estimating likelihood of psychiatric co-morbidity influenced by severity of disease in our models.
A total of 5,405 CD and 5,429 UC patients were included in this study; one-fifth had either major depressive disorder or generalized anxiety. In multivariate analysis, adjusting for potential confounders and the propensity score, presence of mood or anxiety co-morbidity was associated with a 28% increase in risk of surgery in CD (OR 1.28, 95% CI 1.03 – 1.57) but not UC (OR 1.01, 95% CI 0.80 – 1.28). Psychiatric co-morbidity was associated with increased healthcare utilization.
Depressive disorder or generalized anxiety is associated with a modestly increased risk of surgery in patients with CD. Interventions addressing this may improve patient outcomes.
Crohn’s disease; ulcerative colitis; depression; surgery; hospitalization
Recent work has shown that much of the missing heritability of complex traits can be resolved by estimates of heritability explained by all genotyped SNPs. However, it is currently unknown how much heritability is missing due to poor tagging or additional causal variants at known GWAS loci. Here, we use variance components to quantify the heritability explained by all SNPs at known GWAS loci in nine diseases from WTCCC1 and WTCCC2. After accounting for expectation, we observed all SNPs at known GWAS loci to explain more heritability than GWAS-associated SNPs on average (). For some diseases, this increase was individually significant: for Multiple Sclerosis (MS) () and for Crohn's Disease (CD) (); all analyses of autoimmune diseases excluded the well-studied MHC region. Additionally, we found that GWAS loci from other related traits also explained significant heritability. The union of all autoimmune disease loci explained more MS heritability than known MS SNPs () and more CD heritability than known CD SNPs (), with an analogous increase for all autoimmune diseases analyzed. We also observed significant increases in an analysis of Rheumatoid Arthritis (RA) samples typed on ImmunoChip, with more heritability from all SNPs at GWAS loci () and more heritability from all autoimmune disease loci () compared to known RA SNPs (including those identified in this cohort). Our methods adjust for LD between SNPs, which can bias standard estimates of heritability from SNPs even if all causal variants are typed. By comparing adjusted estimates, we hypothesize that the genome-wide distribution of causal variants is enriched for low-frequency alleles, but that causal variants at known GWAS loci are skewed towards common alleles. These findings have important ramifications for fine-mapping study design and our understanding of complex disease architecture.
Heritable diseases have an unknown underlying “genetic architecture” that defines the distribution of effect-sizes for disease-causing mutations. Understanding this genetic architecture is an important first step in designing disease-mapping studies, and many theories have been developed on the nature of this distribution. Here, we evaluate the hypothesis that additional heritable variation lies at previously known associated loci but is not fully explained by the single most associated marker. We develop methods based on variance-components analysis to quantify this type of “local” heritability, demonstrating that standard strategies can be falsely inflated or deflated due to correlation between neighboring markers and propose a robust adjustment. In analysis of nine common diseases we find a significant average increase of local heritability, consistent with multiple common causal variants at an average locus. Intriguingly, for autoimmune diseases we also observe significant local heritability in loci not associated with the specific disease but with other autoimmune diseases, implying a highly correlated underlying disease architecture. These findings have important implications to the design of future studies and our general understanding of common disease.
Electronic medical records (EMRs) are a rich data source for discovery research but are underutilized due to the difficulty of extracting highly accurate clinical data. We assessed whether a classification algorithm incorporating narrative EMR data (typed physician notes), more accurately classifies subjects with rheumatoid arthritis (RA) compared to an algorithm using codified EMR data alone.
Subjects with ≥1 ICD9 RA code (714.xx) or who had anti-CCP checked in the EMR of two large academic centers were included into an ‘RA Mart’ (n=29,432). For all 29,432 subjects, we extracted narrative (using natural language processing) and codified RA clinical information. In a training set of 96 RA and 404 non-RA cases from the RA Mart classified by medical record review, we used narrative and codified data to develop classification algorithms using logistic regression. These algorithms were applied to the entire RA Mart. We calculated and compared the positive predictive value (PPV) of these algorithms by reviewing records of an additional 400 subjects classified as RA by the algorithms.
A complete algorithm (narrative and codified data) classified RA subjects with a significantly higher PPV of 94%, than an algorithm with codified data alone (PPV 88%). Characteristics of the RA cohort identified by the complete algorithm were comparable to existing RA cohorts (80% female, 63% anti-CCP+, 59% erosion+).
We demonstrate the ability to utilize complete EMR data to define an RA cohort with a PPV of 94%, which was superior to an algorithm using codified data alone.
Autoimmune disease results from a loss of tolerance to self-antigens in genetically susceptible individuals. Completely understanding this process requires that targeted antigens be identified, and so a number of techniques have been developed to determine immune receptor specificities. We previously reported the construction of a phage-displayed synthetic human peptidome and a proof-of-principle analysis of antibodies from three patients with neurological autoimmunity. Here we present data from a large-scale screen of 298 independent antibody repertoires, including those from 73 healthy sera, using phage immunoprecipitation sequencing. The resulting database of peptide-antibody interactions characterizes each individual’s unique autoantibody fingerprint, and includes specificities found to occur frequently in the general population as well as those associated with disease. Screening type 1 diabetes (T1D) patients revealed a prematurely polyautoreactive phenotype compared with their matched controls. A collection of cerebrospinal fluids and sera from 63 multiple sclerosis patients uncovered novel, as well as previously reported antibody-peptide interactions. Finally, a screen of synovial fluids and sera from 64 rheumatoid arthritis patients revealed novel disease-associated antibody specificities that were independent of seropositivity status. This work demonstrates the utility of performing PhIP-Seq screens on large numbers of individuals and is another step toward defining the full complement of autoimmunoreactivities in health and disease.
autoantigen discovery; high throughput screening; PhIP-Seq; proteomics
To optimally leverage the scalability and unique features of the electronic health records (EHR) for research that would ultimately improve patient care, we need to accurately identify patients and extract clinically meaningful measures. Using multiple sclerosis (MS) as a proof of principle, we showcased how to leverage routinely collected EHR data to identify patients with a complex neurological disorder and derive an important surrogate measure of disease severity heretofore only available in research settings.
In a cross-sectional observational study, 5,495 MS patients were identified from the EHR systems of two major referral hospitals using an algorithm that includes codified and narrative information extracted using natural language processing. In the subset of patients who receive neurological care at a MS Center where disease measures have been collected, we used routinely collected EHR data to extract two aggregate indicators of MS severity of clinical relevance multiple sclerosis severity score (MSSS) and brain parenchymal fraction (BPF, a measure of whole brain volume).
The EHR algorithm that identifies MS patients has an area under the curve of 0.958, 83% sensitivity, 92% positive predictive value, and 89% negative predictive value when a 95% specificity threshold is used. The correlation between EHR-derived and true MSSS has a mean R2 = 0.38±0.05, and that between EHR-derived and true BPF has a mean R2 = 0.22±0.08. To illustrate its clinical relevance, derived MSSS captures the expected difference in disease severity between relapsing-remitting and progressive MS patients after adjusting for sex, age of symptom onset and disease duration (p = 1.56×10−12).
Incorporation of sophisticated codified and narrative EHR data accurately identifies MS patients and provides estimation of a well-accepted indicator of MS severity that is widely used in research settings but not part of the routine medical records. Similar approaches could be applied to other complex neurological disorders.
To examine the association of previously identified autoimmune disease susceptibility loci with granulomatosis with polyangiitis (GPA, formerly known as Wegener’s granulomatosis), and determine whether genetic susceptibility profiles of other autoimmune diseases are associated with GPA
Genetic data from two cohorts were meta-analyzed. Genotypes for 168 previously identified single nucleotide polymorphisms (SNPs) associated with susceptibility to different autoimmune diseases were ascertained for a total of 880 GPA cases and 1969 controls of European descent. Single marker associations were identified using additive logistic regression models. Multi-SNP associations with GPA were assessed using genetic risk scores based on susceptibility loci for Crohn’s disease, type 1 diabetes, systemic lupus erythematosus, rheumatoid arthritis, celiac disease, and ulcerative colitis. Adjustment for population substructure was performed in all analyses using ancestry informative markers and principal components analysis.
Genetic polymorphisms in CTLA4 were significantly associated with GPA in the single-marker meta-analysis (OR 0.79. 95% CI 0.70–0.89, p=9.8×10−5). A genetic risk score based on rheumatoid arthritis susceptibility markers was significantly associated with GPA (OR 1.05 per 1-unit increase in genetic risk score, 95% CI 1.02–1.08, p=5.1×10−5).
Rheumatoid arthritis and GPA may arise from a similar genetic predisposition. Aside from CTLA4, other loci previously found to be associated with common autoimmune diseases were not statistically associated with GPA in this study.
genetics; vasculitis; granulomatosis with polyangiitis; rheumatoid arthritis; CTLA4
We aimed to mine the data in the Electronic Medical Record to automatically discover patients' Rheumatoid Arthritis disease activity at discrete rheumatology clinic visits. We cast the problem as a document classification task where the feature space includes concepts from the clinical narrative and lab values as stored in the Electronic Medical Record.
Materials and Methods
The Training Set consisted of 2792 clinical notes and associated lab values. Test Set 1 included 1749 clinical notes and associated lab values. Test Set 2 included 344 clinical notes for which there were no associated lab values. The Apache clinical Text Analysis and Knowledge Extraction System was used to analyze the text and transform it into informative features to be combined with relevant lab values.
Experiments over a range of machine learning algorithms and features were conducted. The best performing combination was linear kernel Support Vector Machines with Unified Medical Language System Concept Unique Identifier features with feature selection and lab values. The Area Under the Receiver Operating Characteristic Curve (AUC) is 0.831 (σ = 0.0317), statistically significant as compared to two baselines (AUC = 0.758, σ = 0.0291). Algorithms demonstrated superior performance on cases clinically defined as extreme categories of disease activity (Remission and High) compared to those defined as intermediate categories (Moderate and Low) and included laboratory data on inflammatory markers.
Automatic Rheumatoid Arthritis disease activity discovery from Electronic Medical Record data is a learnable task approximating human performance. As a result, this approach might have several research applications, such as the identification of patients for genome-wide pharmacogenetic studies that require large sample sizes with precise definitions of disease activity and response to therapies.
Elevated serum urate concentrations can cause gout, a prevalent and painful inflammatory arthritis. By combining data from >140,000 individuals of European ancestry within the Global Urate Genetics Consortium (GUGC), we identified and replicated 28 genome-wide significant loci in association with serum urate concentrations (18 new regions in or near TRIM46, INHBB, SFMBT1, TMEM171, VEGFA, BAZ1B, PRKAG2, STC1, HNF4G, A1CF, ATXN2, UBE2Q2, IGF1R, NFAT5, MAF, HLF, ACVR1B-ACVRL1 and B3GNT4). Associations for many of the loci were of similar magnitude in individuals of non-European ancestry. We further characterized these loci for associations with gout, transcript expression and the fractional excretion of urate. Network analyses implicate the inhibins-activins signaling pathways and glucose metabolism in systemic urate control. New candidate genes for serum urate concentration highlight the importance of metabolic control of urate production and excretion, which may have implications for the treatment and prevention of gout.
Electronic health records (EHR) can allow for the generation of large cohorts of individuals with given diseases for clinical and genomic research. A rate-limiting step is the development of electronic phenotype selection algorithms to find such cohorts. This study evaluated the portability of a published phenotype algorithm to identify rheumatoid arthritis (RA) patients from EHR records at three institutions with different EHR systems.
Materials and Methods
Physicians reviewed charts from three institutions to identify patients with RA. Each institution compiled attributes from various sources in the EHR, including codified data and clinical narratives, which were searched using one of two natural language processing (NLP) systems. The performance of the published model was compared with locally retrained models.
Applying the previously published model from Partners Healthcare to datasets from Northwestern and Vanderbilt Universities, the area under the receiver operating characteristic curve was found to be 92% for Northwestern and 95% for Vanderbilt, compared with 97% at Partners. Retraining the model improved the average sensitivity at a specificity of 97% to 72% from the original 65%. Both the original logistic regression models and locally retrained models were superior to simple billing code count thresholds.
These results show that a previously published algorithm for RA is portable to two external hospitals using different EHR systems, different NLP systems, and different target NLP vocabularies. Retraining the algorithm primarily increased the sensitivity at each site.
Electronic phenotype algorithms allow rapid identification of case populations in multiple sites with little retraining.
Automated learning; biomedical informatics; discovery and text and data mining methods; electronic health record; genetic; improving the education and skills training of health professionals; infection control; knowledge representations; linking the genotype and phenotype; medical informatics; natural language processing; other methods of information extraction; phenotype algorithms DNA databank machine learning; phenotype identification; phenotyping; rheumatoid arthritis; rheumatology; translational research – application of biological knowledge to clinical care