Human iPS cells have been generated using a diverse range of tissues from a variety of donors using different reprogramming vectors. However, these cell lines are heterogeneous, which presents a limitation for their use in disease modeling and personalized medicine. To explore the basis of this heterogeneity we generated 25 iPS cell lines under normalised conditions from the same set of somatic tissues across a number of donors. RNA-seq data sets from each cell line were compared to identify the majority contributors to transcriptional heterogeneity. We found that genetic differences between individual donors were the major cause of transcriptional variation between lines. In contrast, residual signatures from the somatic cell of origin, so called epigenetic memory, contributed relatively little to transcriptional variation. Thus, underlying genetic background variation is responsible for most heterogeneity between human iPS cell lines. We conclude that epigenetic effects in hIPSCs are minimal, and that hIPSCs are a stable, robust and powerful platform for large-scale studies of the function of genetic differences between individuals. Our data also suggest that future studies using hIPSCs as a model system should focus most effort on collection of large numbers of donors, rather than generating large numbers of lines from the same donor.
Human induced pluripotent stem (hiPS) cells are a potentially powerful model system for studying human disease and development, and a resource for personalized medicine. However, it has been reported that hiPS cells exhibit substantial heterogeneity which could limit their use as model systems. Clearly, knowledge of the source of heterogeneity is key for deeper understanding of the use of human iPS cells for basic and therapeutic applications. One source of this heterogeneity has been presumed to be “memory” of the adult somatic cell from which the hIPS cells were derived, but the evidence to support this view is scant. We have generated a set of human iPS cells from a set of somatic cell types from different donors. Our study shows that cell lines from different somatic sources but from the same donor (i.e. with the same genome) are more similar than cell lines isolated from the same tissue type but from different donors. Once genetic changes are accounted for, all aspects of gene expression, including mRNA levels, splicing and imprinting are highly similar between iPS cells derived from different human tissues. Thus, most of the previously described transcriptional variation between cell lines is likely to be genetic in origin.
Genetic risk scores have been developed for coronary artery disease and atherosclerosis, but are not predictive of adverse cardiovascular events. We asked whether peripheral blood expression profiles may be predictive of acute myocardial infarction (AMI) and/or cardiovascular death.
Peripheral blood samples from 338 subjects aged 62 ± 11 years with coronary artery disease (CAD) were analyzed in two phases (discovery N = 175, and replication N = 163), and followed for a mean 2.4 years for cardiovascular death. Gene expression was measured on Illumina HT-12 microarrays with two different normalization procedures to control technical and biological covariates. Whole genome genotyping was used to support comparative genome-wide association studies of gene expression. Analysis of variance was combined with receiver operating curve and survival analysis to define a transcriptional signature of cardiovascular death.
In both phases, there was significant differential expression between healthy and AMI groups with overall down-regulation of genes involved in T-lymphocyte signaling and up-regulation of inflammatory genes. Expression quantitative trait loci analysis provided evidence for altered local genetic regulation of transcript abundance in AMI samples. On follow-up there were 31 cardiovascular deaths. A principal component (PC1) score capturing covariance of 238 genes that were differentially expressed between deceased and survivors in the discovery phase significantly predicted risk of cardiovascular death in the replication and combined samples (hazard ratio = 8.5, P < 0.0001) and improved the C-statistic (area under the curve 0.82 to 0.91, P = 0.03) after adjustment for traditional covariates.
A specific blood gene expression profile is associated with a significant risk of death in Caucasian subjects with CAD. This comprises a subset of transcripts that are also altered in expression during acute myocardial infarction.
Glaucoma is a leading cause of blindness worldwide. Primary open-angle glaucoma (POAG) is the most common subtype and is a complex trait with multigenic inheritance. Genome-wide association studies have previously identified a significant association between POAG and the SIX6 locus (rs10483727, odds ratio (OR) = 1.32, p = 3.87×10−11). SIX6 plays a role in ocular development and has been associated with the morphology of the optic nerve. We sequenced the SIX6 coding and regulatory regions in 262 POAG cases and 256 controls and identified six nonsynonymous coding variants, including five rare and one common variant, Asn141His (rs33912345), which was associated significantly with POAG (OR = 1.27, p = 4.2×10−10) in the NEIGHBOR/GLAUGEN datasets. These variants were tested in an in vivo Danio rerio (zebrafish) complementation assay to evaluate ocular metrics such as eye size and optic nerve structure. Five variants, found primarily in POAG cases, were hypomorphic or null, while the sixth variant, found only in controls, was benign. One variant in the SIX6 enhancer increased expression of SIX6 and disrupted its regulation. Finally, to our knowledge for the first time, we have identified a clinical feature in POAG patients that appears to be dependent upon SIX6 genotype: patients who are homozygous for the SIX6 risk allele (His141) have a statistically thinner retinal nerve fiber layer than patients homozygous for the SIX6 non-risk allele (Asn141). Our results, in combination with previous SIX6 work, lead us to hypothesize that SIX6 risk variants disrupt the development of the neural retina, leading to a reduced number of retinal ganglion cells, thereby increasing the risk of glaucoma-associated vision loss.
Primary open angle glaucoma is a blinding disease for which there is currently no cure, only treatments that may slow its progress. To help understand the mechanisms of this disease and to design more effective treatments, we identified previously a locus, SIX6, that increases the risk of glaucoma. This gene is involved in early eye development and helps to form the retina. In this paper, we test specific sequence variants in SIX6 that are found in glaucoma patients. We show that these variants have a reduced function that interferes with their ability to direct proper formation of the retina. One variant in particular is common, and may be the main reason that this gene is important in the glaucoma disease process. Patients who have two copies of this sequence variant show a change in the structure of their eye consistent with fewer neurons that carry the visual signal to the brain. These neurons typically die as people age, and people who begin life with fewer visual neurons may have an increased risk of glaucoma. Additional research in this topic may lead to new treatments that preserve sight.
Migraine can be sub-classified not only according to presence of migraine aura (MA) or absence of migraine aura (MO), but also by additional features accompanying migraine attacks, e.g. photophobia, phonophobia, nausea, etc. all of which are formally recognized by the International Classification of Headache Disorders. It remains unclear how aura status and the other migraine features may be related to underlying migraine pathophysiology. Recent genome-wide association studies (GWAS) have identified 12 independent loci at which single nucleotide polymorphisms (SNPs) are associated with migraine. Using a likelihood framework, we explored the selective association of these SNPs with migraine, sub-classified according to aura status and the other features in a large population-based cohort of women including 3,003 active migraineurs and 18,108 free of migraine. Five loci met stringent significance for association with migraine, among which four were selective for sub-classified migraine, including rs11172113 (LRP1) for MO. The number of loci associated with migraine increased to 11 at suggestive significance thresholds, including five additional selective associations for MO but none for MA. No two SNPs showed similar patterns of selective association with migraine characteristics. At one extreme, SNPs rs6790925 (near TGFBR2) and rs2274316 (MEF2D) were not associated with migraine overall, MA, or MO but were selective for migraine sub-classified by the presence of one or more of the additional migraine features. In contrast, SNP rs7577262 (TRPM8) was associated with migraine overall and showed little or no selectivity for any of the migraine characteristics. The results emphasize the multivalent nature of migraine pathophysiology and suggest that a complete understanding of the genetic influence on migraine may benefit from analyses that stratify migraine according to both aura status and the additional diagnostic features used for clinical characterization of migraine.
Migraine is among the most common and debilitating neurological disorders. Diagnostic criteria for migraine recognize a variety of symptoms including a primary dichotomous classification for the presence or absence of aura, typically a visual disturbance phenomenon, as well as others such as sensitivity to light or sound, and nausea, etc. We explored whether any of 12 recently discovered genetic variants associated with common migraine might have selective association for migraine sub-classified by aura status or nine additional migraine features in a population of middle-aged women including 3,003 migraineurs and 18,180 non-migraineurs. Five of the 12 genetic variants met the most stringent significance criterion for association with migraine, among which four had selective association with sub-classified migraine, including one that was selective for migraine without aura. At suggestive significance, all of the remaining genetic variants were selective for sub-classifications of migraine although no two variants showed the same pattern of selectivity. The selectivity patterns suggest very different contributions to migraine pathophysiology among the 12 loci and their implicated genes. Further, the results suggest that future discovery efforts for new migraine susceptibility loci would benefit by considering associations with sub-classified migraine toward the ultimate goals of more specific diagnosis and personalized treatment.
Modern genetic mapping is plagued by the “missing heritability” problem, which refers to the discordance between the estimated heritabilities of quantitative traits and the variance accounted for by mapped causative variants. One major potential explanation for the missing heritability is allelic heterogeneity, in which there are multiple causative variants at each causative gene with only a fraction having been identified. The majority of genome-wide association studies (GWAS) implicitly assume that a single SNP can explain all the variance for a causative locus. However, if allelic heterogeneity is prevalent, a substantial amount of genetic variance will remain unexplained. In this paper, we take a haplotype-based mapping approach and quantify the number of alleles segregating at each locus using a large set of 7922 eQTL contributing to regulatory variation in the Drosophila melanogaster female head. Not only does this study provide a comprehensive eQTL map for a major community genetic resource, the Drosophila Synthetic Population Resource, but it also provides a direct test of the allelic heterogeneity hypothesis. We find that 95% of cis-eQTLs and 78% of trans-eQTLs are due to multiple alleles, demonstrating that allelic heterogeneity is widespread in Drosophila eQTL. Allelic heterogeneity likely contributes significantly to the missing heritability problem common in GWAS studies.
For traits with complex genetic inheritance it has generally proven very difficult to identify the majority of the specific causative variants involved. A range of hypotheses have been put forward to explain this so-called “missing heritability”. One idea—allelic heterogeneity, where genes each harbor multiple different causative variants—has received little attention, because it is difficult to detect with most genetic mapping designs. Here we make use of a panel of Drosophila melanogaster lines derived from multiple founders, allowing us to directly test for the presence of multiple alleles at a large set of genetic loci influencing gene expression. We find that the vast majority of loci harbor more than two functional alleles, demonstrating extensive allelic heterogeneity at the level of gene expression and suggesting that such heterogeneity is an important factor determining the genetic basis of complex trait variation in general.
Chronic obstructive pulmonary disease (COPD) is a leading cause of global morbidity and mortality and, whilst smoking remains the single most important risk factor, COPD risk is heritable. Of 26 independent genomic regions showing association with lung function in genome-wide association studies, eleven have been reported to show association with airflow obstruction. Although the main risk factor for COPD is smoking, some individuals are observed to have a high forced expired volume in 1 second (FEV1) despite many years of heavy smoking. We hypothesised that these “resistant smokers” may harbour variants which protect against lung function decline caused by smoking and provide insight into the genetic determinants of lung health. We undertook whole exome re-sequencing of 100 heavy smokers who had healthy lung function given their age, sex, height and smoking history and applied three complementary approaches to explore the genetic architecture of smoking resistance. Firstly, we identified novel functional variants in the “resistant smokers” and looked for enrichment of these novel variants within biological pathways. Secondly, we undertook association testing of all exonic variants individually with two independent control sets. Thirdly, we undertook gene-based association testing of all exonic variants. Our strongest signal of association with smoking resistance for a non-synonymous SNP was for rs10859974 (P = 2.34×10−4) in CCDC38, a gene which has previously been reported to show association with FEV1/FVC, and we demonstrate moderate expression of CCDC38 in bronchial epithelial cells. We identified an enrichment of novel putatively functional variants in genes related to cilia structure and function in resistant smokers. Ciliary function abnormalities are known to be associated with both smoking and reduced mucociliary clearance in patients with COPD. We suggest that genetic influences on the development or function of cilia in the bronchial epithelium may affect growth of cilia or the extent of damage caused by tobacco smoke.
Very large genome-wide association studies in general population cohorts have successfully identified at least 26 genes or gene regions associated with lung function and a number of these also show association with chronic obstructive pulmonary disease (COPD). However, these findings explain a small proportion of the heritability of lung function. Although the main risk factor for COPD is smoking, some individuals have normal or good lung function despite many years of heavy smoking. We hypothesised that studying these individuals might tell us more about the genetics of lung health. Re-sequencing of exomes, where all of the variation in the protein-coding portion of the genome can be measured, is a recent approach for the study of low frequency and rare variants. We undertook re-sequencing of the exomes of “resistant smokers” and used publicly available exome data for comparisons. Our findings implicate CCDC38, a gene which has previously shown association with lung function in the general population, and genes involved in cilia structure and lung function as having a role in resistance to smoking.
Personal exome and genome sequencing provides access to loss-of-function and rare deleterious alleles whose interpretation is expected to provide insight into individual disease burden. However, for each allele, accurate interpretation of its effect will depend on both its penetrance and the trait's expressivity. In this regard, an important factor that can modify the effect of a pathogenic coding allele is its level of expression; a factor which itself characteristically changes across tissues. To better inform the degree to which pathogenic alleles can be modified by expression level across multiple tissues, we have conducted exome, RNA and deep, targeted allele-specific expression (ASE) sequencing in ten tissues obtained from a single individual. By combining such data, we report the impact of rare and common loss-of-function variants on allelic expression exposing stronger allelic bias for rare stop-gain variants and informing the extent to which rare deleterious coding alleles are consistently expressed across tissues. This study demonstrates the potential importance of transcriptome data to the interpretation of pathogenic protein-coding variants.
Gene expression is a fundamental cellular process that contributes to phenotypic diversity. Gene expression can vary between alleles of an individual through differences in genomic imprinting or cis-acting regulatory variation. Distinguishing allelic activity is important for informing the abundance of altered mRNA and protein products. Advances in sequencing technologies allow us to quantify patterns of allele-specific expression (ASE) in different individuals and cell-types. Previous studies have identified patterns of ASE across human populations for single cell-types; however the degree of tissue-specificity of ASE has not been deeply characterized. In this study, we compare patterns of ASE across multiple tissues from a single individual using whole transcriptome sequencing (RNA-Seq) and a targeted, high-resolution assay (mmPCR-Seq). We detect patterns of ASE for rare deleterious and loss-of-function protein-coding variants, informing the frequency at which allelic expression could modify the functional impact of personal deleterious protein-coding across tissues. We demonstrate that these interactions occur for one third of such variants however large direction flips in allelic expression are infrequent.
Mapping the polymorphisms responsible for variation in gene expression, known as Expression Quantitative Trait Loci (eQTL), is a common strategy for investigating the molecular basis of disease. Despite numerous eQTL studies, the relationship between the explanatory power of variants on gene expression versus their power to explain ultimate phenotypes remains to be clarified. We addressed this question using four naturally occurring Quantitative Trait Nucleotides (QTN) in three transcription factors that affect sporulation efficiency in wild strains of the yeast, Saccharomyces cerevisiae. We compared the ability of these QTN to explain the variation in both gene expression and sporulation efficiency. We find that the amount of gene expression variation explained by the sporulation QTN is not predictive of the amount of phenotypic variation explained. The QTN are responsible for 98% of the phenotypic variation in our strains but the median gene expression variation explained is only 49%. The alleles that are responsible for most of the variation in sporulation efficiency do not explain most of the variation in gene expression. The balance between the main effects and gene-gene interactions on gene expression variation is not the same as on sporulation efficiency. Finally, we show that nucleotide variants in the same transcription factor explain the expression variation of different sets of target genes depending on whether the variant alters the level or activity of the transcription factor. Our results suggest that a subset of gene expression changes may be more predictive of ultimate phenotypes than the number of genes affected or the total fraction of variation in gene expression variation explained by causative variants, and that the downstream phenotype is buffered against variation in the gene expression network.
There have been major efforts in the study of human disease to identify genetic polymorphisms that cause changes in gene expression. The assumption underlying these studies is that gene expression changes will be responsible for the disease. However, it is unclear if we can predict how a polymorphism affects the variation in disease based on the extent to which it explains variation in gene expression. We have taken advantage of four genetic polymorphisms that affect the ability of budding yeast cells to form spores. The variants were identified in naturally occurring strains, subject to natural selection pressures in the wild, and not from lab strains. These variants lie in factors that control gene expression, which gives us power to compare how the polymorphisms affect variation in both gene expression and the downstream phenotype. We find that the amount of variation in gene expression explained by the variants does not correlate with the amount of variation observed in spore formation, which has implications for studies that attempt to infer the effect of a polymorphism on phenotypic variation by studying its effect on gene expression variation.
We have developed a novel structure-based evaluation for missense variants that explicitly models protein structure and amino acid properties to predict the likelihood that a variant disrupts protein function. A structural disruption score (SDS) is introduced as a measure to depict the likelihood that a case variant is functional. The score is constructed using characteristics that distinguish between causal and neutral variants within a group of proteins. The SDS score is correlated with standard sequence-based deleteriousness, but shows promise for improving discrimination between neutral and causal variants at less conserved sites. The prediction was performed on 3-dimentional structures of 57 gene products whose homozygous SNPs were identified as case-exclusive variants in an exome sequencing study of epilepsy disorders. We contrasted the candidate epilepsy variants with scores for likely benign variants found in the EVS database, and for positive control variants in the same genes that are suspected to promote a range of diseases. To derive a characteristic profile of damaging SNPs, we transformed continuous scores into categorical variables based on the score distribution of each measurement, collected from all possible SNPs in this protein set, where extreme measures were assumed to be deleterious. A second epilepsy dataset was used to replicate the findings. Causal variants tend to receive higher sequence-based deleterious scores, induce larger physico-chemical changes between amino acid pairs, locate in protein domains, buried sites or on conserved protein surface clusters, and cause protein destabilization, relative to negative controls. These measures were agglomerated for each variant. A list of nine high-priority putative functional variants for epilepsy was generated. Our newly developed SDS protocol facilitates SNP prioritization for experimental validation.
non-synonymous single nucleotide polymorphism; missense mutation; protein structural analysis; structural disruption score; variant prioritization; epilepsy disorders
Many existing cohorts contain a range of relatedness between genotyped individuals, either by design or by chance. Haplotype estimation in such cohorts is a central step in many downstream analyses. Using genotypes from six cohorts from isolated populations and two cohorts from non-isolated populations, we have investigated the performance of different phasing methods designed for nominally ‘unrelated’ individuals. We find that SHAPEIT2 produces much lower switch error rates in all cohorts compared to other methods, including those designed specifically for isolated populations. In particular, when large amounts of IBD sharing is present, SHAPEIT2 infers close to perfect haplotypes. Based on these results we have developed a general strategy for phasing cohorts with any level of implicit or explicit relatedness between individuals. First SHAPEIT2 is run ignoring all explicit family information. We then apply a novel HMM method (duoHMM) to combine the SHAPEIT2 haplotypes with any family information to infer the inheritance pattern of each meiosis at all sites across each chromosome. This allows the correction of switch errors, detection of recombination events and genotyping errors. We show that the method detects numbers of recombination events that align very well with expectations based on genetic maps, and that it infers far fewer spurious recombination events than Merlin. The method can also detect genotyping errors and infer recombination events in otherwise uninformative families, such as trios and duos. The detected recombination events can be used in association scans for recombination phenotypes. The method provides a simple and unified approach to haplotype estimation, that will be of interest to researchers in the fields of human, animal and plant genetics.
Every individual carries two copies of each chromosome (haplotypes), one from each of their parents, that consist of a long sequence of alleles. Modern genotyping technologies do not measure haplotypes directly, but the combined sum (or genotype) of alleles at each site. Statistical methods are needed to infer (or phase) the haplotypes from the observed genotypes. Haplotype estimation is a key first step of many disease and population genetic studies. Much recent work in this area has focused on phasing in cohorts of nominally unrelated individuals. So called ‘long range phasing’ is a relatively recent concept for phasing individuals with intermediate levels of relatedness, such as cohorts taken from population isolates. Methods also exist for phasing genotypes for individuals within explicit pedigrees. Whilst high quality phasing techniques are available for each of these demographic scenarios, to date, no single method is applicable to all three. In this paper, we present a general approach for phasing cohorts that contain any level of relatedness between the study individuals. We demonstrate high levels of accuracy in all demographic scenarios, as well as the ability to detect (Mendelian consistent) genotyping error and recombination events in duos and trios, the first method with such a capability.
Annotating and interpreting the results of genome-wide association studies (GWAS) remains challenging. Assigning function to genetic variants as expression quantitative trait loci is an expanding and useful approach, but focuses exclusively on mRNA rather than protein levels. Many variants remain without annotation. To address this problem, we measured the steady state abundance of 441 human signaling and transcription factor proteins from 68 Yoruba HapMap lymphoblastoid cell lines to identify novel relationships between inter-individual protein levels, genetic variants, and sensitivity to chemotherapeutic agents. Proteins were measured using micro-western and reverse phase protein arrays from three independent cell line thaws to permit mixed effect modeling of protein biological replicates. We observed enrichment of protein quantitative trait loci (pQTLs) for cellular sensitivity to two commonly used chemotherapeutics: cisplatin and paclitaxel. We functionally validated the target protein of a genome-wide significant trans-pQTL for its relevance in paclitaxel-induced apoptosis. GWAS overlap results of drug-induced apoptosis and cytotoxicity for paclitaxel and cisplatin revealed unique SNPs associated with the pharmacologic traits (at p<0.001). Interestingly, GWAS SNPs from various regions of the genome implicated the same target protein (p<0.0001) that correlated with drug induced cytotoxicity or apoptosis (p≤0.05). Two genes were functionally validated for association with drug response using siRNA: SMC1A with cisplatin response and ZNF569 with paclitaxel response. This work allows pharmacogenomic discovery to progress from the transcriptome to the proteome and offers potential for identification of new therapeutic targets. This approach, linking targeted proteomic data to variation in pharmacologic response, can be generalized to other studies evaluating genotype-phenotype relationships and provide insight into chemotherapeutic mechanisms.
The central dogma of biology explains that DNA is transcribed to mRNA that is further translated into protein. Many genome-wide studies have implicated genetic variation that influences gene expression and that ultimately affect downstream complex traits including response to drugs. However, because of technical limitations, few studies have evaluated the contribution of genetic variation on protein expression and ensuing effects on downstream phenotypes. To overcome this challenge, we used a novel technology to simultaneously measure the baseline expression of 441 proteins in lymphoblastoid cell lines and compared them with publicly available genetic data. To further illustrate the utility of this approach, we compared protein-level measurements with chemotherapeutic induced apoptosis and cell-growth inhibition data. This study demonstrates the importance of using protein information to understand the functional consequences of genetic variants identified in genome-wide association studies. This protein data set will also have broad utility for understanding the relationship between other genome-wide studies of complex traits.
Systems biology is an approach to dissection of complex traits that explicitly recognizes the impact of genetic, physiological, and environmental interactions in the generation of phenotypic variation. We describe comprehensive transcriptional and metabolic profiling in Drosophila melanogaster across four diets, finding little overlap in modular architecture. Genotype and genotype-by-diet interactions are a major component of transcriptional variation (24 and 5.3% of the total variation, respectively) while there were no main effects of diet (<1%). Genotype was also a major contributor to metabolomic variation (16%), but in contrast to the transcriptome, diet had a large effect (9%) and the interaction effect was minor (2%) for the metabolome. Yet specific principal components of these molecular phenotypes measured in larvae are strongly correlated with particular metabolic syndrome-like phenotypes such as pupal weight, larval sugar content and triglyceride content, development time, and cardiac arrhythmia in adults. The second principal component of the metabolomic profile is especially informative across these traits with glycine identified as a key loading variable. To further relate this physiological variability to genotypic polymorphism, we performed evolve-and-resequence experiments, finding rapid and replicated changes in gene frequency across hundreds of loci that are specific to each diet. Adaptation to diet is thus highly polygenic. However, loci differentially transcribed across diet or previously identified by RNAi knockdown or expression QTL analysis were not the loci responding to dietary selection. Therefore, loci that respond to the selective pressures of diet cannot be readily predicted a priori from functional analyses.
metabolic syndrome; metabolomics; evolve-and-resequence; genotype-by-environment; adaptation
Exome sequencing has been widely used in detecting pathogenic nonsynonymous single nucleotide variants (SNVs) for human inherited diseases. However, traditional statistical genetics methods are ineffective in analyzing exome sequencing data, due to such facts as the large number of sequenced variants, the presence of non-negligible fraction of pathogenic rare variants or de novo mutations, and the limited size of affected and normal populations. Indeed, prevalent applications of exome sequencing have been appealing for an effective computational method for identifying causative nonsynonymous SNVs from a large number of sequenced variants. Here, we propose a bioinformatics approach called SPRING (Snv PRioritization via the INtegration of Genomic data) for identifying pathogenic nonsynonymous SNVs for a given query disease. Based on six functional effect scores calculated by existing methods (SIFT, PolyPhen2, LRT, MutationTaster, GERP and PhyloP) and five association scores derived from a variety of genomic data sources (gene ontology, protein-protein interactions, protein sequences, protein domain annotations and gene pathway annotations), SPRING calculates the statistical significance that an SNV is causative for a query disease and hence provides a means of prioritizing candidate SNVs. With a series of comprehensive validation experiments, we demonstrate that SPRING is valid for diseases whose genetic bases are either partly known or completely unknown and effective for diseases with a variety of inheritance styles. In applications of our method to real exome sequencing data sets, we show the capability of SPRING in detecting causative de novo mutations for autism, epileptic encephalopathies and intellectual disability. We further provide an online service, the standalone software and genome-wide predictions of causative SNVs for 5,080 diseases at http://bioinfo.au.tsinghua.edu.cn/spring.
The detection of causative nonsynonymous single nucleotide variants (SNVs) is essential for the understanding of the pathogenesis of human inherited diseases. In this paper, we propose a statistical method called SPRING (Snv PRioritization via the INtegration of Genomic data) to combine six functional effect scores calculated by existing methods and five association scores derived from multiple genomic data sources to estimate the statistical significance that a nonsynonymous SNV is pathogenic for a query disease. We find that SPRING is effective in identifying disease-causing SNVs for diseases whose genetic bases are either partly known or completely unknown across a variety of inheritance styles. With real exome sequencing data, we show the qualified potential of SPRING in not only the detection of causative SNVs in simulation studies but also the identification of pathogenic de novo mutations for autism, epileptic encephalopathies and intellectual disability.
Phenotypes proximal to gene action generally reflect larger genetic effect sizes than those that are distant. The human metabolome, a result of multiple cellular and biological processes, are functional intermediate phenotypes proximal to gene action. Here, we present a genome-wide association study of 308 untargeted metabolite levels among African Americans from the Atherosclerosis Risk in Communities (ARIC) Study. Nineteen significant common variant-metabolite associations were identified, including 13 novel loci (p<1.6×10−10). These loci were associated with 7–50% of the difference in metabolite levels per allele, and the variance explained ranged from 4% to 20%. Fourteen genes were identified within the nineteen loci, and four of them contained non-synonymous substitutions in four enzyme-encoding genes (KLKB1, SIAE, CPS1, and NAT8); the other significant loci consist of eight other enzyme-encoding genes (ACE, GATM, ACY3, ACSM2B, THEM4, ADH4, UGT1A, TREH), a transporter gene (SLC6A13) and a polycystin protein gene (PKD2L1). In addition, four potential disease-associated paths were identified, including two direct longitudinal predictive relationships: NAT8 with N-acetylornithine, N-acetyl-1-methylhistidine and incident chronic kidney disease, and TREH with trehalose and incident diabetes. These results highlight the value of using endophenotypes proximal to gene function to discover new insights into biology and disease pathology.
Most contemporary GWAS studies have achieved increased power by increasing the size of the discovery sample to tens of thousands of individuals. An alternative approach for detecting the effects of novel loci is to measure phenotypes that more immediately reflect the effects of gene function. The metabolome consists of a collection of small molecules resulting from a variety of cellular and biologic processes, which can be considered intermediate phenotypes proximal to gene function. Here, we report a genome-wide association study identifying nineteen genetic loci influencing untargeted metabolomes traits among African Americans in the Atherosclerosis Risk in Communities (ARIC) Study. Fourteen genes mapped within nineteen loci, including twelve enzyme-encoding genes (KLKB1, SIAE, CPS1, NAT8, ACE, GATM, ACY3, ACSM2B, THEM4, ADH4, UGT1A and TREH), a transporter gene (SLC6A13) and a polycystin protein gene (PKD2L1). In addition, four potential disease-associated paths were identified, including two direct longitudinal predictive relationships: NAT8 with N-acetylornithine, N-acetyl-1-methylhistidine and incident chronic kidney disease, and TREH with trehalose and incident diabetes. These results highlight the value of using phenotypes proximal to gene function to promote novel gene discovery.
Cross-sectional studies have associated short telomere length with smoking, body weight, physical activity, and possibly alcohol intake; however, whether these associations are due to confounding is unknown. We tested these hypotheses in 4,576 individuals from the general population cross-sectionally, and with repeat measurement of relative telomere length 10 years apart. We also tested whether change in telomere length is associated with mortality and morbidity in the general population. Relative telomere length was measured with quantitative polymerase chain reaction. Cross-sectionally at the first examination, short telomere length was associated with increased age (P for trend across quartiles = 3×10−77), current smoking (P = 8×10−3), increased body mass index (P = 7×10−14), physical inactivity (P = 4×10−17), but not with increased alcohol intake (P = 0.10). At the second examination 10 years later, 56% of participants had lost and 44% gained telomere length with a mean loss of 193 basepairs. Change in leukocyte telomere length during 10 years was associated inversely with baseline telomere length (P<1×10−300) and age at baseline (P = 1×10−27), but not with baseline or 10-year inter-observational tobacco consumption, body weight, physical activity, or alcohol intake. Prospectively during a further 10 years follow-up after the second examination, quartiles of telomere length change did not associate with risk of all-cause mortality, cancer, chronic obstructive pulmonary disease, diabetes mellitus, ischemic cerebrovascular disease, or ischemic heart disease. In conclusion, smoking, increased body weight, and physical inactivity were associated with short telomere length cross-sectionally, but not with telomere length change during 10 years observation, and alcohol intake was associated with neither. Also, change in telomere length did not associate prospectively with mortality or morbidity in the general population.
Human chromosomes are capped by protective ends called telomeres. These ends are shortened during renewal of tissue and eventually become critically short, causing cells to become senescent or die. It is widely believed that lifestyle features such as smoking, obesity, physical inactivity, and possibly alcohol intake enhance shortening of telomeres. However, strong evidence to support such an interpretation is hard to find. We therefore tested whether these lifestyle factors are associated with telomere length change in 4,576 healthy individuals from the general population. Individuals had relative telomere length measured twice with a 10-year interval, and were then followed for mortality and morbidity for a further 10 years after the second measurement. We found change in telomere length to be more dynamic than previously believed, as we observed both shortening (in 56%) and lengthening (in 44%) among participants. Contrary to previous beliefs, we found telomere length change to be unaffected by lifestyle factors. Instead, we found the strongest association between past telomere length and age with change in telomere length over 10 years. Also, we found no association between change in telomere length and risk of all-cause mortality, cancer, chronic obstructive lung disease, diabetes mellitus, ischemic cerebrovascular disease, or ischemic heart disease.
Genetic variation in the major histocompatibility complex (MHC) affects CD4∶CD8 lineage commitment and MHC expression. However, the contribution of specific genes in this gene-dense region has not yet been resolved. Nor has it been established whether the same genes regulate MHC expression and T cell selection. Here, we assessed the impact of natural genetic variation on MHC expression and CD4∶CD8 lineage commitment using two genetic models in the rat. First, we mapped Quantitative Trait Loci (QTLs) associated with variation in MHC class I and II protein expression and the CD4∶CD8 T cell ratio in outbred Heterogeneous Stock rats. We identified 10 QTLs across the genome and found that QTLs for the individual traits colocalized within a region spanning the MHC. To identify the genes underlying these overlapping QTLs, we generated a large panel of MHC-recombinant congenic strains, and refined the QTLs to two adjacent intervals of ∼0.25 Mb in the MHC-I and II regions, respectively. An interaction between these intervals affected MHC class I expression as well as negative selection and lineage commitment of CD8 single-positive (SP) thymocytes. We mapped this effect to the transporter associated with antigen processing 2 (Tap2) in the MHC-II region and the classical MHC class I gene(s) (RT1-A) in the MHC-I region. This interaction was revealed by a recombination between RT1-A and Tap2, which occurred in 0.2% of the rats. Variants of Tap2 have previously been shown to influence the antigenicity of MHC class I molecules by altering the MHC class I ligandome. Our results show that a restricted peptide repertoire on MHC class I molecules leads to reduced negative selection of CD8SP cells. To our knowledge, this is the first study showing how a recombination between natural alleles of genes in the MHC influences lineage commitment of T cells.
Peptides from degraded cytoplasmic proteins are transported via TAP into the endoplasmic reticulum for loading onto MHC class I molecules. TAP is encoded by Tap1 and Tap2, which in rodents are located close to the MHC class I genes. In the rat, genetic variation in Tap2 gives rise to two different transporters: a promiscuous A variant (TAP-A) and a more restrictive B variant (TAP-B). It has been proposed that the class I molecule in the DA rat (RT1-Aa) has co-evolved with TAP-A and it has been shown that RT1-Aa antigenicity is changed when co-expressed with TAP-B. To study the contribution of different allelic combinations of RT1-A and Tap2 to the variation in MHC expression and T cell selection, we generated DA rats with either congenic or background alleles in the RT1-A and Tap2 loci. We found increased numbers of mature CD8SP cells in the thymus of rats which co-expressed RT1-Aa and TAP-B. This increase of CD8 cells could be explained by reduced negative selection, but did not correlate with RT1-Aa expression levels on thymic antigen presenting cells. Thus, our results identify a crucial role of the TAP and the quality of the MHC class I repertoire in regulating T cell selection.
Metabolic traits are molecular phenotypes that can drive clinical phenotypes and may predict disease progression. Here, we report results from a metabolome- and genome-wide association study on 1H-NMR urine metabolic profiles. The study was conducted within an untargeted approach, employing a novel method for compound identification. From our discovery cohort of 835 Caucasian individuals who participated in the CoLaus study, we identified 139 suggestively significant (P<5×10−8) and independent associations between single nucleotide polymorphisms (SNP) and metabolome features. Fifty-six of these associations replicated in the TasteSensomics cohort, comprising 601 individuals from São Paulo of vastly diverse ethnic background. They correspond to eleven gene-metabolite associations, six of which had been previously identified in the urine metabolome and three in the serum metabolome. Our key novel findings are the associations of two SNPs with NMR spectral signatures pointing to fucose (rs492602, P = 6.9×10−44) and lysine (rs8101881, P = 1.2×10−33), respectively. Fine-mapping of the first locus pinpointed the FUT2 gene, which encodes a fucosyltransferase enzyme and has previously been associated with Crohn's disease. This implicates fucose as a potential prognostic disease marker, for which there is already published evidence from a mouse model. The second SNP lies within the SLC7A9 gene, rare mutations of which have been linked to severe kidney damage. The replication of previous associations and our new discoveries demonstrate the potential of untargeted metabolomics GWAS to robustly identify molecular disease markers.
The concentrations of small molecules known as metabolites, are subject to tight regulation in all organisms. Collectively, the metabolite concentrations make up the metabolome, which differs amongst individuals as a function of their environment and genetic makeup. In our study, we have further developed an untargeted approach to identify genetic factors affecting human metabolism. In this approach, we first identify all genetic variants that correlate with any of the measured metabolome features in a large set of individuals. For these variants, we then compute a profile of significance for association with all features, generating a signature that facilitates the expert or computational identification of the metabolite whose concentration is most likely affected by the genetic variant at hand. Our study replicated many of the previously reported genetically driven variations in human metabolism and revealed two new striking examples of genetic variations with a sizeable effect on the urine metabolome. Interestingly, in these two gene-metabolite pairs both the gene and the affected metabolite are related to human diseases – Crohn's disease in the first case, and kidney disease in the second. This highlights the connection between genetic predispositions, affected metabolites, and human health.
Transcription factors (TFs) are fundamental controllers of cellular regulation that function in a complex and combinatorial manner. Accurate identification of a transcription factor's targets is essential to understanding the role that factors play in disease biology. However, due to a high false positive rate, identifying coherent functional target sets is difficult. We have created an improved mapping of targets by integrating ChIP-Seq data with 423 functional modules derived from 9,395 human expression experiments. We identified 5,002 TF-module relationships, significantly improved TF target prediction, and found 30 high-confidence TF-TF associations, of which 14 are known. Importantly, we also connected TFs to diseases through these functional modules and identified 3,859 significant TF-disease relationships. As an example, we found a link between MEF2A and Crohn's disease, which we validated in an independent expression dataset. These results show the power of combining expression data and ChIP-Seq data to remove noise and better extract the associations between TFs, functional modules, and disease.
Transcription factors (TFs) are crucial to the precise regulation of many cellular processes and thus, are responsible for many human phenotypes and diseases. Now that the ENCODE project has mapped hundreds of TFs to their genomic binding locations, extracting functional biological signals is the next step in understanding their role in disease. In this paper, we present a novel approach to identifying TF targets and use these targets to find regulatory relationships between TFs and diseases. We present a large open dataset of putative TF-TF interactions and TF-disease associations which includes known connections as well as novel ones. We validate the association of one of our novel TF-disease associations, MEF2A and Crohn's disease, suggesting that our approach generates testable disease association hypotheses. Integrating these datasets will be crucial for understanding phenotypes and complex diseases.
Identifying environmentally-specific genetic effects is a key challenge in understanding the structure of complex traits. Model organisms play a crucial role in the identification of such gene-by-environment interactions, as a result of the unique ability to observe genetically similar individuals across multiple distinct environments. Many model organism studies examine the same traits but under varying environmental conditions. For example, knock-out or diet-controlled studies are often used to examine cholesterol in mice. These studies, when examined in aggregate, provide an opportunity to identify genomic loci exhibiting environmentally-dependent effects. However, the straightforward application of traditional methodologies to aggregate separate studies suffers from several problems. First, environmental conditions are often variable and do not fit the standard univariate model for interactions. Additionally, applying a multivariate model results in increased degrees of freedom and low statistical power. In this paper, we jointly analyze multiple studies with varying environmental conditions using a meta-analytic approach based on a random effects model to identify loci involved in gene-by-environment interactions. Our approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional uni- or multi-variate approaches for discovery of gene-by-environment interactions. We apply our new method to combine 17 mouse studies containing in aggregate 4,965 distinct animals. We identify 26 significant loci involved in High-density lipoprotein (HDL) cholesterol, many of which are consistent with previous findings. Several of these loci show significant evidence of involvement in gene-by-environment interactions. An additional advantage of our meta-analysis approach is that our combined study has significantly higher power and improved resolution compared to any single study thus explaining the large number of loci discovered in the combined study.
Identifying gene-by-environment interactions is important for understand the architecture of a complex trait. Discovering gene-by-environment interaction requires the observation of the same phenotype in individuals under different environments. Model organism studies are often conducted under different environments. These studies provide an unprecedented opportunity for researchers to identify the gene-by-environment interactions. A difference in the effect size of a genetic variant between two studies conducted in different environments may suggest the presence of a gene-by-environment interaction. In this paper, we propose to employ a random-effect-based meta-analysis approach to identify gene-by-environment interaction, which assumes different or heterogeneous effect sizes between studies. Our approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional approaches for discovery of gene-by-environment interactions, which treats the gene-by-environment interactions as covariates in the analysis. We provide a intuitive way to visualize the results of the meta-analysis at a locus which allows us to obtain the biological insights of gene-by-environment interactions. We demonstrate our method by searching for gene-by-environment interactions by combining 17 mouse genetic studies totaling 4,965 distinct animals.
Recent high-throughput efforts such as ENCODE have generated a large body of genome-scale transcriptional data in multiple conditions (e.g., cell-types and disease states). Leveraging these data is especially important for network-based approaches to human disease, for instance to identify coherent transcriptional modules (subnetworks) that can inform functional disease mechanisms and pathological pathways. Yet, genome-scale network analysis across conditions is significantly hampered by the paucity of robust and computationally-efficient methods. Building on the Higher-Order Generalized Singular Value Decomposition, we introduce a new algorithmic approach for efficient, parameter-free and reproducible identification of network-modules simultaneously across multiple conditions. Our method can accommodate weighted (and unweighted) networks of any size and can similarly use co-expression or raw gene expression input data, without hinging upon the definition and stability of the correlation used to assess gene co-expression. In simulation studies, we demonstrated distinctive advantages of our method over existing methods, which was able to recover accurately both common and condition-specific network-modules without entailing ad-hoc input parameters as required by other approaches. We applied our method to genome-scale and multi-tissue transcriptomic datasets from rats (microarray-based) and humans (mRNA-sequencing-based) and identified several common and tissue-specific subnetworks with functional significance, which were not detected by other methods. In humans we recapitulated the crosstalk between cell-cycle progression and cell-extracellular matrix interactions processes in ventricular zones during neocortex expansion and further, we uncovered pathways related to development of later cognitive functions in the cortical plate of the developing brain which were previously unappreciated. Analyses of seven rat tissues identified a multi-tissue subnetwork of co-expressed heat shock protein (Hsp) and cardiomyopathy genes (Bag3, Cryab, Kras, Emd, Plec), which was significantly replicated using separate failing heart and liver gene expression datasets in humans, thus revealing a conserved functional role for Hsp genes in cardiovascular disease.
Complex biological interactions and processes can be modelled as networks, for instance metabolic pathways or protein-protein interactions. The growing availability of large high-throughput data in several experimental conditions now permits the full-scale analysis of biological interactions and processes. However, no reliable and computationally efficient methods for simultaneous analysis of multiple large-scale interaction datasets (networks) have been developed to date. To overcome this shortcoming, we have developed a new computational framework that is parameter-free, computationally efficient and highly reliable. We showed how these distinctive properties make it a useful tool for real genomic data exploration and analyses. Indeed, in extensive simulation studies and real-data analyses we have demonstrated that our method outperformed existing approaches in terms of efficiency and, most importantly, reproducibility of the results. Beyond the computational advantages, we illustrated how our method can be effectively applied to leverage the vast stream of genome-scale transcriptional data that has risen exponentially over the last years. In contrast with existing approaches, using our method we were able to identify and replicate multi-tissue gene co-expression networks that were associated with specific functional processes relevant to phenotypic variation and disease in rats and humans.
Personal genome analysis is now being considered for evaluation of disease risk in healthy individuals, utilizing both rare and common variants. Multiple scores have been developed to predict the deleteriousness of amino acid substitutions, using information on the allele frequencies, level of evolutionary conservation, and averaged structural evidence. However, agreement among these scores is limited and they likely over-estimate the fraction of the genome that is deleterious.
This study proposes an integrative approach to identify a subset of homozygous non-synonymous single nucleotide polymorphisms (nsSNPs). An 8-level classification scheme is constructed from the presence/absence of deleterious predictions combined with evidence of association with disease or complex traits. Detailed literature searches and structural validations are then performed for a subset of homozygous 826 mis-sense mutations in 575 proteins found in the genomes of 12 healthy adults.
Implementation of the Association-Adjusted Consensus Deleterious Scheme (AACDS) classifies 11% of all predicted highly deleterious homozygous variants as most likely to influence disease risk. The number of such variants per genome ranges from 0 to 8 with no significant difference between African and Caucasian Americans. Detailed analysis of mutations affecting the APOE, MTMR2, THSB1, CHIA, αMyHC, and AMY2A proteins shows how the protein structure is likely to be disrupted, even though the associated phenotypes have not been documented in the corresponding individuals.
The classification system for homozygous nsSNPs provides an opportunity to systematically rank nsSNPs based on suggestive evidence from annotations and sequence-based predictions. The ranking scheme, in-depth literature searches, and structural validations of highly prioritized mis-sense mutations compliment traditional sequence-based approaches and should have particular utility for the development of individualized health profiles. An online tool reporting the AACDS score for any variant is provided at the authors’ website.
Homozygous variant; Non-synonymous single nucleotide polymorphism; Personal genome interpretation; Variant prioritization; Protein structure analysis
Human telomeres are maintained by the shelterin protein complex in which TRF1 and TRF2 bind directly to duplex telomeric DNA. How these proteins find telomeric sequences among a genome of billions of base pairs and how they find protein partners to form the shelterin complex remains uncertain. Using single-molecule fluorescence imaging of quantum dot-labeled TRF1 and TRF2, we study how these proteins locate TTAGGG repeats on DNA tightropes. By virtue of its basic domain TRF2 performs an extensive 1D search on nontelomeric DNA, whereas TRF1’s 1D search is limited. Unlike the stable and static associations observed for other proteins at specific binding sites, TRF proteins possess reduced binding stability marked by transient binding (∼9–17 s) and slow 1D diffusion on specific telomeric regions. These slow diffusion constants yield activation energy barriers to sliding ∼2.8–3.6 κBT greater than those for nontelomeric DNA. We propose that the TRF proteins use 1D sliding to find protein partners and assemble the shelterin complex, which in turn stabilizes the interaction with specific telomeric DNA. This ‘tag-team proofreading’ represents a more general mechanism to ensure a specific set of proteins interact with each other on long repetitive specific DNA sequences without requiring external energy sources.
The major histocompatibility complex (MHC) region is strongly associated with multiple sclerosis (MS) susceptibility. HLA-DRB1*15:01 has the strongest effect, and several other alleles have been reported at different levels of validation. Using SNP data from genome-wide studies, we imputed and tested classical alleles and amino acid polymorphisms in 8 classical human leukocyte antigen (HLA) genes in 5,091 cases and 9,595 controls. We identified 11 statistically independent effects overall: 6 HLA-DRB1 and one DPB1 alleles in class II, one HLA-A and two B alleles in class I, and one signal in a region spanning from MICB to LST1. This genomic segment does not contain any HLA class I or II genes and provides robust evidence for the involvement of a non-HLA risk allele within the MHC. Interestingly, this region contains the TNF gene, the cognate ligand of the well-validated TNFRSF1A MS susceptibility gene. The classical HLA effects can be explained to some extent by polymorphic amino acid positions in the peptide-binding grooves. This study dissects the independent effects in the MHC, a critical region for MS susceptibility that harbors multiple risk alleles.
Multiple sclerosis (MS) is an inflammatory and neurodegenerative disease with a heritable component. Although it has been known for a long time that the strongest MS risk factor maps to the major histocompatibility complex (MHC) on chromosome 6, there are still many unresolved questions as to the identity and the nature of the risk variants within the MHC. Because the MHC has a complex structure, systematic investigation across this region has been challenging. In this study, we used state-of-the-art imputation methods coupled to statistical regression to query variants in the human leukocyte antigen (HLA) class I and II genes for a role in MS risk. Starting from available SNP genotype data, we replicated the strongest risk factor, the HLA-DRB1*15:01 allele, and were able to identify 11 independent effects in total. Functional studies are now needed to understand their mechanism in MS etiology.
Interactions between HLA class I molecules and killer-cell immunoglobulin-like receptors (KIR) control natural killer cell (NK) functions in immunity and reproduction. Encoded by genes on different chromosomes, these polymorphic ligands and receptors correlate highly with disease resistance and susceptibility. Although studied at low-resolution in many populations, high-resolution analysis of combinatorial diversity of HLA class I and KIR is limited to Asian and Amerindian populations with low genetic diversity. At the other end of the spectrum is the West African population investigated here: we studied 235 individuals, including 104 mother-child pairs, from the Ga-Adangbe of Ghana. This population has a rich diversity of 175 KIR variants forming 208 KIR haplotypes, and 81 HLA-A, -B and -C variants forming 190 HLA class I haplotypes. Each individual we studied has a unique compound genotype of HLA class I and KIR, forming 1–14 functional ligand-receptor interactions. Maintaining this exceptionally high polymorphism is balancing selection. The centromeric region of the KIR locus, encoding HLA-C receptors, is highly diverse whereas the telomeric region encoding Bw4-specific KIR3DL1, lacks diversity in Africans. Present in the Ga-Adangbe are high frequencies of Bw4-bearing HLA-B*53:01 and Bw4-lacking HLA-B*35:01, which otherwise are identical. Balancing selection at key residues maintains numerous HLA-B allotypes having and lacking Bw4, and also those of stronger and weaker interaction with LILRB1, a KIR-related receptor. Correspondingly, there is a balance at key residues of KIR3DL1 that modulate its level of cell-surface expression. Thus, capacity to interact with NK cells synergizes with peptide binding diversity to drive HLA-B allele frequency distribution. These features of KIR and HLA are consistent with ongoing co-evolution and selection imposed by a pathogen endemic to West Africa. Because of the prevalence of malaria in the Ga-Adangbe and previous associations of cerebral malaria with HLA-B*53:01 and KIR, Plasmodium falciparum is a candidate pathogen.
Natural killer cells are white blood cells with critical roles in human health that deliver front-line immunity against pathogens and nurture placentation in early pregnancy. Controlling these functions are cell-surface receptors called KIR that interact with HLA class I ligands expressed on most cells of the body. KIR and HLA are both products of complex families of variable genes, but present on separate chromosomes. Many HLA and KIR variants and their combinations associate with resistance to specific infections and pregnancy syndromes. Previously we identified basic components of the system necessary for individual and population survival. Here, we explore the system at its most genetically diverse by studying the Ga-Adangbe population from Ghana in West Africa. Co-evolution of KIR receptors with their HLA targets is ongoing in the Ga-Adangbe, with every one of 235 individuals studied having a unique set of KIR receptors and HLA class I ligands. In addition, one critical combination of receptor and ligand maintains alternative forms that either can or cannot interact with their ‘partner.’ This balance resembles that induced by malfunctioning variants of hemoglobin that confer resistance to malaria, a candidate disease for driving diversity and co-evolution of KIR and HLA class I in the Ga-Adangbe.
The improved characterisation of risk factors for rheumatoid arthritis (RA) suggests they could be combined to identify individuals at increased disease risks in whom preventive strategies may be evaluated. We aimed to develop an RA prediction model capable of generating clinically relevant predictive data and to determine if it better predicted younger onset RA (YORA). Our novel modelling approach combined odds ratios for 15 four-digit/10 two-digit HLA-DRB1 alleles, 31 single nucleotide polymorphisms (SNPs) and ever-smoking status in males to determine risk using computer simulation and confidence interval based risk categorisation. Only males were evaluated in our models incorporating smoking as ever-smoking is a significant risk factor for RA in men but not women. We developed multiple models to evaluate each risk factor's impact on prediction. Each model's ability to discriminate anti-citrullinated protein antibody (ACPA)-positive RA from controls was evaluated in two cohorts: Wellcome Trust Case Control Consortium (WTCCC: 1,516 cases; 1,647 controls); UK RA Genetics Group Consortium (UKRAGG: 2,623 cases; 1,500 controls). HLA and smoking provided strongest prediction with good discrimination evidenced by an HLA-smoking model area under the curve (AUC) value of 0.813 in both WTCCC and UKRAGG. SNPs provided minimal prediction (AUC 0.660 WTCCC/0.617 UKRAGG). Whilst high individual risks were identified, with some cases having estimated lifetime risks of 86%, only a minority overall had substantially increased odds for RA. High risks from the HLA model were associated with YORA (P<0.0001); ever-smoking associated with older onset disease. This latter finding suggests smoking's impact on RA risk manifests later in life. Our modelling demonstrates that combining risk factors provides clinically informative RA prediction; additionally HLA and smoking status can be used to predict the risk of younger and older onset RA, respectively.
Rheumatoid arthritis (RA) is a common, incurable disease with major individual and health service costs. Preventing its development is therefore an important goal. Being able to predict who will develop RA would allow researchers to look at ways to prevent it. Many factors have been found that increase someone's risk of RA. These are divided into genetic and environmental (such as smoking) factors. The risk of RA associated with each factor has previously been reported. Here, we demonstrate a method that combines these risk factors in a process called “prediction modelling” to estimate someone's lifetime risk of RA. We show that firstly, our prediction models can identify people with very high-risks of RA and secondly, they can be used to identify people at risk of developing RA at a younger age. Although these findings are an important first step towards preventing RA, as only a minority of people tested had substantially increased disease risks our models could not be used to screen the general population. Instead they need testing in people already at risk of RA such as relatives of affected patients. In this context they could identify enough numbers of high-risk people to allow preventive methods to be evaluated.