Protein-Protein Interactions (PPIs) play important roles in many biological functions. Protein domains, which are defined as independently folding structural blocks of proteins, physically interact with each other to perform these biological functions. Therefore, the identification of Domain-Domain Interactions (DDIs) is of great biological interests because it is generally accepted that PPIs are mediated by DDIs. As a result, much effort has been put on the prediction of domain pair interactions based on computational methods. Many DDI prediction tools using PPIs network and domain evolution information have been reported. However, tools that combine the primary sequences, domain annotations, and structural annotations of proteins have not been evaluated before.
In this study, we report a novel approach called Gram-bAsed Interaction Analysis (GAIA). GAIA extracts peptide segments that are composed of fixed length of continuous amino acids, called n-grams (where n is the number of amino acids), from the annotated domain and DDI data set in Saccharomyces cerevisiae (budding yeast) and identifies a list of n-grams that may contribute to DDIs and PPIs based on the frequencies of their appearance. GAIA also reports the coordinate position of gram pairs on each interacting domain pair. We demonstrate that our approach improves on other DDI prediction approaches when tested against a gold-standard data set and achieves a true positive rate of 82% and a false positive rate of 21%. We also identify a list of 4-gram pairs that are significantly over-represented in the DDI data set and may mediate PPIs.
GAIA represents a novel and reliable way to predict DDIs that mediate PPIs. Our results, which show the localizations of interacting grams/hotspots, provide testable hypotheses for experimental validation. Complemented with other prediction methods, this study will allow us to elucidate the interactome of cells.
We aimed to develop a new method to convert T1-weighted brain MRIs to feature vectors, which could be used for content-based image retrieval (CBIR). To overcome the wide range of anatomical variability in clinical cases and the inconsistency of imaging protocols, we introduced the Gross feature recognition of Anatomical Images based on Atlas grid (GAIA), in which the local intensity alteration, caused by pathological (e.g., ischemia) or physiological (development and aging) intensity changes, as well as by atlas–image misregistration, is used to capture the anatomical features of target images.
As a proof-of-concept, the GAIA was applied for pattern recognition of the neuroanatomical features of multiple stages of Alzheimer's disease, Huntington's disease, spinocerebellar ataxia type 6, and four subtypes of primary progressive aphasia. For each of these diseases, feature vectors based on a training dataset were applied to a test dataset to evaluate the accuracy of pattern recognition. The feature vectors extracted from the training dataset agreed well with the known pathological hallmarks of the selected neurodegenerative diseases. Overall, discriminant scores of the test images accurately categorized these test images to the correct disease categories. Images without typical disease-related anatomical features were misclassified. The proposed method is a promising method for image feature extraction based on disease-related anatomical features, which should enable users to submit a patient image and search past clinical cases with similar anatomical phenotypes.
•A novel method to convert anatomical brain MRIs to feature vectors is introduced.•Degree of local atlas–image disagreement is used to capture the anatomical features.•The method was applied for pattern recognition of various neurodegenerative diseases.•The feature vectors agreed well with the known pathological hallmarks of diseases.•The method accurately categorized test images to the correct disease categories.
Atlas; Feature recognition; Alzheimer's disease; Huntington's disease; Primary progressive aphasia; Spinocerebellar ataxia
We define the Gaia system of life and its environment on Earth, review the status of the Gaia theory, introduce potentially relevant concepts from complexity theory, then try to apply them to Gaia. We consider whether Gaia is a complex adaptive system (CAS) in terms of its behaviour and suggest that the system is self-organizing but does not reside in a critical state. Gaia has supported abundant life for most of the last 3.8 Gyr. Large perturbations have occasionally suppressed life but the system has always recovered without losing the capacity for large-scale free energy capture and recycling of essential elements. To illustrate how complexity theory can help us understand the emergence of planetary-scale order, we present a simple cellular automata (CA) model of the imaginary planet Daisyworld. This exhibits emergent self-regulation as a consequence of feedback coupling between life and its environment. Local spatial interaction, which was absent from the original model, can destabilize the system by generating bifurcation regimes. Variation and natural selection tend to remove this instability. With mutation in the model system, it exhibits self-organizing adaptive behaviour in its response to forcing. We close by suggesting how artificial life ('Alife') techniques may enable more comprehensive feasibility tests of Gaia.
HIV genomic sequence variability has complicated efforts to generate an effective globally relevant vaccine. Regions of the viral genome conserved in sequence and across time may represent the “Achilles’ heel” of HIV. In this study, highly conserved T-cell epitopes were selected using immunoinformatics tools combining HLA-A2 supertype binding predictions with relative global conservation. Analysis performed in 2002 on 10,803 HIV-1 sequences, and again in 2009, on 43,822 sequences, yielded 38 HLA-A2 epitopes. These epitopes were experimentally validated for HLA binding and immunogenicity with PBMCs from HIV-infected patients in Providence, Rhode Island, and/or Bamako, Mali. Thirty-five (92%) stimulated an IFNγ response in PBMCs from at least one subject. Eleven of fourteen peptides (79%) were confirmed as HLA-A2 epitopes in both locations. Validation of these HLA-A2 epitopes conserved across time, clades, and geography supports the hypothesis that such epitopes could provide effective coverage of virus diversity and would be appropriate for inclusion in a globally relevant HIV vaccine.
The Cochran–Armitage trend test (CATT) is well suited for testing association between a marker and a disease in case–control studies. When the underlying genetic model for the disease is known, the CATT optimal for the genetic model is used. For complex diseases, however, the genetic models of the true disease loci are unknown. In this situation, robust tests are preferable. We propose a two-phase analysis with model selection for the case–control design. In the first phase, we use the difference of Hardy–Weinberg disequilibrium coefficients between the cases and the controls for model selection. Then, an optimal CATT corresponding to the selected model is used for testing association. The correlation of the statistics used for selection and the test for association is derived to adjust the two-phase analysis with control of the Type-I error rate. The simulation studies show that this new approach has greater efficiency robustness than the existing methods.
Cochran–Armitage trend test; Disease risk; Efficiency robustness; Hardy–Weinberg disequilibrium; SNP
In the search for genetic determinants of complex disease, two approaches to association analysis are most often employed, testing single loci or testing a small group of loci jointly via haplotypes for their relationship to disease status. It is still debatable which of these approaches is more favourable, and under what conditions. The former has the advantage of simplicity but suffers severely when alleles at the tested loci are not in linkage disequilibrium (LD) with liability alleles; the latter should capture more of the signal encoded in LD, but is far from simple. The complexity of haplotype analysis could be especially troublesome for association scans over large genomic regions, which, in fact, is becoming the standard design. For these reasons, the authors have been evaluating statistical methods that bridge the gap between single-locus and haplotype-based tests. In this article, they present one such method, which uses non-parametric regression techniques embodied by Bayesian adaptive regression splines (BARS). For a set of markers falling within a common genomic region and a corresponding set of single-locus association statistics, the BARS procedure integrates these results into a single test by examining the class of smooth curves consistent with the data. The non-parametric BARS procedure generally finds no signal when no liability allele exists in the tested region (ie it achieves the specified size of the test) and it is sensitive enough to pick up signals when a liability allele is present. The BARS procedure provides a robust and potentially powerful alternative to classical tests of association, diminishes the multiple testing problem inherent in those tests and can be applied to a wide range of data types, including genotype frequencies estimated from pooled samples.
association study; adaptive regression splines; complex disease; genome scan; linkage disequilibrium (LD); non-parametric regression
Motivation: Increasing use of structural modeling for understanding structure–function relationships in proteins has led to the need to ensure that the protein models being used are of acceptable quality. Quality of a given protein structure can be assessed by comparing various intrinsic structural properties of the protein to those observed in high-resolution protein structures.
Results: In this study, we present tools to compare a given structure to high-resolution crystal structures. We assess packing by calculating the total void volume, the percentage of unsatisfied hydrogen bonds, the number of steric clashes and the scaling of the accessible surface area. We assess covalent geometry by determining bond lengths, angles, dihedrals and rotamers. The statistical parameters for the above measures, obtained from high-resolution crystal structures enable us to provide a quality-score that points to specific areas where a given protein structural model needs improvement.
Availability and Implementation: We provide these tools that appraise protein structures in the form of a web server Gaia (http://chiron.dokhlab.org). Gaia evaluates the packing and covalent geometry of a given protein structure and provides quantitative comparison of the given structure to high-resolution crystal structures.
Supplementary information: Supplementary data are available at Bioinformatics online.
Multiple prostate cancer (PCa) risk-related loci have been discovered by genome-wide association studies (GWAS) based on case–control designs. However, GWAS findings may be confounded by population stratification if cases and controls are inadvertently drawn from different genetic backgrounds. In addition, since these loci were identified in cases with predominantly sporadic disease, little is known about their relationships with hereditary prostate cancer (HPC). The association between seventeen reported PCa susceptibility loci was evaluated with a family-based association test using 1,979 hereditary PCa families of European descent collected by members of the International Consortium for Prostate Cancer Genetics, with a total of 5,730 affected men. The risk alleles for 8 of the 17 loci were significantly over-transmitted from parents to affected offspring, including SNPs residing in 8q24 (regions 1, 2 and 3), 10q11, 11q13, 17q12 (region 1), 17q24 and Xp11. In subgroup analyses, three loci, at 8q24 (regions 1 and 2) plus 17q12, were significantly over-transmitted in hereditary PCa families with five or more affected members, while loci at 3p12, 8q24 (region 2), 11q13, 17q12 (region 1), 17q24 and Xp11 were significantly over-transmitted in HPC families with an average age of diagnosis at 65 years or less. Our results indicate that at least a subset of PCa risk-related loci identified by case–control GWAS are also associated with disease risk in HPC families.
Biological processes are regulated by complex interactions between transcription factors and signalling molecules, collectively described as Genetic Regulatory Networks (GRNs). The characterisation of these networks to reveal regulatory mechanisms is a long-term goal of many laboratories. However compiling, visualising and interacting with such networks is non-trivial. Current tools and databases typically focus on GRNs within simple, single celled organisms. However, data is available within the literature describing regulatory interactions in multi-cellular organisms, although not in any systematic form. This is particularly true within the field of developmental biology, where regulatory interactions should also be tagged with information about the time and anatomical location of development in which they occur.
We have developed myGRN (), a web application for storing and interrogating interaction data, with an emphasis on developmental processes. Users can submit interaction and gene expression data, either curated from published sources or derived from their own unpublished data. All interactions associated with publications are publicly visible, and unpublished interactions can only be shared between collaborating labs prior to publication. Users can group interactions into discrete networks based on specific biological processes. Various filters allow dynamic production of network diagrams based on a range of information including tissue location, developmental stage or basic topology. Individual networks can be viewed using myGRV, a tool focused on displaying developmental networks, or exported in a range of formats compatible with third party tools. Networks can also be analysed for the presence of common network motifs. We demonstrate the capabilities of myGRN using a network of zebrafish interactions integrated with expression data from the zebrafish database, ZFIN.
Here we are launching myGRN as a community-based repository for interaction networks, with a specific focus on developmental networks. We plan to extend its functionality, as well as use it to study networks involved in embryonic development in the future.
The goals of our analysis were to map functional loci, which contribute to the case-control status of a trait of interest, using large pedigrees. We used logistic regression fitted with the generalized estimation equation to test associations between a dichotomous phenotype and all genotyped common and rare single-nucleotide polymorphisms. In addition to the association study, we also developed and applied a simple and fast identical-by-descent-based test to identify loci that were shared among affected individuals more often than expected by chance. Among the top significant loci, we assessed the statistical power and the false discovery rate of both methods. We also demonstrated that family-based studies, compared with the standard population-based association studies, have great values and advantages for the discovery of multiple rare causal variants.
Complex diseases are presumed to be the results of interactions of several genes and environmental factors, with each gene only having a small effect on the disease. Thus, the methods that can account for gene-gene interactions to search for a set of marker loci in different genes or across genome and to analyze these loci jointly are critical. In this article, we propose an ensemble learning approach (ELA) to detect a set of loci whose main and interaction effects jointly have a significant association with the trait. In the ELA, we first search for “base learners” and then combine the effects of the base learners by a linear model. Each base learner represents a main effect or an interaction effect. The result of the ELA is easy to interpret. When the ELA is applied to analyze a data set, we can get a final model, an overall P-value of the association test between the set of loci involved in the final model and the trait, and an importance measure for each base learner and each marker involved in the final model. The final model is a linear combination of some base learners. We know which base learner represents a main effect and which one represents an interaction effect. The importance measure of each base learner or marker can tell us the relative importance of the base learner or marker in the final model. We used intensive simulation studies as well as a real data set to evaluate the performance of the ELA. Our simulation studies demonstrated that the ELA is more powerful than the single-marker test in all the simulation scenarios. The ELA also outperformed the other three existing multi-locus methods in almost all cases. In an application to a large-scale case-control study for Type 2 diabetes, the ELA identified 11 single nucleotide polymorphisms that have a significant multi-locus effect (P-value = 0.01), while none of the single nucleotide polymorphisms showed significant marginal effects and none of the two-locus combinations showed significant two-locus interaction effects.
epistasis; association study; complex disease; Type 2 diabetes
The Genetic Analysis Workshop 15 Problem 3 simulated rheumatoid arthritis data set provided 100 replicates of simulated single-nucleotide polymorphism (SNP) and covariate data sets for 1500 families with an affected sib pair and 2000 controls, modeled after real rheumatoid arthritis data. The data generation model included nine unobserved trait loci, most of which have one or more of the generated SNPs associated with them. These data sets provide an ideal experimental test bed for evaluating new and old algorithms for selecting SNPs and covariates that can separate cases from controls, because the cases and controls are known as well as the identities of the trait loci. LASSO-Patternsearch is a new multi-step algorithm with a LASSO-type penalized likelihood method at its core specifically designed to detect and model interactions between important predictor variables. In this article the original LASSO-Patternsearch algorithm is modified to handle the large number of SNPs plus covariates. We start with a screen step within the framework of parametric logistic regression. The patterns that survived the screen step were further selected by a penalized logistic regression with the LASSO penalty. And finally, a parametric logistic regression model were built on the patterns that survived the LASSO step. In our analysis of Genetic Analysis Workshop 15 Problem 3 data we have identified most of the associated SNPs and relevant covariates. Upon using the model as a classifier, very competitive error rates were obtained.
The development of most autoimmune diseases includes a strong heritable component. This genetic contribution to disease ranges from simple Mendelian inheritance of causative alleles to the complex interactions of multiple weak loci influencing risk. The genetic variants responsible for disease are being discovered through a range of strategies from linkage studies to genome-wide association studies. Despite the rapid advances in genetic analysis, substantial components of the heritable risk remain unexplained, either owing to the contribution of an as-yet unidentified, “hidden,” component of risk, or through the underappreciated effects of known risk loci. Surprisingly, despite the variation in genetic control, a great deal of conservation appears in the biological processes influenced by risk alleles, with several key immunological pathways being modified in autoimmune diseases covering a broad spectrum of clinical manifestations. The primary translational potential of this knowledge is in the rational design of new therapeutics to exploit the role of these key pathways in influencing disease. With significant further advances in understanding the genetic risk factors and their biological mechanisms, the possibility of genetically tailored (or “personalized”) therapy may be realized.
Genetic risk variants associated with various autoimmune diseases, from rheumatoid arthritis to type 1 diabetes, are often similar. For example, most autoimmune diseases, if not all, are associated with the human leukocyte antigen (HLA) region.
We introduce an innovative multilocus test for disease association. It is an extension of an existing score test that gains power over alternative methods by incorporating a parsimonious one-degree-of-freedom model for interaction. We use our method in applications designed to detect interactions that generate hypotheses about the functionality of prostate cancer (PRCA) susceptibility regions.
Our proposed score test is designed to gain additional power through the use of a retrospective likelihood that exploits an assumption of independence between unlinked loci in the underlying population. Its performance is validated through simulation. The method is used in conditional scans with data from stage II of the Cancer Genetic Markers of Susceptibility PRCA genome-wide association study.
Our proposed method increases power to detect susceptibility loci in diverse settings. It identified two high-ranking, biologically interesting interactions: (1) rs748120 of NR2C2 and subregions of 8q24 that contain independent susceptibility loci specific to PRCA and (2) rs4810671 of SULF2 and both JAZF1 and HNF1B that are associated with PRCA and type 2 diabetes.
Our score test is a promising multilocus tool for genetic epidemiology. The results of our applications suggest functionality for poorly understood PRCA susceptibility regions. They motivate replication study.
Gene-gene interaction; Score test; Prostate cancer
Complex diseases are multifactorial in nature and can involve multiple loci with gene × gene and gene × environment interactions. Research on methods to uncover the interactions between those genes that confer susceptibility to disease has been extensive, but many of these methods have only been developed for sibling pairs or sibships. In this report, we assess the performance of two methods for finding gene × gene interactions that are applicable to arbitrarily sized pedigrees, one based on correlation in per-family nonparametric linkage scores and another that incorporates candidate loci genotypes as covariates into an affected relative pair linkage analysis. The power and type I error rate of both of these methods was addressed using the simulated Genetic Analysis Workshop 14 data. In general, we found detection of the interacting loci to be a difficult problem, and though we experienced some modest success there is a clear need to continue developing new methods and approaches to the problem.
Genome-wide association studies (GWAS) test hundreds of thousands of single-nucleotide polymorphisms (SNPs) for association to a trait, treating each marker equally and ignoring prior evidence of association to specific regions. Typically, promising regions are selected for further investigation based on p-values obtained from simple tests of association. However, loci that exert only a weak, low-penetrant role on the trait, producing modest evidence of association, are not detectable in the context of a GWAS. Implementing prior knowledge of association in GWAS could increase power, help distinguish between false and true positives, and identify better sets of SNPs for follow-up studies.
Here we performed a GWAS on rheumatoid arthritis (RA) patients and controls (Problem 1, Genetic Analysis Workshop 16). In order to include prior information in the analysis, we applied four methods that distinctively deal with markers in candidate genes in the context of GWAS. SNPs were divided into a random and a candidate subset, then we applied empirical correction by permutation, false-discovery rate, false-positive report probability, and posterior odds of association using different prior probabilities. We repeated the same analyses on two different sets of candidate markers defined on the basis of previously reported association to RA following two different approaches. The four methods showed similar relative behavior when applied to the two sets, with the proportion of candidate SNPs ranked among the top 2,000 varying from 0 to 100%. The use of different prior probabilities changed the stringency of the methods, but not their relative performance.
Complex phenotypes are known to be associated with interactions among genetic factors. A growing body of evidence suggests that gene–gene interactions contribute to many common human diseases. Identifying potential interactions of multiple polymorphisms thus may be important to understand the biology and biochemical processes of the disease etiology. However, despite the great success of genome-wide association studies that mostly focus on single locus analysis, it is challenging to detect these interactions, especially when the marginal effects of the susceptible loci are weak and/or they involve several genetic factors. Here we describe a Bayesian classification tree model to detect such interactions in case-control association studies. We show that this method has the potential to uncover interactions involving polymorphisms showing weak to moderate marginal effects as well as multi-factorial interactions involving more than two loci.
Epistasis; GWAS; Bayesian CART; MCMC; logistic regression; Crohn’s disease
Detection of epistatic interaction between loci has been postulated to provide a more in-depth understanding of the complex biological and biochemical pathways underlying human diseases. Studying the interaction between two loci is the natural progression following traditional and well-established single locus analysis. However, the added costs and time duration required for the computation involved have thus far deterred researchers from pursuing a genome-wide analysis of epistasis. In this paper, we propose a method allowing such analysis to be conducted very rapidly. The method, dubbed EPIBLASTER, is applicable to case–control studies and consists of a two-step process in which the difference in Pearson's correlation coefficients is computed between controls and cases across all possible SNP pairs as an indication of significant interaction warranting further analysis. For the subset of interactions deemed potentially significant, a second-stage analysis is performed using the likelihood ratio test from the logistic regression to obtain the P-value for the estimated coefficients of the individual effects and the interaction term. The algorithm is implemented using the parallel computational capability of commercially available graphical processing units to greatly reduce the computation time involved. In the current setup and example data sets (211 cases, 222 controls, 299468 SNPs; and 601 cases, 825 controls, 291095 SNPs), this coefficient evaluation stage can be completed in roughly 1 day. Our method allows for exhaustive and rapid detection of significant SNP pair interactions without imposing significant marginal effects of the single loci involved in the pair.
Epistasis; genome-wide interaction analysis; graphical processing unit
Following the identification of several disease-associated polymorphisms by whole genome association analysis, interest is now focussing on the detection of effects that, due to their interaction with other genetic (or environmental) factors, may not be identified by using standard single-locus tests. In addition to increasing power to detect association, there is also a hope detecting interactions between loci will allow us to elucidate the biological and biochemical pathways underpinning disease. Here I provide a critical survey of the current methodological approaches (and related software packages) used to detect interactions between genetic loci that contribute to human genetic disease. I also discuss the difficulties in determining the biologcal relevance of statistical interactions.