Protein-Protein Interactions (PPIs) play important roles in many biological functions. Protein domains, which are defined as independently folding structural blocks of proteins, physically interact with each other to perform these biological functions. Therefore, the identification of Domain-Domain Interactions (DDIs) is of great biological interests because it is generally accepted that PPIs are mediated by DDIs. As a result, much effort has been put on the prediction of domain pair interactions based on computational methods. Many DDI prediction tools using PPIs network and domain evolution information have been reported. However, tools that combine the primary sequences, domain annotations, and structural annotations of proteins have not been evaluated before.
In this study, we report a novel approach called Gram-bAsed Interaction Analysis (GAIA). GAIA extracts peptide segments that are composed of fixed length of continuous amino acids, called n-grams (where n is the number of amino acids), from the annotated domain and DDI data set in Saccharomyces cerevisiae (budding yeast) and identifies a list of n-grams that may contribute to DDIs and PPIs based on the frequencies of their appearance. GAIA also reports the coordinate position of gram pairs on each interacting domain pair. We demonstrate that our approach improves on other DDI prediction approaches when tested against a gold-standard data set and achieves a true positive rate of 82% and a false positive rate of 21%. We also identify a list of 4-gram pairs that are significantly over-represented in the DDI data set and may mediate PPIs.
GAIA represents a novel and reliable way to predict DDIs that mediate PPIs. Our results, which show the localizations of interacting grams/hotspots, provide testable hypotheses for experimental validation. Complemented with other prediction methods, this study will allow us to elucidate the interactome of cells.
We define the Gaia system of life and its environment on Earth, review the status of the Gaia theory, introduce potentially relevant concepts from complexity theory, then try to apply them to Gaia. We consider whether Gaia is a complex adaptive system (CAS) in terms of its behaviour and suggest that the system is self-organizing but does not reside in a critical state. Gaia has supported abundant life for most of the last 3.8 Gyr. Large perturbations have occasionally suppressed life but the system has always recovered without losing the capacity for large-scale free energy capture and recycling of essential elements. To illustrate how complexity theory can help us understand the emergence of planetary-scale order, we present a simple cellular automata (CA) model of the imaginary planet Daisyworld. This exhibits emergent self-regulation as a consequence of feedback coupling between life and its environment. Local spatial interaction, which was absent from the original model, can destabilize the system by generating bifurcation regimes. Variation and natural selection tend to remove this instability. With mutation in the model system, it exhibits self-organizing adaptive behaviour in its response to forcing. We close by suggesting how artificial life ('Alife') techniques may enable more comprehensive feasibility tests of Gaia.
Motivation: Increasing use of structural modeling for understanding structure–function relationships in proteins has led to the need to ensure that the protein models being used are of acceptable quality. Quality of a given protein structure can be assessed by comparing various intrinsic structural properties of the protein to those observed in high-resolution protein structures.
Results: In this study, we present tools to compare a given structure to high-resolution crystal structures. We assess packing by calculating the total void volume, the percentage of unsatisfied hydrogen bonds, the number of steric clashes and the scaling of the accessible surface area. We assess covalent geometry by determining bond lengths, angles, dihedrals and rotamers. The statistical parameters for the above measures, obtained from high-resolution crystal structures enable us to provide a quality-score that points to specific areas where a given protein structural model needs improvement.
Availability and Implementation: We provide these tools that appraise protein structures in the form of a web server Gaia (http://chiron.dokhlab.org). Gaia evaluates the packing and covalent geometry of a given protein structure and provides quantitative comparison of the given structure to high-resolution crystal structures.
Supplementary information: Supplementary data are available at Bioinformatics online.
In the search for genetic determinants of complex disease, two approaches to association analysis are most often employed, testing single loci or testing a small group of loci jointly via haplotypes for their relationship to disease status. It is still debatable which of these approaches is more favourable, and under what conditions. The former has the advantage of simplicity but suffers severely when alleles at the tested loci are not in linkage disequilibrium (LD) with liability alleles; the latter should capture more of the signal encoded in LD, but is far from simple. The complexity of haplotype analysis could be especially troublesome for association scans over large genomic regions, which, in fact, is becoming the standard design. For these reasons, the authors have been evaluating statistical methods that bridge the gap between single-locus and haplotype-based tests. In this article, they present one such method, which uses non-parametric regression techniques embodied by Bayesian adaptive regression splines (BARS). For a set of markers falling within a common genomic region and a corresponding set of single-locus association statistics, the BARS procedure integrates these results into a single test by examining the class of smooth curves consistent with the data. The non-parametric BARS procedure generally finds no signal when no liability allele exists in the tested region (ie it achieves the specified size of the test) and it is sensitive enough to pick up signals when a liability allele is present. The BARS procedure provides a robust and potentially powerful alternative to classical tests of association, diminishes the multiple testing problem inherent in those tests and can be applied to a wide range of data types, including genotype frequencies estimated from pooled samples.
association study; adaptive regression splines; complex disease; genome scan; linkage disequilibrium (LD); non-parametric regression
Biological processes are regulated by complex interactions between transcription factors and signalling molecules, collectively described as Genetic Regulatory Networks (GRNs). The characterisation of these networks to reveal regulatory mechanisms is a long-term goal of many laboratories. However compiling, visualising and interacting with such networks is non-trivial. Current tools and databases typically focus on GRNs within simple, single celled organisms. However, data is available within the literature describing regulatory interactions in multi-cellular organisms, although not in any systematic form. This is particularly true within the field of developmental biology, where regulatory interactions should also be tagged with information about the time and anatomical location of development in which they occur.
We have developed myGRN (), a web application for storing and interrogating interaction data, with an emphasis on developmental processes. Users can submit interaction and gene expression data, either curated from published sources or derived from their own unpublished data. All interactions associated with publications are publicly visible, and unpublished interactions can only be shared between collaborating labs prior to publication. Users can group interactions into discrete networks based on specific biological processes. Various filters allow dynamic production of network diagrams based on a range of information including tissue location, developmental stage or basic topology. Individual networks can be viewed using myGRV, a tool focused on displaying developmental networks, or exported in a range of formats compatible with third party tools. Networks can also be analysed for the presence of common network motifs. We demonstrate the capabilities of myGRN using a network of zebrafish interactions integrated with expression data from the zebrafish database, ZFIN.
Here we are launching myGRN as a community-based repository for interaction networks, with a specific focus on developmental networks. We plan to extend its functionality, as well as use it to study networks involved in embryonic development in the future.
Complex diseases are presumed to be the results of interactions of several genes and environmental factors, with each gene only having a small effect on the disease. Thus, the methods that can account for gene-gene interactions to search for a set of marker loci in different genes or across genome and to analyze these loci jointly are critical. In this article, we propose an ensemble learning approach (ELA) to detect a set of loci whose main and interaction effects jointly have a significant association with the trait. In the ELA, we first search for “base learners” and then combine the effects of the base learners by a linear model. Each base learner represents a main effect or an interaction effect. The result of the ELA is easy to interpret. When the ELA is applied to analyze a data set, we can get a final model, an overall P-value of the association test between the set of loci involved in the final model and the trait, and an importance measure for each base learner and each marker involved in the final model. The final model is a linear combination of some base learners. We know which base learner represents a main effect and which one represents an interaction effect. The importance measure of each base learner or marker can tell us the relative importance of the base learner or marker in the final model. We used intensive simulation studies as well as a real data set to evaluate the performance of the ELA. Our simulation studies demonstrated that the ELA is more powerful than the single-marker test in all the simulation scenarios. The ELA also outperformed the other three existing multi-locus methods in almost all cases. In an application to a large-scale case-control study for Type 2 diabetes, the ELA identified 11 single nucleotide polymorphisms that have a significant multi-locus effect (P-value = 0.01), while none of the single nucleotide polymorphisms showed significant marginal effects and none of the two-locus combinations showed significant two-locus interaction effects.
epistasis; association study; complex disease; Type 2 diabetes
The Cochran–Armitage trend test (CATT) is well suited for testing association between a marker and a disease in case–control studies. When the underlying genetic model for the disease is known, the CATT optimal for the genetic model is used. For complex diseases, however, the genetic models of the true disease loci are unknown. In this situation, robust tests are preferable. We propose a two-phase analysis with model selection for the case–control design. In the first phase, we use the difference of Hardy–Weinberg disequilibrium coefficients between the cases and the controls for model selection. Then, an optimal CATT corresponding to the selected model is used for testing association. The correlation of the statistics used for selection and the test for association is derived to adjust the two-phase analysis with control of the Type-I error rate. The simulation studies show that this new approach has greater efficiency robustness than the existing methods.
Cochran–Armitage trend test; Disease risk; Efficiency robustness; Hardy–Weinberg disequilibrium; SNP
We introduce an innovative multilocus test for disease association. It is an extension of an existing score test that gains power over alternative methods by incorporating a parsimonious one-degree-of-freedom model for interaction. We use our method in applications designed to detect interactions that generate hypotheses about the functionality of prostate cancer (PRCA) susceptibility regions.
Our proposed score test is designed to gain additional power through the use of a retrospective likelihood that exploits an assumption of independence between unlinked loci in the underlying population. Its performance is validated through simulation. The method is used in conditional scans with data from stage II of the Cancer Genetic Markers of Susceptibility PRCA genome-wide association study.
Our proposed method increases power to detect susceptibility loci in diverse settings. It identified two high-ranking, biologically interesting interactions: (1) rs748120 of NR2C2 and subregions of 8q24 that contain independent susceptibility loci specific to PRCA and (2) rs4810671 of SULF2 and both JAZF1 and HNF1B that are associated with PRCA and type 2 diabetes.
Our score test is a promising multilocus tool for genetic epidemiology. The results of our applications suggest functionality for poorly understood PRCA susceptibility regions. They motivate replication study.
Gene-gene interaction; Score test; Prostate cancer
Multiple prostate cancer (PCa) risk-related loci have been discovered by genome-wide association studies (GWAS) based on case–control designs. However, GWAS findings may be confounded by population stratification if cases and controls are inadvertently drawn from different genetic backgrounds. In addition, since these loci were identified in cases with predominantly sporadic disease, little is known about their relationships with hereditary prostate cancer (HPC). The association between seventeen reported PCa susceptibility loci was evaluated with a family-based association test using 1,979 hereditary PCa families of European descent collected by members of the International Consortium for Prostate Cancer Genetics, with a total of 5,730 affected men. The risk alleles for 8 of the 17 loci were significantly over-transmitted from parents to affected offspring, including SNPs residing in 8q24 (regions 1, 2 and 3), 10q11, 11q13, 17q12 (region 1), 17q24 and Xp11. In subgroup analyses, three loci, at 8q24 (regions 1 and 2) plus 17q12, were significantly over-transmitted in hereditary PCa families with five or more affected members, while loci at 3p12, 8q24 (region 2), 11q13, 17q12 (region 1), 17q24 and Xp11 were significantly over-transmitted in HPC families with an average age of diagnosis at 65 years or less. Our results indicate that at least a subset of PCa risk-related loci identified by case–control GWAS are also associated with disease risk in HPC families.
Complex diseases are multifactorial in nature and can involve multiple loci with gene × gene and gene × environment interactions. Research on methods to uncover the interactions between those genes that confer susceptibility to disease has been extensive, but many of these methods have only been developed for sibling pairs or sibships. In this report, we assess the performance of two methods for finding gene × gene interactions that are applicable to arbitrarily sized pedigrees, one based on correlation in per-family nonparametric linkage scores and another that incorporates candidate loci genotypes as covariates into an affected relative pair linkage analysis. The power and type I error rate of both of these methods was addressed using the simulated Genetic Analysis Workshop 14 data. In general, we found detection of the interacting loci to be a difficult problem, and though we experienced some modest success there is a clear need to continue developing new methods and approaches to the problem.
The goals of our analysis were to map functional loci, which contribute to the case-control status of a trait of interest, using large pedigrees. We used logistic regression fitted with the generalized estimation equation to test associations between a dichotomous phenotype and all genotyped common and rare single-nucleotide polymorphisms. In addition to the association study, we also developed and applied a simple and fast identical-by-descent-based test to identify loci that were shared among affected individuals more often than expected by chance. Among the top significant loci, we assessed the statistical power and the false discovery rate of both methods. We also demonstrated that family-based studies, compared with the standard population-based association studies, have great values and advantages for the discovery of multiple rare causal variants.
Following the identification of several disease-associated polymorphisms by whole genome association analysis, interest is now focussing on the detection of effects that, due to their interaction with other genetic (or environmental) factors, may not be identified by using standard single-locus tests. In addition to increasing power to detect association, there is also a hope detecting interactions between loci will allow us to elucidate the biological and biochemical pathways underpinning disease. Here I provide a critical survey of the current methodological approaches (and related software packages) used to detect interactions between genetic loci that contribute to human genetic disease. I also discuss the difficulties in determining the biologcal relevance of statistical interactions.
The genetic etiology of autism is heterogeneous. Multiple disorders share genotypic and phenotypic traits with autism. Network based cross-disorder analysis can aid in the understanding and characterization of the molecular pathology of autism, but there are few tools that enable us to conduct cross-disorder analysis and to visualize the results.
We have designed Autworks as a web portal to bring together gene interaction and gene-disease association data on autism to enable network construction, visualization, network comparisons with numerous other related neurological conditions and disorders. Users may examine the structure of gene interactions within a set of disorder-associated genes, compare networks of disorder/disease genes with those of other disorders/diseases, and upload their own sets for comparative analysis.
Autworks is a web application that provides an easy-to-use resource for researchers of varied backgrounds to analyze the autism gene network structure within and between disorders.
Autism; Autistic disorder; Autism spectrum disorders; Autism genetics; Autism genomics; Network biology; Network medicine; Translational bioinformatics; Protein-protein interactions
Hundreds of new loci have been discovered by genome-wide association studies of human traits. These studies mostly focused on associations between single locus and a trait. Interactions between genes and between genes and environmental factors are of interest as they can improve our understanding of the genetic background underlying complex traits. Genome-wide testing of complex genetic models is a computationally demanding task. Moreover, testing of such models leads to multiple comparison problems that reduce the probability of new findings. Assuming that the genetic model underlying a complex trait can include hundreds of genes and environmental factors, testing of these models in genome-wide association studies represent substantial difficulties.
We and Pare with colleagues (2010) developed a method allowing to overcome such difficulties. The method is based on the fact that loci which are involved in interactions can show genotypic variance heterogeneity of a trait. Genome-wide testing of such heterogeneity can be a fast scanning approach which can point to the interacting genetic variants.
In this work we present a new method, SVLM, allowing for variance heterogeneity analysis of imputed genetic variation. Type I error and power of this test are investigated and contracted with these of the Levene's test. We also present an R package, VariABEL, implementing existing and newly developed tests.
Variance heterogeneity analysis is a promising method for detection of potentially interacting loci. New method and software package developed in this work will facilitate such analysis in genome-wide context.
single-nucleotide polymorphisms (SNPs); genome-wide association (GWA); gene-environment interactions (GxE); gene-gene interactions (GxG); variance heterogeneity; environmental sensitivity; VariABEL; the GenABEL project
Driven by advances in molecular genetic technologies and statistical analysis methodologies, there have been huge strides taken in dissecting the complex genetic basis of many inflammatory dermatoses. One example is psoriasis where application of classical linkage analysis and genome wide association investigation has identified genetic loci of major and minor effect. Although most loci independently have modest genetic effect, they identify important biological pathways potentially relevant to disease pathogenesis and therapeutic intervention. In the case of psoriasis these appear to involve the epidermal barrier, NF-κB mechanisms and Th17 adaptive immune responses. The advent of next generation sequencing methods will permit a more detailed and complete map of disease genetic architecture, a key step in developing personalised medicine strategies in the clinical management of the complex inflammatory dermatoses.
Genetic discoveries are validated through the meta-analysis of genome-wide association scans in large international consortia. Because environmental variables may interact with genetic factors, investigation of differing genetic effects for distinct levels of an environmental exposure in these large consortia may yield additional susceptibility loci undetected by main effects analysis. We describe a method of joint meta-analysis of SNP and SNP by Environment (SNP×E) regression coefficients for use in gene-environment interaction studies.
In testing SNP×E interactions, one approach uses a two degree of freedom test to identify genetic variants that influence the trait of interest. This approach detects both main and interaction effects between the trait and the SNP. We propose a method to jointly meta-analyze the SNP and SNP×E coefficients using multivariate generalized least squares. This approach provides confidence intervals of the two estimates, a joint significance test for SNP and SNP×E terms, and a test of homogeneity across samples.
We present a simulation study comparing this method to four other methods of meta-analysis and demonstrate that the joint meta-analysis performs better than the others when both main and interaction effects are present. Additionally, we implemented our methods in a meta-analysis of the association between SNPs from the type 2 diabetes-associated gene PPARG and log-transformed fasting insulin levels and interaction by body mass index in a combined sample of 19,466 individuals from 5 cohorts.
2 degree of freedom meta-analysis; joint meta-analysis; PPARG; Gene-environment interaction meta-analysis
The Genetic Analysis Workshop 15 Problem 3 simulated rheumatoid arthritis data set provided 100 replicates of simulated single-nucleotide polymorphism (SNP) and covariate data sets for 1500 families with an affected sib pair and 2000 controls, modeled after real rheumatoid arthritis data. The data generation model included nine unobserved trait loci, most of which have one or more of the generated SNPs associated with them. These data sets provide an ideal experimental test bed for evaluating new and old algorithms for selecting SNPs and covariates that can separate cases from controls, because the cases and controls are known as well as the identities of the trait loci. LASSO-Patternsearch is a new multi-step algorithm with a LASSO-type penalized likelihood method at its core specifically designed to detect and model interactions between important predictor variables. In this article the original LASSO-Patternsearch algorithm is modified to handle the large number of SNPs plus covariates. We start with a screen step within the framework of parametric logistic regression. The patterns that survived the screen step were further selected by a penalized logistic regression with the LASSO penalty. And finally, a parametric logistic regression model were built on the patterns that survived the LASSO step. In our analysis of Genetic Analysis Workshop 15 Problem 3 data we have identified most of the associated SNPs and relevant covariates. Upon using the model as a classifier, very competitive error rates were obtained.
Dissecting the genetic architecture of fitness-related traits in wild populations is key to understanding evolution and the mechanisms maintaining adaptive genetic variation. We took advantage of a recently developed genetic linkage map and phenotypic information from wild pedigreed individuals from Ram Mountain, Alberta, Canada, to study the genetic architecture of ecologically important traits (horn volume, length, base circumference and body mass) in bighorn sheep. In addition to estimating sex-specific and cross-sex quantitative genetic parameters, we tested for the presence of quantitative trait loci (QTLs), colocalization of QTLs between bighorn sheep and domestic sheep, and sex × QTL interactions. All traits showed significant additive genetic variance and genetic correlations tended to be positive. Linkage analysis based on 241 microsatellite loci typed in 310 pedigreed animals resulted in no significant and five suggestive QTLs (four for horn dimension on chromosomes 1, 18 and 23, and one for body mass on chromosome 26) using genome-wide significance thresholds (Logarithm of odds (LOD) >3.31 and >1.88, respectively). We also confirmed the presence of a horn dimension QTL in bighorn sheep at the only position known to contain a similar QTL in domestic sheep (on chromosome 10 near the horns locus; nominal P<0.01) and highlighted a number of regions potentially containing weight-related QTLs in both species. As expected for sexually dimorphic traits involved in male–male combat, loci with sex-specific effects were detected. This study lays the foundation for future work on adaptive genetic variation and the evolutionary dynamics of sexually dimorphic traits in bighorn sheep.
adaptive variation; animal model; domestic sheep; Ovis aries; sexual dimorphism; sexual selection
Genome-wide association studies (GWAS) test hundreds of thousands of single-nucleotide polymorphisms (SNPs) for association to a trait, treating each marker equally and ignoring prior evidence of association to specific regions. Typically, promising regions are selected for further investigation based on p-values obtained from simple tests of association. However, loci that exert only a weak, low-penetrant role on the trait, producing modest evidence of association, are not detectable in the context of a GWAS. Implementing prior knowledge of association in GWAS could increase power, help distinguish between false and true positives, and identify better sets of SNPs for follow-up studies.
Here we performed a GWAS on rheumatoid arthritis (RA) patients and controls (Problem 1, Genetic Analysis Workshop 16). In order to include prior information in the analysis, we applied four methods that distinctively deal with markers in candidate genes in the context of GWAS. SNPs were divided into a random and a candidate subset, then we applied empirical correction by permutation, false-discovery rate, false-positive report probability, and posterior odds of association using different prior probabilities. We repeated the same analyses on two different sets of candidate markers defined on the basis of previously reported association to RA following two different approaches. The four methods showed similar relative behavior when applied to the two sets, with the proportion of candidate SNPs ranked among the top 2,000 varying from 0 to 100%. The use of different prior probabilities changed the stringency of the methods, but not their relative performance.