PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1608956)

Clipboard (0)
None

Related Articles

1.  Niche adaptation by expansion and reprogramming of general transcription factors 
Experimental analysis of TFB family proteins in a halophilic archaeon reveals complex environment-dependent fitness contributions. Gene conversion events among these proteins can generate novel niche adaptation capabilities, a process that may have contributed to archaeal adaptation to extreme environments.
Evolution of archaeal lineages correlate with duplication events in the TFB family.Each TFB is required for adaptation to multiple environments.The relative fitness contributions of TFBs change with environmental context.Changes in the regulation of duplicated TFBs can generate new adaptation capabilities.
The evolutionary success of an organism depends on its ability to continually adapt to changes in the patterns of constant, periodic, and transient challenges within its environment. This process of ‘niche adaptation' requires reprogramming of the organism's environmental response networks by reorganizing interactions among diverse parts including environmental sensors, signal transducers, and transcriptional and post-transcriptional regulators. Gene duplications have been discovered to be one of the principal strategies in this process, especially for reprogramming of gene regulatory networks (GRNs). Whereas eukaryotes require dozens of factors for recruitment of RNA polymerase, archaea require just two general transcription factors (GTFs) that are orthologous to eukaryotic TFIIB (TFB in archaea) and TATA-binding protein (TBP) (Bell et al, 1998). Both of these GTFs have expanded extensively in nearly 50% of all archaea whose genomes have been fully sequenced. The phylogenetic analysis presented in this study reveal lineage-specific expansions of TFBs, suggesting that they might encode functionally specialized gene regulatory programs for the unique environments to which these organisms have adapted. This hypothesis is particularly appealing when we consider that the greatest expansion is observed within the group of halophilic archaea whose habitats are associated with routine and dynamic changes in a number of environmental factors including light, temperature, oxygen, salinity, and ionic composition (Rodriguez-Valera, 1993; Litchfield, 1998).
We have previously demonstrated that variations in the expanded set of TFBs (a through e) in Halobacterium salinarum NRC-1 manifests at the level of physical interactions within and across the two families, their DNA-binding specificity, their differential regulation in varying environments, and, ultimately, on the large-scale segregation of transcription of all genes into overlapping yet distinct sets of functionally related groups (Facciotti et al, 2007). We have extended findings from this earlier study with a systematic survey of the fitness consequences of perturbing the TFB network of H. salinarum NRC-1 across 17 environments. Notably, each TFB conferred fitness in two or more environmental conditions tested, and the relative fitness contributions (see Table I) of the five TFBs varied significantly by environment. From an evolutionary perspective, the relationships among these fitness landscapes reveal that two classes of TFBs (c/g- and f-type) appear to have played an important role in the evolution of halophilic archaea by overseeing regulation of core physiological capabilities in these organisms. TFBs of the other clades (b/d and a/e) seem to have emerged much more recently through gene duplications or horizontal gene transfers (HGTs) and are being utilized for adaptation to specialized environmental conditions.
We also investigated higher-order functional interactions and relationships among the duplicated TFBs by performing competition experiments and by mapping genetic interactions in different environments. This demonstrated that depending on environmental context, the TFBs have strikingly different functional hierarchies and genetic interactions with one another. This is remarkable as it makes each TFB essential albeit at different times in a dynamically changing environment.
In order to understand the process by which such gene family expansions shape architecture and functioning of a GRN, we performed integrated analysis of phylogeny, physical interactions, regulation, and fitness landscapes of the seven TFBs in H. salinarum NRC-1. This revealed that evolution of both their protein-coding sequence and their promoter has been instrumental in the encoding of environment-specific regulatory programs. Importantly, the convergent and divergent evolution of regulation and binding properties of TFBs suggested that, aside from HGT and random mutations, a third plausible (and perhaps most interesting) mechanism for acquiring a novel TFB variant is through gene conversion. To test this hypothesis, we synthesized a novel TFBx by transferring TFBa/e clade-specific residues to a TFBd backbone, transformed this variant under the control of either the TFBd or the TFBe promoter (PtfbD or PtfbE) into three different host genetic backgrounds (Δura3 (parent), ΔtfbD, and ΔtfbE), and analyzed fitness and gene expression patterns during growth at 25 and 37°C. This showed that gene conversion events spanning the coding sequence and the promoter, environmental context, and genetic background of the host are all extremely influential in the functional integration of a TFB into the GRN. Importantly, this analysis suggested that altering the regulation of an existing set of expanded TFBs might be an efficient mechanism to reprogram the GRN to rapidly generate novel niche adaptation capability. We have confirmed this experimentally by increasing fitness merely by moving tfbE to PtfbD control, and by generating a completely novel phenotype (biofilm-like appearance) by overexpression of tfbE.
Altogether this study clearly demonstrates that archaea can rapidly generate novel niche adaptation programs by simply altering regulation of duplicated TFBs. This is significant because expansions in the TFB family is widespread in archaea, a class of organisms that not only represent 20% of biomass on earth but are also known to have colonized some of the most extreme environments (DeLong and Pace, 2001). This strategy for niche adaptation is further expanded through interactions of the multiple TFBs with members of other expanded TF families such as TBPs (Facciotti et al, 2007) and sequence-specific regulators (e.g. Lrp family (Peeters and Charlier, 2010)). This is analogous to combinatorial solutions for other complex biological problems such as recognition of pathogens by Toll-like receptors (Roach et al, 2005), generation of antibody diversity by V(D)J recombination (Early et al, 1980), and recognition and processing of odors (Malnic et al, 1999).
Numerous lineage-specific expansions of the transcription factor B (TFB) family in archaea suggests an important role for expanded TFBs in encoding environment-specific gene regulatory programs. Given the characteristics of hypersaline lakes, the unusually large numbers of TFBs in halophilic archaea further suggests that they might be especially important in rapid adaptation to the challenges of a dynamically changing environment. Motivated by these observations, we have investigated the implications of TFB expansions by correlating sequence variations, regulation, and physical interactions of all seven TFBs in Halobacterium salinarum NRC-1 to their fitness landscapes, functional hierarchies, and genetic interactions across 2488 experiments covering combinatorial variations in salt, pH, temperature, and Cu stress. This systems analysis has revealed an elegant scheme in which completely novel fitness landscapes are generated by gene conversion events that introduce subtle changes to the regulation or physical interactions of duplicated TFBs. Based on these insights, we have introduced a synthetically redesigned TFB and altered the regulation of existing TFBs to illustrate how archaea can rapidly generate novel phenotypes by simply reprogramming their TFB regulatory network.
doi:10.1038/msb.2011.87
PMCID: PMC3261711  PMID: 22108796
evolution by gene family expansion; fitness; niche adaptation; reprogramming of gene regulatory network; transcription factor B
2.  Backup without redundancy: genetic interactions reveal the cost of duplicate gene loss 
We show that genetic interaction profiles offer a powerful approach to elicit phenotypes that are far richer than is attainable using single gene deletions. This has allowed us to address the long-standing question of the role played by duplicate genes (paralogs) in robustness against deletion.We provide for the first time direct evidence that the capacity of some duplicates to cover for the loss of their paralogs can account for the observed difference in fitness between duplicate and singleton deletions mutants, but that the overall contribution of this effect to dispensability is small.More broadly, we demonstrate that paralogs possessing apparent backup capacity in some environments have in fact distinct and non-overlapping functions, and are unable to provide backup across a range of compromising conditions. This resolves the previous paradox of how backup genes conferring dispensability can nevertheless be independently maintained in the population.From a practical point of view, our findings suggest efficient strategies to elicit rich deletion phenotypes that should be highly relevant for the design of future phenotypic screens.
Much of our understanding of biological processes has been derived from the characterization of the functional consequence to an organism of altering one or more of its genes. Efforts to systematically evaluate the phenotypic effects of gene loss, however, have been hampered by the fact that the disruption of most genes has surprisingly modest effects on cell growth and viability. The high proportion of genes with no apparent deletion effect has wide-ranging practical and theoretical implications and has been the subject of considerable interest (Wagner, 2000, 2005; Giaever et al, 2002; Gu et al, 2003; Papp et al, 2004; Kafri et al, 2005). One factor that has been implicated as contributing to the high degree of dispensability is the abundance of closely related paralogs present in most genomes (Winzeler et al, 1999; Wagner, 2000; Giaever et al, 2002). Indeed, recent work in S. cerevisiae has shown that the existence of a paralog elsewhere in the genome significantly increases the chance that deletion of a given gene has little effect on growth (Gu et al, 2003). However, current analyses have been mostly correlative, and direct mechanistic evidence supporting or refuting the role of backup compensation in mutational robustness is still largely missing. Furthermore, backup between duplicates is not easily justified in evolutionary terms, in that a genuine ability to comprehensively cover for the loss of another gene is evolutionarily unstable (Brookfield, 1992).
Here, we exploit the recent availability of high-density quantitative genetic interaction profiles (EMAPs) to address these issues directly. To test whether SSL paralogs can account for the excess fitness of duplicates, we classified genes into fitness categories according to their deletion growth defect (Materials and methods). The subset of genes covered by our combined data set exhibits an over-representation of duplicate genes in the weak/no deletion phenotype (WNP) class similar to that reported previously (Gu et al, 2003) (Figure 1B). Strikingly, this difference corresponds to the number of WNP duplicates that have an SSL interaction with their corresponding paralog (Figure 1C). Our data thus provide direct evidence that it is indeed duplicate compensation that accounts for the observed difference in deletion growth defect between duplicates and singletons, at least for the genes covered by our data set.
Apart from the mechanism itself, the characteristic features of buffering duplicates have received considerable attention (Gu et al, 2003; Kafri et al, 2005; Wagner, 2005). Our data allowed us to unambiguously distinguish the subset of duplicates whose dispensability can be attributed to the existence of a backup paralog. The ability to identify backup duplicates directly put us in a position to study their features, and how they differ from other duplicates without buffering properties. In particular, we asked to what extent the observed buffering in rich media reflects functional similarity and a genuine ability to cover for the loss of a paralog in a broader range of conditions.
To assess the extent to which SSL duplicates can provide genuine backup under compromising conditions, we fist used genetic interaction profiles as a more stringent test for redundancy that assesses the effect of gene loss in the background of additional gene deletions. In contrast to the expectation that truly buffered duplicates should have few if any synthetic interactions, we find that the number is in fact substantial and often exceeds that of random genes and non-SSL duplicates (Figure 2B). Similarly, using a recent data set of sensitivity profiles of deletion strains to a range of agents and environments (Brown et al, 2006), we find that the deletion of SSL duplicates across a range of environments has on average no weaker (and in fact a slightly stronger) effect on cellular growth rate than that of non-SSL duplicates or random genes. Taken together, these findings suggest that the backup capacity of SSL duplicates is limited and not indicative of a comprehensive ability to cover for the loss of the paralogous partner.
We next tested the degree of functional similarity of buffering duplicates using similarity in genetic interaction as well as environmental sensitivity profiles as indicators of shared functionality (Tong et al, 2004; Schuldiner et al, 2005; Brown et al, 2006; Pan et al, 2006). In spite of their rich media buffering properties, we find that the interaction and sensitivity patterns of most SSL duplicates are divergent and are usually more similar to those of other, non-paralogous genes (Figure 2C and D; Supplementary Figure 10).
Lastly, in addition to our analysis of duplicate phenotypes, we used genetic interaction spectra as deletion phenotypes for generic genes whose single deletion in standard conditions has little measurable effect. As expected, genetic interactions provide a deletion phenotype for many more genes (80–90%) than single gene deletions in standard growth environments (Steinmetz et al, 2002), which yield a detectable growth defect only for 30–40% (Figure 4B). To assess whether these interactions reflect the cost of gene loss (gene importance), we asked if there is a relationship between the probability of a gene being retained between related species and its number of genetic interactions. Indeed, genetic interactivity exhibits a strong correlation with gene retention across related phyla (Figure 4C and Supplementary Figure 7), and predicts the likelihood of gene loss better than lethality/viability, quantitative growth deficiency or environmental specificity (Supplementary Figure 8). Thus, genetic interactions provide a cost of gene loss that effectively recapitulates evolutionary constraints. This is further supported by the observation that genetic interactions are significantly correlated with environmental sensitivity across a range of conditions. Thus, our findings suggest that for most genes there is a substantial cost of gene loss, even though this is often not reflected in single gene deletion tests carried out in standard conditions.
Many genes can be deleted with little phenotypic consequences. By what mechanism and to what extent the presence of duplicate genes in the genome contributes to this robustness against deletions has been the subject of considerable interest. Here, we exploit the availability of high-density genetic interaction maps to provide direct support for the role of backup compensation, where functionally overlapping duplicates cover for the loss of their paralog. However, we find that the overall contribution of duplicates to robustness against null mutations is low (∼25%). The ability to directly identify buffering paralogs allowed us to further study their properties, and how they differ from non-buffering duplicates. Using environmental sensitivity profiles as well as quantitative genetic interaction spectra as high-resolution phenotypes, we establish that even duplicate pairs with compensation capacity exhibit rich and typically non-overlapping deletion phenotypes, and are thus unable to comprehensively cover against loss of their paralog. Our findings reconcile the fact that duplicates can compensate for each other's loss under a limited number of conditions with the evolutionary instability of genes whose loss is not associated with a phenotypic penalty.
doi:10.1038/msb4100127
PMCID: PMC1847942  PMID: 17389874
duplication; evolution; genetic interactions; redundancy
3.  A Flexible Bayesian Model for Studying Gene–Environment Interaction 
PLoS Genetics  2012;8(1):e1002482.
An important follow-up step after genetic markers are found to be associated with a disease outcome is a more detailed analysis investigating how the implicated gene or chromosomal region and an established environment risk factor interact to influence the disease risk. The standard approach to this study of gene–environment interaction considers one genetic marker at a time and therefore could misrepresent and underestimate the genetic contribution to the joint effect when one or more functional loci, some of which might not be genotyped, exist in the region and interact with the environment risk factor in a complex way. We develop a more global approach based on a Bayesian model that uses a latent genetic profile variable to capture all of the genetic variation in the entire targeted region and allows the environment effect to vary across different genetic profile categories. We also propose a resampling-based test derived from the developed Bayesian model for the detection of gene–environment interaction. Using data collected in the Environment and Genetics in Lung Cancer Etiology (EAGLE) study, we apply the Bayesian model to evaluate the joint effect of smoking intensity and genetic variants in the 15q25.1 region, which contains a cluster of nicotinic acetylcholine receptor genes and has been shown to be associated with both lung cancer and smoking behavior. We find evidence for gene–environment interaction (P-value = 0.016), with the smoking effect appearing to be stronger in subjects with a genetic profile associated with a higher lung cancer risk; the conventional test of gene–environment interaction based on the single-marker approach is far from significant.
Author Summary
Many common diseases result from a complex interplay of genetic and environmental risk factors. It is important to study the potential genetic and environmental risk factors jointly in order to achieve a better understanding of the mechanisms underlying disease development. The standard single-marker approach that studies the environmental risk factor and one genetic marker at a time could misrepresent the gene–environment interaction, as the single genetic marker might not be an appropriate surrogate for the underlying genetic functioning polymorphisms. We propose a method to look at gene–environment interaction at the gene/region level by integrating information observed on multiple genetic markers within the selected gene/region with measures of environmental exposure. Using data collected in the Environment and Genetics in Lung Cancer Etiology (EAGLE) study, we apply the proposed model to evaluate the joint effect of smoking intensity and genetic variants in the 15q25.1 region and find evidence for gene–environment interaction (P-value = 0.016), with the smoking effect varying according to a subject's genetic profile.
doi:10.1371/journal.pgen.1002482
PMCID: PMC3266891  PMID: 22291610
4.  Two-Stage Two-Locus Models in Genome-Wide Association 
PLoS Genetics  2006;2(9):e157.
Studies in model organisms suggest that epistasis may play an important role in the etiology of complex diseases and traits in humans. With the era of large-scale genome-wide association studies fast approaching, it is important to quantify whether it will be possible to detect interacting loci using realistic sample sizes in humans and to what extent undetected epistasis will adversely affect power to detect association when single-locus approaches are employed. We therefore investigated the power to detect association for an extensive range of two-locus quantitative trait models that incorporated varying degrees of epistasis. We compared the power to detect association using a single-locus model that ignored interaction effects, a full two-locus model that allowed for interactions, and, most important, two two-stage strategies whereby a subset of loci initially identified using single-locus tests were analyzed using the full two-locus model. Despite the penalty introduced by multiple testing, fitting the full two-locus model performed better than single-locus tests for many of the situations considered, particularly when compared with attempts to detect both individual loci. Using a two-stage strategy reduced the computational burden associated with performing an exhaustive two-locus search across the genome but was not as powerful as the exhaustive search when loci interacted. Two-stage approaches also increased the risk of missing interacting loci that contributed little effect at the margins. Based on our extensive simulations, our results suggest that an exhaustive search involving all pairwise combinations of markers across the genome might provide a useful complement to single-locus scans in identifying interacting loci that contribute to moderate proportions of the phenotypic variance.
Synopsis
Although there is growing appreciation that attempting to map genetic interactions in humans may be a fruitful endeavor, there is no consensus as to the best strategy for their detection, particularly in the case of genome-wide association where the number of potential comparisons is enormous. In this article, the authors compare the performance of four different search strategies to detect loci which interact in genome-wide association—a single-locus search, an exhaustive two-locus search, and two, two-stage procedures in which a subset of loci initially identified with single-locus tests are analyzed using a full two-locus model. Their results show that when loci interact, an exhaustive two-locus search across the genome is superior to a two-stage strategy, and in many situations can identify loci which would not have been identified solely using a single-locus search. Their findings suggest that an exhaustive search involving all pairwise combinations of markers across the genome may provide a useful complement to single-locus scans in identifying interacting loci that contribute to moderate proportions of the phenotypic variance.
doi:10.1371/journal.pgen.0020157
PMCID: PMC1570380  PMID: 17002500
5.  ENABLING PERSONAL GENOMICS WITH AN EXPLICIT TEST OF EPISTASIS 
One goal of personal genomics is to use information about genomic variation to predict who is at risk for various common diseases. Technological advances in genotyping have spawned several personal genetic testing services that market genotyping services directly to the consumer. An important goal of consumer genetic testing is to provide health information along with the genotyping results. This has the potential to integrate detailed personal genetic and genomic information into healthcare decision making. Despite the potential importance of these advances, there are some important limitations. One concern is that much of the literature that is used to formulate personal genetics reports is based on genetic association studies that consider each genetic variant independently of the others. It is our working hypothesis that the true value of personal genomics will only be realized when the complexity of the genotype-to-phenotype mapping relationship is embraced, rather than ignored. We focus here on complexity in genetic architecture due to epistasis or nonlinear gene-gene interaction. We have previously developed a multifactor dimensionality reduction (MDR) algorithm and software package for detecting nonlinear interactions in genetic association studies. In most prior MDR analyses, the permutation testing strategy used to assess statistical significance was unable to differentiate MDR models that captured only interaction effects from those that also detected independent main effects. Statistical interpretation of MDR models required post-hoc analysis using entropy-based measures of interaction information. We introduce here a novel permutation test that allows the effects of nonlinear interactions between multiple genetic variants to be specifically tested in a manner that is not confounded by linear additive effects. We show using data simulated across 35 different epistasis models with varying effect sizes (heritabilities = 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and sample sizes (n = 400, 800, 1600) that the power to detect interactions using the explicit test of epistasis is no different than a standard permutation test. We also show that the test has the appropriate size or type I error rate of approximately 0.05. We then apply MDR with the new explicit test of epistasis to a large genetic study of bladder cancer (n=914) and show that a previously reported nonlinear interaction between two XPD gene polymorphisms is indeed significant (P = 0.005), even after considering the strong additive effect of smoking in the model. Finally, we evaluated the power of the explicit test of epistasis to detect the nonlinear interaction between two XPD gene polymorphisms by simulating data from the MDR model of bladder cancer susceptibility. We show that the power to detect the interaction alone was 1.00 while the power to detect the independent effect of smoking alone was 0.06 which is close to the expected type I error rate of 0.05. Importantly, the power to detect the interaction with smoking in the model was 0.94. The results of this study provide for the first time a simple method for explicitly testing epistasis or gene-gene interaction effects in genetic association studies. An important advantage of the method is that it can be combined with any modeling approach. The explicit test of epistasis brings us a step closer to the type of routine gene-gene interaction analysis that is needed if we are to enable personal genomics.
PMCID: PMC2916690  PMID: 19908385
6.  Application of three-level linear mixed-effects model incorporating gene-age interactions for association analysis of longitudinal family data 
BMC Proceedings  2009;3(Suppl 7):S89.
Longitudinal studies that collect repeated measurements on the same subjects over time have long been considered as being more powerful and providing much better information on individual changes than cross-sectional data. We propose a three-level linear mixed-effects model for testing genetic main effects and gene-age interactions with longitudinal family data. The simulated Genetic Analysis Workshop 16 Problem 3 data sets were used to evaluate the method. Genome-wide association analyses were conducted based on cross-sectional data, i.e., each of the three single-visit data sets separately, and also on the longitudinal data, i.e., using data from all three visits simultaneously. Results from the analysis of coronary artery calcification phenotype showed that the longitudinal association tests were much more powerful than those based on single-visit data only. Gene-age interactions were evaluated under the same framework for detecting genetic effects that are modulated by age.
PMCID: PMC2795992  PMID: 20018085
7.  Gene-Based Testing of Interactions in Association Studies of Quantitative Traits 
PLoS Genetics  2013;9(2):e1003321.
Various methods have been developed for identifying gene–gene interactions in genome-wide association studies (GWAS). However, most methods focus on individual markers as the testing unit, and the large number of such tests drastically erodes statistical power. In this study, we propose novel interaction tests of quantitative traits that are gene-based and that confer advantage in both statistical power and biological interpretation. The framework of gene-based gene–gene interaction (GGG) tests combine marker-based interaction tests between all pairs of markers in two genes to produce a gene-level test for interaction between the two. The tests are based on an analytical formula we derive for the correlation between marker-based interaction tests due to linkage disequilibrium. We propose four GGG tests that extend the following P value combining methods: minimum P value, extended Simes procedure, truncated tail strength, and truncated P value product. Extensive simulations point to correct type I error rates of all tests and show that the two truncated tests are more powerful than the other tests in cases of markers involved in the underlying interaction not being directly genotyped and in cases of multiple underlying interactions. We applied our tests to pairs of genes that exhibit a protein–protein interaction to test for gene-level interactions underlying lipid levels using genotype data from the Atherosclerosis Risk in Communities study. We identified five novel interactions that are not evident from marker-based interaction testing and successfully replicated one of these interactions, between SMAD3 and NEDD9, in an independent sample from the Multi-Ethnic Study of Atherosclerosis. We conclude that our GGG tests show improved power to identify gene-level interactions in existing, as well as emerging, association studies.
Author Summary
Epistasis is likely to play a significant role in complex diseases or traits and is one of the many possible explanations for “missing heritability.” However, epistatic interactions have been difficult to detect in genome-wide association studies (GWAS) due to the limited power caused by the multiple-testing correction from the large number of tests conducted. Gene-based gene–gene interaction (GGG) tests might hold the key to relaxing the multiple-testing correction burden and increasing the power for identifying epistatic interactions in GWAS. Here, we developed GGG tests of quantitative traits by extending four P value combining methods and evaluated their type I error rates and power using extensive simulations. All four GGG tests are more powerful than a principal component-based test. We also applied our GGG tests to data from the Atherosclerosis Risk in Communities study and found five gene-level interactions associated with the levels of total cholesterol and high-density lipoprotein cholesterol (HDL-C). One interaction between SMAD3 and NEDD9 on HDL-C was further replicated in an independent sample from the Multi-Ethnic Study of Atherosclerosis.
doi:10.1371/journal.pgen.1003321
PMCID: PMC3585009  PMID: 23468652
8.  A 2-step strategy for detecting pleiotropic effects on multiple longitudinal traits 
Frontiers in Genetics  2014;5:357.
Genetic pleiotropy refers to the situation in which a single gene influences multiple traits and so it is considered as a major factor that underlies genetic correlation among traits. To identify pleiotropy, an important focus in genome-wide association studies (GWAS) is on finding genetic variants that are simultaneously associated with multiple traits. On the other hand, longitudinal designs are often employed in many complex disease studies, such that, traits are measured repeatedly over time within the same subject. Performing genetic association analysis simultaneously on multiple longitudinal traits for detecting pleiotropic effects is interesting but challenging. In this paper, we propose a 2-step method for simultaneously testing the genetic association with multiple longitudinal traits. In the first step, a mixed effects model is used to analyze each longitudinal trait. We focus on estimation of the random effect that accounts for the subject-specific genetic contribution to the trait; fixed effects of other confounding covariates are also estimated. This first step enables separation of the genetic effect from other confounding effects for each subject and for each longitudinal trait. Then in the second step, we perform a simultaneous association test on multiple estimated random effects arising from multiple longitudinal traits. The proposed method can efficiently detect pleiotropic effects on multiple longitudinal traits and can flexibly handle traits of different data types such as quantitative, binary, or count data. We apply this method to analyze the 16th Genetic Analysis Workshop (GAW16) Framingham Heart Study (FHS) data. A simulation study is also conducted to validate this 2-step method and evaluate its performance.
doi:10.3389/fgene.2014.00357
PMCID: PMC4202779  PMID: 25368629
pleiotropic effect; genetic association; multiple traits; longitudinal data; mixed effects model; single nucleotide polymorphisms (SNPs)
9.  Analysis of multiple compound–protein interactions reveals novel bioactive molecules 
The authors use machine learning of compound-protein interactions to explore drug polypharmacology and to efficiently identify bioactive ligands, including novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein coupled receptors and protein kinases.
We have demonstrated that machine learning of multiple compound–protein interactions is useful for efficient ligand screening and for assessing drug polypharmacology.This approach successfully identified novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein-coupled receptors and protein kinases.These bioactive compounds were not detected by existing computational ligand-screening methods in comparative studies.The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. Perturbations of biological systems by chemical probes provide broader applications not only for analysis of complex systems but also for intentional manipulations of these systems. Nevertheless, the lack of well-characterized chemical modulators has limited their use. Recently, chemical genomics has emerged as a promising area of research applicable to the exploration of novel bioactive molecules, and researchers are currently striving toward the identification of all possible ligands for all target protein families (Wang et al, 2009). Chemical genomics studies have shown that patterns of compound–protein interactions (CPIs) are too diverse to be understood as simple one-to-one events. There is an urgent need to develop appropriate data mining methods for characterizing and visualizing the full complexity of interactions between chemical space and biological systems. However, no existing screening approach has so far succeeded in identifying novel bioactive compounds using multiple interactions among compounds and target proteins.
High-throughput screening (HTS) and computational screening have greatly aided in the identification of early lead compounds for drug discovery. However, the large number of assays required for HTS to identify drugs that target multiple proteins render this process very costly and time-consuming. Therefore, interest in using in silico strategies for screening has increased. The most common computational approaches, ligand-based virtual screening (LBVS) and structure-based virtual screening (SBVS; Oprea and Matter, 2004; Muegge and Oloff, 2006; McInnes, 2007; Figure 1A), have been used for practical drug development. LBVS aims to identify molecules that are very similar to known active molecules and generally has difficulty identifying compounds with novel structural scaffolds that differ from reference molecules. The other popular strategy, SBVS, is constrained by the number of three-dimensional crystallographic structures available. To circumvent these limitations, we have shown that a new computational screening strategy, chemical genomics-based virtual screening (CGBVS), has the potential to identify novel, scaffold-hopping compounds and assess their polypharmacology by using a machine-learning method to recognize conserved molecular patterns in comprehensive CPI data sets.
The CGBVS strategy used in this study was made up of five steps: CPI data collection, descriptor calculation, representation of interaction vectors, predictive model construction using training data sets, and predictions from test data (Figure 1A). Importantly, step 1, the construction of a data set of chemical structures and protein sequences for known CPIs, did not require the three-dimensional protein structures needed for SBVS. In step 2, compound structures and protein sequences were converted into numerical descriptors. These descriptors were used to construct chemical or biological spaces in which decreasing distance between vectors corresponded to increasing similarity of compound structures or protein sequences. In step 3, we represented multiple CPI patterns by concatenating these chemical and protein descriptors. Using these interaction vectors, we could quantify the similarity of molecular interactions for compound–protein pairs, despite the fact that the ligand and protein similarity maps differed substantially. In step 4, concatenated vectors for CPI pairs (positive samples) and non-interacting pairs (negative samples) were input into an established machine-learning method. In the final step, the classifier constructed using training sets was applied to test data.
To evaluate the predictive value of CGBVS, we first compared its performance with that of LBVS by fivefold cross-validation. CGBVS performed with considerably higher accuracy (91.9%) than did LBVS (84.4%; Figure 1B). We next compared CGBVS and SBVS in a retrospective virtual screening based on the human β2-adrenergic receptor (ADRB2). Figure 1C shows that CGBVS provided higher hit rates than did SBVS. These results suggest that CGBVS is more successful than conventional approaches for prediction of CPIs.
We then evaluated the ability of the CGBVS method to predict the polypharmacology of ADRB2 by attempting to identify novel ADRB2 ligands from a group of G-protein-coupled receptor (GPCR) ligands. We ranked the prediction scores for the interactions of 826 reported GPCR ligands with ADRB2 and then analyzed the 50 highest-ranked compounds in greater detail. Of 21 commercially available compounds, 11 showed ADRB2-binding activity and were not previously reported to be ADRB2 ligands. These compounds included ligands not only for aminergic receptors but also for neuropeptide Y-type 1 receptors (NPY1R), which have low protein homology to ADRB2. Most ligands we identified were not detected by LBVS and SBVS, which suggests that only CGBVS could identify this unexpected cross-reaction for a ligand developed as a target to a peptidergic receptor.
The true value of CGBVS in drug discovery must be tested by assessing whether this method can identify scaffold-hopping lead compounds from a set of compounds that is structurally more diverse. To assess this ability, we analyzed 11 500 commercially available compounds to predict compounds likely to bind to two GPCRs and two protein kinases. Functional assays revealed that nine ADRB2 ligands, three NPY1R ligands, five epidermal growth factor receptor (EGFR) inhibitors, and two cyclin-dependent kinase 2 (CDK2) inhibitors were concentrated in the top-ranked compounds (hit rate=30, 15, 25, and 10%, respectively). We also evaluated the extent of scaffold hopping achieved in the identification of these novel ligands. One ADRB2 ligand, two NPY1R ligands, and one CDK2 inhibitor exhibited scaffold hopping (Figure 4), indicating that CGBVS can use this characteristic to rationally predict novel lead compounds, a crucial and very difficult step in drug discovery. This feature of CGBVS is critically different from existing predictive methods, such as LBVS, which depend on similarities between test and reference ligands, and focus on a single protein or highly homologous proteins. In particular, CGBVS is useful for targets with undefined ligands because this method can use CPIs with target proteins that exhibit lower levels of homology.
In summary, we have demonstrated that data mining of multiple CPIs is of great practical value for exploration of chemical space. As a predictive model, CGBVS could provide an important step in the discovery of such multi-target drugs by identifying the group of proteins targeted by a particular ligand, leading to innovation in pharmaceutical research.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. For this purpose, the emerging field of chemical genomics is currently focused on accumulating large assay data sets describing compound–protein interactions (CPIs). Although new target proteins for known drugs have recently been identified through mining of CPI databases, using these resources to identify novel ligands remains unexplored. Herein, we demonstrate that machine learning of multiple CPIs can not only assess drug polypharmacology but can also efficiently identify novel bioactive scaffold-hopping compounds. Through a machine-learning technique that uses multiple CPIs, we have successfully identified novel lead compounds for two pharmaceutically important protein families, G-protein-coupled receptors and protein kinases. These novel compounds were not identified by existing computational ligand-screening methods in comparative studies. The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
doi:10.1038/msb.2011.5
PMCID: PMC3094066  PMID: 21364574
chemical genomics; data mining; drug discovery; ligand screening; systems chemical biology
10.  Knowledge-Driven Analysis Identifies a Gene–Gene Interaction Affecting High-Density Lipoprotein Cholesterol Levels in Multi-Ethnic Populations 
PLoS Genetics  2012;8(5):e1002714.
Total cholesterol, low-density lipoprotein cholesterol, triglyceride, and high-density lipoprotein cholesterol (HDL-C) levels are among the most important risk factors for coronary artery disease. We tested for gene–gene interactions affecting the level of these four lipids based on prior knowledge of established genome-wide association study (GWAS) hits, protein–protein interactions, and pathway information. Using genotype data from 9,713 European Americans from the Atherosclerosis Risk in Communities (ARIC) study, we identified an interaction between HMGCR and a locus near LIPC in their effect on HDL-C levels (Bonferroni corrected Pc = 0.002). Using an adaptive locus-based validation procedure, we successfully validated this gene–gene interaction in the European American cohorts from the Framingham Heart Study (Pc = 0.002) and the Multi-Ethnic Study of Atherosclerosis (MESA; Pc = 0.006). The interaction between these two loci is also significant in the African American sample from ARIC (Pc = 0.004) and in the Hispanic American sample from MESA (Pc = 0.04). Both HMGCR and LIPC are involved in the metabolism of lipids, and genome-wide association studies have previously identified LIPC as associated with levels of HDL-C. However, the effect on HDL-C of the novel gene–gene interaction reported here is twice as pronounced as that predicted by the sum of the marginal effects of the two loci. In conclusion, based on a knowledge-driven analysis of epistasis, together with a new locus-based validation method, we successfully identified and validated an interaction affecting a complex trait in multi-ethnic populations.
Author Summary
Genome-wide association studies (GWAS) have identified many loci associated with complex human traits or diseases. However, the fraction of heritable variation explained by these loci is often relatively low. Gene–gene interactions might play a significant role in complex traits or diseases and are one of the many possible factors contributing to the missing heritability. However, to date only a few interactions have been found and validated in GWAS due to the limited power caused by the need for multiple-testing correction for the very large number of tests conducted. Here, we used three types of prior knowledge, known GWAS hits, protein–protein interactions, and pathway information, to guide our search for gene–gene interactions affecting four lipid levels. We identified an interaction between HMGCR and a locus near LIPC in their effect on high-density lipoprotein cholesterol (HDL-C) and another pair of loci that interact in their effect on low-density lipoprotein cholesterol (LDL-C). We validated the interaction on HDL-C in a number of independent multiple-ethnic populations, while the interaction underlying LDL-C did not validate. The prior knowledge-driven searching approach and a locus-based validation procedure show the potential for dissecting and validating gene–gene interactions in current and future GWAS.
doi:10.1371/journal.pgen.1002714
PMCID: PMC3359971  PMID: 22654671
11.  Improved Statistics for Genome-Wide Interaction Analysis 
PLoS Genetics  2012;8(4):e1002625.
Recently, Wu and colleagues [1] proposed two novel statistics for genome-wide interaction analysis using case/control or case-only data. In computer simulations, their proposed case/control statistic outperformed competing approaches, including the fast-epistasis option in PLINK and logistic regression analysis under the correct model; however, reasons for its superior performance were not fully explored. Here we investigate the theoretical properties and performance of Wu et al.'s proposed statistics and explain why, in some circumstances, they outperform competing approaches. Unfortunately, we find minor errors in the formulae for their statistics, resulting in tests that have higher than nominal type 1 error. We also find minor errors in PLINK's fast-epistasis and case-only statistics, although theory and simulations suggest that these errors have only negligible effect on type 1 error. We propose adjusted versions of all four statistics that, both theoretically and in computer simulations, maintain correct type 1 error rates under the null hypothesis. We also investigate statistics based on correlation coefficients that maintain similar control of type 1 error. Although designed to test specifically for interaction, we show that some of these previously-proposed statistics can, in fact, be sensitive to main effects at one or both loci, particularly in the presence of linkage disequilibrium. We propose two new “joint effects” statistics that, provided the disease is rare, are sensitive only to genuine interaction effects. In computer simulations we find, in most situations considered, that highest power is achieved by analysis under the correct genetic model. Such an analysis is unachievable in practice, as we do not know this model. However, generally high power over a wide range of scenarios is exhibited by our joint effects and adjusted Wu statistics. We recommend use of these alternative or adjusted statistics and urge caution when using Wu et al.'s originally-proposed statistics, on account of the inflated error rate that can result.
Author Summary
Gene–gene interactions are a topic of great interest to geneticists carrying out studies of how genetic factors influence the development of common, complex diseases. Genes that interact may not only make important biological contributions to underlying disease processes, but also be more difficult to detect when using standard statistical methods in which we examine the effects of genetic factors one at a time. Recently a method was proposed by Wu and colleagues [1] for detecting pairwise interactions when carrying out genome-wide association studies (in which a large number of genetic variants across the genome are examined). Wu and colleagues carried out theoretical work and computer simulations that suggested their method outperformed other previously proposed approaches for detecting interactions. Here we show that, in fact, the method proposed by Wu and colleagues can result in an over-preponderence of false postive findings. We propose an adjusted version of their method that reduces the false positive rate while maintaining high power. We also propose a new method for detecting pairs of genetic effects that shows similarly high power but has some conceptual advantages over both Wu's method and also other previously proposed approaches.
doi:10.1371/journal.pgen.1002625
PMCID: PMC3320596  PMID: 22496670
12.  A strategy analysis for genetic association studies with known inbreeding 
BMC Genetics  2011;12:63.
Background
Association studies consist in identifying the genetic variants which are related to a specific disease through the use of statistical multiple hypothesis testing or segregation analysis in pedigrees. This type of studies has been very successful in the case of Mendelian monogenic disorders while it has been less successful in identifying genetic variants related to complex diseases where the insurgence depends on the interactions between different genes and the environment. The current technology allows to genotype more than a million of markers and this number has been rapidly increasing in the last years with the imputation based on templates sets and whole genome sequencing. This type of data introduces a great amount of noise in the statistical analysis and usually requires a great number of samples. Current methods seldom take into account gene-gene and gene-environment interactions which are fundamental especially in complex diseases. In this paper we propose to use a non-parametric additive model to detect the genetic variants related to diseases which accounts for interactions of unknown order. Although this is not new to the current literature, we show that in an isolated population, where the most related subjects share also most of their genetic code, the use of additive models may be improved if the available genealogical tree is taken into account. Specifically, we form a sample of cases and controls with the highest inbreeding by means of the Hungarian method, and estimate the set of genes/environmental variables, associated with the disease, by means of Random Forest.
Results
We have evidence, from statistical theory, simulations and two applications, that we build a suitable procedure to eliminate stratification between cases and controls and that it also has enough precision in identifying genetic variants responsible for a disease. This procedure has been successfully used for the beta-thalassemia, which is a well known Mendelian disease, and also to the common asthma where we have identified candidate genes that underlie to the susceptibility of the asthma. Some of such candidate genes have been also found related to common asthma in the current literature.
Conclusions
The data analysis approach, based on selecting the most related cases and controls along with the Random Forest model, is a powerful tool for detecting genetic variants associated to a disease in isolated populations. Moreover, this method provides also a prediction model that has accuracy in estimating the unknown disease status and that can be generally used to build kit tests for a wide class of Mendelian diseases.
doi:10.1186/1471-2156-12-63
PMCID: PMC3155486  PMID: 21767363
13.  Interactions between Non-Physician Clinicians and Industry: A Systematic Review 
PLoS Medicine  2013;10(11):e1001561.
In a systematic review of studies of interactions between non-physician clinicians and industry, Quinn Grundy and colleagues found that many of the issues identified for physicians' industry interactions exist for non-physician clinicians.
Please see later in the article for the Editors' Summary
Background
With increasing restrictions placed on physician–industry interactions, industry marketing may target other health professionals. Recent health policy developments confer even greater importance on the decision making of non-physician clinicians. The purpose of this systematic review is to examine the types and implications of non-physician clinician–industry interactions in clinical practice.
Methods and Findings
We searched MEDLINE and Web of Science from January 1, 1946, through June 24, 2013, according to PRISMA guidelines. Non-physician clinicians eligible for inclusion were: Registered Nurses, nurse prescribers, Physician Assistants, pharmacists, dieticians, and physical or occupational therapists; trainee samples were excluded. Fifteen studies met inclusion criteria. Data were synthesized qualitatively into eight outcome domains: nature and frequency of industry interactions; attitudes toward industry; perceived ethical acceptability of interactions; perceived marketing influence; perceived reliability of industry information; preparation for industry interactions; reactions to industry relations policy; and management of industry interactions. Non-physician clinicians reported interacting with the pharmaceutical and infant formula industries. Clinicians across disciplines met with pharmaceutical representatives regularly and relied on them for practice information. Clinicians frequently received industry “information,” attended sponsored “education,” and acted as distributors for similar materials targeted at patients. Clinicians generally regarded this as an ethical use of industry resources, and felt they could detect “promotion” while benefiting from industry “information.” Free samples were among the most approved and common ways that clinicians interacted with industry. Included studies were observational and of varying methodological rigor; thus, these findings may not be generalizable. This review is, however, the first to our knowledge to provide a descriptive analysis of this literature.
Conclusions
Non-physician clinicians' generally positive attitudes toward industry interactions, despite their recognition of issues related to bias, suggest that industry interactions are normalized in clinical practice across non-physician disciplines. Industry relations policy should address all disciplines and be implemented consistently in order to mitigate conflicts of interest and address such interactions' potential to affect patient care.
Please see later in the article for the Editors' Summary
Editors' Summary
Background
Making and selling health care goods (including drugs and devices) and services is big business. To maximize the profits they make for their shareholders, companies involved in health care build relationships with physicians by providing information on new drugs, organizing educational meetings, providing samples of their products, giving gifts, and holding sponsored events. These relationships help to keep physicians informed about new developments in health care but also create the potential for causing harm to patients and health care systems. These relationships may, for example, result in increased prescription rates of new, heavily marketed medications, which are often more expensive than their generic counterparts (similar unbranded drugs) and that are more likely to be recalled for safety reasons than long-established drugs. They may also affect the provision of health care services. Industry is providing an increasingly large proportion of routine health care services in many countries, so relationships built up with physicians have the potential to influence the commissioning of the services that are central to the treatment and well-being of patients.
Why Was This Study Done?
As a result of concerns about the tension between industry's need to make profits and the ethics underlying professional practice, restrictions are increasingly being placed on physician–industry interactions. In the US, for example, the Physician Payments Sunshine Act now requires US manufacturers of drugs, devices, and medical supplies that participate in federal health care programs to disclose all payments and gifts made to physicians and teaching hospitals. However, other health professionals, including those with authority to prescribe drugs such as pharmacists, Physician Assistants, and nurse practitioners are not covered by this legislation or by similar legislation in other settings, even though the restructuring of health care to prioritize primary care and multidisciplinary care models means that “non-physician clinicians” are becoming more numerous and more involved in decision-making and medication management. In this systematic review (a study that uses predefined criteria to identify all the research on a given topic), the researchers examine the nature and implications of the interactions between non-physician clinicians and industry.
What Did the Researchers Do and Find?
The researchers identified 15 published studies that examined interactions between non-physician clinicians (Registered Nurses, nurse prescribers, midwives, pharmacists, Physician Assistants, and dieticians) and industry (corporations that produce health care goods and services). They extracted the data from 16 publications (representing 15 different studies) and synthesized them qualitatively (combined the data and reached word-based, rather than numerical, conclusions) into eight outcome domains, including the nature and frequency of interactions, non-physician clinicians' attitudes toward industry, and the perceived ethical acceptability of interactions. In the research the authors identified, non-physician clinicians reported frequent interactions with the pharmaceutical and infant formula industries. Most non-physician clinicians met industry representatives regularly, received gifts and samples, and attended educational events or received educational materials (some of which they distributed to patients). In these studies, non-physician clinicians generally regarded these interactions positively and felt they were an ethical and appropriate use of industry resources. Only a minority of non-physician clinicians felt that marketing influenced their own practice, although a larger percentage felt that their colleagues would be influenced. A sizeable proportion of non-physician clinicians questioned the reliability of industry information, but most were confident that they could detect biased information and therefore rated this information as reliable, valuable, or useful.
What Do These Findings Mean?
These and other findings suggest that non-physician clinicians generally have positive attitudes toward industry interactions but recognize issues related to bias and conflict of interest. Because these findings are based on a small number of studies, most of which were undertaken in the US, they may not be generalizable to other countries. Moreover, they provide no quantitative assessment of the interaction between non-physician clinicians and industry and no information about whether industry interactions affect patient care outcomes. Nevertheless, these findings suggest that industry interactions are normalized (seen as standard) in clinical practice across non-physician disciplines. This normalization creates the potential for serious risks to patients and health care systems. The researchers suggest that it may be unrealistic to expect that non-physician clinicians can be taught individually how to interact with industry ethically or how to detect and avert bias, particularly given the ubiquitous nature of marketing and promotional materials. Instead, they suggest, the environment in which non-physician clinicians practice should be structured to mitigate the potentially harmful effects of interactions with industry.
Additional Information
Please access these websites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1001561.
This study is further discussed in a PLOS Medicine Perspective by James S. Yeh and Aaron S. Kesselheim
The American Medical Association provides guidance for physicians on interactions with pharmaceutical industry representatives, information about the Physician Payments Sunshine Act, and a toolkit for preparing Physician Payments Sunshine Act reports
The International Council of Nurses provides some guidance on industry interactions in its position statement on nurse-industry relations
The UK General Medical Council provides guidance on financial and commercial arrangements and conflicts of interest as part of its good medical practice website, which describes what is required of all registered doctors in the UK
Understanding and Responding to Pharmaceutical Promotion: A Practical Guide is a manual prepared by Health Action International and the World Health Organization that schools of medicine and pharmacy can use to train students how to recognize and respond to pharmaceutical promotion.
The Institute of Medicine's Report on Conflict of Interest in Medical Research, Education, and Practice recommends steps to identify, limit, and manage conflicts of interest
The University of California, San Francisco, Office of Continuing Medical Education offers a course called Marketing of Medicines
doi:10.1371/journal.pmed.1001561
PMCID: PMC3841103  PMID: 24302892
14.  A genome-wide association analysis of Framingham Heart Study longitudinal data using multivariate adaptive splines 
BMC Proceedings  2009;3(Suppl 7):S119.
The Framingham Heart Study is a well known longitudinal cohort study. In recent years, the community-based Framingham Heart Study has embarked on genome-wide association studies. In this paper, we present a Framingham Heart Study genome-wide analysis for fasting triglycerides trait in the Genetic Analysis Workshop16 Problem 2 using multivariate adaptive splines for the analysis of longitudinal data (MASAL). With MASAL, we are able to perform analysis of genome-wide data with longitudinal phenotypes and covariates, making it possible to identify genes, gene-gene, and gene-environment (including time) interactions associated with the trait of interest. We conducted a permutation test to assess the associations between MASAL selected markers and triglycerides trait and report significant gene-gene and gene-environment interaction effects on the trait of interest.
PMCID: PMC2795891  PMID: 20017984
15.  Understanding the physiology of Lactobacillus plantarum at zero growth 
The physiology of Lactobacillus plantarum at extremely low growth rates, through cultivation in retentostats, is much closer to carbon-limited growth than to stationary phase, as evidenced from transcriptomics data, metabolic fluxes, and biomass composition and viability.Using a genome-scale metabolic model and constraint-based computational analyses, amino-acid fluxes—in particular, the rather paradoxical excretion of Asp, Arg, Met, and Ala—could be rationalized as a means to allow extensive metabolism of other amino acids, that is, that of branched-chain and aromatic amino acids.Catabolic products from aromatic amino acids are known to have putative plant-hormone action. The metabolism of amino acids, as well as transcription data, strongly suggested a plant environment-like response in slow-growing L. plantarum, which was confirmed by significant effects of fermented medium on plant root formation.
Natural ecosystems are usually characterized by extremely low and fluctuating nutrient availability. Hence, microorganisms in these environments live a ‘feast-and-famine' existence, with famine the most habitual state. As a result, extremely slow or no growth is the most common state of bacteria, and maintenance processes dominate their life.
In the present study, Lactobacillus plantarum was used as a model microorganism to investigate the physiology of slow growth. Besides fermented foods, this microorganism can be observed in a variety of environmental niches, including plants and lakes, in which nutrient supply is limited. To mimic these conditions, L. plantarum was grown in a glucose-limited chemostat with complete biomass retention (retentostat). During cultivation, biomass progressively accumulated, resulting in steadily decreasing specific substrate availability. Less energy was thus available for growth, and the specific growth rate decreased accordingly, with a final calculated doubling time greater than one year. Detailed measurements of metabolic fluxes were used as constraints in a genome-scale metabolic model to precisely calculate the amount of energy used for net biomass synthesis and for maintenance purposes: at the lowest growth rate investigated (μ=0.0002 h−1), maintenance accounted for 94% of all energy expenses.
Genome-scale metabolic analysis was used in combination with transcriptomics to study the adaptation of L. plantarum to extremely slow growth under limited carbon and energy supply. Importantly, slow growth as investigated here was fundamentally different from the widely studied carbon starvation-induced stationary phase: non-growing cells in retentostat conditions were glucose limited rather than starved, and the transition from a growing to a non-growing state under retentostat conditions was progressive, in contrast with the abrupt transition in batch cultures. These differences were reflected in various aspects of the cell physiology.
The metabolic behavior was remarkably stable during adaptation to slow growth. Although carbon catabolite repression was clearly relieved, as indicated by the upregulation of genes for the utilization of alternative carbohydrates, the metabolism remained largely based on the conversion of glucose to lactate.
Stress resistance mechanisms were also not massively induced. In particular, analysis of the biomass composition—which remained similar to fast-growing cells even under virtually non-growing conditions—and of the gene expression profile, failed to reveal clear stringent or general stress responses, which are generally triggered in glucose-starved cells. The observation that genes involved in growth-associated processes were not downregulated suggested that active synthesis of biomass components (RNA, proteins, and membranes) was required to account for the observed stable biomass and that turnover of macromolecules was high in slow-growing cells. Biomass viability or morphology was also not affected, compared with faster growth conditions. The only typical stress response was the induction of an SOS response—in particular, the upregulation of the two error-prone DNA polymerases—suggesting an increased potential for genetic diversity under adverse conditions. Although diversity was not apparent under the conditions studied here, such mechanisms of increased rates of mutagenesis are likely to have an important role in the adaptation of L. plantarum to slow growth.
A surprising response of L. plantarum during adaptation to slow growth was the production of several amino acids (Arg, Asp, Met, and Ala). A priori, this metabolic behavior seemed inefficient in a context of energy limitation. However, reduced cost analysis using the genome-scale metabolic model indicated that it had a positive effect on energy generation. In-depth analysis of metabolic flux distributions showed that biosynthesis of these amino acids was connected to the catabolism of branched-chain and aromatic amino acids (BCAAs and AAAs), under conditions of limited ammonium efflux. At a fixed ammonium efflux—fixed at the measured value—flux balance analysis indicated that BCAAs and AAAs were expensive to metabolize, because the regeneration of 2-ketoglutarate through glutamate dehydrogenase was limited by ammonium dissipation. Therefore, alternative pathways had to be active to supply the necessary pool of 2-ketoglutarate. At low growth rates, amino-acid production (Arg, Asp, Ala, and Met) accounted for most of the 2-ketoglutarate regeneration. Although it came at the expense of ATP, this metabolic alternative to glutamate dehydrogenase was less energy costly than other solutions such as purine biosynthesis. This is thus an excellent example in which precise, quantitative modeling results in new insights in physiology that intuition would never have achieved. It also shows that flux balance analysis can be used to accurately predict energetically inefficient metabolism, provided the appropriate fluxes are constrained (here, ammonium efflux).
The observation that BCAAs and AAAs were catabolized at the expense of energy was intriguing. However, several end products of these catabolic pathways can serve as signaling molecules for interactions with other organisms. In particular, precursors of plant hormones were predicted as possible end products in the model simulations. Accordingly, the production of compounds interfering with plant root development was demonstrated in slow-growing L. plantarum. The metabolic analysis thus suggested that slow-growing L. plantarum produced plant hormones—or precursors thereof—as a strategy to divert the plant metabolism towards its own interest. In support of this view, transcriptome analysis indicated the upregulation of genes involved in the catabolism of β-glucosides—typical sugars from plant cell wall—as well as a very high induction of six gene clusters encoding cell-surface protein complexes predicted to have a role in the utilization of plant polysaccharides (csc clusters). In such a plant context, limited ammonium production would also make sense, because of the well-documented toxicity of ammonium for plants: production of amino acids could represent an alternative to ammonium excretion while keeping both parties satisfied.
In conclusion, the physiology of L. plantarum at extremely low growth rates, as studied by genome-scale metabolic modeling and transcriptomics, is fundamentally different from that of starvation-induced stationary phase cells. Excitingly, these conditions seem to trigger responses that favor interactions with the environment, more specifically with plants. The reported observations were made in the absence of any plant-derived material, suggesting that this response might constitute a hardwired behavior.
Situations of extremely low substrate availability, resulting in slow growth, are common in natural environments. To mimic these conditions, Lactobacillus plantarum was grown in a carbon-limited retentostat with complete biomass retention. The physiology of extremely slow-growing L. plantarum—as studied by genome-scale modeling and transcriptomics—was fundamentally different from that of stationary-phase cells. Stress resistance mechanisms were not massively induced during transition to extremely slow growth. The energy-generating metabolism was remarkably stable and remained largely based on the conversion of glucose to lactate. The combination of metabolic and transcriptomic analyses revealed behaviors involved in interactions with the environment, more particularly with plants: production of plant hormones or precursors thereof, and preparedness for the utilization of plant-derived substrates. Accordingly, the production of compounds interfering with plant root development was demonstrated in slow-growing L. plantarum. Thus, conditions of slow growth and limited substrate availability seem to trigger a plant environment-like response, even in the absence of plant-derived material, suggesting that this might constitute an intrinsic behavior in L. plantarum.
doi:10.1038/msb.2010.67
PMCID: PMC2964122  PMID: 20865006
Lactobacillus plantarum; metabolic modeling; retentostat; slow growth; transcriptome analysis
16.  Comparison of two methods for analysis of gene–environment interactions in longitudinal family data: the Framingham heart study 
Gene–environment interaction (GEI) analysis can potentially enhance gene discovery for common complex traits. However, genome-wide interaction analysis is computationally intensive. Moreover, analysis of longitudinal data in families is much more challenging due to the two sources of correlations arising from longitudinal measurements and family relationships. GWIS of longitudinal family data can be a computational bottleneck. Therefore, we compared two methods for analysis of longitudinal family data: a methodologically sound but computationally demanding method using the Kronecker model (KRC) and a computationally more forgiving method using the hierarchical linear model (HLM). The KRC model uses a Kronecker product of an unstructured matrix for correlations among repeated measures (longitudinal) and a compound symmetry matrix for correlations within families at a given visit. The HLM uses an autoregressive covariance matrix for correlations among repeated measures and a random intercept for familial correlations. We compared the two methods using the longitudinal Framingham heart study (FHS) SHARe data. Specifically, we evaluated SNP–alcohol (amount of alcohol consumption) interaction effects on high density lipoprotein cholesterol (HDLC). Keeping the prohibitive computational burden of KRC in mind, we limited the analysis to chromosome 16, where preliminary cross-sectional analysis yielded some interesting results. Our first important finding was that the HLM provided very comparable results but was remarkably faster than the KRC, making HLM the method of choice. Our second finding was that longitudinal analysis provided smaller P-values, thus leading to more significant results, than cross-sectional analysis. This was particularly pronounced in identifying GEIs. We conclude that longitudinal analysis of GEIs is more powerful and that the HLM method is an optimal method of choice as compared to the computationally (prohibitively) intensive KRC method.
doi:10.3389/fgene.2014.00009
PMCID: PMC3906599  PMID: 24523728
gene–environment interactions; longitudinal family data; Framingham heart study; interactions in family data; HLM; SNP–alcohol interactions
17.  Genome-Wide Analysis in German Shepherd Dogs Reveals Association of a Locus on CFA 27 with Atopic Dermatitis 
PLoS Genetics  2013;9(5):e1003475.
Humans and dogs are both affected by the allergic skin disease atopic dermatitis (AD), caused by an interaction between genetic and environmental factors. The German shepherd dog (GSD) is a high-risk breed for canine AD (CAD). In this study, we used a Swedish cohort of GSDs as a model for human AD. Serum IgA levels are known to be lower in GSDs compared to other breeds. We detected significantly lower IgA levels in the CAD cases compared to controls (p = 1.1×10−5) in our study population. We also detected a separation within the GSD cohort, where dogs could be grouped into two different subpopulations. Disease prevalence differed significantly between the subpopulations contributing to population stratification (λ = 1.3), which was successfully corrected for using a mixed model approach. A genome-wide association analysis of CAD was performed (ncases = 91, ncontrols = 88). IgA levels were included in the model, due to the high correlation between CAD and low IgA levels. In addition, we detected a correlation between IgA levels and the age at the time of sampling (corr = 0.42, p = 3.0×10−9), thus age was included in the model. A genome-wide significant association was detected on chromosome 27 (praw = 3.1×10−7, pgenome = 0.03). The total associated region was defined as a ∼1.5-Mb-long haplotype including eight genes. Through targeted re-sequencing and additional genotyping of a subset of identified SNPs, we defined 11 smaller haplotype blocks within the associated region. Two blocks showed the strongest association to CAD. The ∼209-kb region, defined by the two blocks, harbors only the PKP2 gene, encoding Plakophilin 2 expressed in the desmosomes and important for skin structure. Our results may yield further insight into the genetics behind both canine and human AD.
Author Summary
Humans and dogs are both affected by the allergic skin disease atopic dermatitis (AD), caused by an interaction between genetic and environmental factors. The German shepherd dog (GSD) is a high-risk breed for canine AD (CAD), also affected by low serum IgA levels. A Swedish cohort of GSDs was used as a model for human AD in this study. We performed a genome-wide association analysis where a region associated with CAD was identified. IgA levels were included in the model due to strong correlation with CAD. Also, age at sampling was included in the model due to correlation with IgA levels. The associated region, consisting of eight genes, was further fine-mapped with sequencing and additional genotyping. Haplotype association analysis from the fine-mapping data indicates association of the gene, plakophilin 2 (PKP2), known to be important for skin structure. We detected a division of the GSD breed into two subpopulations where one is more prone to develop CAD and to have lower serum IgA levels compared with the other. Here, we present methods for performing genome-wide association analyses when the study population is complex and when the trait is affected by additional parameters. The PKP2 gene found within the associated region became an interesting target for further study of its importance both in canine and human AD.
doi:10.1371/journal.pgen.1003475
PMCID: PMC3649999  PMID: 23671420
18.  Human metabolic profiles are stably controlled by genetic and environmental variation 
A comprehensive variation map of the human metabolome identifies genetic and stable-environmental sources as major drivers of metabolite concentrations. The data suggest that sample sizes of a few thousand are sufficient to detect metabolite biomarkers predictive of disease.
We designed a longitudinal twin study to characterize the genetic, stable-environmental, and longitudinally fluctuating influences on metabolite concentrations in two human biofluids—urine and plasma—focusing specifically on the representative subset of metabolites detectable by 1H nuclear magnetic resonance (1H NMR) spectroscopy.We identified widespread genetic and stable-environmental influences on the (urine and plasma) metabolomes, with (30 and 42%) attributable on average to familial sources, and (47 and 60%) attributable to longitudinally stable sources.Ten of the metabolites annotated in the study are estimated to have >60% familial contribution to their variation in concentration.Our findings have implications for the design and interpretation of 1H NMR-based molecular epidemiology studies. On the basis of the stable component of variation quantified in the current paper, we specified a model of disease association under which we inferred that sample sizes of a few thousand should be sufficient to detect disease-predictive metabolite biomarkers.
Metabolites are small molecules involved in biochemical processes in living systems. Their concentration in biofluids, such as urine and plasma, can offer insights into the functional status of biological pathways within an organism, and reflect input from multiple levels of biological organization—genetic, epigenetic, transcriptomic, and proteomic—as well as from environmental and lifestyle factors. Metabolite levels have the potential to indicate a broad variety of deviations from the ‘normal' physiological state, such as those that accompany a disease, or an increased susceptibility to disease. A number of recent studies have demonstrated that metabolite concentrations can be used to diagnose disease states accurately. A more ambitious goal is to identify metabolite biomarkers that are predictive of future disease onset, providing the possibility of intervention in susceptible individuals.
If an extreme concentration of a metabolite is to serve as an indicator of disease status, it is usually important to know the distribution of metabolite levels among healthy individuals. It is also useful to characterize the sources of that observed variation in the healthy population. A proportion of that variation—the heritable component—is attributable to genetic differences between individuals, potentially at many genetic loci. An effective, molecular indicator of a heritable, complex disease is likely to have a substantive heritable component. Non-heritable biological variation in metabolite concentrations can arise from a variety of environmental influences, such as dietary intake, lifestyle choices, general physical condition, composition of gut microflora, and use of medication. Variation across a population in stable-environmental influences leads to long-term differences between individuals in their baseline metabolite levels. Dynamic environmental pressures lead to short-term fluctuations within an individual about their baseline level. A metabolite whose concentration changes substantially in response to short-term pressures is relatively unlikely to offer long-term prediction of disease. In summary, the potential suitability of a metabolite to predict disease is reflected by the relative contributions of heritable and stable/unstable-environmental factors to its variation in concentration across the healthy population.
Studies involving twins are an established technique for quantifying the heritable component of phenotypes in human populations. Monozygotic (MZ) twins share the same DNA genome-wide, while dizygotic (DZ) twins share approximately half their inherited DNA, as do ordinary siblings. By comparing the average extent of phenotypic concordance within MZ pairs to that within DZ pairs, it is possible to quantify the heritability of a trait, and also to quantify the familiality, which refers to the combination of heritable and common-environmental effects (i.e., environmental influences shared by twins in a pair). In addition to incorporating twins into the study design, it is useful to quantify the phenotype in some individuals at multiple time points. The longitudinal aspect of such a study allows environmental effects to be decomposed into those that affect the phenotype over the short term and those that exert stable influence.
For the current study, urine and blood samples were collected from a cohort of MZ and DZ twins, with some twins donating samples on two occasions several months apart. Samples were analysed by 1H nuclear magnetic resonance (1H NMR) spectroscopy—an untargeted, discovery-driven technique for quantifying metabolite concentrations in biological samples. The application of 1H NMR to a biological sample creates a spectrum, made up of multiple peaks, with each peak's size quantitatively representing the concentration of its corresponding hydrogen-containing metabolite.
In each biological sample in our study, we extracted a full set of peaks, and thereby quantified the concentrations of all common plasma and urine metabolites detectable by 1H NMR. We developed bespoke statistical methods to decompose the observed concentration variation at each metabolite peak into that originating from familial, individual-environmental, and unstable-environmental sources.
We quantified the variability landscape across all common metabolite peaks in the urine and plasma 1H NMR metabolomes. We annotated a subset of peaks with a total of 65 metabolites; the variance decompositions for these are shown in Figure 1. Ten metabolites' concentrations were estimated to have familial contributions in excess of 60%. The average proportion of stable variation across all extracted metabolite peaks was estimated to be 47% in the urine samples and 60% in the plasma samples; the average estimated familiality was 30% for urine and 42% for plasma. These results comprise the first quantitative variation map of the 1H NMR metabolome. The identification and quantification of substantive widespread stability provides support for the use of these biofluids in molecular epidemiology studies. On the basis of our findings, we performed power calculations for a hypothetical study searching for predictive disease biomarkers among 1H NMR-detectable urine and plasma metabolites. Our calculations suggest that sample sizes of 2000–5000 should allow reliable identification of disease-predictive metabolite concentrations explaining 5–10% of disease risk, while greater sample sizes of 5000–20 000 would be required to identify metabolite concentrations explaining 1–2% of disease risk.
1H Nuclear Magnetic Resonance spectroscopy (1H NMR) is increasingly used to measure metabolite concentrations in sets of biological samples for top-down systems biology and molecular epidemiology. For such purposes, knowledge of the sources of human variation in metabolite concentrations is valuable, but currently sparse. We conducted and analysed a study to create such a resource. In our unique design, identical and non-identical twin pairs donated plasma and urine samples longitudinally. We acquired 1H NMR spectra on the samples, and statistically decomposed variation in metabolite concentration into familial (genetic and common-environmental), individual-environmental, and longitudinally unstable components. We estimate that stable variation, comprising familial and individual-environmental factors, accounts on average for 60% (plasma) and 47% (urine) of biological variation in 1H NMR-detectable metabolite concentrations. Clinically predictive metabolic variation is likely nested within this stable component, so our results have implications for the effective design of biomarker-discovery studies. We provide a power-calculation method which reveals that sample sizes of a few thousand should offer sufficient statistical precision to detect 1H NMR-based biomarkers quantifying predisposition to disease.
doi:10.1038/msb.2011.57
PMCID: PMC3202796  PMID: 21878913
biomarker; 1H nuclear magnetic resonance spectroscopy; metabolome-wide association study; top-down systems biology; variance decomposition
19.  Automated identification of pathways from quantitative genetic interaction data 
We present a novel Bayesian learning method that reconstructs large detailed gene networks from quantitative genetic interaction (GI) data.The method uses global reasoning to handle missing and ambiguous measurements, and provide confidence estimates for each prediction.Applied to a recent data set over genes relevant to protein folding, the learned networks reflect known biological pathways, including details such as pathway ordering and directionality of relationships.The reconstructed networks also suggest novel relationships, including the placement of SGT2 in the tail-anchored biogenesis pathway, a finding that we experimentally validated.
Recent developments have enabled large-scale quantitative measurement of genetic interactions (GIs) that report on the extent to which the activity of one gene is dependent on a second. It has long been recognized (Avery and Wasserman, 1992; Hartman et al, 2001; Segre et al, 2004; Tong et al, 2004; Drees et al, 2005; Schuldiner et al, 2005; St Onge et al, 2007; Costanzo et al, 2010) that functional dependencies revealed by GI data can provide rich information regarding underlying biological pathways. Further, the precise phenotypic measurements provided by quantitative GI data can provide evidence for even more detailed aspects of pathway structure, such as differentiating between full and partial dependence between two genes (Drees et al, 2005; Schuldiner et al, 2005; St Onge et al, 2007; Jonikas et al, 2009) (Figure 1A). As GI data sets become available for a range of quantitative phenotypes and organisms, such patterns will allow researchers to elucidate pathways important to a diverse set of biological processes.
We present a new method that exploits the high-quality, quantitative nature of recent GI assays to automatically reconstruct detailed multi-gene pathway structures, including the organization of a large set of genes into coherent pathways, the connectivity and ordering within each pathway, and the directionality of each relationship. We introduce activity pathway networks (APNs), which represent functional dependencies among a set of genes in the form of a network. We present an automatic method to efficiently reconstruct APNs over large sets of genes based on quantitative GI measurements. This method handles uncertainty in the data arising from noise, missing measurements, and data points with ambiguous interpretations, by performing global reasoning that combines evidence from multiple data points. In addition, because some structure choices remain uncertain even when jointly considering all measurements, our method maintains multiple likely networks, and allows computation of confidence estimates over each structure choice.
We applied our APN reconstruction method to the recent high-quality GI data set of Jonikas et al (2009), which examined the functional interaction between genes that contribute to protein folding in the ER. Specifically, Jonikas et al used the cell's endogenous sensor (the unfolded protein response), to first identify several hundred yeast genes with functions in endoplasmic reticulum folding and then systematically characterized their functional interdependencies by measuring unfolded protein response levels in double mutants. Our analysis produced an ensemble of 500 likelihood-weighted APNs over 178 genes (Figure 2).
We performed an aggregate evaluation of our results by comparing to known biological relationships between gene pairs, including participation in pathways according to the Kyoto Encyclopedia of Genes and Genomes (KEGG), correlation of chemical genomic profiles in a recent high-throughput assay (Hillenmeyer et al, 2008) and similarity of Gene Ontology (GO) annotations. In each evaluation performed, our reconstructed APNs were significantly more consistent with the known relationships than either the raw GI values or the Pearson correlation between profiles of GI values.
Importantly, our approach provides not only an improved means for defining pairs or groups of related genes, but also enables the identification of detailed multi-gene network structures. In many cases, our method successfully reconstructed known cellular pathways, including the ER-associated degradation (ERAD) pathway, and the biosynthesis of N-linked glycans, ranking them among the highest confidence structures. In-depth examination of the learned network structures indicates agreement with many known details of these pathways. In addition, quantitative analysis indicates that our learned APNs are indicative of ordering within KEGG-annotated biological pathways.
Our results also suggest several novel relationships, including placement of uncharacterized genes into pathways, and novel relationships between characterized genes. These include the dependence of the J domain chaperone JEM1 on the PDI homolog MPD1, dependence of the Ubiquitin-recycling enzyme DOA4 on N-linked glycosylation, and the dependence of the E3 Ubiquitin ligase DOA10 on the signal peptidase complex subunit SPC2. Our APNs also place the poorly characterized TPR-containing protein SGT2 upstream of the tail-anchored protein biogenesis machinery components GET3, GET4, and MDY2 (also known as GET5), suggesting that SGT2 has a function in the insertion of tail-anchored proteins into membranes. Consistent with this prediction, our experimental analysis shows that sgt2Δ cells show a defect in localization of the tail-anchored protein GFP-Sed5 from punctuate Golgi structures to a more diffuse pattern, as seen in other genes involved in this pathway.
Our results show that multi-gene, detailed pathway networks can be reconstructed from quantitative GI data, providing a concrete computational manifestation to intuitions that have traditionally accompanied the manual interpretation of such data. Ongoing technological developments in both genetics and imaging are enabling the measurement of GI data at a genome-wide scale, using high-accuracy quantitative phenotypes that relate to a range of particular biological functions. Methods based on RNAi will soon allow collection of similar data for human cell lines and other mammalian systems (Moffat et al, 2006). Thus, computational methods for analyzing GI data could have an important function in mapping pathways involved in complex biological systems including human cells.
High-throughput quantitative genetic interaction (GI) measurements provide detailed information regarding the structure of the underlying biological pathways by reporting on functional dependencies between genes. However, the analytical tools for fully exploiting such information lag behind the ability to collect these data. We present a novel Bayesian learning method that uses quantitative phenotypes of double knockout organisms to automatically reconstruct detailed pathway structures. We applied our method to a recent data set that measures GIs for endoplasmic reticulum (ER) genes, using the unfolded protein response as a quantitative phenotype. The results provided reconstructions of known functional pathways including N-linked glycosylation and ER-associated protein degradation. It also contained novel relationships, such as the placement of SGT2 in the tail-anchored biogenesis pathway, a finding that we experimentally validated. Our approach should be readily applicable to the next generation of quantitative GI data sets, as assays become available for additional phenotypes and eventually higher-level organisms.
doi:10.1038/msb.2010.27
PMCID: PMC2913392  PMID: 20531408
computational biology; genetic interaction; pathway reconstruction; probabilistic methods
20.  Rapid multiplex high resolution melting method to analyze inflammatory related SNPs in preterm birth 
BMC Research Notes  2012;5:69.
Background
Complex traits like cancer, diabetes, obesity or schizophrenia arise from an intricate interaction between genetic and environmental factors. Complex disorders often cluster in families without a clear-cut pattern of inheritance. Genomic wide association studies focus on the detection of tens or hundreds individual markers contributing to complex diseases. In order to test if a subset of single nucleotide polymorphisms (SNPs) from candidate genes are associated to a condition of interest in a particular individual or group of people, new techniques are needed. High-resolution melting (HRM) analysis is a new method in which polymerase chain reaction (PCR) and mutations scanning are carried out simultaneously in a closed tube, making the procedure fast, inexpensive and easy. Preterm birth (PTB) is considered a complex disease, where genetic and environmental factors interact to carry out the delivery of a newborn before 37 weeks of gestation. It is accepted that inflammation plays an important role in pregnancy and PTB.
Methods
Here, we used real time-PCR followed by HRM analysis to simultaneously identify several gene variations involved in inflammatory pathways on preterm labor. SNPs from TLR4, IL6, IL1 beta and IL12RB genes were analyzed in a case-control study. The results were confirmed either by sequencing or by PCR followed by restriction fragment length polymorphism.
Results
We were able to simultaneously recognize the variations of four genes with similar accuracy than other methods. In order to obtain non-overlapping melting temperatures, the key step in this strategy was primer design. Genotypic frequencies found for each SNP are in concordance with those previously described in similar populations. None of the studied SNPs were associated with PTB.
Conclusions
Several gene variations related to the same inflammatory pathway were screened through a new flexible, fast and non expensive method with the purpose of analyzing their association to PTB. It can easily be used for simultaneously analyze any set of SNPs, either as the first choice for new association studies or as a complement to large-scale genotyping analysis. Given that inflammatory pathway is in the base of several diseases, it is potentially useful to analyze a broad range of disorders.
doi:10.1186/1756-0500-5-69
PMCID: PMC3298535  PMID: 22280494
21.  Network modeling of the transcriptional effects of copy number aberrations in glioblastoma 
DNA copy number aberrations (CNAs) are a characteristic feature of cancer genomes. In this work, Rebecka Jörnsten, Sven Nelander and colleagues combine network modeling and experimental methods to analyze the systems-level effects of CNAs in glioblastoma.
We introduce a modeling approach termed EPoC (Endogenous Perturbation analysis of Cancer), enabling the construction of global, gene-level models that causally connect gene copy number with expression in glioblastoma.On the basis of the resulting model, we predict genes that are likely to be disease-driving and validate selected predictions experimentally. We also demonstrate that further analysis of the network model by sparse singular value decomposition allows stratification of patients with glioblastoma into short-term and long-term survivors, introducing decomposed network models as a useful principle for biomarker discovery.Finally, in systematic comparisons, we demonstrate that EPoC is computationally efficient and yields more consistent results than mRNA-only methods, standard eQTL methods, and two recent multivariate methods for genotype–mRNA coupling.
Gains and losses of chromosomal material (DNA copy number aberrations; CNAs) are a characteristic feature of cancer genomes. At the level of a single locus, it is well known that increased copy number (gene amplification) typically leads to increased gene expression, whereas decreased copy number (gene deletion) leads to decreased gene expression (Pollack et al, 2002; Lee et al, 2008; Nilsson et al, 2008). However, CNAs also affect the expression of genes located outside the amplified/deleted region itself via indirect mechanisms. To fully understand the action of CNAs, it is therefore necessary to analyze their action in a network context. Toward this goal, improved computational approaches will be important, if not essential.
To determine the global effects on transcription of CNAs in the brain tumor glioblastoma, we develop EPoC (Endogenous Perturbation analysis of Cancer), a computational technique capable of inferring sparse, causal network models by combining genome-wide, paired CNA- and mRNA-level data. EPoC aims to detect disease-driving copy number aberrations and their effect on target mRNA expression, and stratify patients into long-term and short-term survivors. Technically, EPoC relates CNA perturbations to mRNA responses by matrix equations, derived from a steady-state approximation of the transcriptional network. Patient prognostic scores are obtained from singular value decompositions of the network matrix. The models are constructed by solving a large-scale, regularized regression problem.
We apply EPoC to glioblastoma data from The Cancer Genome Atlas (TCGA) consortium (186 patients). The identified CNA-driven network comprises 10 672 genes, and contains a number of copy number-altered genes that control multiple downstream genes. Highly connected hub genes include well-known oncogenes and tumor supressor genes that are frequently deleted or amplified in glioblastoma, including EGFR, PDGFRA, CDKN2A and CDKN2B, confirming a clear association between these aberrations and transcriptional variability of these brain tumors. In addition, we identify a number of hub genes that have previously not been associated with glioblastoma, including interferon alpha 1 (IFNA1), myeloid/lymphoid or mixed-lineage leukemia translocated to 10 (MLLT10, a well-known leukemia gene), glutamate decarboxylase 2 GAD2, a postulated glutamate receptor GPR158 and Necdin (NDN). Furthermore, we demonstrate that the network model contains useful information on downstream target genes (including stem cell regulators), and possible drug targets.
We proceed to explore the validity of a small network region experimentally. Introducing experimental perturbations of NDN and other targets in four glioblastoma cell lines (T98G, U-87MG, U-343MG and U-373MG), we confirm several predicted mechanisms. We also demonstrate that the TCGA glioblastoma patients can be stratified into long-term and short-term survivors, using our proposed prognostic scores derived from a singular vector decomposition of the network model. Finally, we compare EPoC to existing methods for mRNA networks analysis and expression quantitative locus methods, and demonstrate that EPoC produces more consistent models between technically independent glioblastoma data sets, and that the EPoC models exhibit better overlap with known protein–protein interaction networks and pathway maps.
In summary, we conclude that large-scale integrative modeling reveals mechanistically and prognostically informative networks in human glioblastoma. Our approach operates at the gene level and our data support that individual hub genes can be identified in practice. Very large aberrations, however, cannot be fully resolved by the current modeling strategy.
DNA copy number aberrations (CNAs) are a hallmark of cancer genomes. However, little is known about how such changes affect global gene expression. We develop a modeling framework, EPoC (Endogenous Perturbation analysis of Cancer), to (1) detect disease-driving CNAs and their effect on target mRNA expression, and to (2) stratify cancer patients into long- and short-term survivors. Our method constructs causal network models of gene expression by combining genome-wide DNA- and RNA-level data. Prognostic scores are obtained from a singular value decomposition of the networks. By applying EPoC to glioblastoma data from The Cancer Genome Atlas consortium, we demonstrate that the resulting network models contain known disease-relevant hub genes, reveal interesting candidate hubs, and uncover predictors of patient survival. Targeted validations in four glioblastoma cell lines support selected predictions, and implicate the p53-interacting protein Necdin in suppressing glioblastoma cell growth. We conclude that large-scale network modeling of the effects of CNAs on gene expression may provide insights into the biology of human cancer. Free software in MATLAB and R is provided.
doi:10.1038/msb.2011.17
PMCID: PMC3101951  PMID: 21525872
cancer biology; cancer genomics; glioblastoma
22.  CHESS (CgHExpreSS): A comprehensive analysis tool for the analysis of genomic alterations and their effects on the expression profile of the genome 
BMC Bioinformatics  2009;10:424.
Background
Genomic alterations frequently occur in many cancer patients and play important mechanistic roles in the pathogenesis of cancer. Furthermore, they can modify the expression level of genes due to altered copy number in the corresponding region of the chromosome. An accumulating body of evidence supports the possibility that strong genome-wide correlation exists between DNA content and gene expression. Therefore, more comprehensive analysis is needed to quantify the relationship between genomic alteration and gene expression. A well-designed bioinformatics tool is essential to perform this kind of integrative analysis. A few programs have already been introduced for integrative analysis. However, there are many limitations in their performance of comprehensive integrated analysis using published software because of limitations in implemented algorithms and visualization modules.
Results
To address this issue, we have implemented the Java-based program CHESS to allow integrative analysis of two experimental data sets: genomic alteration and genome-wide expression profile. CHESS is composed of a genomic alteration analysis module and an integrative analysis module. The genomic alteration analysis module detects genomic alteration by applying a threshold based method or SW-ARRAY algorithm and investigates whether the detected alteration is phenotype specific or not. On the other hand, the integrative analysis module measures the genomic alteration's influence on gene expression. It is divided into two separate parts. The first part calculates overall correlation between comparative genomic hybridization ratio and gene expression level by applying following three statistical methods: simple linear regression, Spearman rank correlation and Pearson's correlation. In the second part, CHESS detects the genes that are differentially expressed according to the genomic alteration pattern with three alternative statistical approaches: Student's t-test, Fisher's exact test and Chi square test. By successive operations of two modules, users can clarify how gene expression levels are affected by the phenotype specific genomic alterations. As CHESS was developed in both Java application and web environments, it can be run on a web browser or a local machine. It also supports all experimental platforms if a properly formatted text file is provided to include the chromosomal position of probes and their gene identifiers.
Conclusions
CHESS is a user-friendly tool for investigating disease specific genomic alterations and quantitative relationships between those genomic alterations and genome-wide gene expression profiling.
doi:10.1186/1471-2105-10-424
PMCID: PMC2801522  PMID: 20003544
23.  Systematic Detection of Epistatic Interactions Based on Allele Pair Frequencies 
PLoS Genetics  2012;8(2):e1002463.
Epistatic genetic interactions are key for understanding the genetic contribution to complex traits. Epistasis is always defined with respect to some trait such as growth rate or fitness. Whereas most existing epistasis screens explicitly test for a trait, it is also possible to implicitly test for fitness traits by searching for the over- or under-representation of allele pairs in a given population. Such analysis of imbalanced allele pair frequencies of distant loci has not been exploited yet on a genome-wide scale, mostly due to statistical difficulties such as the multiple testing problem. We propose a new approach called Imbalanced Allele Pair frequencies (ImAP) for inferring epistatic interactions that is exclusively based on DNA sequence information. Our approach is based on genome-wide SNP data sampled from a population with known family structure. We make use of genotype information of parent-child trios and inspect 3×3 contingency tables for detecting pairs of alleles from different genomic positions that are over- or under-represented in the population. We also developed a simulation setup which mimics the pedigree structure by simultaneously assuming independence of the markers. When applied to mouse SNP data, our method detected 168 imbalanced allele pairs, which is substantially more than in simulations assuming no interactions. We could validate a significant number of the interactions with external data, and we found that interacting loci are enriched for genes involved in developmental processes.
Author Summary
Elucidating non-additive (epistatic) interactions between genes is crucial for understanding the molecular mechanisms of complex diseases. Even though high-throughput, systematic testing of genetic interactions is possible in simple model organisms, such screens have so far not been successful in mammals. Here, we propose a computational screening method that only requires genotype information of family trios for predicting genetic interactions. We tested our framework on a set of more than 2,000 heterozygous mice and found 168 imbalanced allele pairs, which is substantially more than expected by chance. We confirmed many of these interactions using data from recombinant inbred lines. The number of significant allele pair imbalances that we detected is surprisingly large and was not expected based on the published evidence. Our framework sets the stage for similar work in human trios.
doi:10.1371/journal.pgen.1002463
PMCID: PMC3276547  PMID: 22346757
24.  Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis 
PLoS Genetics  2009;5(3):e1000432.
Evidence from human genetic studies of several disorders suggests that interactions between alleles at multiple genes play an important role in influencing phenotypic expression. Analytical methods for identifying Mendelian disease genes are not appropriate when applied to common multigenic diseases, because such methods investigate association with the phenotype only one genetic locus at a time. New strategies are needed that can capture the spectrum of genetic effects, from Mendelian to multifactorial epistasis. Random Forests (RF) and Relief-F are two powerful machine-learning methods that have been studied as filters for genetic case-control data due to their ability to account for the context of alleles at multiple genes when scoring the relevance of individual genetic variants to the phenotype. However, when variants interact strongly, the independence assumption of RF in the tree node-splitting criterion leads to diminished importance scores for relevant variants. Relief-F, on the other hand, was designed to detect strong interactions but is sensitive to large backgrounds of variants that are irrelevant to classification of the phenotype, which is an acute problem in genome-wide association studies. To overcome the weaknesses of these data mining approaches, we develop Evaporative Cooling (EC) feature selection, a flexible machine learning method that can integrate multiple importance scores while removing irrelevant genetic variants. To characterize detailed interactions, we construct a genetic-association interaction network (GAIN), whose edges quantify the synergy between variants with respect to the phenotype. We use simulation analysis to show that EC is able to identify a wide range of interaction effects in genetic association data. We apply the EC filter to a smallpox vaccine cohort study of single nucleotide polymorphisms (SNPs) and infer a GAIN for a collection of SNPs associated with adverse events. Our results suggest an important role for hubs in SNP disease susceptibility networks. The software is available at http://sites.google.com/site/McKinneyLab/software.
Author Summary
Susceptibility to many diseases and disorders is caused by breakdown at multiple points in the genetic network. Each of these points of breakdown by itself may have a very modest effect on disease risk but the points may have a much stronger effect through statistical interactions with each other. Genome-wide association studies provide the opportunity to identify alleles at multiple loci that interact to influence phenotypic variation in common diseases and disorders. However, if each SNP is tested for association as though it were independent of the rest of the genome, then the full advantage of the variation from markers across the genome will be unfulfilled. In this study, we illustrate the utility of a new approach to high-dimensional genetic association analysis that treats the collection of SNPs as interacting on a system level. This approach uses a machine-learning filter followed by an information theoretic and graph theoretic approach to infer a phenotype-specific network of interacting SNPs.
doi:10.1371/journal.pgen.1000432
PMCID: PMC2653647  PMID: 19300503
25.  Comparing baseline and longitudinal measures in association studies 
BMC Proceedings  2014;8(Suppl 1):S84.
In recent years, longitudinal family-based studies have had success in identifying genetic variants that influence complex traits in genome-wide association studies. In this paper, we suggest that longitudinal analyses may contain valuable information that can enable identification of additional associations compared to baseline analyses. Using Genetic Analysis Workshop 18 data, consisting of whole genome sequence data in a pedigree-based sample, we compared 3 methods for the genetic analysis of longitudinal data to an analysis that used baseline data only. These longitudinal methods were (a) longitudinal mixed-effects model; (b) analysis of the mean trait over time; and (c) a 2-stage analysis, with estimation of a random intercept in the first stage and regression of the random intercept on a single-nucleotide polymorphism at the second stage. All methods accounted for the familial correlation among subjects within a pedigree. The analyses considered common variants with minor allele frequency above 5% on chromosome 3. Analyses were performed without knowledge of the simulation model. The 3 longitudinal methods showed consistent results, which were generally different from those found by using only the baseline observation. The gene CACNA2D3, identified by both longitudinal and baseline approaches, had a stronger signal in the longitudinal analysis (p = 2.65 × 10−7) compared to that in the baseline analysis (p = 2.48 × 10−5). The effect size of the longitudinal mixed-effects model and mean trait were higher compared to the 2-stage approach. The longitudinal results provided stable results different from that using 1 observation at baseline and generally had lower p values.
doi:10.1186/1753-6561-8-S1-S84
PMCID: PMC4143666  PMID: 25519412

Results 1-25 (1608956)