Interactions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted.
We report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding multiple sets of interacting SNPs, under different interaction models. The assessment criteria included several relevant detection power measures, family-wise type I error rate, and computational complexity. There are several important results from this study. First, while some SNPs in interactions with strong effects are successfully detected, most of the methods miss many interacting SNPs at an acceptable rate of false positives. In this study, the best-performing method was MECPM. Second, the statistical significance assessment criteria, used by some of the methods to control the type I error rate, are quite conservative, thereby limiting their power and making it difficult to fairly compare them. Third, as expected, power varies for different models and as a function of penetrance, minor allele frequency, linkage disequilibrium and marginal effects. Fourth, the analytical relationships between power and these factors are derived, aiding in the interpretation of the study results. Fifth, for these methods the magnitude of the main effect influences the power of the tests. Sixth, most methods can detect some ground-truth SNPs but have modest power to detect the whole set of interacting SNPs.
This comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at: http://code.google.com/p/simulation-tool-bmc-ms9169818735220977/downloads/list.
Motivation: Identification of somatic DNA copy number alterations (CNAs) and significant consensus events (SCEs) in cancer genomes is a main task in discovering potential cancer-driving genes such as oncogenes and tumor suppressors. The recent development of SNP array technology has facilitated studies on copy number changes at a genome-wide scale with high resolution. However, existing copy number analysis methods are oblivious to normal cell contamination and cannot distinguish between contributions of cancerous and normal cells to the measured copy number signals. This contamination could significantly confound downstream analysis of CNAs and affect the power to detect SCEs in clinical samples.
Results: We report here a statistically principled in silico approach, Bayesian Analysis of COpy number Mixtures (BACOM), to accurately estimate genomic deletion type and normal tissue contamination, and accordingly recover the true copy number profile in cancer cells. We tested the proposed method on two simulated datasets, two prostate cancer datasets and The Cancer Genome Atlas high-grade ovarian dataset, and obtained very promising results supported by the ground truth and biological plausibility. Moreover, based on a large number of comparative simulation studies, the proposed method gives significantly improved power to detect SCEs after in silico correction of normal tissue contamination. We develop a cross-platform open-source Java application that implements the whole pipeline of copy number analysis of heterogeneous cancer tissues including relevant processing steps. We also provide an R interface, bacomR, for running BACOM within the R environment, making it straightforward to include in existing data pipelines.
Availability: The cross-platform, stand-alone Java application, BACOM, the R interface, bacomR, all source code and the simulation data used in this article are freely available at authors' web site: http://www.cbil.ece.vt.edu/software.htm.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Summary: Differential dependency network (DDN) is a caBIG® (cancer Biomedical Informatics Grid) analytical tool for detecting and visualizing statistically significant topological changes in transcriptional networks representing two biological conditions. Developed under caBIG® 's In Silico Research Centers of Excellence (ISRCE) Program, DDN enables differential network analysis and provides an alternative way for defining network biomarkers predictive of phenotypes. DDN also serves as a useful systems biology tool for users across biomedical research communities to infer how genetic, epigenetic or environment variables may affect biological networks and clinical phenotypes. Besides the standalone Java application, we have also developed a Cytoscape plug-in, CytoDDN, to integrate network analysis and visualization seamlessly.
Availability: The Java and MATLAB source code can be downloaded at the authors' web site http://www.cbil.ece.vt.edu/software.htm
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary:In vivo dynamic contrast-enhanced imaging tools provide non-invasive methods for analyzing various functional changes associated with disease initiation, progression and responses to therapy. The quantitative application of these tools has been hindered by its inability to accurately resolve and characterize targeted tissues due to spatially mixed tissue heterogeneity. Convex Analysis of Mixtures – Compartment Modeling (CAM-CM) signal deconvolution tool has been developed to automatically identify pure-volume pixels located at the corners of the clustered pixel time series scatter simplex and subsequently estimate tissue-specific pharmacokinetic parameters. CAM-CM can dissect complex tissues into regions with differential tracer kinetics at pixel-wise resolution and provide a systems biology tool for defining imaging signatures predictive of phenotypes.
Availability: The MATLAB source code can be downloaded at the authors′ website www.cbil.ece.vt.edu/software.htm
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Phenotypic Up-regulated Gene Support Vector Machine (PUGSVM) is a cancer Biomedical Informatics Grid (caBIG™) analytical tool for multiclass gene selection and classification. PUGSVM addresses the problem of imbalanced class separability, small sample size and high gene space dimensionality, where multiclass gene markers are defined by the union of one-versus-everyone phenotypic upregulated genes, and used by a well-matched one-versus-rest support vector machine. PUGSVM provides a simple yet more accurate strategy to identify statistically reproducible mechanistic marker genes for characterization of heterogeneous diseases.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Despite their success in identifying genes that affect complex disease or traits, current genome-wide association studies (GWASs) based on a single SNP analysis are too simple to elucidate a comprehensive picture of the genetic architecture of phenotypes. A simultaneous analysis of a large number of SNPs, although statistically challenging, especially with a small number of samples, is crucial for genetic modeling.
Method: We propose a two-stage procedure for multi-SNP modeling and analysis in GWASs, by first producing a ‘preconditioned’ response variable using a supervised principle component analysis and then formulating Bayesian lasso to select a subset of significant SNPs. The Bayesian lasso is implemented with a hierarchical model, in which scale mixtures of normal are used as prior distributions for the genetic effects and exponential priors are considered for their variances, and then solved by using the Markov chain Monte Carlo (MCMC) algorithm. Our approach obviates the choice of the lasso parameter by imposing a diffuse hyperprior on it and estimating it along with other parameters and is particularly powerful for selecting the most relevant SNPs for GWASs, where the number of predictors exceeds the number of observations.
Results: The new approach was examined through a simulation study. By using the approach to analyze a real dataset from the Framingham Heart Study, we detected several significant genes that are associated with body mass index (BMI). Our findings support the previous results about BMI-related SNPs and, meanwhile, gain new insights into the genetic control of this trait.
Availability: The computer code for the approach developed is available at Penn State Center for Statistical Genetics web site, http://statgen.psu.edu.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: The identification of gene regulatory modules is an important yet challenging problem in computational biology. While many computational methods have been proposed to identify regulatory modules, their initial success is largely compromised by a high rate of false positives, especially when applied to human cancer studies. New strategies are needed for reliable regulatory module identification.
Results: We present a new approach, namely multilevel support vector regression (ml-SVR), to systematically identify condition-specific regulatory modules. The approach is built upon a multilevel analysis strategy designed for suppressing false positive predictions. With this strategy, a regulatory module becomes ever more significant as more relevant gene sets are formed at finer levels. At each level, a two-stage support vector regression (SVR) method is utilized to help reduce false positive predictions by integrating binding motif information and gene expression data; a significant analysis procedure is followed to assess the significance of each regulatory module. To evaluate the effectiveness of the proposed strategy, we first compared the ml-SVR approach with other existing methods on simulation data and yeast cell cycle data. The resulting performance shows that the ml-SVR approach outperforms other methods in the identification of both regulators and their target genes. We then applied our method to breast cancer cell line data to identify condition-specific regulatory modules associated with estrogen treatment. Experimental results show that our method can identify biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer.
Availability and implementation: The ml-SVR MATLAB package can be downloaded at http://www.cbil.ece.vt.edu/software.htm
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Significant efforts have been made to acquire data under different conditions and to construct static networks that can explain various gene regulation mechanisms. However, gene regulatory networks are dynamic and condition-specific; under different conditions, networks exhibit different regulation patterns accompanied by different transcriptional network topologies. Thus, an investigation on the topological changes in transcriptional networks can facilitate the understanding of cell development or provide novel insights into the pathophysiology of certain diseases, and help identify the key genetic players that could serve as biomarkers or drug targets.
Results: Here, we report a differential dependency network (DDN) analysis to detect statistically significant topological changes in the transcriptional networks between two biological conditions. We propose a local dependency model to represent the local structures of a network by a set of conditional probabilities. We develop an efficient learning algorithm to learn the local dependency model using the Lasso technique. A permutation test is subsequently performed to estimate the statistical significance of each learned local structure. In testing on a simulation dataset, the proposed algorithm accurately detected all the genes with network topological changes. The method was then applied to the estrogen-dependent T-47D estrogen receptor-positive (ER+) breast cancer cell line datasets and human and mouse embryonic stem cell datasets. In both experiments using real microarray datasets, the proposed method produced biologically meaningful results. We expect DDN to emerge as an important bioinformatics tool in transcriptional network analyses. While we focus specifically on transcriptional networks, the DDN method we introduce here is generally applicable to other biological networks with similar characteristics.
Availability: The DDN MATLAB toolbox and experiment data are available at http://www.cbil.ece.vt.edu/software.htm.
Supplementary information: Supplementary data are available at Bioinformatics online.
Molecular and epidemiological evidence demonstrate that altered gene expression and single nucleotide polymorphisms in the apoptotic pathway are linked to many cancers. Yet, few studies emphasize the interaction of variant apoptotic genes and their joint modifying effects on prostate cancer (PCA) outcomes. An exhaustive assessment of all the possible two-, three- and four-way gene-gene interactions is computationally burdensome. This statistical conundrum stems from the prohibitive amount of data needed to account for multiple hypothesis testing.
To address this issue, we systematically prioritized and evaluated individual effects and complex interactions among 172 apoptotic SNPs in relation to PCA risk and aggressive disease (i.e., Gleason score ≥ 7 and tumor stages III/IV). Single and joint modifying effects on PCA outcomes among European-American men were analyzed using statistical epistasis networks coupled with multi-factor dimensionality reduction (SEN-guided MDR). The case-control study design included 1,175 incident PCA cases and 1,111 controls from the prostate, lung, colo-rectal, and ovarian (PLCO) cancer screening trial. Moreover, a subset analysis of PCA cases consisted of 688 aggressive and 488 non-aggressive PCA cases. SNP profiles were obtained using the NCI Cancer Genetic Markers of Susceptibility (CGEMS) data portal. Main effects were assessed using logistic regression (LR) models. Prior to modeling interactions, SEN was used to pre-process our genetic data. SEN used network science to reduce our analysis from > 36 million to < 13,000 SNP interactions. Interactions were visualized, evaluated, and validated using entropy-based MDR. All parametric and non-parametric models were adjusted for age, family history of PCA, and multiple hypothesis testing.
Following LR modeling, eleven and thirteen sequence variants were associated with PCA risk and aggressive disease, respectively. However, none of these markers remained significant after we adjusted for multiple comparisons. Nevertheless, we detected a modest synergistic interaction between AKT3 rs2125230-PRKCQ rs571715 and disease aggressiveness using SEN-guided MDR (p = 0.011).
In summary, entropy-based SEN-guided MDR facilitated the logical prioritization and evaluation of apoptotic SNPs in relation to aggressive PCA. The suggestive interaction between AKT3-PRKCQ and aggressive PCA requires further validation using independent observational studies.
Prostate cancer; Apoptosis; Single nucleotide polymorphisms; Gene-gene interactions; Multifactor dimensionality reduction (MDR); Statistical epistasis networks (SEN)
Ancestry informative markers (AIMs) are a type of genetic marker that is informative for tracing the ancestral ethnicity of individuals. Application of AIMs has gained substantial attention in population genetics, forensic sciences, and medical genetics. Single nucleotide polymorphisms (SNPs), the materials of AIMs, are useful for classifying individuals from distinct continental origins but cannot discriminate individuals with subtle genetic differences from closely related ancestral lineages. Proof-of-principle studies have shown that gene expression (GE) also is a heritable human variation that exhibits differential intensity distributions among ethnic groups. GE supplies ethnic information supplemental to SNPs; this motivated us to integrate SNP and GE markers to construct AIM panels with a reduced number of required markers and provide high accuracy in ancestry inference. Few studies in the literature have considered GE in this aspect, and none have integrated SNP and GE markers to aid classification of samples from closely related ethnic populations.
We integrated a forward variable selection procedure into flexible discriminant analysis to identify key SNP and/or GE markers with the highest cross-validation prediction accuracy. By analyzing genome-wide SNP and/or GE markers in 210 independent samples from four ethnic groups in the HapMap II Project, we found that average testing accuracies for a majority of classification analyses were quite high, except for SNP-only analyses that were performed to discern study samples containing individuals from two close Asian populations. The average testing accuracies ranged from 0.53 to 0.79 for SNP-only analyses and increased to around 0.90 when GE markers were integrated together with SNP markers for the classification of samples from closely related Asian populations. Compared to GE-only analyses, integrative analyses of SNP and GE markers showed comparable testing accuracies and a reduced number of selected markers in AIM panels.
Integrative analysis of SNP and GE markers provides high-accuracy and/or cost-effective classification results for assigning samples from closely related or distantly related ancestral lineages to their original ancestral populations. User-friendly BIASLESS (Biomarkers Identification and Samples Subdivision) software was developed as an efficient tool for selecting key SNP and/or GE markers and then building models for sample subdivision. BIASLESS was programmed in R and R-GUI and is available online at http://www.stat.sinica.edu.tw/hsinchou/genetics/prediction/BIASLESS.htm.
Single nucleotide polymorphism (SNP); Allele frequency; Gene expression; HapMap; Classification analysis; Ancestry informative marker (AIM)
Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.
Tests of association with disease status are normally conducted one SNP at a time, ignoring the effects of all other genotyped SNPs. We developed a computationally efficient method to simultaneously analyse all SNPs, either in a genome-wide association (GWA) study, or a fine-mapping study based on re-sequencing and/or imputation. The method selects a subset of SNPs that best predicts disease status, while controlling the type-I error of the selected SNPs. This brings many advantages over standard single-SNP approaches, because the signal from a particular SNP can be more clearly assessed when other SNPs associated with disease status are already included in the model. Thus, in comparison with single-SNP analyses, power is increased and the false positive rate is reduced because of reduced residual variation. Localisation is also greatly improved. We demonstrate these advantages over the widely used single-SNP Armitage Trend Test using GWA simulation studies, a real GWA dataset, and a sequence-based fine-mapping simulation study.
Motivation: Identification of transcriptional regulatory networks (TRNs) is of significant importance in computational biology for cancer research, providing a critical building block to unravel disease pathways. However, existing methods for TRN identification suffer from the inclusion of excessive ‘noise’ in microarray data and false-positives in binding data, especially when applied to human tumor-derived cell line studies. More robust methods that can counteract the imperfection of data sources are therefore needed for reliable identification of TRNs in this context.
Results: In this article, we propose to establish a link between the quality of one target gene to represent its regulator and the uncertainty of its expression to represent other target genes. Specifically, an outlier sum statistic was used to measure the aggregated evidence for regulation events between target genes and their corresponding transcription factors. A Gibbs sampling method was then developed to estimate the marginal distribution of the outlier sum statistic, hence, to uncover underlying regulatory relationships. To evaluate the effectiveness of our proposed method, we compared its performance with that of an existing sampling-based method using both simulation data and yeast cell cycle data. The experimental results show that our method consistently outperforms the competing method in different settings of signal-to-noise ratio and network topology, indicating its robustness for biological applications. Finally, we applied our method to breast cancer cell line data and demonstrated its ability to extract biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer.
Availability and implementation: The Gibbs sampler MATLAB package is freely available at http://www.cbil.ece.vt.edu/software.htm.
Supplementary data are available at Bioinformatics online.
Somatic Copy Number Alterations (CNAs) in human genomes are present in almost all human cancers. Systematic efforts to characterize such structural variants must effectively distinguish significant consensus events from random background aberrations. Here we introduce Significant Aberration in Cancer (SAIC), a new method for characterizing and assessing the statistical significance of recurrent CNA units. Three main features of SAIC include: (1) exploiting the intrinsic correlation among consecutive probes to assign a score to each CNA unit instead of single probes; (2) performing permutations on CNA units that preserve correlations inherent in the copy number data; and (3) iteratively detecting Significant Copy Number Aberrations (SCAs) and estimating an unbiased null distribution by applying an SCA-exclusive permutation scheme.
We test and compare the performance of SAIC against four peer methods (GISTIC, STAC, KC-SMART, CMDS) on a large number of simulation datasets. Experimental results show that SAIC outperforms peer methods in terms of larger area under the Receiver Operating Characteristics curve and increased detection power. We then apply SAIC to analyze structural genomic aberrations acquired in four real cancer genome-wide copy number data sets (ovarian cancer, metastatic prostate cancer, lung adenocarcinoma, glioblastoma). When compared with previously reported results, SAIC successfully identifies most SCAs known to be of biological significance and associated with oncogenes (e.g., KRAS, CCNE1, and MYC) or tumor suppressor genes (e.g., CDKN2A/B). Furthermore, SAIC identifies a number of novel SCAs in these copy number data that encompass tumor related genes and may warrant further studies.
Supported by a well-grounded theoretical framework, SAIC has been developed and used to identify SCAs in various cancer copy number data sets, providing useful information to study the landscape of cancer genomes. Open–source and platform-independent SAIC software is implemented using C++, together with R scripts for data formatting and Perl scripts for user interfacing, and it is easy to install and efficient to use. The source code and documentation are freely available at http://www.cbil.ece.vt.edu/software.htm.
Motivation: Imaging genetic studies typically focus on identifying single-nucleotide polymorphism (SNP) markers associated with imaging phenotypes. Few studies perform regression of SNP values on phenotypic measures for examining how the SNP values change when phenotypic measures are varied. This alternative approach may have a potential to help us discover important imaging genetic associations from a different perspective. In addition, the imaging markers are often measured over time, and this longitudinal profile may provide increased power for differentiating genotype groups. How to identify the longitudinal phenotypic markers associated to disease sensitive SNPs is an important and challenging research topic.
Results: Taking into account the temporal structure of the longitudinal imaging data and the interrelatedness among the SNPs, we propose a novel ‘task-correlated longitudinal sparse regression’ model to study the association between the phenotypic imaging markers and the genotypes encoded by SNPs. In our new association model, we extend the widely used ℓ2,1-norm for matrices to tensors to jointly select imaging markers that have common effects across all the regression tasks and time points, and meanwhile impose the trace-norm regularization onto the unfolded coefficient tensor to achieve low rank such that the interrelationship among SNPs can be addressed. The effectiveness of our method is demonstrated by both clearly improved prediction performance in empirical evaluations and a compact set of selected imaging predictors relevant to disease sensitive SNPs.
Availability: Software is publicly available at: http://ranger.uta.edu/%7eheng/Longitudinal/
firstname.lastname@example.org or email@example.com
Plants rely on the root system for anchorage to the ground and the acquisition and absorption of nutrients critical to sustaining productivity. A genome wide association analysis enables one to analyze allelic diversity of complex traits and identify superior alleles. 384 inbred lines from the Ames panel were genotyped with 681,257 single nucleotide polymorphism markers using Genotyping-by-Sequencing technology and 22 seedling root architecture traits were phenotyped.
Utilizing both a general linear model and mixed linear model, a GWAS study was conducted identifying 268 marker trait associations (p ≤ 5.3×10-7). Analysis of significant SNP markers for multiple traits showed that several were located within gene models with some SNP markers localized within regions of previously identified root quantitative trait loci. Gene model GRMZM2G153722 located on chromosome 4 contained nine significant markers. This predicted gene is expressed in roots and shoots.
This study identifies putatively associated SNP markers associated with root traits at the seedling stage. Some SNPs were located within or near (<1 kb) gene models. These gene models identify possible candidate genes involved in root development at the seedling stage. These and respective linked or functional markers could be targets for breeders for marker assisted selection of seedling root traits.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1226-9) contains supplementary material, which is available to authorized users.
Genome wide association study; Maize; Roots
Recent advances in RNA sequencing (RNA-Seq) technology have offered unprecedented scope and resolution for transcriptome analysis. However, precise quantification of mRNA abundance and identification of differentially expressed genes are complicated due to biological and technical variations in RNA-Seq data.
We systematically study the variation in count data and dissect the sources of variation into between-sample variation and within-sample variation. A novel Bayesian framework is developed for joint estimate of gene level mRNA abundance and differential state, which models the intrinsic variability in RNA-Seq to improve the estimation. Specifically, a Poisson-Lognormal model is incorporated into the Bayesian framework to model within-sample variation; a Gamma-Gamma model is then used to model between-sample variation, which accounts for over-dispersion of read counts among multiple samples. Simulation studies, where sequencing counts are synthesized based on parameters learned from real datasets, have demonstrated the advantage of the proposed method in both quantification of mRNA abundance and identification of differentially expressed genes. Moreover, performance comparison on data from the Sequencing Quality Control (SEQC) Project with ERCC spike-in controls has shown that the proposed method outperforms existing RNA-Seq methods in differential analysis. Application on breast cancer dataset has further illustrated that the proposed Bayesian model can 'blindly' estimate sources of variation caused by sequencing biases.
We have developed a novel Bayesian hierarchical approach to investigate within-sample and between-sample variations in RNA-Seq data. Simulation and real data applications have validated desirable performance of the proposed method. The software package is available at http://www.cbil.ece.vt.edu/software.htm.
Alcohol abuse and dependence are major causes of morbidity and mortality worldwide, and have a strong familial component. Several linkage and association studies have identified chromosomal regions and/or genes that affect alcohol consumption, notably in genes involved in the two-stage pathway of alcohol metabolism.
Here, we use multiple regression models to test for associations and interactions between two alcohol related phenotypes and SNPs in 17 genes involved in alcohol metabolism in the U.S. Caucasian subset of the Collaborative Genetic Study of Nicotine Dependence (COGEND) participants.
Several SNPs across six genes showed evidence for association with either maximum number of drinks consumed in a 24-hour period or DSM-IV symptom count. The strongest evidence for association was between rs1229984, a non-synonymous coding SNP in ADH1B, and DSM-IV symptom count (P = 0.0003). This SNP was also associated with maximum drinks (P = 0.0004). Each minor allele at this SNP predicts 45% fewer DSM-IV symptoms and 18% fewer max drinks. Another SNP in a splice site in ALDH1A1 (rs8187974) showed evidence for association with both phenotypes as well. Minor alleles at this SNP predict greater alcohol consumption. In addition, pairwise interactions were observed between SNPs in several genes (P = 0.00002).
We replicated the large effect of rs1229984 on alcohol behavior, and although not common (MAF = 4%), this polymorphism may be highly relevant from a public health perspective in European Americans. Another SNP, rs8187974, may also affect alcohol behavior but requires replication. Also, interactions between polymorphisms in genes involved in alcohol metabolism are likely determinants of the parameters that ultimately affect alcohol consumption.
Alcoholism; Alcohol Metabolism; Genetic Association
Modeling biological networks serves as both a major goal and an effective tool of systems biology in studying mechanisms that orchestrate the activities of gene products in cells. Biological networks are context-specific and dynamic in nature. To systematically characterize the selectively activated regulatory components and mechanisms, modeling tools must be able to effectively distinguish significant rewiring from random background fluctuations. While differential networks cannot be constructed by existing knowledge alone, novel incorporation of prior knowledge into data-driven approaches can improve the robustness and biological relevance of network inference. However, the major unresolved roadblocks include: big solution space but a small sample size; highly complex networks; imperfect prior knowledge; missing significance assessment; and heuristic structural parameter learning.
To address these challenges, we formulated the inference of differential dependency networks that incorporate both conditional data and prior knowledge as a convex optimization problem, and developed an efficient learning algorithm to jointly infer the conserved biological network and the significant rewiring across different conditions. We used a novel sampling scheme to estimate the expected error rate due to “random” knowledge. Based on that scheme, we developed a strategy that fully exploits the benefit of this data-knowledge integrated approach. We demonstrated and validated the principle and performance of our method using synthetic datasets. We then applied our method to yeast cell line and breast cancer microarray data and obtained biologically plausible results. The open-source R software package and the experimental data are freely available at http://www.cbil.ece.vt.edu/software.htm.
Experiments on both synthetic and real data demonstrate the effectiveness of the knowledge-fused differential dependency network in revealing the statistically significant rewiring in biological networks. The method efficiently leverages data-driven evidence and existing biological knowledge while remaining robust to the false positive edges in the prior knowledge. The identified network rewiring events are supported by previous studies in the literature and also provide new mechanistic insight into the biological systems. We expect the knowledge-fused differential dependency network analysis, together with the open-source R package, to be an important and useful bioinformatics tool in biological network analyses.
Biological networks; Probabilistic graphical models; Differential dependency network; Network rewiring; Network analysis; Systems biology; Knowledge incorporation; Convex optimization
Bayesian networks are powerful instruments to learn genetic models from association studies data. They are able to derive the existing correlation between genetic markers and phenotypic traits and, at the same time, to find the relationships between the markers themselves. However, learning Bayesian networks is often non-trivial due to the high number of variables to be taken into account in the model with respect to the instances of the dataset. Therefore, it becomes very interesting to use an abstraction of the variable space that suitably reduces its dimensionality without losing information. In this paper we present a new strategy to achieve this goal by mapping the SNPs related to the same gene to one meta-variable. In order to assign states to the meta-variables we employ an approach based on classification trees.
We applied our approach to data coming from a genome-wide scan on 288 individuals affected by arterial hypertension and 271 nonagenarians without history of hypertension. After pre-processing, we focused on a subset of 24 SNPs. We compared the performance of the proposed approach with the Bayesian network learned with SNPs as variables and with the network learned with haplotypes as meta-variables. The results were obtained by running a hold-out experiment five times. The mean accuracy of the new method was 64.28%, while the mean accuracy of the SNPs network was 58.99% and the mean accuracy of the haplotype network was 54.57%.
The new approach presented in this paper is able to derive a gene-based predictive model based on SNPs data. Such model is more parsimonious than the one based on single SNPs, while preserving the capability of highlighting predictive SNPs configurations. The prediction performance of this approach was consistently superior to the SNP-based and the haplotype-based one in all the test sets of the evaluation procedure. The method can be then considered as an alternative way to analyze the data coming from association studies.
The arylamine N-acetyltransferase 2 (NAT2) slow acetylation phenotype is an established risk factor for urinary bladder cancer. We previously reported on this risk association using NAT2 phenotypic categories inferred from NAT2 haplotypes based on 7 single nucleotide polymorphisms (SNPs) in a study in Spain. In a subsequent genome-wide scan, we have identified a single common tag SNP (rs1495741) located in the 3′ end of NAT2 that is also associated with bladder cancer risk. The aim of this report is to evaluate the agreement between the common tag SNP and the 7-SNP NAT2 inferred phenotype.
The agreement between the 7-SNP NAT2 inferred phenotype and the tag SNP, rs1495741, was initially assessed in 2,174 subjects from the Spanish Bladder Cancer Study (SBCS), and confirmed in a subset of subjects from the Main and Vermont component the New England Bladder Cancer Study (NEBCS). We also investigated the association of rs1495741 genotypes with NAT2 catalytic activity in cryopreserved hepatocytes from 154 individuals of European background.
We observed very strong agreement between rs1495741 and the 7-SNP inferred NAT2 phenotype: sensitivity and specificity for the NAT2 slow phenotype was 99% and 95%, respectively. Our findings were replicated in an independent population from the United States. Estimates for the association between NAT2 slow phenotype and bladder cancer risk in the SBCS and its interaction with cigarette smoking were comparable for the 7-SNP inferred NAT2 phenotype and rs1495741. In addition, rs1495741 genotypes were strongly related to NAT2 activity measured in hepatocytes (P<0.0001).
A novel NAT2 tag SNP (rs1495741) predicts with high accuracy the 7- SNP inferred NAT2 phenotype, and thus can be used as a sole marker in pharmacogenetic or epidemiological studies of populations of European background. These findings illustrate the utility of tag SNPs, often employed in genome-wide association studies (GWAS), to identify novel phenotypic markers. Further studies are required to determine the functional implications of this novel SNP and the structure and evolution of the haplotype on which it resides.
Studies have shown that interactions of single nucleotide polymorphism (SNP) may play an important role for understanding causes of complex disease. Machine learning approaches provide useful features to explore interactions more effectively and efficiently. We have proposed an integrated method that combines two machine learning methods - Random Forests (RF) and Multivariate Adaptive Regression Splines (MARS) - to identify a subset of important SNPs and detect interaction patterns. In this two-stage RF-MARS (TRM) approach, RF is first applied to detect a predictive subset of SNPs, and then MARS is used to identify the interaction patterns among the selected SNPs. We evaluated the TRM performances in four models: three causal models with one two-way interaction and one null model. RF variable selection was based on out-of-bag classification error rate (OOB) and variable important spectrum (IS). First, we compared the selection of important variable of RF and MARS. Our results support that RFOOB had better performance than MARS and RFIS in detecting important variables. We also evaluated the true positive rate and false positive rate of identifying interaction patterns in TRM and MARS. This study demonstrates that TRMOOB, which is RFOOB plus MARS, has combined the strengths of RF and MARS in identifying SNP-SNP interaction patterns in a scenario of 100 candidate SNPs. TRMOOB had greater true positive rate and lower false positive rate compared with MARS, particularly for searching interactions with a strong association with the outcome. Therefore the use of TRMOOB is favored for exploring SNP-SNP interactions in a large-scale genetic variation study.
polymorphism; interaction; machine learning
Producing a rich, personalized Web-based consultation tool for plastic surgeons and patients is challenging.
(1) To develop a computer tool that allows individual reconstruction and simulation of 3-dimensional (3D) soft tissue from ordinary digital photos of breasts, (2) to implement a Web-based, worldwide-accessible preoperative surgical planning platform for plastic surgeons, and (3) to validate this tool through a quality control analysis by comparing 3D laser scans of the patients with the 3D reconstructions with this tool from original 2-dimensional (2D) pictures of the same patients.
The proposed system uses well-established 2D digital photos for reconstruction into a 3D torso, which is then available to the user for interactive planning. The simulation is performed on dedicated servers, accessible via Internet. It allows the surgeon, together with the patient, to previsualize the impact of the proposed breast augmentation directly during the consultation before a surgery is decided upon. We retrospectively conduced a quality control assessment of available anonymized pre- and postoperative 2D digital photographs of patients undergoing breast augmentation procedures. The method presented above was used to reconstruct 3D pictures from 2D digital pictures. We used a laser scanner capable of generating a highly accurate surface model of the patient’s anatomy to acquire ground truth data. The quality of the computed 3D reconstructions was compared with the ground truth data used to perform both qualitative and quantitative evaluations.
We evaluated the system on 11 clinical cases for surface reconstructions and 4 clinical cases of postoperative simulations, using laser surface scan technologies showing a mean reconstruction error between 2 and 4 mm and a maximum outlier error of 16 mm. Qualitative and quantitative analyses from plastic surgeons demonstrate the potential of these new emerging technologies.
We tested our tool for 3D, Web-based, patient-specific consultation in the clinical scenario of breast augmentation. This example shows that the current state of development allows for creation of responsive and effective Web-based, 3D medical tools, even with highly complex and time-consuming computation, by off-loading them to a dedicated high-performance data center. The efficient combination of advanced technologies, based on analysis and understanding of human anatomy and physiology, will allow the development of further Web-based reconstruction and predictive interfaces at different scales of the human body. The consultation tool presented herein exemplifies the potential of combining advancements in the core areas of computer science and biomedical engineering with the evolving areas of Web technologies. We are confident that future developments based on a multidisciplinary approach will further pave the way toward personalized Web-enabled medicine.
Medical informatics computing; computer-assisted surgery; imaging, three-dimensional
Climate is the primary driver of the distribution of tree species worldwide, and the potential for adaptive evolution will be an important factor determining the response of forests to anthropogenic climate change. Although association mapping has the potential to improve our understanding of the genomic underpinnings of climatically relevant traits, the utility of adaptive polymorphisms uncovered by such studies would be greatly enhanced by the development of integrated models that account for the phenotypic effects of multiple single-nucleotide polymorphisms (SNPs) and their interactions simultaneously. We previously reported the results of association mapping in the widespread conifer Sitka spruce (Picea sitchensis). In the current study we used the recursive partitioning algorithm ‘Random Forest’ to identify optimized combinations of SNPs to predict adaptive phenotypes. After adjusting for population structure, we were able to explain 37% and 30% of the phenotypic variation, respectively, in two locally adaptive traits—autumn budset timing and cold hardiness. For each trait, the leading five SNPs captured much of the phenotypic variation. To determine the role of epistasis in shaping these phenotypes, we also used a novel approach to quantify the strength and direction of pairwise interactions between SNPs and found such interactions to be common. Our results demonstrate the power of Random Forest to identify subsets of markers that are most important to climatic adaptation, and suggest that interactions among these loci may be widespread.
Random Forest; adaptation; association mapping; epistasis; phenology; cold hardiness; GenPred; shared data resources
Dominance effect may play an important role in genetic variation of complex traits. Full featured and easy-to-use computing tools for genomic prediction and variance component estimation of additive and dominance effects using genome-wide single nucleotide polymorphism (SNP) markers are necessary to understand dominance contribution to a complex trait and to utilize dominance for selecting individuals with favorable genetic potential.
The GVCBLUP package is a shared memory parallel computing tool for genomic prediction and variance component estimation of additive and dominance effects using genome-wide SNP markers. This package currently has three main programs (GREML_CE, GREML_QM, and GCORRMX) and a graphical user interface (GUI) that integrates the three main programs with an existing program for the graphical viewing of SNP additive and dominance effects (GVCeasy). The GREML_CE and GREML_QM programs offer complementary computing advantages with identical results for genomic prediction of breeding values, dominance deviations and genotypic values, and for genomic estimation of additive and dominance variances and heritabilities using a combination of expectation-maximization (EM) algorithm and average information restricted maximum likelihood (AI-REML) algorithm. GREML_CE is designed for large numbers of SNP markers and GREML_QM for large numbers of individuals. Test results showed that GREML_CE could analyze 50,000 individuals with 400 K SNP markers and GREML_QM could analyze 100,000 individuals with 50K SNP markers. GCORRMX calculates genomic additive and dominance relationship matrices using SNP markers. GVCeasy is the GUI for GVCBLUP integrated with an existing software tool for the graphical viewing of SNP effects and a function for editing the parameter files for the three main programs.
The GVCBLUP package is a powerful and versatile computing tool for assessing the type and magnitude of genetic effects affecting a phenotype by estimating whole-genome additive and dominance heritabilities, for genomic prediction of breeding values, dominance deviations and genotypic values, for calculating genomic relationships, and for research and education in genomic prediction and estimation.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-270) contains supplementary material, which is available to authorized users.
GVCBLUP; Genomic selection; Variance component; Heritability; BLUP
Breeding values for animals with marker data are estimated using a genomic selection approach where data is analyzed using Bayesian multi-marker association models. Fourteen model scenarios with varying haplotype lengths, hyper parameter and prior distributions were compared to find the scenario expected to give the most correct genomic estimated breeding values for animals with marker information only. Five-fold cross validation was performed to assess the ability of models to estimate breeding values for animals in generation 3. In each of the five subsets, 20% of phenotypic records in generation 3 were left out. Correlations between breeding values estimated on full data and on subsets for the "leave-out" animals varied between 0.77–0.99. Regression coefficients of breeding values from full data on breeding values from subsets ranged from 0.78–1.01. Single-SNP marker models didn't perform well. Correlations were 0.77–0.89 and predicted breeding values were biased. In addition the models seemed to over fit the genomic part of the variation. Highest correlations and most unbiased results were obtained when SNP markers were joined into haplotypes. Especially the scenarios with 5-SNP haplotypes gave promising results (distance between adjacent SNPs is 0.1 cM evenly over the genome). All correlations were 0.99 and regression coefficients were 0.99–1.01. Models with 5-SNP markers seemed robust to hyper parameter and prior changes. Haplotypes up to 40 SNPs also gave good results. However, longer haplotypes are expected to have less predictive ability over several generations and therefore the 5-SNP haplotypes are expected to give the best predictions for generations 4–6.