PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1673232)

Clipboard (0)
None

Related Articles

1.  Identification of cancer genomic markers via integrative sparse boosting 
Biostatistics (Oxford, England)  2012;13(3):509-522.
In high-throughput cancer genomic studies, markers identified from the analysis of single data sets often suffer a lack of reproducibility because of the small sample sizes. An ideal solution is to conduct large-scale prospective studies, which are extremely expensive and time consuming. A cost-effective remedy is to pool data from multiple comparable studies and conduct integrative analysis. Integrative analysis of multiple data sets is challenging because of the high dimensionality of genomic measurements and heterogeneity among studies. In this article, we propose a sparse boosting approach for marker identification in integrative analysis of multiple heterogeneous cancer diagnosis studies with gene expression measurements. The proposed approach can effectively accommodate the heterogeneity among multiple studies and identify markers with consistent effects across studies. Simulation shows that the proposed approach has satisfactory identification results and outperforms alternatives including an intensity approach and meta-analysis. The proposed approach is used to identify markers of pancreatic cancer and liver cancer.
doi:10.1093/biostatistics/kxr033
PMCID: PMC3577103  PMID: 22045909
Cancer genomics; Marker identification; Sparse boosting
2.  Identification of Breast Cancer Prognosis Markers via Integrative Analysis 
Summary
In breast cancer research, it is of great interest to identify genomic markers associated with prognosis. Multiple gene profiling studies have been conducted for such a purpose. Genomic markers identified from the analysis of single datasets often do not have satisfactory reproducibility. Among the multiple possible reasons, the most important one is the small sample sizes of individual studies. A cost-effective solution is to pool data from multiple comparable studies and conduct integrative analysis. In this study, we collect four breast cancer prognosis studies with gene expression measurements. We describe the relationship between prognosis and gene expressions using the accelerated failure time (AFT) models. We adopt a 2-norm group bridge penalization approach for marker identification. This integrative analysis approach can effectively identify markers with consistent effects across multiple datasets and naturally accommodate the heterogeneity among studies. Statistical and simulation studies demonstrate satisfactory performance of this approach. Breast cancer prognosis markers identified using this approach have sound biological implications and satisfactory prediction performance.
doi:10.1016/j.csda.2012.02.017
PMCID: PMC3389801  PMID: 22773869
Breast cancer prognosis; Gene expression; Marker identification; Integrative analysis; 2-norm group bridge
3.  Revealing a signaling role of phytosphingosine-1-phosphate in yeast 
Perturbing metabolic systems of bioactive sphingolipids with genetic approachMultiple types of “omics” data collected from the systemSystems approach for integrating multiple “omics” informationPredicting signal transduction information flow: lipid; TF activation; gene expression
In contemporary biomedical research, gene mutation remains the most powerful and commonly used tool in molecular and systems biology for perturbation and dissection of biological systems. However, as biological systems consist of highly connected networks, for example, metabolic networks or signal transduction networks, perturbing one portion could result in widely spread effects across the network. Such ‘ripple effects' in systems pose a challenge to the paradigm of investigating the role of a metabolite through mutating enzymes required for its production. In this study, we have developed a systems biology approach that integrates different types of ‘-omics' data to identify signal transduction pathways involving spingolipids and gene expression. See Figure 1 for an overall scheme of our approaches.
Sphingolipids are a family of bioactive lipids that have important signaling functions in cells; in yeast, de novo synthesis is required to mediate the cell response to heat shock. We hypothesized that a specific sphingolipid, phyto-sphingosine-1-phosphate (PHS1P), functions as a signaling molecule in the heat stress response (HSR) because, though its mammalian counterparts are known to have important signaling roles, the function of this metabolite in yeast remains unknown. To identify a putative role of PHS1P in the HSR, we deleted the genes involved in production (LCB4 and LCB5) and degradation (DPL1) of PHS1P to perturb its levels in cells. In wild-type cells, heat shock induces a significant increase in PHS1P. Over the same course, expression of over a thousand genes was modulated.
While deleting the genes involved in PHS1P metabolism ‘clamped' the PHS1P concentration as expected, these mutations also resulted in wide spread changes in many sphingolipids in addition to PHS1P. This ‘ripple effect' prevented direct identification of signaling role of PHS1P in gene expression. We overcame this difficulty by using a set of systems approaches as follows: (1) identifying the information between levels of each individual sphingolipid species and gene expression through combining correlation analysis and clustering; (2) identifying the putative PHS1P-sensitive subset of genes by analyzing the results from step 1; (3) identifying transcription factors (TFs) that potentially regulate these PHS1P-sensitive genes thought promoter analysis; (4) modeling the activation states of the TFs by combining gene expression data and promoter sequence data; and finally, (5) modeling the relationship between sphingolipids and activation of TFs.
Our study showed that 441 genes were differentially expressed in the lcb4Δ/lcb5Δ strain in comparison to wild-type strain; however, only 77 genes among them showed a significant correlation with respect to PHS1P, with 22 genes positively correlated and 54 genes negatively correlated. The results led to a hypothesis that the genes showing significant correlation were PHS1P sensitive whereas differential expression of other genes resulted from the compounding ‘ripple effects' of the gene deletions. We tested this hypothesis by directly treating cells with PHS1P and monitoring the expression levels of the genes that were PHS1P sensitive and PHS1P insensitive, and the results showed that the expression of PHS1P-sensitive genes indeed changed in response to the treatment whereas others did not. We developed a statistical model referred to as Bayesian transcription factor state model to infer activation states of TFs in cells under a specific condition based on the genomic information and gene expression data. We then used a Bayesian logistic regression to further model the relationship between the lipid concentrations and activation states of the TFs. Combined TF enrichment analysis and TF state modeling indicated that the HAP TF complex was likely responding to the signal from PHS1P and mediating the regulation of PHS1P-sensitive genes. We tested this hypothesis by treating wild type and a strain of yeast with deletion of HAP4 gene (hap4Δ), a component of the HAP complex, with PHS1P and monitoring the expression of PHS1P-sensitive genes. Indeed, the PHS1P induced the genes in the wild-type strain but not in hap4Δ, thus indicating that induction of the PHS1P-sensitive genes required a functioning HAP complex (see Figure 5 ).
In summary, our experiments demonstrated that, though gene mutation remains one of the most powerful tools to perturb biological systems, the high connectivity of biological systems poses a challenge for using this approach to identify signaling roles of bioactive metabolites. Here, we demonstrated combining the information from multiple types of ‘-omics' data using systems approaches, it is possible to circumvent these difficulties and reveal novel signal transduction pathways.
Sphingolipids including sphingosine-1-phosphate and ceramide participate in numerous cell programs through signaling mechanisms. This class of lipids has important functions in stress responses; however, determining which sphingolipid mediates specific events has remained encumbered by the numerous metabolic interconnections of sphingolipids, such that modulating a specific lipid of interest through manipulating metabolic enzymes causes ‘ripple effects', which change levels of many other lipids. Here, we develop a method of integrative analysis for genomic, transcriptomic, and lipidomic data to address this previously intractable problem. This method revealed a specific signaling role for phytosphingosine-1-phosphate, a lipid with no previously defined specific function in yeast, in regulating genes required for mitochondrial respiration through the HAP complex transcription factor. This approach could be applied to extract meaningful biological information from a similar experimental design that produces multiple sets of high-throughput data.
doi:10.1038/msb.2010.3
PMCID: PMC2835565  PMID: 20160710
information integration; lipidomics; signal transduction; sphingolipids; transcriptomics
4.  Identification of Breast Cancer Prognosis Markers using Integrative Sparse Boosting 
Summary
Objectives
In breast cancer research, it is important to identify genomic markers associated with prognosis. Multiple microarray gene expression profiling studies have been conducted, searching for prognosis markers. Genomic markers identified from the analysis of single datasets often suffer a lack of reproducibility because of small sample sizes. Integrative analysis of data from multiple independent studies has a larger sample size and may provide a cost-effective solution.
Methods
We collect four breast cancer prognosis studies with gene expression measurements. An accelerated failure time (AFT) model with an unknown error distribution is adopted to describe survival. An integrative sparse boosting approach is employed for marker selection. The proposed model and boosting approach can effectively accommodate heterogeneity across multiple studies and identify genes with consistent effects.
Results
Simulation study shows that the proposed approach outperforms alternatives including meta-analysis and intensity approaches by identifying the majority or all of the true positives, while having a low false positive rate. In the analysis of breast cancer data, 44 genes are identified as associated with prognosis. Many of the identified genes have been previously suggested as associated with tumorigenesis and cancer prognosis. The identified genes and corresponding predicted risk scores differ from those using alternative approaches. Monte Carlo-based prediction evaluation suggests that the proposed approach has the best prediction performance.
Conclusions
Integrative analysis may provide an effective way of identifying breast cancer prognosis markers. Markers identified using the integrative sparse boosting analysis have sound biological implications and satisfactory prediction performance.
doi:10.3414/ME11-02-0019
PMCID: PMC3598607  PMID: 22344268
Breast cancer prognosis; Gene Expression; Integrative analysis; Sparse boosting
5.  Integrative analysis of multiple cancer genomic datasets under the heterogeneity model 
Statistics in medicine  2013;32(20):3509-3521.
In the analysis of cancer studies with high-dimensional genomic measurements, integrative analysis provides an effective way of pooling information across multiple heterogeneous datasets. The genomic basis of multiple independent datasets, which can be characterized by the sets of genomic markers, can be described using the homogeneity model or heterogeneity model. Under the homogeneity model, all datasets share the same set of markers associated with responses. In contrast, under the heterogeneity model, different studies have overlapping but possibly different sets of markers. The heterogeneity model contains the homogeneity model as a special case and can be much more flexible. Marker selection under the heterogeneity model calls for bi-level selection to determine whether a covariate is associated with response in any study at all as well as in which studies it is associated with responses. In this study, we consider two minimax concave penalty (MCP) based penalization approaches for marker selection under the heterogeneity model. For each approach, we describe its rationale and an effective computational algorithm. We conduct simulation to investigate their performance and compare with the existing alternatives. We also apply the proposed approaches to the analysis of gene expression data on multiple cancers.
doi:10.1002/sim.5780
PMCID: PMC3743947  PMID: 23519988
Integrative analysis; Heterogeneity model; Marker selection
6.  Integrative analysis of multiple cancer prognosis studies with gene expression measurements 
Statistics in Medicine  2011;30(28):3361-3371.
Although in cancer research microarray gene profiling studies have been successful in identifying genetic variants predisposing to the development and progression of cancer, the identified markers from analysis of single datasets often suffer low reproducibility. Among multiple possible causes, the most important one is the small sample size hence the lack of power of single studies. Integrative analysis jointly considers multiple heterogeneous studies, has a significantly larger sample size, and can improve reproducibility. In this article, we focus on cancer prognosis studies, where the response variables are progression-free, overall, or other types of survival. A group minimax concave penalty (GMCP) penalized integrative analysis approach is proposed for analyzing multiple heterogeneous cancer prognosis studies with microarray gene expression measurements. An efficient group coordinate descent algorithm is developed. The GMCP can automatically accommodate the heterogeneity across multiple datasets, and the identified markers have consistent effects across multiple studies. Simulation studies show that the GMCP provides significantly improved selection results as compared with the existing meta-analysis approaches, intensity approaches, and group Lasso penalized integrative analysis. We apply the GMCP to four microarray studies and identify genes associated with the prognosis of breast cancer.
doi:10.1002/sim.4337
PMCID: PMC3399910  PMID: 22105693
integrative analysis; cancer prognosis; microarray; penalized selection
7.  Incorporating Network Structure in Integrative Analysis of Cancer Prognosis Data 
Genetic epidemiology  2012;37(2):173-183.
In high-throughput cancer genomic studies, markers identified from the analysis of single datasets may have unsatisfactory properties because of low sample sizes. Integrative analysis pools and analyzes raw data from multiple studies, and can effectively increase sample size and lead to improved marker identification results. In this study, we consider the integrative analysis of multiple high-throughput cancer prognosis studies. In the existing integrative analysis studies, the interplay among genes, which can be described using the network structure, has not been effectively accounted for. In network analysis, tightly-connected nodes (genes) are more likely to have related biological functions and similar regression coefficients. The goal of this study is to develop an analysis approach that can incorporate the gene network structure in integrative analysis. To this end, we adopt an AFT (accelerated failure time) model to describe survival. A weighted least squares approach, which has low computational cost, is adopted for estimation. For marker selection, we propose a new penalization approach. The proposed penalty is composed of two parts. The first part is a group MCP penalty, and conducts gene selection. The second part is a Laplacian penalty, and smoothes the differences of coefficients for tightly-connected genes. A group coordinate descent approach is developed to compute the proposed estimate. Simulation study shows satisfactory performance of the proposed approach when there exist moderate to strong correlations among genes. We analyze three lung cancer prognosis datasets, and demonstrate that incorporating the network structure can lead to the identification of important genes and improved prediction performance.
doi:10.1002/gepi.21697
PMCID: PMC3909475  PMID: 23161517
Integrative analysis; Cancer prognosis; Gene network; Penalized selection; Laplacian shrinkage
8.  Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets 
Genetics research  2013;95(0):68-77.
SUMMARY
In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach.
doi:10.1017/S0016672313000086
PMCID: PMC4090387  PMID: 23938111
Integrative analysis; Cancer prognosis; Heterogeneity model; Penalization
9.  Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight 
PLoS Computational Biology  2013;9(3):e1002956.
Complex traits such as obesity are manifestations of intricate interactions of multiple genetic factors. However, such relationships are difficult to identify. Thanks to the recent advance in high-throughput technology, a large amount of data has been collected for various complex traits, including obesity. These data often measure different biological aspects of the traits of interest, including genotypic variations at the DNA level and gene expression alterations at the RNA level. Integration of such heterogeneous data provides promising opportunities to understand the genetic components and possibly genetic architecture of complex traits. In this paper, we propose a machine learning based method, module-guided Random Forests (mgRF), to integrate genotypic and gene expression data to investigate genetic factors and molecular mechanism underlying complex traits. mgRF is an augmented Random Forests method enhanced by a network analysis for identifying multiple correlated variables of different types. We applied mgRF to genetic markers and gene expression data from a cohort of F2 female mouse intercross. mgRF outperformed several existing methods in our extensive comparison. Our new approach has an improved performance when combining both genotypic and gene expression data compared to using either one of the two types of data alone. The resulting predictive variables identified by mgRF provide information of perturbed pathways that are related to body weight. More importantly, the results uncovered intricate interactions among genetic markers and genes that have been overlooked if only one type of data was examined. Our results shed light on genetic mechanisms of obesity and our approach provides a promising complementary framework to the “genetics of gene expression” analysis for integrating genotypic and gene expression information for analyzing complex traits.
Author Summary
Obesity has become a perilous global epidemic that can lead to complex diseases, such as diabetes and cardiovascular diseases. Much effort has been devoted to the studies of the genetic mechanisms that pillow the manifestation of obesity. Although a large quantity of experimental data has been accumulated lately using high-throughput techniques, our understanding of genetic mechanisms of obesity is still limited. The proposed method is motivated to address three critical issues that have impeded the existing methods. The first is the curse of dimensionality in selecting a subset of genetic elements related to the traits of interest from a large number of candidates. The second is genetic multiplicity underlying non-Mendelian traits, in which multiple genes are in interplay. The third issue is the integration of data from multiple sources in light of genetic multiplicity and curse of dimensionality. Here, we propose a new method, which augments the Random Forests method with a network-based analysis, to integrate genotypic and gene expression information and identify correlated multiple genetic elements underlying mouse weight. Our results shed light on complex genetic interactions underlying obesity, which can form viable hypotheses worthy of further investigation.
doi:10.1371/journal.pcbi.1002956
PMCID: PMC3591263  PMID: 23505362
10.  Integrative clustering methods for high-dimensional molecular data 
Translational cancer research  2014;3(3):202-216.
High-throughput ‘omic’ data, such as gene expression, DNA methylation, DNA copy number, has played an instrumental role in furthering our understanding of the molecular basis in states of human health and disease. As cells with similar morphological characteristics can exhibit entirely different molecular profiles and because of the potential that these discrepancies might further our understanding of patient-level variability in clinical outcomes, there is significant interest in the use of high-throughput ‘omic’ data for the identification of novel molecular subtypes of a disease. While numerous clustering methods have been proposed for identifying of molecular subtypes, most were developed for single “omic’ data types and may not be appropriate when more than one ‘omic’ data type are collected on study subjects. Given that complex diseases, such as cancer, arise as a result of genomic, epigenomic, transcriptomic, and proteomic alterations, integrative clustering methods for the simultaneous clustering of multiple ‘omic’ data types have great potential to aid in molecular subtype discovery. Traditionally, ad hoc manual data integration has been performed using the results obtained from the clustering of individual ‘omic’ data types on the same set of patient samples. However, such methods often result in inconsistent assignment of subjects to the molecular cancer subtypes. Recently, several methods have been proposed in the literature that offers a rigorous framework for the simultaneous integration of multiple ‘omic’ data types in a single comprehensive analysis. In this paper, we present a systematic review of existing integrative clustering methods.
doi:10.3978/j.issn.2218-676X.2014.06.03
PMCID: PMC4166480  PMID: 25243110
Consensus clustering; cophenetic correlation; latent models; mixture models; non-negative matrix factorization
11.  SPARSE INTEGRATIVE CLUSTERING OF MULTIPLE OMICS DATA SETS 
The annals of applied statistics  2012;7(1):269-294.
High resolution microarrays and second-generation sequencing platforms are powerful tools to investigate genome-wide alterations in DNA copy number, methylation, and gene expression associated with a disease. An integrated genomic profiling approach measuring multiple omics data types simultaneously in the same set of biological samples would render an integrated data resolution that would not be available with any single data type. In this study, we use penalized latent variable regression methods for joint modeling of multiple omics data types to identify common latent variables that can be used to cluster patient samples into biologically and clinically relevant disease subtypes. We consider lasso (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), and fused lasso (Tibshirani et al., 2005) methods to induce sparsity in the coefficient vectors, revealing important genomic features that have significant contributions to the latent variables. An iterative ridge regression is used to compute the sparse coefficient vectors. In model selection, a uniform design (Fang and Wang, 1994) is used to seek “experimental” points that scattered uniformly across the search domain for efficient sampling of tuning parameter combinations. We compared our method to sparse singular value decomposition (SVD) and penalized Gaussian mixture model (GMM) using both real and simulated data sets. The proposed method is applied to integrate genomic, epigenomic, and transcriptomic data for subtype analysis in breast and lung cancer data sets.
doi:10.1214/12-AOAS578
PMCID: PMC3935438  PMID: 24587839
12.  AucPR: An AUC-based approach using penalized regression for disease prediction with high-dimensional omics data 
BMC Genomics  2014;15(Suppl 10):S1.
Motivation
It is common to get an optimal combination of markers for disease classification and prediction when multiple markers are available. Many approaches based on the area under the receiver operating characteristic curve (AUC) have been proposed. Existing works based on AUC in a high-dimensional context depend mainly on a non-parametric, smooth approximation of AUC, with no work using a parametric AUC-based approach, for high-dimensional data.
Results
We propose an AUC-based approach using penalized regression (AucPR), which is a parametric method used for obtaining a linear combination for maximizing the AUC. To obtain the AUC maximizer in a high-dimensional context, we transform a classical parametric AUC maximizer, which is used in a low-dimensional context, into a regression framework and thus, apply the penalization regression approach directly. Two kinds of penalization, lasso and elastic net, are considered. The parametric approach can avoid some of the difficulties of a conventional non-parametric AUC-based approach, such as the lack of an appropriate concave objective function and a prudent choice of the smoothing parameter. We apply the proposed AucPR for gene selection and classification using four real microarray and synthetic data. Through numerical studies, AucPR is shown to perform better than the penalized logistic regression and the nonparametric AUC-based method, in the sense of AUC and sensitivity for a given specificity, particularly when there are many correlated genes.
Conclusion
We propose a powerful parametric and easily-implementable linear classifier AucPR, for gene selection and disease prediction for high-dimensional data. AucPR is recommended for its good prediction performance. Beside gene expression microarray data, AucPR can be applied to other types of high-dimensional omics data, such as miRNA and protein data.
doi:10.1186/1471-2164-15-S10-S1
PMCID: PMC4304290  PMID: 25559769
AUC; high-dimensional data; penalized regression; ROC curve
13.  A multivariate approach to the integration of multi-omics datasets 
BMC Bioinformatics  2014;15:162.
Background
To leverage the potential of multi-omics studies, exploratory data analysis methods that provide systematic integration and comparison of multiple layers of omics information are required. We describe multiple co-inertia analysis (MCIA), an exploratory data analysis method that identifies co-relationships between multiple high dimensional datasets. Based on a covariance optimization criterion, MCIA simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale, to extract the most variant from each dataset and facilitate biological interpretation and pathway analysis.
Results
We demonstrate integration of multiple layers of information using MCIA, applied to two typical “omics” research scenarios. The integration of transcriptome and proteome profiles of cells in the NCI-60 cancer cell line panel revealed distinct, complementary features, which together increased the coverage and power of pathway analysis. Our analysis highlighted the importance of the leukemia extravasation signaling pathway in leukemia that was not highly ranked in the analysis of any individual dataset. Secondly, we compared transcriptome profiles of high grade serous ovarian tumors that were obtained, on two different microarray platforms and next generation RNA-sequencing, to identify the most informative platform and extract robust biomarkers of molecular subtypes. We discovered that the variance of RNA-sequencing data processed using RPKM had greater variance than that with MapSplice and RSEM. We provided novel markers highly associated to tumor molecular subtype combined from four data platforms. MCIA is implemented and available in the R/Bioconductor “omicade4” package.
Conclusion
We believe MCIA is an attractive method for data integration and visualization of several datasets of multi-omics features observed on the same set of individuals. The method is not dependent on feature annotation, and thus it can extract important features even when there are not present across all datasets. MCIA provides simple graphical representations for the identification of relationships between large datasets.
doi:10.1186/1471-2105-15-162
PMCID: PMC4053266  PMID: 24884486
Multivariate analysis; Multiple co-inertia; Data integration; Omic data; Visualization
14.  Analysis of Genome-Wide Association Studies with Multiple Outcomes Using Penalization 
PLoS ONE  2012;7(12):e51198.
Genome-wide association studies have been extensively conducted, searching for markers for biologically meaningful outcomes and phenotypes. Penalization methods have been adopted in the analysis of the joint effects of a large number of SNPs (single nucleotide polymorphisms) and marker identification. This study is partly motivated by the analysis of heterogeneous stock mice dataset, in which multiple correlated phenotypes and a large number of SNPs are available. Existing penalization methods designed to analyze a single response variable cannot accommodate the correlation among multiple response variables. With multiple response variables sharing the same set of markers, joint modeling is first employed to accommodate the correlation. The group Lasso approach is adopted to select markers associated with all the outcome variables. An efficient computational algorithm is developed. Simulation study and analysis of the heterogeneous stock mice dataset show that the proposed method can outperform existing penalization methods.
doi:10.1371/journal.pone.0051198
PMCID: PMC3522680  PMID: 23272092
15.  3Omics: a web-based systems biology tool for analysis, integration and visualization of human transcriptomic, proteomic and metabolomic data 
BMC Systems Biology  2013;7:64.
Background
Integrative and comparative analyses of multiple transcriptomics, proteomics and metabolomics datasets require an intensive knowledge of tools and background concepts. Thus, it is challenging for users to perform such analyses, highlighting the need for a single tool for such purposes. The 3Omics one-click web tool was developed to visualize and rapidly integrate multiple human inter- or intra-transcriptomic, proteomic, and metabolomic data by combining five commonly used analyses: correlation networking, coexpression, phenotyping, pathway enrichment, and GO (Gene Ontology) enrichment.
Results
3Omics generates inter-omic correlation networks to visualize relationships in data with respect to time or experimental conditions for all transcripts, proteins and metabolites. If only two of three omics datasets are input, then 3Omics supplements the missing transcript, protein or metabolite information related to the input data by text-mining the PubMed database. 3Omics’ coexpression analysis assists in revealing functions shared among different omics datasets. 3Omics’ phenotype analysis integrates Online Mendelian Inheritance in Man with available transcript or protein data. Pathway enrichment analysis on metabolomics data by 3Omics reveals enriched pathways in the KEGG/HumanCyc database. 3Omics performs statistical Gene Ontology-based functional enrichment analyses to display significantly overrepresented GO terms in transcriptomic experiments. Although the principal application of 3Omics is the integration of multiple omics datasets, it is also capable of analyzing individual omics datasets. The information obtained from the analyses of 3Omics in Case Studies 1 and 2 are also in accordance with comprehensive findings in the literature.
Conclusions
3Omics incorporates the advantages and functionality of existing software into a single platform, thereby simplifying data analysis and enabling the user to perform a one-click integrated analysis. Visualization and analysis results are downloadable for further user customization and analysis. The 3Omics software can be freely accessed at http://3omics.cmdm.tw.
doi:10.1186/1752-0509-7-64
PMCID: PMC3723580  PMID: 23875761
Visualization; Omics integration; Systems biology; Transcriptomics; Proteomics; Metabolomics; Analysis
16.  Multilevel omic data integration in cancer cell lines: advanced annotation and emergent properties 
BMC Systems Biology  2013;7:14.
Background
High-throughput (omic) data have become more widespread in both quantity and frequency of use, thanks to technological advances, lower costs and higher precision. Consequently, computational scientists are confronted by two parallel challenges: on one side, the design of efficient methods to interpret each of these data in their own right (gene expression signatures, protein markers, etc.) and, on the other side, realization of a novel, pressing request from the biological field to design methodologies that allow for these data to be interpreted as a whole, i.e. not only as the union of relevant molecules in each of these layers, but as a complex molecular signature containing proteins, mRNAs and miRNAs, all of which must be directly associated in the results of analyses that are able to capture inter-layers connections and complexity.
Results
We address the latter of these two challenges by testing an integrated approach on a known cancer benchmark: the NCI-60 cell panel. Here, high-throughput screens for mRNA, miRNA and proteins are jointly analyzed using factor analysis, combined with linear discriminant analysis, to identify the molecular characteristics of cancer. Comparisons with separate (non-joint) analyses show that the proposed integrated approach can uncover deeper and more precise biological information. In particular, the integrated approach gives a more complete picture of the set of miRNAs identified and the Wnt pathway, which represents an important surrogate marker of melanoma progression. We further test the approach on a more challenging patient-dataset, for which we are able to identify clinically relevant markers.
Conclusions
The integration of multiple layers of omics can bring more information than analysis of single layers alone. Using and expanding the proposed integrated framework to integrate omic data from other molecular levels will allow researchers to uncover further systemic information. The application of this approach to a clinically challenging dataset shows its promising potential.
doi:10.1186/1752-0509-7-14
PMCID: PMC3610285  PMID: 23418673
Multi-omic; Emergent property; Factor analysis; Linear discriminant analysis; NCI-60 cell panel
17.  Network-Assisted Investigation of Combined Causal Signals from Genome-Wide Association Studies in Schizophrenia 
PLoS Computational Biology  2012;8(7):e1002587.
With the recent success of genome-wide association studies (GWAS), a wealth of association data has been accomplished for more than 200 complex diseases/traits, proposing a strong demand for data integration and interpretation. A combinatory analysis of multiple GWAS datasets, or an integrative analysis of GWAS data and other high-throughput data, has been particularly promising. In this study, we proposed an integrative analysis framework of multiple GWAS datasets by overlaying association signals onto the protein-protein interaction network, and demonstrated it using schizophrenia datasets. Building on a dense module search algorithm, we first searched for significantly enriched subnetworks for schizophrenia in each single GWAS dataset and then implemented a discovery-evaluation strategy to identify module genes with consistent association signals. We validated the module genes in an independent dataset, and also examined them through meta-analysis of the related SNPs using multiple GWAS datasets. As a result, we identified 205 module genes with a joint effect significantly associated with schizophrenia; these module genes included a number of well-studied candidate genes such as DISC1, GNA12, GNA13, GNAI1, GPR17, and GRIN2B. Further functional analysis suggested these genes are involved in neuronal related processes. Additionally, meta-analysis found that 18 SNPs in 9 module genes had Pmeta<1×10−4, including the gene HLA-DQA1 located in the MHC region on chromosome 6, which was reported in previous studies using the largest cohort of schizophrenia patients to date. These results demonstrated our bi-directional network-based strategy is efficient for identifying disease-associated genes with modest signals in GWAS datasets. This approach can be applied to any other complex diseases/traits where multiple GWAS datasets are available.
Author Summary
The recent success of genome-wide association studies (GWAS) has generated a wealth of genotyping data critical to studies of genetic architectures of many complex diseases. In contrast to traditional single marker analysis, an integrative analysis of multiple genes and the assessment of their joint effects have been particularly promising, especially upon the availability of many GWAS datasets and other high-throughput datasets for numerous complex diseases. In this study, we developed an integrative analysis framework for multiple GWAS datasets and demonstrated it in schizophrenia. We first constructed a GWAS-weighted protein-protein interaction (PPI) network and then applied a dense module search algorithm to identify subnetworks with combinatory disease effects. We applied combinatorial criteria for module selection based on permutation tests to determine whether the modules are significantly different from random gene sets and whether the modules are associated with the disease in investigation. Importantly, considering there are many complex diseases with multiple GWAS datasets available, we proposed a discovery-evaluation strategy to search for modules with consistent combined effects from two or more GWAS datasets. This approach can be applied to any diseases or traits that have two or more GWAS datasets available.
doi:10.1371/journal.pcbi.1002587
PMCID: PMC3390381  PMID: 22792057
18.  Simultaneous Clustering of Multiple Gene Expression and Physical Interaction Datasets 
PLoS Computational Biology  2010;6(4):e1000742.
Many genome-wide datasets are routinely generated to study different aspects of biological systems, but integrating them to obtain a coherent view of the underlying biology remains a challenge. We propose simultaneous clustering of multiple networks as a framework to integrate large-scale datasets on the interactions among and activities of cellular components. Specifically, we develop an algorithm JointCluster that finds sets of genes that cluster well in multiple networks of interest, such as coexpression networks summarizing correlations among the expression profiles of genes and physical networks describing protein-protein and protein-DNA interactions among genes or gene-products. Our algorithm provides an efficient solution to a well-defined problem of jointly clustering networks, using techniques that permit certain theoretical guarantees on the quality of the detected clustering relative to the optimal clustering. These guarantees coupled with an effective scaling heuristic and the flexibility to handle multiple heterogeneous networks make our method JointCluster an advance over earlier approaches. Simulation results showed JointCluster to be more robust than alternate methods in recovering clusters implanted in networks with high false positive rates. In systematic evaluation of JointCluster and some earlier approaches for combined analysis of the yeast physical network and two gene expression datasets under glucose and ethanol growth conditions, JointCluster discovers clusters that are more consistently enriched for various reference classes capturing different aspects of yeast biology or yield better coverage of the analysed genes. These robust clusters, which are supported across multiple genomic datasets and diverse reference classes, agree with known biology of yeast under these growth conditions, elucidate the genetic control of coordinated transcription, and enable functional predictions for a number of uncharacterized genes.
Author Summary
The generation of high-dimensional datasets in the biological sciences has become routine (protein interaction, gene expression, and DNA/RNA sequence data, to name a few), stretching our ability to derive novel biological insights from them, with even less effort focused on integrating these disparate datasets available in the public domain. Hence a most pressing problem in the life sciences today is the development of algorithms to combine large-scale data on different biological dimensions to maximize our understanding of living systems. We present an algorithm for simultaneously clustering multiple biological networks to identify coherent sets of genes (clusters) underlying cellular processes. The algorithm allows theoretical guarantees on the quality of the detected clusters relative to the optimal clusters that are computationally infeasible to find, and could be applied to coexpression, protein interaction, protein-DNA networks, and other network types. When combining multiple physical and gene expression based networks in yeast, the clusters we identify are consistently enriched for reference classes capturing diverse aspects of biology, yield good coverage of the analysed genes, and highlight novel members in well-studied cellular processes.
doi:10.1371/journal.pcbi.1000742
PMCID: PMC2855327  PMID: 20419151
19.  integrOmics: an R package to unravel relationships between two omics datasets 
Bioinformatics  2009;25(21):2855-2856.
Motivation: With the availability of many ‘omics’ data, such as transcriptomics, proteomics or metabolomics, the integrative or joint analysis of multiple datasets from different technology platforms is becoming crucial to unravel the relationships between different biological functional levels. However, the development of such an analysis is a major computational and technical challenge as most approaches suffer from high data dimensionality. New methodologies need to be developed and validated.
Results: integrOmics efficiently performs integrative analyses of two types of ‘omics’ variables that are measured on the same samples. It includes a regularized version of canonical correlation analysis to enlighten correlations between two datasets, and a sparse version of partial least squares (PLS) regression that includes simultaneous variable selection in both datasets. The usefulness of both approaches has been demonstrated previously and successfully applied in various integrative studies.
Availability: integrOmics is freely available from http://CRAN.R-project.org/ or from the web site companion (http://math.univ-toulouse.fr/biostat) that provides full documentation and tutorials.
Contact: k.lecao@uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp515
PMCID: PMC2781751  PMID: 19706745
20.  GPA: A Statistical Approach to Prioritizing GWAS Results by Integrating Pleiotropy and Annotation 
PLoS Genetics  2014;10(11):e1004787.
Results from Genome-Wide Association Studies (GWAS) have shown that complex diseases are often affected by many genetic variants with small or moderate effects. Identifications of these risk variants remain a very challenging problem. There is a need to develop more powerful statistical methods to leverage available information to improve upon traditional approaches that focus on a single GWAS dataset without incorporating additional data. In this paper, we propose a novel statistical approach, GPA (Genetic analysis incorporating Pleiotropy and Annotation), to increase statistical power to identify risk variants through joint analysis of multiple GWAS data sets and annotation information because: (1) accumulating evidence suggests that different complex diseases share common risk bases, i.e., pleiotropy; and (2) functionally annotated variants have been consistently demonstrated to be enriched among GWAS hits. GPA can integrate multiple GWAS datasets and functional annotations to seek association signals, and it can also perform hypothesis testing to test the presence of pleiotropy and enrichment of functional annotation. Statistical inference of the model parameters and SNP ranking is achieved through an EM algorithm that can handle genome-wide markers efficiently. When we applied GPA to jointly analyze five psychiatric disorders with annotation information, not only did GPA identify many weak signals missed by the traditional single phenotype analysis, but it also revealed relationships in the genetic architecture of these disorders. Using our hypothesis testing framework, statistically significant pleiotropic effects were detected among these psychiatric disorders, and the markers annotated in the central nervous system genes and eQTLs from the Genotype-Tissue Expression (GTEx) database were significantly enriched. We also applied GPA to a bladder cancer GWAS data set with the ENCODE DNase-seq data from 125 cell lines. GPA was able to detect cell lines that are biologically more relevant to bladder cancer. The R implementation of GPA is currently available at http://dongjunchung.github.io/GPA/.
Author Summary
In the past 10 years, many genome wide association studies (GWAS) have been conducted to identify the genetic bases of complex human traits. As of January, 2014, more than 12,000 single-nucleotide polymorphisms (SNPs) have been reported to be significantly associated with at least one complex trait/disease. On one hand, about 85% of identified risk variants are located in non-coding regions, which motivates a systematic understanding of the function of non-coding variants in regulatory elements in the human genome. On the other hand, complex diseases are often affected by many genetic variants with small or moderate effects. To address these issues, we propose a statistical approach, GPA, to integrating information from multiple GWAS datasets and functional annotation. Notably, our approach only requires marker-wise p-values as input, making it especially useful when only summary statistics, instead of the full genotype and phenotype data, are available. We applied GPA to analyze GWAS datasets of five psychiatric disorders and bladder cancer, where the central nervous system genes, eQTLs from the Genotype-Tissue Expression (GTEx), and the ENCODE DNase-seq data from 125 cell lines were used as functional annotation. The analysis results suggest that GPA is an effective method for integrative data analysis in the post-GWAS era.
doi:10.1371/journal.pgen.1004787
PMCID: PMC4230845  PMID: 25393678
21.  Risk Assessment and Communication Tools for Genotype Associations with Multifactorial Phenotypes: The Concept of “Edge Effect” and Cultivating an Ethical Bridge between Omics Innovations and Society 
Applications of omics technologies in the postgenomics era swiftly expanded from rare monogenic disorders to multifactorial common complex diseases, pharmacogenomics, and personalized medicine. Already, there are signposts indicative of further omics technology investment in nutritional sciences (nutrigenomics), environmental health/ecology (ecogenomics), and agriculture (agrigenomics). Genotype–phenotype association studies are a centerpiece of translational research in omics science. Yet scientific and ethical standards and ways to assess and communicate risk information obtained from association studies have been neglected to date. This is a significant gap because association studies decisively influence which genetic loci become genetic tests in the clinic or products in the genetic test marketplace. A growing challenge concerns the interpretation of large overlap typically observed in distribution of quantitative traits in a genetic association study with a polygenic/multifactorial phenotype. To remedy the shortage of risk assessment and communication tools for association studies, this paper presents the concept of edge effect. That is, the shift in population edges of a multi-factorial quantitative phenotype is a more sensitive measure (than population averages) to gauge the population level impact and by extension, policy significance of an omics marker. Empirical application of the edge effect concept is illustrated using an original analysis of warfarin pharmacogenomics and the VKORC1 genetic variation in a Brazilian population sample. These edge effect analyses are examined in relation to regulatory guidance development for association studies. We explain that omics science transcends the conventional laboratory bench space and includes a highly heterogeneous cast of stakeholders in society who have a plurality of interests that are often in conflict. Hence, communication of risk information in diagnostic medicine also demands attention to processes involved in production of knowledge and human values embedded in scientific practice, for example, why, how, by whom, and to what ends association studies are conducted, and standards are developed (or not). To ensure sustainability of omics innovations and forecast their trajectory, we need interventions to bridge the gap between omics laboratory and society. Appreciation of scholarship in history of omics science is one remedy to responsibly learn from the past to ensure a sustainable future in omics fields, both emerging (nutrigenomics, ecogenomics), and those that are more established (pharmacogenomics). Another measure to build public trust and sustainability of omics fields could be legislative initiatives to create a multidisciplinary oversight body, at arm's length from conflict of interests, to carry out independent, impartial, and transparent innovation analyses and prospective technology assessment.
doi:10.1089/omi.2009.0011
PMCID: PMC2727354  PMID: 19290811
22.  Risk Assessment and Communication Tools for Genotype Associations with Multifactorial Phenotypes: The Concept of “Edge Effect” and Cultivating an Ethical Bridge between Omics Innovations and Society 
Abstract
Applications of omics technologies in the postgenomics era swiftly expanded from rare monogenic disorders to multifactorial common complex diseases, pharmacogenomics, and personalized medicine. Already, there are signposts indicative of further omics technology investment in nutritional sciences (nutrigenomics), environmental health/ecology (ecogenomics), and agriculture (agrigenomics). Genotype–phenotype association studies are a centerpiece of translational research in omics science. Yet scientific and ethical standards and ways to assess and communicate risk information obtained from association studies have been neglected to date. This is a significant gap because association studies decisively influence which genetic loci become genetic tests in the clinic or products in the genetic test marketplace. A growing challenge concerns the interpretation of large overlap typically observed in distribution of quantitative traits in a genetic association study with a polygenic/multifactorial phenotype. To remedy the shortage of risk assessment and communication tools for association studies, this paper presents the concept of edge effect. That is, the shift in population edges of a multifactorial quantitative phenotype is a more sensitive measure (than population averages) to gauge the population level impact and by extension, policy significance of an omics marker. Empirical application of the edge effect concept is illustrated using an original analysis of warfarin pharmacogenomics and the VKORC1 genetic variation in a Brazilian population sample. These edge effect analyses are examined in relation to regulatory guidance development for association studies. We explain that omics science transcends the conventional laboratory bench space and includes a highly heterogeneous cast of stakeholders in society who have a plurality of interests that are often in conflict. Hence, communication of risk information in diagnostic medicine also demands attention to processes involved in production of knowledge and human values embedded in scientific practice, for example, why, how, by whom, and to what ends association studies are conducted, and standards are developed (or not). To ensure sustainability of omics innovations and forecast their trajectory, we need interventions to bridge the gap between omics laboratory and society. Appreciation of scholarship in history of omics science is one remedy to responsibly learn from the past to ensure a sustainable future in omics fields, both emerging (nutrigenomics, ecogenomics), and those that are more established (pharmacogenomics). Another measure to build public trust and sustainability of omics fields could be legislative initiatives to create a multidisciplinary oversight body, at arm's length from conflict of interests, to carry out independent, impartial, and transparent innovation analyses and prospective technology assessment.
doi:10.1089/omi.2009.0011
PMCID: PMC2727354  PMID: 19290811
23.  CALIBRATING NON-CONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION 
Annals of statistics  2013;41(5):2505-2536.
We investigate high-dimensional non-convex penalized regression, where the number of covariates may grow at an exponential rate. Although recent asymptotic theory established that there exists a local minimum possessing the oracle property under general conditions, it is still largely an open problem how to identify the oracle estimator among potentially multiple local minima. There are two main obstacles: (1) due to the presence of multiple minima, the solution path is nonunique and is not guaranteed to contain the oracle estimator; (2) even if a solution path is known to contain the oracle estimator, the optimal tuning parameter depends on many unknown factors and is hard to estimate. To address these two challenging issues, we first prove that an easy-to-calculate calibrated CCCP algorithm produces a consistent solution path which contains the oracle estimator with probability approaching one. Furthermore, we propose a high-dimensional BIC criterion and show that it can be applied to the solution path to select the optimal tuning parameter which asymptotically identifies the oracle estimator. The theory for a general class of non-convex penalties in the ultra-high dimensional setup is established when the random errors follow the sub-Gaussian distribution. Monte Carlo studies confirm that the calibrated CCCP algorithm combined with the proposed high-dimensional BIC has desirable performance in identifying the underlying sparsity pattern for high-dimensional data analysis.
doi:10.1214/13-AOS1159
PMCID: PMC4060811  PMID: 24948843
High-dimensional regression; LASSO; MCP; SCAD; variable selection; penalized least squares
24.  A semantic proteomics dashboard (SemPoD) for data management in translational research 
BMC Systems Biology  2012;6(Suppl 3):S20.
Background
One of the primary challenges in translational research data management is breaking down the barriers between the multiple data silos and the integration of 'omics data with clinical information to complete the cycle from the bench to the bedside. The role of contextual metadata, also called provenance information, is a key factor ineffective data integration, reproducibility of results, correct attribution of original source, and answering research queries involving "What", "Where", "When", "Which", "Who", "How", and "Why" (also known as the W7 model). But, at present there is limited or no effective approach to managing and leveraging provenance information for integrating data across studies or projects. Hence, there is an urgent need for a paradigm shift in creating a "provenance-aware" informatics platform to address this challenge. We introduce an ontology-driven, intuitive Semantic Proteomics Dashboard (SemPoD) that uses provenance together with domain information (semantic provenance) to enable researchers to query, compare, and correlate different types of data across multiple projects, and allow integration with legacy data to support their ongoing research.
Results
The SemPoD platform, currently in use at the Case Center for Proteomics and Bioinformatics (CPB), consists of three components: (a) Ontology-driven Visual Query Composer, (b) Result Explorer, and (c) Query Manager. Currently, SemPoD allows provenance-aware querying of 1153 mass-spectrometry experiments from 20 different projects. SemPod uses the systems molecular biology provenance ontology (SysPro) to support a dynamic query composition interface, which automatically updates the components of the query interface based on previous user selections and efficientlyprunes the result set usinga "smart filtering" approach. The SysPro ontology re-uses terms from the PROV-ontology (PROV-O) being developed by the World Wide Web Consortium (W3C) provenance working group, the minimum information required for reporting a molecular interaction experiment (MIMIx), and the minimum information about a proteomics experiment (MIAPE) guidelines. The SemPoD was evaluated both in terms of user feedback and as scalability of the system.
Conclusions
SemPoD is an intuitive and powerful provenance ontology-driven data access and query platform that uses the MIAPE and MIMIx metadata guideline to create an integrated view over large-scale systems molecular biology datasets. SemPoD leverages the SysPro ontology to create an intuitive dashboard for biologists to compose queries, explore the results, and use a query manager for storing queries for later use. SemPoD can be deployed over many existing database applications storing 'omics data, including, as illustrated here, the LabKey data-management system. The initial user feedback evaluating the usability and functionality of SemPoD has been very positive and it is being considered for wider deployment beyond the proteomics domain, and in other 'omics' centers.
doi:10.1186/1752-0509-6-S3-S20
PMCID: PMC3524316  PMID: 23282161
25.  A marginal approach to reduced-rank penalized spline smoothing with application to multilevel functional data 
Multilevel functional data is collected in many biomedical studies. For example, in a study of the effect of Nimodipine on patients with subarachnoid hemorrhage (SAH), patients underwent multiple 4-hour treatment cycles. Within each treatment cycle, subjects’ vital signs were reported every 10 minutes. This data has a natural multilevel structure with treatment cycles nested within subjects and measurements nested within cycles. Most literature on nonparametric analysis of such multilevel functional data focus on conditional approaches using functional mixed effects models. However, parameters obtained from the conditional models do not have direct interpretations as population average effects. When population effects are of interest, we may employ marginal regression models. In this work, we propose marginal approaches to fit multilevel functional data through penalized spline generalized estimating equation (penalized spline GEE). The procedure is effective for modeling multilevel correlated generalized outcomes as well as continuous outcomes without suffering from numerical difficulties. We provide a variance estimator robust to misspecification of correlation structure. We investigate the large sample properties of the penalized spline GEE estimator with multilevel continuous data and show that the asymptotics falls into two categories. In the small knots scenario, the estimated mean function is asymptotically efficient when the true correlation function is used and the asymptotic bias does not depend on the working correlation matrix. In the large knots scenario, both the asymptotic bias and variance depend on the working correlation. We propose a new method to select the smoothing parameter for penalized spline GEE based on an estimate of the asymptotic mean squared error (MSE). We conduct extensive simulation studies to examine property of the proposed estimator under different correlation structures and sensitivity of the variance estimation to the choice of smoothing parameter. Finally, we apply the methods to the SAH study to evaluate a recent debate on discontinuing the use of Nimodipine in the clinical community.
doi:10.1080/01621459.2013.826134
PMCID: PMC3909538  PMID: 24497670
Penalized spline; GEE; Semiparametric models; Longitudinal data; Functional data

Results 1-25 (1673232)