Purpose of review
The desire for biomarkers for diagnosis and prognosis of diseases has never been greater. With the availability of genome data and an increased availability of proteome data, the discovery of biomarkers has become increasingly feasible. This article reviews some recent applications of the many evolving “omic” technologies to organ transplantation.
With the advancement of many high throughput “omic” techniques such as genomics, metabolomics, antibiomics, peptidomics and proteomics, efforts have been made to understand potential mechanisms of specific graft injuries and develop novel biomarkers for acute rejection, chronic rejection, and operational tolerance.
The translation of potential biomarkers from the lab bench to the clinical bedside is not an easy task and will require the concerted effort of the immunologists, molecular biologists, transplantation specialists, geneticists, and experts in bioinformatics. Rigorous prospective validation studies will be needed using large sets of independent patient samples. The appropriate and timely exploitation of evolving “omic” technologies will lay the cornerstone for a new age of translational research for organ transplant monitoring.
genomics; proteomics; organ transplant; biomarker; translational medicine
Advances in biotechnology offer a fast growing variety of high-throughput data for screening molecular activities of genomic, transcriptional, post-transcriptional and translational observations. However, to date, most computational and algorithmic efforts have been directed at mining data from each of these molecular levels (genomic, transcriptional, etc.) separately. In view of the rapid advances in technology (new generation sequencing, high-throughput proteomics) it is important to address the problem of analyzing these data as a whole, i.e. preserving the emergent properties that appear in the cellular system when all molecular levels are interacting. We analyzed one of the (currently) few datasets that provide both transcriptional and post-transcriptional data of the same samples to investigate the possibility to extract more information, using a joint analysis approach.
We use Factor Analysis coupled with pre-established knowledge as a theoretical base to achieve this goal. Our intention is to identify structures that contain information from both mRNAs and miRNAs, and that can explain the complexity of the data. Despite the small sample available, we can show that this approach permits identification of meaningful structures, in particular two polycistronic miRNA genes related to transcriptional activity and likely to be relevant in the discrimination between gliosarcomas and other brain tumors.
This suggests the need to develop methodologies to simultaneously mine information from different levels of biological organization, rather than linking separate analyses performed in parallel.
Omics and bioinformatics are essential to understanding the molecular systems that underlie various plant functions. Recent game-changing sequencing technologies have revitalized sequencing approaches in genomics and have produced opportunities for various emerging analytical applications. Driven by technological advances, several new omics layers such as the interactome, epigenome and hormonome have emerged. Furthermore, in several plant species, the development of omics resources has progressed to address particular biological properties of individual species. Integration of knowledge from omics-based research is an emerging issue as researchers seek to identify significance, gain biological insights and promote translational research. From these perspectives, we provide this review of the emerging aspects of plant systems research based on omics and bioinformatics analyses together with their associated resources and technological advances.
Bioinformatics; Data integration; Genome-scale approach; Omics; Systems analysis
High-throughput “omics” technologies bring new opportunities for biological and biomedical researchers to ask complex questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard and standardized nomenclature imposes a major challenge which requires advanced computational methods and bioinformatics infrastructures for integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery. In this paper, we present the challenges in high-throughput “omics” data integration and analysis, introduce a protein-centric approach for systems integration of large and heterogeneous high-throughput “omics” data including microarray, mass spectrometry, protein sequence, protein structure, and protein interaction data, and use scientific case study to illustrate how one can use varied “omics” data from different laboratories to make useful connections that could lead to new biological knowledge.
The desire for biomarkers for diagnosis and prognosis of diseases has never been greater. With the availability of genome data and an increased availability of proteome data, the discovery of biomarkers has become increasingly feasible. However, the task is daunting and requires collaborations among researchers working in the fields of transplantation, immunology, genetics, molecular biology, biostatistics, and bioinformatics. With the advancement of high throughput omic techniques such as genomics and proteomics (collectively known as proteogenomics), efforts have been made to develop diagnostic tools from new and to-be discovered biomarkers. Yet biomarker validation, particularly in organ transplantation, remains challenging because of the lack of a true gold standard for diagnostic categories and analytical bottlenecks that face high-throughput data deconvolution. Even though microarray technique is relatively mature, proteomics is still growing with regards to data normalization and analysis methods. Study design, sample selection, and rigorous data analysis are the critical issues for biomarker discovery using high-throughout proteogenomic technologies that combine the use and strengths of both genomics and proteomics. In this review, we look into the current status and latest developments in the field of biomarker discovery using genomics and proteomics related to organ transplantation, with an emphasis on the evolution of proteomic technologies.
Biomarker discovery; proteogenomics; genomics; proteomics; microarray; transplantation; acute rejection; peptidomics
Significant research has been devoted to predicting diagnosis, prognosis, and response to treatment using high-throughput assays. Rapid translation into clinical results hinges upon efficient access to up-to-date and high-quality molecular medicine modalities.
We first explain why this goal is inadequately supported by existing databases and portals and then introduce a novel semantic indexing and information retrieval model for clinical bioinformatics. The formalism provides the means for indexing a variety of relevant objects (e.g. papers, algorithms, signatures, datasets) and includes a model of the research processes that creates and validates these objects in order to support their systematic presentation once retrieved.
We test the applicability of the model by constructing proof-of-concept encodings and visual presentations of evidence and modalities in molecular profiling and prognosis of: (a) diffuse large B-cell lymphoma (DLBCL) and (b) breast cancer.
information retrieval; molecular medicine; semantic model; clinical bioinformatics; predictive computational models
The identification of novel candidate markers is a key challenge in the development of cancer therapies. This can be facilitated by putting accessible and automated approaches analysing the current wealth of ‘omic’-scale data in the hands of researchers who are directly addressing biological questions. Data integration techniques and standardized, automated, high-throughput analyses are needed to manage the data available as well as to help narrow down the excessive number of target gene possibilities presented by modern databases and system-level resources. Here we present CancerMA, an online, integrated bioinformatic pipeline for automated identification of novel candidate cancer markers/targets; it operates by means of meta-analysing expression profiles of user-defined sets of biologically significant and related genes across a manually curated database of 80 publicly available cancer microarray datasets covering 13 cancer types. A simple-to-use web interface allows bioinformaticians and non-bioinformaticians alike to initiate new analyses as well as to view and retrieve the meta-analysis results. The functionality of CancerMA is shown by means of two validation datasets.
Over the past two decades, high-throughput (HTP) technologies such as microarrays and mass spectrometry have fundamentally changed clinical cancer research. They have revealed novel molecular markers of cancer subtypes, metastasis, and drug sensitivity and resistance. Some have been translated into the clinic as tools for early disease diagnosis, prognosis, and individualized treatment and response monitoring. Despite these successes, many challenges remain: HTP platforms are often noisy and suffer from false positives and false negatives; optimal analysis and successful validation require complex workflows; and great volumes of data are accumulating at a rapid pace. Here we discuss these challenges, and show how integrative computational biology can help diminish them by creating new software tools, analytical methods, and data standards.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-0983-z) contains supplementary material, which is available to authorized users.
The development of high throughput experimental technologies have given rise to the "-omics" era where terabyte-scale datasets for systems-level measurements of various cellular and molecular phenomena pose considerable challenges in data processing and extraction of biological meaning. Moreover, it has created an unmet need for the effective integration of these datasets to achieve insights into biological systems. While it has increased the demand for bioinformatics experts who can interface with biologists, it has also raised the requirement for biologists to possess a basic capability in bioinformatics and to communicate seamlessly with these experts. This may be achieved by embedding in their undergraduate and graduate life science education, basic training in bioinformatics geared towards acquiring a minimum skill set in computation and informatics.
Based on previous attempts to define curricula suitable for addressing the bioinformatics capability gap, an initiative was taken during the Workshops on Education in Bioinformatics and Computational Biology (WEBCB) in 2008 and 2009 to identify a minimum skill set for the training of future bioinformaticians and molecular biologists with informatics capabilities. The minimum skill set proposed is cross-disciplinary in nature, involving a combination of knowledge and proficiency from the fields of biology, computer science, mathematics and statistics, and can be tailored to the needs of the "-omics".
The proposed bioinformatics minimum skill set serves as a guideline for biology curriculum design and development in universities at both the undergraduate and graduate levels.
Protein phosphorylation is one of the most important post-translational modifications (PTMs) as it participates in regulating various cellular processes and biological functions. It is therefore crucial to identify phosphorylated proteins to construct a phosphor-relay network, and eventually to understand the underlying molecular regulatory mechanism in response to both internal and external stimuli. The changes in phosphorylation status at these novel phosphosites can be accurately measured using a 15N-stable isotopic labeling in Arabidopsis (SILIA) quantitative proteomic approach in a high-throughput manner. One of the unique characteristics of the SILIA quantitative phosphoproteomic approach is the preservation of native PTM status on protein during the entire peptide preparation procedure. Evolved from SILIA is another quantitative PTM proteomic approach, AQUIP (absolute quantitation of isoforms of post-translationally modified proteins), which was developed by combining the advantages of targeted proteomics with SILIA. Bioinformatics-based phosphorylation site prediction coupled with an MS-based in vitro kinase assay is an additional way to extend the capability of phosphosite identification from the total cellular protein. The combined use of SILIA and AQUIP provides a novel strategy for molecular systems biological study and for investigation of in vivo biological functions of these phosphoprotein isoforms and combinatorial codes of PTMs.
SILIA; AQUIP; plant; quantitative proteomics; post-translational modification; cell signaling and regulation; mass spectrometry-based interactomics
In high-throughput -omics studies, markers identified from analysis of single data sets often suffer from a lack of reproducibility because of sample limitation. A cost-effective remedy is to pool data from multiple comparable studies and conduct integrative analysis. Integrative analysis of multiple -omics data sets is challenging because of the high dimensionality of data and heterogeneity among studies. In this article, for marker selection in integrative analysis of data from multiple heterogeneous studies, we propose a 2-norm group bridge penalization approach. This approach can effectively identify markers with consistent effects across multiple studies and accommodate the heterogeneity among studies. We propose an efficient computational algorithm and establish the asymptotic consistency property. Simulations and applications in cancer profiling studies show satisfactory performance of the proposed approach.
High-dimensional data; Integrative analysis; 2-norm group bridge
Microarray-based gene expression profiling represents a major breakthrough for understanding the molecular complexity of breast cancer. cDNA expression profiles cannot detect changes in activities that arise from post-translational modifications, however, and therefore do not provide a complete picture of all biologically important changes that occur in tumors. Additional opportunities to identify and/or validate molecular signatures of breast carcinomas are provided by proteomic approaches. Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS) offers high-throughput protein profiling, leading to extraction of protein array data, calling for effective and appropriate use of bioinformatics and statistical tools.
Whole tissue lysates of 105 breast carcinomas were analyzed on IMAC 30 ProteinChip Arrays (Bio-Rad, Hercules, CA, USA) using the ProteinChip Reader Model PBS IIc (Bio-Rad) and Ciphergen ProteinChip software (Bio-Rad, Hercules, CA, USA). Cluster analysis of protein spectra was performed to identify protein patterns potentially related to established clinicopathological variables and/or tumor markers.
Unsupervised hierarchical clustering of 130 peaks detected in spectra from breast cancer tissue lysates provided six clusters of peaks and five groups of patients differing significantly in tumor type, nuclear grade, presence of hormonal receptors, mucin 1 and cytokeratin 5/6 or cytokeratin 14. These tumor groups resembled closely luminal types A and B, basal and HER2-like carcinomas.
Our results show similar clustering of tumors to those provided by cDNA expression profiles of breast carcinomas. This fact testifies the validity of the SELDI-TOF MS proteomic approach in such a type of study. As SELDI-TOF MS provides different information from cDNA expression profiles, the results suggest the technique's potential to supplement and expand our knowledge of breast cancer, to identify novel biomarkers and to produce clinically useful classifications of breast carcinomas.
Currently, cancer therapy remains limited by a “one-size-fits-all” approach, whereby treatment decisions are based mainly on the clinical stage of disease, yet fail to reference the individual's underlying biology and its role driving malignancy. Identifying better personalized therapies for cancer treatment is hindered by the lack of high-quality “omics” data of sufficient size to produce meaningful results and the ability to integrate biomedical data from disparate technologies. Resolving these issues will help translation of therapies from research to clinic by helping clinicians develop patient-specific treatments based on the unique signatures of patient's tumor. Here we describe the Georgetown Database of Cancer (G-DOC), a Web platform that enables basic and clinical research by integrating patient characteristics and clinical outcome data with a variety of high-throughput research data in a unified environment. While several rich data repositories for high-dimensional research data exist in the public domain, most focus on a single-data type and do not support integration across multiple technologies. Currently, G-DOC contains data from more than 2500 breast cancer patients and 800 gastrointestinal cancer patients, G-DOC includes a broad collection of bioinformatics and systems biology tools for analysis and visualization of four major “omics” types: DNA, mRNA, microRNA, and metabolites. We believe that G-DOC will help facilitate systems medicine by providing identification of trends and patterns in integrated data sets and hence facilitate the use of better targeted therapies for cancer. A set of representative usage scenarios is provided to highlight the technical capabilities of this resource.
It becomes increasingly clear that our current taxonomy of clinical phenotypes is mixed with molecular heterogeneity. Of vital importance for refined clinical practice and improved intervention strategies is to define the hidden molecular distinct diseases using modern large-scale genomic approaches. Microarray omics technology has provided a powerful way to dissect hidden genetic heterogeneity of complex diseases. The aim of this study was thus to develop a bioinformatics approach to seek the transcriptional features leading to the hidden subtyping of a complex clinical phenotype. The basic strategy of the proposed method was to iteratively partition in two ways sample and feature space with super-paramagnetic clustering technique and to seek for hard and robust gene clusters that lead to a natural partition of disease samples and that have the highest functionally conceptual consensus evaluated with Gene Ontology.
We applied the proposed method to two publicly available microarray datasets of diffuse large B-cell lymphoma (DLBCL), a notoriously heterogeneous phenotype. A feature subset of 30 genes (38 probes) derived from analysis of the first dataset consisting of 4026 genes and 42 DLBCL samples identified three categories of patients with very different five-year overall survival rates (70.59%, 44.44% and 14.29% respectively; p = 0.0017). Analysis of the second dataset consisting of 7129 genes and 58 DLBCL samples revealed a feature subset of 13 genes (16 probes) that not only replicated the findings of the important DLBCL genes (e.g. JAW1 and BCL7A), but also identified three clinically similar subtypes (with 5-year overall survival rates of 63.13%, 34.92% and 15.38% respectively; p = 0.0009) to those identified in the first dataset. Finally, we built a multivariate Cox proportional-hazards prediction model for each feature subset and defined JAW1 as one of the most significant predictor (p = 0.005 and 0.014; hazard ratios = 0.02 and 0.03, respectively for two datasets) for both DLBCL cohorts under study.
Our results showed that the proposed algorithm is a promising computational strategy for peeling off the hidden genetic heterogeneity based on transcriptionally profiling disease samples, which may lead to an improved diagnosis and treatment of cancers.
Bioinformatics is the application of omics science, information technology, mathematics and statistics in the field of biomarker detection. Clinical bioinformatics can be applied for identification and validation of new biomarkers to improve current methods of monitoring disease activity and identify new therapeutic targets. Acute lung injurt (ALI)/Acute respiratory distress syndrome (ARDS) affects a large number of patients with a poor prognosis. The present review mainly focused on the progress in understanding disease heterogeneity through the use of evolving biological, genomic, and genetic approaches and the role of clinical bioinformatics in the pathogenesis and treatment of ALI/ARDS. The remarkable advances in clinical bioinformatics can be a new way for understanding disease pathogenesis, diagnosis and treatment.
Acute lung injury; Acute respiratory distress syndrome; Genomics; Proteomics; Metabolomics; Bioinformatics
Deep sequencing techniques provide a remarkable opportunity for comprehensive understanding of tumorigenesis at the molecular level. As omics studies become popular, integrative approaches need to be developed to move from a simple cataloguing of mutations and changes in gene expression to dissecting the molecular nature of carcinogenesis at the systemic level and understanding the complex networks that lead to cancer development.
Here, we describe a high-throughput, multi-dimensional sequencing study of primary lung adenocarcinoma tumors and adjacent normal tissues of six Korean female never-smoker patients. Our data encompass results from exome-seq, RNA-seq, small RNA-seq, and MeDIP-seq. We identified and validated novel genetic aberrations, including 47 somatic mutations and 19 fusion transcripts. One of the fusions involves the c-RET gene, which was recently reported to form fusion genes that may function as drivers of carcinogenesis in lung cancer patients. We also characterized gene expression profiles, which we integrated with genomic aberrations and gene regulations into functional networks. The most prominent gene network module that emerged indicates that disturbances in G2/M transition and mitotic progression are causally linked to tumorigenesis in these patients. Also, results from the analysis strongly suggest that several novel microRNA-target interactions represent key regulatory elements of the gene network.
Our study not only provides an overview of the alterations occurring in lung adenocarcinoma at multiple levels from genome to transcriptome and epigenome, but also offers a model for integrative genomics analysis and proposes potential target pathways for the control of lung adenocarcinoma.
Consortia of microorganisms, commonly known as biofilms, are attracting much attention from the scientific community due to their impact in human activity. As biofilm research grows to be a data-intensive discipline, the need for suitable bioinformatics approaches becomes compelling to manage and validate individual experiments, and also execute inter-laboratory large-scale comparisons. However, biofilm data is widespread across ad hoc, non-standardized individual files and, thus, data interchange among researchers, or any attempt of cross-laboratory experimentation or analysis, is hardly possible or even attempted.
This paper presents BiofOmics, the first publicly accessible Web platform specialized in the management and analysis of data derived from biofilm high-throughput studies. The aim is to promote data interchange across laboratories, implementing collaborative experiments, and enable the development of bioinformatics tools in support of the processing and analysis of the increasing volumes of experimental biofilm data that are being generated. BiofOmics’ data deposition facility enforces data structuring and standardization, supported by controlled vocabulary. Researchers are responsible for the description of the experiments, their results and conclusions. BiofOmics’ curators interact with submitters only to enforce data structuring and the use of controlled vocabulary. Then, BiofOmics’ search facility makes publicly available the profile and data associated with a submitted study so that any researcher can profit from these standardization efforts to compare similar studies, generate new hypotheses to be tested or even extend the conditions experimented in the study.
BiofOmics’ novelty lies in its support to standardized data deposition, the availability of computerizable data files and the free-of-charge dissemination of biofilm studies across the community. Hopefully, this will open promising research possibilities, namely the comparison of results between different laboratories, the reproducibility of methods within and between laboratories, and the development of guidelines and standardized protocols for biofilm formation operating procedures and analytical methods.
Advances in the high-throughput omic technologies have made it possible to profile cells in a large number of ways at the DNA, RNA, protein, chromosomal, functional, and pharmacological levels. A persistent problem is that some classes of molecular data are labeled with gene identifiers, others with transcript or protein identifiers, and still others with chromosomal locations. What has lagged behind is the ability to integrate the resulting data to uncover complex relationships and patterns. Those issues are reflected in full form by molecular profile data on the panel of 60 diverse human cancer cell lines (the NCI-60) used since 1990 by the U.S. National Cancer Institute to screen compounds for anticancer activity. To our knowledge, CellMiner is the first online database resource for integration of the diverse molecular types of NCI-60 and related meta data.
CellMiner enables scientists to perform advanced querying of molecular information on NCI-60 (and additional types) through a single web interface. CellMiner is a freely available tool that organizes and stores raw and normalized data that represent multiple types of molecular characterizations at the DNA, RNA, protein, and pharmacological levels. Annotations for each project, along with associated metadata on the samples and datasets, are stored in a MySQL database and linked to the molecular profile data. Data can be queried and downloaded along with comprehensive information on experimental and analytic methods for each data set. A Data Intersection tool allows selection of a list of genes (proteins) in common between two or more data sets and outputs the data for those genes (proteins) in the respective sets. In addition to its role as an integrative resource for the NCI-60, the CellMiner package also serves as a shell for incorporation of molecular profile data on other cell or tissue sample types.
CellMiner is a relational database tool for storing, querying, integrating, and downloading molecular profile data on the NCI-60 and other cancer cell types. More broadly, it provides a template to use in providing such functionality for other molecular profile data generated by academic institutions, public projects, or the private sector. CellMiner is available online at .
The modern biomedical research and healthcare delivery domains have seen an unparalleled increase in the rate of innovation and novel technologies over the past several decades. Catalyzed by paradigm-shifting public and private programs focusing upon the formation and delivery of genomic and personalized medicine, the need for high-throughput and integrative approaches to the collection, management, and analysis of heterogeneous data sets has become imperative. This need is particularly pressing in the translational bioinformatics domain, where many fundamental research questions require the integration of large scale, multi-dimensional clinical phenotype and bio-molecular data sets. Modern biomedical informatics theory and practice has demonstrated the distinct benefits associated with the use of knowledge-based systems in such contexts. A knowledge-based system can be defined as an intelligent agent that employs a computationally tractable knowledge base or repository in order to reason upon data in a targeted domain and reproduce expert performance relative to such reasoning operations. The ultimate goal of the design and use of such agents is to increase the reproducibility, scalability, and accessibility of complex reasoning tasks. Examples of the application of knowledge-based systems in biomedicine span a broad spectrum, from the execution of clinical decision support, to epidemiologic surveillance of public data sets for the purposes of detecting emerging infectious diseases, to the discovery of novel hypotheses in large-scale research data sets. In this chapter, we will review the basic theoretical frameworks that define core knowledge types and reasoning operations with particular emphasis on the applicability of such conceptual models within the biomedical domain, and then go on to introduce a number of prototypical data integration requirements and patterns relevant to the conduct of translational bioinformatics that can be addressed via the design and use of knowledge-based systems.
With the advent of high-throughput technologies, the field of systems biology has amassed an abundance of “omics” data, quantifying thousands of cellular components across a variety of scales, ranging from mRNA transcript levels to metabolite quantities. Methods are needed to not only integrate this omics data but to also use this data to heighten the predictive capabilities of computational models. Several recent studies have successfully demonstrated how flux balance analysis (FBA), a constraint-based modeling approach, can be used to integrate transcriptomic data into genome-scale metabolic network reconstructions to generate predictive computational models. In this review, we summarize such FBA-based methods for integrating expression data into genome-scale metabolic network reconstructions, highlighting their advantages as well as their limitations.
flux balance analysis; data integration; transcriptomics; expression data; metabolic networks
Despite continual efforts to develop a prognostic model of gastric cancer by using clinical and pathological parameters, a clinical test that can discriminate patients with good outcomes from those with poor outcomes after gastric cancer surgery has not been established. We aim to develop practical biomarker-based risk score that can predict relapse of gastric cancer after surgical treatment.
Using microarray technologies, we generated and analyzed gene expression profiling data from 65 gastric cancer patients to identify biomarker genes associated with relapse. The association of expression patterns of identified genes with relapse and overall survival was validated in independent gastric cancer patients.
We uncovered two subgroups of gastric cancer that were strongly associated with the prognosis. For the easy translation of our findings into practice, we developed a scoring system based on the expression of six genes that predicted the likelihood of relapse after curative resection. In multivariate analysis, the risk score was an independent predictor of relapse in a cohort of 96 patients. We were able to validate the robustness of the 6-gene signature in an additional independent cohort.
The risk score derived from the 6-gene set successfully prognosticated the relapse of gastric cancer patients after gastrectomy.
The histologic scoring of renal biopsies is still the gold standard for renal disease classification. The Banff classification scheme and the chronic allograft damage index are histopathologic scoring schemes widely used in renal transplantation. The determination of genome-wide gene expression profiles in human renal biopsies has the potential to serve as independent validation data sets and also provide a more precise evaluation of the functional status behind the visible morphologic alterations. It is expected that results from high-throughput -omics experiments will lead to improved classification schemes in the near future as also discussed at recent Banff meetings. In this review we give an overview on -omics studies, focusing on the association of molecular changes on the transcript as well as on the protein level and morphologic scoring schemes in renal disease and transplantation.
Histopathologic classification; morphology; gene expression signatures; biomarkers; kidney function
Systems integration is becoming the driving force for 21st century biology. Researchers are systematically tackling gene functions and complex regulatory processes by studying organisms at different levels of organization, from genomes and transcriptomes to proteomes and interactomes. To fully realize the value of such high-throughput data requires advanced bioinformatics for integration, mining, comparative analysis, and functional interpretation. We are developing a bioinformatics research infrastructure that links data mining with text mining and network analysis in the systems biology context for biological network discovery. The system features include: (i) integration of over 100 molecular and omics databases, along with gene/protein ID mapping from disparate data sources; (ii) data mining and text mining capabilities for literature-based knowledge extraction; and (iii) interoperability with ontologies to capture functional properties of proteins and complexes. The system further connects with a data analysis pipeline for next-generation sequencing, linking genomics data to functional annotation. The integrative approach will reveal hidden interrelationships among the various components of the biological systems, allowing researchers to ask complex biological questions and gain better understanding of biological and disease processes, thereby facilitating target discovery.
Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new “omics”-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points.
NNAlign is available online at http://www.cbs.dtu.dk/services/NNAlign.
Linking phenotypes to high-throughput molecular biology information generated by ~omics technologies allows revealing cellular mechanisms underlying an organism's phenotype. ~Omics datasets are often very large and noisy with many features (e.g., genes, metabolite abundances). Thus, associating phenotypes to ~omics data requires an approach that is robust to noise and can handle large and diverse data sets.
We developed a web-tool PhenoLink (http://bamics2.cmbi.ru.nl/websoftware/phenolink/) that links phenotype to ~omics data sets using well-established as well new techniques. PhenoLink imputes missing values and preprocesses input data (i) to decrease inherent noise in the data and (ii) to counterbalance pitfalls of the Random Forest algorithm, on which feature (e.g., gene) selection is based. Preprocessed data is used in feature (e.g., gene) selection to identify relations to phenotypes. We applied PhenoLink to identify gene-phenotype relations based on the presence/absence of 2847 genes in 42 Lactobacillus plantarum strains and phenotypic measurements of these strains in several experimental conditions, including growth on sugars and nitrogen-dioxide production. Genes were ranked based on their importance (predictive value) to correctly predict the phenotype of a given strain. In addition to known gene to phenotype relations we also found novel relations.
PhenoLink is an easily accessible web-tool to facilitate identifying relations from large and often noisy phenotype and ~omics datasets. Visualization of links to phenotypes offered in PhenoLink allows prioritizing links, finding relations between features, finding relations between phenotypes, and identifying outliers in phenotype data. PhenoLink can be used to uncover phenotype links to a multitude of ~omics data, e.g., gene presence/absence (determined by e.g.: CGH or next-generation sequencing), gene expression (determined by e.g.: microarrays or RNA-seq), or metabolite abundance (determined by e.g.: GC-MS).