|Home | About | Journals | Submit | Contact Us | Français|
To review the issues that have arisen with the advent of translational research in terms of integration of data and knowledge, and survey current efforts to address these issues.
Using examples form the biomedical literature, we identified new trends in biomedical research and their impact on bioinformatics. We analyzed the requirements for effective knowledge repositories and studied issues in the integration of biomedical knowledge.
New diagnostic and therapeutic approaches based on gene expression patterns have brought about new issues in the statistical analysis of data, and new workflows are needed are needed to support translational research. Interoperable data repositories based on standard annotations, infrastructures and services are needed to support the pooling and meta-analysis of data, as well as their comparison to earlier experiments. High-quality, integrated ontologies and knowledge bases serve as a source of prior knowledge used in combination with traditional data mining techniques and contribute to the development of more effective data analysis strategies.
As biomedical research evolves from traditional clinical and biological investigations towards omics sciences and translational research, specific needs have emerged, including integrating data collected in research studies with patient clinical data, linking omics knowledge with medical knowledge, modeling the molecular basis of diseases, and developing tools that support in-depth analysis of research data. As such, translational research illustrates the need to bridge the gap between bioinformatics and medical informatics, and opens new avenues for biomedical informatics research.
Access to information, analysis of data, and integration of knowledge are key components of biomedical research. Scientists and physicians must be able to integrate their data with other data, to combine information from multiple sources, and to compare their results to prior knowledge. This paper illustrates the role of knowledge in biomedical research, with focus on omics disciplines, and surveys current efforts to address the needs of biomedical researchers for better access to information and better integration of data and knowledge.
The current era of biomedical research can be characterized by what NIH Director E.A. Zerhouni calls the "four Ps" of medicine: Predictive, Personalized, Preemptive and Participatory1. Risk factors of diseases must be identified early in order to adapt counter-measures, especially for long-term, chronic diseases. Treatments must be tailored in order to take into account the characteristics of individual patients. Shifting the focus of medicine from the current doctor- centric, curative paradigm to preventing diseases will require the active involvement of patients. With the advent of personalized medicine, biomarkers, including genetic markers, will be tested for each patient in order to diagnose specific forms of diseases, predict disease progression and patient outcome, and propose the best therapeutic options. This scenario puts genomics and pharmacogenomics at the centre of medicine . This new vision of personalized medicine is supported by very active biomedical research. As the role of "omics" disciplines2 in biomedical research becomes more important, classical clinical studies must be adapted to these new approaches. New models of diseases have emerged from these studies. The genes identified through omics studies provide clues to possible pathogenetic mechanisms and are likely to be useful in developing diagnostic tests and adapting therapeutic responses. Discoveries typically begin at "the bench" with basic research. Then they must be translated into practical applications and progress to the clinical level, the patient's "bedside." In parallel, clinical researchers make novel observations about the nature and progression of disease that often stimulate basic investigations. This exchange of information describes translational research or translational medicine: researchers and physicians applying newly gained knowledge to the clinic - and back again to the bench3. Such recent changes in biomedical research have brought about new challenges for bioinformatics and medical informatics. The analysis of genomic studies and the new workflows between research and health care generate greater demand for accessing and integrating information.
Over the past decade, biomedical research has evolved to mine gene expression profiles for clues to the pathogenesis, prognosis and treatment of human diseases. In oncology, for example, this research rests on the premise that extraordinary insights into the molecular basis of cancer can be obtained by analyzing gene expression in patient-derived tumor samples, in addition to experimental models. DNA microarrays (DNA chips) are used to monitor the gene expression (i.e., a proxy for gene activity) of thousands of genes simultaneously across the human genome. This technique involves the extraction of RNA from tumor samples and its subsequent fluorescent labeling and hybridization to an array of DNA probes. Microarrays covering nearly the entire human genome are now available. In a series of experiments, Golub demonstrated that the classification of cancer -- specifically two principal forms of acute leukemia -- could be achieved by using DNA microarrays to monitor gene expression, without a prior molecular understanding of this distinction . This finding implies that such methodologies can be applied to the molecular dissection of cancers. This approach has been used for the molecular classification of many tumor types, including lymphoma (e.g., ), prostate cancer (e.g., ), brain tumors (e.g. ), and lung cancer (e.g., ). Similar approaches have demonstrated that patterns of gene expression (or gene expression "signatures") may be found across different tumor types. For example, Golub et al identified a signature of metastatic propensity across prostate, breast, and lung cancers, suggesting that a genetic test performed at the time of diagnosis might predict the future behavior of some tumors . While most studies of gene expression have been carried out on tissue samples, some have used peripheral blood samples (e.g. ), thus extending the applicability of this technique.
Gene expression-based approaches are also widely used in pharmacology (e.g., ). The expectation here is that genomic approaches might lead to the discovery of molecules and compounds capable of modulating biological processes in cells. Drug discovery typically starts with prior knowledge of a target gene that is biologically relevant to a disease state (e.g., a gene mutation in cancer). The protein product of this gene is then biochemically purified, and a collection of compounds screened in vitro for their ability to bind to the protein. Novel approaches to drug discovery are based on genomics. Gene expression - based methods are used to identify candidate drugs that modulate previously intractable targets. These genes and gene products can serve as potential therapeutic targets or tools in addition to providing diagnostic and prognostic markers, as well as end-points for clinical trials. In cancer research, this approach has been applied to the discovery of substances that may induce the maturation of abnormal cells (e.g., acute myeloid leukemia cells), inhibit androgen or estrogen action in cancer cells, inhibit angiogenesis associated with tumor cell proliferation or inhibit the activity of the causal protein in some tumors (e.g., Ewing sarcoma ).
The functional consequences of genetic polymorphisms have been examined for several drug-metabolizing enzymes . Variants leading to reduced or increased enzymatic activity compared to the wild-type alleles have been identified. The possible application of genotyping has been discussed for several pathologic conditions. Among many other examples, the acetylator status has long been used for predicting isoniazid-induced hepatic toxicity in tuberculosis , and associations between genetic variability and response to beta-adrenergic medications have been explored . The association between gene expression and response to treatment holds the promise of personalized medicine, as doctors will be able to individualize drug therapy and provide specific therapies to those most likely to respond, while avoiding therapies in those most likely to suffer adverse effects.
Clinical trials provide an evaluation framework for interventions. Parameters are measured in patients under different types of interventions and the values of these parameters are compared across groups of subjects in order to identify associations between interventions and outcomes. Traditional clinical trials generally involve many subjects in which only few parameters are measured. Conversely, omics studies typically generate a large number of measurements on the limited number of test subjects (relatively to the number of parameters measured). This imbalance has created new issues involving statistics and bias . Omics studies offer a potentially powerful approach to identifying new biomarkers, but many of them are plagued by a lack of consistency and reproducibility (e.g., ). In principle, the inconsistency may be due to false positive studies, false negative studies or true variability among heterogeneous groups. In order to avoid biases and get more reliable results, the data from individual experiments at different centers could be pooled and public data repositories used for comparative data analysis . Moreover, the goal of omics approaches is also to acquire comprehensive, integrated understanding of biology by studying all biological processes in addition to analyzing parameters individually (e.g., ). Therefore, solutions exploiting prior knowledge about gene functions (e.g., in gene annotations databases) and multi-scale biological models have been proposed and are discussed in section 3.3.
In the context of translational research and translational medicine, information sharing between medical research, epidemiology and clinical medicine has been identified as a strong requirement. Translational research creates a bidirectional information transfer that accelerates trials and evaluates their clinical potential. In this framework, clinical data and biomarkers must be collected early in order to extract new knowledge and form new hypotheses from the mass of collected data. Therefore the relationship between research, population studies and health care rests on the integration of the data and knowledge from these three areas: research (scientific publications, public databases, experimental results), epidemiology (e.g., cohort studies), and healthcare (clinical data stored in patient records).
Two main challenges have to be overcome when automatically interrelating data from these different areas. First, these data are annotated to different terminologies and data referring to the same entity may be represented by different identifiers . For instance, the disease "acute myeloid leukemia" is coded D015470 in bibliographic data-bases indexed with MeSH, 91861009 in clinical records coded with SNOMED Clinical Terms® (SNOMED CT®)4, and C3171 in research records annotated to the NCI Thesaurus5 . The second issue is that the data to be integrated are complementary in nature but intrinsically different (omics - pathology - anatomy - physiology). Ontologies have been proven useful for data integration (e.g., [20, 21]). Several ontologies have been developed in bioinformatics and in the biomedical domain. However, they are still incomplete (neither all concepts nor relations are present) and fragmented (ontologies are orthogonal and few bridges are established between complementary ontologies) (e.g., ). Enrichment and integration of biomedical ontologies are therefore important stakes for translational medicine and bioinformatics, as well as for the future links between these two disciplines (e.g., [23, 24, 25])
Pooling experimental data requires the standard annotation of the experiments. It also requires interoperability among data repositories supported by standard services and workflows. Interoperable data repositories constitute an enabling resource for meta-analysis.
Public datasets have been created in response to the growing demand for publicly available repositories for high-throughput gene expression data. Such public repositories represent an important resource for the biological research community as they provide unrestricted access to microarray data published by other researchers. As such, they complement local in-house gene expression databases by providing reference data for comparative studies. Among them, the Gene Expression Omnibus (GEO) repository developed by the National Center for Biotechnology Information (NCBI) is publicly accessible on the NCBI website at http://www.ncbi.nlm.nih.gov/geo . GEO archives and helps disseminate microarray and other forms of high-throughput data generated by the scientific community . GEO data can be viewed from the perspective of the experiment or the gene. The experiment- centric view presents the entire study, while the gene-centric view displays quantitative gene expression measurements for one given gene across a dataset, with links to gene annotations. Other efforts to archive experiments and make them accessible to the whole community include the Stanford Microarray Database (SMD)  (http://smd.stanford.edu) and the ArrayExpress database of microarray  (http://www.ebi.ac.uk/arrayexpress), developed by the European Bioinformatics Institute. All these repositories promote standard exchange formats such as MAGE-TAB . Moreover, data submitted to these repositories are required to have a common set of core elements. As many other resources in this domain, including local experimental databases, data sets in public repositories are compliant with the standards that define a minimum information about a microarray experiment. Broad adhesion to these standards facilitates the publication and retrieval of data, as it ensures consistency across datasets.
In addition to such wide-scale projects, more focused initiatives seek to collect all published data on a given medical topic. Specific pipelines and services have been developed in conjunction with such focused databases. For example, the Oncomine initiative seeks to collect all published cancer microarray data (http://www.oncomine.org). To date, this effort has accumulated 18,000 cancer gene expression experiments. Automated analyses can be performed to identify the genes, pathways, regulatory networks, and functional networks activated and repressed in human cancer. As described in , all cancer microarray data deposited in GEO and SMD are automatically copied to Oncomine and then standardized.
Data repositories may be extended with clinical data. With focus on three types of tumors -- breast carcinoma, bladder carcinoma and uveal melanoma -- the Integrated Tumor Transcriptome Array and Clinical data Analysis (ITTACA) centralizes public datasets containing both gene expression and clinical data on these tumors . This system enables users to carry out different class comparison analyses, including the comparison of expression distribution profiles, tests for differential expression and patient survival analyses and to compare personal results with the results in the existing literature (http://bioinfo.curie.fr/ittaca).
The generation of large amounts of data and the need to share and compare these data bring about challenges for both data management and data annotation and highlight the need for standards. The Microarray Gene Expression Data (MGED) society is an international organization created in 1999 for facilitating sharing of functional genomics and proteomics array data. MGED has defined the Minimum Information About a Microarray Experiment (MIAME) that corresponds to the minimum information that must be reported about a microarray experiment to enable its unambiguous interpretation and reproduction. This standard has been used for years worldwide. The Microarray Gene Expression Object Model (MAGE-OM) and resulting markup language (MAGE-ML) provide a mechanism for standardizing data representation for data exchange purposes . Moreover, a common terminology, the MGED Ontology (MO) has been developed by the Ontology Working Group of the MGED society to complement these standards. The objective of MO is to provide common 'terms for annotating experiments in line with the MIAME guidelines, i.e., to provide the semantics to describe a microarray experiment according to the concepts specified in MIAME' . (http://mged.sourceforge.net/ontologies/index.php.)
Similar efforts in the field of functional annotation have established standard vocabularies for the annotation of genes and gene products . With the aim of contributing to the unification of biological information, the Gene Ontology (GO) has been developed since 2000 [36, 37] and has been adopted by most model organism databases, such as the Gene Ontology Annotation (GOA) database  (http://www.ebi.ac.uk/GOA).
Moreover, some research communities have decided to standardize their data models and data types to address interoperability issues. One of the requirements for a federated information system is interoperability, i.e., the ability of one computer system to access and use the resources of another system. In order to meet this need, the U.S. National Cancer Institute Center for Bioinformatics (NCICB) has created the cancer Common Ontologic Representation Environment (caCORE) to address interoperability issues in the field of cancer research . The caCORE system includes controlled terminologies such as the NCI Thesaurus (NCIT) , as well as common data elements (CDEs), which are named identifiers for the entities and attributes found in databases.
However, despite these standardization efforts, not all the data created, stored, and made available in the biomedical domain are homogeneously represented. Because most biomedical systems have been developed independently of each other, these systems do not have a common structure, nor do they share common data elements. Because determining the correspondences between heterogeneous data sources is complex and time-consuming, automated support is needed . Several approaches have been proposed, either based on the comparison of data-elements (schema-level approaches) or based on the comparison of value sets of data elements coming from distinct sources (instance-level approaches) [42, 43, 44].
Biomedical research requires to pool and to integrate information from diverse data sources, which is facilitated by the use of common data models and common ontologies. Additionally, coordinated research efforts typically span multiple institutions. Therefore, there is a need for an infrastructure that supports such collaborative efforts, with the objective of enabling more efficient access to the resources and sharing distributed computational resources. To address this need, the U.S. National Cancer Institute (NCI) has initiated a nationwide effort, called the cancer Biomedical Informatics Grid (caBIG), to develop a federation of interoperable research information systems . At the heart of the caBIG approach to federated interoperability is a Grid middleware infrastructure, called caGrid. . Moreover, this infrastructure is based on the caCORE system mentioned earlier, which supports the creation of interoperable biomedical information systems. Similar efforts in Europe have established grid infrastructures for sharing computational resources in bioinformatics (e.g., http://www.embracegrid.info) and enabling cooperative research in biomedical research , for example in infectious diseases  and immune diseases , as well as in cancer research .
More generally, grid technologies are expected to facilitate the launch and ongoing management of coordinated cancer research studies involving multiple institutions, to provide the ability to manage and securely share information and analytic resources. Additionally, grid computing supports high-throughput data analysis and predictive classification studies on large datasets . Grid computing can also support the modeling of complex biological systems, which requires advanced computer simulations to bring together knowledge at all the different levels of biological understanding -- from the cell (e.g., gene function) to the organism (e.g., physiology) -- in order to provide a coherent theory of biology, which can then be applied to clinical medicine.
In conjunction with the development of distributed databases and grid computing, an increasing number of tools in biomedical informatics have been developed as Web Services, with potential applications in genomic medicine (e.g., ). Web Services offer two major benefits for the biomedical community: interoperability and reusability. Web Services use standard communication protocols over the Internet, which makes them virtually platform-independent. Instead of developing a specific service locally, developers can reuse Web Service components in their own applications. With the objective of implementing complex data analysis processes, Web Services must be associated with workflow management systems (e.g., ). Environments such as Taverna provide a language and software tools to create and execute workflows and to construct highly complex analyses over public and private data and computational resources [54, 55].
In the near future, these efforts will hopefully be strengthened by the creation of publicly available registries that describe all these services in a standard manner. For example, Stevens et al  recommend the use of ontologies to express the semantic information associated with the description of Web Services. The design of broad-coverage formal models of tasks and their representation as formal ontologies will facilitate the discovery of services, their selection and their composition into dynamic workflows .
One advantage of integrating large numbers of microarray studies and compiling them in a data-warehouse is that it makes it possible to compare the results of different studies and to determine which methods are robust and produce consistent results across a range of studies. There are, however, many problems associated with the comparison of gene expression profiles across disparate microarray data sets. In studies performed in 2004 and 2007 by several teams, the authors demonstrated that the consistency of replicates in each experiment exhibits a large degree of variation. Different technologies seemed to show good agreement within and across labs using the same RNA samples. The variability between two labs using the same technology was higher than that between two technologies within the same lab. Moreover, the source of RNA samples can make a difference in microarray data [58, 59, 60].
Several methods have been developed to address these variability issues in multiple, independent data sets generated on various platforms. Among others:
The approach associating data integration and meta-analysis helps address statistical methodological issues . Data related to the same pathologic condition from different laboratories may be analyzed (e.g. ). For example, Bhanot et al have used classification models with non-Hodgkin's lymphomarelated microarray data from different laboratories , and Lyman et al have used meta-analysis techniques to detect predictors of recurrence-free survival in breast cancer . Data integration may also be used with data corresponding to different diseases, for example different types of cancers . Different kinds of experimental data can be integrated (e.g., microarray and proteomics). Moreover, data from different species can be integrated. For example, English and Butte evaluated 49 obesityrelated genome-wide experiments including microarray, genetics, proteomics and gene knock-down from human, mouse, rat and worm. They created an integrative model and showed that intersecting the results of experiments significantly improved the sensitivity, specificity and precision of the prediction of obesity-associated genes .
Computable forms of knowledge include knowledge bases and ontologies. Existing resources are often incomplete and need to be enriched and integrated. Incorporating prior knowledge into the analysis of gene expression datasets has been shown to improve the results.
The number of data sources has grown tremendously over the last decade. Frey et al mention that around 900 biological public databases (e.g., genomic, proteomic, metabolomic, and others) were available in 2007, representing a vast amount of information about genes, proteins, diseases and their interrelations . Besides repositories of experimental data, many knowledge resources are also publicly available. Such resources typically compile manually curated knowledge extracted from the biomedical literature and other sources. For example Entrez Gene provides information about genes, Online Mendelian Inheritance in Man (OMIM) provides information about genetic diseases and GOA provides the functional annotation of gene products.
Ontologies have been developed to represent the entities of biomedical interest and their relations, in multiple subdomains and for multiple levels of granularity. Figure 1 shows ontologies from genomics (white), chemistry (blue), anatomy (yellow), and diseases (green). Some reference ontologies are domain-specific such as the Chemical Entities of Biological Interest (ChEBI) for chemical entities or the Foundational Model of Anatomy (FMA) for anatomical entities . Some ontologies are level-specific such as GO at the cellular level, or SNOMED at the organism level. Ontologies can be overlapping in part. For example, subcellular anatomical entities are defined in both the FMA and the Cell Component axis of GO . In contrast, some ontologies may reuse the entities defined in other ontologies. For example, reasoning over the anatomical location of diseases in a clinical ontology can be delegated to the anatomical ontology in which the anatomical entities are defined .
The use of ontologies is a key element to interoperability among resources. For this reason, high-quality ontologies must be available to the community, ideally at no cost and without any constraints impeding their use or redistribution. The Open Biomedical Ontologies (OBO) are a collection of controlled vocabularies freely available to the biomedical community. Web-based ontology portals such as the BioPortal (http://www.bioontology.org/tools/portal/bioportal.html) allow users to browse, search, and visualize ontologies (and metadata) in the library, and to submit an ontology to the library. Ontology portals also tend to include features popularized by the "Web 2.0" movement, including the collaborative review of ontologies by users . The need for innovative technology and methods that allow scientists to record, manage, and disseminate biomedical information and knowledge in machine-processable form gave rise, in part, to initiatives such as the National Center for Biomedical Ontology (NCBO) created in 20056 .
The development of OBO ontologies is regulated within the OBO Foundry, which defines a set of shared principles governing ontology development . Knowledge integration will also benefit from the development of top-domain ontologies, such as BioTop . Such ontologies define the top-level classes of biomedical ontologies and can be used for linking finer-grained domain ontologies. Of note, some recently created ontologies were designed to be interoperable and to incorporate accurate representations of biological reality . For example, the PRotein Ontology (PRO) includes connections to other ontologies, including GO. It is expected that the connection of protein forms to GO classes using appropriate relations will support accurate functional annotation. Analogously, relations defined between protein classes and the OBO Disease Ontology will facilitate disease understanding . Until the development of federated biomedical ontologies is fully orchestrated by organizations such as the OBO Foundry - if it ever is, there will be a need for creating ad hoc bridges across existing ontologies, which is one of the objectives of the Unified Medical Language System (UMLS)7 developed by the US National Library of Medicine. The UMLS Metathesaurus integrates 1.4 million concepts from over one hundred terminologies in use in life sciences, as well as some 12 million relations among these concepts. UMLS concepts are not only inter-related, but may also be linked to external resources such as GenBank, providing easy access to the knowledge contained in these resources . More generally, various approaches to aligning existing ontologies are discussed in .
Knowledge integration efforts have benefited from the development of Semantic Web technologies . In the past few years, the World-Wide Web Consortium (W3C) has developed a set of standards and tools to support the vision of a flexible, integrated, automatic and self-adapting Web. Some of these technologies are now mature and have started making an impact in the life sciences. Semantic Web languages include the Resource Description Framework (RDF), a variety of data interchange formats (e.g., RDF/XML, N3, Turtle, N-Triples) and notations, such as RDF Schema (RDFS), and the Web Ontology Language (OWL), all of which are intended to provide a formal description of concepts, terms, and relationships within a given knowledge domain. OWL provides formal computational definitions, as well as tools for reasoning, in order to facilitate ontology development and ontology maintenance. Therefore most health science ontologies, including those originally developed in OBO format , have been converted to OWL [80, 81].
Standard terminologies, such as the Gene Ontology, are widely used in databases and knowledge bases as controlled vocabularies for functional annotations and largely facilitate comparative functional analysis. However, the functional annotation of gene products is not always consistent across databases and often remains incomplete. Although GO curators adhere to the same protocols and standards while assigning GO annotations, specific annotation procedures and the specialization of curators vary across groups. Methods have been developed to assess the consistency of GO annotation across model organism databases (e.g., ).
Determining the function of uncharacterized proteins remains a major challenge and is an active field of research. Various knowledge sources have been explored, including large scale protein-protein interaction assays, global mRNA expression analyses and systematic protein localization studies in ). Various techniques have been explored as well to generate functional annotation predictions, among which information theory-based semantic similarity, based on existing GO annotations .
Methods based on natural language processing and statistical techniques have been widely used for years for mining free text and extracting GO annotations. While the content of most biological databases is acquired through careful manual curation of literature and data, the increasing volume of biomedical literature to be reviewed and the increasing number of gene products in need of annotation are likely to overload the manual curation process. Consequently, text mining techniques are often employed to retrieve and extract functional annotation from the literature. For example, GoPubMed uses GO to organize the results of a PubMed search . The BioCreAtIvE initiative, with tasks such as gene name normalization and identification of functional annotation from free text, demonstrated that term recognition techniques are suitable for real applications in biology . However, automatic annotation techniques generally require additional knowledge processing and had lesser performance than gene identification tasks. Daraselia et al also showed the usefulness of combining NLP techniques (protein annotation extracted from Medline) with additional knowledge (information from protein-protein interactions datasets) .
Analogous to the methods devoted to quality assurance and enrichment of knowledge bases, methods have been developed for the evaluation of ontologies, including terminology enrichment and consistency checking.
Terminology enrichment techniques are used for identifying missing relations in terminologies. For example, GO lacks explicit associative relations across its three hierarchies, which may impede the consistent clustering of gene products according to functional characteristics. For instance, while the gene APOC3 is associated with both the molecular function 'lipid transporter activity' and the biological process 'lipid transport', APOH is only annotated with 'lipid transporter activity'. To address this issue, various approaches to suggesting new relations among biological terms have been proposed, based on lexical and statistical phenomena. Biological terms are often found as proper substrings of other terms. Compositionality of terms has been used to suggest semantic relations among GO terms directly [88, 89] or through ChEBI terms . Moreover, Mungall proposed a formal language, Obol, for defining allowed compositional patterns among terms from OBO ontologies . Statistical and data mining techniques have also been applied to biological knowledge bases annotated to the GO in order to automatically extract candidate relations among GO terms and help enrich ontologies with associative relations .
When ontologies are represented with formal languages and defined in reference to formal upper-level ontologies, it becomes possible to validate existing relations among classes and to identify new relations. OWL, the Web Ontology Language, is often used to represent the concepts and the relations in ontologies. OWL is more expressive than XML, RDF, and RDF-S, because it contains additional features for describing properties and classes formally. Such features include equivalence and disjointness among classes, cardinality of relations (e.g., "exactly one"), characteristics of properties (e.g., symmetry), and enumerated classes. Using the formal semantics of the OWL language makes it possible to reason about these classes and their instances and to ensure the consistency of these ontologies.
Key to the analysis of omics data is the integration of prior knowledge. Of special interest are methods that include functional characteristics from the beginning of the data analysis process, integrate medical knowledge with biological knowledge, and combine mining techniques with inference-based knowledge processing.
The analysis of transcriptomic data is classically carried out in two steps. First, data are clustered according to gene expression levels in order to create three clusters: over-expressed, under- expressed and invariant. Only subsequently is functional information introduced in order to characterize the clusters "functionally". One of the limitations of this approach is that functional similarity does not contribute to the clustering process. Methods including functional annotation from the beginning of the analysis have been proposed (e.g., ). These methods rely, for example, on semantic similarity measures among genes based on functional annotations .
Moreover, besides gene expression, proteomic patterns, functional characteristics of genes and the medical features associated with a sample (e.g., phenotype, clinical history, environmental factors, experimental conditions) could contribute to the clustering process. Such characteristics can be represented as UMLS concepts , NCIT or SNOMED CT concepts [96, 97]. Once annotated to these ontologies, the datasets can be clustered in such a way that the annotations themselves participate in the clustering, along with the expression profiles of the genes. More generally, knowledge integration has been shown to increase the power of analysis in several genomic studies. Butte has developed an approach based on the UMLS , while other authors have integrated Entrez Gene and GO . Chabalier has proposed a method for integrating information from the KEGG pathway database and the GO annotation repository into a disease ontology .
Various data mining techniques have been applied to biomedical data analysis (e.g., , ). Among data mining techniques, association rule mining, used widely in the area of market basket analysis, can be applied to the analysis of biological data as well. Based on the frequencies of co-occurrence between a gene G and a phenotype P, a typical rule would be: "if P is present, then G is present". Association rules can reveal biologically relevant associations between different genes or between environmental effects and gene expression profiles. The mining techniques may include negative rule generation (e.g., ) in addition to positive rule generation. Ideally, data mining techniques should be combined with inference-based knowledge processing. For example, the classification capabilities associated with ontologies may be used to aggregate annotations in order to improve the support and confidence values of association rules. More generally, knowledge bases and inference may contribute to increase the power of data mining techniques.
As biomedical research evolves from traditional clinical and biological research towards omics sciences and translational research, specific needs have emerged, including integrating data collected in research studies with patient clinical data, linking omics knowledge with medical knowledge, modeling the molecular basis of diseases, and developing tools that support in-depth analysis of research data. As such, translational research illustrates the need to bridge the gap between bioinformatics and medical informatics , and opens new avenues for biomedical informatics research.
This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM).
2omics is a generic term for new disciplines enabled by high-throuput technologies, such as genomics, transcriptomics, and proteomics