With mounting availability of genomic and phenotypic databases, data integration and mining become increasingly challenging. While efforts have been put forward to analyze prokaryotic phenotypes, current computational technologies either lack high throughput capacity for genomic scale analysis, or are limited in their capability to integrate and mine data across different scales of biology. Consequently, simultaneous analysis of associations among genomes, phenotypes, and gene functions is prohibited. Here, we developed a high throughput computational approach, and demonstrated for the first time the feasibility of integrating large quantities of prokaryotic phenotypes along with genomic datasets for mining across multiple scales of biology (protein domains, pathways, molecular functions, and cellular processes). Applying this method over 59 fully sequenced prokaryotic species, we identified genetic basis and molecular mechanisms underlying the phenotypes in bacteria. We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations. Manual evaluation of a random sample of these significant correlations showed a minimal precision of 30% (95% confidence interval: 20%–42%; n = 50). We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature. We furthermore unveiled 10 significant correlations between phenotypes and KEGG pathways, eight of which were corroborated in the evaluation, and 309 significant correlations between phenotypes and 166 GO concepts evaluated using a random sample (minimal precision = 72%; 95% confidence interval: 60%–80%; n = 50). Additionally, we conducted a novel large-scale phenomic visualization analysis to provide insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes (metaphenotypes). We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.
A key challenge of the post-genomic era is to conceive large-scale studies of genomes and observable characteristics of organisms (phenotypes) and to interpret the data thus produced. The goal of this “phenomic” study is to improve our understanding of complex biological systems in terms of their molecular underpinnings. In this paper, Liu and colleagues present comprehensive computational and novel visualization methods for discovering biological knowledge spanning multiple scales of biology. The authors were able to predict and visualize new knowledge between clusters of microbiological phenotypes and their molecular mechanisms. To their knowledge, this is the first time this has been done. More specifically, the method integrates microbiological data with genomic-scale data from protein family databases, gene ontology, and biological pathways. Conducted over 59 fully sequenced bacteria, and including significantly more phenotypes than previous studies of its kind, this study enables a “systems biology” view across different classifications of genes and processes. This represents advancement over previous techniques, which are either limited in biological scale or analytical breadth. Visualization of the networks generated by this technique shows the common biological modules shared by related phenotypes. The results of this experiment demonstrate that the fusion of clinical data with genomic information is able to elucidate, in high throughput, a massive number of biological processes underlying phenotypes.