With the abundance of information and analysis results being collected for genetic loci, user-friendly and flexible data visualization approaches can inform and improve the analysis and dissemination of these data. A chromosomal ideogram is an idealized graphic representation of chromosomes. Ideograms can be combined with overlaid points, lines, and/or shapes, to provide summary information from studies of various kinds, such as genome-wide association studies or phenome-wide association studies, coupled with genomic location information. To facilitate visualizing varied data in multiple ways using ideograms, we have developed a flexible software tool called PhenoGram which exists as a web-based tool and also a command-line program.
With PhenoGram researchers can create chomosomal ideograms annotated with lines in color at specific base-pair locations, or colored base-pair to base-pair regions, with or without other annotation. PhenoGram allows for annotation of chromosomal locations and/or regions with shapes in different colors, gene identifiers, or other text. PhenoGram also allows for creation of plots showing expanded chromosomal locations, providing a way to show results for specific chromosomal regions in greater detail. We have now used PhenoGram to produce a variety of different plots, and provide these as examples herein. These plots include visualization of the genomic coverage of SNPs from a genotyping array, highlighting the chromosomal coverage of imputed SNPs, copy-number variation region coverage, as well as plots similar to the NHGRI GWA Catalog of genome-wide association results.
PhenoGram is a versatile, user-friendly software tool fostering the exploration and sharing of genomic information. Through visualization of data, researchers can both explore and share complex results, facilitating a greater understanding of these data.
Data visualization; Bioinformatics; Genome-wide association study; GWAS; Copy-number variants; CNV; SNP; Ideogram
Metabolic phenotyping has become an important ‘bird's-eye-view’ technology which can be applied to higher organisms, such as model plant and animal systems in the post-genomics and proteomics era. Although genotyping technology has expanded greatly over the past decade, metabolic phenotyping has languished due to the difficulty of ‘top-down’ chemical analyses. Here, we describe a systematic NMR methodology for stable isotope-labeling and analysis of metabolite mixtures in plant and animal systems.
The analysis method includes a stable isotope labeling technique for use in living organisms; a systematic method for simultaneously identifying a large number of metabolites by using a newly developed HSQC-based metabolite chemical shift database combined with heteronuclear multidimensional NMR spectroscopy; Principal Components Analysis; and a visualization method using a coarse-grained overview of the metabolic system. The database contains more than 1000 1H and 13C chemical shifts corresponding to 142 metabolites measured under identical physicochemical conditions. Using the stable isotope labeling technique in Arabidopsis T87 cultured cells and Bombyx mori, we systematically detected >450 HSQC peaks in each 13C-HSQC spectrum derived from model plant, Arabidopsis T87 cultured cells and the invertebrate animal model Bombyx mori. Furthermore, for the first time, efficient 13C labeling has allowed reliable signal assignment using analytical separation techniques such as 3D HCCH-COSY spectra in higher organism extracts.
Overall physiological changes could be detected and categorized in relation to a critical developmental phase change in B. mori by coarse-grained representations in which the organization of metabolic pathways related to a specific developmental phase was visualized on the basis of constituent changes of 56 identified metabolites. Based on the observed intensities of 13C atoms of given metabolites on development-dependent changes in the 56 identified 13C-HSQC signals, we have determined the changes in metabolic networks that are associated with energy and nitrogen metabolism.
The increasing availability and diversity of omics data in the post-genomic era offers new perspectives in most areas of biomedical research. Graph-based biological networks models capture the topology of the functional relationships between molecular entities such as gene, protein and small compounds and provide a suitable framework for integrating and analyzing omics-data. The development of software tools capable of integrating data from different sources and to provide flexible methods to reconstruct, represent and analyze topological networks is an active field of research in bioinformatics.
BisoGenet is a multi-tier application for visualization and analysis of biomolecular relationships. The system consists of three tiers. In the data tier, an in-house database stores genomics information, protein-protein interactions, protein-DNA interactions, gene ontology and metabolic pathways. In the middle tier, a global network is created at server startup, representing the whole data on bioentities and their relationships retrieved from the database. The client tier is a Cytoscape plugin, which manages user input, communication with the Web Service, visualization and analysis of the resulting network.
BisoGenet is able to build and visualize biological networks in a fast and user-friendly manner. A feature of Bisogenet is the possibility to include coding relations to distinguish between genes and their products. This feature could be instrumental to achieve a finer grain representation of the bioentities and their relationships. The client application includes network analysis tools and interactive network expansion capabilities. In addition, an option is provided to allow other networks to be converted to BisoGenet. This feature facilitates the integration of our software with other tools available in the Cytoscape platform. BisoGenet is available at http://bio.cigb.edu.cu/bisogenet-cytoscape/.
Phenotype analysis is commonly recognized to be of great importance for gaining insight into genetic interaction underlying inherited diseases. However, few computational contributions have been proposed for this purpose, mainly owing to lack of controlled clinical information easily accessible and structured for computational genome-wise analyses. We developed and made available through GFINDer web server an original approach for the analysis of genetic disorder related genes by exploiting the information on genetic diseases and their clinical phenotypes present in textual form within the Online Mendelian Inheritance in Man (OMIM) database. Because several synonyms for the same name and different names for overlapping concepts are often used in OMIM, we first normalized phenotype location descriptions reducing them to a list of unique controlled terms representing phenotype location categories. Then, we hierarchically structured them and the correspondent genetic diseases according to their topology and granularity of description, respectively. Thus, in GFINDer we could implement specific Genetic Disorders modules for the analysis of these structured data. Such modules allow to automatically annotate user-classified gene lists with updated disease and clinical information, classify them according to the genetic syndrome and the phenotypic location categories, and statistically identify the most relevant categories in each gene class. GFINDer is available for non-profit use at .
The evolving complexity of genome-scale experiments has increasingly centralized the role of a highly computable, accurate, and comprehensive resource spanning multiple biological scales and viewpoints. To provide a resource to meet this need, we have significantly extended the PhenoGO database with gene-disease specific annotations and included an additional ten species. This a computationally-derived resource is primarily intended to provide phenotypic context (cell type, tissue, organ, and disease) for mining existing associations between gene products and GO terms specified in the Gene Ontology Databases Automated natural language processing (BioMedLEE) and computational ontology (PhenOS) methods were used to derive these relationships from the literature, expanding the database with information from ten additional species to include over 600,000 phenotypic contexts spanning eleven species from five GO annotation databases. A comprehensive evaluation evaluating the mappings (n = 300) found precision (positive predictive value) at 85%, and recall (sensitivity) at 76%. Phenotypes are encoded in general purpose ontologies such as Cell Ontology, the Unified Medical Language System, and in specialized ontologies such as the Mouse Anatomy and the Mammalian Phenotype Ontology. A web portal has also been developed, allowing for advanced filtering and querying of the database as well as download of the entire dataset .
Linking phenotypic with genotypic diversity has become a major requirement for basic and applied genome-centric biological research. To meet this need, a comprehensive database backend for efficiently storing, querying and analyzing large experimental data sets is necessary. Chado, a generic, modular, community-based database schema is widely used in the biological community to store information associated with genome sequence data. To meet the need to also accommodate large-scale phenotyping and genotyping projects, a new Chado module called Natural Diversity has been developed. The module strictly adheres to the Chado remit of being generic and ontology driven. The flexibility of the new module is demonstrated in its capacity to store any type of experiment that either uses or generates specimens or stock organisms. Experiments may be grouped or structured hierarchically, whereas any kind of biological entity can be stored as the observed unit, from a specimen to be used in genotyping or phenotyping experiments, to a group of species collected in the field that will undergo further lab analysis. We describe details of the Natural Diversity module, including the design approach, the relational schema and use cases implemented in several databases.
High-quality manual annotation methods and practices need to be scaled to the increased rate of genomic data production. Curation based on gene families and gene networks is one approach that can significantly increase both curation efficiency and quality. The Sol Genomics Network (SGN; http://solgenomics.net) is a comparative genomics platform, with genetic, genomic and phenotypic information of the Solanaceae family and its closely related species that incorporates a community-based gene and phenotype curation system. In this article, we describe a manual curation system for gene families aimed at facilitating curation, querying and visualization of gene interaction patterns underlying complex biological processes, including an interface for efficiently capturing information from experiments with large data sets reported in the literature. Well-annotated multigene families are useful for further exploration of genome organization and gene evolution across species. As an example, we illustrate the system with the multigene transcription factor families, WRKY and Small Auxin Up-regulated RNA (SAUR), which both play important roles in responding to abiotic stresses in plants.
The Rat Genome Database (RGD) is the premier repository of rat genomic and genetic data and currently houses over 40 000 rat gene records as well as human and mouse orthologs, 1771 rat and 1911 human quantitative trait loci (QTLs) and 2209 rat strains. Biological information curated for these data objects includes disease associations, phenotypes, pathways, molecular functions, biological processes and cellular components. A suite of tools has been developed to aid curators in acquiring and validating data objects, assigning nomenclature, attaching biological information to objects and making connections among data types. The software used to assign nomenclature, to create and edit objects and to make annotations to the data objects has been specifically designed to make the curation process as fast and efficient as possible. The user interfaces have been adapted to the work routines of the curators, creating a suite of tools that is intuitive and powerful.
Database URL: http://rgd.mcw.edu
The Rat Genome Database (RGD, ) is one of the core resources for rat genomics and recent developments have focused on providing support for disease-based research using the rat model. Recognizing the importance of the rat as a disease model we have employed targeted curation strategies to curate genes, QTL and strain data for neurological and cardiovascular disease areas. This work has centered on rat but also includes data for mouse and human to create ‘disease portals’ that provide a unified view of the genes, QTL and strain models for these diseases across the three species. The disease curation efforts combined with normal curation activities have served to greatly increase the content of the database, particularly for biological information, including gene ontology, disease, pathway and phenotype ontology annotations. In addition to improving the features and database content, community outreach has been expanded to demonstrate how investigators can leverage the resources at RGD to facilitate their research and to elicit suggestions and needs for future developments. We have published a number of papers that provide additional information on the ontology annotations and the tools at RGD for data mining and analysis to better enable researchers to fully utilize the database.
Recent advances in the speed and sensitivity of mass spectrometers and in analytical methods, the exponential acceleration of computer processing speeds, and the availability of genomic databases from an array of species and protein information databases have led to a deluge of proteomic data. The development of a lab-based automated proteomic software platform for the automated collection, processing, storage, and visualization of expansive proteomic datasets is critically important. The high-throughput autonomous proteomic pipeline (HTAPP) described here is designed from the ground up to provide critically important flexibility for diverse proteomic workflows and to streamline the total analysis of a complex proteomic sample. This tool is comprised of software that controls the acquisition of mass spectral data along with automation of post-acquisition tasks such as peptide quantification, clustered MS/MS spectral database searching, statistical validation, and data exploration within a user-configurable lab-based relational database. The software design of HTAPP focuses on accommodating diverse workflows and providing missing software functionality to a wide range of proteomic researchers to accelerate the extraction of biological meaning from immense proteomic data sets. Although individual software modules in our integrated technology platform may have some similarities to existing tools, the true novelty of the approach described here is in the synergistic and flexible combination of these tools to provide an integrated and efficient analysis of proteomic samples.
Automation; LIMS; MS/MS database search; Peptide analysis; Relational database
Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post-hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.
supervised learning; classification; data exploration; machine learning; data mining
Aging and age-related disease represents a substantial quantity of current natural, social and behavioral science research efforts. Presently, no centralized system exists for tracking aging research projects across numerous research disciplines. The multidisciplinary nature of this research complicates the understanding of underlying project categories, the establishment of project relations, and the development of a unified project classification scheme. We have developed a highly visual database, the International Aging Research Portfolio (IARP), available at AgingPortfolio.org to address this issue. The database integrates information on research grants, peer-reviewed publications, and issued patent applications from multiple sources. Additionally, the database uses flexible project classification mechanisms and tools for analyzing project associations and trends. This system enables scientists to search the centralized project database, to classify and categorize aging projects, and to analyze the funding aspects across multiple research disciplines. The IARP is designed to provide improved allocation and prioritization of scarce research funding, to reduce project overlap and improve scientific collaboration thereby accelerating scientific and medical progress in a rapidly growing area of research. Grant applications often precede publications and some grants do not result in publications, thus, this system provides utility to investigate an earlier and broader view on research activity in many research disciplines. This project is a first attempt to provide a centralized database system for research grants and to categorize aging research projects into multiple subcategories utilizing both advanced machine algorithms and a hierarchical environment for scientific collaboration.
Motivation: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations.
Results: We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder—a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases.
Discussion: Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles.
Availability: Freely available at: http://bioinf.umbc.edu/EMU/ftp.
Supplementary information: Supplementary data are available at Bioinformatics online.
The advent of genomic and proteomic technologies in this post-genomic era has urged the researchers to develop novel research
strategies against cancer by targeting the human genes that would greatly facilitate to identify more promising treatment and to
develop accurate early diagnosis for cancer. To harness the power of cancer genetic information towards better treatment we have
developed a cancer gene database called CanGeneBase (CGB). It is a comprehensive data collection of cancer-related genes with
the intention of helping the researchers to stay on a single platform to gain exclusive information on the genes of their interest.
According to the Cancer Gene Data Curation Project, about 4,700 genes have been identified as being related to cancer. The
present CanGeneBase covers about 12 different types of cancer which includes 190 unique gene entries. Each entry encompasses
about 33 useful parameters to provide detailed information about specific gene. CanGeneBase is made in such a way that it can
be easily accessed by either gene symbol or by the type of cancer.
The database is freely available at http://22.214.171.124/bioinfo/cancerdb/
Cancer; database; drug list; oncogenes; tumour suppressor genes; cancer types; target; molecular descriptors
The GDB Human Genome Data Base refers collectively to GDB and OMIM, Online Mendelian Inheritance in Man. GDB and OMIM are linked databases that provide an international repository for information generated by the Human Genome Initiative. GDB contains human gene mapping data, while OMIM offers the text of Dr. Victor A. McKusick's catalog of genetic disease and phenotype descriptions. These databases, updated and edited continuously, integrate bibliographic and full-text information with several types of mapping data. They are accessible through a flexible interface and are available through SprintNet and the Internet to the scientific community without cost. This paper provides an overview of the context, development, structure, content, and use of these databases.
Whole genome analysis, now including whole genome sequencing, is moving rapidly into the clinical setting, leading to detection of human variation on a broader scale than ever before. Interpreting this information will depend on the availability of thorough and accurate phenotype information, and the ability to curate, store, and access data on genotype-phenotype relationships. This idea has already been demonstrated within the context of chromosome microarray (CMA) testing. The International Standards for Cytogenomic Arrays (ISCA) Consortium promotes standardization of variant interpretation for this technology through its initiatives, including the formation of a publicly available database housing clinical CMA data. Recognizing that phenotypic data is essential for the interpretation of genomic variants, the ISCA Consortium has developed tools to facilitate the collection of this data and its deposition in a standardized, structured format within the ISCA Consortium database. This rich source of phenotypic data can also be used within broader applications, such as developing phenotypic profiles of emerging genomic disorders, the identification of candidate regions for particular phenotypes, or the creation of tools for use in clinical practice. We summarize the ISCA experience as a model for ongoing efforts incorporating phenotype data with genotype data to improve the quality of research and clinical care in human genetics.
Genotype-Phenotype Correlation; copy number variation; CNV; Oligonucleotide array sequence analysis; Cytogenetics
The Rat Genome Database (RGD, http://rgd.mcw.edu) was developed to provide a core resource for rat researchers combining genetic, genomic, pathway, phenotype and strain information with a focus on disease. RGD users are provided with access to structured and curated data from the molecular level through to the level of the whole organism, including the variations associated with disease phenotypes. To fully support use of the rat as a translational model for biological systems and human disease, RGD continues to curate these datasets while enhancing and developing tools to allow efficient and effective access to the data in a variety of formats including linear genome viewers, pathway diagrams and biological ontologies. To support pathophysiological analysis of data, RGD Disease Portals provide an entryway to integrated gene, QTL and strain data specific to a particular disease. In addition to tool and content development and maintenance, RGD promotes rat research and provides user education by creating and disseminating tutorials on the curated datasets, submission processes, and tools available at RGD. By curating, storing, integrating, visualizing and promoting rat data, RGD ensures that the investment made into rat genomics and genetics can be leveraged by all interested investigators.
The OMIM database is a tool used daily by geneticists. Syndrome pages include a Clinical Synopsis section containing a list of known phenotypes comprising a clinical syndrome. The phenotypes are in free text and different phrases are often used to describe the same phenotype, the differences originating in spelling variations or typing errors, varying sentence structures and terminological variants.
These variations hinder searching for syndromes or using the large amount of phenotypic information for research purposes. In addition, negation forms also create false positives when searching the textual description of phenotypes and induce noise in text mining applications.
Our method allows efficient and complete search of OMIM phenotypes as well as improved data-mining of the OMIM phenome. Applying natural language processing, each phrase is tagged with additional semantic information using UMLS and MESH. Using a grammar based method, annotated phrases are clustered into groups denoting similar phenotypes. These groups of synonymous expressions enable precise search, as query terms can be matched with the many variations that appear in OMIM, while avoiding over-matching expressions that include the query term in a negative context. On the basis of these clusters, we computed pair-wise similarity among syndromes in OMIM. Using this new similarity measure, we identified 79,770 new connections between syndromes, an average of 16 new connections per syndrome. Our project is Web-based and available at http://fohs.bgu.ac.il/s2g/csiomim
The resulting enhanced search functionality provides clinicians with an efficient tool for diagnosis. This search application is also used for finding similar syndromes for the candidate gene prioritization tool S2G.
The enhanced OMIM database we produced can be further used for bioinformatics purposes such as linking phenotypes and genes based on syndrome similarities and the known genes in Morbidmap.
Renewed interest in plant × environment interactions has risen in the post-genomic era. In this context, high-throughput phenotyping platforms have been developed to create reproducible environmental scenarios in which the phenotypic responses of multiple genotypes can be analysed in a reproducible way. These platforms benefit hugely from the development of suitable databases for storage, sharing and analysis of the large amount of data collected. In the model plant Arabidopsis thaliana, most databases available to the scientific community contain data related to genetic and molecular biology and are characterised by an inadequacy in the description of plant developmental stages and experimental metadata such as environmental conditions. Our goal was to develop a comprehensive information system for sharing of the data collected in PHENOPSIS, an automated platform for Arabidopsis thaliana phenotyping, with the scientific community.
PHENOPSIS DB is a publicly available (URL: http://bioweb.supagro.inra.fr/phenopsis/) information system developed for storage, browsing and sharing of online data generated by the PHENOPSIS platform and offline data collected by experimenters and experimental metadata. It provides modules coupled to a Web interface for (i) the visualisation of environmental data of an experiment, (ii) the visualisation and statistical analysis of phenotypic data, and (iii) the analysis of Arabidopsis thaliana plant images.
Firstly, data stored in the PHENOPSIS DB are of interest to the Arabidopsis thaliana community, particularly in allowing phenotypic meta-analyses directly linked to environmental conditions on which publications are still scarce. Secondly, data or image analysis modules can be downloaded from the Web interface for direct usage or as the basis for modifications according to new requirements. Finally, the structure of PHENOPSIS DB provides a useful template for the development of other similar databases related to genotype × environment interactions.
With the availability of whole genome sequence in many species, linkage analysis, positional cloning, and microarray are gradually becoming powerful tools for investigating the links between phenotype and genotype or genes. However, in these methods, causative genes underlying a quantitative trait locus, or a disease, are usually located within a large genomic region or a large set of genes. Examining the function of every gene is very time-consuming and needs to retrieve and integrate the information from multiple databases or genome resources. PGMapper is a software tool for automatically matching phenotype to genes from a defined genome region or a group of given genes by combining the mapping information from the Ensembl database and gene function information from the OMIM and PubMed databases. PGMapper is currently available for candidate gene search of human, mouse, rat, zebrafish, and 12 other species.
The Clinical Measurement Ontology (CMO), Measurement Method Ontology (MMO), and Experimental Condition Ontology (XCO) were originally developed at the Rat Genome Database (RGD) to standardize quantitative rat phenotype data in order to integrate results from multiple studies into the PhenoMiner database and data mining tool. These ontologies provide the framework for presenting what was measured, how it was measured, and under what conditions it was measured.
There has been a continuing expansion of subdomains in each ontology with a parallel 2–3 fold increase in the total number of terms, substantially increasing the size and improving the scope of the ontologies. The proportion of terms with textual definitions has increased from ~60% to over 80% with greater synchronization of format and content throughout the three ontologies. Representation of definition source Uniform Resource Identifiers (URI) has been standardized, including the removal of all non-URI characters, and systematic versioning of all ontology files has been implemented. The continued expansion and success of these ontologies has facilitated the integration of more than 60,000 records into the RGD PhenoMiner database. In addition, new applications of these ontologies, such as annotation of Quantitative Trait Loci (QTL), have been added at the sites actively using them, including RGD and the Animal QTL Database.
The improvements to these three ontologies have been substantial, and development is ongoing. New terms and expansions to the ontologies continue to be added as a result of active curation efforts at RGD and the Animal QTL database. Use of these vocabularies to standardize data representation for quantitative phenotypes and quantitative trait loci across databases for multiple species has demonstrated their utility for integrating diverse data types from multiple sources. These ontologies are freely available for download and use from the NCBO BioPortal website at http://bioportal.bioontology.org/ontologies/1583 (CMO), http://bioportal.bioontology.org/ontologies/1584 (MMO), and http://bioportal.bioontology.org/ontologies/1585 (XCO), or from the RGD ftp site at ftp://rgd.mcw.edu/pub/ontology/.
As proteomic data sets increase in size and complexity, the necessity for database-centric software systems able to organize, compare, and visualize all the proteomic experiments in a lab grows. We recently developed an integrated platform called high-throughput autonomous proteomic pipeline (HTAPP) for the automated acquisition and processing of quantitative proteomic data, and integration of proteomic results with existing external protein information resources within a lab-based relational database called PeptideDepot. Here, we introduce the peptide validation software component of this system, which combines relational database-integrated electronic manual spectral annotation in Java with a new software tool in the R programming language for the generation of logistic regression spectral models from user-supplied validated data sets and flexible application of these user-generated models in automated proteomic workflows. This logistic regression spectral model uses both variables computed directly from SEQUEST output in addition to deterministic variables based on expert manual validation criteria of spectral quality. In the case of linear quadrupole ion trap (LTQ) or LTQ-FTICR LC/MS data, our logistic spectral model outperformed both XCorr (242% more peptides identified on average) and the X!Tandem E-value (87% more peptides identified on average) at a 1% false discovery rate estimated by decoy database approach.
Decoy database; Logistic regression model; SEQUEST; Software; Spectral validation
The ability to rapidly characterize an unknown microorganism is critical in both responding to infectious disease and biodefense. To do this, we need some way of anticipating an organism's phenotype based on the molecules encoded by its genome. However, the link between molecular composition (i.e. genotype) and phenotype for microbes is not obvious. While there have been several studies that address this challenge, none have yet proposed a large-scale method integrating curated biological information. Here we utilize a systematic approach to discover genotype-phenotype associations that combines phenotypic information from a biomedical informatics database, GIDEON, with the molecular information contained in National Center for Biotechnology Information's Clusters of Orthologous Groups database (NCBI COGs).
Integrating the information in the two databases, we are able to correlate the presence or absence of a given protein in a microbe with its phenotype as measured by certain morphological characteristics or survival in a particular growth media. With a 0.8 correlation score threshold, 66% of the associations found were confirmed by the literature and at a 0.9 correlation threshold, 86% were positively verified.
Our results suggest possible phenotypic manifestations for proteins biochemically associated with sugar metabolism and electron transport. Moreover, we believe our approach can be extended to linking pathogenic phenotypes with functionally related proteins.
Next-generation sequencing (NGS) data in the identification of disease-causing genes provides a promising opportunity in the diagnosis of disease. Beyond the previous efforts for NGS data alignment, variant detection, and visualization, developing a comprehensive annotation system supported by multiple layers of disease phenotype-related databases is essential for deciphering the human genome. To satisfy the impending need to decipher the human genome, it is essential to develop a comprehensive annotation system supported by multiple layers of disease phenotype-related databases.
AnsNGS (Annotation system of sequence variations for next-generation sequencing data) is a tool for contextualizing variants related to diseases and examining their functional consequences. The AnsNGS integrates a variety of annotation databases to attain multiple levels of annotation.
The AnsNGS assigns biological functions to variants, and provides gene (or disease)-centric queries for finding disease-causing variants. The AnsNGS also connects those genes harbouring variants and the corresponding expression probes for downstream analysis using expression microarrays. Here, we demonstrate its ability to identify disease-related variants in the human genome.
The AnsNGS can give a key insight into which of these variants is already known to be involved in a disease-related phenotype or located in or near a known regulatory site. The AnsNGS is available free of charge to academic users and can be obtained from http://snubi.org/software/AnsNGS/.
High-Throughput Nucleotide Sequencing; DNA Sequence Analysis; Molecular Sequence Annotation; Genome Structural Variation; Disease
The DECIPHER database (https://decipher.sanger.ac.uk/) is an accessible online repository of genetic variation with associated phenotypes that facilitates the identification and interpretation of pathogenic genetic variation in patients with rare disorders. Contributing to DECIPHER is an international consortium of >200 academic clinical centres of genetic medicine and ≥1600 clinical geneticists and diagnostic laboratory scientists. Information integrated from a variety of bioinformatics resources, coupled with visualization tools, provides a comprehensive set of tools to identify other patients with similar genotype–phenotype characteristics and highlights potentially pathogenic genes. In a significant development, we have extended DECIPHER from a database of just copy-number variants to allow upload, annotation and analysis of sequence variants such as single nucleotide variants (SNVs) and InDels. Other notable developments in DECIPHER include a purpose-built, customizable and interactive genome browser to aid combined visualization and interpretation of sequence and copy-number variation against informative datasets of pathogenic and population variation. We have also introduced several new features to our deposition and analysis interface. This article provides an update to the DECIPHER database, an earlier instance of which has been described elsewhere [Swaminathan et al. (2012) DECIPHER: web-based, community resource for clinical interpretation of rare variants in developmental disorders. Hum. Mol. Genet., 21, R37–R44].