In mammalian circadian clockwork, the CLOCK-BMAL1 complex binds to DNA enhancers of target genes and drives circadian oscillation of transcription. Here we identified 7,978 CLOCK-binding sites in mouse liver by chromatin immunoprecipitation-sequencing (ChIP-Seq), and a newly developed bioinformatics method, motif centrality analysis of ChIP-Seq (MOCCS), revealed a genome-wide distribution of previously unappreciated noncanonical E-boxes targeted by CLOCK. In vitro promoter assays showed that CACGNG, CACGTT, and CATG(T/C)G are functional CLOCK-binding motifs. Furthermore, we extensively revealed rhythmically expressed genes by poly(A)-tailed RNA-Seq and identified 1,629 CLOCK target genes within 11,926 genes expressed in the liver. Our analysis also revealed rhythmically expressed genes that have no apparent CLOCK-binding site, indicating the importance of indirect transcriptional and posttranscriptional regulations. Indirect transcriptional regulation is represented by rhythmic expression of CLOCK-regulated transcription factors, such as Krüppel-like factors (KLFs). Indirect posttranscriptional regulation involves rhythmic microRNAs that were identified by small-RNA-Seq. Collectively, CLOCK-dependent direct transactivation through multiple E-boxes and indirect regulations polyphonically orchestrate dynamic circadian outputs.
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
BioHackathon; Bioinformatics; Semantic Web; Web services; Ontology; Visualization; Knowledge representation; Databases; Semantic interoperability; Data models; Data sharing; Data integration
The DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp) maintains and provides archival, retrieval and analytical resources for biological information. This database content is shared with the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). DDBJ launched a new nucleotide sequence submission system for receiving traditional nucleotide sequence. We expect that the new submission system will be useful for many submitters to input accurate annotation and reduce the time needed for data input. In addition, DDBJ has started a new service, the Japanese Genotype–phenotype Archive (JGA), with our partner institute, the National Bioscience Database Center (NBDC). JGA permanently archives and shares all types of individual human genetic and phenotypic data. We also introduce improvements in the DDBJ services and databases made during the past year.
High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecular biological research. However, the immense amount of sequence data requires computational skills and suitable hardware resources that are a challenge to molecular biologists. The DNA Data Bank of Japan (DDBJ) of the National Institute of Genetics (NIG) has initiated a cloud computing-based analytical pipeline, the DDBJ Read Annotation Pipeline (DDBJ Pipeline), for a high-throughput annotation of NGS reads. The DDBJ Pipeline offers a user-friendly graphical web interface and processes massive NGS datasets using decentralized processing by NIG supercomputers currently free of charge. The proposed pipeline consists of two analysis components: basic analysis for reference genome mapping and de novo assembly and subsequent high-level analysis of structural and functional annotations. Users may smoothly switch between the two components in the pipeline, facilitating web-based operations on a supercomputer for high-throughput data analysis. Moreover, public NGS reads of the DDBJ Sequence Read Archive located on the same supercomputer can be imported into the pipeline through the input of only an accession number. This proposed pipeline will facilitate research by utilizing unified analytical workflows applied to the NGS data. The DDBJ Pipeline is accessible at http://p.ddbj.nig.ac.jp/.
next-generation sequencing; sequence read archive; cloud computing; analytical pipeline; genome analysis
BioHackathon 2010 was the third in a series of meetings hosted by the Database Center for Life Sciences (DBCLS) in Tokyo, Japan. The overall goal of the BioHackathon series is to improve the quality and accessibility of life science research data on the Web by bringing together representatives from public databases, analytical tool providers, and cyber-infrastructure researchers to jointly tackle important challenges in the area of in silico biological research.
The theme of BioHackathon 2010 was the 'Semantic Web', and all attendees gathered with the shared goal of producing Semantic Web data from their respective resources, and/or consuming or interacting those data using their tools and interfaces. We discussed on topics including guidelines for designing semantic data and interoperability of resources. We consequently developed tools and clients for analysis and visualization.
We provide a meeting report from BioHackathon 2010, in which we describe the discussions, decisions, and breakthroughs made as we moved towards compliance with Semantic Web technologies - from source provider, through middleware, to the end-consumer.
BioHackathon; Open source; Software; Semantic Web; Databases; Data integration; Data visualization; Web services; Interfaces
The overwhelming amount of network data in functional genomics is making its visualization cluttered with jumbling nodes and edges. Such cluttered network visualization, which is known as "hair-balls", is significantly hindering data interpretation and analysis of researchers. Effective navigation approaches that can always abstract network data properly and present them insightfully are hence required, to help researchers interpret the data and acquire knowledge efficiently. Cytoscape is a de facto standard platform for network visualization and analysis, which has many users around the world. Apart from its core sophisticated features, it easily allows for extension of the functionalities by loading extra plug-ins.
We developed NaviClusterCS, which enables researchers to interactively navigate large biological networks of ~100,000 nodes in a "Google Maps-like" manner in the Cytoscape environment. NaviClusterCS rapidly and automatically identifies biologically meaningful clusters in large networks, e.g., proteins sharing similar biological functions in protein-protein interaction networks. Then, it displays not all nodes but only preferable numbers of those clusters at any magnification to avoid creating the cluttered network visualization, while its zooming and re-centering functions still enable researchers to interactively analyze the networks in detail. Its application to a real Arabidopsis co-expression network dataset illustrated a practical use of the tool for suggesting knowledge that is hidden in large biological networks and difficult to be obtained using other visualization methods.
NaviClusterCS provides interactive and multi-scale network navigation to a wide range of biologists in the big data era, via the de facto standard platform for network visualization. It can be freely downloaded at http://navicluster.cb.k.u-tokyo.ac.jp/cs/ and installed as a plug-in of Cytoscape.
The DNA data bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) maintains a primary nucleotide sequence database and provides analytical resources for biological information to researchers. This database content is exchanged with the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). Resources provided by the DDBJ include traditional nucleotide sequence data released in the form of 27 316 452 entries or 16 876 791 557 base pairs (as of June 2012), and raw reads of new generation sequencers in the sequence read archive (SRA). A Japanese researcher published his own genome sequence via DDBJ-SRA on 31 July 2012. To cope with the ongoing genomic data deluge, in March 2012, our computer previous system was totally replaced by a commodity cluster-based system that boasts 122.5 TFlops of CPU capacity and 5 PB of storage space. During this upgrade, it was considered crucial to replace and refactor substantial portions of the DDBJ software systems as well. As a result of the replacement process, which took more than 2 years to perform, we have achieved significant improvements in system performance.
The Genia task, when it was introduced in 2009, was the first community-wide effort to address a fine-grained, structural information extraction from biomedical literature. Arranged for the second time as one of the main tasks of BioNLP Shared Task 2011, it aimed to measure the progress of the community since 2009, and to evaluate generalization of the technology to full text papers. The Protein Coreference task was arranged as one of the supporting tasks, motivated from one of the lessons of the 2009 task that the abundance of coreference structures in natural language text hinders further improvement with the Genia task.
The Genia task received final submissions from 15 teams. The results show that the community has made a significant progress, marking 74% of the best F-score in extracting bio-molecular events of simple structure, e.g., gene expressions, and 45% ~ 48% in extracting those of complex structure, e.g., regulations. The Protein Coreference task received 6 final submissions. The results show that the coreference resolution performance in biomedical domain is lagging behind that in newswire domain, cf. 50% vs. 66% in MUC score. Particularly, in terms of protein coreference resolution the best system achieved 34% in F-score.
Detailed analysis performed on the results improves our insight into the problem and suggests the directions for further improvements.
Term clustering, by measuring the string similarities between terms, is known within the natural language processing community to be an effective method for improving the quality of texts and dictionaries. However, we have observed that chemical names are difficult to cluster using string similarity measures. In order to clearly demonstrate this difficulty, we compared the string similarities determined using the edit distance, the Monge-Elkan score, SoftTFIDF, and the bigram Dice coefficient for chemical names with those for non-chemical names.
Our experimental results revealed the following: (1) The edit distance had the best performance in the matching of full forms, whereas Cohen et al. reported that SoftTFIDF with the Jaro-Winkler distance would yield the best measure for matching pairs of terms for their experiments. (2) For each of the string similarity measures above, the best threshold for term matching differs for chemical names and for non-chemical names; the difference is especially large for the edit distance. (3) Although the matching results obtained for chemical names using the edit distance, Monge-Elkan scores, or the bigram Dice coefficients are better than the result obtained for non-chemical names, the results were contrary when using SoftTFIDF. (4) A suitable weight for chemical names varies substantially from one for non-chemical names. In particular, a weight vector that has been optimized for non-chemical names is not suitable for chemical names. (5) The matching results using the edit distances improve further by dividing a set of full forms into two subsets, according to whether a full form is a chemical name or not. These results show that our hypothesis is acceptable, and that we can significantly improve the performance of abbreviation-full form clustering by computing chemical names and non-chemical names separately.
In conclusion, the discriminative application of string similarity methods to chemical and non-chemical names may be a simple yet effective way to improve the performance of term clustering.
The Integrating Network Objects with Hierarchies (INOH) database is a highly structured, manually curated database of signal transduction pathways including Mammalia, Xenopus laevis, Drosophila melanogaster, Caenorhabditis elegans and canonical. Since most pathway knowledge resides in scientific articles, the database focuses on curating and encoding textual knowledge into a machine-processable form. We use a hierarchical pathway representation model with a compound graph, and every pathway component in the INOH database is annotated by a set of uniquely developed ontologies. Finally, we developed the Similarity Search using the combination of a compound graph and hierarchical ontologies. The INOH database is to be a good resource for many users who want to analyze a large protein network. INOH ontologies and 73 signal transduction and 29 metabolic pathway diagrams (including over 6155 interactions and 3395 protein entities) are freely available in INOH XML and BioPAX formats.
Database URL: http://www.inoh.org/
The DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp) maintains and provides archival, retrieval and analytical resources for biological information. The central DDBJ resource consists of public, open-access nucleotide sequence databases including raw sequence reads, assembly information and functional annotation. Database content is exchanged with EBI and NCBI within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). In 2011, DDBJ launched two new resources: the ‘DDBJ Omics Archive’ (DOR; http://trace.ddbj.nig.ac.jp/dor) and BioProject (http://trace.ddbj.nig.ac.jp/bioproject). DOR is an archival database of functional genomics data generated by microarray and highly parallel new generation sequencers. Data are exchanged between the ArrayExpress at EBI and DOR in the common MAGE-TAB format. BioProject provides an organizational framework to access metadata about research projects and the data from the projects that are deposited into different databases. In this article, we describe major changes and improvements introduced to the DDBJ services, and the launch of two new resources: DOR and BioProject.
The interaction between biological researchers and the bioinformatics tools they use is still hampered by incomplete interoperability between such tools. To ensure interoperability initiatives are effectively deployed, end-user applications need to be aware of, and support, best practices and standards. Here, we report on an initiative in which software developers and genome biologists came together to explore and raise awareness of these issues: BioHackathon 2009.
Developers in attendance came from diverse backgrounds, with experts in Web services, workflow tools, text mining and visualization. Genome biologists provided expertise and exemplar data from the domains of sequence and pathway analysis and glyco-informatics. One goal of the meeting was to evaluate the ability to address real world use cases in these domains using the tools that the developers represented. This resulted in i) a workflow to annotate 100,000 sequences from an invertebrate species; ii) an integrated system for analysis of the transcription factor binding sites (TFBSs) enriched based on differential gene expression data obtained from a microarray experiment; iii) a workflow to enumerate putative physical protein interactions among enzymes in a metabolic pathway using protein structure data; iv) a workflow to analyze glyco-gene-related diseases by searching for human homologs of glyco-genes in other species, such as fruit flies, and retrieving their phenotype-annotated SNPs.
Beyond deriving prototype solutions for each use-case, a second major purpose of the BioHackathon was to highlight areas of insufficiency. We discuss the issues raised by our exploration of the problem/solution space, concluding that there are still problems with the way Web services are modeled and annotated, including: i) the absence of several useful data or analysis functions in the Web service "space"; ii) the lack of documentation of methods; iii) lack of compliance with the SOAP/WSDL specification among and between various programming-language libraries; and iv) incompatibility between various bioinformatics data formats. Although it was still difficult to solve real world problems posed to the developers by the biological researchers in attendance because of these problems, we note the promise of addressing these issues within a semantic framework.
In recent years, biological web resources such as databases and tools have become more complex because of the enormous amounts of data generated in the field of life sciences. Traditional methods of distributing tutorials include publishing textbooks and posting web documents, but these static contents cannot adequately describe recent dynamic web services. Due to improvements in computer technology, it is now possible to create dynamic content such as video with minimal effort and low cost on most modern computers. The ease of creating and distributing video tutorials instead of static content improves accessibility for researchers, annotators and curators. This article focuses on online video repositories for educational and tutorial videos provided by resource developers and users. It also describes a project in Japan named TogoTV (http://togotv.dbcls.jp/en/) and discusses the production and distribution of high-quality tutorial videos, which would be useful to viewer, with examples. This article intends to stimulate and encourage researchers who develop and use databases and tools to distribute how-to videos as a tool to enhance product usability.
screencast; vodcast; tutorial; YouTube; QuickTime; Flash
Many abbreviations are used in the literature especially in the life sciences, and polysemous abbreviations appear frequently, making it difficult to read and understand scientific papers that are outside of a reader’s expertise. Thus, we have developed Allie, a database and a search service of abbreviations and their long forms (a.k.a. full forms or definitions). Allie searches for abbreviations and their corresponding long forms in a database that we have generated based on all titles and abstracts in MEDLINE. When a user query matches an abbreviation, Allie returns all potential long forms of the query along with their bibliographic data (i.e. title and publication year). In addition, for each candidate, co-occurring abbreviations and a research field in which it frequently appears in the MEDLINE data are displayed. This function helps users learn about the context in which an abbreviation appears. To deal with synonymous long forms, we use a dictionary called GENA that contains domain-specific terms such as gene, protein or disease names along with their synonymic information. Conceptually identical domain-specific terms are regarded as one term, and then conceptually identical abbreviation-long form pairs are grouped taking into account their appearance in MEDLINE. To keep up with new abbreviations that are continuously introduced, Allie has an automatic update system. In addition, the database of abbreviations and their long forms with their corresponding PubMed IDs is constructed and updated weekly.
Database URL: The Allie service is available at http://allie.dbcls.jp/.
Motivation: Many types of omics data are compiled as lists of connections between elements and visualized as networks or graphs where the nodes and edges correspond to the elements and the connections, respectively. However, these networks often appear as ‘hair-balls’—with a large number of extremely tangled edges—and cannot be visually interpreted.
Results: We present an interactive, multiscale navigation method for biological networks. Our approach can automatically and rapidly abstract any portion of a large network of interest to an immediately interpretable extent. The method is based on an ultrafast graph clustering technique that abstracts networks of about 100 000 nodes in a second by iteratively grouping densely connected portions and a biological-property-based clustering technique that takes advantage of biological information often provided for biological entities (e.g. Gene Ontology terms). It was confirmed to be effective by applying it to real yeast protein network data, and would greatly help modern biologists faced with large, complicated networks in a similar manner to how Web mapping services enable interactive multiscale navigation of geographical maps (e.g. Google Maps).
Availability: Java implementation of our method, named NaviCluster, is available at http://navicluster.cb.k.u-tokyo.ac.jp/.
Supplementary information: Supplementary data are available at Bioinformatics online.
In this paper, we describe a server/client literature management system specialized for the life science domain, the TogoDoc system (Togo, pronounced Toe-Go, is a romanization of a Japanese word for integration). The server and the client program cooperate closely over the Internet to provide life scientists with an effective literature recommendation service and efficient literature management. The content-based and personalized literature recommendation helps researchers to isolate interesting papers from the “tsunami” of literature, in which, on average, more than one biomedical paper is added to MEDLINE every minute. Because researchers these days need to cover updates of much wider topics to generate hypotheses using massive datasets obtained from public databases or omics experiments, the importance of having an effective literature recommendation service is rising. The automatic recommendation is based on the content of personal literature libraries of electronic PDF papers. The client program automatically analyzes these files, which are sometimes deeply buried in storage disks of researchers' personal computers. Just saving PDF papers to the designated folders makes the client program automatically analyze and retrieve metadata, rename file names, synchronize the data to the server, and receive the recommendation lists of newly published papers, thus accomplishing effortless literature management. In addition, the tag suggestion and associative search functions are provided for easy classification of and access to past papers (researchers who read many papers sometimes only vaguely remember or completely forget what they read in the past). The TogoDoc system is available for both Windows and Mac OS X and is free. The TogoDoc Client software is available at http://tdc.cb.k.u-tokyo.ac.jp/, and the TogoDoc server is available at https://docman.dbcls.jp/pubmed_recom.
The DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) provides a nucleotide sequence archive database and accompanying database tools for sequence submission, entry retrieval and annotation analysis. The DDBJ collected and released 3 637 446 entries/2 272 231 889 bases between July 2009 and June 2010. A highlight of the released data was archive datasets from next-generation sequencing reads of Japanese rice cultivar, Koshihikari submitted by the National Institute of Agrobiological Sciences. In this period, we started a new archive for quantitative genomics data, the DDBJ Omics aRchive (DOR). The DOR stores quantitative data both from the microarray and high-throughput new sequencing platforms. Moreover, we improved the content of the DDBJ patent sequence, released a new submission tool of the DDBJ Sequence Read Archive (DRA) which archives massive raw sequencing reads, and enhanced a cloud computing-based analytical system from sequencing reads, the DDBJ Read Annotation Pipeline. In this article, we describe these new functions of the DDBJ databases and support tools.
The recent explosion in the availability of genetic sequence data has made large-scale phylogenetic inference routine in many life sciences laboratories. The outcomes of such analyses are, typically, a variety of candidate phylogenetic relationships or tree topologies, even when the power of genome-scale data is exploited. Because much phylogenetic information must be buried in such topology distributions, it is important to reveal that information as effectively as possible; however, existing methods need to adopt complex structures to represent such information. Hence, researchers, in particular those not experts in evolutionary studies, sometimes hesitate to adopt these methods and much phylogenetic information could be overlooked and wasted. In this paper, we propose the centroid wheel tree representation, which is an informative representation of phylogenetic topology distributions, and which can be readily interpreted even by nonexperts. Furthermore, we mathematically prove this to be the most balanced representation of phylogenetic topologies and efficiently solvable in the framework of the traveling salesman problem, for which very sophisticated program packages are available. This theoretically and practically superior representation should aid biologists faced with abundant data. The centroid representation introduced here is fairly general, so it can be applied to other fields that are characterized by high-dimensional solution spaces and large quantities of noisy data. The software is implemented in Java and available via http://cwt.cb.k.u-tokyo.ac.jp/.
Centroid wheel tree; centroid representation; phylogenetic tree; probability distribution; traveling salesman problem
Web services have become a key technology for bioinformatics, since life science databases are globally decentralized and the exponential increase in the amount of available data demands for efficient systems without the need to transfer entire databases for every step of an analysis. However, various incompatibilities among database resources and analysis services make it difficult to connect and integrate these into interoperable workflows. To resolve this situation, we invited domain specialists from web service providers, client software developers, Open Bio* projects, the BioMoby project and researchers of emerging areas where a standard exchange data format is not well established, for an intensive collaboration entitled the BioHackathon 2008. The meeting was hosted by the Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC) and was held in Tokyo from February 11th to 15th, 2008. In this report we highlight the work accomplished and the common issues arisen from this event, including the standardization of data exchange formats and services in the emerging fields of glycoinformatics, biological interaction networks, text mining, and phyloinformatics. In addition, common shared object development based on BioSQL, as well as technical challenges in large data management, asynchronous services, and security are discussed. Consequently, we improved interoperability of web services in several fields, however, further cooperation among major database centers and continued collaborative efforts between service providers and software developers are still necessary for an effective advance in bioinformatics web service technologies.
Web services have become widely used in bioinformatics analysis, but there exist incompatibilities in interfaces and data types, which prevent users from making full use of a combination of these services. Therefore, we have developed the TogoWS service to provide an integrated interface with advanced features. In the TogoWS REST (REpresentative State Transfer) API (application programming interface), we introduce a unified access method for major database resources through intuitive URIs that can be used to search, retrieve, parse and convert the database entries. The TogoWS SOAP API resolves compatibility issues found on the server and client-side SOAP implementations. The TogoWS service is freely available at: http://togows.dbcls.jp/.
Motivation: The identification of putative ligand-binding sites on proteins is important for the prediction of protein function. Knowledge-based approaches using structure databases have become interesting, because of the recent increase in structural information. Approaches using binding motif information are particularly effective. However, they can only be applied to well-known ligands that frequently appear in the structure databases.
Results: We have developed a new method for predicting the binding sites of chemically diverse ligands, by using information about the interactions between fragments. The selection of the fragment size is important. If the fragments are too small, then the patterns derived from the binding motifs cannot be used, since they are many-body interactions, while using larger fragments limits the application to well-known ligands. In our method, we used the main and side chains for proteins, and three successive atoms for ligands, as fragments. After superposition of the fragments, our method builds the conformations of ligands and predicts the binding sites. As a result, our method could accurately predict the binding sites of chemically diverse ligands, even though the Protein Data Bank currently contains a large number of nucleotides. Moreover, a further evaluation for the unbound forms of proteins revealed that our building up procedure was robust to conformational changes induced by ligand binding.
Availability: Our method, named ‘BUMBLE’, is available at http://bumble.hgc.jp/
Supplementary information: Supplementary Material is available at Bioinformatics online.
The DNA Data Bank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp) has collected and released 1 701 110 entries/1 116 138 614 bases between July 2008 and June 2009. A few highlighted data releases from DDBJ were the complete genome sequence of an endosymbiont within protist cells in the termite gut and Cap Analysis Gene Expression tags for human and mouse deposited from the Functional Annotation of the Mammalian cDNA consortium. In this period, we started a novel user announcement service using Really Simple Syndication (RSS) to deliver a list of data released from DDBJ on a daily basis. Comprehensive visualization of a DDBJ release data was attempted by using a word cloud program. Moreover, a new archive for sequencing data from next-generation sequencers, the ‘DDBJ Read Archive’ (DRA), was launched. Concurrently, for read data registered in DRA, a semi-automatic annotation tool called the ‘DDBJ Read Annotation Pipeline’ was released as a preliminary step. The pipeline consists of two parts: basic analysis for reference genome mapping and de novo assembly and high-level analysis of structural and functional annotations. These new services will aid users’ research and provide easier access to DDBJ databases.
Genome-wide data enables us to clarify the underlying molecular mechanisms of complex phenotypes. The Online Mendelian Inheritance in Man (OMIM) is a widely employed knowledge base of human genes and genetic disorders for biological researchers. However, OMIM has not been fully exploited for omics analysis because its bibliographic data structure is not suitable for computer automation. Here, we characterized diseases and genes by generating feature profiles of associated drugs, biological phenomena and anatomy with the MeSH (Medical Subject Headings) vocabulary. We obtained 1 760 054 pairs of OMIM entries and MeSH terms by utilizing the full set of MEDLINE articles. We developed a web-based application called Gendoo (gene, disease features ontology-based overview system) to visualize these profiles. By comparing feature profiles of types 1 and 2 diabetes, we clearly illustrated their differences: type 1 diabetes is an autoimmune disease (P-value = 4.55 × 10−5) and type 2 diabetes is related to obesity (P-value = 1.18 × 10−15). Gendoo and the developed feature profiles should be useful for omics analysis from molecular and clinical viewpoints. Gendoo is available at http://gendoo.dbcls.jp/.
The evolutionary history of biological pathways is of general interest, especially in this post-genomic era, because it may provide clues for understanding how complex systems encoded on genomes have been organized. To explain how pathways can evolve de novo, some noteworthy models have been proposed. However, direct reconstruction of pathway evolutionary history both on a genomic scale and at the depth of the tree of life has suffered from artificial effects in estimating the gene content of ancestral species. Recently, we developed an algorithm that effectively reconstructs gene-content evolution without these artificial effects, and we applied it to this problem. The carefully reconstructed history, which was based on the metabolic pathways of 160 prokaryotic species, confirmed that pathways have grown beyond the random acquisition of individual genes. Pathway acquisition took place quickly, probably eliminating the difficulty in holding genes during the course of the pathway evolution. This rapid evolution was due to massive horizontal gene transfers as gene groups, some of which were possibly operon transfers, which would convey existing pathways but not be able to generate novel pathways. To this end, we analyzed how these pathways originally appeared and found that the original acquisition of pathways occurred more contemporaneously than expected across different phylogenetic clades. As a possible model to explain this observation, we propose that novel pathway evolution may be facilitated by bidirectional horizontal gene transfers in prokaryotic communities. Such a model would complement existing pathway evolution models.
Many biological functions, from energy metabolism to antibiotic resistance, are carried out by biological pathways that require a number of cooperatively functioning genes. Hence, underlying mechanisms in the evolution of biological pathways are of particular interest. However, compared to the evolution of individual genes, which has been well studied, the evolution of biological pathways is far less understood. In this study, we used the abundant genome sequences available today and a novel algorithm we recently developed to trace the evolutionary history of prokaryotic metabolic pathways and to analyze how these pathways emerged. We found that the pathways have experienced significantly rapid acquisition, which would play a key role in eliminating the difficulty in holding genes during the course of pathway evolution. In addition, the emergence of novel pathways was suggested to have occurred more contemporaneously than expected across different phylogenetic clades. Based on these observations, we propose that novel pathway evolution can be facilitated by bidirectional horizontal gene transfers in prokaryotic communities. This simple model may approach the question of how biological pathways requiring a number of cooperatively functioning genes can be obtained and are the core event within the evolution of biological pathways in prokaryotes.
BodyParts3D is a dictionary-type database for anatomy in which anatomical concepts are represented by 3D structure data that specify corresponding segments of a 3D whole-body model for an adult human male. It encompasses morphological and geometrical knowledge in anatomy and complements ontological representation. Moreover, BodyParts3D introduces a universal coordinate system in human anatomy, which may facilitate management of samples and data in biomedical research and clinical practice. As of today, 382 anatomical concepts, sufficient for mapping materials in most molecular medicine experiments, have been specified. Expansion of the dictionary by adding further segments and details to the whole-body model will continue in collaboration with clinical researchers until sufficient resolution and accuracy for most clinical application are achieved. BodyParts3D is accessible at: http://lifesciencedb.jp/ag/bp3d/.