High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecular biological research. However, the immense amount of sequence data requires computational skills and suitable hardware resources that are a challenge to molecular biologists. The DNA Data Bank of Japan (DDBJ) of the National Institute of Genetics (NIG) has initiated a cloud computing-based analytical pipeline, the DDBJ Read Annotation Pipeline (DDBJ Pipeline), for a high-throughput annotation of NGS reads. The DDBJ Pipeline offers a user-friendly graphical web interface and processes massive NGS datasets using decentralized processing by NIG supercomputers currently free of charge. The proposed pipeline consists of two analysis components: basic analysis for reference genome mapping and de novo assembly and subsequent high-level analysis of structural and functional annotations. Users may smoothly switch between the two components in the pipeline, facilitating web-based operations on a supercomputer for high-throughput data analysis. Moreover, public NGS reads of the DDBJ Sequence Read Archive located on the same supercomputer can be imported into the pipeline through the input of only an accession number. This proposed pipeline will facilitate research by utilizing unified analytical workflows applied to the NGS data. The DDBJ Pipeline is accessible at http://p.ddbj.nig.ac.jp/.
next-generation sequencing; sequence read archive; cloud computing; analytical pipeline; genome analysis
The overwhelming amount of network data in functional genomics is making its visualization cluttered with jumbling nodes and edges. Such cluttered network visualization, which is known as "hair-balls", is significantly hindering data interpretation and analysis of researchers. Effective navigation approaches that can always abstract network data properly and present them insightfully are hence required, to help researchers interpret the data and acquire knowledge efficiently. Cytoscape is a de facto standard platform for network visualization and analysis, which has many users around the world. Apart from its core sophisticated features, it easily allows for extension of the functionalities by loading extra plug-ins.
We developed NaviClusterCS, which enables researchers to interactively navigate large biological networks of ~100,000 nodes in a "Google Maps-like" manner in the Cytoscape environment. NaviClusterCS rapidly and automatically identifies biologically meaningful clusters in large networks, e.g., proteins sharing similar biological functions in protein-protein interaction networks. Then, it displays not all nodes but only preferable numbers of those clusters at any magnification to avoid creating the cluttered network visualization, while its zooming and re-centering functions still enable researchers to interactively analyze the networks in detail. Its application to a real Arabidopsis co-expression network dataset illustrated a practical use of the tool for suggesting knowledge that is hidden in large biological networks and difficult to be obtained using other visualization methods.
NaviClusterCS provides interactive and multi-scale network navigation to a wide range of biologists in the big data era, via the de facto standard platform for network visualization. It can be freely downloaded at http://navicluster.cb.k.u-tokyo.ac.jp/cs/ and installed as a plug-in of Cytoscape.
The DNA data bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) maintains a primary nucleotide sequence database and provides analytical resources for biological information to researchers. This database content is exchanged with the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). Resources provided by the DDBJ include traditional nucleotide sequence data released in the form of 27 316 452 entries or 16 876 791 557 base pairs (as of June 2012), and raw reads of new generation sequencers in the sequence read archive (SRA). A Japanese researcher published his own genome sequence via DDBJ-SRA on 31 July 2012. To cope with the ongoing genomic data deluge, in March 2012, our computer previous system was totally replaced by a commodity cluster-based system that boasts 122.5 TFlops of CPU capacity and 5 PB of storage space. During this upgrade, it was considered crucial to replace and refactor substantial portions of the DDBJ software systems as well. As a result of the replacement process, which took more than 2 years to perform, we have achieved significant improvements in system performance.
The Genia task, when it was introduced in 2009, was the first community-wide effort to address a fine-grained, structural information extraction from biomedical literature. Arranged for the second time as one of the main tasks of BioNLP Shared Task 2011, it aimed to measure the progress of the community since 2009, and to evaluate generalization of the technology to full text papers. The Protein Coreference task was arranged as one of the supporting tasks, motivated from one of the lessons of the 2009 task that the abundance of coreference structures in natural language text hinders further improvement with the Genia task.
The Genia task received final submissions from 15 teams. The results show that the community has made a significant progress, marking 74% of the best F-score in extracting bio-molecular events of simple structure, e.g., gene expressions, and 45% ~ 48% in extracting those of complex structure, e.g., regulations. The Protein Coreference task received 6 final submissions. The results show that the coreference resolution performance in biomedical domain is lagging behind that in newswire domain, cf. 50% vs. 66% in MUC score. Particularly, in terms of protein coreference resolution the best system achieved 34% in F-score.
Detailed analysis performed on the results improves our insight into the problem and suggests the directions for further improvements.
Term clustering, by measuring the string similarities between terms, is known within the natural language processing community to be an effective method for improving the quality of texts and dictionaries. However, we have observed that chemical names are difficult to cluster using string similarity measures. In order to clearly demonstrate this difficulty, we compared the string similarities determined using the edit distance, the Monge-Elkan score, SoftTFIDF, and the bigram Dice coefficient for chemical names with those for non-chemical names.
Our experimental results revealed the following: (1) The edit distance had the best performance in the matching of full forms, whereas Cohen et al. reported that SoftTFIDF with the Jaro-Winkler distance would yield the best measure for matching pairs of terms for their experiments. (2) For each of the string similarity measures above, the best threshold for term matching differs for chemical names and for non-chemical names; the difference is especially large for the edit distance. (3) Although the matching results obtained for chemical names using the edit distance, Monge-Elkan scores, or the bigram Dice coefficients are better than the result obtained for non-chemical names, the results were contrary when using SoftTFIDF. (4) A suitable weight for chemical names varies substantially from one for non-chemical names. In particular, a weight vector that has been optimized for non-chemical names is not suitable for chemical names. (5) The matching results using the edit distances improve further by dividing a set of full forms into two subsets, according to whether a full form is a chemical name or not. These results show that our hypothesis is acceptable, and that we can significantly improve the performance of abbreviation-full form clustering by computing chemical names and non-chemical names separately.
In conclusion, the discriminative application of string similarity methods to chemical and non-chemical names may be a simple yet effective way to improve the performance of term clustering.
The Integrating Network Objects with Hierarchies (INOH) database is a highly structured, manually curated database of signal transduction pathways including Mammalia, Xenopus laevis, Drosophila melanogaster, Caenorhabditis elegans and canonical. Since most pathway knowledge resides in scientific articles, the database focuses on curating and encoding textual knowledge into a machine-processable form. We use a hierarchical pathway representation model with a compound graph, and every pathway component in the INOH database is annotated by a set of uniquely developed ontologies. Finally, we developed the Similarity Search using the combination of a compound graph and hierarchical ontologies. The INOH database is to be a good resource for many users who want to analyze a large protein network. INOH ontologies and 73 signal transduction and 29 metabolic pathway diagrams (including over 6155 interactions and 3395 protein entities) are freely available in INOH XML and BioPAX formats.
Database URL: http://www.inoh.org/
The DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp) maintains and provides archival, retrieval and analytical resources for biological information. The central DDBJ resource consists of public, open-access nucleotide sequence databases including raw sequence reads, assembly information and functional annotation. Database content is exchanged with EBI and NCBI within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). In 2011, DDBJ launched two new resources: the ‘DDBJ Omics Archive’ (DOR; http://trace.ddbj.nig.ac.jp/dor) and BioProject (http://trace.ddbj.nig.ac.jp/bioproject). DOR is an archival database of functional genomics data generated by microarray and highly parallel new generation sequencers. Data are exchanged between the ArrayExpress at EBI and DOR in the common MAGE-TAB format. BioProject provides an organizational framework to access metadata about research projects and the data from the projects that are deposited into different databases. In this article, we describe major changes and improvements introduced to the DDBJ services, and the launch of two new resources: DOR and BioProject.
In recent years, biological web resources such as databases and tools have become more complex because of the enormous amounts of data generated in the field of life sciences. Traditional methods of distributing tutorials include publishing textbooks and posting web documents, but these static contents cannot adequately describe recent dynamic web services. Due to improvements in computer technology, it is now possible to create dynamic content such as video with minimal effort and low cost on most modern computers. The ease of creating and distributing video tutorials instead of static content improves accessibility for researchers, annotators and curators. This article focuses on online video repositories for educational and tutorial videos provided by resource developers and users. It also describes a project in Japan named TogoTV (http://togotv.dbcls.jp/en/) and discusses the production and distribution of high-quality tutorial videos, which would be useful to viewer, with examples. This article intends to stimulate and encourage researchers who develop and use databases and tools to distribute how-to videos as a tool to enhance product usability.
screencast; vodcast; tutorial; YouTube; QuickTime; Flash
Many abbreviations are used in the literature especially in the life sciences, and polysemous abbreviations appear frequently, making it difficult to read and understand scientific papers that are outside of a reader’s expertise. Thus, we have developed Allie, a database and a search service of abbreviations and their long forms (a.k.a. full forms or definitions). Allie searches for abbreviations and their corresponding long forms in a database that we have generated based on all titles and abstracts in MEDLINE. When a user query matches an abbreviation, Allie returns all potential long forms of the query along with their bibliographic data (i.e. title and publication year). In addition, for each candidate, co-occurring abbreviations and a research field in which it frequently appears in the MEDLINE data are displayed. This function helps users learn about the context in which an abbreviation appears. To deal with synonymous long forms, we use a dictionary called GENA that contains domain-specific terms such as gene, protein or disease names along with their synonymic information. Conceptually identical domain-specific terms are regarded as one term, and then conceptually identical abbreviation-long form pairs are grouped taking into account their appearance in MEDLINE. To keep up with new abbreviations that are continuously introduced, Allie has an automatic update system. In addition, the database of abbreviations and their long forms with their corresponding PubMed IDs is constructed and updated weekly.
Database URL: The Allie service is available at http://allie.dbcls.jp/.
Motivation: Many types of omics data are compiled as lists of connections between elements and visualized as networks or graphs where the nodes and edges correspond to the elements and the connections, respectively. However, these networks often appear as ‘hair-balls’—with a large number of extremely tangled edges—and cannot be visually interpreted.
Results: We present an interactive, multiscale navigation method for biological networks. Our approach can automatically and rapidly abstract any portion of a large network of interest to an immediately interpretable extent. The method is based on an ultrafast graph clustering technique that abstracts networks of about 100 000 nodes in a second by iteratively grouping densely connected portions and a biological-property-based clustering technique that takes advantage of biological information often provided for biological entities (e.g. Gene Ontology terms). It was confirmed to be effective by applying it to real yeast protein network data, and would greatly help modern biologists faced with large, complicated networks in a similar manner to how Web mapping services enable interactive multiscale navigation of geographical maps (e.g. Google Maps).
Availability: Java implementation of our method, named NaviCluster, is available at http://navicluster.cb.k.u-tokyo.ac.jp/.
Supplementary information: Supplementary data are available at Bioinformatics online.
In this paper, we describe a server/client literature management system specialized for the life science domain, the TogoDoc system (Togo, pronounced Toe-Go, is a romanization of a Japanese word for integration). The server and the client program cooperate closely over the Internet to provide life scientists with an effective literature recommendation service and efficient literature management. The content-based and personalized literature recommendation helps researchers to isolate interesting papers from the “tsunami” of literature, in which, on average, more than one biomedical paper is added to MEDLINE every minute. Because researchers these days need to cover updates of much wider topics to generate hypotheses using massive datasets obtained from public databases or omics experiments, the importance of having an effective literature recommendation service is rising. The automatic recommendation is based on the content of personal literature libraries of electronic PDF papers. The client program automatically analyzes these files, which are sometimes deeply buried in storage disks of researchers' personal computers. Just saving PDF papers to the designated folders makes the client program automatically analyze and retrieve metadata, rename file names, synchronize the data to the server, and receive the recommendation lists of newly published papers, thus accomplishing effortless literature management. In addition, the tag suggestion and associative search functions are provided for easy classification of and access to past papers (researchers who read many papers sometimes only vaguely remember or completely forget what they read in the past). The TogoDoc system is available for both Windows and Mac OS X and is free. The TogoDoc Client software is available at http://tdc.cb.k.u-tokyo.ac.jp/, and the TogoDoc server is available at https://docman.dbcls.jp/pubmed_recom.
The DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) provides a nucleotide sequence archive database and accompanying database tools for sequence submission, entry retrieval and annotation analysis. The DDBJ collected and released 3 637 446 entries/2 272 231 889 bases between July 2009 and June 2010. A highlight of the released data was archive datasets from next-generation sequencing reads of Japanese rice cultivar, Koshihikari submitted by the National Institute of Agrobiological Sciences. In this period, we started a new archive for quantitative genomics data, the DDBJ Omics aRchive (DOR). The DOR stores quantitative data both from the microarray and high-throughput new sequencing platforms. Moreover, we improved the content of the DDBJ patent sequence, released a new submission tool of the DDBJ Sequence Read Archive (DRA) which archives massive raw sequencing reads, and enhanced a cloud computing-based analytical system from sequencing reads, the DDBJ Read Annotation Pipeline. In this article, we describe these new functions of the DDBJ databases and support tools.
The recent explosion in the availability of genetic sequence data has made large-scale phylogenetic inference routine in many life sciences laboratories. The outcomes of such analyses are, typically, a variety of candidate phylogenetic relationships or tree topologies, even when the power of genome-scale data is exploited. Because much phylogenetic information must be buried in such topology distributions, it is important to reveal that information as effectively as possible; however, existing methods need to adopt complex structures to represent such information. Hence, researchers, in particular those not experts in evolutionary studies, sometimes hesitate to adopt these methods and much phylogenetic information could be overlooked and wasted. In this paper, we propose the centroid wheel tree representation, which is an informative representation of phylogenetic topology distributions, and which can be readily interpreted even by nonexperts. Furthermore, we mathematically prove this to be the most balanced representation of phylogenetic topologies and efficiently solvable in the framework of the traveling salesman problem, for which very sophisticated program packages are available. This theoretically and practically superior representation should aid biologists faced with abundant data. The centroid representation introduced here is fairly general, so it can be applied to other fields that are characterized by high-dimensional solution spaces and large quantities of noisy data. The software is implemented in Java and available via http://cwt.cb.k.u-tokyo.ac.jp/.
Centroid wheel tree; centroid representation; phylogenetic tree; probability distribution; traveling salesman problem
Web services have become widely used in bioinformatics analysis, but there exist incompatibilities in interfaces and data types, which prevent users from making full use of a combination of these services. Therefore, we have developed the TogoWS service to provide an integrated interface with advanced features. In the TogoWS REST (REpresentative State Transfer) API (application programming interface), we introduce a unified access method for major database resources through intuitive URIs that can be used to search, retrieve, parse and convert the database entries. The TogoWS SOAP API resolves compatibility issues found on the server and client-side SOAP implementations. The TogoWS service is freely available at: http://togows.dbcls.jp/.
Motivation: The identification of putative ligand-binding sites on proteins is important for the prediction of protein function. Knowledge-based approaches using structure databases have become interesting, because of the recent increase in structural information. Approaches using binding motif information are particularly effective. However, they can only be applied to well-known ligands that frequently appear in the structure databases.
Results: We have developed a new method for predicting the binding sites of chemically diverse ligands, by using information about the interactions between fragments. The selection of the fragment size is important. If the fragments are too small, then the patterns derived from the binding motifs cannot be used, since they are many-body interactions, while using larger fragments limits the application to well-known ligands. In our method, we used the main and side chains for proteins, and three successive atoms for ligands, as fragments. After superposition of the fragments, our method builds the conformations of ligands and predicts the binding sites. As a result, our method could accurately predict the binding sites of chemically diverse ligands, even though the Protein Data Bank currently contains a large number of nucleotides. Moreover, a further evaluation for the unbound forms of proteins revealed that our building up procedure was robust to conformational changes induced by ligand binding.
Availability: Our method, named ‘BUMBLE’, is available at http://bumble.hgc.jp/
Supplementary information: Supplementary Material is available at Bioinformatics online.
The DNA Data Bank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp) has collected and released 1 701 110 entries/1 116 138 614 bases between July 2008 and June 2009. A few highlighted data releases from DDBJ were the complete genome sequence of an endosymbiont within protist cells in the termite gut and Cap Analysis Gene Expression tags for human and mouse deposited from the Functional Annotation of the Mammalian cDNA consortium. In this period, we started a novel user announcement service using Really Simple Syndication (RSS) to deliver a list of data released from DDBJ on a daily basis. Comprehensive visualization of a DDBJ release data was attempted by using a word cloud program. Moreover, a new archive for sequencing data from next-generation sequencers, the ‘DDBJ Read Archive’ (DRA), was launched. Concurrently, for read data registered in DRA, a semi-automatic annotation tool called the ‘DDBJ Read Annotation Pipeline’ was released as a preliminary step. The pipeline consists of two parts: basic analysis for reference genome mapping and de novo assembly and high-level analysis of structural and functional annotations. These new services will aid users’ research and provide easier access to DDBJ databases.
Genome-wide data enables us to clarify the underlying molecular mechanisms of complex phenotypes. The Online Mendelian Inheritance in Man (OMIM) is a widely employed knowledge base of human genes and genetic disorders for biological researchers. However, OMIM has not been fully exploited for omics analysis because its bibliographic data structure is not suitable for computer automation. Here, we characterized diseases and genes by generating feature profiles of associated drugs, biological phenomena and anatomy with the MeSH (Medical Subject Headings) vocabulary. We obtained 1 760 054 pairs of OMIM entries and MeSH terms by utilizing the full set of MEDLINE articles. We developed a web-based application called Gendoo (gene, disease features ontology-based overview system) to visualize these profiles. By comparing feature profiles of types 1 and 2 diabetes, we clearly illustrated their differences: type 1 diabetes is an autoimmune disease (P-value = 4.55 × 10−5) and type 2 diabetes is related to obesity (P-value = 1.18 × 10−15). Gendoo and the developed feature profiles should be useful for omics analysis from molecular and clinical viewpoints. Gendoo is available at http://gendoo.dbcls.jp/.
The evolutionary history of biological pathways is of general interest, especially in this post-genomic era, because it may provide clues for understanding how complex systems encoded on genomes have been organized. To explain how pathways can evolve de novo, some noteworthy models have been proposed. However, direct reconstruction of pathway evolutionary history both on a genomic scale and at the depth of the tree of life has suffered from artificial effects in estimating the gene content of ancestral species. Recently, we developed an algorithm that effectively reconstructs gene-content evolution without these artificial effects, and we applied it to this problem. The carefully reconstructed history, which was based on the metabolic pathways of 160 prokaryotic species, confirmed that pathways have grown beyond the random acquisition of individual genes. Pathway acquisition took place quickly, probably eliminating the difficulty in holding genes during the course of the pathway evolution. This rapid evolution was due to massive horizontal gene transfers as gene groups, some of which were possibly operon transfers, which would convey existing pathways but not be able to generate novel pathways. To this end, we analyzed how these pathways originally appeared and found that the original acquisition of pathways occurred more contemporaneously than expected across different phylogenetic clades. As a possible model to explain this observation, we propose that novel pathway evolution may be facilitated by bidirectional horizontal gene transfers in prokaryotic communities. Such a model would complement existing pathway evolution models.
Many biological functions, from energy metabolism to antibiotic resistance, are carried out by biological pathways that require a number of cooperatively functioning genes. Hence, underlying mechanisms in the evolution of biological pathways are of particular interest. However, compared to the evolution of individual genes, which has been well studied, the evolution of biological pathways is far less understood. In this study, we used the abundant genome sequences available today and a novel algorithm we recently developed to trace the evolutionary history of prokaryotic metabolic pathways and to analyze how these pathways emerged. We found that the pathways have experienced significantly rapid acquisition, which would play a key role in eliminating the difficulty in holding genes during the course of pathway evolution. In addition, the emergence of novel pathways was suggested to have occurred more contemporaneously than expected across different phylogenetic clades. Based on these observations, we propose that novel pathway evolution can be facilitated by bidirectional horizontal gene transfers in prokaryotic communities. This simple model may approach the question of how biological pathways requiring a number of cooperatively functioning genes can be obtained and are the core event within the evolution of biological pathways in prokaryotes.
BodyParts3D is a dictionary-type database for anatomy in which anatomical concepts are represented by 3D structure data that specify corresponding segments of a 3D whole-body model for an adult human male. It encompasses morphological and geometrical knowledge in anatomy and complements ontological representation. Moreover, BodyParts3D introduces a universal coordinate system in human anatomy, which may facilitate management of samples and data in biomedical research and clinical practice. As of today, 382 anatomical concepts, sufficient for mapping materials in most molecular medicine experiments, have been specified. Expansion of the dictionary by adding further segments and details to the whole-body model will continue in collaboration with clinical researchers until sufficient resolution and accuracy for most clinical application are achieved. BodyParts3D is accessible at: http://lifesciencedb.jp/ag/bp3d/.
Numerous microbes inhabit the human intestine, many of which are uncharacterized or uncultivable. They form a complex microbial community that deeply affects human physiology. To identify the genomic features common to all human gut microbiomes as well as those variable among them, we performed a large-scale comparative metagenomic analysis of fecal samples from 13 healthy individuals of various ages, including unweaned infants. We found that, while the gut microbiota from unweaned infants were simple and showed a high inter-individual variation in taxonomic and gene composition, those from adults and weaned children were more complex but showed a high functional uniformity regardless of age or sex. In searching for the genes over-represented in gut microbiomes, we identified 237 gene families commonly enriched in adult-type and 136 families in infant-type microbiomes, with a small overlap. An analysis of their predicted functions revealed various strategies employed by each type of microbiota to adapt to its intestinal environment, suggesting that these gene sets encode the core functions of adult and infant-type gut microbiota. By analysing the orphan genes, 647 new gene families were identified to be exclusively present in human intestinal microbiomes. In addition, we discovered a conjugative transposon family explosively amplified in human gut microbiomes, which strongly suggests that the intestine is a ‘hot spot’ for horizontal gene transfer between microbes.
metagenomics; human gut microbiota; gene family; conjugative transposon
Many online resources for the life sciences have been developed and introduced in peer-reviewed papers recently, ranging from databases and web applications to data-analysis software. Some have been introduced in special journal issues or websites with a search function, but others remain scattered throughout the Internet and in the published literature. The searchable resources on these sites are collected and maintained manually and are therefore of higher quality than automatically updated sites, but also require more time and effort.
We developed an online resource search system called OReFiL to address these issues. We developed a crawler to gather all of the web pages whose URLs appear in MEDLINE abstracts and full-text papers on the BioMed Central open-access journals. The URLs were extracted using regular expressions and rules based on our heuristic knowledge. We then indexed the online resources to facilitate their retrieval and comparison by researchers. Because every online resource has at least one PubMed ID, we can easily acquire its summary with Medical Subject Headings (MeSH) terms and confirm its credibility through reference to the corresponding PubMed entry. In addition, because OReFiL automatically extracts URLs and updates the index, minimal time and effort is needed to maintain the system.
We developed OReFiL, a search system for online life science resources, which is freely available. The system's distinctive features include the ability to return up-to-date query-relevant online resources introduced in peer-reviewed papers; the ability to search using free words, MeSH terms, or author names; easy verification of each hit following links to the corresponding PubMed entry or to papers citing the URL through the search systems of BioMed Central, Scirus, HighWire Press, or Google Scholar; and quick confirmation of the existence of an online resource web page.
In order to understand an overview of promoter activities intrinsic to primary DNA sequences in the human genome within a particular cell type, we carried out systematic quantitative luciferase assays of DNA fragments corresponding to putative promoters for 472 human genes which are expressed in HEK (human embryonic kidney epithelial) 293 cells. We observed the promoter activities of them were distributed in a bimodal manner; putative promoters belonging to the first group (with strong promoter activities) were designated as P1 and the latter (with weak promoter activities) as P2. The frequencies of the TATA-boxes, the CpG islands, and the overall G + C-contents were significantly different between these two populations, indicating there are two separate groups of promoters. Interestingly, similar analysis using 251 randomly isolated genomic DNA fragments showed that P2-type promoter occasionally occurs within the human genome. Furthermore, 35 DNA fragments corresponding to putative promoters of non-protein-coding transcripts (ncRNAs) shared similar features with the P2 in both promoter activities and sequence compositions. At least, a part of ncRNAs, which have been massively identified by full-length cDNA projects with no functional relevance inferred, may have originated from those sporadic promoter activities of primary DNA sequences inherent to the human genome.
human genome; promoter; transcriptional start site
Exhaustive gene identification is a fundamental goal in all metagenomics projects. However, most metagenomic sequences are unassembled anonymous fragments, and conventional gene-finding methods cannot be applied. We have developed a prokaryotic gene-finding program, MetaGene, which utilizes di-codon frequencies estimated by the GC content of a given sequence with other various measures. MetaGene can predict a whole range of prokaryotic genes based on the anonymous genomic sequences of a few hundred bases, with a sensitivity of 95% and a specificity of 90% for artificial shotgun sequences (700 bp fragments from 12 species). MetaGene has two sets of codon frequency interpolations, one for bacteria and one for archaea, and automatically selects the proper set for a given sequence using the domain classification method we propose. The domain classification works properly, correctly assigning domain information to more than 90% of the artificial shotgun sequences. Applied to the Sargasso Sea dataset, MetaGene predicted almost all of the annotated genes and a notable number of novel genes. MetaGene can be applied to wide variety of metagenomic projects and expands the utility of metagenomics.
Objective: To help biomedical researchers recognize dynamically introduced abbreviations in biomedical literature, such as gene and protein names, we have constructed a support system called ALICE (Abbreviation LIfter using Corpus-based Extraction). ALICE aims to extract all types of abbreviations with their expansions from a target paper on the fly.
Methods: ALICE extracts an abbreviation and its expansion from the literature by using heuristic pattern-matching rules. This system consists of three phases and potentially identifies valid 320 abbreviation-expansion patterns as combinations of the rules.
Results: It achieved 95% recall and 97% precision on randomly selected titles and abstracts from the MEDLINE database.
Conclusion: ALICE extracted abbreviations and their expansions from the literature efficiently. The subtly compiled heuristics enabled it to extract abbreviations with high recall without significantly reducing precision. ALICE does not only facilitate recognition of an undefined abbreviation in a paper by constructing an abbreviation database or dictionary, but also makes biomedical literature retrieval more accurate. This system is freely available at http://uvdb3.hgc.jp/ALICE/ALICE_index.html.
In general, it is not easy to specify a single sequence identity for each molecule name
that appears in a pathway in the scientific literature. A molecule name may stand
for concepts of various granularities, from concrete objects such as H-Ras and ERK1
to abstract concepts or categories such as Ras and MAPK. Typically, the relations
among molecule names derive a hierarchical structure; without a proper way to
handle this knowledge, it becomes ever more difficult to develop a reliable pathway
database. This paper describes an ontology that is designed to annotate molecules
in the scientific literature on signal transduction pathways.