Search tips
Search criteria

Results 1-25 (57)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
Document Types
author:("Nakai, kenya")
1.  Computational Promoter Modeling Identifies the Modes of Transcriptional Regulation in Hematopoietic Stem Cells 
PLoS ONE  2014;9(4):e93853.
Extrinsic and intrinsic regulators are responsible for the tight control of hematopoietic stem cells (HSCs), which differentiate into all blood cell lineages. To understand the fundamental basis of HSC biology, we focused on differentially expressed genes (DEGs) in long-term and short-term HSCs, which are closely related in terms of cell development but substantially differ in their stem cell capacity. To analyze the transcriptional regulation of the DEGs identified in the novel transcriptome profiles obtained by our RNA-seq analysis, we developed a computational method to model the linear relationship between gene expression and the features of putative regulatory elements. The transcriptional regulation modes characterized here suggest the importance of transcription factors (TFs) that are expressed at steady state or at low levels. Remarkably, we found that 24 differentially expressed TFs targeting 21 putative TF-binding sites contributed significantly to transcriptional regulation. These TFs tended to be modulated by other nondifferentially expressed TFs, suggesting that HSCs can achieve flexible and rapid responses via the control of nondifferentially expressed TFs through a highly complex regulatory network. Our novel transcriptome profiles and new method are powerful tools for studying the mechanistic basis of cell fate decisions.
PMCID: PMC3977923  PMID: 24710559
2.  Evaluation of Sequence Features from Intrinsically Disordered Regions for the Estimation of Protein Function 
PLoS ONE  2014;9(2):e89890.
With the exponential increase in the number of sequenced organisms, automated annotation of proteins is becoming increasingly important. Intrinsically disordered regions are known to play a significant role in protein function. Despite their abundance, especially in eukaryotes, they are rarely used to inform function prediction systems. In this study, we extracted seven sequence features in intrinsically disordered regions and developed a scheme to use them to predict Gene Ontology Slim terms associated with proteins. We evaluated the function prediction performance of each feature. Our results indicate that the residue composition based features have the highest precision while bigram probabilities, based on sequence profiles of intrinsically disordered regions obtained from PSIBlast, have the highest recall. Amino acid bigrams and features based on secondary structure show an intermediate level of precision and recall. Almost all features showed a high prediction performance for GO Slim terms related to extracellular matrix, nucleus, RNA and DNA binding. However, feature performance varied significantly for different GO Slim terms emphasizing the need for a unique classifier optimized for the prediction of each functional term. These findings provide a first comprehensive and quantitative evaluation of sequence features in intrinsically disordered regions and will help in the development of a more informative protein function predictor.
PMCID: PMC3933697  PMID: 24587103
3.  Innate Immunity Interactome Dynamics 
Innate immune response involves protein–protein interactions, deoxyribonucleic acid (DNA)–protein interactions and signaling cascades. So far, thousands of protein–protein interactions have been curated as a static interaction map. However, protein–protein interactions involved in innate immune response are dynamic. We recorded the dynamics in the interactome during innate immune response by combining gene expression data of lipopolysaccharide (LPS)-stimulated dendritic cells with protein–protein interactions data. We identified the differences in interactome during innate immune response by constructing differential networks and identifying protein modules, which were up-/down-regulated at each stage during the innate immune response. For each protein complex, we identified enriched biological processes and pathways. In addition, we identified core interactions that are conserved throughout the innate immune response and their enriched gene ontology terms and pathways. We defined two novel measures to assess the differences between network maps at different time points. We found that the protein interaction network at 1 hour after LPS stimulation has the highest interactions protein ratio, which indicates a role for proteins with large number of interactions in innate immune response. A pairwise differential matrix allows for the global visualization of the differences between different networks. We investigated the toll-like receptor subnetwork and found that S100A8 is down-regulated in dendritic cells after LPS stimulation. Identified protein complexes have a crucial role not only in innate immunity, but also in circadian rhythms, pathways involved in cancer, and p53 pathways. The study confirmed previous work that reported a strong correlation between cancer and immunity.
PMCID: PMC3885269  PMID: 24453478
innate immunity; protein-protein interactions; gene expression; differential networks; interactome dynamics
4.  An integrated pipeline for next generation sequencing and annotation of the complete mitochondrial genome of the giant intestinal fluke, Fasciolopsis buski (Lankester, 1857) Looss, 1899 
PeerJ  2013;1:e207.
Helminths include both parasitic nematodes (roundworms) and platyhelminths (trematode and cestode flatworms) that are abundant, and are of clinical importance. The genetic characterization of parasitic flatworms using advanced molecular tools is central to the diagnosis and control of infections. Although the nuclear genome houses suitable genetic markers (e.g., in ribosomal (r) DNA) for species identification and molecular characterization, the mitochondrial (mt) genome consistently provides a rich source of novel markers for informative systematics and epidemiological studies. In the last decade, there have been some important advances in mtDNA genomics of helminths, especially lung flukes, liver flukes and intestinal flukes. Fasciolopsis buski, often called the giant intestinal fluke, is one of the largest digenean trematodes infecting humans and found primarily in Asia, in particular the Indian subcontinent. Next-generation sequencing (NGS) technologies now provide opportunities for high throughput sequencing, assembly and annotation within a short span of time. Herein, we describe a high-throughput sequencing and bioinformatics pipeline for mt genomics for F. buski that emphasizes the utility of short read NGS platforms such as Ion Torrent and Illumina in successfully sequencing and assembling the mt genome using innovative approaches for PCR primer design as well as assembly. We took advantage of our NGS whole genome sequence data (unpublished so far) for F. buski and its comparison with available data for the Fasciola hepatica mtDNA as the reference genome for design of precise and specific primers for amplification of mt genome sequences from F. buski. A long-range PCR was carried out to create an NGS library enriched in mt DNA sequences. Two different NGS platforms were employed for complete sequencing, assembly and annotation of the F. buski mt genome. The complete mt genome sequences of the intestinal fluke comprise 14,118 bp and is thus the shortest trematode mitochondrial genome sequenced to date. The noncoding control regions are separated into two parts by the tRNA-Gly gene and don’t contain either tandem repeats or secondary structures, which are typical for trematode control regions. The gene content and arrangement are identical to that of F. hepatica. The F. buski mtDNA genome has a close resemblance with F. hepatica and has a similar gene order tallying with that of other trematodes. The mtDNA for the intestinal fluke is reported herein for the first time by our group that would help investigate Fasciolidae taxonomy and systematics with the aid of mtDNA NGS data. More so, it would serve as a resource for comparative mitochondrial genomics and systematic studies of trematode parasites.
PMCID: PMC3828612  PMID: 24255820
Fasciolopsis buski; Mitochondria; Next generation sequencing; Contigs
5.  Linking Transcriptional Changes over Time in Stimulated Dendritic Cells to Identify Gene Networks Activated during the Innate Immune Response 
PLoS Computational Biology  2013;9(11):e1003323.
The innate immune response is primarily mediated by the Toll-like receptors functioning through the MyD88-dependent and TRIF-dependent pathways. Despite being widely studied, it is not yet completely understood and systems-level analyses have been lacking. In this study, we identified a high-probability network of genes activated during the innate immune response using a novel approach to analyze time-course gene expression profiles of activated immune cells in combination with a large gene regulatory and protein-protein interaction network. We classified the immune response into three consecutive time-dependent stages and identified the most probable paths between genes showing a significant change in expression at each stage. The resultant network contained several novel and known regulators of the innate immune response, many of which did not show any observable change in expression at the sampled time points. The response network shows the dominance of genes from specific functional classes during different stages of the immune response. It also suggests a role for the protein phosphatase 2a catalytic subunit α in the regulation of the immunoproteasome during the late phase of the response. In order to clarify the differences between the MyD88-dependent and TRIF-dependent pathways in the innate immune response, time-course gene expression profiles from MyD88-knockout and TRIF-knockout dendritic cells were analyzed. Their response networks suggest the dominance of the MyD88-dependent pathway in the innate immune response, and an association of the circadian regulators and immunoproteasomal degradation with the TRIF-dependent pathway. The response network presented here provides the most probable associations between genes expressed in the early and the late phases of the innate immune response, while taking into account the intermediate regulators. We propose that the method described here can also be used in the identification of time-dependent gene sub-networks in other biological systems.
Author Summary
The innate immune response is the first level of protection in organisms against invading pathogens. It consists of a large number of proteins functioning in signaling cascades triggered by the binding of fragments from microbes to specific cellular receptors. Disruptions in these pathways can lead to numerous diseases. As such, the innate immune system has been the subject of a large number of studies. However, due to its complexity and the interplay of a large number of pathways, it is not yet completely understood. In this study, we measured transcriptional changes in activated immune cells and used this information in the context of a large network of protein-protein and protein-DNA interactions to identify a smaller network of response genes. We did this by identifying the most probable network paths connecting genes showing large changes in their expression patterns at successive stages of the response. Analysis of this activated gene network revealed the associations between various temporal regulators of the innate immune response. We also identified response networks for immune cells lacking important mediators, MyD88 and TRIF, to clarify the distinct functional modules affected by their associated pathways in the innate immune response.
PMCID: PMC3820512  PMID: 24244133
6.  Comparative selenoproteome analysis reveals a reduced utilization of selenium in parasitic platyhelminthes 
PeerJ  2013;1:e202.
Background. The selenocysteine(Sec)-containing proteins, selenoproteins, are an important group of proteins present in all three kingdoms of life. Although the selenoproteomes of many organisms have been analyzed, systematic studies on selenoproteins in platyhelminthes are still lacking. Moreover, comparison of selenoproteomes between free-living and parasitic animals is rarely studied.
Results. In this study, three representative organisms (Schmidtea mediterranea, Schistosoma japonicum and Taenia solium) were selected for comparative analysis of selenoproteomes in Platyhelminthes. Using a SelGenAmic-based selenoprotein prediction algorithm, a total of 37 selenoprotein genes were identified in these organisms. The size of selenoproteomes and selenoprotein families were found to be associated with different lifestyles: free-living organisms have larger selenoproteome whereas parasitic lifestyle corresponds to reduced selenoproteomes. Five selenoproteins, SelT, Sel15, GPx, SPS2 and TR, were found to be present in all examined platyhelminthes as well as almost all sequenced animals, suggesting their essential role in metazoans. Finally, a new splicing form of SelW that lacked the first exon was found to be present in S. japonicum.
Conclusions. Our data provide a first glance into the selenoproteomes of organisms in the phylum Platyhelminthes and may help understand function and evolutionary dynamics of selenium utilization in diversified metazoans.
PMCID: PMC3828610  PMID: 24255816
Selenocysteine; Parasite; Platyhelminthes; Selenoprotein; Bioinformatics
7.  Characterization of RNA in exosomes secreted by human breast cancer cell lines using next-generation sequencing 
PeerJ  2013;1:e201.
Exosomes are nanosized (30–100 nm) membrane vesicles secreted by most cell types. Exosomes have been found to contain various RNA species including miRNA, mRNA and long non-protein coding RNAs. A number of cancer cells produce elevated levels of exosomes. Because exosomes have been isolated from most body fluids they may provide a source for non-invasive cancer diagnostics. Transcriptome profiling that uses deep-sequencing technologies (RNA-Seq) offers enormous amount of data that can be used for biomarkers discovery, however, in case of exosomes this approach was applied only for the analysis of small RNAs. In this study, we utilized RNA-Seq technology to analyze RNAs present in microvesicles secreted by human breast cancer cell lines.
Exosomes were isolated from the media conditioned by two human breast cancer cell lines, MDA-MB-231 and MDA-MB-436. Exosomal RNA was profiled using the Ion Torrent semiconductor chip-based technology. Exosomes were found to contain various classes of RNA with the major class represented by fragmented ribosomal RNA (rRNA), in particular 28S and 18S rRNA subunits. Analysis of exosomal RNA content revealed that it reflects RNA content of the donor cells. Although exosomes produced by the two cancer cell lines shared most of the RNA species, there was a number of non-coding transcripts unique to MDA-MB-231 and MDA-MB-436 cells. This suggests that RNA analysis might distinguish exosomes produced by low metastatic breast cancer cell line (MDA-MB-436) from that produced by highly metastatic breast cancer cell line (MDA-MB-231). The analysis of gene ontologies (GOs) associated with the most abundant transcripts present in exosomes revealed significant enrichment in genes encoding proteins involved in translation and rRNA and ncRNA processing. These GO terms indicate most expressed genes for both, cellular and exosomal RNA.
For the first time, using RNA-seq, we examined the transcriptomes of exosomes secreted by human breast cancer cells. We found that most abundant exosomal RNA species are the fragments of 28S and 18S rRNA subunits. This limits the number of reads from other RNAs. To increase the number of detectable transcripts and improve the accuracy of their expression level the protocols allowing depletion of fragmented rRNA should be utilized in the future RNA-seq analyses on exosomes. Present data revealed that exosomal transcripts are representative of their cells of origin and thus could form basis for detection of tumor specific markers.
PMCID: PMC3828613  PMID: 24255815
Exosomes; Microvesicles; Next generation sequencing; Breast cancer; Biomarkers
8.  Construction of microRNA functional families by a mixture model of position weight matrices 
PeerJ  2013;1:e199.
MicroRNAs (miRNAs) are small regulatory molecules that repress the translational processes of their target genes by binding to their 3′ untranslated regions (3′ UTRs). Because the target genes are predominantly determined by their sequence complementarity to the miRNA seed regions (nucleotides 2–7) which are evolutionarily conserved, it is inferred that the target relationships and functions of the miRNA family members are conserved across many species. Therefore, detecting the relevant miRNA families with confidence would help to clarify the conserved miRNA functions, and elucidate miRNA-mediated biological processes. We present a mixture model of position weight matrices for constructing miRNA functional families. This model systematically finds not only evolutionarily conserved miRNA family members but also functionally related miRNAs, as it simultaneously generates position weight matrices representing the conserved sequences. Using mammalian miRNA sequences, in our experiments, we identified potential miRNA groups characterized by similar sequence patterns that have common functions. We validated our results using score measures and by the analysis of the conserved targets. Our method would provide a way to comprehensively identify conserved miRNA functions.
PMCID: PMC3817585  PMID: 24255813
Mixture model; MicroRNA; EM algorithm; Machine learning; Position weight matrix; Sequence analysis
9.  An extended genovo metagenomic assembler by incorporating paired-end information 
PeerJ  2013;1:e196.
Metagenomes present assembly challenges, when assembling multiple genomes from mixed reads of multiple species. An assembler for single genomes can’t adapt well when applied in this case. A metagenomic assembler, Genovo, is a de novo assembler for metagenomes under a generative probabilistic model. Genovo assembles all reads without discarding any reads in a preprocessing step, and is therefore able to extract more information from metagenomic data and, in principle, generate better assembly results. Paired end sequencing is currently widely-used yet Genovo was designed for 454 single end reads. In this research, we attempted to extend Genovo by incorporating paired-end information, named Xgenovo, so that it generates higher quality assemblies with paired end reads.
First, we extended Genovo by adding a bonus parameter in the Chinese Restaurant Process used to get prior accounts for the unknown number of genomes in the sample. This bonus parameter intends for a pair of reads to be in the same contig and as an effort to solve chimera contig case. Second, we modified the sampling process of the location of a read in a contig. We used relative distance for the number of trials in the symmetric geometric distribution instead of using distance between the offset and the center of contig used in Genovo. Using this relative distance, a read sampled in the appropriate location has higher probability. Therefore a read will be mapped in the correct location.
Results of extensive experiments on simulated metagenomic datasets from simple to complex with species coverage setting following uniform and lognormal distribution showed that Xgenovo can be superior to the original Genovo and the recently proposed metagenome assembler for 454 reads, MAP. Xgenovo successfully generated longer N50 than Genovo and MAP while maintaining the assembly quality even for very complex metagenomic datasets consisting of 115 species. Xgenovo also demonstrated the potential to decrease the computational cost. This means that our strategy worked well. The software and all simulated datasets are publicly available online at
PMCID: PMC3817583  PMID: 24281688
Genovo; 454 paired end reads; de novo metagenomic assembler
10.  Identification of novel motif patterns to decipher the promoter architecture of co-expressed genes in Arabidopsis thaliana 
BMC Systems Biology  2013;7(Suppl 3):S10.
The understanding of the mechanisms of transcriptional regulation remains a challenge for molecular biologists in the post-genome era. It is hypothesized that the regulatory regions of genes expressed in the same tissue or cell type share a similar structure. Though several studies have analyzed the promoters of genes expressed in specific metazoan tissues or cells, little research has been done in plants. Hence finding specific patterns of motifs to explain the promoter architecture of co-expressed genes in plants could shed light on their transcription mechanism.
We identified novel patterns of sets of motifs in promoters of genes co-expressed in four different plant structures (PSs) and in the entire plant in Arabidopsis thaliana. Sets of genes expressed in four PSs (flower, seed, root, shoot) and housekeeping genes expressed in the entire plant were taken from a database of co-expressed genes in A. thaliana. PS-specific motifs were predicted using three motif-discovery algorithms, 8 of which are novel, to the best of our knowledge. A support vector machine was trained using the average upstream distance of the identified motifs from the translation start site on both strands of binding sites. The correctly classified promoters per PS were used to construct specific patterns of sets of motifs to describe the promoter architecture of those co-expressed genes. The discovered PS-specific patterns were tested in the entire A. thaliana genome, correctly identifying 77.8%, 81.2%, 70.8% and 53.7% genes expressed in petal differentiation, synergid cells, root hair and trichome, as well as 88.4% housekeeping genes.
We present five patterns of sets of motifs which describe the promoter architecture of co-expressed genes in five PSs with the ability to predict them from the entire A. thaliana genome. Based on these findings, we conclude that the positioning and orientation of transcription factor binding sites at specific distances from the translation start site is a reliable measure to differentiate promoters of genes expressed in different A. thaliana structures from background genomic promoters. Our method can be used to predict novel motifs and decipher a similar promoter architecture for genes co-expressed in A. thaliana under different conditions.
PMCID: PMC3852273  PMID: 24555803
11.  Sequence- and Species-Dependence of Proteasomal Processivity 
ACS chemical biology  2012;7(8):1444-1453.
The proteasome is the degradation machine at the center of the ubiquitin-proteasome system and controls the concentrations of many proteins in eukaryotes. It is highly processive so that substrates are degraded completely into small peptides, avoiding the formation of potentially toxic fragments. Nonetheless, some proteins are incompletely degraded, indicating the existence of factors that influence proteasomal processivity. We have quantified proteasomal processivity and determined the underlying rates of substrate degradation and release. We find that processivity increases with species complexity over a 5-fold range between yeast and mammalian proteasome, and the effect is due to slower but more persistent degradation by proteasomes from more complex organisms. A sequence stretch that has been implicated in causing incomplete degradation, the glycine-rich region of the NFκB subunit p105, reduces the proteasome’s ability to unfold its substrate, and polyglutamine repeats such as found in Huntington’s disease reduce the processivity of the proteasome in a length-dependent manner.
PMCID: PMC3423507  PMID: 22716912
12.  Global gene expression of the inner cell mass and trophectoderm of the bovine blastocyst 
The first distinct differentiation event in mammals occurs at the blastocyst stage when totipotent blastomeres differentiate into either pluripotent inner cell mass (ICM) or multipotent trophectoderm (TE). Here we determined, for the first time, global gene expression patterns in the ICM and TE isolated from bovine blastocysts. The ICM and TE were isolated from blastocysts harvested at day 8 after insemination by magnetic activated cell sorting, and cDNA sequenced using the SOLiD 4.0 system.
A total of 870 genes were differentially expressed between ICM and TE. Several genes characteristic of ICM (for example, NANOG, SOX2, and STAT3) and TE (ELF5, GATA3, and KRT18) in mouse and human showed similar patterns in bovine. Other genes, however, showed differences in expression between ICM and TE that deviates from the expected based on mouse and human.
Analysis of gene expression indicated that differentiation of blastomeres of the morula-stage embryo into the ICM and TE of the blastocyst is accompanied by differences between the two cell lineages in expression of genes controlling metabolic processes, endocytosis, hatching from the zona pellucida, paracrine and endocrine signaling with the mother, and genes supporting the changes in cellular architecture, stemness, and hematopoiesis necessary for development of the trophoblast.
PMCID: PMC3514149  PMID: 23126590
Blastocyst; Trophectoderm; Inner cell mass; Development
13.  Genome-Wide Analysis of DNA Methylation and Expression of MicroRNAs in Breast Cancer Cells 
DNA methylation of promoters is linked to transcriptional silencing of protein-coding genes, and its alteration plays important roles in cancer formation. For example, hypermethylation of tumor suppressor genes has been seen in some cancers. Alteration of methylation in the promoters of microRNAs (miRNAs) has also been linked to transcriptional changes in cancers; however, no systematic studies of methylation and transcription of miRNAs have been reported. In the present study, to clarify the relation between DNA methylation and transcription of miRNAs, next-generation sequencing and microarrays were used to analyze the methylation and expression of miRNAs, protein-coding genes, other non-coding RNAs (ncRNAs), and pseudogenes in the human breast cancer cell lines MCF7 and the adriamycin (ADR) resistant cell line MCF7/ADR. DNA methylation in the proximal promoter of miRNAs is tightly linked to transcriptional silencing, as it is with protein-coding genes. In protein-coding genes, highly expressed genes have CpG-rich proximal promoters whereas weakly expressed genes do not. This is only rarely observed in other gene categories, including miRNAs. The present study highlights the epigenetic similarities and differences between miRNA and protein-coding genes.
PMCID: PMC3430232  PMID: 22942701
DNA methylation; microRNA; cancer
14.  Assessing the utility of gene co-expression stability in combination with correlation in the analysis of protein-protein interaction networks 
BMC Genomics  2011;12(Suppl 3):S19.
Gene co-expression, in the form of a correlation coefficient, has been valuable in the analysis, classification and prediction of protein-protein interactions. However, it is susceptible to bias from a few samples having a large effect on the correlation coefficient. Gene co-expression stability is a means of quantifying this bias, with high stability indicating robust, unbiased co-expression correlation coefficients. We assess the utility of gene co-expression stability as an additional measure to support the co-expression correlation in the analysis of protein-protein interaction networks.
We studied the patterns of co-expression correlation and stability in interacting proteins with respect to their interaction promiscuity, levels of intrinsic disorder, and essentiality or disease-relatedness. Co-expression stability, along with co-expression correlation, acts as a better classifier of hub proteins in interaction networks, than co-expression correlation alone, enabling the identification of a class of hubs that are functionally distinct from the widely accepted transient (date) and obligate (party) hubs. Proteins with high levels of intrinsic disorder have low co-expression correlation and high stability with their interaction partners suggesting their involvement in transient interactions, except for a small group that have high co-expression correlation and are typically subunits of stable complexes. Similar behavior was seen for disease-related and essential genes. Interacting proteins that are both disordered have higher co-expression stability than ordered protein pairs. Using co-expression correlation and stability, we found that transient interactions are more likely to occur between an ordered and a disordered protein while obligate interactions primarily occur between proteins that are either both ordered, or disordered.
We observe that co-expression stability shows distinct patterns in structurally and functionally different groups of proteins and interactions. We conclude that it is a useful and important measure to be used in concert with gene co-expression correlation for further insights into the characteristics of proteins in the context of their interaction network.
PMCID: PMC3333178  PMID: 22369639
15.  Profiling ascidian promoters as the primordial type of vertebrate promoter 
BMC Genomics  2011;12(Suppl 3):S7.
CpG islands are observed in mammals and other vertebrates, generally escape DNA methylation, and tend to occur in the promoters of widely expressed genes. Another class of promoter has lower G+C and CpG contents, and is thought to be involved in the spatiotemporal regulation of gene expression. Non-vertebrate deuterostomes are reported to have a single class of promoter with high-frequency CpG dinucleotides, suggesting that this is the original type of promoter. However, the limited annotation of these genes has impeded the large-scale analysis of their promoters.
To determine the origins of the two classes of vertebrate promoters, we chose Ciona intestinalis, an invertebrate that is evolutionarily close to the vertebrates, and identified its transcription start sites genome-wide using a next-generation sequencer. We indeed observed a high CpG content around the transcription start sites, but their levels in the promoters and background sequences differed much less than in mammals. The CpG-rich stretches were also fairly restricted, so they appeared more similar to mammalian CpG-poor promoters.
From these data, we infer that CpG islands are not sufficiently ancient to be found in invertebrates. They probably appeared early in vertebrate evolution via some active mechanism and have since been maintained as part of vertebrate promoters.
PMCID: PMC3333190  PMID: 22369359
16.  DBTSS: DataBase of Transcriptional Start Sites progress report in 2012 
Nucleic Acids Research  2011;40(D1):D150-D154.
To support transcriptional regulation studies, we have constructed DBTSS (DataBase of Transcriptional Start Sites), which contains exact positions of transcriptional start sites (TSSs), determined with our own technique named TSS-seq, in the genomes of various species. In its latest version, DBTSS covers the data of the majority of human adult and embryonic tissues: it now contains 418 million TSS tag sequences from 28 tissues/cell cultures. Moreover, we integrated a series of our own transcriptomic data, such as the RNA-seq data of subcellular-fractionated RNAs as well as the ChIP-seq data of histone modifications and the binding of RNA polymerase II/several transcription factors in cultured cell lines into our original TSS information. We also included several external epigenomic data, such as the chromatin map of the ENCODE project. We further associated our TSS information with public or original single-nucleotide variation (SNV) data, in order to identify SNVs in the regulatory regions. These data can be browsed in our new viewer, which supports versatile search conditions of users. We believe that our new DBTSS will be an invaluable resource for interpreting the differential uses of TSSs and for identifying human genetic variations that are associated with disordered transcriptional regulation. DBTSS can be accessed at
PMCID: PMC3245115  PMID: 22086958
17.  Predicting promoter activities of primary human DNA sequences 
Nucleic Acids Research  2011;39(11):e75.
We developed a computer program that can predict the intrinsic promoter activities of primary human DNA sequences. We observed promoter activity using a quantitative luciferase assay and generated a prediction model using multiple linear regression. Our program achieved a prediction accuracy correlation coefficient of 0.87 between the predicted and observed promoter activities. We evaluated the prediction accuracy of the program using massive sequencing analysis of transcriptional start sites in vivo. We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters. Using this program, we analyzed the transcriptional landscape of the entire human genome. We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model. Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.
PMCID: PMC3113590  PMID: 21486745
18.  A regression analysis of gene expression in ES cells reveals two gene classes that are significantly different in epigenetic patterns 
BMC Bioinformatics  2011;12(Suppl 1):S50.
To understand the gene regulatory system that governs the self-renewal and pluripotency of embryonic stem cells (ESCs) is an important step for promoting regenerative medicine. In it, the role of several core transcription factors (TFs), such as Oct4, Sox2 and Nanog, has been intensively investigated, details of their involvement in the genome-wide gene regulation are still not well clarified.
We constructed a predictive model of genome-wide gene expression in mouse ESCs from publicly available ChIP-seq data of 12 core TFs. The tag sequences were remapped on the genome by various alignment tools. Then, the binding density of each TF is calculated from the genome-wide bona fide TF binding sites. The TF-binding data was combined with the data of several epigenetic states (DNA methylation, several histone modifications, and CpG island) of promoter regions. These data as well as the ordinary peak intensity data were used as predictors of a simple linear regression model that predicts absolute gene expression. We also developed a pipeline for analyzing the effects of predictors and their interactions.
Through our analysis, we identified two classes of genes that are either well explained or inefficiently explained by our model. The latter class seems to be genes that are not directly regulated by the core TFs. The regulatory regions of these gene classes show apparently distinct patterns of DNA methylation, histone modifications, existence of CpG islands, and gene ontology terms, suggesting the relative importance of epigenetic effects. Furthermore, we identified statistically significant TF interactions correlated with the epigenetic modification patterns.
Here, we proposed an improved prediction method in explaining the ESC-specific gene expression. Our study implies that the majority of genes are more or less directly regulated by the core TFs. In addition, our result is consistent with the general idea of relative importance of epigenetic effects in ESCs.
PMCID: PMC3044308  PMID: 21342583
19.  Challenges of the next decade for the Asia Pacific region: 2010 International Conference in Bioinformatics (InCoB 2010) 
BMC Genomics  2010;11(Suppl 4):S1.
The 2010 annual conference of the Asia Pacific Bioinformatics Network (APBioNet), Asia’s oldest bioinformatics organisation formed in 1998, was organized as the 9th International Conference on Bioinformatics (InCoB), Sept. 26-28, 2010 in Tokyo, Japan. Initially, APBioNet created InCoB as forum to foster bioinformatics in the Asia Pacific region. Given the growing importance of interdisciplinary research, InCoB2010 included topics targeting scientists in the fields of genomic medicine, immunology and chemoinformatics, supporting translational research. Peer-reviewed manuscripts that were accepted for publication in this supplement, represent key areas of research interests that have emerged in our region. We also highlight some of the current challenges bioinformatics is facing in the Asia Pacific region and conclude our report with the announcement of APBioNet’s 100 BioDatabases (BioDB100) initiative. BioDB100 will comply with the database criteria set out earlier in our proposal for Minimum Information about a Bioinformatics and Investigation (MIABi), setting the standards for biocuration and bioinformatics research, on which we will report at the next InCoB, Nov. 27 – Dec. 2, 2011 at Kuala Lumpur, Malaysia.
PMCID: PMC3005919  PMID: 21143792
20.  Cross-validated methods for promoter/transcription start site mapping in SL trans-spliced genes, established using the Ciona intestinalis troponin I gene 
Nucleic Acids Research  2010;39(7):2638-2648.
In conventionally-expressed eukaryotic genes, transcription start sites (TSSs) can be identified by mapping the mature mRNA 5′-terminal sequence onto the genome. However, this approach is not applicable to genes that undergo pre-mRNA 5′-leader trans-splicing (SL trans-splicing) because the original 5′-segment of the primary transcript is replaced by the spliced leader sequence during the trans-splicing reaction and is discarded. Thus TSS mapping for trans-spliced genes requires different approaches. We describe two such approaches and show that they generate precisely agreeing results for an SL trans-spliced gene encoding the muscle protein troponin I in the ascidian tunicate chordate Ciona intestinalis. One method is based on experimental deletion of trans-splice acceptor sites and the other is based on high-throughput mRNA 5′-RACE sequence analysis of natural RNA populations in order to detect minor transcripts containing the pre-mRNA’s original 5′-end. Both methods identified a single major troponin I TSS located ∼460 nt upstream of the trans-splice acceptor site. Further experimental analysis identified a functionally important TATA element 31 nt upstream of the start site. The two methods employed have complementary strengths and are broadly applicable to mapping promoters/TSSs for trans-spliced genes in tunicates and in trans-splicing organisms from other phyla.
PMCID: PMC3074122  PMID: 21109525
21.  InCoB2010 - 9th International Conference on Bioinformatics at Tokyo, Japan, September 26-28, 2010 
BMC Bioinformatics  2010;11(Suppl 7):S1.
The International Conference on Bioinformatics (InCoB), the annual conference of the Asia-Pacific Bioinformatics Network (APBioNet), is hosted in one of countries of the Asia-Pacific region. The 2010 conference was awarded to Japan and has attracted more than one hundred high-quality research paper submissions. Thorough peer reviewing resulted in 47 (43.5%) accepted papers out of 108 submissions. Submissions from Japan, R.O. Korea, P.R. China, Australia, Singapore and U.S.A totaled 43.8% and contributed to 57.4% of accepted papers. Manuscripts originating from Taiwan and India added up to 42.8% of submissions and 28.3% of acceptances. The fifteen articles published in this BMC Bioinformatics supplement cover disease informatics, structural bioinformatics and drug design, biological databases and software tools, signaling pathways, gene regulatory and biochemical networks, evolution and sequence analysis.
PMCID: PMC2957677  PMID: 21106116
22.  Gradual transition from mosaic to global DNA methylation patterns during deuterostome evolution 
BMC Bioinformatics  2010;11(Suppl 7):S2.
DNA methylation by the Dnmt family occurs in vertebrates and invertebrates, including ascidians, and is thought to play important roles in gene regulation and genome stability, especially in vertebrates. However, the global methylation patterns of vertebrates and invertebrates are distinctive. Whereas almost all CpG sites are methylated in vertebrates, with the exception of those in CpG islands, the ascidian genome contains approximately equal amounts of methylated and unmethylated regions. Curiously, methylation status can be reliably estimated from the local frequency of CpG dinucleotides in the ascidian genome. Methylated and unmethylated regions tend to have few and many CpG sites, respectively, consistent with our knowledge of the methylation status of CpG islands and other regions in mammals. However, DNA methylation patterns and levels in vertebrates and invertebrates have not been analyzed in the same way.
Using a new computational methodology based on the decomposition of the bimodal distributions of methylated and unmethylated regions, we estimated the extent of the global methylation patterns in a wide range of animals. We then examined the epigenetic changes in silico along the phylogenetic tree. We observed a gradual transition from fractional to global patterns of methylation in deuterostomes, rather than a clear demarcation between vertebrates and invertebrates. When we applied this methodology to six piscine genomes, some of which showed features similar to those of invertebrates.
The mammalian global DNA methylation pattern was probably not acquired at an early stage of vertebrate evolution, but gradually expanded from that of a more ancient organism.
PMCID: PMC2957685  PMID: 21106124
23.  HitPredict: a database of quality assessed protein–protein interactions in nine species 
Nucleic Acids Research  2010;39(Database issue):D744-D749.
Despite the availability of a large number of protein–protein interactions (PPIs) in several species, researchers are often limited to using very small subsets in a few organisms due to the high prevalence of spurious interactions. In spite of the importance of quality assessment of experimentally determined PPIs, a surprisingly small number of databases provide interactions with scores and confidence levels. We introduce HitPredict (, a database with quality assessed PPIs in nine species. HitPredict assigns a confidence level to interactions based on a reliability score that is computed using evidence from sequence, structure and functional annotations of the interacting proteins. HitPredict was first released in 2005 and is updated annually. The current release contains 36 930 proteins with 176 983 non-redundant, physical interactions, of which 116 198 (66%) are predicted to be of high confidence.
PMCID: PMC3013773  PMID: 20947562
24.  Effects of Alu elements on global nucleosome positioning in the human genome 
BMC Genomics  2010;11:309.
Understanding the genome sequence-specific positioning of nucleosomes is essential to understand various cellular processes, such as transcriptional regulation and replication. As a typical example, the 10-bp periodicity of AA/TT and GC dinucleotides has been reported in several species, but it is still unclear whether this feature can be observed in the whole genomes of all eukaryotes.
With Fourier analysis, we found that this is not the case: 84-bp and 167-bp periodicities are prevalent in primates. The 167-bp periodicity is intriguing because it is almost equal to the sum of the lengths of a nucleosomal unit and its linker region. After masking Alu elements, these periodicities were greatly diminished. Next, using two independent large-scale sets of nucleosome mapping data, we analyzed the distribution of nucleosomes in the vicinity of Alu elements and showed that (1) there are one or two fixed slot(s) for nucleosome positioning within the Alu element and (2) the positioning of neighboring nucleosomes seems to be in phase, more or less, with the presence of Alu elements. Furthermore, (3) these effects of Alu elements on nucleosome positioning are consistent with inactivation of promoter activity in Alu elements.
Our discoveries suggest that the principle governing nucleosome positioning differs greatly across species and that the Alu family is an important factor in primate genomes.
PMCID: PMC2878307  PMID: 20478020
25.  Characterization of Transcription Start Sites of Putative Non-coding RNAs by Multifaceted Use of Massively Paralleled Sequencer 
On the basis of integrated transcriptome analysis, we show that not all transcriptional start site clusters (TSCs) in the intergenic regions (iTSCs) have the same properties; thus, it is possible to discriminate the iTSCs that are likely to have biological relevance from the other noise-level iTSCs. We used a total of 251 933 381 short-read sequence tags generated from various types of transcriptome analyses in order to characterize 6039 iTSCs, which have significant expression levels. We analyzed and found that 23% of these iTSCs were located in the proximal regions of the RefSeq genes. These RefSeq-linked iTSCs showed similar expression patterns with the neighboring RefSeq genes, had widely fluctuating transcription start sites and lacked ordered nucleosome positioning. These iTSCs seemed not to form independent transcriptional units, simply representing the by-products of the neighboring RefSeq genes, in spite of their significant expression levels. Similar features were also observed for the TSCs located in the antisense regions of the RefSeq genes. Furthermore, for the remaining iTSCs that were not associated with any RefSeq genes, we demonstrate that integrative interpretation of the transcriptome data provides essential information to specify their biological functions in the hypoxic responses of the cells.
PMCID: PMC2885271  PMID: 20400770
non-coding RNA; integrated transcriptome analysis; transcriptional start site cluster (TSC); intergenic transcript; antisense transcript

Results 1-25 (57)