The first distinct differentiation event in mammals occurs at the blastocyst stage when totipotent blastomeres differentiate into either pluripotent inner cell mass (ICM) or multipotent trophectoderm (TE). Here we determined, for the first time, global gene expression patterns in the ICM and TE isolated from bovine blastocysts. The ICM and TE were isolated from blastocysts harvested at day 8 after insemination by magnetic activated cell sorting, and cDNA sequenced using the SOLiD 4.0 system.
A total of 870 genes were differentially expressed between ICM and TE. Several genes characteristic of ICM (for example, NANOG, SOX2, and STAT3) and TE (ELF5, GATA3, and KRT18) in mouse and human showed similar patterns in bovine. Other genes, however, showed differences in expression between ICM and TE that deviates from the expected based on mouse and human.
Analysis of gene expression indicated that differentiation of blastomeres of the morula-stage embryo into the ICM and TE of the blastocyst is accompanied by differences between the two cell lineages in expression of genes controlling metabolic processes, endocytosis, hatching from the zona pellucida, paracrine and endocrine signaling with the mother, and genes supporting the changes in cellular architecture, stemness, and hematopoiesis necessary for development of the trophoblast.
Blastocyst; Trophectoderm; Inner cell mass; Development
DNA methylation of promoters is linked to transcriptional silencing of protein-coding genes, and its alteration plays important roles in cancer formation. For example, hypermethylation of tumor suppressor genes has been seen in some cancers. Alteration of methylation in the promoters of microRNAs (miRNAs) has also been linked to transcriptional changes in cancers; however, no systematic studies of methylation and transcription of miRNAs have been reported. In the present study, to clarify the relation between DNA methylation and transcription of miRNAs, next-generation sequencing and microarrays were used to analyze the methylation and expression of miRNAs, protein-coding genes, other non-coding RNAs (ncRNAs), and pseudogenes in the human breast cancer cell lines MCF7 and the adriamycin (ADR) resistant cell line MCF7/ADR. DNA methylation in the proximal promoter of miRNAs is tightly linked to transcriptional silencing, as it is with protein-coding genes. In protein-coding genes, highly expressed genes have CpG-rich proximal promoters whereas weakly expressed genes do not. This is only rarely observed in other gene categories, including miRNAs. The present study highlights the epigenetic similarities and differences between miRNA and protein-coding genes.
DNA methylation; microRNA; cancer
Gene co-expression, in the form of a correlation coefficient, has been valuable in the analysis, classification and prediction of protein-protein interactions. However, it is susceptible to bias from a few samples having a large effect on the correlation coefficient. Gene co-expression stability is a means of quantifying this bias, with high stability indicating robust, unbiased co-expression correlation coefficients. We assess the utility of gene co-expression stability as an additional measure to support the co-expression correlation in the analysis of protein-protein interaction networks.
We studied the patterns of co-expression correlation and stability in interacting proteins with respect to their interaction promiscuity, levels of intrinsic disorder, and essentiality or disease-relatedness. Co-expression stability, along with co-expression correlation, acts as a better classifier of hub proteins in interaction networks, than co-expression correlation alone, enabling the identification of a class of hubs that are functionally distinct from the widely accepted transient (date) and obligate (party) hubs. Proteins with high levels of intrinsic disorder have low co-expression correlation and high stability with their interaction partners suggesting their involvement in transient interactions, except for a small group that have high co-expression correlation and are typically subunits of stable complexes. Similar behavior was seen for disease-related and essential genes. Interacting proteins that are both disordered have higher co-expression stability than ordered protein pairs. Using co-expression correlation and stability, we found that transient interactions are more likely to occur between an ordered and a disordered protein while obligate interactions primarily occur between proteins that are either both ordered, or disordered.
We observe that co-expression stability shows distinct patterns in structurally and functionally different groups of proteins and interactions. We conclude that it is a useful and important measure to be used in concert with gene co-expression correlation for further insights into the characteristics of proteins in the context of their interaction network.
CpG islands are observed in mammals and other vertebrates, generally escape DNA methylation, and tend to occur in the promoters of widely expressed genes. Another class of promoter has lower G+C and CpG contents, and is thought to be involved in the spatiotemporal regulation of gene expression. Non-vertebrate deuterostomes are reported to have a single class of promoter with high-frequency CpG dinucleotides, suggesting that this is the original type of promoter. However, the limited annotation of these genes has impeded the large-scale analysis of their promoters.
To determine the origins of the two classes of vertebrate promoters, we chose Ciona intestinalis, an invertebrate that is evolutionarily close to the vertebrates, and identified its transcription start sites genome-wide using a next-generation sequencer. We indeed observed a high CpG content around the transcription start sites, but their levels in the promoters and background sequences differed much less than in mammals. The CpG-rich stretches were also fairly restricted, so they appeared more similar to mammalian CpG-poor promoters.
From these data, we infer that CpG islands are not sufficiently ancient to be found in invertebrates. They probably appeared early in vertebrate evolution via some active mechanism and have since been maintained as part of vertebrate promoters.
To support transcriptional regulation studies, we have constructed DBTSS (DataBase of Transcriptional Start Sites), which contains exact positions of transcriptional start sites (TSSs), determined with our own technique named TSS-seq, in the genomes of various species. In its latest version, DBTSS covers the data of the majority of human adult and embryonic tissues: it now contains 418 million TSS tag sequences from 28 tissues/cell cultures. Moreover, we integrated a series of our own transcriptomic data, such as the RNA-seq data of subcellular-fractionated RNAs as well as the ChIP-seq data of histone modifications and the binding of RNA polymerase II/several transcription factors in cultured cell lines into our original TSS information. We also included several external epigenomic data, such as the chromatin map of the ENCODE project. We further associated our TSS information with public or original single-nucleotide variation (SNV) data, in order to identify SNVs in the regulatory regions. These data can be browsed in our new viewer, which supports versatile search conditions of users. We believe that our new DBTSS will be an invaluable resource for interpreting the differential uses of TSSs and for identifying human genetic variations that are associated with disordered transcriptional regulation. DBTSS can be accessed at http://dbtss.hgc.jp.
We developed a computer program that can predict the intrinsic promoter activities of primary human DNA sequences. We observed promoter activity using a quantitative luciferase assay and generated a prediction model using multiple linear regression. Our program achieved a prediction accuracy correlation coefficient of 0.87 between the predicted and observed promoter activities. We evaluated the prediction accuracy of the program using massive sequencing analysis of transcriptional start sites in vivo. We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters. Using this program, we analyzed the transcriptional landscape of the entire human genome. We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model. Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.
To understand the gene regulatory system that governs the self-renewal and pluripotency of embryonic stem cells (ESCs) is an important step for promoting regenerative medicine. In it, the role of several core transcription factors (TFs), such as Oct4, Sox2 and Nanog, has been intensively investigated, details of their involvement in the genome-wide gene regulation are still not well clarified.
We constructed a predictive model of genome-wide gene expression in mouse ESCs from publicly available ChIP-seq data of 12 core TFs. The tag sequences were remapped on the genome by various alignment tools. Then, the binding density of each TF is calculated from the genome-wide bona fide TF binding sites. The TF-binding data was combined with the data of several epigenetic states (DNA methylation, several histone modifications, and CpG island) of promoter regions. These data as well as the ordinary peak intensity data were used as predictors of a simple linear regression model that predicts absolute gene expression. We also developed a pipeline for analyzing the effects of predictors and their interactions.
Through our analysis, we identified two classes of genes that are either well explained or inefficiently explained by our model. The latter class seems to be genes that are not directly regulated by the core TFs. The regulatory regions of these gene classes show apparently distinct patterns of DNA methylation, histone modifications, existence of CpG islands, and gene ontology terms, suggesting the relative importance of epigenetic effects. Furthermore, we identified statistically significant TF interactions correlated with the epigenetic modification patterns.
Here, we proposed an improved prediction method in explaining the ESC-specific gene expression. Our study implies that the majority of genes are more or less directly regulated by the core TFs. In addition, our result is consistent with the general idea of relative importance of epigenetic effects in ESCs.
The 2010 annual conference of the Asia Pacific Bioinformatics Network (APBioNet), Asia’s oldest bioinformatics organisation formed in 1998, was organized as the 9th International Conference on Bioinformatics (InCoB), Sept. 26-28, 2010 in Tokyo, Japan. Initially, APBioNet created InCoB as forum to foster bioinformatics in the Asia Pacific region. Given the growing importance of interdisciplinary research, InCoB2010 included topics targeting scientists in the fields of genomic medicine, immunology and chemoinformatics, supporting translational research. Peer-reviewed manuscripts that were accepted for publication in this supplement, represent key areas of research interests that have emerged in our region. We also highlight some of the current challenges bioinformatics is facing in the Asia Pacific region and conclude our report with the announcement of APBioNet’s 100 BioDatabases (BioDB100) initiative. BioDB100 will comply with the database criteria set out earlier in our proposal for Minimum Information about a Bioinformatics and Investigation (MIABi), setting the standards for biocuration and bioinformatics research, on which we will report at the next InCoB, Nov. 27 – Dec. 2, 2011 at Kuala Lumpur, Malaysia.
In conventionally-expressed eukaryotic genes, transcription start sites (TSSs) can be identified by mapping the mature mRNA 5′-terminal sequence onto the genome. However, this approach is not applicable to genes that undergo pre-mRNA 5′-leader trans-splicing (SL trans-splicing) because the original 5′-segment of the primary transcript is replaced by the spliced leader sequence during the trans-splicing reaction and is discarded. Thus TSS mapping for trans-spliced genes requires different approaches. We describe two such approaches and show that they generate precisely agreeing results for an SL trans-spliced gene encoding the muscle protein troponin I in the ascidian tunicate chordate Ciona intestinalis. One method is based on experimental deletion of trans-splice acceptor sites and the other is based on high-throughput mRNA 5′-RACE sequence analysis of natural RNA populations in order to detect minor transcripts containing the pre-mRNA’s original 5′-end. Both methods identified a single major troponin I TSS located ∼460 nt upstream of the trans-splice acceptor site. Further experimental analysis identified a functionally important TATA element 31 nt upstream of the start site. The two methods employed have complementary strengths and are broadly applicable to mapping promoters/TSSs for trans-spliced genes in tunicates and in trans-splicing organisms from other phyla.
The International Conference on Bioinformatics (InCoB), the annual conference of the Asia-Pacific Bioinformatics Network (APBioNet), is hosted in one of countries of the Asia-Pacific region. The 2010 conference was awarded to Japan and has attracted more than one hundred high-quality research paper submissions. Thorough peer reviewing resulted in 47 (43.5%) accepted papers out of 108 submissions. Submissions from Japan, R.O. Korea, P.R. China, Australia, Singapore and U.S.A totaled 43.8% and contributed to 57.4% of accepted papers. Manuscripts originating from Taiwan and India added up to 42.8% of submissions and 28.3% of acceptances. The fifteen articles published in this BMC Bioinformatics supplement cover disease informatics, structural bioinformatics and drug design, biological databases and software tools, signaling pathways, gene regulatory and biochemical networks, evolution and sequence analysis.
DNA methylation by the Dnmt family occurs in vertebrates and invertebrates, including ascidians, and is thought to play important roles in gene regulation and genome stability, especially in vertebrates. However, the global methylation patterns of vertebrates and invertebrates are distinctive. Whereas almost all CpG sites are methylated in vertebrates, with the exception of those in CpG islands, the ascidian genome contains approximately equal amounts of methylated and unmethylated regions. Curiously, methylation status can be reliably estimated from the local frequency of CpG dinucleotides in the ascidian genome. Methylated and unmethylated regions tend to have few and many CpG sites, respectively, consistent with our knowledge of the methylation status of CpG islands and other regions in mammals. However, DNA methylation patterns and levels in vertebrates and invertebrates have not been analyzed in the same way.
Using a new computational methodology based on the decomposition of the bimodal distributions of methylated and unmethylated regions, we estimated the extent of the global methylation patterns in a wide range of animals. We then examined the epigenetic changes in silico along the phylogenetic tree. We observed a gradual transition from fractional to global patterns of methylation in deuterostomes, rather than a clear demarcation between vertebrates and invertebrates. When we applied this methodology to six piscine genomes, some of which showed features similar to those of invertebrates.
The mammalian global DNA methylation pattern was probably not acquired at an early stage of vertebrate evolution, but gradually expanded from that of a more ancient organism.
Despite the availability of a large number of protein–protein interactions (PPIs) in several species, researchers are often limited to using very small subsets in a few organisms due to the high prevalence of spurious interactions. In spite of the importance of quality assessment of experimentally determined PPIs, a surprisingly small number of databases provide interactions with scores and confidence levels. We introduce HitPredict (http://hintdb.hgc.jp/htp/), a database with quality assessed PPIs in nine species. HitPredict assigns a confidence level to interactions based on a reliability score that is computed using evidence from sequence, structure and functional annotations of the interacting proteins. HitPredict was first released in 2005 and is updated annually. The current release contains 36 930 proteins with 176 983 non-redundant, physical interactions, of which 116 198 (66%) are predicted to be of high confidence.
Understanding the genome sequence-specific positioning of nucleosomes is essential to understand various cellular processes, such as transcriptional regulation and replication. As a typical example, the 10-bp periodicity of AA/TT and GC dinucleotides has been reported in several species, but it is still unclear whether this feature can be observed in the whole genomes of all eukaryotes.
With Fourier analysis, we found that this is not the case: 84-bp and 167-bp periodicities are prevalent in primates. The 167-bp periodicity is intriguing because it is almost equal to the sum of the lengths of a nucleosomal unit and its linker region. After masking Alu elements, these periodicities were greatly diminished. Next, using two independent large-scale sets of nucleosome mapping data, we analyzed the distribution of nucleosomes in the vicinity of Alu elements and showed that (1) there are one or two fixed slot(s) for nucleosome positioning within the Alu element and (2) the positioning of neighboring nucleosomes seems to be in phase, more or less, with the presence of Alu elements. Furthermore, (3) these effects of Alu elements on nucleosome positioning are consistent with inactivation of promoter activity in Alu elements.
Our discoveries suggest that the principle governing nucleosome positioning differs greatly across species and that the Alu family is an important factor in primate genomes.
On the basis of integrated transcriptome analysis, we show that not all transcriptional start site clusters (TSCs) in the intergenic regions (iTSCs) have the same properties; thus, it is possible to discriminate the iTSCs that are likely to have biological relevance from the other noise-level iTSCs. We used a total of 251 933 381 short-read sequence tags generated from various types of transcriptome analyses in order to characterize 6039 iTSCs, which have significant expression levels. We analyzed and found that 23% of these iTSCs were located in the proximal regions of the RefSeq genes. These RefSeq-linked iTSCs showed similar expression patterns with the neighboring RefSeq genes, had widely fluctuating transcription start sites and lacked ordered nucleosome positioning. These iTSCs seemed not to form independent transcriptional units, simply representing the by-products of the neighboring RefSeq genes, in spite of their significant expression levels. Similar features were also observed for the TSCs located in the antisense regions of the RefSeq genes. Furthermore, for the remaining iTSCs that were not associated with any RefSeq genes, we demonstrate that integrative interpretation of the transcriptome data provides essential information to specify their biological functions in the hypoxic responses of the cells.
non-coding RNA; integrated transcriptome analysis; transcriptional start site cluster (TSC); intergenic transcript; antisense transcript
Although nucleosome remodeling is essential to transcriptional regulation in eukaryotes, little is known about its genome-wide behavior. Since a number of nucleosome positioning maps in vivo have been recently determined, we examined if their comparisons might be used for obtaining a genome-wide profile of nucleosome remodeling. Using seven yeast maps, the local variability of nucleosomes, measured by the entropy, was significantly higher in a set of reported unstable nucleosomes. The binding sites of four transcription factors, known as the remodeling factors, were distinctively high both in entropy and linker ratio, whereas those of Yhp1, their potential inhibitor, showed the lowest values in both of them. Taken together, our map shows the general information of nucleosome dynamics reasonably well. The “nucleosome dynamics” map provides the new significant correlation with the degree of expression variety instead of their intensity. Furthermore, the associations with gene function and histone modification were also discussed here.
Electronic supplementary material
The online version of this article (doi:10.1007/s00412-010-0264-y) contains supplementary material, which is available to authorized users.
DataBase of Transcription Start Sites (DBTSS) is a database which contains precise positional information for transcription start sites (TSSs) of eukaryotic mRNAs. In this update, we included 330 million new tags generated by massively sequencing the 5′-end of oligo-cap selected cDNAs in humans and mice. The tags were collected from normal fetal or adult human tissues, including brain, thymus, liver, kidney and heart, from 6 human cell lines in 21 diverse growth conditions as well as from mouse NIH3T3 cell line: altogether 31 different cell types or culture conditions are represented. This unprecedented increase in depth of data now allows DBTSS to faithfully represent the dynamically changing landscape of TSSs in different cell types and conditions, during development and in the course of evolution. Differential usage of alternative 5′-ends across cell types and conditions can be viewed in a series of new interfaces. Promoter sequence information is now displayed in a comparative genomics viewer where evolutionary turnover of the TSSs can be evaluated. DBTSS can be accessed at http://dbtss.hgc.jp/.
Sets of genes expressed in the same tissue are believed to be under the regulation of a similar set of transcription factors, and can thus be assumed to contain similar structural patterns in their regulatory regions. Here we present a study of the structural patterns in promoters of genes expressed specifically in 26 human and 34 mouse tissues. For each tissue we constructed promoter structure models, taking into account presences of motifs, their positioning to the transcription start site, and pairwise positioning of motifs. We found that 35 out of 60 models (58%) were able to distinguish positive test promoter sequences from control promoter sequences with statistical significance. Models with high performance include those for liver, skeletal muscle, kidney and tongue. Many of the important structural patterns in these models involve transcription factors of known importance in the tissues in question and structural patterns tend to be conserved between human and mouse. In addition to that, promoter models for related tissues tend to have high inter-tissue performance, indicating that their promoters share common structural patterns. Together, these results illustrate the validity of our models, but also indicate that the promoter structures for some tissues are easier to model than those of others.
Streptococcus mutans is the major pathogen of dental caries, and it occasionally causes infective endocarditis. While the pathogenicity of this species is distinct from other human pathogenic streptococci, the species-specific evolution of the genus Streptococcus and its genomic diversity are poorly understood.
We have sequenced the complete genome of S. mutans serotype c strain NN2025, and compared it with the genome of UA159. The NN2025 genome is composed of 2,013,587 bp, and the two strains show highly conserved core-genome. However, comparison of the two S. mutans strains showed a large genomic inversion across the replication axis producing an X-shaped symmetrical DNA dot plot. This phenomenon was also observed between other streptococcal species, indicating that streptococcal genetic rearrangements across the replication axis play an important role in Streptococcus genetic shuffling. We further confirmed the genomic diversity among 95 clinical isolates using long-PCR analysis. Genomic diversity in S. mutans appears to occur frequently between insertion sequence (IS) elements and transposons, and these diversity regions consist of restriction/modification systems, antimicrobial peptide synthesis systems, and transporters. S. mutans may preferentially reject the phage infection by clustered regularly interspaced short palindromic repeats (CRISPRs). In particular, the CRISPR-2 region, which is highly divergent between strains, in NN2025 has long repeated spacer sequences corresponding to the streptococcal phage genome.
These observations suggest that S. mutans strains evolve through chromosomal shuffling and that phage infection is not needed for gene acquisition. In contrast, S. pyogenes tolerates phage infection for acquisition of virulence determinants for niche adaptation.
In cancer research, the association between a gene and clinical outcome suggests the underlying etiology of the disease and consequently can motivate further studies. The recent availability of published cancer microarray datasets with clinical annotation provides the opportunity for linking gene expression to prognosis. However, the data are not easy to access and analyze without an effective analysis platform.
To take advantage of public resources in full, a database named "PrognoScan" has been developed. This is 1) a large collection of publicly available cancer microarray datasets with clinical annotation, as well as 2) a tool for assessing the biological relationship between gene expression and prognosis. PrognoScan employs the minimum P-value approach for grouping patients for survival analysis that finds the optimal cutpoint in continuous gene expression measurement without prior biological knowledge or assumption and, as a result, enables systematic meta-analysis of multiple datasets.
PrognoScan provides a powerful platform for evaluating potential tumor markers and therapeutic targets and would accelerate cancer research. The database is publicly accessible at .
Combining our full-length cDNA method and the massively parallel sequencing technology, we developed a simple method to collect precise positional information of transcriptional start sites (TSSs) together with digital information of the gene-expression levels in a high throughput manner. We applied this method to observe gene-expression changes in a colon cancer cell line cultured in normoxic and hypoxic conditions. We generated more than 100 million 36-base TSS-tag sequences and revealed comprehensive features of hypoxia responsive alterations in the transcriptional landscape of the human genome. The features include presence of inducible ‘hot regions’ in 54 genomic regions, 220 novel hypoxia inducible promoters that may drive non-protein-coding transcripts, 191 hypoxia responsive alternative promoters and detailed views of 120 novel as well as known hypoxia responsive genes. We further analyzed hypoxic response of different cells using additional 60 million TSS-tags and found that the degree of the gene-expression changes were different among cell lines, possibly reflecting cellular robustness against hypoxia. The novel dynamic figure of the human gene transcriptome will deepen our understanding of the transcriptional program of the human genome as well as bringing new insights into the biology of cancer cells in hypoxia.
To represent the sequence specificity of transcription factors, the position weight matrix (PWM) is widely used. In most cases, each element is defined as a log likelihood ratio of a base appearing at a certain position, which is estimated from a finite number of known binding sites. To avoid bias due to this small sample size, a certain numeric value, called a pseudocount, is usually allocated for each position, and its fraction according to the background base composition is added to each element. So far, there has been no consensus on the optimal pseudocount value. In this study, we simulated the sampling process by artificially generating binding sites based on observed nucleotide frequencies in a public PWM database, and then the generated matrix with an added pseudocount value was compared to the original frequency matrix using various measures. Although the results were somewhat different between measures, in many cases, we could find an optimal pseudocount value for each matrix. These optimal values are independent of the sample size and are clearly correlated with the entropy of the original matrices, meaning that larger pseudocount vales are preferable for less conserved binding sites. As a simple representative, we suggest the value of 0.8 for practical uses.
Although the knowledge accumulated on the transcriptional regulations of eukaryotes is significant, the knowledge on their translational regulations remains limited. Thus, we performed a comprehensive detection of terminal oligo-pyrimidine (TOP), which is one of the well-characterized cis-regulatory motifs for translational controls located immediately downstream of the transcriptional start sites of mRNAs. Utilizing our precise 5′-end information of the full-length cDNAs, we could screen 1645 candidate TOP genes by position specific matrix search. Among them, not only 75 out of 78 ribosomal protein genes but also eight previously identified non-ribosomal-protein TOP genes were included. We further experimentally validated the translational activities of 83 TOP candidate genes. Clear translational regulations exerted on the stimulation of 12-O-tetradecanoyl-1-phorbol-13-acetate for at least 41 of them was observed, indicating that there should be a few hundreds of human genes which are subjected to regulation at translation levels via TOPs. Our result suggests that TOP genes code not only formerly characterized ribosomal proteins and translation-related proteins but also a wider variety of proteins, such as lysosome-related proteins and metabolism-related proteins, playing pivotal roles in gene expression controls in the majority of cellular mRNAs.
Interspecies sequence comparison is a powerful tool to extract functional or evolutionary information from the genomes of organisms. A number of studies have compared protein sequences or promoter sequences between mammals, which provided many insights into genomics. However, the correlation between protein conservation and promoter conservation remains controversial.
We examined promoter conservation as well as protein conservation for 6,901 human and mouse orthologous genes, and observed a very weak correlation between them. We further investigated their relationship by decomposing it based on functional categories, and identified categories with significant tendencies. Remarkably, the 'ribosome' category showed significantly low promoter conservation, despite its high protein conservation, and the 'extracellular matrix' category showed significantly high promoter conservation, in spite of its low protein conservation.
Our results show the relation of gene function to protein conservation and promoter conservation, and revealed that there seem to be nonparallel components between protein and promoter sequence evolution.
The fact that promoters are essential for the function of all genes presents the basis of the general idea that retrotranspositions give rise to processed pseudogenes. However, recent studies have demonstrated that some retrotransposed genes are transcriptionally active. Because promoters are not thought to be retrotransposed along with exonic sequences, these transcriptionally active genes must have acquired a functional promoter by mechanisms that are yet to be determined. Hence, comparison between a retrotransposed gene and its source gene appears to provide a unique opportunity to investigate the promoter creation for a new gene. Here, we identified 29 gene pairs in the human genome, consisting of a functional retrotransposed gene and its parental gene, and compared their respective promoters. In more than half of these cases, we unexpectedly found that a large part of the core promoter had been transcribed, reverse transcribed, and then integrated to be operative at the transposed locus. This observation can be ascribed to the recent discovery that transcription start sites tend to be interspersed rather than situated at 1 specific site. This propensity could confer retrotransposability to promoters per se. Accordingly, the retrotransposability can explain the genesis of some alternative promoters.
promoter; transcription start site; retrotransposition; alternative promoter; human genome; molecular evolution
It is essential in modern biology to understand how transcriptional regulatory regions are composed of cis-elements, yet we have limited knowledge of, for example, the combinational uses of these elements and their positional distribution.
We predicted the positions of 228 known binding motifs for transcription factors in phylogenetically conserved regions within -2000 and +1000 bp of transcriptional start sites (TSSs) of human genes and visualized their correlated non-overlapping occurrences. In the 8,454 significantly correlated motif pairs, two major classes were observed: 248 pairs in Class 1 were mainly found around TSSs, whereas 4,020 Class 2 pairs appear at rather arbitrary distances from TSSs. These classes are distinct in a number of aspects. First, the positional distribution of the Class 1 constituent motifs shows a single peak near the TSSs, whereas Class 2 motifs show a relatively broad distribution. Second, genes that harbor the Class 1 pairs are more likely to be CpG-rich and to be expressed ubiquitously than those that harbor Class 2 pairs. Third, the 'hub' motifs, which are used in many different motif pairs, are different between the two classes. In addition, many of the transcription factors that correspond to the Class 2 hub motifs contain domains rich in specific amino acids; these domains may form disordered regions important for protein-protein interaction.
There exist at least two classes of motif pairs with respect to TSSs in human promoters, possibly reflecting compositional differences between promoters and enhancers. We anticipate that our visualization method may be useful for the further characterisation of promoters.