Search tips
Search criteria

Results 1-25 (30)

Clipboard (0)

Select a Filter Below

Year of Publication
more »
Document Types
1.  RECLU: a pipeline to discover reproducible transcriptional start sites and their alternative regulation using capped analysis of gene expression (CAGE) 
BMC Genomics  2014;15:269.
Next generation sequencing based technologies are being extensively used to study transcriptomes. Among these, cap analysis of gene expression (CAGE) is specialized in detecting the most 5’ ends of RNA molecules. After mapping the sequenced reads back to a reference genome CAGE data highlights the transcriptional start sites (TSSs) and their usage at a single nucleotide resolution.
We propose a pipeline to group the single nucleotide TSS into larger reproducible peaks and compare their usage across biological states. Importantly, our pipeline discovers broad peaks as well as the fine structure of individual transcriptional start sites embedded within them. We assess the performance of our approach on a large CAGE datasets including 156 primary cell types and two cell lines with biological replicas. We demonstrate that genes have complicated structures of transcription initiation events. In particular, we discover that narrow peaks embedded in broader regions of transcriptional activity can be differentially used even if the larger region is not.
By examining the reproducible fine scaled organization of TSS we can detect many differentially regulated peaks undetected by previous approaches.
PMCID: PMC4029093  PMID: 24779366
CAGE; Peak finding; Reproducibility; Hierarchical stability
2.  Explaining the correlations among properties of mammalian promoters 
Nucleic Acids Research  2014;42(8):4823-4832.
Proximal promoters are fundamental genomic elements for gene expression. They vary in terms of GC percentage, CpG abundance, presence of TATA signal, evolutionary conservation, chromosomal spread of transcription start sites and breadth of expression across cell types. These properties are correlated, and it has been suggested that there are two classes of promoters: one class with high CpG, widely spread transcription start sites and broad expression, and another with TATA signals, narrow spread and restricted expression. However, it has been unclear why these properties are correlated in this way. We reexamined these features using the deep FANTOM5 CAGE data from hundreds of cell types. First, we point out subtle but important biases in previous definitions of promoters and of expression breadth. Second, we show that most promoters are rather nonspecifically expressed across many cell types. Third, promoters’ expression breadth is independent of maximum expression level, and therefore correlates with average expression level. Fourth, the data show a more complex picture than two classes, with a network of direct and indirect correlations among promoter properties. By tentatively distinguishing the direct from the indirect correlations, we reveal simple explanations for them.
PMCID: PMC4005656  PMID: 24682821
3.  Improved search heuristics find 20 000 new alignments between human and mouse genomes 
Nucleic Acids Research  2014;42(7):e59.
Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human–dog and melanogaster–pseudoobscura comparisons, but not for human–mouse, which suggests that we still miss many human–mouse alignments. Our optimized heuristics find ∼20 000 new human–mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.
PMCID: PMC3985675  PMID: 24493737
4.  A bioinformatician’s guide to the forefront of suffix array construction algorithms 
Briefings in Bioinformatics  2014;15(2):138-154.
The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.
PMCID: PMC3956071  PMID: 24413184
suffix array construction; linear-time algorithm; text index; spaced seeds; subset seeds
5.  Gentle Masking of Low-Complexity Sequences Improves Homology Search 
PLoS ONE  2011;6(12):e28819.
Detection of sequences that are homologous, i.e. descended from a common ancestor, is a fundamental task in computational biology. This task is confounded by low-complexity tracts (such as atatatatatat), which arise frequently and independently, causing strong similarities that are not homologies. There has been much research on identifying low-complexity tracts, but little research on how to treat them during homology search. We propose to find homologies by aligning sequences with “gentle” masking of low-complexity tracts. Gentle masking means that the match score involving a masked letter is , where is the unmasked score. Gentle masking slightly but noticeably improves the sensitivity of homology search (compared to “harsh” masking), without harming specificity. We show examples in three useful homology search problems: detection of NUMTs (nuclear copies of mitochondrial DNA), recruitment of metagenomic DNA reads to reference genomes, and pseudogene detection. Gentle masking is currently the best way to treat low-complexity tracts during homology search.
PMCID: PMC3242753  PMID: 22205972
6.  An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome 
Bioinformatics  2013;29(8):965-972.
Summary: Many high-throughput sequencing experiments produce paired DNA reads. Paired-end DNA reads provide extra positional information that is useful in reliable mapping of short reads to a reference genome, as well as in downstream analyses of structural variations. Given the importance of paired-end alignments, it is surprising that there have been no previous publications focusing on this topic. In this article, we present a new probabilistic framework to predict the alignment of paired-end reads to a reference genome. Using both simulated and real data, we compare the performance of our method with six other read-mapping tools that provide a paired-end option. We show that our method provides a good combination of accuracy, error rate and computation time, especially in more challenging and practical cases, such as when the reference genome is incomplete or unavailable for the sample, or when there are large variations between the reference genome and the source of the reads. An open-source implementation of our method is available as part of Last, a multi-purpose alignment program freely available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3624798  PMID: 23413433
7.  Finding Protein-Coding Genes through Human Polymorphisms 
PLoS ONE  2013;8(1):e54210.
Human gene catalogs are fundamental to the study of human biology and medicine. But they are all based on open reading frames (ORFs) in a reference genome sequence (with allowance for introns). Individual genomes, however, are polymorphic: their sequences are not identical. There has been much research on how polymorphism affects previously-identified genes, but no research has been done on how it affects gene identification itself. We computationally predict protein-coding genes in a straightforward manner, by finding long ORFs in mRNA sequences aligned to the reference genome. We systematically test the effect of known polymorphisms with this procedure. Polymorphisms can not only disrupt ORFs, they can also create long ORFs that do not exist in the reference sequence. We found 5,737 putative protein-coding genes that do not exist in the reference, whose protein-coding status is supported by homology to known proteins. On average 10% of these genes are located in the genomic regions devoid of annotated genes in 12 other catalogs. Our statistical analysis showed that these ORFs are unlikely to occur by chance.
PMCID: PMC3551959  PMID: 23349826
8.  Adding unaligned sequences into an existing alignment using MAFFT and LAST 
Bioinformatics  2012;28(23):3144-3146.
Two methods to add unaligned sequences into an existing multiple sequence alignment have been implemented as the ‘–add’ and ‘–addfragments’ options in the MAFFT package. The former option is a basic one and applicable only to full-length sequences, whereas the latter option is applicable even when the unaligned sequences are short and fragmentary. These methods internally infer the phylogenetic relationship among the sequences in the existing alignment and the phylogenetic positions of unaligned sequences. Benchmarks based on two independent simulations consistently suggest that the “–addfragments” option outperforms recent methods, PaPaRa and PAGAN, in accuracy for difficult problems and that these three methods appropriately handle easy problems.
Supplementary information: Supplementary data are available at Bioinformatics online
PMCID: PMC3516148  PMID: 23023983
9.  Mammalian NUMT insertion is non-random 
Nucleic Acids Research  2012;40(18):9073-9088.
It is well known that remnants of partial or whole copies of mitochondrial DNA, known as Nuclear MiTochondrial sequences (NUMTs), are found in nuclear genomes. Since whole genome sequences have become available, many bioinformatics studies have identified putative NUMTs and from those attempted to infer the factors involved in NUMT creation. These studies conclude that NUMTs represent randomly chosen regions of the mitochondrial genome. There is less consensus regarding the nuclear insertion sites of NUMTs — previous studies have discussed the possible role of retrotransposons, but some recent ones have reported no correlation or even anti-correlation between NUMT sites and retrotransposons. These studies have generally defined NUMT sites using BLAST with default parameters. We analyze a redefined set of human NUMTs, computed with a carefully considered protocol. We discover that the inferred insertion points of NUMTs have a strong tendency to have high-predicted DNA curvature, occur in experimentally defined open chromatin regions and often occur immediately adjacent to A + T oligomers. We also show clear evidence that their flanking regions are indeed rich in retrotransposons. Finally we show that parts of the mitochondrial genome D-loop are under-represented as a source of NUMTs in primate evolution.
PMCID: PMC3467031  PMID: 22761406
10.  A mostly traditional approach improves alignment of bisulfite-converted DNA 
Nucleic Acids Research  2012;40(13):e100.
Cytosines in genomic DNA are sometimes methylated. This affects many biological processes and diseases. The standard way of measuring methylation is to use bisulfite, which converts unmethylated cytosines to thymines, then sequence the DNA and compare it to a reference genome sequence. We describe a method for the critical step of aligning the DNA reads to the correct genomic locations. Our method builds on classic alignment techniques, including likelihood-ratio scores and spaced seeds. In a realistic benchmark, our method has a better combination of sensitivity, specificity and speed than nine other high-throughput bisulfite aligners. This study enables more accurate and rational analysis of DNA methylation. It also illustrates how to adapt general-purpose alignment methods to a special case with distorted base patterns: this should be informative for other special cases such as ancient DNA and AT-rich genomes.
PMCID: PMC3401460  PMID: 22457070
11.  RecountDB: a database of mapped and count corrected transcribed sequences 
Nucleic Acids Research  2011;40(Database issue):D1089-D1092.
The field of gene expression analysis continues to benefit from next-generation sequencing generated data, which enables transcripts to be measured with unmatched accuracy and resolution. But the high-throughput reads from these technologies also contain many errors, which can compromise the ability to accurately detect and quantify rare transcripts. Fortunately, techniques exist to ameliorate the affects of sequencer error. We present RecountDB, a secondary database derived from primary data in NCBI's short read archive. RecountDB holds sequence counts from RNA-seq and 5′ capped transcription start site experiments, corrected and mapped to the relevant genome. Via a searchable and browseable interface users can obtain corrected data in formats useful for transcriptomic analysis. The database is currently populated with 2265 entries from 45 organisms and continuously growing. RecountDB is publicly available at:
PMCID: PMC3245132  PMID: 22139942
12.  Inferring transcription factor complexes from ChIP-seq data 
Nucleic Acids Research  2011;39(15):e98.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) allows researchers to determine the genome-wide binding locations of individual transcription factors (TFs) at high resolution. This information can be interrogated to study various aspects of TF behaviour, including the mechanisms that control TF binding. Physical interaction between TFs comprises one important aspect of TF binding in eukaryotes, mediating tissue-specific gene expression. We have developed an algorithm, spaced motif analysis (SpaMo), which is able to infer physical interactions between the given TF and TFs bound at neighbouring sites at the DNA interface. The algorithm predicts TF interactions in half of the ChIP-seq data sets we test, with the majority of these predictions supported by direct evidence from the literature or evidence of homodimerization. High resolution motif spacing information obtained by this method can facilitate an improved understanding of individual TF complex structures. SpaMo can assist researchers in extracting maximum information relating to binding mechanisms from their TF ChIP-seq data. SpaMo is available for download and interactive use as part of the MEME Suite (
PMCID: PMC3159476  PMID: 21602262
13.  CentroidHomfold-LAST: accurate prediction of RNA secondary structure using automatically collected homologous sequences 
Nucleic Acids Research  2011;39(Web Server issue):W100-W106.
Although secondary structure predictions of an individual RNA sequence have been widely used in a number of sequence analyses of RNAs, accuracy is still limited. Recently, we proposed a method (called ‘CentroidHomfold’), which includes information about homologous sequences into the prediction of the secondary structure of the target sequence, and showed that it substantially improved the performance of secondary structure predictions. CentroidHomfold, however, forces users to prepare homologous sequences of the target sequence. We have developed a Web application (CentroidHomfold-LAST) that predicts the secondary structure of the target sequence using automatically collected homologous sequences. LAST, which is a fast and sensitive local aligner, and CentroidHomfold are employed in the Web application. Computational experiments with a commonly-used data set indicated that CentroidHomfold-LAST substantially outperformed conventional secondary structure predictions including CentroidFold and RNAfold.
PMCID: PMC3125741  PMID: 21565800
14.  A new repeat-masking method enables specific detection of homologous sequences 
Nucleic Acids Research  2010;39(4):e23.
Biological sequences are often analyzed by detecting homologous regions between them. Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies. Standard repeat-masking methods fail to eliminate this problem, and they are especially ill-suited to AT-rich DNA such as malaria and slime-mould genomes. We present a new repeat-masking method, tantan, which is motivated by the mechanisms that create simple repeats. This method thoroughly eliminates spurious homology predictions for DNA–DNA, protein–protein and DNA–protein comparisons. Moreover, it enables accurate homology search for non-coding DNA with extreme A + T composition.
PMCID: PMC3045581  PMID: 21109538
15.  Parameters for accurate genome alignment 
BMC Bioinformatics  2010;11:80.
Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed.
We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases.
These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours
PMCID: PMC2829014  PMID: 20144198
16.  Incorporating sequence quality data into alignment improves DNA read mapping 
Nucleic Acids Research  2010;38(7):e100.
New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.
PMCID: PMC2853142  PMID: 20110255
17.  Pseudocounts for transcription factor binding sites 
Nucleic Acids Research  2008;37(3):939-944.
To represent the sequence specificity of transcription factors, the position weight matrix (PWM) is widely used. In most cases, each element is defined as a log likelihood ratio of a base appearing at a certain position, which is estimated from a finite number of known binding sites. To avoid bias due to this small sample size, a certain numeric value, called a pseudocount, is usually allocated for each position, and its fraction according to the background base composition is added to each element. So far, there has been no consensus on the optimal pseudocount value. In this study, we simulated the sampling process by artificially generating binding sites based on observed nucleotide frequencies in a public PWM database, and then the generated matrix with an added pseudocount value was compared to the original frequency matrix using various measures. Although the results were somewhat different between measures, in many cases, we could find an optimal pseudocount value for each matrix. These optimal values are independent of the sample size and are clearly correlated with the entropy of the original matrices, meaning that larger pseudocount vales are preferable for less conserved binding sites. As a simple representative, we suggest the value of 0.8 for practical uses.
PMCID: PMC2647310  PMID: 19106141
18.  The whole alignment and nothing but the alignment: the problem of spurious alignment flanks 
Nucleic Acids Research  2008;36(18):5863-5871.
Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human–fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple ‘overalignment’ P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.
PMCID: PMC2566872  PMID: 18796526
19.  Discovering Sequence Motifs with Arbitrary Insertions and Deletions 
PLoS Computational Biology  2008;4(5):e1000071.
Biology is encoded in molecular sequences: deciphering this encoding remains a grand scientific challenge. Functional regions of DNA, RNA, and protein sequences often exhibit characteristic but subtle motifs; thus, computational discovery of motifs in sequences is a fundamental and much-studied problem. However, most current algorithms do not allow for insertions or deletions (indels) within motifs, and the few that do have other limitations. We present a method, GLAM2 (Gapped Local Alignment of Motifs), for discovering motifs allowing indels in a fully general manner, and a companion method GLAM2SCAN for searching sequence databases using such motifs. glam2 is a generalization of the gapless Gibbs sampling algorithm. It re-discovers variable-width protein motifs from the PROSITE database significantly more accurately than the alternative methods PRATT and SAM-T2K. Furthermore, it usefully refines protein motifs from the ELM database: in some cases, the refined motifs make orders of magnitude fewer overpredictions than the original ELM regular expressions. GLAM2 performs respectably on the BAliBASE multiple alignment benchmark, and may be superior to leading multiple alignment methods for “motif-like” alignments with N- and C-terminal extensions. Finally, we demonstrate the use of GLAM2 to discover protein kinase substrate motifs and a gapped DNA motif for the LIM-only transcriptional regulatory complex: using GLAM2SCAN, we identify promising targets for the latter. GLAM2 is especially promising for short protein motifs, and it should improve our ability to identify the protein cleavage sites, interaction sites, post-translational modification attachment sites, etc., that underlie much of biology. It may be equally useful for arbitrarily gapped motifs in DNA and RNA, although fewer examples of such motifs are known at present. GLAM2 is public domain software, available for download at
Author Summary
In recent decades, scientists have extracted genetic sequences—DNA, RNA, and protein sequences—from numerous organisms. These sequences hold the information for the construction and functioning of these organisms, but as yet we are mostly unable to read them. It has long been known that these sequences contain many kinds of “motifs”, i.e. re-occurring patterns, associated with specific biological functions. Thus, much research has been devoted to computer algorithms for automatically discovering subtle, recurring motifs in sequences. However, previous algorithms search for rigid motifs whose instances vary only by substitutions, and not by insertions or deletions. Real motifs are flexible, and do vary by insertions and deletions. This study describes a new computer algorithm for discovering motifs, which allows for arbitrary insertions and deletions. This algorithm can discover real, flexible motifs, and should be able to help us determine the functions of many biological molecules.
PMCID: PMC2323616  PMID: 18437229
20.  Large-scale clustering of CAGE tag expression data 
BMC Bioinformatics  2007;8:161.
Recent analyses have suggested that many genes possess multiple transcription start sites (TSSs) that are differentially utilized in different tissues and cell lines. We have identified a huge number of TSSs mapped onto the mouse genome using the cap analysis of gene expression (CAGE) method. The standard hierarchical clustering algorithm, which gives us easily understandable graphical tree images, has difficulties in processing such huge amounts of TSS data and a better method to calculate and display the results is needed.
We use a combination of hierarchical and non-hierarchical clustering to cluster expression profiles of TSSs based on a large amount of CAGE data to profit from the best of both methods. We processed the genome-wide expression data, including 159,075 TSSs derived from 127 RNA samples of various organs of mouse, and succeeded in categorizing them into 70–100 clusters. The clusters exhibited intriguing biological features: a cluster supergroup with a ubiquitous expression profile, tissue-specific patterns, a distinct distribution of non-coding RNA and functional TSS groups.
Our approach succeeded in greatly reducing the calculation cost, and is an appropriate solution for analyzing large-scale TSS usage data.
PMCID: PMC1890301  PMID: 17517134
21.  Dynamic usage of transcription start sites within core promoters 
Genome Biology  2006;7(12):R118.
An exploration of the internal dynamics of mammalian promoters demonstrates that start site selection within mouse core promoters varies amongst tissues.
Mammalian promoters do not initiate transcription at single, well defined base pairs, but rather at multiple, alternative start sites spread across a region. We previously characterized the static structures of transcription start site usage within promoters at the base pair level, based on large-scale sequencing of transcript 5' ends.
In the present study we begin to explore the internal dynamics of mammalian promoters, and demonstrate that start site selection within many mouse core promoters varies among tissues. We also show that this dynamic usage of start sites is associated with CpG islands, broad and multimodal promoter structures, and imprinting.
Our results reveal a new level of biologic complexity within promoters - fine-scale regulation of transcription starting events at the base pair level. These events are likely to be related to epigenetic transcriptional regulation.
PMCID: PMC1794431  PMID: 17156492
22.  The Abundance of Short Proteins in the Mammalian Proteome 
PLoS Genetics  2006;2(4):e52.
Short proteins play key roles in cell signalling and other processes, but their abundance in the mammalian proteome is unknown. Current catalogues of mammalian proteins exhibit an artefactual discontinuity at a length of 100 aa, so that protein abundance peaks just above this length and falls off sharply below it. To clarify the abundance of short proteins, we identify proteins in the FANTOM collection of mouse cDNAs by analysing synonymous and non-synonymous substitutions with the computer program CRITICA. This analysis confirms that there is no real discontinuity at length 100. Roughly 10% of mouse proteins are shorter than 100 aa, although the majority of these are variants of proteins longer than 100 aa. We identify many novel short proteins, including a “dark matter” subset containing ones that lack detectable homology to other known proteins. Translation assays confirm that some of these novel proteins can be translated and localised to the secretory pathway.
Living things work by the actions of proteins and other molecules that are encoded in their genomes. Genome projects aim to construct a “parts list” of life by cataloguing all of these molecules. However, some molecules are harder to identify than others. One difficult category is short proteins. This is because protein-coding nucleotide sequences have statistical features that, for large proteins, are very pronounced and readily distinguishable from non-coding sequence, but for small proteins are not. Thus, to avoid erroneous predictions, many genome projects have employed an artificial length threshold of 100 amino acids. Hence, short proteins are underrepresented in protein catalogues, although they are known to play important roles in immunity, cell signalling, and metabolism. This study clarifies the abundance of short proteins by exploiting two advantageous resources. The first is the large FANTOM collection of mouse transcript sequences: it is much easier to identify proteins in mature transcripts than in raw genome sequence. The second is a method to identify protein-coding sequences by examining how they differ between organisms. The redundancy of the genetic code implies that these differences will follow distinctive patterns, and this approach has the statistical power to break the 100-amino-acid barrier and reliably identify short proteins.
PMCID: PMC1449894  PMID: 16683031
23.  Pseudo–Messenger RNA: Phantoms of the Transcriptome 
PLoS Genetics  2006;2(4):e23.
The mammalian transcriptome harbours shadowy entities that resist classification and analysis. In analogy with pseudogenes, we define pseudo–messenger RNA to be RNA molecules that resemble protein-coding mRNA, but cannot encode full-length proteins owing to disruptions of the reading frame. Using a rigorous computational pipeline, which rules out sequencing errors, we identify 10,679 pseudo–messenger RNAs (approximately half of which are transposon-associated) among the 102,801 FANTOM3 mouse cDNAs: just over 10% of the FANTOM3 transcriptome. These comprise not only transcribed pseudogenes, but also disrupted splice variants of otherwise protein-coding genes. Some may encode truncated proteins, only a minority of which appear subject to nonsense-mediated decay. The presence of an excess of transcripts whose only disruptions are opal stop codons suggests that there are more selenoproteins than currently estimated. We also describe compensatory frameshifts, where a segment of the gene has changed frame but remains translatable. In summary, we survey a large class of non-standard but potentially functional transcripts that are likely to encode genetic information and effect biological processes in novel ways. Many of these transcripts do not correspond cleanly to any identifiable object in the genome, implying fundamental limits to the goal of annotating all functional elements at the genome sequence level.
Our understanding of genetics has been dominated by the so-called central dogma: the theory that DNA is transcribed into RNA, which is translated via the genetic code to produce proteins. Thus, DNA is the inherited store of genetic information, proteins are the end products that carry out cellular functions, and RNA is a kind of passive intermediary, hence termed messenger RNA. However, evidence has been accumulating that RNA plays a much more dynamic role than this. This study provides an unprejudiced survey of “pathological” RNA molecules, which resemble protein-coding RNA except that they contain violations of the genetic code. These pseudo–messenger RNAs constitute a surprisingly large fraction of all transcripts, as much as 10%. These ghostly molecules have always been present in RNA surveys, but have stayed below the radar because they do not cleanly correspond to annotated elements in DNA, i.e., “genes”. Their prevalence demonstrates that RNA is a distinct continent that cannot be fully understood as a mirror of DNA or proteins.
PMCID: PMC1449882  PMID: 16683022
24.  Clusters of Internally Primed Transcripts Reveal Novel Long Noncoding RNAs 
PLoS Genetics  2006;2(4):e37.
Non-protein-coding RNAs (ncRNAs) are increasingly being recognized as having important regulatory roles. Although much recent attention has focused on tiny 22- to 25-nucleotide microRNAs, several functional ncRNAs are orders of magnitude larger in size. Examples of such macro ncRNAs include Xist and Air, which in mouse are 18 and 108 kilobases (Kb), respectively. We surveyed the 102,801 FANTOM3 mouse cDNA clones and found that Air and Xist were present not as single, full-length transcripts but as a cluster of multiple, shorter cDNAs, which were unspliced, had little coding potential, and were most likely primed from internal adenine-rich regions within longer parental transcripts. We therefore conducted a genome-wide search for regional clusters of such cDNAs to find novel macro ncRNA candidates. Sixty-six regions were identified, each of which mapped outside known protein-coding loci and which had a mean length of 92 Kb. We detected several known long ncRNAs within these regions, supporting the basic rationale of our approach. In silico analysis showed that many regions had evidence of imprinting and/or antisense transcription. These regions were significantly associated with microRNAs and transcripts from the central nervous system. We selected eight novel regions for experimental validation by northern blot and RT-PCR and found that the majority represent previously unrecognized noncoding transcripts that are at least 10 Kb in size and predominantly localized in the nucleus. Taken together, the data not only identify multiple new ncRNAs but also suggest the existence of many more macro ncRNAs like Xist and Air.
The human genome has been sequenced, and, intriguingly, less than 2% specifies the information for the basic protein building blocks of our bodies. So, what does the other 98% do? It now appears that the mammalian genome also specifies the instructions for many previously undiscovered “non protein-coding RNA” (ncRNA) genes. However, what these ncRNAs do is largely unknown. In recent years, strategies have been designed that have successfully identified hundreds of short ncRNAs—termed microRNAs—many of which have since been shown to act as genetic regulators. Also known to be functionally important are a handful of ncRNAs orders of magnitude larger in size than microRNAs. The availability of complete genome and comprehensive transcript sequences allows for the systematic discovery of more large ncRNAs. The authors developed a computational strategy to screen the mouse genome and identify large ncRNAs. They detected existing large ncRNAs, thus validating their approach, but, more importantly, discovered more than 60 other candidates, some of which were subsequently confirmed experimentally. This work opens the door to a virtually unexplored world of large ncRNAs and beckons future experimental work to define the cellular functions of these molecules.
PMCID: PMC1449886  PMID: 16683026
25.  MotifViz: an analysis and visualization tool for motif discovery 
Nucleic Acids Research  2004;32(Web Server issue):W420-W423.
Detecting overrepresented known transcription factor binding motifs in a set of promoter sequences of co-regulated genes has become an important approach to deciphering transcriptional regulatory mechanisms. In this paper, we present an interactive web server, MotifViz, for three motif discovery programs, Clover, Rover and Motifish, covering most available flavors of algorithms for achieving this goal. For comparison, we have also implemented the simple motif-matching program Possum. MotifViz provides uniform and intuitive input and output formats for all four programs. It can be accessed at
PMCID: PMC441564  PMID: 15215422

Results 1-25 (30)