Search tips
Search criteria

Results 1-25 (29)

Clipboard (0)
Year of Publication
1.  Evolution at protein ends: major contribution of alternative transcription initiation and termination to the transcriptome and proteome diversity in mammals 
Nucleic Acids Research  2014;42(11):7132-7144.
Alternative splicing (AS), alternative transcription initiation (ATI) and alternative transcription termination (ATT) create the extraordinary complexity of transcriptomes and make key contributions to the structural and functional diversity of mammalian proteomes. Analysis of mammalian genomic and transcriptomic data shows that contrary to the traditional view, the joint contribution of ATI and ATT to the transcriptome and proteome diversity is quantitatively greater than the contribution of AS. Although the mean numbers of protein-coding constitutive and alternative nucleotides in gene loci are nearly identical, their distribution along the transcripts is highly non-uniform. On average, coding exons in the variable 5′ and 3′ transcript ends that are created by ATI and ATT contain approximately four times more alternative nucleotides than core protein-coding regions that diversify exclusively via AS. Short upstream exons that encompass alternative 5′-untranslated regions and N-termini of proteins evolve under strong nucleotide-level selection whereas in 3′-terminal exons that encode protein C-termini, protein-level selection is significantly stronger. The groups of genes that are subject to ATI and ATT show major differences in biological roles, expression and selection patterns.
PMCID: PMC4066770  PMID: 24792168
2.  Classification and evolution of type II CRISPR-Cas systems 
Nucleic Acids Research  2014;42(10):6091-6105.
The CRISPR-Cas systems of archaeal and bacterial adaptive immunity are classified into three types that differ by the repertoires of CRISPR-associated (cas) genes, the organization of cas operons and the structure of repeats in the CRISPR arrays. The simplest among the CRISPR-Cas systems is type II in which the endonuclease activities required for the interference with foreign deoxyribonucleic acid (DNA) are concentrated in a single multidomain protein, Cas9, and are guided by a co-processed dual-tracrRNA:crRNA molecule. This compact enzymatic machinery and readily programmable site-specific DNA targeting make type II systems top candidates for a new generation of powerful tools for genomic engineering. Here we report an updated census of CRISPR-Cas systems in bacterial and archaeal genomes. Type II systems are the rarest, missing in archaea, and represented in ∼5% of bacterial genomes, with an over-representation among pathogens and commensals. Phylogenomic analysis suggests that at least three cas genes, cas1, cas2 and cas4, and the CRISPR repeats of the type II-B system were acquired via recombination with a type I CRISPR-Cas locus. Distant homologs of Cas9 were identified among proteins encoded by diverse transposons, suggesting that type II CRISPR-Cas evolved via recombination of mobile nuclease genes with type I loci.
PMCID: PMC4041416  PMID: 24728998
3.  Phylogeny of Cas9 determines functional exchangeability of dual-RNA and Cas9 among orthologous type II CRISPR-Cas systems 
Nucleic Acids Research  2013;42(4):2577-2590.
The CRISPR-Cas-derived RNA-guided Cas9 endonuclease is the key element of an emerging promising technology for genome engineering in a broad range of cells and organisms. The DNA-targeting mechanism of the type II CRISPR-Cas system involves maturation of tracrRNA:crRNA duplex (dual-RNA), which directs Cas9 to cleave invading DNA in a sequence-specific manner, dependent on the presence of a Protospacer Adjacent Motif (PAM) on the target. We show that evolution of dual-RNA and Cas9 in bacteria produced remarkable sequence diversity. We selected eight representatives of phylogenetically defined type II CRISPR-Cas groups to analyze possible coevolution of Cas9 and dual-RNA. We demonstrate that these two components are interchangeable only between closely related type II systems when the PAM sequence is adjusted to the investigated Cas9 protein. Comparison of the taxonomy of bacterial species that harbor type II CRISPR-Cas systems with the Cas9 phylogeny corroborates horizontal transfer of the CRISPR-Cas loci. The reported collection of dual-RNA:Cas9 with associated PAMs expands the possibilities for multiplex genome editing and could provide means to improve the specificity of the RNA-programmable Cas9 tool.
PMCID: PMC3936727  PMID: 24270795
4.  Differences in DNA methylation between human neuronal and glial cells are concentrated in enhancers and non-CpG sites 
Nucleic Acids Research  2013;42(1):109-127.
We applied Illumina Human Methylation450K array to perform a genomic-scale single-site resolution DNA methylation analysis in neuronal and nonneuronal (primarily glial) nuclei separated from the orbitofrontal cortex of postmortem human brain. The findings were validated using enhanced reduced representation bisulfite sequencing. We identified thousands of sites differentially methylated (DM) between neuronal and nonneuronal cells. The DM sites were depleted within CpG-island–containing promoters but enriched in predicted enhancers. Classification of the DM sites into those undermethylated in neurons (neuronal type) and those undermethylated in nonneuronal cells (glial type), combined with findings of others that methylation within control elements typically negatively correlates with gene expression, yielded large sets of predicted neuron-specific and non–neuron-specific genes. These sets of predicted genes were in excellent agreement with the available direct measurements of gene expression in human and mouse. We also found a distinct set of DNA methylation patterns that were unique for neuronal cells. In particular, neuronal-type differential methylation was overrepresented in CpG island shores, enriched within gene bodies but not in intergenic regions, and preferentially harbored binding motifs for a distinct set of transcription factors, including neuron-specific activity-dependent factors. Finally, non-CpG methylation was substantially more prevalent in neurons than in nonneuronal cells.
PMCID: PMC3874157  PMID: 24057217
5.  Comparative genomics of defense systems in archaea and bacteria 
Nucleic Acids Research  2013;41(8):4360-4377.
Our knowledge of prokaryotic defense systems has vastly expanded as the result of comparative genomic analysis, followed by experimental validation. This expansion is both quantitative, including the discovery of diverse new examples of known types of defense systems, such as restriction-modification or toxin-antitoxin systems, and qualitative, including the discovery of fundamentally new defense mechanisms, such as the CRISPR-Cas immunity system. Large-scale statistical analysis reveals that the distribution of different defense systems in bacterial and archaeal taxa is non-uniform, with four groups of organisms distinguishable with respect to the overall abundance and the balance between specific types of defense systems. The genes encoding defense system components in bacterial and archaea typically cluster in defense islands. In addition to genes encoding known defense systems, these islands contain numerous uncharacterized genes, which are candidates for new types of defense systems. The tight association of the genes encoding immunity systems and dormancy- or cell death-inducing defense systems in prokaryotic genomes suggests that these two major types of defense are functionally coupled, providing for effective protection at the population level.
PMCID: PMC3632139  PMID: 23470997
6.  Abundance of type I toxin–antitoxin systems in bacteria: searches for new candidates and discovery of novel families 
Nucleic Acids Research  2010;38(11):3743-3759.
Small, hydrophobic proteins whose synthesis is repressed by small RNAs (sRNAs), denoted type I toxin–antitoxin modules, were first discovered on plasmids where they regulate plasmid stability, but were subsequently found on a few bacterial chromosomes. We used exhaustive PSI-BLAST and TBLASTN searches across 774 bacterial genomes to identify homologs of known type I toxins. These searches substantially expanded the collection of predicted type I toxins, revealed homology of the Ldr and Fst toxins, and suggested that type I toxin–antitoxin loci are not spread by horizontal gene transfer. To discover novel type I toxin–antitoxin systems, we developed a set of search parameters based on characteristics of known loci including the presence of tandem repeats and clusters of charged and bulky amino acids at the C-termini of short proteins containing predicted transmembrane regions. We detected sRNAs for three predicted toxins from enterohemorrhagic Escherichia coli and Bacillus subtilis, and showed that two of the respective proteins indeed are toxic when overexpressed. We also demonstrated that the local free-energy minima of RNA folding can be used to detect the positions of the sRNA genes. Our results suggest that type I toxin–antitoxin modules are much more widely distributed among bacteria than previously appreciated.
PMCID: PMC2887945  PMID: 20156992
7.  Selection for minimization of translational frameshifting errors as a factor in the evolution of codon usage 
Nucleic Acids Research  2009;37(20):6799-6810.
In a wide range of genomes, it was observed that the usage of synonymous codons is biased toward specific codons and codon patterns. Factors that are implicated in the selection for codon usage include facilitation of fast and accurate translation. There are two types of translational errors: missense errors and processivity errors. There is considerable evidence in support of the hypothesis that codon usage is optimized to minimize missense errors. In contrast, little is known about the relationship between codon usage and frameshifting errors, an important form of processivity errors, which appear to occur at frequencies comparable to the frequencies of missense errors. Based on the recently proposed pause-and-slip model of frameshifting, we developed Frameshifting Robustness Score (FRS). We used this measure to test if the pattern of codon usage indicates optimization against frameshifting errors. We found that the FRS values of protein-coding sequences from four analyzed genomes (the bacteria Bacillus subtilis and Escherichia coli, and the yeasts Saccharomyces cerevisiae and Schizosaccharomyce pombe) were typically higher than expected by chance. Other properties of FRS patterns observed in B. subtilis, S. cerevisiae and S. pombe, such as the tendency of FRS to increase from the 5′- to 3′-end of protein-coding sequences, were also consistent with the hypothesis of optimization against frameshifting errors in translation. For E. coli, the results of different tests were less consistent, suggestive of a much weaker optimization, if any. Collectively, the results fit the concept of selection against mistranslation-induced protein misfolding being one of the factors shaping the evolution of both coding and non-coding sequences.
PMCID: PMC2777431  PMID: 19745054
8.  Darwinian evolution in the light of genomics 
Nucleic Acids Research  2009;37(4):1011-1034.
Comparative genomics and systems biology offer unprecedented opportunities for testing central tenets of evolutionary biology formulated by Darwin in the Origin of Species in 1859 and expanded in the Modern Synthesis 100 years later. Evolutionary-genomic studies show that natural selection is only one of the forces that shape genome evolution and is not quantitatively dominant, whereas non-adaptive processes are much more prominent than previously suspected. Major contributions of horizontal gene transfer and diverse selfish genetic elements to genome evolution undermine the Tree of Life concept. An adequate depiction of evolution requires the more complex concept of a network or ‘forest’ of life. There is no consistent tendency of evolution towards increased genomic complexity, and when complexity increases, this appears to be a non-adaptive consequence of evolution under weak purifying selection rather than an adaptation. Several universals of genome evolution were discovered including the invariant distributions of evolutionary rates among orthologous genes from diverse genomes and of paralogous gene family sizes, and the negative correlation between gene expression level and sequence evolution rate. Simple, non-adaptive models of evolution explain some of these universals, suggesting that a new synthesis of evolutionary biology might become feasible in a not so remote future.
PMCID: PMC2651812  PMID: 19213802
9.  Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world 
Nucleic Acids Research  2008;36(21):6688-6719.
The first bacterial genome was sequenced in 1995, and the first archaeal genome in 1996. Soon after these breakthroughs, an exponential rate of genome sequencing was established, with a doubling time of approximately 20 months for bacteria and approximately 34 months for archaea. Comparative analysis of the hundreds of sequenced bacterial and dozens of archaeal genomes leads to several generalizations on the principles of genome organization and evolution. A crucial finding that enables functional characterization of the sequenced genomes and evolutionary reconstruction is that the majority of archaeal and bacterial genes have conserved orthologs in other, often, distant organisms. However, comparative genomics also shows that horizontal gene transfer (HGT) is a dominant force of prokaryotic evolution, along with the loss of genetic material resulting in genome contraction. A crucial component of the prokaryotic world is the mobilome, the enormous collection of viruses, plasmids and other selfish elements, which are in constant exchange with more stable chromosomes and serve as HGT vehicles. Thus, the prokaryotic genome space is a tightly connected, although compartmentalized, network, a novel notion that undermines the ‘Tree of Life’ model of evolution and requires a new conceptual framework and tools for the study of prokaryotic evolution.
PMCID: PMC2588523  PMID: 18948295
10.  ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes 
Nucleic Acids Research  2008;37(Database issue):D448-D454.
The database of Alignable Tight Genomic Clusters (ATGCs) consists of closely related genomes of archaea and bacteria, and is a resource for research into prokaryotic microevolution. Construction of a data set with appropriate characteristics is a major hurdle for this type of studies. With the current rate of genome sequencing, it is difficult to follow the progress of the field and to determine which of the available genome sets meet the requirements of a given research project, in particular, with respect to the minimum and maximum levels of similarity between the included genomes. Additionally, extraction of specific content, such as genomic alignments or families of orthologs, from a selected set of genomes is a complicated and time-consuming process. The database addresses these problems by providing an intuitive and efficient web interface to browse precomputed ATGCs, select appropriate ones and access ATGC-derived data such as multiple alignments of orthologous proteins, matrices of pairwise intergenomic distances based on genome-wide analysis of synonymous and nonsynonymous substitution rates and others. The ATGC database will be regularly updated following new releases of the NCBI RefSeq. The database is hosted by the Genomics Division at Lawrence Berkeley National laboratory and is publicly available at
PMCID: PMC2686458  PMID: 18845571
11.  Viral AlkB proteins repair RNA damage by oxidative demethylation 
Nucleic Acids Research  2008;36(17):5451-5461.
Bacterial and mammalian AlkB proteins are iron(II)- and 2-oxoglutarate-dependent dioxygenases that reverse methylation damage, such as 1-methyladenine and 3-methylcytosine, in RNA and DNA. An AlkB-domain is encoded by the genome of numerous single-stranded, plant-infecting RNA viruses, the majority of which belong to the Flexiviridae family. Our phylogenetic analysis of AlkB sequences suggests that a single plant virus might have acquired AlkB relatively recently, followed by horizontal dissemination among other viruses via recombination. Here, we describe the first functional characterization of AlkB proteins from three plant viruses. The viral AlkB proteins efficiently reactivated methylated bacteriophage genomes when expressed in Escherichia coli, and also displayed robust, iron(II)- and 2-oxoglutarate-dependent demethylase activity in vitro. Viral AlkB proteins preferred RNA over DNA substrates, and thus represent the first AlkBs with such substrate specificity. Our results suggest a role for viral AlkBs in maintaining the integrity of the viral RNA genome through repair of deleterious methylation damage, and support the notion that AlkB-mediated RNA repair is biologically relevant.
PMCID: PMC2553587  PMID: 18718927
13.  Evolutionary conservation suggests a regulatory function of AUG triplets in 5′-UTRs of eukaryotic genes 
Nucleic Acids Research  2005;33(17):5512-5520.
By comparing sequences of human, mouse and rat orthologous genes, we show that in 5′-untranslated regions (5′-UTRs) of mammalian cDNAs but not in 3′-UTRs or coding sequences, AUG is conserved to a significantly greater extent than any of the other 63 nt triplets. This effect is likely to reflect, primarily, bona fide evolutionary conservation, rather than cDNA annotation artifacts, because the excess of conserved upstream AUGs (uAUGs) is seen in 5′-UTRs containing stop codons in-frame with the start AUG and many of the conserved AUGs are found in different frames, consistent with the location in authentic non-coding sequences. Altogether, conserved uAUGs are present in at least 20–30% of mammalian genes. Qualitatively similar results were obtained by comparison of orthologous genes from different species of the yeast genus Saccharomyces. Together with the observation that mammalian and yeast 5′-UTRs are significantly depleted in overall AUG content, these findings suggest that AUG triplets in 5′-UTRs are subject to the pressure of purifying selection in two opposite directions: the uAUGs that have no specific function tend to be deleterious and get eliminated during evolution, whereas those uAUGs that do serve a function are conserved. Most probably, the principal role of the conserved uAUGs is attenuation of translation at the initiation stage, which is often additionally regulated by alternative splicing in the mammalian 5′-UTRs. Consistent with this hypothesis, we found that open reading frames starting from conserved uAUGs are significantly shorter than those starting from non-conserved uAUGs, possibly, owing to selection for optimization of the level of attenuation.
PMCID: PMC1236974  PMID: 16186132
14.  Ancestral paralogs and pseudoparalogs and their role in the emergence of the eukaryotic cell 
Nucleic Acids Research  2005;33(14):4626-4638.
Gene duplication is a crucial mechanism of evolutionary innovation. A substantial fraction of eukaryotic genomes consists of paralogous gene families. We assess the extent of ancestral paralogy, which dates back to the last common ancestor of all eukaryotes, and examine the origins of the ancestral paralogs and their potential roles in the emergence of the eukaryotic cell complexity. A parsimonious reconstruction of ancestral gene repertoires shows that 4137 orthologous gene sets in the last eukaryotic common ancestor (LECA) map back to 2150 orthologous sets in the hypothetical first eukaryotic common ancestor (FECA) [paralogy quotient (PQ) of 1.92]. Analogous reconstructions show significantly lower levels of paralogy in prokaryotes, 1.19 for archaea and 1.25 for bacteria. The only functional class of eukaryotic proteins with a significant excess of paralogous clusters over the mean includes molecular chaperones and proteins with related functions. Almost all genes in this category underwent multiple duplications during early eukaryotic evolution. In structural terms, the most prominent sets of paralogs are superstructure-forming proteins with repetitive domains, such as WD-40 and TPR. In addition to the true ancestral paralogs which evolved via duplication at the onset of eukaryotic evolution, numerous pseudoparalogs were detected, i.e. homologous genes that apparently were acquired by early eukaryotes via different routes, including horizontal gene transfer (HGT) from diverse bacteria. The results of this study demonstrate a major increase in the level of gene paralogy as a hallmark of the early evolution of eukaryotes.
PMCID: PMC1187821  PMID: 16106042
15.  Origin and evolution of the archaeo-eukaryotic primase superfamily and related palm-domain proteins: structural insights and new members 
Nucleic Acids Research  2005;33(12):3875-3896.
We report an in-depth computational study of the protein sequences and structures of the superfamily of archaeo-eukaryotic primases (AEPs). This analysis greatly expands the range of diversity of the AEPs and reveals the unique active site shared by all members of this superfamily. In particular, it is shown that eukaryotic nucleo-cytoplasmic large DNA viruses, including poxviruses, asfarviruses, iridoviruses, phycodnaviruses and the mimivirus, encode AEPs of a distinct family, which also includes the herpesvirus primases whose relationship to AEPs has not been recognized previously. Many eukaryotic genomes, including chordates and plants, encode previously uncharacterized homologs of these predicted viral primases, which might be involved in novel DNA repair pathways. At a deeper level of evolutionary connections, structural comparisons indicate that AEPs, the nucleases involved in the initiation of rolling circle replication in plasmids and viruses, and origin-binding domains of papilloma and polyoma viruses evolved from a common ancestral protein that might have been involved in a protein-priming mechanism of initiation of DNA replication. Contextual analysis of multidomain protein architectures and gene neighborhoods in prokaryotes and viruses reveals remarkable parallels between AEPs and the unrelated DnaG-type primases, in particular, tight associations with the same repertoire of helicases. These observations point to a functional equivalence of the two classes of primases, which seem to have repeatedly displaced each other in various extrachromosomal replicons.
PMCID: PMC1176014  PMID: 16027112
16.  Conservation versus parallel gains in intron evolution 
Nucleic Acids Research  2005;33(6):1741-1748.
Orthologous genes from distant eukaryotic species, e.g. animals and plants, share up to 25–30% intron positions. However, the relative contributions of evolutionary conservation and parallel gain of new introns into this pattern remain unknown. Here, the extent of independent insertion of introns in the same sites (parallel gain) in orthologous genes from phylogenetically distant eukaryotes is assessed within the framework of the protosplice site model. It is shown that protosplice sites are no more conserved during evolution of eukaryotic gene sequences than random sites. Simulation of intron insertion into protosplice sites with the observed protosplice site frequencies and intron densities shows that parallel gain can account but for a small fraction (5–10%) of shared intron positions in distantly related species. Thus, the presence of numerous introns in the same positions in orthologous genes from distant eukaryotes, such as animals, fungi and plants, appears to reflect mostly bona fide evolutionary conservation.
PMCID: PMC1069513  PMID: 15788746
17.  ‘Conserved hypothetical’ proteins: prioritization of targets for experimental study 
Nucleic Acids Research  2004;32(18):5452-5463.
Comparative genomics shows that a substantial fraction of the genes in sequenced genomes encodes ‘conserved hypothetical’ proteins, i.e. those that are found in organisms from several phylogenetic lineages but have not been functionally characterized. Here, we briefly discuss recent progress in functional characterization of prokaryotic ‘conserved hypothetical’ proteins and the possible criteria for prioritizing targets for experimental study. Based on these criteria, the chief one being wide phyletic spread, we offer two ‘top 10’ lists of highly attractive targets. The first list consists of proteins for which biochemical activity could be predicted with reasonable confidence but the biological function was predicted only in general terms, if at all (‘known unknowns’). The second list includes proteins for which there is no prediction of biochemical activity, even if, for some, general biological clues exist (‘unknown unknowns’). The experimental characterization of these and other ‘conserved hypothetical’ proteins is expected to reveal new, crucial aspects of microbial biology and could also lead to better functional prediction for medically relevant human homologs.
PMCID: PMC524295  PMID: 15479782
18.  Comparative genomics of the FtsK–HerA superfamily of pumping ATPases: implications for the origins of chromosome segregation, cell division and viral capsid packaging 
Nucleic Acids Research  2004;32(17):5260-5279.
Recently, it has been shown that a predicted P-loop ATPase (the HerA or MlaA protein), which is highly conserved in archaea and also present in many bacteria but absent in eukaryotes, has a bidirectional helicase activity and forms hexameric rings similar to those described for the TrwB ATPase. In this study, the FtsK–HerA superfamily of P-loop ATPases, in which the HerA clade comprises one of the major branches, is analyzed in detail. We show that, in addition to the FtsK and HerA clades, this superfamily includes several families of characterized or predicted ATPases which are predominantly involved in extrusion of DNA and peptides through membrane pores. The DNA-packaging ATPases of various bacteriophages and eukaryotic double-stranded DNA viruses also belong to the FtsK–HerA superfamily. The FtsK protein is the essential bacterial ATPase that is responsible for the correct segregation of daughter chromosomes during cell division. The structural and evolutionary relationship between HerA and FtsK and the nearly perfect complementarity of their phyletic distributions suggest that HerA similarly mediates DNA pumping into the progeny cells during archaeal cell division. It appears likely that the HerA and FtsK families diverged concomitantly with the archaeal–bacterial division and that the last universal common ancestor of modern life forms had an ancestral DNA-pumping ATPase that gave rise to these families. Furthermore, the relationship of these cellular proteins with the packaging ATPases of diverse DNA viruses suggests that a common DNA pumping mechanism might be operational in both cellular and viral genome segregation. The herA gene forms a highly conserved operon with the gene for the NurA nuclease and, in many archaea, also with the orthologs of eukaryotic double-strand break repair proteins MRE11 and Rad50. HerA is predicted to function in a complex with these proteins in DNA pumping and repair of double-stranded breaks introduced during this process and, possibly, also during DNA replication. Extensive comparative analysis of the ‘genomic context’ combined with in-depth sequence analysis led to the prediction of numerous previously unnoticed nucleases of the NurA superfamily, including a specific version that is likely to be the endonuclease component of a novel restriction-modification system. This analysis also led to the identification of previously uncharacterized nucleases, such as a novel predicted nuclease of the Sir2-type Rossmann fold, and phosphatases of the HAD superfamily that are likely to function as partners of the FtsK–HerA superfamily ATPases.
PMCID: PMC521647  PMID: 15466593
19.  Prevalence of intron gain over intron loss in the evolution of paralogous gene families 
Nucleic Acids Research  2004;32(12):3724-3733.
The mechanisms and evolutionary dynamics of intron insertion and loss in eukaryotic genes remain poorly understood. Reconstruction of parsimonious scenarios of gene structure evolution in paralogous gene families in animals and plants revealed numerous gains and losses of introns. In all analyzed lineages, the number of acquired new introns was substantially greater than the number of lost ancestral introns. This trend held even for lineages in which vertical evolution of genes involved more intron losses than gains, suggesting that gene duplication boosts intron insertion. However, dating gene duplications and the associated intron gains and losses based on the molecular clock assumption showed that very few, if any, introns were gained during the last ∼100 million years of animal and plant evolution, in agreement with previous conclusions reached through analysis of orthologous gene sets. These results are generally compatible with the emerging notion of intensive insertion and loss of introns during transitional epochs in contrast to the relative quiet of the intervening evolutionary spans.
PMCID: PMC484173  PMID: 15254274
20.  Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals 
Nucleic Acids Research  2004;32(5):1774-1782.
Sequencing of multiple, nearly complete eukaryotic genomes creates opportunities for detecting previously unnoticed, subtle functional signals in non-coding regions. A genome-wide comparative analysis of orthologous sets of mammalian and yeast mRNAs revealed distinct patterns of evolutionary conservation at the boundaries of the untranslated regions (UTRs) and the coding region (CDS). Elevated sequence conservation was detected in ∼30 nt regions around the start codon. There seems to be a complementary relationship between sequence conservation in the ∼30 nt regions of the 5′-UTR immediately upstream of the start codon and that in the synonymous positions of the 5′-terminal 30 nt of the CDS: in mammalian mRNAs, the 5′-UTR shows a greater conservation than the CDS, whereas the opposite trend holds for yeast mRNAs. Unexpectedly, a ∼30 nt region downstream of the stop codon shows a substantially lower level of sequence conservation than the downstream portions of the 3′-UTRs. However, the sequence in this poorly conserved 30 nt portion of the 3′-UTR is non-random in that it has a higher GC content than the rest of the UTR. It is hypothesized that the elevated sequence conservation in the region immediately upstream of the start codon is related to the requirement for initiation factor binding during pre-initiation ribosomal scanning. In contrast, the poorly conserved region downstream of the stop codon could be involved in the post- termination scanning and dissociation of the ribosomes from the mRNA, which requires only the mRNA–ribosome interaction. Additionally, it was found that the choice of the stop codon in mammals, but not in yeasts, and the context in the immediate vicinity of the stop codons in both mammals and yeasts are subject to strong selection. Thus, genome-wide analysis of orthologous gene sets allows detection of previously unrecognized patterns of sequence conservation, which are likely to reflect hidden functional signals, such as ribosomal filters that could regulate translation by modulating the interaction between the mRNA and ribosomes.
PMCID: PMC390323  PMID: 15031317
21.  Congruent evolution of different classes of non-coding DNA in prokaryotic genomes 
Nucleic Acids Research  2002;30(19):4264-4271.
Prokaryotic genomes are considered to be ‘wall-to-wall’ genomes, which consist largely of genes for proteins and structural RNAs, with only a small fraction of the genomic DNA allotted to intergenic regions, which are thought to typically contain regulatory signals. The majority of bacterial and archaeal genomes contain 6–14% non-coding DNA. Significant positive correlations were detected between the fraction of non-coding DNA and inter- and intra-operonic distances, suggesting that different classes of non-coding DNA evolve congruently. In contrast, no correlation was found between any of these characteristics of non-coding sequences and the number of genes or genome size. Thus, the non-coding regions and the gene sets in prokaryotes seem to evolve in different regimes. The evolution of non-coding regions appears to be determined primarily by the selective pressure to minimize the amount of non-functional DNA, while maintaining essential regulatory signals, because of which the content of non-coding DNA in different genomes is relatively uniform and intra- and inter-operonic non-coding regions evolve congruently. In contrast, the gene set is optimized for the particular environmental niche of the given microbe, which results in the lack of correlation between the gene number and the characteristics of non-coding regions.
PMCID: PMC140549  PMID: 12364605
22.  Connected gene neighborhoods in prokaryotic genomes 
Nucleic Acids Research  2002;30(10):2212-2223.
A computational method was developed for delineating connected gene neighborhoods in bacterial and archaeal genomes. These gene neighborhoods are not typically present, in their entirety, in any single genome, but are held together by overlapping, partially conserved gene arrays. The procedure was applied to comparing the orders of orthologous genes, which were extracted from the database of Clusters of Orthologous Groups of proteins (COGs), in 31 prokaryotic genomes and resulted in the identification of 188 clusters of gene arrays, which included 1001 of 2890 COGs. These clusters were projected onto actual genomes to produce extended neighborhoods including additional genes, which are adjacent to the genes from the clusters and are transcribed in the same direction, which resulted in a total of 2387 COGs being included in the neighborhoods. Most of the neighborhoods consist predominantly of genes united by a coherent functional theme, but also include a minority of genes without an obvious functional connection to the main theme. We hypothesize that although some of the latter genes might have unsuspected roles, others are maintained within gene arrays because of the advantage of expression at a level that is typical of the given neighborhood. We designate this phenomenon ‘genomic hitchhiking’. The largest neighborhood includes 79 genes (COGs) and consists of overlapping, rearranged ribosomal protein superoperons; apparent genome hitchhiking is particularly typical of this neighborhood and other neighborhoods that consist of genes coding for translation machinery components. Several neighborhoods involve previously undetected connections between genes, allowing new functional predictions. Gene neighborhoods appear to evolve via complex rearrangement, with different combinations of genes from a neighborhood fixed in different lineages.
PMCID: PMC115289  PMID: 12000841
23.  Comparative genomics and evolution of proteins involved in RNA metabolism 
Nucleic Acids Research  2002;30(7):1427-1464.
RNA metabolism, broadly defined as the compendium of all processes that involve RNA, including transcription, processing and modification of transcripts, translation, RNA degradation and its regulation, is the central and most evolutionarily conserved part of cell physiology. A comprehensive, genome-wide census of all enzymatic and non-enzymatic protein domains involved in RNA metabolism was conducted by using sequence profile analysis and structural comparisons. Proteins related to RNA metabolism comprise from 3 to 11% of the complete protein repertoire in bacteria, archaea and eukaryotes, with the greatest fraction seen in parasitic bacteria with small genomes. Approximately one-half of protein domains involved in RNA metabolism are present in most, if not all, species from all three primary kingdoms and are traceable to the last universal common ancestor (LUCA). The principal features of LUCA’s RNA metabolism system were reconstructed by parsimony-based evolutionary analysis of all relevant groups of orthologous proteins. This reconstruction shows that LUCA possessed not only the basal translation system, but also the principal forms of RNA modification, such as methylation, pseudouridylation and thiouridylation, as well as simple mechanisms for polyadenylation and RNA degradation. Some of these ancient domains form paralogous groups whose evolution can be traced back in time beyond LUCA, towards low-specificity proteins, which probably functioned as cofactors for ribozymes within the RNA world framework. The main lineage-specific innovations of RNA metabolism systems were identified. The most notable phase of innovation in RNA metabolism coincides with the advent of eukaryotes and was brought about by the merge of the archaeal and bacterial systems via mitochondrial endosymbiosis, but also involved emergence of several new, eukaryote-specific RNA-binding domains. Subsequent, vast expansions of these domains mark the origin of alternative splicing in animals and probably in plants. In addition to the reconstruction of the evolutionary history of RNA metabolism, this analysis produced numerous functional predictions, e.g. of previously undetected enzymes of RNA modification.
PMCID: PMC101826  PMID: 11917006
24.  A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis 
Nucleic Acids Research  2002;30(2):482-496.
During a systematic analysis of conserved gene context in prokaryotic genomes, a previously undetected, complex, partially conserved neighborhood consisting of more than 20 genes was discovered in most Archaea (with the exception of Thermoplasma acidophilum and Halobacterium NRC-1) and some bacteria, including the hyperthermophiles Thermotoga maritima and Aquifex aeolicus. The gene composition and gene order in this neighborhood vary greatly between species, but all versions have a stable, conserved core that consists of five genes. One of the core genes encodes a predicted DNA helicase, often fused to a predicted HD-superfamily hydrolase, and another encodes a RecB family exonuclease; three core genes remain uncharacterized, but one of these might encode a nuclease of a new family. Two more genes that belong to this neighborhood and are present in most of the genomes in which the neighborhood was detected encode, respectively, a predicted HD-superfamily hydrolase (possibly a nuclease) of a distinct family and a predicted, novel DNA polymerase. Another characteristic feature of this neighborhood is the expansion of a superfamily of paralogous, uncharacterized proteins, which are encoded by at least 20–30% of the genes in the neighborhood. The functional features of the proteins encoded in this neighborhood suggest that they comprise a previously undetected DNA repair system, which, to our knowledge, is the first repair system largely specific for thermophiles to be identified. This hypothetical repair system might be functionally analogous to the bacterial–eukaryotic system of translesion, mutagenic repair whose central components are DNA polymerases of the UmuC-DinB-Rad30-Rev1 superfamily, which typically are missing in thermophiles.
PMCID: PMC99818  PMID: 11788711
25.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements 
Nucleic Acids Research  2001;29(14):2994-3005.
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 ± 0.005 to 0.895 ± 0.003. This does not include the benefits from four modifications we included in the ‘baseline’ version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence’s amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
PMCID: PMC55814  PMID: 11452024

Results 1-25 (29)