Search tips
Search criteria

Results 1-25 (39)

Clipboard (0)
Year of Publication
1.  Epigenetic hereditary transcription profiles II, aging revisited 
Biology Direct  2007;2:39.
Previously, we have shown that deviations from the average transcription profile of a group of functionally related genes can be epigenetically transmitted to daughter cells, thereby implicating nuclear programming as the cause. As a first step in further characterizing this phenomenon it was necessary to determine to what extent such deviations occur in non-tumorigenic tissues derived from normal individuals. To this end, a microarray database derived from 90 human donors aged between 22 to 87 years was used to study deviations from the average transcription profile of the proteasome genes.
Increase in donor age was found to correlate with a decrease in deviations from the general transcription profile with this decline being gender-specific. The age-related index declined at a faster rate for males although it started from a higher level. Additionally, transcription profiles from similar tissues were more alike than those from different tissues, indicating that deviations arise during differentiation.
These findings suggest that aging and differentiation are related to epigenetic changes that alter the transcription profile of proteasomal genes. Since alterations in the structure and function of the proteasome are unlikely, such changes appear to occur without concomitant change in gene function.
These findings, if confirmed, may have a significant impact on our understanding of the aging process.
Open peer review
This article was reviewed by Nathan Bowen (nominated by I. King Jordan), Timothy E. Reddy (nominated by Charles DeLisi) and by Martijn Huynen. For the full reviews, please go to the Reviewers'comments section.
PMCID: PMC2265679  PMID: 18163906
2.  Orthologs of the small RPB8 subunit of the eukaryotic RNA polymerases are conserved in hyperthermophilic Crenarchaeota and "Korarchaeota" 
Biology Direct  2007;2:38.
Although most of the key components of the transcription apparatus, and in particular, RNA polymerase (RNAP) subunits, are conserved between archaea and eukaryotes, no archaeal homologs of the small RPB8 subunit of eukaryotic RNAP have been detected. We report that orthologs of RPB8 are encoded in all sequenced genomes of hyperthermophilic Crenarchaeota and a recently sequenced "korarchaeal" genome, but not in Euryarchaeota or the mesophilic crenarchaeon Cenarchaeum symbiosum. These findings suggest that all 12 core subunits of eukaryotic RNAPs were already present in the last common ancestor of the extant archaea.
This article was reviewed by Purificacion Lopez-Garcia and Chris Ponting.
PMCID: PMC2234397  PMID: 18081935
3.  Hox, Wnt, and the evolution of the primary body axis: insights from the early-divergent phyla 
Biology Direct  2007;2:37.
The subkingdom Bilateria encompasses the overwhelming majority of animals, including all but four early-branching phyla: Porifera, Ctenophora, Placozoa, and Cnidaria. On average, these early-branching phyla have fewer cell types, tissues, and organs, and are considered to be significantly less specialized along their primary body axis. As such, they present an attractive outgroup from which to investigate how evolutionary changes in the genetic toolkit may have contributed to the emergence of the complex animal body plans of the Bilateria. This review offers an up-to-date glimpse of genome-scale comparisons between bilaterians and these early-diverging taxa. Specifically, we examine these data in the context of how they may explain the evolutionary development of primary body axes and axial symmetry across the Metazoa. Next, we re-evaluate the validity and evolutionary genomic relevance of the zootype hypothesis, which defines an animal by a specific spatial pattern of gene expression. Finally, we extend the hypothesis that Wnt genes may be the earliest primary body axis patterning mechanism by suggesting that Hox genes were co-opted into this patterning network prior to the last common ancestor of cnidarians and bilaterians.
Reviewed by Pierre Pontarotti, Gáspár Jékely, and L Aravind. For the full reviews, please go to the Reviewers' comments section.
PMCID: PMC2222619  PMID: 18078518
4.  Evolutionary history of bacteriophages with double-stranded DNA genomes 
Biology Direct  2007;2:36.
Reconstruction of evolutionary history of bacteriophages is a difficult problem because of fast sequence drift and lack of omnipresent genes in phage genomes. Moreover, losses and recombinational exchanges of genes are so pervasive in phages that the plausibility of phylogenetic inference in phage kingdom has been questioned.
We compiled the profiles of presence and absence of 803 orthologous genes in 158 completely sequenced phages with double-stranded DNA genomes and used these gene content vectors to infer the evolutionary history of phages. There were 18 well-supported clades, mostly corresponding to accepted genera, but in some cases appearing to define new taxonomic groups. Conflicts between this phylogeny and trees constructed from sequence alignments of phage proteins were exploited to infer 294 specific acts of intergenome gene transfer.
A notoriously reticulate evolutionary history of fast-evolving phages can be reconstructed in considerable detail by quantitative comparative genomics.
Open peer review
This article was reviewed by Eugene Koonin, Nicholas Galtier and Martijn Huynen.
PMCID: PMC2222618  PMID: 18062816
5.  Exosomal transfer of proteins and RNAs at synapses in the nervous system 
Biology Direct  2007;2:35.
Many cell types have been reported to secrete small vesicles called exosomes, that are derived from multivesicular bodies and that can also form from endocytic-like lipid raft domains of the plasma membrane. Secretory exosomes contain a characteristic composition of proteins, and a recent report indicates that mast cell exosomes harbor a variety of mRNAs and microRNAs as well. Exosomes express cell recognition molecules on their surface that facilitate their selective targeting and uptake into recipient cells.
In this review, I suggest that exosomal secretion of proteins and RNAs may be a fundamental mode of communication within the nervous system, supplementing the known mechanisms of anterograde and retrograde signaling across synapses. In one specific scenario, exosomes are proposed to bud from the lipid raft region of the postsynaptic membrane adjacent to the postsynaptic density, in a manner that is stimulated by stimuli that elicit long-term potentiation. The exosomes would then transfer newly synthesized synaptic proteins (such as CAM kinase II alpha) and synaptic RNAs to the presynaptic terminal, where they would contribute to synaptic plasticity.
The model is consistent with the known cellular and molecular features of synaptic neurobiology and makes a number of predictions that can be tested in vitro and in vivo.
Open peer review
Reviewed by Etienne Joly, Gaspar Jekely, Juergen Brosius and Eugene Koonin. For the full reviews, please go to the Reviewers' comments section.
PMCID: PMC2219957  PMID: 18053135
6.  Systematic analysis of mRNA 5' coding sequence incompleteness in Danio rerio: an automated EST-based approach 
Biology Direct  2007;2:34.
All standard methods for cDNA cloning are affected by a potential inability to effectively clone the 5' region of mRNA. The aim of this work was to estimate mRNA open reading frame (ORF) 5' region sequence completeness in the model organism Danio rerio (zebrafish).
We implemented a novel automated approach (5'_ORF_Extender) that systematically compares available expressed sequence tags (ESTs) with all the zebrafish experimentally determined mRNA sequences, identifies additional sequence stretches at 5' region and scans for the presence of all conditions needed to define a new, extended putative ORF. Our software was able to identify 285 (3.3%) mRNAs with putatively incomplete ORFs at 5' region and, in three example cases selected (selt1a, unc119.2, nppa), the extended coding region at 5' end was cloned by reverse transcription-polymerase chain reaction (RT-PCR).
The implemented method, which could also be useful for the analysis of other genomes, allowed us to describe the relevance of the "5' end mRNA artifact" problem for genomic annotation and functional genomic experiment design in zebrafish.
Open peer review
This article was reviewed by Alexey V. Kochetov (nominated by Mikhail Gelfand), Shamil Sunyaev, and Gáspár Jékely. For the full reviews, please go to the Reviewers' Comments section.
PMCID: PMC2222617  PMID: 18042283
7.  Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea 
Biology Direct  2007;2:33.
An evolutionary classification of genes from sequenced genomes that distinguishes between orthologs and paralogs is indispensable for genome annotation and evolutionary reconstruction. Shortly after multiple genome sequences of bacteria, archaea, and unicellular eukaryotes became available, an attempt on such a classification was implemented in Clusters of Orthologous Groups of proteins (COGs). Rapid accumulation of genome sequences creates opportunities for refining COGs but also represents a challenge because of error amplification. One of the practical strategies involves construction of refined COGs for phylogenetically compact subsets of genomes.
New Archaeal Clusters of Orthologous Genes (arCOGs) were constructed for 41 archaeal genomes (13 Crenarchaeota, 27 Euryarchaeota and one Nanoarchaeon) using an improved procedure that employs a similarity tree between smaller, group-specific clusters, semi-automatically partitions orthology domains in multidomain proteins, and uses profile searches for identification of remote orthologs. The annotation of arCOGs is a consensus between three assignments based on the COGs, the CDD database, and the annotations of homologs in the NR database. The 7538 arCOGs, on average, cover ~88% of the genes in a genome compared to a ~76% coverage in COGs. The finer granularity of ortholog identification in the arCOGs is apparent from the fact that 4538 arCOGs correspond to 2362 COGs; ~40% of the arCOGs are new. The archaeal gene core (protein-coding genes found in all 41 genome) consists of 166 arCOGs. The arCOGs were used to reconstruct gene loss and gene gain events during archaeal evolution and gene sets of ancestral forms. The Last Archaeal Common Ancestor (LACA) is conservatively estimated to possess 996 genes compared to 1245 and 1335 genes for the last common ancestors of Crenarchaeota and Euryarchaeota, respectively. It is inferred that LACA was a chemoautotrophic hyperthermophile that, in addition to the core archaeal functions, encoded more idiosyncratic systems, e.g., the CASS systems of antivirus defense and some toxin-antitoxin systems.
The arCOGs provide a convenient, flexible framework for functional annotation of archaeal genomes, comparative genomics and evolutionary reconstructions. Genomic reconstructions suggest that the last common ancestor of archaea might have been (nearly) as advanced as the modern archaeal hyperthermophiles. ArCOGs and related information are available at: .
This article was reviewed by Peer Bork, Patrick Forterre, and Purificacion Lopez-Garcia.
PMCID: PMC2222616  PMID: 18042280
8.  Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins 
Biology Direct  2007;2:32.
The evolution of the full repertoire of proteins encoded in a given genome is mostly driven by gene duplications, deletions, and sequence modifications of existing proteins. Indirect information about relative rates and other intrinsic parameters of these three basic processes is contained in the proteome-wide distribution of sequence identities of pairs of paralogous proteins.
We introduce a simple mathematical framework based on a stochastic birth-and-death model that allows one to extract some of this information and apply it to the set of all pairs of paralogous proteins in H. pylori, E. coli, S. cerevisiae, C. elegans, D. melanogaster, and H. sapiens. It was found that the histogram of sequence identities p generated by an all-to-all alignment of all protein sequences encoded in a genome is well fitted with a power-law form ~ p-γ with the value of the exponent γ around 4 for the majority of organisms used in this study. This implies that the intra-protein variability of substitution rates is best described by the Gamma-distribution with the exponent α ≈ 0.33. Different features of the shape of such histograms allow us to quantify the ratio between the genome-wide average deletion/duplication rates and the amino-acid substitution rate.
We separately measure the short-term ("raw") duplication and deletion rates rdup∗, rdel∗ which include gene copies that will be removed soon after the duplication event and their dramatically reduced long-term counterparts rdup, rdel. High deletion rate among recently duplicated proteins is consistent with a scenario in which they didn't have enough time to significantly change their functional roles and thus are to a large degree disposable. Systematic trends of each of the four duplication/deletion rates with the total number of genes in the genome were analyzed. All but the deletion rate of recent duplicates rdel∗ were shown to systematically increase with Ngenes. Abnormally flat shapes of sequence identity histograms observed for yeast and human are consistent with lineages leading to these organisms undergoing one or more whole-genome duplications. This interpretation is corroborated by our analysis of the genome of Paramecium tetraurelia where the p-4 profile of the histogram is gradually restored by the successive removal of paralogs generated in its four known whole-genome duplication events.
PMCID: PMC2246104  PMID: 18039386
9.  Evaluating the protein coding potential of exonized transposable element sequences 
Biology Direct  2007;2:31.
Transposable element (TE) sequences, once thought to be merely selfish or parasitic members of the genomic community, have been shown to contribute a wide variety of functional sequences to their host genomes. Analysis of complete genome sequences have turned up numerous cases where TE sequences have been incorporated as exons into mRNAs, and it is widely assumed that such 'exonized' TEs encode protein sequences. However, the extent to which TE-derived sequences actually encode proteins is unknown and a matter of some controversy. We have tried to address this outstanding issue from two perspectives: i-by evaluating ascertainment biases related to the search methods used to uncover TE-derived protein coding sequences (CDS) and ii-through a probabilistic codon-frequency based analysis of the protein coding potential of TE-derived exons.
We compared the ability of three classes of sequence similarity search methods to detect TE-derived sequences among data sets of experimentally characterized proteins: 1-a profile-based hidden Markov model (HMM) approach, 2-BLAST methods and 3-RepeatMasker. Profile based methods are more sensitive and more selective than the other methods evaluated. However, the application of profile-based search methods to the detection of TE-derived sequences among well-curated experimentally characterized protein data sets did not turn up many more cases than had been previously detected and nowhere near as many cases as recent genome-wide searches have. We observed that the different search methods used were complementary in the sense that they yielded largely non-overlapping sets of hits and differed in their ability to recover known cases of TE-derived CDS. The probabilistic analysis of TE-derived exon sequences indicates that these sequences have low protein coding potential on average. In particular, non-autonomous TEs that do not encode protein sequences, such as Alu elements, are frequently exonized but unlikely to encode protein sequences.
The exaptation of the numerous TE sequences found in exons as bona fide protein coding sequences may prove to be far less common than has been suggested by the analysis of complete genomes. We hypothesize that many exonized TE sequences actually function as post-transcriptional regulators of gene expression, rather than coding sequences, which may act through a variety of double stranded RNA related regulatory pathways. Indeed, their relatively high copy numbers and similarity to sequences dispersed throughout the genome suggests that exonized TE sequences could serve as master regulators with a wide scope of regulatory influence.
This article was reviewed by Itai Yanai, Kateryna D. Makova, Melissa Wilson (nominated by Kateryna D. Makova) and Cedric Feschotte (nominated by John M. Logsdon Jr.).
PMCID: PMC2203978  PMID: 18036258
10.  The new biology: beyond the Modern Synthesis 
Biology Direct  2007;2:30.
The last third of the 20th Century featured an accumulation of research findings that severely challenged the assumptions of the "Modern Synthesis" which provided the foundations for most biological research during that century. The foundations of that "Modernist" biology had thus largely crumbled by the start of the 21st Century. This in turn raises the question of foundations for biology in the 21st Century.
Like the physical sciences in the first half of the 20th Century, biology at the start of the 21st Century is achieving a substantive maturity of theory, experimental tools, and fundamental findings thanks to relatively secure foundations in genomics. Genomics has also forced biologists to connect evolutionary and molecular biology, because these formerly Balkanized disciplines have been brought together as actors on the genomic stage. Biologists are now addressing the evolution of genetic systems using more than the concepts of population biology alone, and the problems of cell biology using more than the tools of biochemistry and molecular biology alone. It is becoming increasingly clear that solutions to such basic problems as aging, sex, development, and genome size potentially involve elements of biological science at every level of organization, from molecule to population. The new biology knits together genomics, bioinformatics, evolutionary genetics, and other such general-purpose tools to supply novel explanations for the paradoxes that undermined Modernist biology.
Open Peer Reviewers
This article was reviewed by W.F. Doolittle, E.V. Koonin, and J.M. Logsdon. For the full reviews, please go to the Reviewers' Comments section.
PMCID: PMC2222615  PMID: 18036242
11.  Opening Pandora's Box: making biological discoveries through computational data exploration 
Biology Direct  2007;2:29.
PMCID: PMC2092420  PMID: 18028542
12.  Natural variation in SAR11 marine bacterioplankton genomes inferred from metagenomic data 
Biology Direct  2007;2:27.
One objective of metagenomics is to reconstruct information about specific uncultured organisms from fragmentary environmental DNA sequences. We used the genome of an isolate of the marine alphaproteobacterium SAR11 ('Candidatus Pelagibacter ubique'; strain HTCC1062), obtained from the cold, productive Oregon coast, as a query sequence to study variation in SAR11 metagenome sequence data from the Sargasso Sea, a warm, oligotrophic ocean gyre.
The average amino acid identity of SAR11 genes encoded by the metagenomic data to the query genome was only 71%, indicating significant evolutionary divergence between the coastal isolates and Sargasso Sea populations. However, an analysis of gene neighbors indicated that SAR11 genes in the Sargasso Sea metagenomic data match the gene order of the HTCC1062 genome in 96% of cases (> 85,000 observations), and that rearrangements are most frequent at predicted operon boundaries. There were no conserved examples of genes with known functions being found in the coastal isolates, but not the Sargasso Sea metagenomic data, or vice versa, suggesting that core regions of these diverse SAR11 genomes are relatively conserved in gene content. However, four hypervariable regions were observed, which may encode properties associated with variation in SAR11 ecotypes. The largest of these, HVR2, is a 48 kb region flanked by the sole 5S and 23S genes in the HTCC1062 genome, and mainly encodes genes that determine cell surface properties. A comparison of two closely related 'Candidatus Pelagibacter' genomes (HTCC1062 and HTCC1002) revealed a number of "gene indels" in core regions. Most of these were found to be polymorphic in the metagenomic data and showed evidence of purifying selection, suggesting that the same "polymorphic gene indels" are maintained in physically isolated SAR11 populations.
These findings suggest that natural selection has conserved many core features of SAR11 genomes across broad oceanic scales, but significant variation was found associated with four hypervariable genome regions. The data also led to the hypothesis that some gene insertions and deletions might be polymorphisms, similar to allelic polymorphisms.
PMCID: PMC2217521  PMID: 17988398
13.  Revisiting adverse effects of cross-hybridization in Affymetrix gene expression data: do they matter for correlation analysis? 
Biology Direct  2007;2:28.
This work was undertaken in response to a recently published paper by Okoniewski and Miller (BMC Bioinformatics 2006, 7: Article 276). The authors of that paper came to the conclusion that the process of multiple targeting in short oligonucleotide microarrays induces spurious correlations and this effect may deteriorate the inference on correlation coefficients. The design of their study and supporting simulations cast serious doubt upon the validity of this conclusion. The work by Okoniewski and Miller drove us to revisit the issue by means of experimentation with biological data and probabilistic modeling of cross-hybridization effects.
We have identified two serious flaws in the study by Okoniewski and Miller: (1) The data used in their paper are not amenable to correlation analysis; (2) The proposed simulation model is inadequate for studying the effects of cross-hybridization. Using two other data sets, we have shown that removing multiply targeted probe sets does not lead to a shift in the histogram of sample correlation coefficients towards smaller values. A more realistic approach to mathematical modeling of cross-hybridization demonstrates that this process is by far more complex than the simplistic model considered by the authors. A diversity of correlation effects (such as the induction of positive or negative correlations) caused by cross-hybridization can be expected in theory but there are natural limitations on the ability to provide quantitative insights into such effects due to the fact that they are not directly observable.
The proposed stochastic model is instrumental in studying general regularities in hybridization interaction between probe sets in microarray data. As the problem stands now, there is no compelling reason to believe that multiple targeting causes a large-scale effect on the correlation structure of Affymetrix gene expression data. Our analysis suggests that the observed long-range correlations in microarray data are of a biological nature rather than a technological flaw.
The paper was reviewed by I. K. Jordan, D. P. Gaile (nominated by E. Koonin), and W. Huber (nominated by S. Dudoit).
PMCID: PMC2211459  PMID: 17988401
14.  Calibrating E-values for MS2 database search methods 
Biology Direct  2007;2:26.
The key to mass-spectrometry-based proteomics is peptide identification, which relies on software analysis of tandem mass spectra. Although each search engine has its strength, combining the strengths of various search engines is not yet realizable largely due to the lack of a unified statistical framework that is applicable to any method.
We have developed a universal scheme for statistical calibration of peptide identifications. The protocol can be used for both de novo approaches as well as database search methods. We demonstrate the protocol using only the database search methods. Among seven methods -SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X!Tandem (v1.0), OMSSA (v2.0) and RAId_DbS – calibrated, except for X!Tandem and RAId_DbS most methods require a rescaling according to the database size searched. We demonstrate that our calibration protocol indeed produces unified statistics both in terms of average number of false positives and in terms of the probability for a peptide hit to be a true positive. Although both the protocols for calibration and the statistics thus calibrated are universal, the calibration formulas obtained from one laboratory with data collected using either centroid or profile format may not be directly usable by the other laboratories. Thus each laboratory is encouraged to calibrate the search methods it intends to use. We also address the importance of using spectrum-specific statistics and possible improvement on the current calibration protocol. The spectra used for statistical (E-value) calibration are freely available upon request.
Open peer review
Reviewed by Dongxiao Zhu (nominated by Arcady Mushegian), Alexey Nesvizhskii (nominated by King Jordan) and Vineet Bafna. For the full reviews, please go to the Reviewers' comments section.
PMCID: PMC2206012  PMID: 17983478
15.  RAId_DbS: Peptide Identification using Database Searches with Realistic Statistics 
Biology Direct  2007;2:25.
The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.
Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.
PMCID: PMC2211744  PMID: 17961253
16.  Evolution of the genetic code: partial optimization of a random code for robustness to translation error in a rugged fitness landscape 
Biology Direct  2007;2:24.
The standard genetic code table has a distinctly non-random structure, with similar amino acids often encoded by codons series that differ by a single nucleotide substitution, typically, in the third or the first position of the codon. It has been repeatedly argued that this structure of the code results from selective optimization for robustness to translation errors such that translational misreading has the minimal adverse effect. Indeed, it has been shown in several studies that the standard code is more robust than a substantial majority of random codes. However, it remains unclear how much evolution the standard code underwent, what is the level of optimization, and what is the likely starting point.
We explored possible evolutionary trajectories of the genetic code within a limited domain of the vast space of possible codes. Only those codes were analyzed for robustness to translation error that possess the same block structure and the same degree of degeneracy as the standard code. This choice of a small part of the vast space of possible codes is based on the notion that the block structure of the standard code is a consequence of the structure of the complex between the cognate tRNA and the codon in mRNA where the third base of the codon plays a minimum role as a specificity determinant. Within this part of the fitness landscape, a simple evolutionary algorithm, with elementary evolutionary steps comprising swaps of four-codon or two-codon series, was employed to investigate the optimization of codes for the maximum attainable robustness. The properties of the standard code were compared to the properties of four sets of codes, namely, purely random codes, random codes that are more robust than the standard code, and two sets of codes that resulted from optimization of the first two sets. The comparison of these sets of codes with the standard code and its locally optimized version showed that, on average, optimization of random codes yielded evolutionary trajectories that converged at the same level of robustness to translation errors as the optimization path of the standard code; however, the standard code required considerably fewer steps to reach that level than an average random code. When evolution starts from random codes whose fitness is comparable to that of the standard code, they typically reach much higher level of optimization than the standard code, i.e., the standard code is much closer to its local minimum (fitness peak) than most of the random codes with similar levels of robustness. Thus, the standard genetic code appears to be a point on an evolutionary trajectory from a random point (code) about half the way to the summit of the local peak. The fitness landscape of code evolution appears to be extremely rugged, containing numerous peaks with a broad distribution of heights, and the standard code is relatively unremarkable, being located on the slope of a moderate-height peak.
The standard code appears to be the result of partial optimization of a random code for robustness to errors of translation. The reason the code is not fully optimized could be the trade-off between the beneficial effect of increasing robustness to translation errors and the deleterious effect of codon series reassignment that becomes increasingly severe with growing complexity of the evolving system. Thus, evolution of the code can be represented as a combination of adaptation and frozen accident.
This article was reviewed by David Ardell, Allan Drummond (nominated by Laura Landweber), and Rob Knight.
Open Peer Review
This article was reviewed by David Ardell, Allan Drummond (nominated by Laura Landweber), and Rob Knight.
PMCID: PMC2211284  PMID: 17956616
17.  Gene-interleaving patterns of synteny in the Saccharomyces cerevisiae genome: are they proof of an ancient genome duplication event? 
Biology Direct  2007;2:23.
Recent comparative genomic studies claim local syntenic gene-interleaving relationships in Ashbya gossypii and Kluyveromyces waltii are compelling evidence for an ancient whole-genome duplication event in Saccharomyces cerevisiae. We here test, using Hannenhalli-Pevzner rearrangement algorithms that address the multiple genome rearrangement problem, whether syntenic patterns are proof of paleopolyploidization.
We focus on (1) pairwise comparison of gene arrangement sequences in A. gossypii and S. cerevisiae, (2) reconstruction of gene arrangements ancestral to A. gossypii, S. cerevisiae, and K. waltii, (3) synteny patterns arising within and between lineages, and (4) expected gene orientation of duplicate gene sets. The existence of syntenic patterns between ancestral gene sets and A. gossypii, S. cerevisiae, and K. waltii, and other evidence, suggests that gene-interleaving relationships are the natural consequence of topological rearrangements in chromosomes and that a more gradual scenario of genome evolution involving segmental duplication and recombination constitutes a more parsimonious explanation. Furthermore, phylogenetic trees reconstructed under alternative hypotheses placed the putative whole-genome duplication event after the divergence of the S. cerevisiae and K. waltii lineages, but in the lineage leading to K. waltii. This is clearly incompatible with an ancient genome duplication event in S. cerevisiae.
Because the presence of syntenic patterns appears to be a condition that is necessary, but not sufficient, to support the existence of the whole-genome duplication event, our results prompt careful re-evaluation of paleopolyploidization in the yeast lineage and the evolutionary meaning of syntenic patterns.
This article was reviewed by Kenneth H. Wolfe (nominated by Nicolas Galtier), Austin L. Hughes (nominated by Eugene Koonin), Mikhail S. Gelfand, and Mark Gerstein.
PMCID: PMC2134927  PMID: 17894859
18.  A statistical analysis of the three-fold evolution of genomic compression through frame overlaps in prokaryotes 
Biology Direct  2007;2:22.
Among microbial genomes, genetic information is frequently compressed, exploiting redundancies in the genetic code in order to store information in overlapping genes. We investigate the length, phase and orientation properties of overlap in 58 prokaryotic species evaluating neutral and selective mechanisms of evolution.
Using a variety of statistical null models we find patterns of compressive coding that can not be explained purely in terms of the selective processes favoring genome minimization or translational coupling. The distribution of overlap lengths follows a fat-tailed distribution, in which a significant proportion of overlaps are in excess of 100 base pairs in length. The phase of overlap – pairing of codon positions in complementary reading frames – is strongly predicted by the translation orientation of each gene. We find that as overlapping genes become longer, they have a tendency to alternate among alternative overlap phases. Some phases seem to reflect codon pairings reducing the probability of non-synonymous substitution. We analyze the lineage-dependent features of overlapping genes by tracing a number of different continuous characters through the prokaryotic phylogeny using squared-change parsimony and observe both clade-specific and species-specific patterns.
Overlapping reading frames preserve in their structure, features relating to mutational origination of new genes, but have undergone modification for both immediate benefits and for variational buffering and amplification. Genomes come under a variety of different mutational and selectional pressures, and the structure of redundancies in overlapping genes can be used to detect these pressures. No single mechanism is able to account for all the variability observed among the set of prokaryotic overlapping genes but a three-fold analysis of evolutionary events provides a more integrative framework.
This article was reviewed by Eugene Koonin, Marten Huynem, and Han Liang.
PMCID: PMC2174442  PMID: 17877818
19.  The Biological Big Bang model for the major transitions in evolution 
Biology Direct  2007;2:21.
Major transitions in biological evolution show the same pattern of sudden emergence of diverse forms at a new level of complexity. The relationships between major groups within an emergent new class of biological entities are hard to decipher and do not seem to fit the tree pattern that, following Darwin's original proposal, remains the dominant description of biological evolution. The cases in point include the origin of complex RNA molecules and protein folds; major groups of viruses; archaea and bacteria, and the principal lineages within each of these prokaryotic domains; eukaryotic supergroups; and animal phyla. In each of these pivotal nexuses in life's history, the principal "types" seem to appear rapidly and fully equipped with the signature features of the respective new level of biological organization. No intermediate "grades" or intermediate forms between different types are detectable. Usually, this pattern is attributed to cladogenesis compressed in time, combined with the inevitable erosion of the phylogenetic signal.
I propose that most or all major evolutionary transitions that show the "explosive" pattern of emergence of new types of biological entities correspond to a boundary between two qualitatively distinct evolutionary phases. The first, inflationary phase is characterized by extremely rapid evolution driven by various processes of genetic information exchange, such as horizontal gene transfer, recombination, fusion, fission, and spread of mobile elements. These processes give rise to a vast diversity of forms from which the main classes of entities at the new level of complexity emerge independently, through a sampling process. In the second phase, evolution dramatically slows down, the respective process of genetic information exchange tapers off, and multiple lineages of the new type of entities emerge, each of them evolving in a tree-like fashion from that point on. This biphasic model of evolution incorporates the previously developed concepts of the emergence of protein folds by recombination of small structural units and origin of viruses and cells from a pre-cellular compartmentalized pool of recombining genetic elements. The model is extended to encompass other major transitions. It is proposed that bacterial and archaeal phyla emerged independently from two distinct populations of primordial cells that, originally, possessed leaky membranes, which made the cells prone to rampant gene exchange; and that the eukaryotic supergroups emerged through distinct, secondary endosymbiotic events (as opposed to the primary, mitochondrial endosymbiosis). This biphasic model of evolution is substantially analogous to the scenario of the origin of universes in the eternal inflation version of modern cosmology. Under this model, universes like ours emerge in the infinite multiverse when the eternal process of exponential expansion, known as inflation, ceases in a particular region as a result of false vacuum decay, a first order phase transition process. The result is the nucleation of a new universe, which is traditionally denoted Big Bang, although this scenario is radically different from the Big Bang of the traditional model of an expanding universe. Hence I denote the phase transitions at the end of each inflationary epoch in the history of life Biological Big Bangs (BBB).
A Biological Big Bang (BBB) model is proposed for the major transitions in life's evolution. According to this model, each transition is a BBB such that new classes of biological entities emerge at the end of a rapid phase of evolution (inflation) that is characterized by extensive exchange of genetic information which takes distinct forms for different BBBs. The major types of new forms emerge independently, via a sampling process, from the pool of recombining entities of the preceding generation. This process is envisaged as being qualitatively different from tree-pattern cladogenesis.
This article was reviewed by William Martin, Sergei Maslov, and Leonid Mirny.
PMCID: PMC1973067  PMID: 17708768
20.  Extensive parallelism in protein evolution 
Biology Direct  2007;2:20.
Independently evolving lineages mostly accumulate different changes, which leads to their gradual divergence. However, parallel accumulation of identical changes is also common, especially in traits with only a small number of possible states.
We characterize parallelism in evolution of coding sequences in three four-species sets of genomes of mammals, Drosophila, and yeasts. Each such set contains two independent evolutionary paths, which we call paths I and II. An amino acid replacement which occurred along path I also occurs along path II with the probability 50–80% of that expected under selective neutrality. Thus, the per site rate of parallel evolution of proteins is several times higher than their average rate of evolution, but still lower than the rate of evolution of neutral sequences. This deficit may be caused by changes in the fitness landscape, leading to a replacement being possible along path I but not along path II. However, constant, weak selection assumed by the nearly neutral model of evolution appears to be a more likely explanation. Then, the average coefficient of selection associated with an amino acid replacement, in the units of the effective population size, must exceed ~0.4, and the fraction of effectively neutral replacements must be below ~30%. At a majority of evolvable amino acid sites, only a relatively small number of different amino acids is permitted.
High, but below-neutral, rates of parallel amino acid replacements suggest that a majority of amino acid replacements that occur in evolution are subject to weak, but non-trivial, selection, as predicted by Ohta's nearly-neutral theory.
This article was reviewed by John McDonald (nominated by Laura Landweber), Sarah Teichmann and Subhajyoti De, and Chris Adami.
PMCID: PMC2020468  PMID: 17705846
21.  The intracellular region of Notch ligands: does the tail make the difference? 
Biology Direct  2007;2:19.
The cytoplasmic tail of Notch ligands drives endocytosis, mediates association with proteins implicated in the organization of cell-cell junctions and, through regulated intra-membrane proteolysis, is released from the membrane as a signaling fragment. We survey these findings and discuss the role of Notch ligands intracellular region in bidirectional signaling and possibly in signal modulation in mammals.
This article was reviewed by Frank Eisenhaber, L Aravind, and Eugene V. Koonin.
PMCID: PMC1965462  PMID: 17623096
22.  Small but versatile: the extraordinary functional and structural diversity of the β-grasp fold 
Biology Direct  2007;2:18.
The β-grasp fold (β-GF), prototyped by ubiquitin (UB), has been recruited for a strikingly diverse range of biochemical functions. These functions include providing a scaffold for different enzymatic active sites (e.g. NUDIX phosphohydrolases) and iron-sulfur clusters, RNA-soluble-ligand and co-factor-binding, sulfur transfer, adaptor functions in signaling, assembly of macromolecular complexes and post-translational protein modification. To understand the basis for the functional versatility of this small fold we undertook a comprehensive sequence-structure analysis of the fold and developed a natural classification for its members.
As a result we were able to define the core distinguishing features of the fold and numerous elaborations, including several previously unrecognized variants. Systematic analysis of all known interactions of the fold showed that its manifold functional abilities arise primarily from the prominent β-sheet, which provides an exposed surface for diverse interactions or additionally, by forming open barrel-like structures. We show that in the β-GF both enzymatic activities and the binding of diverse co-factors (e.g. molybdopterin) have independently evolved on at least three occasions each, and iron-sulfur-cluster-binding on at least two independent occasions. Our analysis identified multiple previously unknown large monophyletic assemblages within the β-GF, including one which unifies versions found in the fasciclin-1 superfamily, the ribosomal protein L25, the phosphoribosyl AMP cyclohydrolase (HisI) and glutamine synthetase. We also uncovered several new groups of β-GF domains including a domain found in bacterial flagellar and fimbrial assembly components, and 5 new UB-like domains in the eukaryotes.
Evolutionary reconstruction indicates that the β-GF had differentiated into at least 7 distinct lineages by the time of the last universal common ancestor of all extant organisms, encompassing much of the structural diversity observed in extant versions of the fold. The earliest β-GF members were probably involved in RNA metabolism and subsequently radiated into various functional niches. Most of the structural diversification occurred in the prokaryotes, whereas the eukaryotic phase was mainly marked by a specific expansion of the ubiquitin-like β-GF members. The eukaryotic UB superfamily diversified into at least 67 distinct families, of which at least 19–20 families were already present in the eukaryotic common ancestor, including several protein and one lipid conjugated forms. Another key aspect of the eukaryotic phase of evolution of the β-GF was the dramatic increase in domain architectural complexity of proteins related to the expansion of UB-like domains in numerous adaptor roles.
This article was reviewed by Igor Zhulin, Arcady Mushegian and Frank Eisenhaber.
PMCID: PMC1949818  PMID: 17605815
23.  Neutral genetic drift can alter promiscuous protein functions, potentially aiding functional evolution 
Biology Direct  2007;2:17.
Many of the mutations accumulated by naturally evolving proteins are neutral in the sense that they do not significantly alter a protein's ability to perform its primary biological function. However, new protein functions evolve when selection begins to favor other, "promiscuous" functions that are incidental to a protein's original biological role. If mutations that are neutral with respect to a protein's primary biological function cause substantial changes in promiscuous functions, these mutations could enable future functional evolution.
Here we investigate this possibility experimentally by examining how cytochrome P450 enzymes that have evolved neutrally with respect to activity on a single substrate have changed in their abilities to catalyze reactions on five other substrates. We find that the enzymes have sometimes changed as much as four-fold in the promiscuous activities. The changes in promiscuous activities tend to increase with the number of mutations, and can be largely rationalized in terms of the chemical structures of the substrates. The activities on chemically similar substrates tend to change in a coordinated fashion, potentially providing a route for systematically predicting the change in one activity based on the measurement of several others.
Our work suggests that initially neutral genetic drift can lead to substantial changes in protein functions that are not currently under selection, in effect poising the proteins to more readily undergo functional evolution should selection favor new functions in the future.
This article was reviewed by Martijn Huynen, Fyodor Kondrashov, and Dan Tawfik (nominated by Christoph Adami).
PMCID: PMC1914045  PMID: 17598905
24.  The stochastic behavior of a molecular switching circuit with feedback 
Biology Direct  2007;2:13.
Using a statistical physics approach, we study the stochastic switching behavior of a model circuit of multisite phosphorylation and dephosphorylation with feedback. The circuit consists of a kinase and phosphatase acting on multiple sites of a substrate that, contingent on its modification state, catalyzes its own phosphorylation and, in a symmetric scenario, dephosphorylation. The symmetric case is viewed as a cartoon of conflicting feedback that could result from antagonistic pathways impinging on the state of a shared component.
Multisite phosphorylation is sufficient for bistable behavior under feedback even when catalysis is linear in substrate concentration, which is the case we consider. We compute the phase diagram, fluctuation spectrum and large-deviation properties related to switch memory within a statistical mechanics framework. Bistability occurs as either a first-order or second-order non-equilibrium phase transition, depending on the network symmetries and the ratio of phosphatase to kinase numbers. In the second-order case, the circuit never leaves the bistable regime upon increasing the number of substrate molecules at constant kinase to phosphatase ratio.
The number of substrate molecules is a key parameter controlling both the onset of the bistable regime, fluctuation intensity, and the residence time in a switched state. The relevance of the concept of memory depends on the degree of switch symmetry, as memory presupposes information to be remembered, which is highest for equal residence times in the switched states.
This article was reviewed by Artem Novozhilov (nominated by Eugene Koonin), Sergei Maslov, and Ned Wingreen.
PMCID: PMC1904185  PMID: 17540019
25.  On the origin of the translation system and the genetic code in the RNA world by means of natural selection, exaptation, and subfunctionalization 
Biology Direct  2007;2:14.
The origin of the translation system is, arguably, the central and the hardest problem in the study of the origin of life, and one of the hardest in all evolutionary biology. The problem has a clear catch-22 aspect: high translation fidelity hardly can be achieved without a complex, highly evolved set of RNAs and proteins but an elaborate protein machinery could not evolve without an accurate translation system. The origin of the genetic code and whether it evolved on the basis of a stereochemical correspondence between amino acids and their cognate codons (or anticodons), through selectional optimization of the code vocabulary, as a "frozen accident" or via a combination of all these routes is another wide open problem despite extensive theoretical and experimental studies. Here we combine the results of comparative genomics of translation system components, data on interaction of amino acids with their cognate codons and anticodons, and data on catalytic activities of ribozymes to develop conceptual models for the origins of the translation system and the genetic code.
Our main guide in constructing the models is the Darwinian Continuity Principle whereby a scenario for the evolution of a complex system must consist of plausible elementary steps, each conferring a distinct advantage on the evolving ensemble of genetic elements. Evolution of the translation system is envisaged to occur in a compartmentalized ensemble of replicating, co-selected RNA segments, i.e., in a RNA World containing ribozymes with versatile activities. Since evolution has no foresight, the translation system could not evolve in the RNA World as the result of selection for protein synthesis and must have been a by-product of evolution drive by selection for another function, i.e., the translation system evolved via the exaptation route. It is proposed that the evolutionary process that eventually led to the emergence of translation started with the selection for ribozymes binding abiogenic amino acids that stimulated ribozyme-catalyzed reactions. The proposed scenario for the evolution of translation consists of the following steps: binding of amino acids to a ribozyme resulting in an enhancement of its catalytic activity; evolution of the amino-acid-stimulated ribozyme into a peptide ligase (predecessor of the large ribosomal subunit) yielding, initially, a unique peptide activating the original ribozyme and, possibly, other ribozymes in the ensemble; evolution of self-charging proto-tRNAs that were selected, initially, for accumulation of amino acids, and subsequently, for delivery of amino acids to the peptide ligase; joining of the peptide ligase with a distinct RNA molecule (predecessor of the small ribosomal subunit) carrying a built-in template for more efficient, complementary binding of charged proto-tRNAs; evolution of the ability of the peptide ligase to assemble peptides using exogenous RNAs as template for complementary binding of charged proteo-tRNAs, yielding peptides with the potential to activate different ribozymes; evolution of the translocation function of the protoribosome leading to the production of increasingly longer peptides (the first proteins), i.e., the origin of translation. The specifics of the recognition of amino acids by proto-tRNAs and the origin of the genetic code depend on whether or not there is a physical affinity between amino acids and their cognate codons or anticodons, a problem that remains unresolved.
We describe a stepwise model for the origin of the translation system in the ancient RNA world such that each step confers a distinct advantage onto an ensemble of co-evolving genetic elements. Under this scenario, the primary cause for the emergence of translation was the ability of amino acids and peptides to stimulate reactions catalyzed by ribozymes. Thus, the translation system might have evolved as the result of selection for ribozymes capable of, initially, efficient amino acid binding, and subsequently, synthesis of increasingly versatile peptides. Several aspects of this scenario are amenable to experimental testing.
This article was reviewed by Rob Knight, Doron Lancet, Alexander Mankin (nominated by Arcady Mushegian), and Arcady Mushegian.
PMCID: PMC1894784  PMID: 17540026

Results 1-25 (39)