The CRISPR-Cas systems of adaptive antivirus immunity are present in most archaea and many bacteria, and provide resistance to specific viruses or plasmids by inserting fragments of foreign DNA into the host genome and then utilizing transcripts of these spacers to inactivate the cognate foreign genome. The recent development of powerful genome engineering tools on the basis of CRISPR-Cas has sharply increased the interest in the diversity and evolution of these systems. Comparative genomic data indicate that during evolution of prokaryotes CRISPR-Cas loci are lost and acquired via horizontal gene transfer at high rates. Mathematical modeling and initial experimental studies of CRISPR-carrying microbes and viruses reveal complex coevolutionary dynamics.
We performed a bifurcation analysis of models of coevolution of viruses and microbial host that possess CRISPR-Cas hereditary adaptive immunity systems. The analyzed Malthusian and logistic models display complex, and in particular, quasi-chaotic oscillation regimes that have not been previously observed experimentally or in agent-based models of the CRISPR-mediated immunity. The key factors for the appearance of the quasi-chaotic oscillations are the non-linear dependence of the host immunity on the virus load and the partitioning of the hosts into the immune and susceptible populations, so that the system consists of three components.
Bifurcation analysis of CRISPR-host coevolution model predicts complex regimes including quasi-chaotic oscillations. The quasi-chaotic regimes of virus-host coevolution are likely to be biologically relevant given the evolutionary instability of the CRISPR-Cas loci revealed by comparative genomics. The results of this analysis might have implications beyond the CRISPR-Cas systems, i.e. could describe the behavior of any adaptive immunity system with a heritable component, be it genetic or epigenetic. These predictions are experimentally testable.
This manuscript was reviewed by Sandor Pongor, Sergei Maslov and Marek Kimmel. For the complete reports, go to the Reviewers’ Reports section.
Measures of node centrality in biological networks are useful to detect genes with critical functional roles. In gene co-expression networks, highly connected genes (i.e., candidate hubs) have been associated with key disease-related pathways. Although different approaches to estimating gene centrality are available, their potential biological relevance in gene co-expression networks deserves further investigation. Moreover, standard measures of gene centrality focus on binary interaction networks, which may not always be suitable in the context of co-expression networks. Here, I also investigate a method that identifies potential biologically meaningful genes based on a weighted connectivity score and indicators of statistical relevance.
The method enables a characterization of the strength and diversity of co-expression associations in the network. It outperformed standard centrality measures by highlighting more biologically informative genes in different gene co-expression networks and biological research domains. As part of the illustration of the gene selection potential of this approach, I present an application case in zebrafish heart regeneration. The proposed technique predicted genes that are significantly implicated in cellular processes required for tissue regeneration after injury.
A method for selecting biologically informative genes from gene co-expression networks is provided, together with free open software.
This article was reviewed by Anthony Almudevar, Maciej M Kańduła (nominated by David P Kreil) and Christine Wells.
Network hubs; Weighted networks; Gene co-expression networks; Centrality scores; Zebrafish; Heart regeneration; Cancer; Microarrays; RNA-Seq
Because amino acid activation is rate-limiting for uncatalyzed protein synthesis, it is a key puzzle in understanding the origin of the genetic code. Two unrelated classes (I and II) of contemporary aminoacyl-tRNA synthetases (aaRS) now translate the code. Observing that codons for the most highly conserved, Class I catalytic peptides, when read in the reverse direction, are very nearly anticodons for Class II defining catalytic peptides, Rodin and Ohno proposed that the two superfamilies descended from opposite strands of the same ancestral gene. This unusual hypothesis languished for a decade, perhaps because it appeared to be unfalsifiable.
The proposed sense/antisense alignment makes important predictions. Fragments that align in antiparallel orientations, and contain the respective active sites, should catalyze the same two reactions catalyzed by contemporary synthetases. Recent experiments confirmed that prediction. Invariant cores from both classes, called Urzymes after Ur = primitive, authentic, plus enzyme and representing ~20% of the contemporary structures, can be expressed and exhibit high, proportionate rate accelerations for both amino-acid activation and tRNA acylation. A major fraction (60%) of the catalytic rate acceleration by contemporary synthetases resides in segments that align sense/antisense. Bioinformatic evidence for sense/antisense ancestry extends to codons specifying the invariant secondary and tertiary structures outside the active sites of the two synthetase classes. Peptides from a designed, 46-residue gene constrained by Rosetta to encode Class I and II ATP binding sites with fully complementary sequences both accelerate amino acid activation by ATP ~400 fold.
Biochemical and bioinformatic results substantially enhance the posterior probability that ancestors of the two synthetase classes arose from opposite strands of the same ancestral gene. The remarkable acceleration by short peptides of the rate-limiting step in uncatalyzed protein synthesis, together with the synergy of synthetase Urzymes and their cognate tRNAs, introduce a new paradigm for the origin of protein catalysts, emphasize the potential relevance of an operational RNA code embedded in the tRNA acceptor stems, and challenge the RNA-World hypothesis.
This article was reviewed by Dr. Paul Schimmel (nominated by Laura Landweber), Dr. Eugene Koonin and Professor David Ardell.
Aminoacyl-tRNA synthetases; Urzymes; Genetic code; Origin of Translation; RNA World hypothesis; Amino acid activation; Structural homology; Ancestral genes; Sense/antisense coding
The emergence of Next Generation Sequencing generates an incredible amount of sequence and great potential for new enzyme discovery. Despite this huge amount of data and the profusion of bioinformatic methods for function prediction, a large part of known enzyme activities is still lacking an associated protein sequence. These particular activities are called “orphan enzymes”. The present review proposes an update of previous surveys on orphan enzymes by mining the current content of public databases. While the percentage of orphan enzyme activities has decreased from 38% to 22% in ten years, there are still more than 1,000 orphans among the 5,000 entries of the Enzyme Commission (EC) classification. Taking into account all the reactions present in metabolic databases, this proportion dramatically increases to reach nearly 50% of orphans and many of them are not associated to a known pathway. We extended our survey to “local orphan enzymes” that are activities which have no representative sequence in a given clade, but have at least one in organisms belonging to other clades. We observe an important bias in Archaea and find that in general more than 30% of the EC activities have incomplete sequence information in at least one superkingdom. To estimate if candidate proteins for local orphans could be retrieved by homology search, we applied a simple strategy based on the PRIAM software and noticed that candidates may be proposed for an important fraction of local orphan enzymes. Finally, by studying relation between protein domains and catalyzed activities, it appears that newly discovered enzymes are mostly associated with already known enzyme domains. Thus, the exploration of the promiscuity and the multifunctional aspect of known enzyme families may solve part of the orphan enzyme issue. We conclude this review with a presentation of recent initiatives in finding proteins for orphan enzymes and in extending the enzyme world by the discovery of new activities.
This article was reviewed by Michael Galperin, Daniel Haft and Daniel Kahn.
Orphan enzyme activities; Enzyme discovery; Metabolic pathways; Enzyme promiscuity; Data survey; Biological databases; Local orphan enzymes
We have previously suggested a method for proteome wide analysis of variation at functional residues wherein we identified the set of all human genes with nonsynonymous single nucleotide variation (nsSNV) in the active site residue of the corresponding proteins. 34 of these proteins were shown to have a 1:1:1 enzyme:pathway:reaction relationship, making these proteins ideal candidates for laboratory validation through creation and observation of specific yeast active site knock-outs and downstream targeted metabolomics experiments. Here we present the next step in the workflow toward using yeast metabolic modeling to predict human metabolic behavior resulting from nsSNV.
For the previously identified candidate proteins, we used the reciprocal best BLAST hits method followed by manual alignment and pathway comparison to identify 6 human proteins with yeast orthologs which were suitable for flux balance analysis (FBA). 5 of these proteins are known to be associated with diseases, including ribose 5-phosphate isomerase deficiency, myopathy with lactic acidosis and sideroblastic anaemia, anemia due to disorders of glutathione metabolism, and two porphyrias, and we suspect the sixth enzyme to have disease associations which are not yet classified or understood based on the work described herein.
Preliminary findings using the Yeast 7.0 FBA model show lack of growth for only one enzyme, but augmentation of the Yeast 7.0 biomass function to better simulate knockout of certain genes suggested physiological relevance of variations in three additional proteins. Thus, we suggest the following four proteins for laboratory validation: delta-aminolevulinic acid dehydratase, ferrochelatase, ribose-5 phosphate isomerase and mitochondrial tyrosyl-tRNA synthetase. This study indicates that the predictive ability of this method will improve as more advanced, comprehensive models are developed. Moreover, these findings will be useful in the development of simple downstream biochemical or mass-spectrometric assays to corroborate these predictions and detect presence of certain known nsSNVs with deleterious outcomes. Results may also be useful in predicting as yet unknown outcomes of active site nsSNVs for enzymes that are not yet well classified or annotated.
This article was reviewed by Daniel Haft and Igor B. Rogozin.
nsSNV; Ortholog; Sequence conservation; FBA; Yeast metabolic modeling
Pannexin1 is ubiquitously expressed in vertebrate tissues, but the role it plays in vascular tone regulation remains unclear. We found that Pannexin1 expression level is much higher in the endothelium relative to smooth muscle of saphenous artery. The ability of endothelium-intact arteries for dilation was significantly impaired whereas contractile responses were considerably increased in mice with genetic ablation of Pannexin1. No such increased contractile responses were detected in the endothelium-denuded arteries. Combined, our findings suggest a new function of Pannexin1 as an important player in normal endothelium-dependent regulation of arterial tone, where it facilitates vessel dilation and attenuates constriction.
Reviewed by Dr. Armen Mulkidjanian and Dr. Alexander Lobkovsky.
Pannexin1; Endothelium; Saphenous artery; Knockout mice
This article was reviewed by Purificacion Lopez-Garcia and Igor B Rogozin.
Cryptomonads, are a lineage of unicellular and mostly photosynthetic algae, that acquired their plastids through the “secondary” endosymbiosis of a red alga — and still retain the nuclear genome (nucleomorph) of the latter. We find that the genome of the cryptomonad Guillardia theta comprises genes coding for 13 globin domains, of which 6 occur within two large chimeric proteins. All the sequences adhere to the vertebrate 3/3 myoglobin fold. Although several globins have no introns, the remainder have atypical intron locations. Bayesian phylogenetic analyses suggest that the G. theta Hbs are related to the stramenopile and chlorophyte single domain globins.
Protist; Cryptomonads; Hemoglobin; Phylogeny
This article was reviewed by Lakshminarayan M. Iyer and I. King Jordan. For complete reviews, see the Reviewers’ Reports section.
Polintons (also known as Mavericks) and Tlr elements of Tetrahymena thermophila represent two families of large DNA transposons widespread in eukaryotes. Here, we show that both Polintons and Tlr elements encode two key virion proteins, the major capsid protein with the double jelly-roll fold and the minor capsid protein, known as the penton, with the single jelly-roll topology. This observation along with the previously noted conservation of the genes for viral genome packaging ATPase and adenovirus-like protease strongly suggests that Polintons and Tlr elements combine features of bona fide viruses and transposons. We propose the name ‘Polintoviruses’ to denote these putative viruses that could have played a central role in the evolution of several groups of DNA viruses of eukaryotes.
Polintons; Mavericks; Transposable elements; Double jelly-roll fold; Capsid proteins; Virus evolution
H. sapiens-M. tuberculosis H37Rv protein-protein interaction (PPI) data are essential for understanding the infection mechanism of the formidable pathogen M. tuberculosis H37Rv. Computational prediction is an important strategy to fill the gap in experimental H. sapiens-M. tuberculosis H37Rv PPI data. Homology-based prediction is frequently used in predicting both intra-species and inter-species PPIs. However, some limitations are not properly resolved in several published works that predict eukaryote-prokaryote inter-species PPIs using intra-species template PPIs.
We develop a stringent homology-based prediction approach by taking into account (i) differences between eukaryotic and prokaryotic proteins and (ii) differences between inter-species and intra-species PPI interfaces. We compare our stringent homology-based approach to a conventional homology-based approach for predicting host-pathogen PPIs, based on cellular compartment distribution analysis, disease gene list enrichment analysis, pathway enrichment analysis and functional category enrichment analysis. These analyses support the validity of our prediction result, and clearly show that our approach has better performance in predicting H. sapiens-M. tuberculosis H37Rv PPIs. Using our stringent homology-based approach, we have predicted a set of highly plausible H. sapiens-M. tuberculosis H37Rv PPIs which might be useful for many of related studies. Based on our analysis of the H. sapiens-M. tuberculosis H37Rv PPI network predicted by our stringent homology-based approach, we have discovered several interesting properties which are reported here for the first time. We find that both host proteins and pathogen proteins involved in the host-pathogen PPIs tend to be hubs in their own intra-species PPI network. Also, both host and pathogen proteins involved in host-pathogen PPIs tend to have longer primary sequence, tend to have more domains, tend to be more hydrophilic, etc. And the protein domains from both host and pathogen proteins involved in host-pathogen PPIs tend to have lower charge, and tend to be more hydrophilic.
Our stringent homology-based prediction approach provides a better strategy in predicting PPIs between eukaryotic hosts and prokaryotic pathogens than a conventional homology-based approach. The properties we have observed from the predicted H. sapiens-M. tuberculosis H37Rv PPI network are useful for understanding inter-species host-pathogen PPI networks and provide novel insights for host-pathogen interaction studies.
This article was reviewed by Michael Gromiha, Narayanaswamy Srinivasan and Thomas Dandekar.
ChIP-Seq (chromatin immunoprecipitation sequencing) has provided the advantage for finding motifs as ChIP-Seq experiments narrow down the motif finding to binding site locations. Recent motif finding tools facilitate the motif detection by providing user-friendly Web interface. In this work, we reviewed nine motif finding Web tools that are capable for detecting binding site motifs in ChIP-Seq data. We showed each motif finding Web tool has its own advantages for detecting motifs that other tools may not discover. We recommended the users to use multiple motif finding Web tools that implement different algorithms for obtaining significant motifs, overlapping resemble motifs, and non-overlapping motifs. Finally, we provided our suggestions for future development of motif finding Web tool that better assists researchers for finding motifs in ChIP-Seq data.
This article was reviewed by Prof. Sandor Pongor, Dr. Yuriy Gusev, and Dr. Shyam Prabhakar (nominated by Prof. Limsoon Wong).
Motif finding Web tool; Peak calling; Binding site; Over-represented motif; ChIP-Seq
For the anucleate platelet it has been unclear how well platelet transcriptomes correlate among different donors or across different RNA profiling platforms, and what the transcriptomes’ relationship is with the platelet proteome. We profiled the platelet transcriptome of 10 healthy young males (5 white and 5 black) with no notable clinical history using RNA sequencing and by Affymetrix microarray.
We found that the abundance of platelet mRNA transcripts was highly correlated across the 10 individuals, independently of race and of the employed technology. Our RNA-seq data showed that these high inter-individual correlations extend beyond mRNAs to several categories of non-coding RNAs. Pseudogenes represented a notable exception by exhibiting a difference in expression by race. Comparison of our mRNA signatures to a publicly available quantitative platelet proteome showed that most (87.5%) identified platelet proteins had a detectable corresponding mRNA. However, a high number of mRNAs that were present in the transcriptomes of all 10 individuals had no representation in the proteome. Spearman correlations of the relative abundances for those genes represented by both an mRNA and a protein showed a weak (~0.3) connection. Further analysis of the overlapping and non-overlapping platelet mRNAs and proteins identified gene groups corresponding to distinct cellular processes.
The results of our analyses provide novel insights for platelet biology, show only a weak connection between the platelet transcriptome and proteome, and indicate that it is feasible to assemble a platelet mRNA-ome that can serve as a reference for future platelet transcriptomic studies of human health and disease.
This article was reviewed by Dr Mikhail Dozmorov (nominated by Dr Yuri Gusev), Dr Neil Smalheiser and Dr Eugene Koonin.
Duplicated genes can indefinately persist in genomes if either both copies retain the original function due to dosage benefit (gene conservation), or one of the copies assumes a novel function (neofunctionalization), or both copies become required to perform the function previously accomplished by a single copy (subfunctionalization), or through a combination of these mechanisms. Different models of duplication retention imply different predictions about substitution rates in the coding portion of paralogs and about asymmetry of these rates.
We analyse sequence evolution asymmetry in paralogs present in 12 Drosophila genomes using the nearest non-duplicated orthologous outgroup as a reference. Those paralogs present in D. melanogaster are analysed in conjunction with the asymmetry of expression rate and ubiquity and of segregating non-synonymous polymorphisms in the same paralogs. Paralogs accumulate substitutions, on average, faster than their nearest singleton orthologs. The distribution of paralogs’ substitution rate asymmetry is overdispersed relative to that of orthologous clades, containing disproportionally more unusually symmetric and unusually asymmetric clades. We show that paralogs are more asymmetric in: a) clades orthologous to highly constrained singleton genes; b) genes with high expression level; c) genes with ubiquitous expression and d) non-tandem duplications. We further demonstrate that, in each asymmetrically evolving pair of paralogs, the faster evolving member of the pair tends to have lower average expression rate, lower expression uniformity and higher frequency of non-synonymous SNPs than its slower evolving counterpart.
Our findings are consistent with the hypothesis that many duplications in Drosophila are retained despite stabilising selection being more relaxed in one of the paralogs than in the other, suggesting a widespread unfinished pseudogenization. This phenomenon is likely to make detection of neo- and subfunctionalization signatures difficult, as these models of duplication retention also predict asymmetries in substitution rates and expression profiles.
This article has been reviewed by Dr. Jia Zeng (nominated by Dr. I. King Jordan), Dr. Fyodor Kondrashov and Dr. Yuri Wolf.
Gene duplication; Pseudogenization; Drosophila; Substitution rate; Gene expression; Polymorphism
A recent study argued, based on data on functional genome size of major phyla, that there is evidence life may have originated significantly prior to the formation of the Earth.
Here a more refined regression analysis is performed in which 1) measurement error is systematically taken into account, and 2) interval estimates (e.g., confidence or prediction intervals) are produced. It is shown that such models for which the interval estimate for the time origin of the genome includes the age of the Earth are consistent with observed data.
The appearance of life after the formation of the Earth is consistent with the data set under examination.
This article was reviewed by Yuri Wolf, Peter Gogarten, and Christoph Adami.
Genome; Evolution; Origin; Regression; Measurement error; Confidence interval; Prediction interval
The problem of probabilistic inference of gene content in the last common ancestor of several extant species with completely sequenced genomes is: for each gene that is conserved in all or some of the genomes, assign the probability that its ancestral gene was present in the genome of their last common ancestor.
We have developed a family of models of gene gain and gene loss in evolution, and applied the maximum-likelihood approach that uses phylogenetic tree of prokaryotes and the record of orthologous relationships between their genes to infer the gene content of LUCA, the Last Universal Common Ancestor of all currently living cellular organisms. The crucial parameter, the ratio of gene losses and gene gains, was estimated from the data and was higher in models that take account of the number of in-paralogs in genomes than in models that treat gene presences and absences as a binary trait.
While the numbers of genes that are placed confidently into LUCA are similar in the ML methods and in previously published methods that use various parsimony-based approaches, the identities of genes themselves are different. Most of the models of either kind treat the genes found in many existing genomes in a similar way, assigning to them high probabilities of being ancestral (“high ancestrality”). The ML models are more likely than others to assign high ancestrality to the genes that are relatively rare in the present-day genomes.
This article was reviewed by Martijn A Huynen, Toni Gabaldón and Fyodor Kondrashov.
Biological systems produce outputs in response to variable inputs. Input-output relations tend to follow a few regular patterns. For example, many chemical processes follow the S-shaped Hill equation relation between input concentrations and output concentrations. That Hill equation pattern contradicts the fundamental Michaelis-Menten theory of enzyme kinetics. I use the discrepancy between the expected Michaelis-Menten process of enzyme kinetics and the widely observed Hill equation pattern of biological systems to explore the general properties of biological input-output relations. I start with the various processes that could explain the discrepancy between basic chemistry and biological pattern. I then expand the analysis to consider broader aspects that shape biological input-output relations. Key aspects include the input-output processing by component subsystems and how those components combine to determine the system’s overall input-output relations. That aggregate structure often imposes strong regularity on underlying disorder. Aggregation imposes order by dissipating information as it flows through the components of a system. The dissipation of information may be evaluated by the analysis of measurement and precision, explaining why certain common scaling patterns arise so frequently in input-output relations. I discuss how aggregation, measurement and scale provide a framework for understanding the relations between pattern and process. The regularity imposed by those broader structural aspects sets the contours of variation in biology. Thus, biological design will also tend to follow those contours. Natural selection may act primarily to modulate system properties within those broad constraints.
This article was reviewed by Eugene Koonin, Georg Luebeck and Sergei Maslov.
Biological design; Cellular biochemistry; Cellular sensors; Measurement theory; Information theory; Natural selection; Signal processing
The generation of interferon-gamma (IFN-γ) by MHC class II activated CD4+ T helper cells play a substantial contribution in the control of infections such as caused by Mycobacterium tuberculosis. In the past, numerous methods have been developed for predicting MHC class II binders that can activate T-helper cells. Best of author’s knowledge, no method has been developed so far that can predict the type of cytokine will be secreted by these MHC Class II binders or T-helper epitopes. In this study, an attempt has been made to predict the IFN-γ inducing peptides. The main dataset used in this study contains 3705 IFN-γ inducing and 6728 non-IFN-γ inducing MHC class II binders. Another dataset called IFNgOnly contains 4483 IFN-γ inducing epitopes and 2160 epitopes that induce other cytokine except IFN-γ. In addition we have alternate dataset that contains IFN-γ inducing and equal number of random peptides.
It was observed that the peptide length, positional conservation of residues and amino acid composition affects IFN-γ inducing capabilities of these peptides. We identified the motifs in IFN-γ inducing binders/peptides using MERCI software. Our analysis indicates that IFN-γ inducing and non-inducing peptides can be discriminated using above features. We developed models for predicting IFN-γ inducing peptides using various approaches like machine learning technique, motifs-based search, and hybrid approach. Our best model based on the hybrid approach achieved maximum prediction accuracy of 82.10% with MCC of 0.62 on main dataset. We also developed hybrid model on IFNgOnly dataset and achieved maximum accuracy of 81.39% with 0.57 MCC.
Based on this study, we have developed a webserver for predicting i) IFN-γ inducing peptides, ii) virtual screening of peptide libraries and iii) identification of IFN-γ inducing regions in antigen (http://crdd.osdd.net/raghava/ifnepitope/).
This article was reviewed by Prof Kurt Blaser, Prof Laurence Eisenlohr and Dr Manabu Sugai.
Translation elongation factors eEF1A1 and eEF1A2 are 92% identical but exhibit
non-overlapping expression patterns. While the two proteins are predicted to
have similar tertiary structures, it is notable that the minor variations
between their sequences are highly localised within their modelled structures.
We used recently available high-throughput “omics” data to assess
the spatial location of post-translational modifications and discovered that
they are highly enriched on those surface regions of the protein that correspond
to the clusters of sequence variation. This observation suggests how these two
isoforms could be differentially regulated allowing them to perform distinct
This article was reviewed by Frank Eisenhaber and Ramanathan Sowdhamini.
eEF1A1; eEF1A2; Phosphorylation; Methylation; Acetylation; Ubiquitination; Post-translational modification
Identification of drug-like molecules is one of the major challenges in the field of drug discovery. Existing approach like Lipinski rule of 5 (Ro5), Operea have their own limitations. Thus, there is a need to develop computational method that can predict drug-likeness of a molecule with precision. In addition, there is a need to develop algorithm for screening chemical library for their drug-like properties.
In this study, we have used 1347 approved and 3206 experimental drugs for developing a knowledge-based computational model for predicting drug-likeness of a molecule. We have used freely available PaDEL software for computing molecular fingerprints/descriptors of the molecules for developing prediction models. Weka software has been used for feature selection in order to identify the best fingerprints. We have developed various classification models using different types of fingerprints like Estate, PubChem, Extended, FingerPrinter, MACCS keys, GraphsOnlyFP, SubstructureFP, Substructure FPCount, Klekota-RothFP, Klekota-Roth FPCount. It was observed that the models developed using MACCS keys based fingerprints, discriminated approved and experimental drugs with higher precision. Our model based on one hundred fifty nine MACCS keys predicted drug-likeness of the molecules with 89.96% accuracy along with 0.77 MCC. Our analysis indicated that MACCS keys (ISIS keys) 112, 122, 144, and 150 were highly prevalent in the approved drugs. The screening of ZINC (drug-like) and ChEMBL databases showed that around 78.33% and 72.43% of the compounds present in these databases had drug-like potential.
It was apparent from above study that the binary fingerprints could be used to discriminate approved and experimental drugs with high accuracy. In order to facilitate researchers working in the field of drug discovery, we have developed a webserver for predicting, designing, and screening novel drug-like molecules (http://crdd.osdd.net/oscadd/drugmint/).
This article was reviewed by Robert Murphy, Difei Wang (nominated by Yuriy Gusev), and Ahmet Bakan (nominated by James Faeder).
Drug-likeness; FDA; Substructure; Fingerprints; DrugBank; SVM; Lipinski
In the past, numerous methods have been developed for predicting antigenic regions or B-cell epitopes that can induce B-cell response. To the best of authors’ knowledge, no method has been developed for predicting B-cell epitopes that can induce a specific class of antibody (e.g., IgA, IgG) except allergenic epitopes (IgE). In this study, an attempt has been made to understand the relation between primary sequence of epitopes and the class of antibodies generated.
The dataset used in this study has been derived from Immune Epitope Database and consists of 14725 B-cell epitopes that include 11981 IgG, 2341 IgE, 403 IgA specific epitopes and 22835 non-B-cell epitopes. In order to understand the preference of residues or motifs in these epitopes, we computed and compared amino acid and dipeptide composition of IgG, IgE, IgA inducing epitopes and non-B-cell epitopes. Differences in composition profiles of different classes of epitopes were observed, and few residues were found to be preferred. Based on these observations, we developed models for predicting antibody class-specific B-cell epitopes using various features like amino acid composition, dipeptide composition, and binary profiles. Among these, dipeptide composition-based support vector machine model achieved maximum Matthews correlation coefficient of 0.44, 0.70 and 0.45 for IgG, IgE and IgA specific epitopes respectively. All models were developed on experimentally validated non-redundant dataset and evaluated using five-fold cross validation. In addition, the performance of dipeptide-based model was also evaluated on independent dataset.
Present study utilizes the amino acid sequence information for predicting the tendencies of antigens to induce different classes of antibodies. For the first time, in silico models have been developed for predicting B-cell epitopes, which can induce specific class of antibodies. A web service called IgPred has been developed to serve the scientific community. This server will be useful for researchers working in the field of subunit/epitope/peptide-based vaccines and immunotherapy (http://crdd.osdd.net/raghava/igpred/).
This article was reviewed by Dr. M Michael Gromiha, Dr Christopher Langmead (nominated by Dr Robert Murphy) and Dr Lina Ma (nominated by Dr Zhang Zhang).
Support vector machine; Prediction; Antibody; Class-specific; B-cell epitope; Isotype
This article was reviewed by Prof Xiufan Liu (nominated by Dr Purificacion Lopez-Garcia) and Prof Sandor Pongor.
Using phylogenetic analysis on newly available sequences, we characterize A/chicken/Jiangsu/RD5/2013(H10N9) as currently closest precursor strain for the NA segment in the novel avian-origin H7N9 virus responsible for an outbreak in China. We also show that the internal segments of this precursor strain are closely related to those of the presumed precursor for the HA segment, A/duck/Zhejiang/12/2011(H7N3), which indicates that the sources of both HA and NA donors for the reassortant virus are of regional and not migratory-bird origin and highlights the role of chicken already in the early reassortment events.
Avian influenza; Zoonotic infections; Phylogeny; Reassortment history
The recently discovered Pandoraviruses are by far the largest viruses known, with their 2 megabase genomes exceeding in size the genomes of numerous bacteria and archaea. Pandoraviruses show a distant relationship with other nucleocytoplasmic large DNA viruses (NCLDV) of eukaryotes, lack some of the NCLDV core genes and in particular do not appear to be specifically related to the other, better characterized family of giant viruses, the Mimiviridae. Here we report phylogenetic analysis of 6 core NCLDV genes that confidently places Pandoraviruses within the family Phycodnaviridae, with an apparent specific affinity with Coccolithoviruses. We conclude that, despite their many unusual characteristics, Pandoraviruses are highly derived phycodnaviruses. These findings imply that giant viruses have independently evolved from smaller NCLDV on at least two occasions.
This article was reviewed by Patrick Forterre and Lakshminarayan Iyer. For the full reviews, see the Reviewers’ reports section.
The modern evolutionary synthesis leaves unresolved some of the most
fundamental, long-standing questions in evolutionary biology: What is the
role of sex in evolution? How does complex adaptation evolve? How can
selection operate effectively on genetic interactions? More recently, the
molecular biology and genomics revolutions have raised a host of critical
new questions, through empirical findings that the modern synthesis fails to
explain: for example, the discovery of de novo genes; the immense
constructive role of transposable elements in evolution; genetic variance
and biochemical activity that go far beyond what traditional natural
selection can maintain; perplexing cases of molecular parallelism; and
Presentation of the hypothesis
Here I address these questions from a unified perspective, by means of a new
mechanistic view of evolution that offers a novel connection between
selection on the phenotype and genetic evolutionary change (while relying,
like the traditional theory, on natural selection as the only source of
feedback on the fit between an organism and its environment). I hypothesize
that the mutation that is of relevance for the evolution of complex
adaptation—while not Lamarckian, or “directed” to increase
fitness—is not random, but is instead the outcome of a complex and
continually evolving biological process that combines information from
multiple loci into one. This allows selection on a fleeting combination of
interacting alleles at different loci to have a hereditary effect according
to the combination’s fitness.
Testing and implications of the hypothesis
This proposed mechanism addresses the problem of how beneficial genetic
interactions can evolve under selection, and also offers an intuitive
explanation for the role of sex in evolution, which focuses on sex as the
generator of genetic combinations. Importantly, it also implies that genetic
variation that has appeared neutral through the lens of traditional theory
can actually experience selection on interactions and thus has a much
greater adaptive potential than previously considered. Empirical evidence
for the proposed mechanism from both molecular evolution and evolution at
the organismal level is discussed, and multiple predictions are offered by
which it may be tested.
This article was reviewed by Nigel Goldenfeld (nominated by Eugene V.
Koonin), Jürgen Brosius and W. Ford Doolittle.
Adaptive evolution; Neutral theory; Sex and recombination; Epistasis; Junk DNA; de novo genes; Transcriptional promiscuity; Mutation bias; Evolvability
Significant efforts have been made to address the problem of identifying short genes in prokaryotic genomes. However, most known methods are not effective in detecting short genes. Because of the limited information contained in short DNA sequences, it is very difficult to accurately distinguish between protein coding and non-coding sequences in prokaryotic genomes. We have developed a new Iteratively Adaptive Sparse Partial Least Squares (IASPLS) algorithm as the classifier to improve the accuracy of the identification process.
For testing, we chose the short coding and non-coding sequences from seven prokaryotic organisms. We used seven feature sets (including GC content, Z-curve, etc.) of short genes.
In comparison with GeneMarkS, Metagene, Orphelia, and Heuristic Approachs methods, our model achieved the best prediction performance in identification of short prokaryotic genes. Even when we focused on the very short length group ([60–100 nt)), our model provided sensitivity as high as 83.44% and specificity as high as 92.8%. These values are two or three times higher than three of the other methods while Metagene fails to recognize genes in this length range.
The experiments also proved that the IASPLS can improve the identification accuracy in comparison with other widely used classifiers, i.e. Logistic, Random Forest (RF) and K nearest neighbors (KNN). The accuracy in using IASPLS was improved 5.90% or more in comparison with the other methods. In addition to the improvements in accuracy, IASPLS required ten times less computer time than using KNN or RF.
It is conclusive that our method is preferable for application as an automated method of short gene classification. Its linearity and easily optimized parameters make it practicable for predicting short genes of newly-sequenced or under-studied species.
This article was reviewed by Alexey Kondrashov, Rajeev Azad (nominated by Dr J.Peter Gogarten) and Yuriy Fofanov (nominated by Dr Janet Siefert).
Iteratively adaptive SPLS; Short coding sequence; Prokaryotic genome
LINE-1 (L1) retrotransposons are repetitive elements in mammalian genomes. They are
capable of synthesizing DNA on their own RNA templates by harnessing reverse
transcriptase (RT) that they encode. Abundantly expressed full-length L1s and their
RT are found to globally influence gene expression profiles, differentiation state,
and proliferation capacity of early embryos and many types of cancer, albeit by yet
unknown mechanisms. They are essential for the progression of early development and
the establishment of a cancer-related undifferentiated state. This raises important
questions regarding the functional significance of L1 RT in these cell systems.
Massive nuclear L1-linked reverse transcription has been shown to occur in mouse
zygotes and two-cell embryos, and this phenomenon is purported to be DNA replication
independent. This review argues against this claim with the goal of understanding the
nature of this phenomenon and the role of L1 RT in early embryos and cancers.
Available L1 data are revisited and integrated with relevant findings accumulated in
the fields of replication timing, chromatin organization, and epigenetics, bringing
together evidence that strongly supports two new concepts. First, noncanonical
replication of a portion of genomic full-length L1s by means of L1 RNP-driven reverse
transcription is proposed to co-exist with DNA polymerase-dependent replication of
the rest of the genome during the same round of DNA replication in embryonic and
cancer cell systems. Second, the role of this mechanism is thought to be epigenetic;
it might promote transcriptional competence of neighboring genes linked to
undifferentiated states through the prevention of tethering of involved L1s to the
nuclear periphery. From the standpoint of these concepts, several hitherto
inexplicable phenomena can be explained. Testing methods for the model are
This article was reviewed by Dr. Philip Zegerman (nominated by Dr. Orly Alter),
Dr. I. King Jordan, and Dr. Panayiotis (Takis) Benos. For the complete reviews,
see the Reviewers’ Reports section.
LINE-1; L1 retrotransposon; DNA replication; Replication timing; Epigenetics; Pluripotency; Cancer; Embryonic stem cells; Chromatin domains; Origins of replication
It is now popularly accepted that an “RNA world” existed in early evolution. During division of RNA-based protocells, random distribution of individual genes (simultaneously as ribozymes) between offspring might have resulted in gene loss, especially when the number of gene types increased. Therefore, the emergence of a chromosome carrying linked genes was critical for the prosperity of the RNA world. However, there were quite a few immediate difficulties for this event to occur. For example, a chromosome would be much longer than individual genes, and thus more likely to degrade and less likely to replicate completely; the copying of the chromosome might start at middle sites and be only partial; and, without a complex transcription mechanism, the synthesis of distinct ribozymes would become problematic.
Inspired by features of viroids, which have been suggested as “living fossils” of the RNA world, we supposed that these difficulties could have been overcome if the chromosome adopted a circular form and small, self-cleaving ribozymes (e.g. the hammer head ribozymes) resided at the sites between genes. Computer simulation using a Monte-Carlo method was conducted to investigate this hypothesis. The simulation shows that an RNA chromosome can spread (increase in quantity and be sustained) in the system if it is a circular one and its linear “transcripts” are readily broken at the sites between genes; the chromosome works as genetic material and ribozymes “coded” by it serve as functional molecules; and both circularity and self-cleavage are important for the spread of the chromosome.
In the RNA world, circularity and self-cleavage may have been adopted as a strategy to overcome the immediate difficulties for the emergence of a chromosome (with linked genes). The strategy suggested here is very simple and likely to have been used in this early stage of evolution. By demonstrating the possibility of the emergence of an RNA chromosome, this study opens on the prospect of a prosperous RNA world, populated by RNA-based protocells with a number of genes, showing complicated functions.
This article was reviewed by Sergei Kazakov (nominated by Laura Landweber), Nobuto Takeuchi (nominated by Anthony Poole), and Eugene Koonin.