Search tips
Search criteria

Results 1-25 (347)

Clipboard (0)
Year of Publication
Document Types
1.  The human platelet: strong transcriptome correlations among individuals associate weakly with the platelet proteome 
Biology Direct  2014;9:3.
For the anucleate platelet it has been unclear how well platelet transcriptomes correlate among different donors or across different RNA profiling platforms, and what the transcriptomes’ relationship is with the platelet proteome. We profiled the platelet transcriptome of 10 healthy young males (5 white and 5 black) with no notable clinical history using RNA sequencing and by Affymetrix microarray.
We found that the abundance of platelet mRNA transcripts was highly correlated across the 10 individuals, independently of race and of the employed technology. Our RNA-seq data showed that these high inter-individual correlations extend beyond mRNAs to several categories of non-coding RNAs. Pseudogenes represented a notable exception by exhibiting a difference in expression by race. Comparison of our mRNA signatures to a publicly available quantitative platelet proteome showed that most (87.5%) identified platelet proteins had a detectable corresponding mRNA. However, a high number of mRNAs that were present in the transcriptomes of all 10 individuals had no representation in the proteome. Spearman correlations of the relative abundances for those genes represented by both an mRNA and a protein showed a weak (~0.3) connection. Further analysis of the overlapping and non-overlapping platelet mRNAs and proteins identified gene groups corresponding to distinct cellular processes.
The results of our analyses provide novel insights for platelet biology, show only a weak connection between the platelet transcriptome and proteome, and indicate that it is feasible to assemble a platelet mRNA-ome that can serve as a reference for future platelet transcriptomic studies of human health and disease.
Reviewed by
This article was reviewed by Dr Mikhail Dozmorov (nominated by Dr Yuri Gusev), Dr Neil Smalheiser and Dr Eugene Koonin.
PMCID: PMC3937023  PMID: 24524654
2.  Faster evolving Drosophila paralogs lose expression rate and ubiquity and accumulate more non-synonymous SNPs 
Biology Direct  2014;9:2.
Duplicated genes can indefinately persist in genomes if either both copies retain the original function due to dosage benefit (gene conservation), or one of the copies assumes a novel function (neofunctionalization), or both copies become required to perform the function previously accomplished by a single copy (subfunctionalization), or through a combination of these mechanisms. Different models of duplication retention imply different predictions about substitution rates in the coding portion of paralogs and about asymmetry of these rates.
We analyse sequence evolution asymmetry in paralogs present in 12 Drosophila genomes using the nearest non-duplicated orthologous outgroup as a reference. Those paralogs present in D. melanogaster are analysed in conjunction with the asymmetry of expression rate and ubiquity and of segregating non-synonymous polymorphisms in the same paralogs. Paralogs accumulate substitutions, on average, faster than their nearest singleton orthologs. The distribution of paralogs’ substitution rate asymmetry is overdispersed relative to that of orthologous clades, containing disproportionally more unusually symmetric and unusually asymmetric clades. We show that paralogs are more asymmetric in: a) clades orthologous to highly constrained singleton genes; b) genes with high expression level; c) genes with ubiquitous expression and d) non-tandem duplications. We further demonstrate that, in each asymmetrically evolving pair of paralogs, the faster evolving member of the pair tends to have lower average expression rate, lower expression uniformity and higher frequency of non-synonymous SNPs than its slower evolving counterpart.
Our findings are consistent with the hypothesis that many duplications in Drosophila are retained despite stabilising selection being more relaxed in one of the paralogs than in the other, suggesting a widespread unfinished pseudogenization. This phenomenon is likely to make detection of neo- and subfunctionalization signatures difficult, as these models of duplication retention also predict asymmetries in substitution rates and expression profiles.
This article has been reviewed by Dr. Jia Zeng (nominated by Dr. I. King Jordan), Dr. Fyodor Kondrashov and Dr. Yuri Wolf.
PMCID: PMC3906896  PMID: 24438455
Gene duplication; Pseudogenization; Drosophila; Substitution rate; Gene expression; Polymorphism
3.  Earth before life 
Biology Direct  2014;9:1.
A recent study argued, based on data on functional genome size of major phyla, that there is evidence life may have originated significantly prior to the formation of the Earth.
Here a more refined regression analysis is performed in which 1) measurement error is systematically taken into account, and 2) interval estimates (e.g., confidence or prediction intervals) are produced. It is shown that such models for which the interval estimate for the time origin of the genome includes the age of the Earth are consistent with observed data.
The appearance of life after the formation of the Earth is consistent with the data set under examination.
This article was reviewed by Yuri Wolf, Peter Gogarten, and Christoph Adami.
PMCID: PMC3892030  PMID: 24405803
Genome; Evolution; Origin; Regression; Measurement error; Confidence interval; Prediction interval
4.  Models of gene gain and gene loss for probabilistic reconstruction of gene content in the last universal common ancestor of life 
Biology Direct  2013;8:32.
The problem of probabilistic inference of gene content in the last common ancestor of several extant species with completely sequenced genomes is: for each gene that is conserved in all or some of the genomes, assign the probability that its ancestral gene was present in the genome of their last common ancestor.
We have developed a family of models of gene gain and gene loss in evolution, and applied the maximum-likelihood approach that uses phylogenetic tree of prokaryotes and the record of orthologous relationships between their genes to infer the gene content of LUCA, the Last Universal Common Ancestor of all currently living cellular organisms. The crucial parameter, the ratio of gene losses and gene gains, was estimated from the data and was higher in models that take account of the number of in-paralogs in genomes than in models that treat gene presences and absences as a binary trait.
While the numbers of genes that are placed confidently into LUCA are similar in the ML methods and in previously published methods that use various parsimony-based approaches, the identities of genes themselves are different. Most of the models of either kind treat the genes found in many existing genomes in a similar way, assigning to them high probabilities of being ancestral (“high ancestrality”). The ML models are more likely than others to assign high ancestrality to the genes that are relatively rare in the present-day genomes.
This article was reviewed by Martijn A Huynen, Toni Gabaldón and Fyodor Kondrashov.
PMCID: PMC3892064  PMID: 24354654
5.  Highly homologous eEF1A1 and eEF1A2 exhibit differential post-translational modification with significant enrichment around localised sites of sequence variation 
Biology Direct  2013;8:29.
Translation elongation factors eEF1A1 and eEF1A2 are 92% identical but exhibit non-overlapping expression patterns. While the two proteins are predicted to have similar tertiary structures, it is notable that the minor variations between their sequences are highly localised within their modelled structures. We used recently available high-throughput “omics” data to assess the spatial location of post-translational modifications and discovered that they are highly enriched on those surface regions of the protein that correspond to the clusters of sequence variation. This observation suggests how these two isoforms could be differentially regulated allowing them to perform distinct functions.
This article was reviewed by Frank Eisenhaber and Ramanathan Sowdhamini.
PMCID: PMC3868327  PMID: 24220286
eEF1A1; eEF1A2; Phosphorylation; Methylation; Acetylation; Ubiquitination; Post-translational modification
6.  DrugMint: a webserver for predicting and designing of drug-like molecules 
Biology Direct  2013;8:28.
Identification of drug-like molecules is one of the major challenges in the field of drug discovery. Existing approach like Lipinski rule of 5 (Ro5), Operea have their own limitations. Thus, there is a need to develop computational method that can predict drug-likeness of a molecule with precision. In addition, there is a need to develop algorithm for screening chemical library for their drug-like properties.
In this study, we have used 1347 approved and 3206 experimental drugs for developing a knowledge-based computational model for predicting drug-likeness of a molecule. We have used freely available PaDEL software for computing molecular fingerprints/descriptors of the molecules for developing prediction models. Weka software has been used for feature selection in order to identify the best fingerprints. We have developed various classification models using different types of fingerprints like Estate, PubChem, Extended, FingerPrinter, MACCS keys, GraphsOnlyFP, SubstructureFP, Substructure FPCount, Klekota-RothFP, Klekota-Roth FPCount. It was observed that the models developed using MACCS keys based fingerprints, discriminated approved and experimental drugs with higher precision. Our model based on one hundred fifty nine MACCS keys predicted drug-likeness of the molecules with 89.96% accuracy along with 0.77 MCC. Our analysis indicated that MACCS keys (ISIS keys) 112, 122, 144, and 150 were highly prevalent in the approved drugs. The screening of ZINC (drug-like) and ChEMBL databases showed that around 78.33% and 72.43% of the compounds present in these databases had drug-like potential.
It was apparent from above study that the binary fingerprints could be used to discriminate approved and experimental drugs with high accuracy. In order to facilitate researchers working in the field of drug discovery, we have developed a webserver for predicting, designing, and screening novel drug-like molecules (
This article was reviewed by Robert Murphy, Difei Wang (nominated by Yuriy Gusev), and Ahmet Bakan (nominated by James Faeder).
PMCID: PMC3826839  PMID: 24188205
Drug-likeness; FDA; Substructure; Fingerprints; DrugBank; SVM; Lipinski
7.  Identification of B-cell epitopes in an antigen for inducing specific class of antibodies 
Biology Direct  2013;8:27.
In the past, numerous methods have been developed for predicting antigenic regions or B-cell epitopes that can induce B-cell response. To the best of authors’ knowledge, no method has been developed for predicting B-cell epitopes that can induce a specific class of antibody (e.g., IgA, IgG) except allergenic epitopes (IgE). In this study, an attempt has been made to understand the relation between primary sequence of epitopes and the class of antibodies generated.
The dataset used in this study has been derived from Immune Epitope Database and consists of 14725 B-cell epitopes that include 11981 IgG, 2341 IgE, 403 IgA specific epitopes and 22835 non-B-cell epitopes. In order to understand the preference of residues or motifs in these epitopes, we computed and compared amino acid and dipeptide composition of IgG, IgE, IgA inducing epitopes and non-B-cell epitopes. Differences in composition profiles of different classes of epitopes were observed, and few residues were found to be preferred. Based on these observations, we developed models for predicting antibody class-specific B-cell epitopes using various features like amino acid composition, dipeptide composition, and binary profiles. Among these, dipeptide composition-based support vector machine model achieved maximum Matthews correlation coefficient of 0.44, 0.70 and 0.45 for IgG, IgE and IgA specific epitopes respectively. All models were developed on experimentally validated non-redundant dataset and evaluated using five-fold cross validation. In addition, the performance of dipeptide-based model was also evaluated on independent dataset.
Present study utilizes the amino acid sequence information for predicting the tendencies of antigens to induce different classes of antibodies. For the first time, in silico models have been developed for predicting B-cell epitopes, which can induce specific class of antibodies. A web service called IgPred has been developed to serve the scientific community. This server will be useful for researchers working in the field of subunit/epitope/peptide-based vaccines and immunotherapy (
This article was reviewed by Dr. M Michael Gromiha, Dr Christopher Langmead (nominated by Dr Robert Murphy) and Dr Lina Ma (nominated by Dr Zhang Zhang).
PMCID: PMC3831251  PMID: 24168386
Support vector machine; Prediction; Antibody; Class-specific; B-cell epitope; Isotype
8.  Pandoraviruses are highly derived phycodnaviruses 
Biology Direct  2013;8:25.
The recently discovered Pandoraviruses are by far the largest viruses known, with their 2 megabase genomes exceeding in size the genomes of numerous bacteria and archaea. Pandoraviruses show a distant relationship with other nucleocytoplasmic large DNA viruses (NCLDV) of eukaryotes, lack some of the NCLDV core genes and in particular do not appear to be specifically related to the other, better characterized family of giant viruses, the Mimiviridae. Here we report phylogenetic analysis of 6 core NCLDV genes that confidently places Pandoraviruses within the family Phycodnaviridae, with an apparent specific affinity with Coccolithoviruses. We conclude that, despite their many unusual characteristics, Pandoraviruses are highly derived phycodnaviruses. These findings imply that giant viruses have independently evolved from smaller NCLDV on at least two occasions.
This article was reviewed by Patrick Forterre and Lakshminarayan Iyer. For the full reviews, see the Reviewers’ reports section.
PMCID: PMC3924356  PMID: 24148757
9.  Recognizing short coding sequences of prokaryotic genome using a novel iteratively adaptive sparse partial least squares algorithm 
Biology Direct  2013;8:23.
Significant efforts have been made to address the problem of identifying short genes in prokaryotic genomes. However, most known methods are not effective in detecting short genes. Because of the limited information contained in short DNA sequences, it is very difficult to accurately distinguish between protein coding and non-coding sequences in prokaryotic genomes. We have developed a new Iteratively Adaptive Sparse Partial Least Squares (IASPLS) algorithm as the classifier to improve the accuracy of the identification process.
For testing, we chose the short coding and non-coding sequences from seven prokaryotic organisms. We used seven feature sets (including GC content, Z-curve, etc.) of short genes.
In comparison with GeneMarkS, Metagene, Orphelia, and Heuristic Approachs methods, our model achieved the best prediction performance in identification of short prokaryotic genes. Even when we focused on the very short length group ([60–100 nt)), our model provided sensitivity as high as 83.44% and specificity as high as 92.8%. These values are two or three times higher than three of the other methods while Metagene fails to recognize genes in this length range.
The experiments also proved that the IASPLS can improve the identification accuracy in comparison with other widely used classifiers, i.e. Logistic, Random Forest (RF) and K nearest neighbors (KNN). The accuracy in using IASPLS was improved 5.90% or more in comparison with the other methods. In addition to the improvements in accuracy, IASPLS required ten times less computer time than using KNN or RF.
It is conclusive that our method is preferable for application as an automated method of short gene classification. Its linearity and easily optimized parameters make it practicable for predicting short genes of newly-sequenced or under-studied species.
This article was reviewed by Alexey Kondrashov, Rajeev Azad (nominated by Dr J.Peter Gogarten) and Yuriy Fofanov (nominated by Dr Janet Siefert).
PMCID: PMC3852556  PMID: 24067167
Iteratively adaptive SPLS; Short coding sequence; Prokaryotic genome
10.  Circularity and self-cleavage as a strategy for the emergence of a chromosome in the RNA-based protocell 
Biology Direct  2013;8:21.
It is now popularly accepted that an “RNA world” existed in early evolution. During division of RNA-based protocells, random distribution of individual genes (simultaneously as ribozymes) between offspring might have resulted in gene loss, especially when the number of gene types increased. Therefore, the emergence of a chromosome carrying linked genes was critical for the prosperity of the RNA world. However, there were quite a few immediate difficulties for this event to occur. For example, a chromosome would be much longer than individual genes, and thus more likely to degrade and less likely to replicate completely; the copying of the chromosome might start at middle sites and be only partial; and, without a complex transcription mechanism, the synthesis of distinct ribozymes would become problematic.
Inspired by features of viroids, which have been suggested as “living fossils” of the RNA world, we supposed that these difficulties could have been overcome if the chromosome adopted a circular form and small, self-cleaving ribozymes (e.g. the hammer head ribozymes) resided at the sites between genes. Computer simulation using a Monte-Carlo method was conducted to investigate this hypothesis. The simulation shows that an RNA chromosome can spread (increase in quantity and be sustained) in the system if it is a circular one and its linear “transcripts” are readily broken at the sites between genes; the chromosome works as genetic material and ribozymes “coded” by it serve as functional molecules; and both circularity and self-cleavage are important for the spread of the chromosome.
In the RNA world, circularity and self-cleavage may have been adopted as a strategy to overcome the immediate difficulties for the emergence of a chromosome (with linked genes). The strategy suggested here is very simple and likely to have been used in this early stage of evolution. By demonstrating the possibility of the emergence of an RNA chromosome, this study opens on the prospect of a prosperous RNA world, populated by RNA-based protocells with a number of genes, showing complicated functions.
This article was reviewed by Sergei Kazakov (nominated by Laura Landweber), Nobuto Takeuchi (nominated by Anthony Poole), and Eugene Koonin.
PMCID: PMC3765326  PMID: 23971788
11.  Novel autoproteolytic and DNA-damage sensing components in the bacterial SOS response and oxidized methylcytosine-induced eukaryotic DNA demethylation systems 
Biology Direct  2013;8:20.
The bacterial SOS response is an elaborate program for DNA repair, cell cycle regulation and adaptive mutagenesis under stress conditions. Using sensitive sequence and structure analysis, combined with contextual information derived from comparative genomics and domain architectures, we identify two novel domain superfamilies in the SOS response system. We present evidence that one of these, the SOS response associated peptidase (SRAP; Pfam: DUF159) is a novel thiol autopeptidase. Given the involvement of other autopeptidases, such as LexA and UmuD, in the SOS response, this finding suggests that multiple structurally unrelated peptidases have been recruited to this process. The second of these, the ImuB-C superfamily, is linked to the Y-family DNA polymerase-related domain in ImuB, and also occurs as a standalone protein. We present evidence using gene neighborhood analysis that both these domains function with different mutagenic polymerases in bacteria, such as Pol IV (DinB), Pol V (UmuCD) and ImuA-ImuB-DnaE2 and also other repair systems, which either deploy Ku and an ATP-dependent ligase or a SplB-like radical SAM photolyase. We suggest that the SRAP superfamily domain functions as a DNA-associated autoproteolytic switch that recruits diverse repair enzymes upon DNA damage, whereas the ImuB-C domain performs a similar function albeit in a non-catalytic fashion. We propose that C3Orf37, the eukaryotic member of the SRAP superfamily, which has been recently shown to specifically bind DNA with 5-hydroxymethylcytosine, 5-formylcytosine and 5-carboxycytosine, is a sensor for these oxidized bases generated by the TET enzymes from methylcytosine. Hence, its autoproteolytic activity might help it act as a switch that recruits DNA repair enzymes to remove these oxidized methylcytosine species as part of the DNA demethylation pathway downstream of the TET enzymes.
This article was reviewed by RDS, RF and GJ.
PMCID: PMC3765255  PMID: 23945014
12.  Parabolic replicator dynamics and the principle of minimum Tsallis information gain 
Biology Direct  2013;8:19.
Non-linear, parabolic (sub-exponential) and hyperbolic (super-exponential) models of prebiological evolution of molecular replicators have been proposed and extensively studied. The parabolic models appear to be the most realistic approximations of real-life replicator systems due primarily to product inhibition. Unlike the more traditional exponential models, the distribution of individual frequencies in an evolving parabolic population is not described by the Maximum Entropy (MaxEnt) Principle in its traditional form, whereby the distribution with the maximum Shannon entropy is chosen among all the distributions that are possible under the given constraints. We sought to identify a more general form of the MaxEnt principle that would be applicable to parabolic growth.
We consider a model of a population that reproduces according to the parabolic growth law and show that the frequencies of individuals in the population minimize the Tsallis relative entropy (non-additive information gain) at each time moment. Next, we consider a model of a parabolically growing population that maintains a constant total size and provide an “implicit” solution for this system. We show that in this case, the frequencies of the individuals in the population also minimize the Tsallis information gain at each moment of the ‘internal time” of the population.
The results of this analysis show that the general MaxEnt principle is the underlying law for the evolution of a broad class of replicator systems including not only exponential but also parabolic and hyperbolic systems. The choice of the appropriate entropy (information) function depends on the growth dynamics of a particular class of systems. The Tsallis entropy is non-additive for independent subsystems, i.e. the information on the subsystems is insufficient to describe the system as a whole. In the context of prebiotic evolution, this “non-reductionist” nature of parabolic replicator systems might reflect the importance of group selection and competition between ensembles of cooperating replicators.
This article was reviewed by Viswanadham Sridhara (nominated by Claus Wilke), Puushottam Dixit (nominated by Sergei Maslov), and Nick Grishin. For the complete reviews, see the Reviewers’ Reports section.
PMCID: PMC3765284  PMID: 23937956
Replicator equation; Parabolic growth; Tsallis entropy; Non-extensive statistical mechanics; MaxEnt principle
13.  How protein targeting to primary plastids via the endomembrane system could have evolved? A new hypothesis based on phylogenetic studies 
Biology Direct  2013;8:18.
It is commonly assumed that a heterotrophic ancestor of the supergroup Archaeplastida/Plantae engulfed a cyanobacterium that was transformed into a primary plastid; however, it is still unclear how nuclear-encoded proteins initially were imported into the new organelle. Most proteins targeted to primary plastids carry a transit peptide and are transported post-translationally using Toc and Tic translocons. There are, however, several proteins with N-terminal signal peptides that are directed to higher plant plastids in vesicles derived from the endomembrane system (ES). The existence of these proteins inspired a hypothesis that all nuclear-encoded, plastid-targeted proteins initially carried signal peptides and were targeted to the ancestral primary plastid via the host ES.
We present the first phylogenetic analyses of Arabidopsis thaliana α-carbonic anhydrase (CAH1), Oryza sativa nucleotide pyrophosphatase/phosphodiesterase (NPP1), and two O. sativa α-amylases (αAmy3, αAmy7), proteins that are directed to higher plant primary plastids via the ES. We also investigated protein disulfide isomerase (RB60) from the green alga Chlamydomonas reinhardtii because of its peculiar dual post- and co-translational targeting to both the plastid and ES. Our analyses show that these proteins all are of eukaryotic rather than cyanobacterial origin, and that their non-plastid homologs are equipped with signal peptides responsible for co-translational import into the host ES. Our results indicate that vesicular trafficking of proteins to primary plastids evolved long after the cyanobacterial endosymbiosis (possibly only in higher plants) to permit their glycosylation and/or transport to more than one cellular compartment.
The proteins we analyzed are not relics of ES-mediated protein targeting to the ancestral primary plastid. Available data indicate that Toc- and Tic-based translocation dominated protein import into primary plastids from the beginning. Only a handful of host proteins, which already were targeted through the ES, later were adapted to reach the plastid via the vesicular trafficking. They represent a derived class of higher plant plastid-targeted proteins with an unusual evolutionary history.
This article was reviewed by Prof. William Martin, Dr. Philippe Deschamps (nominated by Dr. Purificacion Lopez-Garcia) and Dr Simonetta Gribaldo.
PMCID: PMC3716720  PMID: 23845039
Endomembrane system; Endosymbiont; Endoplasmic reticulum; Golgi apparatus; Horizontal gene transfer; Phylogeny; Plastid; Plastid transit peptide; Primary endosymbiosis; Protein trafficking; Signal peptide
14.  The mechanistic and evolutionary aspects of the 2′- and 3′-OH paradigm in biosynthetic machinery 
Biology Direct  2013;8:17.
The translation machinery underlies a multitude of biological processes within the cell. The design and implementation of the modern translation apparatus on even the simplest course of action is extremely complex, and involves different RNA and protein factors. According to the “RNA world” idea, the critical link in the translation machinery may be assigned to an adaptor tRNA molecule. Its exceptional functional and structural characteristics are of primary importance in understanding the evolutionary relationships among all these macromolecular components.
Presentation of the hypothesis
The 2′-3′ hydroxyls of the tRNA A76 constitute chemical groups of critical functional importance, as they are implicated in almost all phases of protein biosynthesis. They contribute to: a) each step of the tRNA aminoacylation reaction catalyzed by aminoacyl-tRNA synthetases (aaRSs); b) the isomerase activity of EF-Tu, involving a mixture of the 2′(3′)- aminoacyl tRNA isomers as substrates, thereby producing the required combination of amino acid and tRNA; and c) peptide bond formation at the peptidyl transferase center (PTC) of the ribosome. We hypothesize that specific functions assigned to the 2′-3′ hydroxyls during peptide bond formation co-evolved, together with two modes of attack on the aminoacyl-adenylate carbonyl typical for two classes of aaRSs, and alongside the isomerase activity of EF-Tu. Protein components of the translational apparatus are universally recognized as being of ancient origin, possibly replacing RNA-based enzymes that may have existed before the last universal common ancestor (LUCA). We believe that a remnant of these processes is still imprinted on the organization of modern-day translation.
Testing and implications of the hypothesis
Earlier publications indicate that it is possible to select ribozymes capable of attaching the aa-AMP moiety to RNA molecules. The scenario described herein would gain general acceptance, if a ribozyme able to activate the amino acid and transfer it onto the terminal ribose of the tRNA, would be found in any life form, or generated in vitro. Interestingly, recent studies have demonstrated the plausibility of using metals, likely abandoned under primordial conditions, as biomimetic catalysts of the aminoacylation reaction.
This article was reviewed by Henri Grosjean, Manuel Santos and Eugene Koonin. For complete reviews, go to the Reviewers’ reports section.
PMCID: PMC3716924  PMID: 23835000
Aminoacyl-tRNA synthetases; Elongation factor EF-Tu; Ribosome; 2′-3′ hydroxyls of the ribose
15.  RNaseIII and T4 Polynucleotide Kinase sequence biases and solutions during RNA-seq library construction 
Biology Direct  2013;8:16.
RNA-seq is a next generation sequencing method with a wide range of applications including single nucleotide polymorphism (SNP) detection, splice junction identification, and gene expression level measurement. However, the RNA-seq sequence data can be biased during library constructions resulting in incorrect data for SNP, splice junction, and gene expression studies. Here, we developed new library preparation methods to limit such biases.
A whole transcriptome library prepared for the SOLiD system displayed numerous read duplications (pile-ups) and gaps in known exons. The pile-ups and gaps of the whole transcriptome library caused a loss of SNP and splice junction information and reduced the quality of gene expression results. Further, we found clear sequence biases for both 5' and 3' end reads in the whole transcriptome library. To remove this bias, RNaseIII fragmentation was replaced with heat fragmentation. For adaptor ligation, T4 Polynucleotide Kinase (T4PNK) was used following heat fragmentation. However, its kinase and phosphatase activities introduced additional sequence biases. To minimize them, we used OptiKinase before T4PNK. Our study further revealed the specific target sequences of RNaseIII and T4PNK.
Our results suggest that the heat fragmentation removed the RNaseIII sequence bias and significantly reduced the pile-ups and gaps. OptiKinase minimized the T4PNK sequence biases and removed most of the remaining pile-ups and gaps, thus maximizing the quality of RNA-seq data.
This article was reviewed by Dr. A. Kolodziejczyk (nominated by Dr. Sarah Teichmann), Dr. Eugene Koonin, and Dr. Christoph Adami. For the full reviews, see the Reviewers' Comments section.
PMCID: PMC3710281  PMID: 23826734
RNaseIII; T4PNK; Sequence bias; Heat fragmentation; OptiKinase; RNA-seq
16.  Comprehensive analysis of the HEPN superfamily: identification of novel roles in intra-genomic conflicts, defense, pathogenesis and RNA processing 
Biology Direct  2013;8:15.
The major role of enzymatic toxins that target nucleic acids in biological conflicts at all levels has become increasingly apparent thanks in large part to the advances of comparative genomics. Typically, toxins evolve rapidly hampering the identification of these proteins by sequence analysis. Here we analyze an unexpectedly widespread superfamily of toxin domains most of which possess RNase activity.
The HEPN superfamily is comprised of all α-helical domains that were first identified as being associated with DNA polymerase β-type nucleotidyltransferases in prokaryotes and animal Sacsin proteins. Using sensitive sequence and structure comparison methods, we vastly extend the HEPN superfamily by identifying numerous novel families and by detecting diverged HEPN domains in several known protein families. The new HEPN families include the RNase LS and LsoA catalytic domains, KEN domains (e.g. RNaseL and Ire1) and the RNase domains of RloC and PrrC. The majority of HEPN domains contain conserved motifs that constitute a metal-independent endoRNase active site. Some HEPN domains lacking this motif probably function as non-catalytic RNA-binding domains, such as in the case of the mannitol repressor MtlR. Our analysis shows that HEPN domains function as toxins that are shared by numerous systems implicated in intra-genomic, inter-genomic and intra-organismal conflicts across the three domains of cellular life. In prokaryotes HEPN domains are essential components of numerous toxin-antitoxin (TA) and abortive infection (Abi) systems and in addition are tightly associated with many restriction-modification (R-M) and CRISPR-Cas systems, and occasionally with other defense systems such as Pgl and Ter. We present evidence of multiple modes of action of HEPN domains in these systems, which include direct attack on viral RNAs (e.g. LsoA and RNase LS) in conjunction with other RNase domains (e.g. a novel RNase H fold domain, NamA), suicidal or dormancy-inducing attack on self RNAs (RM systems and possibly CRISPR-Cas systems), and suicidal attack coupled with direct interaction with phage components (Abi systems). These findings are compatible with the hypothesis on coupling of pathogen-targeting (immunity) and self-directed (programmed cell death and dormancy induction) responses in the evolution of robust antiviral strategies. We propose that altruistic cell suicide mediated by HEPN domains and other functionally similar RNases was essential for the evolution of kin and group selection and cell cooperation. HEPN domains were repeatedly acquired by eukaryotes and incorporated into several core functions such as endonucleolytic processing of the 5.8S-25S/28S rRNA precursor (Las1), a novel ER membrane-associated RNA degradation system (C6orf70), sensing of unprocessed transcripts at the nuclear periphery (Swt1). Multiple lines of evidence suggest that, similar to prokaryotes, HEPN proteins were recruited to antiviral, antitransposon, apoptotic systems or RNA-level response to unfolded proteins (Sacsin and KEN domains) in several groups of eukaryotes.
Extensive sequence and structure comparisons reveal unexpectedly broad presence of the HEPN domain in an enormous variety of defense and stress response systems across the tree of life. In addition, HEPN domains have been recruited to perform essential functions, in particular in eukaryotic rRNA processing. These findings are expected to stimulate experiments that could shed light on diverse cellular processes across the three domains of life.
This article was reviewed by Martijn Huynen, Igor Zhulin and Nick Grishin
PMCID: PMC3710099  PMID: 23768067
17.  Methylation kinetics and CpG-island methylator phenotyope status in colorectal cancer cell lines 
Biology Direct  2013;8:14.
Hypermethylation of CpG islands is thought to contribute to carcinogenesis through the inactivation of tumor suppressor genes. Tumor cells with relatively high levels of CpG island methylation are considered CpG island methylator phenotypes (CIMP). The mechanisms that are responsible for regulating the activity of de novo methylation are not well understood.
We quantify and compare de novo methylation kinetics in CIMP and non-CIMP colon cancer cell lines in the context of different loci, following 5-aza-2’deoxycytidine (5-AZA)-mediated de-methylation of cells. In non-CIMP cells, a relatively fast rate of re-methylation is observed that starts with a certain time delay after cessation of 5-AZA treatment. CIMP cells, on the other hand, start re-methylation without a time delay but at a significantly slower rate. A mathematical model can account for these counter-intuitive results by assuming negative feedback regulation of de novo methylation activity and by further assuming that this regulation is corrupted in CIMP cells. This model further suggests that when methylation levels have grown back to physiological levels, de novo methylation activity ceases in non-CIMP cells, while it continues at a constant low level in CIMP cells.
We propose that the faster rate of re-methylation observed in non-CIMP compared to CIMP cells in our study could be a consequence of feedback-mediated regulation of DNA methyl transferase activity. Testing this hypothesis will involve the search for specific feedback regulatory mechanisms involved in the activation of de novo methylation.
Reviewers’ report
This article was reviewed by Georg Luebeck, Tomasz Lipniacki, and Anna Marciniak-Czochra
PMCID: PMC3691599  PMID: 23758948
Methylation kinetics; Methylator phenotype; Methylation rates; Mathematical modeling
18.  Two novel PIWI families: roles in inter-genomic conflicts in bacteria and Mediator-dependent modulation of transcription in eukaryotes 
Biology Direct  2013;8:13.
The PIWI module, found in the PIWI/AGO superfamily of proteins, is a critical component of several cellular pathways including germline maintenance, chromatin organization, regulation of splicing, RNA interference, and virus suppression. It binds a guide strand which helps it target complementary nucleic strands.
Here we report the discovery of two divergent, novel families of PIWI modules, the first such to be described since the initial discovery of the PIWI/AGO superfamily over a decade ago. Both families display conservation patterns consistent with the binding of oligonucleotide guide strands. The first family is bacterial in distribution and is typically encoded by a distinctive three-gene operon alongside genes for a restriction endonuclease fold enzyme and a helicase of the DinG family. The second family is found only in eukaryotes. It is the core conserved module of the Med13 protein, a subunit of the CDK8 subcomplex of the transcription regulatory Mediator complex.
Based on the presence of the DinG family helicase, which specifically acts on R-loops, we infer that the first family of PIWI modules is part of a novel RNA-dependent restriction system which could target invasive DNA from phages, plasmids or conjugative transposons. It is predicted to facilitate restriction of actively transcribed invading DNA by utilizing RNA guides. The PIWI family found in the eukaryotic Med13 proteins throws new light on the regulatory switch through which the CDK8 subcomplex modulates transcription at Mediator-bound promoters of highly transcribed genes. We propose that this involves recognition of small RNAs by the PIWI module in Med13 resulting in a conformational switch that propagates through the Mediator complex.
This article was reviewed by Sandor Pongor, Frank Eisenhaber and Balaji Santhanam.
PMCID: PMC3702460  PMID: 23758928
19.  Exosomes secreted by human cells transport largely mRNA fragments that are enriched in the 3′-untranslated regions 
Biology Direct  2013;8:12.
Small secreted membrane vesicles called exosomes have recently attracted a great interest after the discovery that they transfer mRNA that can be translated into protein in recipient cells. Surprisingly, we found that for the majority of exosomal mRNAs only a fraction of their corresponding probes is detectable on the expression microarrays. Exosomal mRNA fragmentation is characterized with a specific structural pattern. The closer to the 3′-end of the transcript the fragments are localized, the larger fraction among the secreted RNAs they constitute. Since the 3′-ends of transcripts contain elements conferring subcellular localization of mRNA and are rich in miRNA-binding sites, exosomal RNA may act as competing RNA to regulate stability, localization and translation activity of mRNAs in recipient cells.
This article was reviewed by Neil Smalheiser and Sandor Pongor.
PMCID: PMC3732077  PMID: 23758897
Exosomes; Secreted RNA; 3′-UTR; Microarray analysis
20.  Genome-wide analysis reveals downregulation of miR-379/miR-656 cluster in human cancers 
Biology Direct  2013;8:10.
MicroRNAs (miRNAs) are non-uniformly distributed in genomes and ~30% of the miRNAs in the human genome are clustered. In this study we have focused on the imprinted miRNA cluster miR-379/miR-656 on 14q32.31 (hereafter C14) to test their coordinated function. We have analyzed expression profile of >1000 human miRNAs in >1400 samples representing seven different human tissue types obtained from cancer patients along with matched and unmatched controls.
We found 68% of the miRNAs in this cluster to be significantly downregulated in glioblastoma multiforme (GBM), 61% downregulated in kidney renal clear cell carcinoma (KIRC), 46% in breast invasive carcinoma (BRCA) and 14% in ovarian serous cystadenocarcinoma (OV). On a genome-wide scale C14 miRNAs accounted for 12-30% of the total downregulated miRNAs in different cancers. Pathway enrichment for the predicted targets of C14 miRNA was significant for cancer pathways, especially Glioma (p< 3.77x10-6, FDR<0.005). The observed downregulation was confirmed in GBM patients by real-time PCR, where 79% of C14 miRNAs (34/43) showed downregulation. In GBM samples, hypermethylation at C14 locus (p<0.003) and downregulation of MEF2, a crucial transcription factor for the cluster was observed which likely contribute to the observed downregulation of the entire miRNA cluster.
We provide compelling evidence that the entire C14 miRNA cluster is a tumor suppressor locus involved in multiple cancers, especially in GBM, and points toward a general mechanism of coordinated function for clustered miRNAs.
Reviewed by: Prof. Gregory J Goodall and Dr. Alexander Max Burroughs
PMCID: PMC3680324  PMID: 23618224
MiRNAs; Cluster; GBM; DLK1-DIO3; MEF2; Tumor Suppressor; Cancer
21.  Insights into archaeal evolution and symbiosis from the genomes of a nanoarchaeon and its inferred crenarchaeal host from Obsidian Pool, Yellowstone National Park 
Biology Direct  2013;8:9.
A single cultured marine organism, Nanoarchaeum equitans, represents the Nanoarchaeota branch of symbiotic Archaea, with a highly reduced genome and unusual features such as multiple split genes.
The first terrestrial hyperthermophilic member of the Nanoarchaeota was collected from Obsidian Pool, a thermal feature in Yellowstone National Park, separated by single cell isolation, and sequenced together with its putative host, a Sulfolobales archaeon. Both the new Nanoarchaeota (Nst1) and N. equitans lack most biosynthetic capabilities, and phylogenetic analysis of ribosomal RNA and protein sequences indicates that the two form a deep-branching archaeal lineage. However, the Nst1 genome is more than 20% larger, and encodes a complete gluconeogenesis pathway as well as the full complement of archaeal flagellum proteins. With a larger genome, a smaller repertoire of split protein encoding genes and no split non-contiguous tRNAs, Nst1 appears to have experienced less severe genome reduction than N. equitans. These findings imply that, rather than representing ancestral characters, the extremely compact genomes and multiple split genes of Nanoarchaeota are derived characters associated with their symbiotic or parasitic lifestyle. The inferred host of Nst1 is potentially autotrophic, with a streamlined genome and simplified central and energetic metabolism as compared to other Sulfolobales.
Comparison of the N. equitans and Nst1 genomes suggests that the marine and terrestrial lineages of Nanoarchaeota share a common ancestor that was already a symbiont of another archaeon. The two distinct Nanoarchaeota-host genomic data sets offer novel insights into the evolution of archaeal symbiosis and parasitism, enabling further studies of the cellular and molecular mechanisms of these relationships.
This article was reviewed by Patrick Forterre, Bettina Siebers (nominated by Michael Galperin) and Purification Lopez-Garcia
PMCID: PMC3655853  PMID: 23607440
Archaea evolution; Single cell genomics; Symbiosis; Hyperthermophiles; Split genes
22.  Invariance and optimality in the regulation of an enzyme 
Biology Direct  2013;8:7.
The Michaelis-Menten equation, proposed a century ago, describes the kinetics of enzyme-catalyzed biochemical reactions. Since then, this equation has been used in countless, increasingly complex models of cellular metabolism, often including time-dependent enzyme levels. However, even for a single reaction, there remains a fundamental disconnect between our understanding of the reaction kinetics, and the regulation of that reaction through changes in the abundance of active enzyme.
We revisit the Michaelis-Menten equation under the assumption of a time-dependent enzyme concentration. We show that all temporal enzyme profiles with the same average enzyme level yield identical substrate degradation– a simple analytical conclusion that can be thought of as an invariance principle, and which we validate experimentally using a β-galactosidase assay. The ensemble of all time-dependent enzyme trajectories with the same average concentration constitutes a space of functions. We develop a simple model of biological fitness which assigns a cost to each of these trajectories (in the form of a function of functions, i.e. a functional). We then show how one can use variational calculus to analytically infer temporal enzyme profiles that minimize the overall enzyme cost. In particular, by separately treating the static costs of amino acid sequestration and the dynamic costs of protein production, we identify a fundamental cellular tradeoff.
The overall metabolic outcome of a reaction described by Michaelis-Menten kinetics is ultimately determined by the average concentration of the enzyme during a given time interval. This invariance in analogy to path-independent phenomena in physics, suggests a new way in which variational calculus can be employed to address biological questions. Together, our results point to possible avenues for a unified approach to studying metabolism and its regulation.
This article was reviewed by Sergei Maslov, William Hlavacek and Daniel Kahn.
PMCID: PMC3665469  PMID: 23522082
23.  Description of plant tRNA-derived RNA fragments (tRFs) associated with argonaute and identification of their putative targets 
Biology Direct  2013;8:6.
tRNA-derived RNA fragments (tRFs) are 19mer small RNAs that associate with Argonaute (AGO) proteins in humans. However, in plants, it is unknown if tRFs bind with AGO proteins. Here, using public deep sequencing libraries of immunoprecipitated Argonaute proteins (AGO-IP) and bioinformatics approaches, we identified the Arabidopsis thaliana AGO-IP tRFs. Moreover, using three degradome deep sequencing libraries, we identified four putative tRF targets. The expression pattern of tRFs, based on deep sequencing data, was also analyzed under abiotic and biotic stresses. The results obtained here represent a useful starting point for future studies on tRFs in plants.
PMCID: PMC3574835  PMID: 23402430
tRNAs; Small RNA; tRFs; tRNA-derived RNA fragments; Argonaute and Arabidopsis
24.  GABBR1 has a HERV-W LTR in its regulatory region – a possible implication for schizophrenia 
Biology Direct  2013;8:5.
Schizophrenia is a complex disease with uncertain aetiology. We suggest GABBR1, GABA receptor B1 implicated in schizophrenia based on a HERV-W LTR in the regulatory region of GABBR1. Our hypothesis is supported by: (i) GABBR1 is in the 6p22 genomic region most often implicated in schizophrenia; (ii) microarray studies found that only presynaptic pathway-related genes, including GABA receptors, have altered expression in schizophrenic patients and (iii) it explains how HERV-W elements, expressed in schizophrenia, play a role in the disease: by altering the expression of GABBR1 via a long terminal repeat that is also a regulatory element to GABBR1.
This paper was reviewed by Sandor Pongor and Martijn Huynen.
PMCID: PMC3574838  PMID: 23391219
Schizophrenia; Human endogenous retrovirus; HERV-W; long terminal repeat; LTR; GABA; GABBR1; GABA receptor; Enhancer; Silencer
25.  Surprisingly high number of Twintrons in vertebrates 
Biology Direct  2013;8:4.
Twintrons represent a special intronic arrangement in which introns of two different types occupy the same gene position. Consequently, alternative splicing of these introns requires two different spliceosomes competing for the same RNA molecule. So far, only two twintrons have been described in insects. Surprisingly, we discovered several such arrangements in vertebrate genomes, which are quite conserved throughout the lineages.
This article was reviewed by Fyodor Kondrashow and Eugene Koonin.
PMCID: PMC3564746  PMID: 23356793
Twintrons; Vertebrate genomes; Gene expression

Results 1-25 (347)