In the era of metagenomics and diagnostics sequencing, the importance of protein comparison methods of boosted performance cannot be overstated. Here we present PSimScan (Protein Similarity Scanner), a flexible open source protein similarity search tool which provides a significant gain in speed compared to BLASTP at the price of controlled sensitivity loss. The PSimScan algorithm introduces a number of novel performance optimization methods that can be further used by the community to improve the speed and lower hardware requirements of bioinformatics software. The optimization starts at the lookup table construction, then the initial lookup table–based hits are passed through a pipeline of filtering and aggregation routines of increasing computational complexity. The first step in this pipeline is a novel algorithm that builds and selects ‘similarity zones’ aggregated from neighboring matches on small arrays of adjacent diagonals. PSimScan performs 5 to 100 times faster than the standard NCBI BLASTP, depending on chosen parameters, and runs on commodity hardware. Its sensitivity and selectivity at the slowest settings are comparable to the NCBI BLASTP’s and decrease with the increase of speed, yet stay at the levels reasonable for many tasks. PSimScan is most advantageous when used on large collections of query sequences. Comparing the entire proteome of Streptocuccus pneumoniae (2,042 proteins) to the NCBI’s non-redundant protein database of 16,971,855 records takes 6.5 hours on a moderately powerful PC, while the same task with the NCBI BLASTP takes over 66 hours. We describe innovations in the PSimScan algorithm in considerable detail to encourage bioinformaticians to improve on the tool and to use the innovations in their own software development.
By sequencing the genomes of 34 mutation accumulation lines of a mismatch-repair defective strain of Escherichia coli that had undergone a total of 12,750 generations, we identified 1625 spontaneous base-pair substitutions spread across the E. coli genome. These mutations are not distributed at random but, instead, fall into a wave-like spatial pattern that is repeated almost exactly in mirror image in the two separately replicated halves of the bacterial chromosome. The pattern is correlated to genomic features, with mutation densities greatest in regions predicted to have high superhelicity. Superimposed upon this pattern are regional hotspots, some of which are located where replication forks may collide or be blocked. These results suggest that, as they traverse the chromosome, the two replication forks encounter parallel structural features that change the fidelity of DNA replication.
mutation rate; evolution; replication fidelity; chromosome structure; DNA polymerase errors
Trypanosoma brucei is a unicellular flagellated eukaryotic parasite that causes African trypanosomiasis in human and domestic animals with devastating health and economic consequences. Recent studies have revealed the important roles of the single flagellum of T. brucei in many aspects, especially that the flagellar motility is required for the viability of the bloodstream form T. brucei, suggesting that impairment of the flagellar function may provide a promising cure for African sleeping sickness. Knowing the flagellum proteome is crucial to study the molecular mechanism of the flagellar functions. Here we present a novel computational method for identifying flagellar proteins in T. brucei, called trypanosome flagellar protein predictor (TFPP). TFPP was developed based on a list of selected discriminating features derived from protein sequences, and could predict flagellar proteins with ∼92% specificity at a ∼84% sensitivity rate. Applied to the whole T. brucei proteome, TFPP reveals 811 more flagellar proteins with high confidence, suggesting that the flagellar proteome covers ∼10% of the whole proteome. Comparison of the expression profiles of the whole T. brucei proteome at three typical life cycle stages found that ∼45% of the flagellar proteins were significantly changed in expression levels between the three life cycle stages, indicating life cycle stage-specific regulation of flagellar functions in T. brucei. Overall, our study demonstrated that TFPP is highly effective in identifying flagellar proteins and could provide opportunities to study the trypanosome flagellar proteome systematically. Furthermore, the web server for TFPP can be freely accessed at http:/wukong.tongji.edu.cn/tfpp.
PSD does not usually generate a complete series of y-type ions, particularly at high mass, and this is a limitation for de novo sequencing algorithms. It is demonstrated that b2 and b3 ions can be used to help assign high mass xN-2 and xN-3 fragments that are found in vacuum ultraviolet (VUV) photofragmentation experiments. In addition, vN-type ion fragments with side chain loss from the N-terminal residue often enable confirmation of N-terminal amino acids. Libraries containing several thousand peptides were examined using photodissociation in a MALDI-TOF/TOF instrument. 1345 photodissociation spectra with a high S/N ratio were interpreted.
De Novo Sequencing; Photodissociation; Mass Spectrometry; MALDI-TOF/TOF
The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.
Xylan is one of the most abundant biopolymers on Earth. Its degradation is mediated primarily by microbial xylanase in nature. To explore the diversity and distribution patterns of xylanase genes in soils, samples of five soil types with different physicochemical characters were analyzed.
Partial xylanase genes of glycoside hydrolase (GH) family 10 were recovered following direct DNA extraction from soil, PCR amplification and cloning. Combined with our previous study, a total of 1084 gene fragments were obtained, representing 366 OTUs. More than half of the OTUs were novel (identities of <65% with known xylanases) and had no close relatives based on phylogenetic analyses. Xylanase genes from all the soil environments were mainly distributed in Bacteroidetes, Proteobacteria, Acidobacteria, Firmicutes, Actinobacteria, Dictyoglomi and some fungi. Although identical sequences were found in several sites, habitat-specific patterns appeared to be important, and geochemical factors such as pH and oxygen content significantly influenced the compositions of xylan-degrading microbial communities.
These results provide insight into the GH 10 xylanases in various soil environments and reveal that xylan-degrading microbial communities are environment specific with diverse and abundant populations.
The third-generation of sequencing technologies produces sequence reads of 1000 bp or more that may contain high polymorphism information. However, most currently available sequence analysis tools are developed specifically for analyzing short sequence reads. While the traditional Smith-Waterman (SW) algorithm can be used to map long sequence reads, its naive implementation is computationally infeasible. We have developed a new Sequence mapping and Analyzing Program (SAP) that implements a modified version of SW to speed up the alignment process. In benchmarks with simulated and real exon sequencing data and a real E. coli genome sequence data generated by the third-generation sequencing technologies, SAP outperforms currently available tools for mapping short and long sequence reads in both speed and proportion of captured reads. In addition, it achieves high accuracy in detecting SNPs and InDels in the simulated data. SAP is available at https://github.com/davidsun/SAP.
Though genomic-level data are becoming widely available, many of the metazoan species sequenced are laboratory systems whose natural history is not well documented. In contrast, the wide array of species with very well-characterized natural history have, until recently, lacked genomics tools. It is now possible to address significant evolutionary genomics questions by applying high-throughput sequencing to discover the majority of genes for ecologically tractable species, and by subsequently developing microarray platforms from which to investigate gene regulatory networks that function in natural systems. We used GS-FLX Titanium Sequencing (Roche/454-Sequencing) of two normalized libraries of pooled RNA samples to characterize a transcriptome of the dark-eyed junco (Junco hyemalis), a North American sparrow that is a classically studied species in the fields of photoperiodism, speciation, and hormone-mediated behavior.
From a broad pool of RNA sampled from tissues throughout the body of a male and a female junco, we sequenced a total of 434 million nucleotides from 1.17 million reads that were assembled de novo into 31,379 putative transcripts representing 22,765 gene sets covering 35.8 million nucleotides with 12-fold average depth of coverage. Annotation of roughly half of the putative genes was accomplished using sequence similarity, and expression was confirmed for the majority with a preliminary microarray analysis. Of 716 core bilaterian genes, 646 (90 %) were recovered within our characterized gene set. Gene Ontology, orthoDB orthology groups, and KEGG Pathway annotation provide further functional information about the sequences, and 25,781 potential SNPs were identified.
The extensive sequence information returned by this effort adds to the growing store of genomic data on diverse species. The extent of coverage and annotation achieved and confirmation of expression, show that transcriptome sequencing provides useful information for ecological model systems that have historically lacked genomic tools. The junco-specific microarray developed here is allowing investigations of gene expression responses to environmental and hormonal manipulations – extending the historic work on natural history and hormone-mediated phenotypes in this system.
Transcriptome; Aves; pyrosequencing; microarray; Junco; 454 titanium cDNA sequencing; single nucleotide polymorphism.
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) loci, together with cas (CRISPR–associated) genes, form the CRISPR/Cas adaptive immune system, a primary defense strategy that eubacteria and archaea mobilize against foreign nucleic acids, including phages and conjugative plasmids. Short spacer sequences separated by the repeats are derived from foreign DNA and direct interference to future infections. The availability of hundreds of shotgun metagenomic datasets from the Human Microbiome Project (HMP) enables us to explore the distribution and diversity of known CRISPRs in human-associated microbial communities and to discover new CRISPRs. We propose a targeted assembly strategy to reconstruct CRISPR arrays, which whole-metagenome assemblies fail to identify. For each known CRISPR type (identified from reference genomes), we use its direct repeat consensus sequence to recruit reads from each HMP dataset and then assemble the recruited reads into CRISPR loci; the unique spacer sequences can then be extracted for analysis. We also identified novel CRISPRs or new CRISPR variants in contigs from whole-metagenome assemblies and used targeted assembly to more comprehensively identify these CRISPRs across samples. We observed that the distributions of CRISPRs (including 64 known and 86 novel ones) are largely body-site specific. We provide detailed analysis of several CRISPR loci, including novel CRISPRs. For example, known streptococcal CRISPRs were identified in most oral microbiomes, totaling ∼8,000 unique spacers: samples resampled from the same individual and oral site shared the most spacers; different oral sites from the same individual shared significantly fewer, while different individuals had almost no common spacers, indicating the impact of subtle niche differences on the evolution of CRISPR defenses. We further demonstrate potential applications of CRISPRs to the tracing of rare species and the virus exposure of individuals. This work indicates the importance of effective identification and characterization of CRISPR loci to the study of the dynamic ecology of microbiomes.
Human bodies are complex ecological systems in which various microbial organisms and viruses interact with each other and with the human host. The Human Microbiome Project (HMP) has resulted in >700 datasets of shotgun metagenomic sequences, from which we can learn about the compositions and functions of human-associated microbial communities. CRISPR/Cas systems are a widespread class of adaptive immune systems in bacteria and archaea, providing acquired immunity against foreign nucleic acids: CRISPR/Cas defense pathways involve integration of viral- or plasmid-derived DNA segments into CRISPR arrays (forming spacers between repeated structural sequences), and expression of short crRNAs from these single repeat-spacer units, to generate interference to future invading foreign genomes. Powered by an effective computational approach (the targeted assembly approach for CRISPR), our analysis of CRISPR arrays in the HMP datasets provides the very first global view of bacterial immunity systems in human-associated microbial communities. The great diversity of CRISPR spacers we observed among different body sites, in different individuals, and in single individuals over time, indicates the impact of subtle niche differences on the evolution of CRISPR defenses and indicates the key role of bacteriophage (and plasmids) in shaping human microbial communities.
We estimated the reproducibility of tandem mass fragmentation spectra for the widely-used collision-induced dissociation (CID) instruments. Using the Pearson correlation coefficient as a measure of spectral similarity, we found that the within-experiment reproducibility of fragment ion intensities is very high (about 0.85). However, across different experiments and instrument types/setups, the correlation decreases by more than 15% (to about 0.70). We further investigated the accuracy of current predictors of peptide fragmentation spectra and found that they are more accurate than the ad-hoc models generally used by search engines (e.g. SEQUEST) and, surprisingly, approaching the empirical upper limit set by the average across-experiment spectral reproducibility (especially for charge +1 and charge +2 precursor ions). These results provide evidence that, in terms of accuracy of modeling, predicted peptide fragmentation spectra provide a viable alternative to spectral libraries for peptide identification, with a higher coverage of peptides and lower storage requirements. Furthermore, using five data sets of proteome digests by two different proteases, we find that PeptideART (a data-driven machine learning approach) is generally more accurate than MassAnalyzer (an approach based on a kinetic model for peptide fragmentation) in predicting fragmentation spectra, but that both models are significantly more accurate than the ad-hoc models. Availability: PeptideART is freely available at www.informatics.indiana.edu/predrag.
Though many software packages have been developed to perform label-free quantification of proteins in complex biological samples using peptide intensities generated by LC–MS/MS, two critical issues are generally ignored in this field: (i) peptides have multiple elution patterns across runs in an experiment, and (ii) many peptides cannot be used for protein quantification. To address these two key issues, we have developed a novel alignment method to enable accurate peptide peak retention time determination and multiple filters to eliminate unqualified peptides for protein quantification. Repeatability and linearity have been tested using six very different samples, i.e., standard peptides, kidney tissue lysates, HT29-MTX cell lysates, depleted human serum, human serum albumin-bound proteins, and standard proteins spiked in kidney tissue lysates. At least 90.8% of the proteins (up to 1,390) had CVs ≤ 30% across 10 technical replicates, and at least 93.6% (up to 2,013) had R2 ≥ 0.9500 across 7 concentrations. Identical amounts of standard protein spiked in complex biological samples achieved a CV of 8.6% across eight injections of two groups. Further assessment was made by comparing mass spectrometric results to immunodetection, and consistent results were obtained. The new approach has novel and specific features enabling accurate label-free quantification.
LC–S/MS; label-free quantification; alignment; retention time; unqualified peptides
Quantification of protein expression by means of mass spectrometry (MS) has been introduced in various proteomics studies. In particular, two label-free quantification methods, such as spectral counting and spectra feature analysis have been extensively investigated in a wide variety of proteomic studies. The cornerstone of both methods is peptide identification based on a proteomic database search and subsequent estimation of peptide retention time. However, they often suffer from restrictive database search and inaccurate estimation of the liquid chromatography (LC) retention time. Furthermore, conventional peptide identification methods based on the spectral library search algorithms such as SEQUEST or SpectraST have been found to provide neither the best match nor high-scored matches. Lastly, these methods are limited in the sense that target peptides cannot be identified unless they have been previously generated and stored into the database or spectral libraries.
To overcome these limitations, we propose a novel method, namely Quantification method based on Finding the Identical Spectral set for a Homogenous peptide (Q-FISH) to estimate the peptide's abundance from its tandem mass spectrometry (MS/MS) spectra through the direct comparison of experimental spectra. Intuitively, our Q-FISH method compares all possible pairs of experimental spectra in order to identify both known and novel proteins, significantly enhancing identification accuracy by grouping replicated spectra from the same peptide targets.
We applied Q-FISH to Nano-LC-MS/MS data obtained from human hepatocellular carcinoma (HCC) and normal liver tissue samples to identify differentially expressed peptides between the normal and disease samples. For a total of 44,318 spectra obtained through MS/MS analysis, Q-FISH yielded 14,747 clusters. Among these, 5,777 clusters were identified only in the HCC sample, 6,648 clusters only in the normal tissue sample, and 2,323 clusters both in the HCC and normal tissue samples. While it will be interesting to investigate peptide clusters only found from one sample, further examined spectral clusters identified both in the HCC and normal samples since our goal is to identify and assess differentially expressed peptides quantitatively. The next step was to perform a beta-binomial test to isolate differentially expressed peptides between the HCC and normal tissue samples. This test resulted in 84 peptides with significantly differential spectral counts between the HCC and normal tissue samples. We independently identified 50 and 95 peptides by SEQUEST, of which 24 and 56 peptides, respectively, were found to be known biomarkers for the human liver cancer. Comparing Q-FISH and SEQUEST results, we found 22 of the differentially expressed 84 peptides by Q-FISH were also identified by SEQUEST. Remarkably, of these 22 peptides discovered both by Q-FISH and SEQUEST, 13 peptides are known for human liver cancer and the remaining 9 peptides are known to be associated with other cancers.
We proposed a novel statistical method, Q-FISH, for accurately identifying protein species and simultaneously quantifying the expression levels of identified peptides from mass spectrometry data. Q-FISH analysis on human HCC and liver tissue samples identified many protein biomarkers that are highly relevant to HCC. Q-FISH can be a useful tool both for peptide identification and quantification on mass spectrometry data analysis. It may also prove to be more effective in discovering novel protein biomarkers than SEQUEST and other standard methods.
Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode.
Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website: http://omics.informatics.indiana.edu/mg/RAPSearch2/.
Supplementary information: Available at the RAPSearch2 website.
A synthetic approach to model the analytical complexity of biological proteolytic digests has been developed. Combinatorial peptide libraries ranging in length between nine and twelve amino acids that represent typical tryptic digests were designed, synthesized and analyzed. Individual libraries and mixtures thereof were studied by replicate liquid chromatography-ion trap mass spectrometry and compared to a tryptic digest of Deinococcus radiodurans. Similar to complex proteome analysis, replicate study of individual libraries identified additional unique peptides. Fewer novel sequences were revealed with each additional analysis in a manner similar to that observed for biological data. Our results demonstrate a bimodal distribution of peptides sorting to either very low or very high levels of detection. Upon mixing of libraries at equal abundance, a length-dependent bias in favor of longer sequence identification was observed. Peptide identification as a function of site-specific amino acid content was characterized with certain amino acids proving to be of considerable importance. This report demonstrates that peptide libraries of defined character can serve as a reference for instrument characterization. Furthermore, they are uniquely suited to delineate the physical properties that influence identification of peptides which provides a foundation for optimizing the study of samples with less defined heterogeneity.
Peptide detectability is defined as the probability that a peptide is identified in an LC-MS/MS experiment and has been useful in providing solutions to protein inference and label-free quantification. Previously, predictors for peptide detectability trained on standard or complex samples were proposed. Although the models trained on complex samples may benefit from the large training data sets, it is unclear to what extent they are affected by the unequal abundances of identified proteins. To address this challenge and improve detectability prediction, we present a new algorithm for the iterative learning of peptide detectability from complex mixtures. We provide evidence that the new method approximates detectability with useful accuracy and, based on its design, can be used to interpret the outcome of other learning strategies. We studied the properties of peptides from the bacterium Deinococcus radiodurans and found that at standard quantities, its tryptic peptides can be roughly classified as either detectable or undetectable, with a relatively small fraction having medium detectability. We extend the concept of detectability from peptides to proteins and apply the model to predict the behavior of a replicate LC-MS/MS experiment from a single analysis. Finally, our study summarizes a theoretical framework for peptide/protein identification and label-free quantification.
Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.
We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.
RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.
short reads; similarity search; suffix array; reduced amino acid alphabet; metagenomics
False positive control/estimate in peptide identifications by mass spectrometry is of critical importance for reliable inference at the protein level and downstream bioinformatics analysis. Approaches based on search against decoy databases have become popular for its conceptual simplicity and easy implementation. Although various decoy search strategies have been proposed, few studies have investigated their difference in performance. With datasets collected on a mixture of model proteins, we demonstrate that a single search against the target database coupled with its reversed version offers a good balance between performance and simplicity. In particular, both the accuracy of the estimate of the number of false positives and sensitivity is at least comparable to other procedures examined in this study. It is also shown that scrambling while preserving frequency of amino acid words can potentially improve the accuracy of false positive estimate, though more studies are needed to investigate the optimal scrambling procedure for specific condition and the variation of the estimate across repeated scrambling.
Decoy databases; False positive; Mass Spectrometry; Peptides; Sensitivity
Acute light-induced photoreceptor degeneration has been studied in experimental animals as a model for photoreceptor cell loss in human retinal degenerative diseases. Light absorption by rhodopsin in rod photoreceptor outer segments (OS) induces oxidative stress and initiates apoptotic cell death. However, the molecular events that induce oxidative stress and initiate the apoptotic cascade remain poorly understood. To better understand the molecular mechanisms of light-induced photoreceptor cell death, we studied the proteomic changes in OS upon intense light exposure by using a proteolytic 18O labeling method. Of 171 proteins identified, the relative abundance of 98 proteins in light-exposed and unexposed OS was determined. The quantities of 11 proteins were found to differ by more than 2-fold between light-exposed OS and those remaining in darkness. Among the 11 proteins, 8 were phototransduction proteins and 7 of these were altered such that the efficiency of phototransduction would be reduced or quenched during light exposure. In contrast, the amount of OS rhodopsin kinase was reduced by 2-fold after light exposure, suggesting attenuation in the mechanism of quenching phototransduction. Liquid chromatography multiple reaction monitoring (LC-MRM) was performed to confirm this reduction in the quantity of rhodopsin kinase. As revealed by immunofluorescence microscopy, this reduction of rhodopsin kinase is not a result of protein translocation from the outer to the inner segment. Collectively, our findings suggest that the absolute quantity of rhodopsin kinase in rod photoreceptors is reduced upon light stimulation and that this reduction may be a contributing factor to light-induced photoreceptor cell death. This report provides new insights into the proteomic changes in the OS upon intense light exposure and creates a foundation for understanding the mechanisms of light-induced photoreceptor cell death.
photoreceptor; light damage; 18O labeling; mass spectrometry; rhodopsin kinase; phototransduction
The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.
Recent studies have shown that RNA structural motifs play essential roles in RNA folding and interaction with other molecules. Computational identification and analysis of RNA structural motifs remains a challenging task. Existing motif identification methods based on 3D structure may not properly compare motifs with high structural variations. Other structural motif identification methods consider only nested canonical base-pairing structures and cannot be used to identify complex RNA structural motifs that often consist of various non-canonical base pairs due to uncommon hydrogen bond interactions. In this article, we present a novel RNA structural alignment method for RNA structural motif identification, RNAMotifScan, which takes into consideration the isosteric (both canonical and non-canonical) base pairs and multi-pairings in RNA structural motifs. The utility and accuracy of RNAMotifScan is demonstrated by searching for kink-turn, C-loop, sarcin-ricin, reverse kink-turn and E-loop motifs against a 23S rRNA (PDBid: 1S72), which is well characterized for the occurrences of these motifs. Finally, we search these motifs against the RNA structures in the entire Protein Data Bank and the abundances of them are estimated. RNAMotifScan is freely available at our supplementary website (http://genome.ucf.edu/RNAMotifScan).
The protein inference problem represents a major challenge in shotgun proteomics. In this article, we describe a novel Bayesian approach to address this challenge by incorporating the predicted peptide detectabilities as the prior probabilities of peptide identification. We propose a rigorious probabilistic model for protein inference and provide practical algoritmic solutions to this problem. We used a complex synthetic protein mixture to test our method and obtained promising results.
algorithms; alignment; combinatorial proteomics; computational molecular biology; databases; mass spectroscopy; proteins; sequence analysis
Long terminal repeat (LTR) retroelements represent a successful group of transposable elements (TEs) that have played an important role in shaping the structure of many eukaryotic genomes. Here, we present a genome-wide analysis of LTR retroelements in Daphnia pulex, a cyclical parthenogen and the first crustacean for which the whole genomic sequence is available. In addition, we analyze transcriptional data and perform transposon display assays of lab-reared lineages and natural isolates to identify potential influences on TE mobility and differences in LTR retroelements loads among individuals reproducing with and without sex.
We conducted a comprehensive de novo search for LTR retroelements and identified 333 intact LTR retroelements representing 142 families in the D. pulex genome. While nearly half of the identified LTR retroelements belong to the gypsy group, we also found copia (95), BEL/Pao (66) and DIRS (19) retroelements. Phylogenetic analysis of reverse transcriptase sequences showed that LTR retroelements in the D. pulex genome form many lineages distinct from known families, suggesting that the majority are novel. Our investigation of transcriptional activity of LTR retroelements using tiling array data obtained from three different experimental conditions found that 71 LTR retroelements are actively transcribed. Transposon display assays of mutation-accumulation lines showed evidence for putative somatic insertions for two DIRS retroelement families. Losses of presumably heterozygous insertions were observed in lineages in which selfing occurred, but never in asexuals, highlighting the potential impact of reproductive mode on TE abundance and distribution over time. The same two families were also assayed across natural isolates (both cyclical parthenogens and obligate asexuals) and there were more retroelements in populations capable of reproducing sexually for one of the two families assayed.
Given the importance of LTR retroelements activity in the evolution of other genomes, this comprehensive survey provides insight into the potential impact of LTR retroelements on the genome of D. pulex, a cyclically parthenogenetic microcrustacean that has served as an ecological model for over a century.
Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e., ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increased the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for the metagenomic projects when the genome assembly does not work because of the low sequence coverage.
Metagenomics; ORFome; ORFome assembly; Function annotation
The protein inference problem represents a major challenge in shotgun proteomics. Here we describe a novel Bayesian approach to address this challenge by incorporating the predicted peptide detectabilities as the prior probabilities of peptide identification. We propose a rigorious probabilistic model for protein inference, and provide practical algoritmic solutions to this problem. We used a complex synthetic protein mixture to test our method and obtained promising results.