An increasing number of studies involve integrative analysis of gene and protein expression data taking advantage of new technologies such as next-generation transcriptome sequencing (RNA-Seq) and highly sensitive mass spectrometry (MS) instrumentation. Thus, it becomes interesting to revisit the correlative analysis of gene and protein expression data using more recently generated datasets. Furthermore, within the proteomics community there is a substantial interest in comparing the performance of different label-free quantitative proteomic strategies. Gene expression data can be used as an indirect benchmark for such protein-level comparisons. In this work we use publicly available mouse data to perform a joint analysis of genomic and proteomic data obtained on the same organism. First, we perform a comparative analysis of different label-free protein quantification methods (intensity-based and spectral count based, and using various associated data normalization steps) using several software tools on proteomic side. Similarly, we perform correlative analysis of gene expression data derived using microarray and RNA-Seq methods on genomic side. We also investigate the correlation between gene and protein expression data, and various factors affecting the accuracy of quantitation at both levels. It is observed that spectral count-based protein abundance metrics, which are easy to extract from any published data, are comparable to intensity-base measures with respect to correlation with gene expression data. The results of this work should be useful for designing robust computational pipelines for extraction and joint analysis of gene and protein expression data in the context of integrative studies.
Microalgae are promising feedstock for production of lipids, sugars, bioactive compounds and in particular biofuels, yet development of sensitive and reliable phylotyping strategies for microalgae has been hindered by the paucity of phylogenetically closely-related finished genomes.
Using the oleaginous eustigmatophyte Nannochloropsis as a model, we assessed current intragenus phylotyping strategies by producing the complete plastid (pt) and mitochondrial (mt) genomes of seven strains from six Nannochloropsis species. Genes on the pt and mt genomes have been highly conserved in content, size and order, strongly negatively selected and evolving at a rate 33% and 66% of nuclear genomes respectively. Pt genome diversification was driven by asymmetric evolution of two inverted repeats (IRa and IRb): psbV and clpC in IRb are highly conserved whereas their counterparts in IRa exhibit three lineage-associated types of structural polymorphism via duplication or disruption of whole or partial genes. In the mt genomes, however, a single evolution hotspot varies in copy-number of a 3.5 Kb-long, cox1-harboring repeat. The organelle markers (e.g., cox1, cox2, psbA, rbcL and rrn16_mt) and nuclear markers (e.g., ITS2 and 18S) that are widely used for phylogenetic analysis obtained a divergent phylogeny for the seven strains, largely due to low SNP density. A new strategy for intragenus phylotyping of microalgae was thus proposed that includes (i) twelve sequence markers that are of higher sensitivity than ITS2 for interspecies phylogenetic analysis, (ii) multi-locus sequence typing based on rps11_mt-nad4, rps3_mt and cox2-rrn16_mt for intraspecies phylogenetic reconstruction and (iii) several SSR loci for identification of strains within a given species.
This first comprehensive dataset of organelle genomes for a microalgal genus enabled exhaustive assessment and searches of all candidate phylogenetic markers on the organelle genomes. A new strategy for intragenus phylotyping of microalgae was proposed which might be generally applicable to other microalgal genera and should serve as a valuable tool in the expanding algal biotechnology industry.
Nannochloropsis; Plastid phylogenomes; Mitochondrial phylogenomes; Intragenus phylotyping strategy
Next-generation sequencing (NGS) technologies have been widely used in life sciences. However, several kinds of sequencing artifacts, including low-quality reads and contaminating reads, were found to be quite common in raw sequencing data, which compromise downstream analysis. Therefore, quality control (QC) is essential for raw NGS data. However, although a few NGS data quality control tools are publicly available, there are two limitations: First, the processing speed could not cope with the rapid increase of large data volume. Second, with respect to removing the contaminating reads, none of them could identify contaminating sources de novo, and they rely heavily on prior information of the contaminating species, which is usually not available in advance. Here we report QC-Chain, a fast, accurate and holistic NGS data quality-control method. The tool synergeticly comprised of user-friendly tools for (1) quality assessment and trimming of raw reads using Parallel-QC, a fast read processing tool; (2) identification, quantification and filtration of unknown contamination to get high-quality clean reads. It was optimized based on parallel computation, so the processing speed is significantly higher than other QC methods. Experiments on simulated and real NGS data have shown that reads with low sequencing quality could be identified and filtered. Possible contaminating sources could be identified and quantified de novo, accurately and quickly. Comparison between raw reads and processed reads also showed that subsequent analyses (genome assembly, gene prediction, gene annotation, etc.) results based on processed reads improved significantly in completeness and accuracy. As regard to processing speed, QC-Chain achieves 7–8 time speed-up based on parallel computation as compared to traditional methods. Therefore, QC-Chain is a fast and useful quality control tool for read quality process and de novo contamination filtration of NGS reads, which could significantly facilitate downstream analysis. QC-Chain is publicly available at: http://www.computationalbioenergy.org/qc-chain.html.
Motivation: A review of the available single nucleotide polymorphism (SNP) calling procedures for Illumina high-throughput sequencing (HTS) platform data reveals that most rely mainly on base-calling and mapping qualities as sources of error when calling SNPs. Thus, errors not involved in base-calling or alignment, such as those in genomic sample preparation, are not accounted for.
Results: A novel method of consensus and SNP calling, Genotype Model Selection (GeMS), is given which accounts for the errors that occur during the preparation of the genomic sample. Simulations and real data analyses indicate that GeMS has the best performance balance of sensitivity and positive predictive value among the tested SNP callers.
Availability: The GeMS package can be downloaded from https://sites.google.com/a/bioinformatics.ucr.edu/xinping-cui/home/software or http://computationalbioenergy.org/software.html
Supplementary data are available at Bioinformatics online.
The etiology of dental caries remains elusive because of our limited understanding of the complex oral microbiomes. The current methodologies have been limited by insufficient depth and breadth of microbial sampling, paucity of data for diseased hosts particularly at the population level, inconsistency of sampled sites and the inability to distinguish the underlying microbial factors. By cross-validating 16S rRNA gene amplicon-based and whole-genome-based deep-sequencing technologies, we report the most in-depth, comprehensive and collaborated view to date of the adult saliva microbiomes in pilot populations of 19 caries-active and 26 healthy human hosts. We found that: first, saliva microbiomes in human population were featured by a vast phylogenetic diversity yet a minimal organismal core; second, caries microbiomes were significantly more variable in community structure whereas the healthy ones were relatively conserved; third, abundance changes of certain taxa such as overabundance of Prevotella Genus distinguished caries microbiota from healthy ones, and furthermore, caries-active and normal individuals carried different arrays of Prevotella species; and finally, no ‘caries-specific' operational taxonomic units (OTUs) were detected, yet 147 OTUs were ‘caries associated', that is, differentially distributed yet present in both healthy and caries-active populations. These findings underscored the necessity of species- and strain-level resolution for caries prognosis, and were consistent with the ecological hypothesis where the shifts in community structure, instead of the presence or absence of particular groups of microbes, underlie the cariogenesis.
caries; metagenomics; oral-microbiome; Prevotella; saliva
In a typical shotgun proteomics experiments, a significant number of high quality MS/MS spectra remain “unassigned”. The main focus of this work is to improve our understanding of various sources of unassigned high quality spectra. To achieve this, we designed an iterative computational approach for more efficient interrogation of MS/MS data. The method involves multiple stages of database searching with different search parameters, spectral library searching, blind searching for modified peptides, and genomic database searching. The method is applied to a large publicly available shotgun proteomic dataset.
Tandem mass spectrometry; unassigned spectra; spectral quality assessment; interactive database search; post translational modification; peptide polymorphisms; novel peptides
The NGS (next generation sequencing)-based metagenomic data analysis is becoming the mainstream for the study of microbial communities. Faced with a large amount of data in metagenomic research, effective data visualization is important for scientists to effectively explore, interpret and manipulate such rich information. The visualization of the metagenomic data, especially multi-sample data, is one of the most critical challenges. The different data sample sources, sequencing approaches and heterogeneous data formats make robust and seamless data visualization difficult. Moreover, researchers have different focuses on metagenomic studies: taxonomical or functional, sample-centric or genome-centric, single sample or multiple samples, etc. However, current efforts in metagenomic data visualization cannot fulfill all of these needs, and it is extremely hard to organize all of these visualization effects in a systematic manner. An extendable, interactive visualization tool would be the method of choice to fulfill all of these visualization needs. In this paper, we have present MetaSee, an extendable toolbox that facilitates the interactive visualization of metagenomic samples of interests. The main components of MetaSee include: (I) a core visualization engine that is composed of different views for comparison of multiple samples: Global view, Phylogenetic view, Sample view and Taxa view, as well as link-out for more in-depth analysis; (II) front-end user interface with real metagenomic models that connect to the above core visualization engine and (III) open-source portal for the development of plug-ins for MetaSee. This integrative visualization tool not only provides the visualization effects, but also enables researchers to perform in-depth analysis of the metagenomic samples of interests. Moreover, its open-source portal allows for the design of plug-ins for MetaSee, which would facilitate the development of any additional visualization effects.
Most microorganisms in nature are uncultured with unknown functionality. Sequence-based metagenomics alone answers ‘who/what are there?’ but not ‘what are they doing and who is doing it and how?’. Function-based metagenomics reveals gene function but is usually limited by the specificity and sensitivity of screening strategies, especially the identification of clones whose functional gene expression has no distinguishable activity or phenotypes. A ‘biosensor-based genetic transducer’ (BGT) technique, which employs a whole-cell biosensor to quantitatively detect expression of inserted genes encoding designated functions, is able to screen for functionality of unknown genes from uncultured microorganisms. In this study, BGT was integrated with Stable isotope probing (SIP)-enabled Metagenomics to form a culture-independent SMB toolbox. The utility of this approach was demonstrated in the discovery of a novel functional gene cluster in naphthalene contaminated groundwater. Specifically, metagenomic sequencing of the 13C-DNA fraction obtained by SIP indicated that an uncultured Acidovorax sp. was the dominant key naphthalene degrader in-situ, although three culturable Pseudomonas sp. degraders were also present in the same groundwater. BGT verified the functionality of a new nag2 operon which co-existed with two other nag and two nah operons for naphthalene biodegradation in the same microbial community. Pyrosequencing analysis showed that the nag2 operon was the key functional operon in naphthalene degradation in-situ, and shared homology with both nag operons in Ralstonia sp. U2 and Polaromonas naphthalenivorans CJ2. The SMB toolbox will be useful in providing deep insights into uncultured microorganisms and unravelling their ecological roles in natural environments.
Metagenomics method directly sequences and analyses genome information from microbial communities. There are usually more than hundreds of genomes from different microbial species in the same community, and the main computational tasks for metagenomic data analyses include taxonomical and functional component examination of all genomes in the microbial community. Metagenomic data analysis is both data- and computation- intensive, which requires extensive computational power. Most of the current metagenomic data analysis softwares were designed to be used on a single computer or single computer clusters, which could not match with the fast increasing number of large metagenomic projects' computational requirements. Therefore, advanced computational methods and pipelines have to be developed to cope with such need for efficient analyses.
In this paper, we proposed Parallel-META, a GPU- and multi-core-CPU-based open-source pipeline for metagenomic data analysis, which enabled the efficient and parallel analysis of multiple metagenomic datasets and the visualization of the results for multiple samples. In Parallel-META, the similarity-based database search was parallelized based on GPU computing and multi-core CPU computing optimization. Experiments have shown that Parallel-META has at least 15 times speed-up compared to traditional metagenomic data analysis method, with the same accuracy of the results http://www.computationalbioenergy.org/parallel-meta.html.
The parallel processing of current metagenomic data would be very promising: with current speed up of 15 times and above, binning would not be a very time-consuming process any more. Therefore, some deeper analysis of the metagenomic data, such as the comparison of different samples, would be feasible in the pipeline, and some of these functionalities have been included into the Parallel-META pipeline.
Most mass spectrometry (MS) based proteomic studies depend on searching acquired tandem mass (MS/MS) spectra against databases of known protein sequences. In these experiments, however, a large number of high quality spectra remain unassigned. These spectra may correspond to novel peptides not present in the database, especially those corresponding to novel alternative splice (AS) forms. Recently, fast and comprehensive profiling of mammalian genomes using deep sequencing (i.e. RNA-Seq) has become possible. MS-based proteomics can potentially be used as an aid for protein-level validation of novel AS events observed in RNA-Seq data.
In this work, we have used publicly available mouse tissue proteomic and RNA-Seq datasets and have examined the feasibility of using MS data for the identification of novel AS forms by searching MS/MS spectra against translated mRNA sequences derived from RNA-Seq data. A significant correlation between the likelihood of identifying a peptide from MS/MS data and the number of reads in RNA-Seq data for the same gene was observed. Based on in silico experiments, it was also observed that only a fraction of novel AS forms identified from RNA-Seq had the corresponding junction peptide compatible with MS/MS sequencing. The number of novel peptides that were actually identified from MS/MS spectra was substantially lower than the number expected based on in silico analysis.
The ability to confirm novel AS forms from MS/MS data in the dataset analyzed was found to be quite limited. This can be explained in part by low abundance of many novel transcripts, with the abundance of their corresponding protein products falling below the limit of detection by MS.
The reconstruction of protein complexes from the physical interactome of organisms serves as a building block towards understanding the higher level organization of the cell. Over the past few years, several independent high-throughput experiments have helped to catalogue enormous amount of physical protein interaction data from organisms such as yeast. However, these individual datasets show lack of correlation with each other and also contain substantial number of false positives (noise). Over these years, several affinity scoring schemes have also been devised to improve the qualities of these datasets. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining datasets from multiple sources and by making use of these affinity scoring schemes. In the attempt towards tackling this challenge, the Markov Clustering algorithm (MCL) has proved to be a popular and reasonably successful method, mainly due to its scalability, robustness, and ability to work on scored (weighted) networks. However, MCL produces many noisy clusters, which either do not match known complexes or have additional proteins that reduce the accuracies of correctly predicted complexes.
Inspired by recent experimental observations by Gavin and colleagues on the modularity structure in yeast complexes and the distinctive properties of "core" and "attachment" proteins, we develop a core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes from scored (weighted) PPI networks. We combine physical interactions from two recent "pull-down" experiments to generate an unscored PPI network. We then score this network using available affinity scoring schemes to generate multiple scored PPI networks. The evaluation of our method (called MCL-CAw) on these networks shows that: (i) MCL-CAw derives larger number of yeast complexes and with better accuracies than MCL, particularly in the presence of natural noise; (ii) Affinity scoring can effectively reduce the impact of noise on MCL-CAw and thereby improve the quality (precision and recall) of its predicted complexes; (iii) MCL-CAw responds well to most available scoring schemes. We discuss several instances where MCL-CAw was successful in deriving meaningful complexes, and where it missed a few proteins or whole complexes due to affinity scoring of the networks. We compare MCL-CAw with several recent complex detection algorithms on unscored and scored networks, and assess the relative performance of the algorithms on these networks. Further, we study the impact of augmenting physical datasets with computationally inferred interactions for complex detection. Finally, we analyse the essentiality of proteins within predicted complexes to understand a possible correlation between protein essentiality and their ability to form complexes.
We demonstrate that core-attachment based refinement in MCL-CAw improves the predictions of MCL on yeast PPI networks. We show that affinity scoring improves the performance of MCL-CAw.
In many protein-protein interaction (PPI) networks, densely connected hub proteins are more likely to be essential proteins. This is referred to as the "centrality-lethality rule", which indicates that the topological placement of a protein in PPI network is connected with its biological essentiality. Though such connections are observed in many PPI networks, the underlying topological properties for these connections are not yet clearly understood. Some suggested putative connections are the involvement of essential proteins in the maintenance of overall network connections, or that they play a role in essential protein clusters. In this work, we have attempted to examine the placement of essential proteins and the network topology from a different perspective by determining the correlation of protein essentiality and reverse nearest neighbor topology (RNN).
The RNN topology is a weighted directed graph derived from PPI network, and it is a natural representation of the topological dependences between proteins within the PPI network. Similar to the original PPI network, we have observed that essential proteins tend to be hub proteins in RNN topology. Additionally, essential genes are enriched in clusters containing many hub proteins in RNN topology (RNN protein clusters). Based on these two properties of essential genes in RNN topology, we have proposed a new measure; the RNN cluster centrality. Results from a variety of PPI networks demonstrate that RNN cluster centrality outperforms other centrality measures with regard to the proportion of selected proteins that are essential proteins. We also investigated the biological importance of RNN clusters.
This study reveals that RNN cluster centrality provides the best correlation of protein essentiality and placement of proteins in PPI network. Additionally, merged RNN clusters were found to be topologically important in that essential proteins are significantly enriched in RNN clusters, and biologically important because they play an important role in many Gene Ontology (GO) processes.
Peptide identification by tandem mass spectrometry (MS/MS) is one of the most important problems in proteomics. Recent advances in high throughput MS/MS experiments result in huge amount of spectra, and the peptide identification process should keep pace. In this paper, we strive to achieve high accuracy and efficiency for peptide identification with the presence of noise by a two-phase filtering strategy. Our algorithm transforms spectra to high dimensional vectors, and then uses self-organizing map (SOM) and multi-point range query (MPRQ) as very efficient coarse filters to select a number of candidate peptides from database. These candidate peptides are subsequently scored and ranked by an accurate tag-based scoring function Sλ. Experiments showed that our approach is both fast and accurate for peptide identification.
Splicing event identification is one of the most important issues in the comprehensive analysis of transcription profile. Recent development of next-generation sequencing technology has generated an extensive profile of alternative splicing. However, while many of these splicing events are between exons that are relatively close on genome sequences, reads generated by RNA-Seq are not limited to alternative splicing between close exons but occur in virtually all splicing events. In this work, a novel method, SAW, was proposed for the identification of all splicing events based on short reads from RNA-Seq. It was observed that short reads not in known gene models are actually absent words from known gene sequences. An efficient method to filter and cluster these short reads by fingerprint fragments of splicing events without aligning short reads to genome sequences was developed. Additionally, the possible splicing sites were also determined without alignment against genome sequences. A consensus sequence was then generated for each short read cluster, which was then aligned to the genome sequences. Results demonstrated that this method could identify more than 90% of the known splicing events with a very low false discovery rate, as well as accurately identify, a number of novel splicing events between distant exons.
The problem of finding a Shortest Common Supersequence (SCS) of a set of sequences is an important problem with applications in many areas. It is a key problem in biological sequences analysis. The SCS problem is well-known to be NP-complete. Many heuristic algorithms have been proposed. Some heuristics work well on a few long sequences (as in sequence comparison applications); others work well on many short sequences (as in oligo-array synthesis). Unfortunately, most do not work well on large SCS instances where there are many, long sequences.
In this paper, we present a Deposition and Reduction (DR) algorithm for solving large SCS instances of biological sequences. There are two processes in our DR algorithm: deposition process, and reduction process. The deposition process is responsible for generating a small set of common supersequences; and the reduction process shortens these common supersequences by removing some characters while preserving the common supersequence property. Our evaluation on simulated data and real DNA and protein sequences show that our algorithm consistently produces the best results compared to many well-known heuristic algorithms, and especially on large instances.
Our DR algorithm provides a partial answer to the open problem of designing efficient heuristic algorithm for SCS problem on many long sequences. Our algorithm has a bounded approximation ratio. The algorithm is efficient, both in running time and space complexity and our evaluation shows that it is practical even for SCS problems on many long sequences.
The broad applicability of gene expression profiling to genomic analyses has generated huge demand for mass production of microarrays and hence for improving the cost effectiveness of microarray fabrication. We developed a post-processing method for deriving a good synthesis strategy. In this paper, we assessed all the known efficient methods and our post-processing method for reducing the number of synthesis cycles for manufacturing a DNA-chip of a given set of oligos. Our experimental results on both simulated and 52 real datasets show that no single method consistently gives the best synthesis strategy, and post-processing an existing strategy is necessary as it often reduces the number of synthesis cycles further.