Currently there is an explosive increase of the next-generation sequencing (NGS) projects and related datasets, which have to be processed by Quality Control (QC) procedures before they could be utilized for omics analysis. QC procedure usually includes identification and filtration of sequencing artifacts such as low-quality reads and contaminating reads, which would significantly affect and sometimes mislead downstream analysis. Quality control of NGS data for microbial communities is especially challenging. In this work, we have evaluated and compared the performance and effects of various QC pipelines on different types of metagenomic NGS data and from different angles, based on which general principles of using QC pipelines were proposed. Results based on both simulated and real metagenomic datasets have shown that: firstly, QC-Chain is superior in its ability for contamination identification for metagenomic NGS datasets with different complexities with high sensitivity and specificity. Secondly, the high performance computing engine enabled QC-Chain to achieve a significant reduction in processing time compared to other pipelines based on serial computing. Thirdly, QC-Chain could outperform other tools in benefiting downstream metagenomic data analysis.
The industrial yeast Saccharomyces cerevisiae is a traditional ethanologenic agent and a promising biocatalyst for advanced biofuels production using lignocellulose mateials. Here we present the genomic background of type strain NRRL Y-12632 and its transcriptomic response to 5-hydroxymethyl-2-furaldehyde (HMF), a commonly encountered toxic compound liberated from lignocellulosic-biomass pretreatment, in dissecting the genomic mechanisms of yeast tolerance. Compared with the genome of laboratory model strain S288C, we identified more than 32,000 SNPs in Y-12632 with 23,000 missense and nonsense SNPs. Enriched sequence mutations occurred for genes involved in MAPK- and phosphatidylinositol (PI)- signaling pathways in strain Y-12632, with 41 and 13 genes containing non-synonymous SNPs, respectively. Many of these mutated genes displayed consistent up-regulated signature expressions in response to challenges of 30 mM HMF. Analogous single-gene deletion mutations of these genes showed significantly sensitive growth response on a synthetic medium containing 20 mM HMF. Our results suggest at least three MAPK-signaling pathways, especially for the cell-wall integrity pathway, and PI-signaling pathways to be involved in mediation of yeast tolerance against HMF in industrial yeast Saccharomyces cerevisiae. Higher levels of sequence variations were also observed for genes involved in purine and pyrimidine metabolism pathways.
The research in microbial communities would potentially impact a vast number of applications in “bio”-related disciplines. Large-scale analyses became a clear trend in microbial community studies, thus it is increasingly important to perform efficient and in-depth data mining for insightful biological principles from large number of samples. However, as microbial communities are from different sources and of different structures, comparison and data-mining from large number of samples become quite difficult. In this work, we have proposed a data model to represent large-scale comparison of microbial community samples, namely the “Multi-Dimensional View” data model (the MDV model) that should at least include 3 aspects: samples profile (S), taxa profile (T) and meta-data profile (V). We have also proposed a method for rapid data analysis based on the MDV model and applied it on the case studies with samples from various environmental conditions. Results have shown that though sampling environments usually define key variables, the analysis could detect bio-makers and even subtle variables based on large number of samples, which might be used to discover novel principles that drive the development of communities. The efficiency and effectiveness of data analysis method based on the MDV model have been validated by the results.
Single-cell phenotyping is critical to the success of biological reductionism. Raman-activated cell sorting (RACS) has shown promise in resolving the dynamics of living cells at the individual level and to uncover population heterogeneities in comparison to established approaches such as fluorescence-activated cell sorting (FACS). Given that the number of single-cells would be massive in any experiment, the power of Raman profiling technique for single-cell analysis would be fully utilized only when coupled with a high-throughput and intelligent process control and data analysis system. In this work, we established QSpec, an automatic system that supports high-throughput Raman-based single-cell phenotyping. Additionally, a single-cell Raman profile database has been established upon which data-mining could be applied to discover the heterogeneity among single-cells under different conditions. To test the effectiveness of this control and data analysis system, a sub-system was also developed to simulate the phenotypes of single-cells as well as the device features.
Single-cell; High-throughput; Raman-activated cell sorting (RACS); Data analysis; Database; Simulation
Although Traditional Chinese Medicine (TCM) preparations have long history with successful applications, the scientific and systematic quality assessment of TCM preparations mainly focuses on chemical constituents and is far from comprehensive. There are currently only few primitive studies on assessment of biological ingredients in TCM preparations. Here, we have proposed a method, M-TCM, for biological assessment of the quality of TCM preparations based on high-throughput sequencing and metagenomic analysis. We have tested this method on Liuwei Dihuang Wan (LDW), a TCM whose ingredients have been well-defined. Our results have shown that firstly, this method could determine the biological ingredients of LDW preparations. Secondly, the quality and stability of LDW varies significantly among different manufacturers. Thirdly, the overall quality of LDW samples is significantly affected by their biological contaminations. This novel strategy has the potential to achieve comprehensive ingredient profiling of TCM preparations.
The metagenomic method directly sequences and analyses genome information from microbial communities. The main computational tasks for metagenomic analyses include taxonomical and functional structure analysis for all genomes in a microbial community (also referred to as a metagenomic sample). With the advancement of Next Generation Sequencing (NGS) techniques, the number of metagenomic samples and the data size for each sample are increasing rapidly. Current metagenomic analysis is both data- and computation- intensive, especially when there are many species in a metagenomic sample, and each has a large number of sequences. As such, metagenomic analyses require extensive computational power. The increasing analytical requirements further augment the challenges for computation analysis. In this work, we have proposed Parallel-META 2.0, a metagenomic analysis software package, to cope with such needs for efficient and fast analyses of taxonomical and functional structures for microbial communities. Parallel-META 2.0 is an extended and improved version of Parallel-META 1.0, which enhances the taxonomical analysis using multiple databases, improves computation efficiency by optimized parallel computing, and supports interactive visualization of results in multiple views. Furthermore, it enables functional analysis for metagenomic samples including short-reads assembly, gene prediction and functional annotation. Therefore, it could provide accurate taxonomical and functional analyses of the metagenomic samples in high-throughput manner and on large scale.
Human saliva microbiota is phylogenetically divergent among host individuals yet their roles in health and disease are poorly appreciated. We employed a microbial functional gene microarray, HuMiChip 1.0, to reconstruct the global functional profiles of human saliva microbiota from ten healthy and ten caries-active adults. Saliva microbiota in the pilot population featured a vast diversity of functional genes. No significant distinction in gene number or diversity indices was observed between healthy and caries-active microbiota. However, co-presence network analysis of functional genes revealed that caries-active microbiota was more divergent in non-core genes than healthy microbiota, despite both groups exhibited a similar degree of conservation at their respective core genes. Furthermore, functional gene structure of saliva microbiota could potentially distinguish caries-active patients from healthy hosts. Microbial functions such as Diaminopimelate epimerase, Prephenate dehydrogenase, Pyruvate-formate lyase and N-acetylmuramoyl-L-alanine amidase were significantly linked to caries. Therefore, saliva microbiota carried disease-associated functional signatures, which could be potentially exploited for caries diagnosis.
Oleaginous microalgae are promising feedstock for biofuels, yet the genetic diversity, origin and evolution of oleaginous traits remain largely unknown. Here we present a detailed phylogenomic analysis of five oleaginous Nannochloropsis species (a total of six strains) and one time-series transcriptome dataset for triacylglycerol (TAG) synthesis on one representative strain. Despite small genome sizes, high coding potential and relative paucity of mobile elements, the genomes feature small cores of ca. 2,700 protein-coding genes and a large pan-genome of >38,000 genes. The six genomes share key oleaginous traits, such as the enrichment of selected lipid biosynthesis genes and certain glycoside hydrolase genes that potentially shift carbon flux from chrysolaminaran to TAG synthesis. The eleven type II diacylglycerol acyltransferase genes (DGAT-2) in every strain, each expressed during TAG synthesis, likely originated from three ancient genomes, including the secondary endosymbiosis host and the engulfed green and red algae. Horizontal gene transfers were inferred in most lipid synthesis nodes with expanded gene doses and many glycoside hydrolase genes. Thus multiple genome pooling and horizontal genetic exchange, together with selective inheritance of lipid synthesis genes and species-specific gene loss, have led to the enormous genetic apparatus for oleaginousness and the wide genomic divergence among present-day Nannochloropsis. These findings have important implications in the screening and genetic engineering of microalgae for biofuels.
Microalgae are promising feedstock for biofuels. However, the diversity, origin and evolution of oil-producing microalgal genomes in general, and those of their oleaginous traits in particular, remain poorly understood. We present five new genomes of the oleaginous microalgae Nannochloropsis spp. that allow genus-, species- and strain-level genomic comparison. With each Nannochloropsis genome encoding approximately 6,562–9,915 genes, a core genome of ca. 2,700 genes and a large pan-genome of >38,000 genes were found. The genomes share key genetic features such as gene dose expansion of selected nodes in lipid biosynthesis pathways. Evidence of horizontal gene transfers, primarily from bacteria, was found in most of these nodes. However, the eleven type II acyl-CoA:diacylglycerol acyltransferase genes (DGAT-2),the highest gene dose reported among known organisms, likely originated from three ancient genomes of the secondary endosymbiosis host and the engulfed green and red algae; they were strictly vertically inherited in each of the Nannochloropsis spp. Thus, multiple genome pooling and horizontal genetic exchange have underlain the enormous genetic makeup underlying TAG production in present-day Nannochloropsis.
An increasing number of studies involve integrative analysis of gene and protein expression data taking advantage of new technologies such as next-generation transcriptome sequencing (RNA-Seq) and highly sensitive mass spectrometry (MS) instrumentation. Thus, it becomes interesting to revisit the correlative analysis of gene and protein expression data using more recently generated datasets. Furthermore, within the proteomics community there is a substantial interest in comparing the performance of different label-free quantitative proteomic strategies. Gene expression data can be used as an indirect benchmark for such protein-level comparisons. In this work we use publicly available mouse data to perform a joint analysis of genomic and proteomic data obtained on the same organism. First, we perform a comparative analysis of different label-free protein quantification methods (intensity-based and spectral count based, and using various associated data normalization steps) using several software tools on proteomic side. Similarly, we perform correlative analysis of gene expression data derived using microarray and RNA-Seq methods on genomic side. We also investigate the correlation between gene and protein expression data, and various factors affecting the accuracy of quantitation at both levels. It is observed that spectral count-based protein abundance metrics, which are easy to extract from any published data, are comparable to intensity-base measures with respect to correlation with gene expression data. The results of this work should be useful for designing robust computational pipelines for extraction and joint analysis of gene and protein expression data in the context of integrative studies.
Microalgae are promising feedstock for production of lipids, sugars, bioactive compounds and in particular biofuels, yet development of sensitive and reliable phylotyping strategies for microalgae has been hindered by the paucity of phylogenetically closely-related finished genomes.
Using the oleaginous eustigmatophyte Nannochloropsis as a model, we assessed current intragenus phylotyping strategies by producing the complete plastid (pt) and mitochondrial (mt) genomes of seven strains from six Nannochloropsis species. Genes on the pt and mt genomes have been highly conserved in content, size and order, strongly negatively selected and evolving at a rate 33% and 66% of nuclear genomes respectively. Pt genome diversification was driven by asymmetric evolution of two inverted repeats (IRa and IRb): psbV and clpC in IRb are highly conserved whereas their counterparts in IRa exhibit three lineage-associated types of structural polymorphism via duplication or disruption of whole or partial genes. In the mt genomes, however, a single evolution hotspot varies in copy-number of a 3.5 Kb-long, cox1-harboring repeat. The organelle markers (e.g., cox1, cox2, psbA, rbcL and rrn16_mt) and nuclear markers (e.g., ITS2 and 18S) that are widely used for phylogenetic analysis obtained a divergent phylogeny for the seven strains, largely due to low SNP density. A new strategy for intragenus phylotyping of microalgae was thus proposed that includes (i) twelve sequence markers that are of higher sensitivity than ITS2 for interspecies phylogenetic analysis, (ii) multi-locus sequence typing based on rps11_mt-nad4, rps3_mt and cox2-rrn16_mt for intraspecies phylogenetic reconstruction and (iii) several SSR loci for identification of strains within a given species.
This first comprehensive dataset of organelle genomes for a microalgal genus enabled exhaustive assessment and searches of all candidate phylogenetic markers on the organelle genomes. A new strategy for intragenus phylotyping of microalgae was proposed which might be generally applicable to other microalgal genera and should serve as a valuable tool in the expanding algal biotechnology industry.
Nannochloropsis; Plastid phylogenomes; Mitochondrial phylogenomes; Intragenus phylotyping strategy
Next-generation sequencing (NGS) technologies have been widely used in life sciences. However, several kinds of sequencing artifacts, including low-quality reads and contaminating reads, were found to be quite common in raw sequencing data, which compromise downstream analysis. Therefore, quality control (QC) is essential for raw NGS data. However, although a few NGS data quality control tools are publicly available, there are two limitations: First, the processing speed could not cope with the rapid increase of large data volume. Second, with respect to removing the contaminating reads, none of them could identify contaminating sources de novo, and they rely heavily on prior information of the contaminating species, which is usually not available in advance. Here we report QC-Chain, a fast, accurate and holistic NGS data quality-control method. The tool synergeticly comprised of user-friendly tools for (1) quality assessment and trimming of raw reads using Parallel-QC, a fast read processing tool; (2) identification, quantification and filtration of unknown contamination to get high-quality clean reads. It was optimized based on parallel computation, so the processing speed is significantly higher than other QC methods. Experiments on simulated and real NGS data have shown that reads with low sequencing quality could be identified and filtered. Possible contaminating sources could be identified and quantified de novo, accurately and quickly. Comparison between raw reads and processed reads also showed that subsequent analyses (genome assembly, gene prediction, gene annotation, etc.) results based on processed reads improved significantly in completeness and accuracy. As regard to processing speed, QC-Chain achieves 7–8 time speed-up based on parallel computation as compared to traditional methods. Therefore, QC-Chain is a fast and useful quality control tool for read quality process and de novo contamination filtration of NGS reads, which could significantly facilitate downstream analysis. QC-Chain is publicly available at: http://www.computationalbioenergy.org/qc-chain.html.
Motivation: A review of the available single nucleotide polymorphism (SNP) calling procedures for Illumina high-throughput sequencing (HTS) platform data reveals that most rely mainly on base-calling and mapping qualities as sources of error when calling SNPs. Thus, errors not involved in base-calling or alignment, such as those in genomic sample preparation, are not accounted for.
Results: A novel method of consensus and SNP calling, Genotype Model Selection (GeMS), is given which accounts for the errors that occur during the preparation of the genomic sample. Simulations and real data analyses indicate that GeMS has the best performance balance of sensitivity and positive predictive value among the tested SNP callers.
Availability: The GeMS package can be downloaded from https://sites.google.com/a/bioinformatics.ucr.edu/xinping-cui/home/software or http://computationalbioenergy.org/software.html
Supplementary data are available at Bioinformatics online.
The etiology of dental caries remains elusive because of our limited understanding of the complex oral microbiomes. The current methodologies have been limited by insufficient depth and breadth of microbial sampling, paucity of data for diseased hosts particularly at the population level, inconsistency of sampled sites and the inability to distinguish the underlying microbial factors. By cross-validating 16S rRNA gene amplicon-based and whole-genome-based deep-sequencing technologies, we report the most in-depth, comprehensive and collaborated view to date of the adult saliva microbiomes in pilot populations of 19 caries-active and 26 healthy human hosts. We found that: first, saliva microbiomes in human population were featured by a vast phylogenetic diversity yet a minimal organismal core; second, caries microbiomes were significantly more variable in community structure whereas the healthy ones were relatively conserved; third, abundance changes of certain taxa such as overabundance of Prevotella Genus distinguished caries microbiota from healthy ones, and furthermore, caries-active and normal individuals carried different arrays of Prevotella species; and finally, no ‘caries-specific' operational taxonomic units (OTUs) were detected, yet 147 OTUs were ‘caries associated', that is, differentially distributed yet present in both healthy and caries-active populations. These findings underscored the necessity of species- and strain-level resolution for caries prognosis, and were consistent with the ecological hypothesis where the shifts in community structure, instead of the presence or absence of particular groups of microbes, underlie the cariogenesis.
caries; metagenomics; oral-microbiome; Prevotella; saliva
In a typical shotgun proteomics experiments, a significant number of high quality MS/MS spectra remain “unassigned”. The main focus of this work is to improve our understanding of various sources of unassigned high quality spectra. To achieve this, we designed an iterative computational approach for more efficient interrogation of MS/MS data. The method involves multiple stages of database searching with different search parameters, spectral library searching, blind searching for modified peptides, and genomic database searching. The method is applied to a large publicly available shotgun proteomic dataset.
Tandem mass spectrometry; unassigned spectra; spectral quality assessment; interactive database search; post translational modification; peptide polymorphisms; novel peptides
The NGS (next generation sequencing)-based metagenomic data analysis is becoming the mainstream for the study of microbial communities. Faced with a large amount of data in metagenomic research, effective data visualization is important for scientists to effectively explore, interpret and manipulate such rich information. The visualization of the metagenomic data, especially multi-sample data, is one of the most critical challenges. The different data sample sources, sequencing approaches and heterogeneous data formats make robust and seamless data visualization difficult. Moreover, researchers have different focuses on metagenomic studies: taxonomical or functional, sample-centric or genome-centric, single sample or multiple samples, etc. However, current efforts in metagenomic data visualization cannot fulfill all of these needs, and it is extremely hard to organize all of these visualization effects in a systematic manner. An extendable, interactive visualization tool would be the method of choice to fulfill all of these visualization needs. In this paper, we have present MetaSee, an extendable toolbox that facilitates the interactive visualization of metagenomic samples of interests. The main components of MetaSee include: (I) a core visualization engine that is composed of different views for comparison of multiple samples: Global view, Phylogenetic view, Sample view and Taxa view, as well as link-out for more in-depth analysis; (II) front-end user interface with real metagenomic models that connect to the above core visualization engine and (III) open-source portal for the development of plug-ins for MetaSee. This integrative visualization tool not only provides the visualization effects, but also enables researchers to perform in-depth analysis of the metagenomic samples of interests. Moreover, its open-source portal allows for the design of plug-ins for MetaSee, which would facilitate the development of any additional visualization effects.
Most microorganisms in nature are uncultured with unknown functionality. Sequence-based metagenomics alone answers ‘who/what are there?’ but not ‘what are they doing and who is doing it and how?’. Function-based metagenomics reveals gene function but is usually limited by the specificity and sensitivity of screening strategies, especially the identification of clones whose functional gene expression has no distinguishable activity or phenotypes. A ‘biosensor-based genetic transducer’ (BGT) technique, which employs a whole-cell biosensor to quantitatively detect expression of inserted genes encoding designated functions, is able to screen for functionality of unknown genes from uncultured microorganisms. In this study, BGT was integrated with Stable isotope probing (SIP)-enabled Metagenomics to form a culture-independent SMB toolbox. The utility of this approach was demonstrated in the discovery of a novel functional gene cluster in naphthalene contaminated groundwater. Specifically, metagenomic sequencing of the 13C-DNA fraction obtained by SIP indicated that an uncultured Acidovorax sp. was the dominant key naphthalene degrader in-situ, although three culturable Pseudomonas sp. degraders were also present in the same groundwater. BGT verified the functionality of a new nag2 operon which co-existed with two other nag and two nah operons for naphthalene biodegradation in the same microbial community. Pyrosequencing analysis showed that the nag2 operon was the key functional operon in naphthalene degradation in-situ, and shared homology with both nag operons in Ralstonia sp. U2 and Polaromonas naphthalenivorans CJ2. The SMB toolbox will be useful in providing deep insights into uncultured microorganisms and unravelling their ecological roles in natural environments.
Metagenomics method directly sequences and analyses genome information from microbial communities. There are usually more than hundreds of genomes from different microbial species in the same community, and the main computational tasks for metagenomic data analyses include taxonomical and functional component examination of all genomes in the microbial community. Metagenomic data analysis is both data- and computation- intensive, which requires extensive computational power. Most of the current metagenomic data analysis softwares were designed to be used on a single computer or single computer clusters, which could not match with the fast increasing number of large metagenomic projects' computational requirements. Therefore, advanced computational methods and pipelines have to be developed to cope with such need for efficient analyses.
In this paper, we proposed Parallel-META, a GPU- and multi-core-CPU-based open-source pipeline for metagenomic data analysis, which enabled the efficient and parallel analysis of multiple metagenomic datasets and the visualization of the results for multiple samples. In Parallel-META, the similarity-based database search was parallelized based on GPU computing and multi-core CPU computing optimization. Experiments have shown that Parallel-META has at least 15 times speed-up compared to traditional metagenomic data analysis method, with the same accuracy of the results http://www.computationalbioenergy.org/parallel-meta.html.
The parallel processing of current metagenomic data would be very promising: with current speed up of 15 times and above, binning would not be a very time-consuming process any more. Therefore, some deeper analysis of the metagenomic data, such as the comparison of different samples, would be feasible in the pipeline, and some of these functionalities have been included into the Parallel-META pipeline.
Most mass spectrometry (MS) based proteomic studies depend on searching acquired tandem mass (MS/MS) spectra against databases of known protein sequences. In these experiments, however, a large number of high quality spectra remain unassigned. These spectra may correspond to novel peptides not present in the database, especially those corresponding to novel alternative splice (AS) forms. Recently, fast and comprehensive profiling of mammalian genomes using deep sequencing (i.e. RNA-Seq) has become possible. MS-based proteomics can potentially be used as an aid for protein-level validation of novel AS events observed in RNA-Seq data.
In this work, we have used publicly available mouse tissue proteomic and RNA-Seq datasets and have examined the feasibility of using MS data for the identification of novel AS forms by searching MS/MS spectra against translated mRNA sequences derived from RNA-Seq data. A significant correlation between the likelihood of identifying a peptide from MS/MS data and the number of reads in RNA-Seq data for the same gene was observed. Based on in silico experiments, it was also observed that only a fraction of novel AS forms identified from RNA-Seq had the corresponding junction peptide compatible with MS/MS sequencing. The number of novel peptides that were actually identified from MS/MS spectra was substantially lower than the number expected based on in silico analysis.
The ability to confirm novel AS forms from MS/MS data in the dataset analyzed was found to be quite limited. This can be explained in part by low abundance of many novel transcripts, with the abundance of their corresponding protein products falling below the limit of detection by MS.
The reconstruction of protein complexes from the physical interactome of organisms serves as a building block towards understanding the higher level organization of the cell. Over the past few years, several independent high-throughput experiments have helped to catalogue enormous amount of physical protein interaction data from organisms such as yeast. However, these individual datasets show lack of correlation with each other and also contain substantial number of false positives (noise). Over these years, several affinity scoring schemes have also been devised to improve the qualities of these datasets. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining datasets from multiple sources and by making use of these affinity scoring schemes. In the attempt towards tackling this challenge, the Markov Clustering algorithm (MCL) has proved to be a popular and reasonably successful method, mainly due to its scalability, robustness, and ability to work on scored (weighted) networks. However, MCL produces many noisy clusters, which either do not match known complexes or have additional proteins that reduce the accuracies of correctly predicted complexes.
Inspired by recent experimental observations by Gavin and colleagues on the modularity structure in yeast complexes and the distinctive properties of "core" and "attachment" proteins, we develop a core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes from scored (weighted) PPI networks. We combine physical interactions from two recent "pull-down" experiments to generate an unscored PPI network. We then score this network using available affinity scoring schemes to generate multiple scored PPI networks. The evaluation of our method (called MCL-CAw) on these networks shows that: (i) MCL-CAw derives larger number of yeast complexes and with better accuracies than MCL, particularly in the presence of natural noise; (ii) Affinity scoring can effectively reduce the impact of noise on MCL-CAw and thereby improve the quality (precision and recall) of its predicted complexes; (iii) MCL-CAw responds well to most available scoring schemes. We discuss several instances where MCL-CAw was successful in deriving meaningful complexes, and where it missed a few proteins or whole complexes due to affinity scoring of the networks. We compare MCL-CAw with several recent complex detection algorithms on unscored and scored networks, and assess the relative performance of the algorithms on these networks. Further, we study the impact of augmenting physical datasets with computationally inferred interactions for complex detection. Finally, we analyse the essentiality of proteins within predicted complexes to understand a possible correlation between protein essentiality and their ability to form complexes.
We demonstrate that core-attachment based refinement in MCL-CAw improves the predictions of MCL on yeast PPI networks. We show that affinity scoring improves the performance of MCL-CAw.
In many protein-protein interaction (PPI) networks, densely connected hub proteins are more likely to be essential proteins. This is referred to as the "centrality-lethality rule", which indicates that the topological placement of a protein in PPI network is connected with its biological essentiality. Though such connections are observed in many PPI networks, the underlying topological properties for these connections are not yet clearly understood. Some suggested putative connections are the involvement of essential proteins in the maintenance of overall network connections, or that they play a role in essential protein clusters. In this work, we have attempted to examine the placement of essential proteins and the network topology from a different perspective by determining the correlation of protein essentiality and reverse nearest neighbor topology (RNN).
The RNN topology is a weighted directed graph derived from PPI network, and it is a natural representation of the topological dependences between proteins within the PPI network. Similar to the original PPI network, we have observed that essential proteins tend to be hub proteins in RNN topology. Additionally, essential genes are enriched in clusters containing many hub proteins in RNN topology (RNN protein clusters). Based on these two properties of essential genes in RNN topology, we have proposed a new measure; the RNN cluster centrality. Results from a variety of PPI networks demonstrate that RNN cluster centrality outperforms other centrality measures with regard to the proportion of selected proteins that are essential proteins. We also investigated the biological importance of RNN clusters.
This study reveals that RNN cluster centrality provides the best correlation of protein essentiality and placement of proteins in PPI network. Additionally, merged RNN clusters were found to be topologically important in that essential proteins are significantly enriched in RNN clusters, and biologically important because they play an important role in many Gene Ontology (GO) processes.
Peptide identification by tandem mass spectrometry (MS/MS) is one of the most important problems in proteomics. Recent advances in high throughput MS/MS experiments result in huge amount of spectra, and the peptide identification process should keep pace. In this paper, we strive to achieve high accuracy and efficiency for peptide identification with the presence of noise by a two-phase filtering strategy. Our algorithm transforms spectra to high dimensional vectors, and then uses self-organizing map (SOM) and multi-point range query (MPRQ) as very efficient coarse filters to select a number of candidate peptides from database. These candidate peptides are subsequently scored and ranked by an accurate tag-based scoring function Sλ. Experiments showed that our approach is both fast and accurate for peptide identification.
Splicing event identification is one of the most important issues in the comprehensive analysis of transcription profile. Recent development of next-generation sequencing technology has generated an extensive profile of alternative splicing. However, while many of these splicing events are between exons that are relatively close on genome sequences, reads generated by RNA-Seq are not limited to alternative splicing between close exons but occur in virtually all splicing events. In this work, a novel method, SAW, was proposed for the identification of all splicing events based on short reads from RNA-Seq. It was observed that short reads not in known gene models are actually absent words from known gene sequences. An efficient method to filter and cluster these short reads by fingerprint fragments of splicing events without aligning short reads to genome sequences was developed. Additionally, the possible splicing sites were also determined without alignment against genome sequences. A consensus sequence was then generated for each short read cluster, which was then aligned to the genome sequences. Results demonstrated that this method could identify more than 90% of the known splicing events with a very low false discovery rate, as well as accurately identify, a number of novel splicing events between distant exons.
The problem of finding a Shortest Common Supersequence (SCS) of a set of sequences is an important problem with applications in many areas. It is a key problem in biological sequences analysis. The SCS problem is well-known to be NP-complete. Many heuristic algorithms have been proposed. Some heuristics work well on a few long sequences (as in sequence comparison applications); others work well on many short sequences (as in oligo-array synthesis). Unfortunately, most do not work well on large SCS instances where there are many, long sequences.
In this paper, we present a Deposition and Reduction (DR) algorithm for solving large SCS instances of biological sequences. There are two processes in our DR algorithm: deposition process, and reduction process. The deposition process is responsible for generating a small set of common supersequences; and the reduction process shortens these common supersequences by removing some characters while preserving the common supersequence property. Our evaluation on simulated data and real DNA and protein sequences show that our algorithm consistently produces the best results compared to many well-known heuristic algorithms, and especially on large instances.
Our DR algorithm provides a partial answer to the open problem of designing efficient heuristic algorithm for SCS problem on many long sequences. Our algorithm has a bounded approximation ratio. The algorithm is efficient, both in running time and space complexity and our evaluation shows that it is practical even for SCS problems on many long sequences.
The broad applicability of gene expression profiling to genomic analyses has generated huge demand for mass production of microarrays and hence for improving the cost effectiveness of microarray fabrication. We developed a post-processing method for deriving a good synthesis strategy. In this paper, we assessed all the known efficient methods and our post-processing method for reducing the number of synthesis cycles for manufacturing a DNA-chip of a given set of oligos. Our experimental results on both simulated and 52 real datasets show that no single method consistently gives the best synthesis strategy, and post-processing an existing strategy is necessary as it often reduces the number of synthesis cycles further.