Effective use of rapid and inexpensive whole genome sequencing for microbes requires fast, memory efficient bioinformatics tools for sequence comparison. The kSNP v2 software finds single nucleotide polymorphisms (SNPs) in whole genome data. kSNP v2 has numerous improvements over kSNP v1 including SNP gene annotation; better scaling for draft genomes available as assembled contigs or raw, unassembled reads; a tool to identify the optimal value of k; distribution of packages of executables for Linux and Mac OS X for ease of installation and user-friendly use; and a detailed User Guide. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a single reference genome. Most target sets with hundreds of genomes complete in minutes to hours. SNP phylogenies are built by maximum likelihood, parsimony, and distance, based on all SNPs, only core SNPs, or SNPs present in some intermediate user-specified fraction of targets. The SNP-based trees that result are consistent with known taxonomy. kSNP v2 can handle many gigabases of sequence in a single run, and if one or more annotated genomes are included in the target set, SNPs are annotated with protein coding and other information (UTRs, etc.) from Genbank file(s). We demonstrate application of kSNP v2 on sets of viral and bacterial genomes, and discuss in detail analysis of a set of 68 finished E. coli and Shigella genomes and a set of the same genomes to which have been added 47 assemblies and four “raw read” genomes of H104:H4 strains from the recent European E. coli outbreak that resulted in both bloody diarrhea and hemolytic uremic syndrome (HUS), and caused at least 50 deaths.
A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.
We present a new classification method for expression profiling data, called MIDClass (Microarray Interval Discriminant CLASSifier), based on association rules. It classifies expressions profiles exploiting the idea that the transcript expression intervals better discriminate subtypes in the same class. A wide experimental analysis shows the effectiveness of MIDClass compared to the most prominent classification approaches.
The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).
closely-related genomes; de Bruijn graph; metagenomics; quantification
We report analysis of N-glycans derived from disease-free individuals and patients with Barrett's esophagus, high-grade dysplasia, and esophageal adenocarcinoma by microchip electrophoresis with laser-induced fluorescence detection. Serum samples in 10-μL aliquots are enzymatically treated to cleave the N-glycans that are subsequently reacted with 8-aminopyrene-1,3,6-trisulfonic acid to add charge and a fluorescent label. Separations at 1250 V/cm and over 22 cm yielded efficiencies up to 700,000 plates for the N-glycans and analysis times under 100 s. Principal component analysis (PCA) and analysis of variance (ANOVA) tests of the peak areas and migration times are used to evaluate N-glycan profiles from native and desialylated samples and to determine differences among the four sample groups. With microchip electrophoresis, we are able to distinguish the three patient groups from each other and from disease-free individuals.
microfluidics; microchip electrophoresis; N-glycans; glycan profiling; disease-state monitoring; Barrett's esophagus; high-grade dysplasia; esophageal adenocarcinoma
Despite sharing much of their genomes, males and females are often highly dimorphic, reflecting at least in part the resolution of sexual conflict in response to sexually antagonistic selection. Sexual dimorphism arises owing to sex differences in gene expression, and steroid hormones are often invoked as a proximate cause of sexual dimorphism. Experimental elevation of androgens can modify behavior, physiology, and gene expression, but knowledge of the role of hormones remains incomplete, including how the sexes differ in gene expression in response to hormones. We addressed these questions in a bird species with a long history of behavioral endocrinological and ecological study, the dark-eyed junco (Junco hyemalis), using a custom microarray. Focusing on two brain regions involved in sexually dimorphic behavior and regulation of hormone secretion, we identified 651 genes that differed in expression by sex in medial amygdala and 611 in hypothalamus. Additionally, we treated individuals of each sex with testosterone implants and identified many genes that may be related to previously identified phenotypic effects of testosterone treatment. Some of these genes relate to previously identified effects of testosterone-treatment and suggest that the multiple effects of testosterone may be mediated by modifying the expression of a small number of genes. Notably, testosterone-treatment tended to alter expression of different genes in each sex: only 4 of the 527 genes identified as significant in one sex or the other were significantly differentially expressed in both sexes. Hormonally regulated gene expression is a key mechanism underlying sexual dimorphism, and our study identifies specific genes that may mediate some of these processes.
Next generation sequencing platforms have greatly reduced sequencing costs, leading to the production of unprecedented amounts of sequence data. BWA is one of the most popular alignment tools due to its relatively high accuracy. However, mapping reads using BWA is still the most time consuming step in sequence analysis. Increasing mapping efficiency would allow the community to better cope with ever expanding volumes of sequence data.
We designed a new program, CGAP-align, that achieves a performance improvement over BWA without sacrificing recall or precision. This is accomplished through the use of Suffix Tarray, a novel data structure combining elements of Suffix Array and Suffix Tree. We also utilize a tighter lower bound estimation for the number of mismatches in a read, allowing for more effective pruning during inexact mapping. Evaluation of both simulated and real data suggests that CGAP-align consistently outperforms the current version of BWA and can achieve over twice its speed under certain conditions, all while obtaining nearly identical results.
CGAP-align is a new time efficient read alignment tool that extends and improves BWA. The increase in alignment speed will be of critical assistance to all sequence-based research and medicine. CGAP-align is freely available to the academic community at http://sourceforge.net/p/cgap-align under the GNU General Public License (GPL).
Shotgun metagenomics has been applied to the studies of the functionality of various microbial communities. As a critical analysis step in these studies, biological pathways are reconstructed based on the genes predicted from metagenomic shotgun sequences. Pathway reconstruction provides insights into the functionality of a microbial community and can be used for comparing multiple microbial communities. The utilization of pathway reconstruction, however, can be jeopardized because of imperfect functional annotation of genes, and ambiguity in the assignment of predicted enzymes to biochemical reactions (e.g., some enzymes are involved in multiple biochemical reactions). Considering that metabolic functions in a microbial community are carried out by many enzymes in a collaborative manner, we present a probabilistic sampling approach to profiling functional content in a metagenomic dataset, by sampling functions of catalytically promiscuous enzymes within the context of the entire metabolic network defined by the annotated metagenome. We test our approach on metagenomic datasets from environmental and human-associated microbial communities. The results show that our approach provides a more accurate representation of the metabolic activities encoded in a metagenome, and thus improves the comparative analysis of multiple microbial communities. In addition, our approach reports likelihood scores of putative reactions, which can be used to identify important reactions and metabolic pathways that reflect the environmental adaptation of the microbial communities. Source code for sampling metabolic networks is available online at http://omics.informatics.indiana.edu/mg/MetaNetSam/.
We present a probabilistic sampling approach to profiling metabolic reactions in a microbial community from metagenomic shotgun reads, in an attempt to understand the metabolism within a microbial community and compare them across multiple communities. Different from the conventional pathway reconstruction approaches that aim at a definitive set of reactions, our method estimates how likely each annotated reaction can occur in the metabolism of the microbial community, given the shotgun sequencing data. This probabilistic measure improves our prediction of the actual metabolism in the microbial communities and can be used in the comparative functional analysis of metagenomic data.
In the era of metagenomics and diagnostics sequencing, the importance of protein comparison methods of boosted performance cannot be overstated. Here we present PSimScan (Protein Similarity Scanner), a flexible open source protein similarity search tool which provides a significant gain in speed compared to BLASTP at the price of controlled sensitivity loss. The PSimScan algorithm introduces a number of novel performance optimization methods that can be further used by the community to improve the speed and lower hardware requirements of bioinformatics software. The optimization starts at the lookup table construction, then the initial lookup table–based hits are passed through a pipeline of filtering and aggregation routines of increasing computational complexity. The first step in this pipeline is a novel algorithm that builds and selects ‘similarity zones’ aggregated from neighboring matches on small arrays of adjacent diagonals. PSimScan performs 5 to 100 times faster than the standard NCBI BLASTP, depending on chosen parameters, and runs on commodity hardware. Its sensitivity and selectivity at the slowest settings are comparable to the NCBI BLASTP’s and decrease with the increase of speed, yet stay at the levels reasonable for many tasks. PSimScan is most advantageous when used on large collections of query sequences. Comparing the entire proteome of Streptocuccus pneumoniae (2,042 proteins) to the NCBI’s non-redundant protein database of 16,971,855 records takes 6.5 hours on a moderately powerful PC, while the same task with the NCBI BLASTP takes over 66 hours. We describe innovations in the PSimScan algorithm in considerable detail to encourage bioinformaticians to improve on the tool and to use the innovations in their own software development.
By sequencing the genomes of 34 mutation accumulation lines of a mismatch-repair defective strain of Escherichia coli that had undergone a total of 12,750 generations, we identified 1625 spontaneous base-pair substitutions spread across the E. coli genome. These mutations are not distributed at random but, instead, fall into a wave-like spatial pattern that is repeated almost exactly in mirror image in the two separately replicated halves of the bacterial chromosome. The pattern is correlated to genomic features, with mutation densities greatest in regions predicted to have high superhelicity. Superimposed upon this pattern are regional hotspots, some of which are located where replication forks may collide or be blocked. These results suggest that, as they traverse the chromosome, the two replication forks encounter parallel structural features that change the fidelity of DNA replication.
mutation rate; evolution; replication fidelity; chromosome structure; DNA polymerase errors
Trypanosoma brucei is a unicellular flagellated eukaryotic parasite that causes African trypanosomiasis in human and domestic animals with devastating health and economic consequences. Recent studies have revealed the important roles of the single flagellum of T. brucei in many aspects, especially that the flagellar motility is required for the viability of the bloodstream form T. brucei, suggesting that impairment of the flagellar function may provide a promising cure for African sleeping sickness. Knowing the flagellum proteome is crucial to study the molecular mechanism of the flagellar functions. Here we present a novel computational method for identifying flagellar proteins in T. brucei, called trypanosome flagellar protein predictor (TFPP). TFPP was developed based on a list of selected discriminating features derived from protein sequences, and could predict flagellar proteins with ∼92% specificity at a ∼84% sensitivity rate. Applied to the whole T. brucei proteome, TFPP reveals 811 more flagellar proteins with high confidence, suggesting that the flagellar proteome covers ∼10% of the whole proteome. Comparison of the expression profiles of the whole T. brucei proteome at three typical life cycle stages found that ∼45% of the flagellar proteins were significantly changed in expression levels between the three life cycle stages, indicating life cycle stage-specific regulation of flagellar functions in T. brucei. Overall, our study demonstrated that TFPP is highly effective in identifying flagellar proteins and could provide opportunities to study the trypanosome flagellar proteome systematically. Furthermore, the web server for TFPP can be freely accessed at http:/wukong.tongji.edu.cn/tfpp.
PSD does not usually generate a complete series of y-type ions, particularly at high mass, and this is a limitation for de novo sequencing algorithms. It is demonstrated that b2 and b3 ions can be used to help assign high mass xN-2 and xN-3 fragments that are found in vacuum ultraviolet (VUV) photofragmentation experiments. In addition, vN-type ion fragments with side chain loss from the N-terminal residue often enable confirmation of N-terminal amino acids. Libraries containing several thousand peptides were examined using photodissociation in a MALDI-TOF/TOF instrument. 1345 photodissociation spectra with a high S/N ratio were interpreted.
De Novo Sequencing; Photodissociation; Mass Spectrometry; MALDI-TOF/TOF
The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.
Xylan is one of the most abundant biopolymers on Earth. Its degradation is mediated primarily by microbial xylanase in nature. To explore the diversity and distribution patterns of xylanase genes in soils, samples of five soil types with different physicochemical characters were analyzed.
Partial xylanase genes of glycoside hydrolase (GH) family 10 were recovered following direct DNA extraction from soil, PCR amplification and cloning. Combined with our previous study, a total of 1084 gene fragments were obtained, representing 366 OTUs. More than half of the OTUs were novel (identities of <65% with known xylanases) and had no close relatives based on phylogenetic analyses. Xylanase genes from all the soil environments were mainly distributed in Bacteroidetes, Proteobacteria, Acidobacteria, Firmicutes, Actinobacteria, Dictyoglomi and some fungi. Although identical sequences were found in several sites, habitat-specific patterns appeared to be important, and geochemical factors such as pH and oxygen content significantly influenced the compositions of xylan-degrading microbial communities.
These results provide insight into the GH 10 xylanases in various soil environments and reveal that xylan-degrading microbial communities are environment specific with diverse and abundant populations.
The third-generation of sequencing technologies produces sequence reads of 1000 bp or more that may contain high polymorphism information. However, most currently available sequence analysis tools are developed specifically for analyzing short sequence reads. While the traditional Smith-Waterman (SW) algorithm can be used to map long sequence reads, its naive implementation is computationally infeasible. We have developed a new Sequence mapping and Analyzing Program (SAP) that implements a modified version of SW to speed up the alignment process. In benchmarks with simulated and real exon sequencing data and a real E. coli genome sequence data generated by the third-generation sequencing technologies, SAP outperforms currently available tools for mapping short and long sequence reads in both speed and proportion of captured reads. In addition, it achieves high accuracy in detecting SNPs and InDels in the simulated data. SAP is available at https://github.com/davidsun/SAP.
Though genomic-level data are becoming widely available, many of the metazoan species sequenced are laboratory systems whose natural history is not well documented. In contrast, the wide array of species with very well-characterized natural history have, until recently, lacked genomics tools. It is now possible to address significant evolutionary genomics questions by applying high-throughput sequencing to discover the majority of genes for ecologically tractable species, and by subsequently developing microarray platforms from which to investigate gene regulatory networks that function in natural systems. We used GS-FLX Titanium Sequencing (Roche/454-Sequencing) of two normalized libraries of pooled RNA samples to characterize a transcriptome of the dark-eyed junco (Junco hyemalis), a North American sparrow that is a classically studied species in the fields of photoperiodism, speciation, and hormone-mediated behavior.
From a broad pool of RNA sampled from tissues throughout the body of a male and a female junco, we sequenced a total of 434 million nucleotides from 1.17 million reads that were assembled de novo into 31,379 putative transcripts representing 22,765 gene sets covering 35.8 million nucleotides with 12-fold average depth of coverage. Annotation of roughly half of the putative genes was accomplished using sequence similarity, and expression was confirmed for the majority with a preliminary microarray analysis. Of 716 core bilaterian genes, 646 (90 %) were recovered within our characterized gene set. Gene Ontology, orthoDB orthology groups, and KEGG Pathway annotation provide further functional information about the sequences, and 25,781 potential SNPs were identified.
The extensive sequence information returned by this effort adds to the growing store of genomic data on diverse species. The extent of coverage and annotation achieved and confirmation of expression, show that transcriptome sequencing provides useful information for ecological model systems that have historically lacked genomic tools. The junco-specific microarray developed here is allowing investigations of gene expression responses to environmental and hormonal manipulations – extending the historic work on natural history and hormone-mediated phenotypes in this system.
Transcriptome; Aves; pyrosequencing; microarray; Junco; 454 titanium cDNA sequencing; single nucleotide polymorphism.
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) loci, together with cas (CRISPR–associated) genes, form the CRISPR/Cas adaptive immune system, a primary defense strategy that eubacteria and archaea mobilize against foreign nucleic acids, including phages and conjugative plasmids. Short spacer sequences separated by the repeats are derived from foreign DNA and direct interference to future infections. The availability of hundreds of shotgun metagenomic datasets from the Human Microbiome Project (HMP) enables us to explore the distribution and diversity of known CRISPRs in human-associated microbial communities and to discover new CRISPRs. We propose a targeted assembly strategy to reconstruct CRISPR arrays, which whole-metagenome assemblies fail to identify. For each known CRISPR type (identified from reference genomes), we use its direct repeat consensus sequence to recruit reads from each HMP dataset and then assemble the recruited reads into CRISPR loci; the unique spacer sequences can then be extracted for analysis. We also identified novel CRISPRs or new CRISPR variants in contigs from whole-metagenome assemblies and used targeted assembly to more comprehensively identify these CRISPRs across samples. We observed that the distributions of CRISPRs (including 64 known and 86 novel ones) are largely body-site specific. We provide detailed analysis of several CRISPR loci, including novel CRISPRs. For example, known streptococcal CRISPRs were identified in most oral microbiomes, totaling ∼8,000 unique spacers: samples resampled from the same individual and oral site shared the most spacers; different oral sites from the same individual shared significantly fewer, while different individuals had almost no common spacers, indicating the impact of subtle niche differences on the evolution of CRISPR defenses. We further demonstrate potential applications of CRISPRs to the tracing of rare species and the virus exposure of individuals. This work indicates the importance of effective identification and characterization of CRISPR loci to the study of the dynamic ecology of microbiomes.
Human bodies are complex ecological systems in which various microbial organisms and viruses interact with each other and with the human host. The Human Microbiome Project (HMP) has resulted in >700 datasets of shotgun metagenomic sequences, from which we can learn about the compositions and functions of human-associated microbial communities. CRISPR/Cas systems are a widespread class of adaptive immune systems in bacteria and archaea, providing acquired immunity against foreign nucleic acids: CRISPR/Cas defense pathways involve integration of viral- or plasmid-derived DNA segments into CRISPR arrays (forming spacers between repeated structural sequences), and expression of short crRNAs from these single repeat-spacer units, to generate interference to future invading foreign genomes. Powered by an effective computational approach (the targeted assembly approach for CRISPR), our analysis of CRISPR arrays in the HMP datasets provides the very first global view of bacterial immunity systems in human-associated microbial communities. The great diversity of CRISPR spacers we observed among different body sites, in different individuals, and in single individuals over time, indicates the impact of subtle niche differences on the evolution of CRISPR defenses and indicates the key role of bacteriophage (and plasmids) in shaping human microbial communities.
We estimated the reproducibility of tandem mass fragmentation spectra for the widely-used collision-induced dissociation (CID) instruments. Using the Pearson correlation coefficient as a measure of spectral similarity, we found that the within-experiment reproducibility of fragment ion intensities is very high (about 0.85). However, across different experiments and instrument types/setups, the correlation decreases by more than 15% (to about 0.70). We further investigated the accuracy of current predictors of peptide fragmentation spectra and found that they are more accurate than the ad-hoc models generally used by search engines (e.g. SEQUEST) and, surprisingly, approaching the empirical upper limit set by the average across-experiment spectral reproducibility (especially for charge +1 and charge +2 precursor ions). These results provide evidence that, in terms of accuracy of modeling, predicted peptide fragmentation spectra provide a viable alternative to spectral libraries for peptide identification, with a higher coverage of peptides and lower storage requirements. Furthermore, using five data sets of proteome digests by two different proteases, we find that PeptideART (a data-driven machine learning approach) is generally more accurate than MassAnalyzer (an approach based on a kinetic model for peptide fragmentation) in predicting fragmentation spectra, but that both models are significantly more accurate than the ad-hoc models. Availability: PeptideART is freely available at www.informatics.indiana.edu/predrag.
Though many software packages have been developed to perform label-free quantification of proteins in complex biological samples using peptide intensities generated by LC–MS/MS, two critical issues are generally ignored in this field: (i) peptides have multiple elution patterns across runs in an experiment, and (ii) many peptides cannot be used for protein quantification. To address these two key issues, we have developed a novel alignment method to enable accurate peptide peak retention time determination and multiple filters to eliminate unqualified peptides for protein quantification. Repeatability and linearity have been tested using six very different samples, i.e., standard peptides, kidney tissue lysates, HT29-MTX cell lysates, depleted human serum, human serum albumin-bound proteins, and standard proteins spiked in kidney tissue lysates. At least 90.8% of the proteins (up to 1,390) had CVs ≤ 30% across 10 technical replicates, and at least 93.6% (up to 2,013) had R2 ≥ 0.9500 across 7 concentrations. Identical amounts of standard protein spiked in complex biological samples achieved a CV of 8.6% across eight injections of two groups. Further assessment was made by comparing mass spectrometric results to immunodetection, and consistent results were obtained. The new approach has novel and specific features enabling accurate label-free quantification.
LC–S/MS; label-free quantification; alignment; retention time; unqualified peptides
Quantification of protein expression by means of mass spectrometry (MS) has been introduced in various proteomics studies. In particular, two label-free quantification methods, such as spectral counting and spectra feature analysis have been extensively investigated in a wide variety of proteomic studies. The cornerstone of both methods is peptide identification based on a proteomic database search and subsequent estimation of peptide retention time. However, they often suffer from restrictive database search and inaccurate estimation of the liquid chromatography (LC) retention time. Furthermore, conventional peptide identification methods based on the spectral library search algorithms such as SEQUEST or SpectraST have been found to provide neither the best match nor high-scored matches. Lastly, these methods are limited in the sense that target peptides cannot be identified unless they have been previously generated and stored into the database or spectral libraries.
To overcome these limitations, we propose a novel method, namely Quantification method based on Finding the Identical Spectral set for a Homogenous peptide (Q-FISH) to estimate the peptide's abundance from its tandem mass spectrometry (MS/MS) spectra through the direct comparison of experimental spectra. Intuitively, our Q-FISH method compares all possible pairs of experimental spectra in order to identify both known and novel proteins, significantly enhancing identification accuracy by grouping replicated spectra from the same peptide targets.
We applied Q-FISH to Nano-LC-MS/MS data obtained from human hepatocellular carcinoma (HCC) and normal liver tissue samples to identify differentially expressed peptides between the normal and disease samples. For a total of 44,318 spectra obtained through MS/MS analysis, Q-FISH yielded 14,747 clusters. Among these, 5,777 clusters were identified only in the HCC sample, 6,648 clusters only in the normal tissue sample, and 2,323 clusters both in the HCC and normal tissue samples. While it will be interesting to investigate peptide clusters only found from one sample, further examined spectral clusters identified both in the HCC and normal samples since our goal is to identify and assess differentially expressed peptides quantitatively. The next step was to perform a beta-binomial test to isolate differentially expressed peptides between the HCC and normal tissue samples. This test resulted in 84 peptides with significantly differential spectral counts between the HCC and normal tissue samples. We independently identified 50 and 95 peptides by SEQUEST, of which 24 and 56 peptides, respectively, were found to be known biomarkers for the human liver cancer. Comparing Q-FISH and SEQUEST results, we found 22 of the differentially expressed 84 peptides by Q-FISH were also identified by SEQUEST. Remarkably, of these 22 peptides discovered both by Q-FISH and SEQUEST, 13 peptides are known for human liver cancer and the remaining 9 peptides are known to be associated with other cancers.
We proposed a novel statistical method, Q-FISH, for accurately identifying protein species and simultaneously quantifying the expression levels of identified peptides from mass spectrometry data. Q-FISH analysis on human HCC and liver tissue samples identified many protein biomarkers that are highly relevant to HCC. Q-FISH can be a useful tool both for peptide identification and quantification on mass spectrometry data analysis. It may also prove to be more effective in discovering novel protein biomarkers than SEQUEST and other standard methods.
Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode.
Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website: http://omics.informatics.indiana.edu/mg/RAPSearch2/.
Supplementary information: Available at the RAPSearch2 website.
A synthetic approach to model the analytical complexity of biological proteolytic digests has been developed. Combinatorial peptide libraries ranging in length between nine and twelve amino acids that represent typical tryptic digests were designed, synthesized and analyzed. Individual libraries and mixtures thereof were studied by replicate liquid chromatography-ion trap mass spectrometry and compared to a tryptic digest of Deinococcus radiodurans. Similar to complex proteome analysis, replicate study of individual libraries identified additional unique peptides. Fewer novel sequences were revealed with each additional analysis in a manner similar to that observed for biological data. Our results demonstrate a bimodal distribution of peptides sorting to either very low or very high levels of detection. Upon mixing of libraries at equal abundance, a length-dependent bias in favor of longer sequence identification was observed. Peptide identification as a function of site-specific amino acid content was characterized with certain amino acids proving to be of considerable importance. This report demonstrates that peptide libraries of defined character can serve as a reference for instrument characterization. Furthermore, they are uniquely suited to delineate the physical properties that influence identification of peptides which provides a foundation for optimizing the study of samples with less defined heterogeneity.
Peptide detectability is defined as the probability that a peptide is identified in an LC-MS/MS experiment and has been useful in providing solutions to protein inference and label-free quantification. Previously, predictors for peptide detectability trained on standard or complex samples were proposed. Although the models trained on complex samples may benefit from the large training data sets, it is unclear to what extent they are affected by the unequal abundances of identified proteins. To address this challenge and improve detectability prediction, we present a new algorithm for the iterative learning of peptide detectability from complex mixtures. We provide evidence that the new method approximates detectability with useful accuracy and, based on its design, can be used to interpret the outcome of other learning strategies. We studied the properties of peptides from the bacterium Deinococcus radiodurans and found that at standard quantities, its tryptic peptides can be roughly classified as either detectable or undetectable, with a relatively small fraction having medium detectability. We extend the concept of detectability from peptides to proteins and apply the model to predict the behavior of a replicate LC-MS/MS experiment from a single analysis. Finally, our study summarizes a theoretical framework for peptide/protein identification and label-free quantification.