Search tips
Search criteria

Results 1-25 (38)

Clipboard (0)

Select a Filter Below

Year of Publication
author:("Tang, haifu")
1.  A New Method for Stranded Whole Transcriptome RNA-seq 
Methods (San Diego, Calif.)  2013;63(2):126-134.
This report describes an improved protocol to generate stranded, barcoded RNA-seq libraries to capture the whole transcriptome. By optimizing the use of duplex specific nuclease (DSN) to remove ribosomal RNA reads from stranded barcoded libraries, we demonstrate improved efficiency of multiplexed next generation sequencing (NGS). This approach detects expression profiles of all RNA types, including miRNA (microRNA), piRNA (Piwi-interacting RNA), snoRNA (small nucleolar RNA), lincRNA (long non-coding RNA), mtRNA (mitochondrial RNA) and mRNA (messenger RNA) without the use of gel electrophoresis. The improved protocol generates high quality data that can be used to identify differential expression in known and novel coding and non-coding transcripts, splice variants, mitochondrial genes and SNPs (single nucleotide polymorphisms).
PMCID: PMC3739992  PMID: 23557989
RNA-seq; transcriptome; duplex-specific nuclease; gene expression1
2.  Gene finding in metatranscriptomic sequences 
BMC Bioinformatics  2014;15(Suppl 9):S8.
Metatranscriptomic sequencing is a highly sensitive bioassay of functional activity in a microbial community, providing complementary information to the metagenomic sequencing of the community. The acquisition of the metatranscriptomic sequences will enable us to refine the annotations of the metagenomes, and to study the gene activities and their regulation in complex microbial communities and their dynamics.
In this paper, we present TransGeneScan, a software tool for finding genes in assembled transcripts from metatranscriptomic sequences. By incorporating several features of metatranscriptomic sequencing, including strand-specificity, short intergenic regions, and putative antisense transcripts into a Hidden Markov Model, TranGeneScan can predict a sense transcript containing one or multiple genes (in an operon) or an antisense transcript.
We tested TransGeneScan on a mock metatranscriptomic data set containing three known bacterial genomes. The results showed that TranGeneScan performs better than metagenomic gene finders (MetaGeneMark and FragGeneScan) on predicting protein coding genes in assembled transcripts, and achieves comparable or even higher accuracy than gene finders for microbial genomes (Glimmer and GeneMark). These results imply, with the assistance of metatranscriptomic sequencing, we can obtain a broad and precise picture about the genes (and their functions) in a microbial community.
TransGeneScan is available as open-source software on SourceForge at
PMCID: PMC4168707  PMID: 25253067
Metatranscriptomics; Gene finding; Hidden Markov Model; Operons; Antisense RNA (asRNA)
3.  Protein identification problem from a Bayesian point of view 
Statistics and its interface  2012;5(1):21-37.
We present a generic Bayesian framework for the peptide and protein identification in proteomics, and provide a unified interpretation for the database searching and the de novo peptide sequencing approaches that are used in peptide identification. We describe several probabilistic graphical models and a variety of prior distributions that can be incorporated into the Bayesian framework to model different types of prior information, such as the known protein sequences, the known protein abundances, the peptide precursor masses, the estimated peptide retention time and the peptide detectabilities. Various applications of the Bayesian framework are discussed theoretically, including its application to the identification of peptides containing mutations and post-translational modifications.
PMCID: PMC3992622  PMID: 24761189
Shotgun proteomics; Protein identification; Mass spectrometry; Bayesian methods
4.  Extending the coverage of spectral libraries: a neighbor-based approach to predicting intensities of peptide fragmentation spectra 
Proteomics  2013;13(5):756-765.
Searching spectral libraries in tandem mass spectrometry (MS/MS) is an important new approach to improving the quality of peptide and protein identification. The idea relies on the observation that ion intensities in an MS/MS spectrum of a given peptide are generally reproducible across experiments, and thus, matching between spectra from an experiment and the spectra of previously identified peptides stored in a spectral library can lead to better peptide identification compared to the traditional database search. However, the use of libraries is greatly limited by their coverage of peptide sequences: even for well-studied organisms a large fraction of peptides have not been previously identified. To address this issue, we propose to expand spectral libraries by predicting the MS/MS spectra of peptides based on the spectra of peptides with similar sequences. We first demonstrate that the intensity patterns of dominant fragment ions between similar peptides tend to be similar. In accordance with this observation, we develop a neighbor-based approach which first selects peptides that are likely to have spectra similar to the target peptide and then combines their spectra using a weighted K-nearest neighbor method to accurately predict fragment ion intensities corresponding to the target peptide. This approach has the potential to predict spectra for every peptide in the proteome. When rigorous quality criteria are applied, we estimate that the method increases the coverage of spectral libraries available from the National Institute of Standards and Technology by 20–60%, although the values vary with peptide length and charge state. We find that the overall best search performance is achieved when spectral libraries are supplemented by the high quality predicted spectra.
PMCID: PMC3733334  PMID: 23303707
5.  Software tools for glycan profiling 
Methods in molecular biology (Clifton, N.J.)  2013;951:10.1007/978-1-62703-146-2_18.
PMCID: PMC3861397  PMID: 23296537
6.  A de Bruijn Graph Approach to the Quantification of Closely-Related Genomes in a Microbial Community 
Journal of Computational Biology  2012;19(6):814-825.
The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).
PMCID: PMC3375647  PMID: 22697249
closely-related genomes; de Bruijn graph; metagenomics; quantification
7.  CRISPR-Cas systems target a diverse collection of invasive mobile genetic elements in human microbiomes 
Genome Biology  2013;14(4):R40.
Bacteria and archaea develop immunity against invading genomes by incorporating pieces of the invaders' sequences, called spacers, into a clustered regularly interspaced short palindromic repeats (CRISPR) locus between repeats, forming arrays of repeat-spacer units. When spacers are expressed, they direct CRISPR-associated (Cas) proteins to silence complementary invading DNA. In order to characterize the invaders of human microbiomes, we use spacers from CRISPR arrays that we had previously assembled from shotgun metagenomic datasets, and identify contigs that contain these spacers' targets.
We discover 95,000 contigs that are putative invasive mobile genetic elements, some targeted by hundreds of CRISPR spacers. We find that oral sites in healthy human populations have a much greater variety of mobile genetic elements than stool samples. Mobile genetic elements carry genes encoding diverse functions: only 7% of the mobile genetic elements are similar to known phages or plasmids, although a much greater proportion contain phage- or plasmid-related genes. A small number of contigs share similarity with known integrative and conjugative elements, providing the first examples of CRISPR defenses against this class of element. We provide detailed analyses of a few large mobile genetic elements of various types, and a relative abundance analysis of mobile genetic elements and putative hosts, exploring the dynamic activities of mobile genetic elements in human microbiomes. A joint analysis of mobile genetic elements and CRISPRs shows that protospacer-adjacent motifs drive their interaction network; however, some CRISPR-Cas systems target mobile genetic elements lacking motifs.
We identify a large collection of invasive mobile genetic elements in human microbiomes, an important resource for further study of the interaction between the CRISPR-Cas immune system and invaders.
PMCID: PMC4053933  PMID: 23628424
CRISPR-Cas system; human microbiome; mobile genetic element (MGE)
8.  N-Glycan Profiling by Microchip Electrophoresis to Differentiate Disease-States Related to Esophageal Adenocarcinoma 
Analytical Chemistry  2012;84(8):3621-3627.
We report analysis of N-glycans derived from disease-free individuals and patients with Barrett's esophagus, high-grade dysplasia, and esophageal adenocarcinoma by microchip electrophoresis with laser-induced fluorescence detection. Serum samples in 10-μL aliquots are enzymatically treated to cleave the N-glycans that are subsequently reacted with 8-aminopyrene-1,3,6-trisulfonic acid to add charge and a fluorescent label. Separations at 1250 V/cm and over 22 cm yielded efficiencies up to 700,000 plates for the N-glycans and analysis times under 100 s. Principal component analysis (PCA) and analysis of variance (ANOVA) tests of the peak areas and migration times are used to evaluate N-glycan profiles from native and desialylated samples and to determine differences among the four sample groups. With microchip electrophoresis, we are able to distinguish the three patient groups from each other and from disease-free individuals.
PMCID: PMC3339272  PMID: 22397697
microfluidics; microchip electrophoresis; N-glycans; glycan profiling; disease-state monitoring; Barrett's esophagus; high-grade dysplasia; esophageal adenocarcinoma
9.  Testosterone Affects Neural Gene Expression Differently in Male and Female Juncos: A Role for Hormones in Mediating Sexual Dimorphism and Conflict 
PLoS ONE  2013;8(4):e61784.
Despite sharing much of their genomes, males and females are often highly dimorphic, reflecting at least in part the resolution of sexual conflict in response to sexually antagonistic selection. Sexual dimorphism arises owing to sex differences in gene expression, and steroid hormones are often invoked as a proximate cause of sexual dimorphism. Experimental elevation of androgens can modify behavior, physiology, and gene expression, but knowledge of the role of hormones remains incomplete, including how the sexes differ in gene expression in response to hormones. We addressed these questions in a bird species with a long history of behavioral endocrinological and ecological study, the dark-eyed junco (Junco hyemalis), using a custom microarray. Focusing on two brain regions involved in sexually dimorphic behavior and regulation of hormone secretion, we identified 651 genes that differed in expression by sex in medial amygdala and 611 in hypothalamus. Additionally, we treated individuals of each sex with testosterone implants and identified many genes that may be related to previously identified phenotypic effects of testosterone treatment. Some of these genes relate to previously identified effects of testosterone-treatment and suggest that the multiple effects of testosterone may be mediated by modifying the expression of a small number of genes. Notably, testosterone-treatment tended to alter expression of different genes in each sex: only 4 of the 527 genes identified as significant in one sex or the other were significantly differentially expressed in both sexes. Hormonally regulated gene expression is a key mechanism underlying sexual dimorphism, and our study identifies specific genes that may mediate some of these processes.
PMCID: PMC3627916  PMID: 23613935
10.  Probabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences 
PLoS Computational Biology  2013;9(3):e1002981.
Shotgun metagenomics has been applied to the studies of the functionality of various microbial communities. As a critical analysis step in these studies, biological pathways are reconstructed based on the genes predicted from metagenomic shotgun sequences. Pathway reconstruction provides insights into the functionality of a microbial community and can be used for comparing multiple microbial communities. The utilization of pathway reconstruction, however, can be jeopardized because of imperfect functional annotation of genes, and ambiguity in the assignment of predicted enzymes to biochemical reactions (e.g., some enzymes are involved in multiple biochemical reactions). Considering that metabolic functions in a microbial community are carried out by many enzymes in a collaborative manner, we present a probabilistic sampling approach to profiling functional content in a metagenomic dataset, by sampling functions of catalytically promiscuous enzymes within the context of the entire metabolic network defined by the annotated metagenome. We test our approach on metagenomic datasets from environmental and human-associated microbial communities. The results show that our approach provides a more accurate representation of the metabolic activities encoded in a metagenome, and thus improves the comparative analysis of multiple microbial communities. In addition, our approach reports likelihood scores of putative reactions, which can be used to identify important reactions and metabolic pathways that reflect the environmental adaptation of the microbial communities. Source code for sampling metabolic networks is available online at
Author Summary
We present a probabilistic sampling approach to profiling metabolic reactions in a microbial community from metagenomic shotgun reads, in an attempt to understand the metabolism within a microbial community and compare them across multiple communities. Different from the conventional pathway reconstruction approaches that aim at a definitive set of reactions, our method estimates how likely each annotated reaction can occur in the metabolism of the microbial community, given the shotgun sequencing data. This probabilistic measure improves our prediction of the actual metabolism in the microbial communities and can be used in the comparative functional analysis of metagenomic data.
PMCID: PMC3605055  PMID: 23555216
11.  On the Mutational Topology of the Bacterial Genome 
G3: Genes|Genomes|Genetics  2013;3(3):399-407.
By sequencing the genomes of 34 mutation accumulation lines of a mismatch-repair defective strain of Escherichia coli that had undergone a total of 12,750 generations, we identified 1625 spontaneous base-pair substitutions spread across the E. coli genome. These mutations are not distributed at random but, instead, fall into a wave-like spatial pattern that is repeated almost exactly in mirror image in the two separately replicated halves of the bacterial chromosome. The pattern is correlated to genomic features, with mutation densities greatest in regions predicted to have high superhelicity. Superimposed upon this pattern are regional hotspots, some of which are located where replication forks may collide or be blocked. These results suggest that, as they traverse the chromosome, the two replication forks encounter parallel structural features that change the fidelity of DNA replication.
PMCID: PMC3583449  PMID: 23450823
mutation rate; evolution; replication fidelity; chromosome structure; DNA polymerase errors
12.  The Ecoresponsive Genome of Daphnia pulex 
Science (New York, N.Y.)  2011;331(6017):555-561.
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 Mb and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than 1/3 of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The co-expansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes – including many additional loci within sequenced regions that are otherwise devoid of annotations – are the most responsive genes to ecological challenges.
PMCID: PMC3529199  PMID: 21292972
14.  Investigation of VUV Photodissociation Propensities Using Peptide Libraries 
PSD does not usually generate a complete series of y-type ions, particularly at high mass, and this is a limitation for de novo sequencing algorithms. It is demonstrated that b2 and b3 ions can be used to help assign high mass xN-2 and xN-3 fragments that are found in vacuum ultraviolet (VUV) photofragmentation experiments. In addition, vN-type ion fragments with side chain loss from the N-terminal residue often enable confirmation of N-terminal amino acids. Libraries containing several thousand peptides were examined using photodissociation in a MALDI-TOF/TOF instrument. 1345 photodissociation spectra with a high S/N ratio were interpreted.
PMCID: PMC3224043  PMID: 22125417
De Novo Sequencing; Photodissociation; Mass Spectrometry; MALDI-TOF/TOF
15.  De novo transcriptome sequencing in a songbird, the dark-eyed junco (Junco hyemalis): genomic tools for an ecological model system 
BMC Genomics  2012;13:305.
Though genomic-level data are becoming widely available, many of the metazoan species sequenced are laboratory systems whose natural history is not well documented. In contrast, the wide array of species with very well-characterized natural history have, until recently, lacked genomics tools. It is now possible to address significant evolutionary genomics questions by applying high-throughput sequencing to discover the majority of genes for ecologically tractable species, and by subsequently developing microarray platforms from which to investigate gene regulatory networks that function in natural systems. We used GS-FLX Titanium Sequencing (Roche/454-Sequencing) of two normalized libraries of pooled RNA samples to characterize a transcriptome of the dark-eyed junco (Junco hyemalis), a North American sparrow that is a classically studied species in the fields of photoperiodism, speciation, and hormone-mediated behavior.
From a broad pool of RNA sampled from tissues throughout the body of a male and a female junco, we sequenced a total of 434 million nucleotides from 1.17 million reads that were assembled de novo into 31,379 putative transcripts representing 22,765 gene sets covering 35.8 million nucleotides with 12-fold average depth of coverage. Annotation of roughly half of the putative genes was accomplished using sequence similarity, and expression was confirmed for the majority with a preliminary microarray analysis. Of 716 core bilaterian genes, 646 (90 %) were recovered within our characterized gene set. Gene Ontology, orthoDB orthology groups, and KEGG Pathway annotation provide further functional information about the sequences, and 25,781 potential SNPs were identified.
The extensive sequence information returned by this effort adds to the growing store of genomic data on diverse species. The extent of coverage and annotation achieved and confirmation of expression, show that transcriptome sequencing provides useful information for ecological model systems that have historically lacked genomic tools. The junco-specific microarray developed here is allowing investigations of gene expression responses to environmental and hormonal manipulations – extending the historic work on natural history and hormone-mediated phenotypes in this system.
PMCID: PMC3476391  PMID: 22776250
Transcriptome; Aves; pyrosequencing; microarray; Junco; 454 titanium cDNA sequencing; single nucleotide polymorphism.
16.  Diverse CRISPRs Evolving in Human Microbiomes 
PLoS Genetics  2012;8(6):e1002441.
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) loci, together with cas (CRISPR–associated) genes, form the CRISPR/Cas adaptive immune system, a primary defense strategy that eubacteria and archaea mobilize against foreign nucleic acids, including phages and conjugative plasmids. Short spacer sequences separated by the repeats are derived from foreign DNA and direct interference to future infections. The availability of hundreds of shotgun metagenomic datasets from the Human Microbiome Project (HMP) enables us to explore the distribution and diversity of known CRISPRs in human-associated microbial communities and to discover new CRISPRs. We propose a targeted assembly strategy to reconstruct CRISPR arrays, which whole-metagenome assemblies fail to identify. For each known CRISPR type (identified from reference genomes), we use its direct repeat consensus sequence to recruit reads from each HMP dataset and then assemble the recruited reads into CRISPR loci; the unique spacer sequences can then be extracted for analysis. We also identified novel CRISPRs or new CRISPR variants in contigs from whole-metagenome assemblies and used targeted assembly to more comprehensively identify these CRISPRs across samples. We observed that the distributions of CRISPRs (including 64 known and 86 novel ones) are largely body-site specific. We provide detailed analysis of several CRISPR loci, including novel CRISPRs. For example, known streptococcal CRISPRs were identified in most oral microbiomes, totaling ∼8,000 unique spacers: samples resampled from the same individual and oral site shared the most spacers; different oral sites from the same individual shared significantly fewer, while different individuals had almost no common spacers, indicating the impact of subtle niche differences on the evolution of CRISPR defenses. We further demonstrate potential applications of CRISPRs to the tracing of rare species and the virus exposure of individuals. This work indicates the importance of effective identification and characterization of CRISPR loci to the study of the dynamic ecology of microbiomes.
Author Summary
Human bodies are complex ecological systems in which various microbial organisms and viruses interact with each other and with the human host. The Human Microbiome Project (HMP) has resulted in >700 datasets of shotgun metagenomic sequences, from which we can learn about the compositions and functions of human-associated microbial communities. CRISPR/Cas systems are a widespread class of adaptive immune systems in bacteria and archaea, providing acquired immunity against foreign nucleic acids: CRISPR/Cas defense pathways involve integration of viral- or plasmid-derived DNA segments into CRISPR arrays (forming spacers between repeated structural sequences), and expression of short crRNAs from these single repeat-spacer units, to generate interference to future invading foreign genomes. Powered by an effective computational approach (the targeted assembly approach for CRISPR), our analysis of CRISPR arrays in the HMP datasets provides the very first global view of bacterial immunity systems in human-associated microbial communities. The great diversity of CRISPR spacers we observed among different body sites, in different individuals, and in single individuals over time, indicates the impact of subtle niche differences on the evolution of CRISPR defenses and indicates the key role of bacteriophage (and plasmids) in shaping human microbial communities.
PMCID: PMC3374615  PMID: 22719260
17.  On the Accuracy and Limits of Peptide Fragmentation Spectrum Prediction 
Analytical chemistry  2010;83(3):790-796.
We estimated the reproducibility of tandem mass fragmentation spectra for the widely-used collision-induced dissociation (CID) instruments. Using the Pearson correlation coefficient as a measure of spectral similarity, we found that the within-experiment reproducibility of fragment ion intensities is very high (about 0.85). However, across different experiments and instrument types/setups, the correlation decreases by more than 15% (to about 0.70). We further investigated the accuracy of current predictors of peptide fragmentation spectra and found that they are more accurate than the ad-hoc models generally used by search engines (e.g. SEQUEST) and, surprisingly, approaching the empirical upper limit set by the average across-experiment spectral reproducibility (especially for charge +1 and charge +2 precursor ions). These results provide evidence that, in terms of accuracy of modeling, predicted peptide fragmentation spectra provide a viable alternative to spectral libraries for peptide identification, with a higher coverage of peptides and lower storage requirements. Furthermore, using five data sets of proteome digests by two different proteases, we find that PeptideART (a data-driven machine learning approach) is generally more accurate than MassAnalyzer (an approach based on a kinetic model for peptide fragmentation) in predicting fragmentation spectra, but that both models are significantly more accurate than the ad-hoc models. Availability: PeptideART is freely available at
PMCID: PMC3036742  PMID: 21175207
18.  A Novel Alignment Method and Multiple Filters for Exclusion of Unqualified Peptides To Enhance Label-Free Quantification Using Peptide Intensity in LC—MS/MS 
Journal of proteome research  2011;10(10):4799-4812.
Though many software packages have been developed to perform label-free quantification of proteins in complex biological samples using peptide intensities generated by LC–MS/MS, two critical issues are generally ignored in this field: (i) peptides have multiple elution patterns across runs in an experiment, and (ii) many peptides cannot be used for protein quantification. To address these two key issues, we have developed a novel alignment method to enable accurate peptide peak retention time determination and multiple filters to eliminate unqualified peptides for protein quantification. Repeatability and linearity have been tested using six very different samples, i.e., standard peptides, kidney tissue lysates, HT29-MTX cell lysates, depleted human serum, human serum albumin-bound proteins, and standard proteins spiked in kidney tissue lysates. At least 90.8% of the proteins (up to 1,390) had CVs ≤ 30% across 10 technical replicates, and at least 93.6% (up to 2,013) had R2 ≥ 0.9500 across 7 concentrations. Identical amounts of standard protein spiked in complex biological samples achieved a CV of 8.6% across eight injections of two groups. Further assessment was made by comparing mass spectrometric results to immunodetection, and consistent results were obtained. The new approach has novel and specific features enabling accurate label-free quantification.
PMCID: PMC3216047  PMID: 21888428
LC–S/MS; label-free quantification; alignment; retention time; unqualified peptides
19.  Enhanced peptide quantification using spectral count clustering and cluster abundance 
BMC Bioinformatics  2011;12:423.
Quantification of protein expression by means of mass spectrometry (MS) has been introduced in various proteomics studies. In particular, two label-free quantification methods, such as spectral counting and spectra feature analysis have been extensively investigated in a wide variety of proteomic studies. The cornerstone of both methods is peptide identification based on a proteomic database search and subsequent estimation of peptide retention time. However, they often suffer from restrictive database search and inaccurate estimation of the liquid chromatography (LC) retention time. Furthermore, conventional peptide identification methods based on the spectral library search algorithms such as SEQUEST or SpectraST have been found to provide neither the best match nor high-scored matches. Lastly, these methods are limited in the sense that target peptides cannot be identified unless they have been previously generated and stored into the database or spectral libraries.
To overcome these limitations, we propose a novel method, namely Quantification method based on Finding the Identical Spectral set for a Homogenous peptide (Q-FISH) to estimate the peptide's abundance from its tandem mass spectrometry (MS/MS) spectra through the direct comparison of experimental spectra. Intuitively, our Q-FISH method compares all possible pairs of experimental spectra in order to identify both known and novel proteins, significantly enhancing identification accuracy by grouping replicated spectra from the same peptide targets.
We applied Q-FISH to Nano-LC-MS/MS data obtained from human hepatocellular carcinoma (HCC) and normal liver tissue samples to identify differentially expressed peptides between the normal and disease samples. For a total of 44,318 spectra obtained through MS/MS analysis, Q-FISH yielded 14,747 clusters. Among these, 5,777 clusters were identified only in the HCC sample, 6,648 clusters only in the normal tissue sample, and 2,323 clusters both in the HCC and normal tissue samples. While it will be interesting to investigate peptide clusters only found from one sample, further examined spectral clusters identified both in the HCC and normal samples since our goal is to identify and assess differentially expressed peptides quantitatively. The next step was to perform a beta-binomial test to isolate differentially expressed peptides between the HCC and normal tissue samples. This test resulted in 84 peptides with significantly differential spectral counts between the HCC and normal tissue samples. We independently identified 50 and 95 peptides by SEQUEST, of which 24 and 56 peptides, respectively, were found to be known biomarkers for the human liver cancer. Comparing Q-FISH and SEQUEST results, we found 22 of the differentially expressed 84 peptides by Q-FISH were also identified by SEQUEST. Remarkably, of these 22 peptides discovered both by Q-FISH and SEQUEST, 13 peptides are known for human liver cancer and the remaining 9 peptides are known to be associated with other cancers.
We proposed a novel statistical method, Q-FISH, for accurately identifying protein species and simultaneously quantifying the expression levels of identified peptides from mass spectrometry data. Q-FISH analysis on human HCC and liver tissue samples identified many protein biomarkers that are highly relevant to HCC. Q-FISH can be a useful tool both for peptide identification and quantification on mass spectrometry data analysis. It may also prove to be more effective in discovering novel protein biomarkers than SEQUEST and other standard methods.
PMCID: PMC3234305  PMID: 22034872
20.  RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data 
Bioinformatics  2011;28(1):125-126.
Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode.
Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website:
Supplementary information: Available at the RAPSearch2 website.
PMCID: PMC3244761  PMID: 22039206
21.  Combinatorial Libraries of Synthetic Peptides as a Model for Shotgun Proteomics 
Analytical chemistry  2010;82(15):6559-6568.
A synthetic approach to model the analytical complexity of biological proteolytic digests has been developed. Combinatorial peptide libraries ranging in length between nine and twelve amino acids that represent typical tryptic digests were designed, synthesized and analyzed. Individual libraries and mixtures thereof were studied by replicate liquid chromatography-ion trap mass spectrometry and compared to a tryptic digest of Deinococcus radiodurans. Similar to complex proteome analysis, replicate study of individual libraries identified additional unique peptides. Fewer novel sequences were revealed with each additional analysis in a manner similar to that observed for biological data. Our results demonstrate a bimodal distribution of peptides sorting to either very low or very high levels of detection. Upon mixing of libraries at equal abundance, a length-dependent bias in favor of longer sequence identification was observed. Peptide identification as a function of site-specific amino acid content was characterized with certain amino acids proving to be of considerable importance. This report demonstrates that peptide libraries of defined character can serve as a reference for instrument characterization. Furthermore, they are uniquely suited to delineate the physical properties that influence identification of peptides which provides a foundation for optimizing the study of samples with less defined heterogeneity.
PMCID: PMC2927099  PMID: 20669997
22.  The importance of peptide detectability for protein identification, quantification, and experiment design in MS/MS proteomics 
Journal of proteome research  2010;9(12):6288-6297.
Peptide detectability is defined as the probability that a peptide is identified in an LC-MS/MS experiment and has been useful in providing solutions to protein inference and label-free quantification. Previously, predictors for peptide detectability trained on standard or complex samples were proposed. Although the models trained on complex samples may benefit from the large training data sets, it is unclear to what extent they are affected by the unequal abundances of identified proteins. To address this challenge and improve detectability prediction, we present a new algorithm for the iterative learning of peptide detectability from complex mixtures. We provide evidence that the new method approximates detectability with useful accuracy and, based on its design, can be used to interpret the outcome of other learning strategies. We studied the properties of peptides from the bacterium Deinococcus radiodurans and found that at standard quantities, its tryptic peptides can be roughly classified as either detectable or undetectable, with a relatively small fraction having medium detectability. We extend the concept of detectability from peptides to proteins and apply the model to predict the behavior of a replicate LC-MS/MS experiment from a single analysis. Finally, our study summarizes a theoretical framework for peptide/protein identification and label-free quantification.
PMCID: PMC3006185  PMID: 21067214
23.  RAPSearch: a fast protein similarity search tool for short reads 
BMC Bioinformatics  2011;12:159.
Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.
We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.
RAPSearch is implemented as open-source software and is accessible at It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.
PMCID: PMC3113943  PMID: 21575167
short reads; similarity search; suffix array; reduced amino acid alphabet; metagenomics
24.  On the estimation of false positives in peptide identifications using decoy search strategy 
Proteomics  2009;9(1):194-204.
False positive control/estimate in peptide identifications by mass spectrometry is of critical importance for reliable inference at the protein level and downstream bioinformatics analysis. Approaches based on search against decoy databases have become popular for its conceptual simplicity and easy implementation. Although various decoy search strategies have been proposed, few studies have investigated their difference in performance. With datasets collected on a mixture of model proteins, we demonstrate that a single search against the target database coupled with its reversed version offers a good balance between performance and simplicity. In particular, both the accuracy of the estimate of the number of false positives and sensitivity is at least comparable to other procedures examined in this study. It is also shown that scrambling while preserving frequency of amino acid words can potentially improve the accuracy of false positive estimate, though more studies are needed to investigate the optimal scrambling procedure for specific condition and the variation of the estimate across repeated scrambling.
PMCID: PMC3076744  PMID: 19053142
Decoy databases; False positive; Mass Spectrometry; Peptides; Sensitivity
25.  Proteomic Changes in the Photoreceptor Outer Segment Upon Intense Light Exposure 
Acute light-induced photoreceptor degeneration has been studied in experimental animals as a model for photoreceptor cell loss in human retinal degenerative diseases. Light absorption by rhodopsin in rod photoreceptor outer segments (OS) induces oxidative stress and initiates apoptotic cell death. However, the molecular events that induce oxidative stress and initiate the apoptotic cascade remain poorly understood. To better understand the molecular mechanisms of light-induced photoreceptor cell death, we studied the proteomic changes in OS upon intense light exposure by using a proteolytic 18O labeling method. Of 171 proteins identified, the relative abundance of 98 proteins in light-exposed and unexposed OS was determined. The quantities of 11 proteins were found to differ by more than 2-fold between light-exposed OS and those remaining in darkness. Among the 11 proteins, 8 were phototransduction proteins and 7 of these were altered such that the efficiency of phototransduction would be reduced or quenched during light exposure. In contrast, the amount of OS rhodopsin kinase was reduced by 2-fold after light exposure, suggesting attenuation in the mechanism of quenching phototransduction. Liquid chromatography multiple reaction monitoring (LC-MRM) was performed to confirm this reduction in the quantity of rhodopsin kinase. As revealed by immunofluorescence microscopy, this reduction of rhodopsin kinase is not a result of protein translocation from the outer to the inner segment. Collectively, our findings suggest that the absolute quantity of rhodopsin kinase in rod photoreceptors is reduced upon light stimulation and that this reduction may be a contributing factor to light-induced photoreceptor cell death. This report provides new insights into the proteomic changes in the OS upon intense light exposure and creates a foundation for understanding the mechanisms of light-induced photoreceptor cell death.
PMCID: PMC2818867  PMID: 20020778
photoreceptor; light damage; 18O labeling; mass spectrometry; rhodopsin kinase; phototransduction

Results 1-25 (38)