Motivation: Peptide detection is a crucial step in mass spectrometry (MS) based proteomics. Most existing algorithms are based upon greedy isotope template matching and thus may be prone to error propagation and ineffective to detect overlapping peptides. In addition, existing algorithms usually work at different charge states separately, isolating useful information that can be drawn from other charge states, which may lead to poor detection of low abundance peptides.
Results: BPDA2d models spectra as a mixture of candidate peptide signals and systematically evaluates all possible combinations of possible peptide candidates to interpret the given spectra. For each candidate, BPDA2d takes into account its elution profile, charge state distribution and isotope pattern, and it combines all evidence to infer the candidate's signal and existence probability. By piecing all evidence together—especially by deriving information across charge states—low abundance peptides can be better identified and peptide detection rates can be improved. Instead of local template matching, BPDA2d performs global optimization for all candidates and systematically optimizes their signals. Since BPDA2d looks for the optimal among all possible interpretations of the given spectra, it has the capability in handling complex spectra where features overlap. BPDA2d estimates the posterior existence probability of detected peptides, which can be directly used for probability-based evaluation in subsequent processing steps. Our experiments indicate that BPDA2d outperforms state-of-the-art detection methods on both simulated data and real liquid chromatography–mass spectrometry data, according to sensitivity and detection accuracy.
Availability: The BPDA2d software package is available at http://gsp.tamu.edu/Publications/supplementary/sun11a/
Supplementary data are available at Bioinformatics online.
The yellow fever vaccines (YF-17D-204 and 17DD) are considered to be among the safest vaccines and the presence of neutralizing antibodies is correlated with protection, although other immune effector mechanisms are known to be involved. T-cell responses are known to play an important role modulating antibody production and the killing of infected cells. However, little is known about the repertoire of T-cell responses elicited by the YF-17DD vaccine in humans. In this report, a library of 653 partially overlapping 15-mer peptides covering the envelope (Env) and nonstructural (NS) proteins 1 to 5 of the vaccine was utilized to perform a comprehensive analysis of the virus-specific CD4+ and CD8+ T-cell responses. The T-cell responses were screened ex-vivo by IFN-γ ELISPOT assays using blood samples from 220 YF-17DD vaccinees collected two months to four years after immunization. Each peptide was tested in 75 to 208 separate individuals of the cohort. The screening identified sixteen immunodominant antigens that elicited activation of circulating memory T-cells in 10% to 33% of the individuals. Biochemical in-vitro binding assays and immunogenetic and immunogenicity studies indicated that each of the sixteen immunogenic 15-mer peptides contained two or more partially overlapping epitopes that could bind with high affinity to molecules of different HLAs. The prevalence of the immunogenicity of a peptide in the cohort was correlated with the diversity of HLA-II alleles that they could bind. These findings suggest that overlapping of HLA binding motifs within a peptide enhances its T-cell immunogenicity and the prevalence of the response in the population. In summary, the results suggests that in addition to factors of the innate immunity, “promiscuous” T-cell antigens might contribute to the high efficacy of the yellow fever vaccines.
T-cell responses are considered to be very important; however, the role of T-cell responses in vaccine mediated immunity is still controversial. One reason may be that most studies of human T-cell responses are focused on a few epitopes. We still lack a systematic view of the repertoire of peptides presented by the different HLA class I and II molecules and how the peptides presented by the different HLAs interact within the host to develop T-cell responses. Here we present a study of the T-cell responses against the YF-17DD vaccine in the context of a cohort of 220 volunteers and observed that the most prevalent T-cell responses are targeted at peptides that bind to multiple types of HLA molecules. Based on these results we postulate that promiscuous T-cell epitopes might have a critical role in the development of adaptive immunity. These results may have broader implications for other pathogens, since the yellow fever vaccine is currently being developed as a vaccine vector for other diseases. Therefore, these epitopes might have a functionally cooperative role in boosting specific neutralizing antibody responses. In addition, we propose that promiscuous T-cell antigens may be better immunogens for vaccine development; however more studies are necessary.
Motivation: In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces.
Results: This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known.
Availability: Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering
Mass spectrometry is a complex technique used for large-scale protein profiling with clinical and pharmaceutical applications. While individual components in the system have been studied extensively, little work has been done to integrate various modules and evaluate them from a systems point of view.
In this work, we investigate this problem by putting together the different modules in a typical proteomics work flow, in order to capture and analyze key factors that impact the number of identified peptides and quantified proteins, protein quantification error, differential expression results, and classification performance. The proposed proteomics pipeline model can be used to optimize the work flow as well as to pinpoint critical bottlenecks worth investing time and resources into for improving performance. Using the model-based approach proposed here, one can study systematically the critical problem of proteomic biomarker discovery, by means of simulation using ground-truthed synthetic MS data.
Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.
Classification; epistemology; error estimation; genomics; validation.
RNA-Seq is the recently developed high-throughput sequencing technology for profiling the entire transcriptome in any organism. It has several major advantages over current hybridization-based approach such as microarrays. However, the cost per sample by RNA-Seq is still prohibitive for most laboratories. With continued improvement in sequence output, it would be cost-effective if multiple samples are multiplexed and sequenced in a single lane with sufficient transcriptome coverage. The objective of this analysis is to evaluate what sequencing depth might be sufficient to interrogate gene expression profiling in the chicken by RNA-Seq.
Two cDNA libraries from chicken lungs were sequenced initially, and 4.9 million (M) and 1.6 M (60 bp) reads were generated, respectively. With significant improvements in sequencing technology, two technical replicate cDNA libraries were re-sequenced. Totals of 29.6 M and 28.7 M (75 bp) reads were obtained with the two samples. More than 90% of annotated genes were detected in the data sets with 28.7-29.6 M reads, while only 68% of genes were detected in the data set with 1.6 M reads. The correlation coefficients of gene expression between technical replicates within the same sample were 0.9458 and 0.8442. To evaluate the appropriate depth needed for mRNA profiling, a random sampling method was used to generate different number of reads from each sample. There was a significant increase in correlation coefficients from a sequencing depth of 1.6 M to 10 M for all genes except highly abundant genes. No significant improvement was observed from the depth of 10 M to 20 M (75 bp) reads.
The analysis from the current study demonstrated that 30 M (75 bp) reads is sufficient to detect all annotated genes in chicken lungs. Ten million (75 bp) reads could detect about 80% of annotated chicken genes, and RNA-Seq at this depth can serve as a replacement of microarray technology. Furthermore, the depth of sequencing had a significant impact on measuring gene expression of low abundant genes. Finally, the combination of experimental and simulation approaches is a powerful approach to address the relationship between the depth of sequencing and transcriptome coverage.
From September 2005 to March 2007, 238 individuals being vaccinated for the first time with the yellow fever (YF) -17DD vaccine were enrolled in a cohort established in Recife, Brazil. A prospective study indicated that, after immunization, anti-YF immunoglobulin M (IgM) and anti-YF IgG were present in 70.6% (IgM) and 98.3% (IgG) of the vaccinated subjects. All vaccinees developed protective immunity, which was detected by the plaque reduction neutralization test (PRNT) with a geometric mean titer of 892. Of the 238 individuals, 86.6% had IgG antibodies to dengue virus; however, the presence of anti-dengue IgG did not interfere significantly with the development of anti-YF neutralizing antibodies. In a separate retrospective study of individuals immunized with the 17DD vaccine, the PRNT values at 5 and 10 years post-vaccination remained positive but showed a significant decrease in neutralization titer (25% with PRNT titers < 100 after 5 years and 35% after 10 years).
Mass spectrometry (MS) is an essential analytical tool in proteomics. Many existing algorithms for peptide detection are based on isotope template matching and usually work at different charge states separately, making them ineffective to detect overlapping peptides and low abundance peptides.
We present BPDA, a Bayesian approach for peptide detection in data produced by MS instruments with high enough resolution to baseline-resolve isotopic peaks, such as MALDI-TOF and LC-MS. We model the spectra as a mixture of candidate peptide signals, and the model is parameterized by MS physical properties. BPDA is based on a rigorous statistical framework and avoids problems, such as voting and ad-hoc thresholding, generally encountered in algorithms based on template matching. It systematically evaluates all possible combinations of possible peptide candidates to interpret a given spectrum, and iteratively finds the best fitting peptide signal in order to minimize the mean squared error of the inferred spectrum to the observed spectrum. In contrast to previous detection methods, BPDA performs deisotoping and deconvolution of mass spectra simultaneously, which enables better identification of weak peptide signals and produces higher sensitivities and more robust results. Unlike template-matching algorithms, BPDA can handle complex data where features overlap. Our experimental results indicate that BPDA performs well on simulated data and real MS data sets, for various resolutions and signal to noise ratios, and compares very favorably with commonly used commercial and open-source software, such as flexAnalysis, OpenMS, and Decon2LS, according to sensitivity and detection accuracy.
Unlike previous detection methods, which only employ isotopic distributions and work at each single charge state alone, BPDA takes into account the charge state distribution as well, thus lending information to better identify weak peptide signals and produce more robust results. The proposed approach is based on a rigorous statistical framework, which avoids problems generally encountered in algorithms based on template matching. Our experiments indicate that BPDA performs well on both simulated data and real data, and compares very favorably with commonly used commercial and open-source software. The BPDA software can be downloaded from http://gsp.tamu.edu/Publications/supplementary/sun10a/bpda.
Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior.
Genomics; classification; error estimation; discrete histogram rule; sampling distribution; resubstitution; leave-one-out; ensemble methods; coefficient of determination.
Many missing-value (MV) imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between MV imputation and classification accuracy. Furthermore, these studies are problematic in fundamental steps such as MV generation and classifier error estimation. In this work, we carry out a model-based study that addresses some of the issues in previous studies. Six popular imputation algorithms, two feature selection methods, and three classification rules are considered. The results suggest that it is beneficial to apply MV imputation when the noise level is high, variance is small, or gene-cluster correlation is strong, under small to moderate MV rates. In these cases, if data quality metrics are available, then it may be helpful to consider the data point with poor quality as missing and apply one of the most robust imputation algorithms to estimate the true signal based on the available high-quality data points. However, at large MV rates, we conclude that imputation methods are not recommended. Regarding the MV rate, our results indicate the presence of a peaking phenomenon: performance of imputation methods actually improves initially as the MV rate increases, but after an optimum point, performance quickly deteriorates with increasing MV rates.
We report the detailed development of biomarkers to predict the clinical outcome under dengue infection. Transcriptional signatures from purified peripheral blood mononuclear cells were derived from whole-genome gene-expression microarray data, validated by quantitative PCR and tested in independent samples.
The study was performed on patients of a well-characterized dengue cohort from Recife, Brazil. The samples analyzed were collected prospectively from acute febrile dengue patients who evolved with different degrees of disease severity: classic dengue fever or dengue hemorrhagic fever (DHF) samples were compared with similar samples from other non-dengue febrile illnesses. The DHF samples were collected 2–3 days before the presentation of the plasma leakage symptoms. Differentially-expressed genes were selected by univariate statistical tests as well as multivariate classification techniques. The results showed that at early stages of dengue infection, the genes involved in effector mechanisms of innate immune response presented a weaker activation on patients who later developed hemorrhagic fever, whereas the genes involved in apoptosis were expressed in higher levels.
Some of the gene expression signatures displayed estimated accuracy rates of more than 95%, indicating that expression profiling with these signatures may provide a useful means of DHF prognosis at early stages of infection.
Nanomaterials are being manufactured on a commercial scale for use in medical, diagnostic, energy, component and communications industries. However, concerns over the safety of engineered nanomaterials have surfaced. Humans can be exposed to nanomaterials in different ways such as inhalation or exposure through the integumentary system.
The interactions of engineered nanomaterials with primary human cells was investigated, using a systems biology approach combining gene expression microarray profiling with dynamic experimental parameters. In this experiment, primary human epidermal keratinocytes cells were exposed to several low-micron to nano-scale materials, and gene expression was profiled over both time and dose to compile a comprehensive picture of nanomaterial-cellular interactions. Very few gene-expression studies so far have dealt with both time and dose response simultaneously. Here, we propose different approaches to this kind of analysis. First, we used heat maps and multi-dimensional scaling (MDS) plots to visualize the dose response of nanomaterials over time. Then, in order to find out the most common patterns in gene-expression profiles, we used self-organizing maps (SOM) combined with two different criteria to determine the number of clusters. The consistency of SOM results is discussed in context of the information derived from the MDS plots. Finally, in order to identify the genes that have significantly different responses among different levels of dose of each treatment while accounting for the effect of time at the same time, we used a two-way ANOVA model, in connection with Tukey's additivity test and the Box-Cox transformation. The results are discussed in the context of the cellular responses of engineered nanomaterials.
The analysis presented here lead to interesting and complementary conclusions about the response across time of human epidermal keratinocytes after exposure to nanomaterials. For example, we observed that gene expression for most treatments become closer to the expression of the baseline cultures as time proceeds. The genes found to be differentially-expressed are involved in a number of cellular processes, including regulation of transcription and translation, protein localization, transport, cell cycle progression, cell migration, cytoskeletal reorganization, signal transduction, and development.
The complement system, a key component that links the innate and adaptive immune responses, has three pathways: the classical, lectin, and alternative pathways. In the present study, we have analyzed the levels of various complement components in blood samples from dengue fever (DF) and dengue hemorrhagic fever (DHF) patients and found that the level of complement activation is associated with disease severity.
Methods and Results
Patients with DHF had lower levels of complement factor 3 (C3; p = 0.002) and increased levels of C3a, C4a and C5a (p<0.0001) when compared to those with the less severe form, DF. There were no significant differences between DF and DHF patients in the levels of C1q, immunocomplexes (CIC-CIq) and CRP. However, small but statistically significant differences were detected in the levels of MBL. In contrast, the levels of two regulatory proteins of the alternative pathway varied widely between DF and DHF patients: DHF patients had higher levels of factor D (p = 0.01), which cleaves factor B to yield the active (C3bBb) C3 convertase, and lower levels of factor H (p = 0.03), which inactivates the (C3bBb) C3 convertase, than did DF patients. When we considered the levels of factors D and H together as an indicator of (C3bBb) C3 convertase regulation, we found that the plasma levels of these regulatory proteins in DHF patients favored the formation of the (C3bBb) C3 convertase, whereas its formation was inhibited in DF patients (p<0.0001).
The data suggest that an imbalance in the levels of regulatory factors D and H is associated with an abnormal regulation of complement activity in DHF patients.
Dengue virus infection causes a wide spectrum of illness, ranging from sub-clinical to severe disease. Severe dengue is associated with sequential viral infections. A strict definition of primary versus secondary dengue infections requires a combination of several tests performed at different stages of the disease, which is not practical.
Methods and Findings
We developed a simple method to classify dengue infections as primary or secondary based on the levels of dengue-specific IgG. A group of 109 dengue infection patients were classified as having primary or secondary dengue infection on the basis of a strict combination of results from assays of antigen-specific IgM and IgG, isolation of virus and detection of the viral genome by PCR tests performed on multiple samples, collected from each patient over a period of 30 days. The dengue-specific IgG levels of all samples from 59 of the patients were analyzed by linear discriminant analysis (LDA), and one- and two-dimensional classifiers were designed. The one-dimensional classifier was estimated by bolstered resubstitution error estimation to have 75.1% sensitivity and 92.5% specificity. The two-dimensional classifier was designed by taking also into consideration the number of days after the onset of symptoms, with an estimated sensitivity and specificity of 91.64% and 92.46%. The performance of the two-dimensional classifier was validated using an independent test set of standard samples from the remaining 50 patients. The classifications of the independent set of samples determined by the two-dimensional classifiers were further validated by comparing with two other dengue classification methods: hemagglutination inhibition (HI) assay and an in-house anti-dengue IgG-capture ELISA method. The decisions made with the two-dimensional classifier were in 100% accordance with the HI assay and 96% with the in-house ELISA.
Once acute dengue infection has been determined, a 2-D classifier based on common dengue virus IgG kits can reliably distinguish primary and secondary dengue infections. Software for calculation and validation of the 2-D classifier is made available for download.
The development of DNA microarray technology a decade ago led to the establishment of functional genomics as one of the most active and successful scientific disciplines today. With the ongoing development of immunomic microarray technology—a spatially addressable, large-scale technology for measurement of specific immunological response—the new challenge of functional immunomics is emerging, which bears similarities to but is also significantly different from functional genomics. Immunonic data has been successfully used to identify biological markers involved in autoimmune diseases, allergies, viral infections such as human immunodeficiency virus (HIV), influenza, diabetes, and responses to cancer vaccines. This review intends to provide a coherent vision of this nascent scientific field, and speculate on future research directions. We discuss at some length issues such as epitope prediction, immunomic microarray technology and its applications, and computation and statistical challenges related to functional immunomics. Based on the recent discovery of regulation mechanisms in T cell responses, we envision the use of immunomic microarrays as a tool for advances in systems biology of cellular immune responses, by means of immunomic regulatory network models.