Search tips
Search criteria

Results 1-8 (8)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Bayesian estimation of the discrete coefficient of determination 
The discrete coefficient of determination (CoD) measures the nonlinear interaction between discrete predictor and target variables and has had far-reaching applications in Genomic Signal Processing. Previous work has addressed the inference of the discrete CoD using classical parametric and nonparametric approaches. In this paper, we introduce a Bayesian framework for the inference of the discrete CoD. We derive analytically the optimal minimum mean-square error (MMSE) CoD estimator, as well as a CoD estimator based on the Optimal Bayesian Predictor (OBP). For the latter estimator, exact expressions for its bias, variance, and root-mean-square (RMS) are given. The accuracy of both Bayesian CoD estimators with non-informative and informative priors, under fixed or random parameters, is studied via analytical and numerical approaches. We also demonstrate the application of the proposed Bayesian approach in the inference of gene regulatory networks, using gene-expression data from a previously published study on metastatic melanoma.
PMCID: PMC4715135  PMID: 26807133
Discrete coefficient of determination; Bayesian inference; Gene regulatory network inference
2.  From Functional Genomics to Functional Immunomics: New Challenges, Old Problems, Big Rewards  
PLoS Computational Biology  2006;2(7):e81.
The development of DNA microarray technology a decade ago led to the establishment of functional genomics as one of the most active and successful scientific disciplines today. With the ongoing development of immunomic microarray technology—a spatially addressable, large-scale technology for measurement of specific immunological response—the new challenge of functional immunomics is emerging, which bears similarities to but is also significantly different from functional genomics. Immunonic data has been successfully used to identify biological markers involved in autoimmune diseases, allergies, viral infections such as human immunodeficiency virus (HIV), influenza, diabetes, and responses to cancer vaccines. This review intends to provide a coherent vision of this nascent scientific field, and speculate on future research directions. We discuss at some length issues such as epitope prediction, immunomic microarray technology and its applications, and computation and statistical challenges related to functional immunomics. Based on the recent discovery of regulation mechanisms in T cell responses, we envision the use of immunomic microarrays as a tool for advances in systems biology of cellular immune responses, by means of immunomic regulatory network models.
PMCID: PMC1523295  PMID: 16863395
3.  Cross-validation under separate sampling: strong bias and how to correct it 
Bioinformatics  2014;30(23):3349-3355.
Motivation: It is commonly assumed in pattern recognition that cross-validation error estimation is ‘almost unbiased’ as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics.
Results: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an ‘almost unbiased’ theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.
Availability and implementation: The source code in C++, along with the Supplementary Materials, is available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4296143  PMID: 25123902
4.  High-dimensional bolstered error estimation 
Bioinformatics  2011;27(21):3056-3064.
Motivation: In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces.
Results: This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known.
Availability: Companion website at
PMCID: PMC3198579  PMID: 21914630
5.  The Illusion of Distribution-Free Small-Sample Classification in Genomics 
Current Genomics  2011;12(5):333-341.
Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.
PMCID: PMC3145263  PMID: 22294876
Classification; epistemology; error estimation; genomics; validation.
6.  Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens 
BMC Bioinformatics  2011;12(Suppl 10):S5.
RNA-Seq is the recently developed high-throughput sequencing technology for profiling the entire transcriptome in any organism. It has several major advantages over current hybridization-based approach such as microarrays. However, the cost per sample by RNA-Seq is still prohibitive for most laboratories. With continued improvement in sequence output, it would be cost-effective if multiple samples are multiplexed and sequenced in a single lane with sufficient transcriptome coverage. The objective of this analysis is to evaluate what sequencing depth might be sufficient to interrogate gene expression profiling in the chicken by RNA-Seq.
Two cDNA libraries from chicken lungs were sequenced initially, and 4.9 million (M) and 1.6 M (60 bp) reads were generated, respectively. With significant improvements in sequencing technology, two technical replicate cDNA libraries were re-sequenced. Totals of 29.6 M and 28.7 M (75 bp) reads were obtained with the two samples. More than 90% of annotated genes were detected in the data sets with 28.7-29.6 M reads, while only 68% of genes were detected in the data set with 1.6 M reads. The correlation coefficients of gene expression between technical replicates within the same sample were 0.9458 and 0.8442. To evaluate the appropriate depth needed for mRNA profiling, a random sampling method was used to generate different number of reads from each sample. There was a significant increase in correlation coefficients from a sequencing depth of 1.6 M to 10 M for all genes except highly abundant genes. No significant improvement was observed from the depth of 10 M to 20 M (75 bp) reads.
The analysis from the current study demonstrated that 30 M (75 bp) reads is sufficient to detect all annotated genes in chicken lungs. Ten million (75 bp) reads could detect about 80% of annotated chicken genes, and RNA-Seq at this depth can serve as a replacement of microarray technology. Furthermore, the depth of sequencing had a significant impact on measuring gene expression of low abundant genes. Finally, the combination of experimental and simulation approaches is a powerful approach to address the relationship between the depth of sequencing and transcriptome coverage.
PMCID: PMC3236848  PMID: 22165852
7.  Description of a Prospective 17DD Yellow Fever Vaccine Cohort in Recife, Brazil 
From September 2005 to March 2007, 238 individuals being vaccinated for the first time with the yellow fever (YF) -17DD vaccine were enrolled in a cohort established in Recife, Brazil. A prospective study indicated that, after immunization, anti-YF immunoglobulin M (IgM) and anti-YF IgG were present in 70.6% (IgM) and 98.3% (IgG) of the vaccinated subjects. All vaccinees developed protective immunity, which was detected by the plaque reduction neutralization test (PRNT) with a geometric mean titer of 892. Of the 238 individuals, 86.6% had IgG antibodies to dengue virus; however, the presence of anti-dengue IgG did not interfere significantly with the development of anti-YF neutralizing antibodies. In a separate retrospective study of individuals immunized with the 17DD vaccine, the PRNT values at 5 and 10 years post-vaccination remained positive but showed a significant decrease in neutralization titer (25% with PRNT titers < 100 after 5 years and 35% after 10 years).
PMCID: PMC3183786  PMID: 21976581
8.  Classification and Error Estimation for Discrete Data 
Current Genomics  2009;10(7):446-462.
Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior.
PMCID: PMC2808673  PMID: 20436873
Genomics; classification; error estimation; discrete histogram rule; sampling distribution; resubstitution; leave-one-out; ensemble methods; coefficient of determination.

Results 1-8 (8)