Search tips
Search criteria

Results 1-5 (5)

Clipboard (0)
Year of Publication
Document Types
1.  Adjusting confounders in ranking biomarkers: a model-based ROC approach 
Briefings in Bioinformatics  2012;13(5):513-523.
High-throughput studies have been extensively conducted in the research of complex human diseases. As a representative example, consider gene-expression studies where thousands of genes are profiled at the same time. An important objective of such studies is to rank the diagnostic accuracy of biomarkers (e.g. gene expressions) for predicting outcome variables while properly adjusting for confounding effects from low-dimensional clinical risk factors and environmental exposures. Existing approaches are often fully based on parametric or semi-parametric models and target evaluating estimation significance as opposed to diagnostic accuracy. Receiver operating characteristic (ROC) approaches can be employed to tackle this problem. However, existing ROC ranking methods focus on biomarkers only and ignore effects of confounders. In this article, we propose a model-based approach which ranks the diagnostic accuracy of biomarkers using ROC measures with a proper adjustment of confounding effects. To this end, three different methods for constructing the underlying regression models are investigated. Simulation study shows that the proposed methods can accurately identify biomarkers with additional diagnostic power beyond confounders. Analysis of two cancer gene-expression studies demonstrates that adjusting for confounders can lead to substantially different rankings of genes.
PMCID: PMC3431720  PMID: 22396461
ranking biomarkers; ROC; confounders; high-throughput data
2.  Principal component analysis based methods in bioinformatics studies 
Briefings in Bioinformatics  2011;12(6):714-722.
In analysis of bioinformatics data, a unique challenge arises from the high dimensionality of measurements. Without loss of generality, we use genomic study with gene expression measurements as a representative example but note that analysis techniques discussed in this article are also applicable to other types of bioinformatics studies. Principal component analysis (PCA) is a classic dimension reduction approach. It constructs linear combinations of gene expressions, called principal components (PCs). The PCs are orthogonal to each other, can effectively explain variation of gene expressions, and may have a much lower dimensionality. PCA is computationally simple and can be realized using many existing software packages. This article consists of the following parts. First, we review the standard PCA technique and their applications in bioinformatics data analysis. Second, we describe recent ‘non-standard’ applications of PCA, including accommodating interactions among genes, pathways and network modules and conducting PCA with estimating equations as opposed to gene expressions. Third, we introduce several recently proposed PCA-based techniques, including the supervised PCA, sparse PCA and functional PCA. The supervised PCA and sparse PCA have been shown to have better empirical performance than the standard PCA. The functional PCA can analyze time-course gene expression data. Last, we raise the awareness of several critical but unsolved problems related to PCA. The goal of this article is to make bioinformatics researchers aware of the PCA technique and more importantly its most recent development, so that this simple yet effective dimension reduction technique can be better employed in bioinformatics data analysis.
PMCID: PMC3220871  PMID: 21242203
principal component analysis; dimension reduction; bioinformatics methodologies; gene expression
3.  Ranking prognosis markers in cancer genomic studies 
Briefings in Bioinformatics  2010;12(1):33-40.
In cancer research, high-throughput genomic studies have been extensively conducted, searching for markers associated with cancer diagnosis, prognosis and variation in response to treatment. In this article, we analyze cancer prognosis studies and investigate ranking markers based on their marginal prognosis power. To avoid ambiguity, we focus on microarray gene expression studies where genes are the markers, but note that the methodology and results are applicable to other high-throughput studies. The objectives of this study are 2-fold. First, we investigate ranking markers under three commonly adopted semiparametric models, namely the Cox, accelerated failure time and additive risk models. Data analysis shows that the ranking may vary significantly under different models. Second, we describe a nonparametric concordance measure, which has roots in the time-dependent ROC (receiver operating characteristic) framework and relies on much weaker assumptions than the semiparametric models. In simulation, it is shown that ranking using the concordance measure is not sensitive to model specification whereas ranking under the semiparametric models is. In data analysis, the concordance measure generates rankings significantly different from those under the semiparametric models.
PMCID: PMC3030811  PMID: 21087949
cancer prognosis markers; semiparametric survival analysis; concordance measure
4.  Semiparametric prognosis models in genomic studies 
Briefings in Bioinformatics  2010;11(4):385-393.
Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
PMCID: PMC2905523  PMID: 20123942
genomic studies; semiparametric prognosis models; model comparison
5.  Penalized feature selection and classification in bioinformatics 
Briefings in Bioinformatics  2008;9(5):392-403.
In bioinformatics studies, supervised classification with high-dimensional input variables is frequently encountered. Examples routinely arise in genomic, epigenetic and proteomic studies. Feature selection can be employed along with classifier construction to avoid over-fitting, to generate more reliable classifier and to provide more insights into the underlying causal relationships. In this article, we provide a review of several recently developed penalized feature selection and classification techniques—which belong to the family of embedded feature selection methods—for bioinformatics studies with high-dimensional input. Classification objective functions, penalty functions and computational algorithms are discussed. Our goal is to make interested researchers aware of these feature selection and classification methods that are applicable to high-dimensional bioinformatics data.
PMCID: PMC2733190  PMID: 18562478
bioinformatics application; feature selection; penalization

Results 1-5 (5)