Search tips
Search criteria

Results 1-7 (7)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Model Selection Emphasises the Importance of Non-Chromosomal Information in Genetic Studies 
PLoS ONE  2015;10(1):e0117014.
Ever since the case of the missing heritability was highlighted some years ago, scientists have been investigating various possible explanations for the issue. However, none of these explanations include non-chromosomal genetic information. Here we describe explicitly how chromosomal and non-chromosomal modifiers collectively influence the heritability of a trait, in this case, the growth rate of yeast. Our results show that the non-chromosomal contribution can be large, adding another dimension to the estimation of heritability. We also discovered, combining the strength of LASSO with model selection, that the interaction of chromosomal and non-chromosomal information is essential in describing phenotypes.
PMCID: PMC4308103  PMID: 25626013
2.  Multiple graph regularized protein domain ranking 
BMC Bioinformatics  2012;13:307.
Protein domain ranking is a fundamental task in structural biology. Most protein domain ranking methods rely on the pairwise comparison of protein domains while neglecting the global manifold structure of the protein domain database. Recently, graph regularized ranking that exploits the global structure of the graph defined by the pairwise similarities has been proposed. However, the existing graph regularized ranking methods are very sensitive to the choice of the graph model and parameters, and this remains a difficult problem for most of the protein domain ranking methods.
To tackle this problem, we have developed the Multiple Graph regularized Ranking algorithm, MultiG-Rank. Instead of using a single graph to regularize the ranking scores, MultiG-Rank approximates the intrinsic manifold of protein domain distribution by combining multiple initial graphs for the regularization. Graph weights are learned with ranking scores jointly and automatically, by alternately minimizing an objective function in an iterative algorithm. Experimental results on a subset of the ASTRAL SCOP protein domain database demonstrate that MultiG-Rank achieves a better ranking performance than single graph regularized ranking methods and pairwise similarity based ranking methods.
The problem of graph model and parameter selection in graph regularized protein domain ranking can be solved effectively by combining multiple graphs. This aspect of generalization introduces a new frontier in applying multiple graphs to solving protein domain ranking applications.
PMCID: PMC3583823  PMID: 23157331
3.  Efficient Feature Selection and Multiclass Classification with Integrated Instance and Model Based Learning 
Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L1 or Lp penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced.
By combining instance-based and model-based learning, we propose an efficient learning method with integrated KNN and constrained logistic regression (KNNLog) for simultaneous multiclass classification and feature selection. Our proposed method simultaneously minimizes the intra-class distance and maximizes the interclass distance with fewer estimated parameters. It is very efficient for problems with small sample size and unbalanced classes, a case common in many real applications. In addition, our model-based feature selection methods can identify highly correlated features simultaneously avoiding the multiplicity problem due to multiple tests. The proposed method is evaluated with simulation and real data including one unbalanced microRNA dataset for leukemia and one multiclass metagenomic dataset from the Human Microbiome Project (HMP). It performs well with limited computational experiments.
PMCID: PMC3347893  PMID: 22577297
feature selection; multiclass classification; statistical learning; high-dimensional data
4.  Gene Expression Data Classification With Kernel Principal Component Analysis 
One important feature of the gene expression data is that the number of genes M far exceeds the number of samples N. Standard statistical methods do not work well when N < M. Development of new methodologies or modification of existing methodologies is needed for the analysis of the microarray data. In this paper, we propose a novel analysis procedure for classifying the gene expression data. This procedure involves dimension reduction using kernel principal component analysis (KPCA) and classification with logistic regression (discrimination). KPCA is a generalization and nonlinear version of principal component analysis. The proposed algorithm was applied to five different gene expression datasets involving human tumor samples. Comparison with other popular classification methods such as support vector machines and neural networks shows that our algorithm is very promising in classifying gene expression data.
PMCID: PMC1184105  PMID: 16046821
5.  Data Mining in Genomics and Proteomics 
PMCID: PMC1184057  PMID: 16046810
6.  Functional Clustering Algorithm for High-Dimensional Proteomics Data 
Clustering proteomics data is a challenging problem for any traditional clustering algorithm. Usually, the number of samples is largely smaller than the number of protein peaks. The use of a clustering algorithm which does not take into consideration the number of features of variables (here the number of peaks) is needed. An innovative hierarchical clustering algorithm may be a good approach. We propose here a new dissimilarity measure for the hierarchical clustering combined with a functional data analysis. We present a specific application of functional data analysis (FDA) to a high-throughput proteomics study. The high performance of the proposed algorithm is compared to two popular dissimilarity measures in the clustering of normal and human T-cell leukemia virus type 1 (HTLV-1)-infected patients samples.
PMCID: PMC1184055  PMID: 16046812
7.  Postgenomics: Proteomics and Bioinformatics in Cancer Research 
Now that the human genome is completed, the characterization of the proteins encoded by the sequence remains a challenging task. The study of the complete protein complement of the genome, the “proteome,” referred to as proteomics, will be essential if new therapeutic drugs and new disease biomarkers for early diagnosis are to be developed. Research efforts are already underway to develop the technology necessary to compare the specific protein profiles of diseased versus nondiseased states. These technologies provide a wealth of information and rapidly generate large quantities of data. Processing the large amounts of data will lead to useful predictive mathematical descriptions of biological systems which will permit rapid identification of novel therapeutic targets and identification of metabolic disorders. Here, we present an overview of the current status and future research approaches in defining the cancer cell's proteome in combination with different bioinformatics and computational biology tools toward a better understanding of health and disease.
PMCID: PMC514267  PMID: 14615629

Results 1-7 (7)