Search tips
Search criteria

Results 1-25 (385)

Clipboard (0)
Year of Publication
Document Types
1.  An efficient heuristic method for active feature acquisition and its application to protein-protein interaction prediction 
BMC Proceedings  2012;6(Suppl 7):S2.
Machine learning approaches for classification learn the pattern of the feature space of different classes, or learn a boundary that separates the feature space into different classes. The features of the data instances are usually available, and it is only the class-labels of the instances that are unavailable. For example, to classify text documents into different topic categories, the words in the documents are features and they are readily available, whereas the topic is what is predicted. However, in some domains obtaining features may be resource-intensive because of which not all features may be available. An example is that of protein-protein interaction prediction, where not only are the labels ('interacting' or 'non-interacting') unavailable, but so are some of the features. It may be possible to obtain at least some of the missing features by carrying out a few experiments as permitted by the available resources. If only a few experiments can be carried out to acquire missing features, which proteins should be studied and which features of those proteins should be determined? From the perspective of machine learning for PPI prediction, it would be desirable that those features be acquired which when used in training the classifier, the accuracy of the classifier is improved the most. That is, the utility of the feature-acquisition is measured in terms of how much acquired features contribute to improving the accuracy of the classifier. Active feature acquisition (AFA) is a strategy to preselect such instance-feature combinations (i.e. protein and experiment combinations) for maximum utility. The goal of AFA is the creation of optimal training set that would result in the best classifier, and not in determining the best classification model itself.
We present a heuristic method for active feature acquisition to calculate the utility of acquiring a missing feature. This heuristic takes into account the change in belief of the classification model induced by the acquisition of the feature under consideration. As compared to random selection of proteins on which the experiments are performed and the type of experiment that is performed, the heuristic method reduces the number of experiments to as few as 40%. Most notable characteristic of this method is that it does not require re-training of the classification model on every possible combination of instance, feature and feature-value tuples. For this reason, our method is far less computationally expensive as compared with previous AFA strategies.
The results show that our heuristic method for AFA creates an optimal training set with far less features acquired as compared to random acquisition. This shows the value of active feature acquisition to aid in protein-protein interaction prediction where feature acquisition is costly. Compared to previous methods, the proposed method reduces computational cost while also achieving a better F-score. The proposed method is valuable as it presents a direction to AFA with a far lesser computational expense by removing the need for the first time, of training a classifier for every combination of instance, feature and feature-value tuples which would be impractical for several domains.
PMCID: PMC3504800  PMID: 23173746
2.  Effect of conformation sampling strategies in genetic algorithm for multiple protein docking 
BMC Proceedings  2012;6(Suppl 7):S4.
Macromolecular protein complexes play important roles in a cell and their tertiary structure can help understand key biological processes of their functions. Multiple protein docking is a valuable computational tool for providing structure information of multimeric protein complexes. In a previous study we developed and implemented an algorithm for this purpose, named Multi-LZerD. This method represents a conformation of a multimeric protein complex as a graph, where nodes denote subunits and each edge connecting nodes denotes a pairwise docking conformation of the two subunits. Multi-LZerD employs a genetic algorithm to sample different topologies of the graph and pairwise transformations between subunits, seeking for the conformation of the optimal (lowest) energy. In this study we explore different configurations of the genetic algorithm, namely, the population size, whether to include a crossover operation, as well as the threshold for structural clustering, to find the optimal experimental setup.
Multi-LZerD was executed to predict the structures of three multimeric protein complexes, using different population sizes, clustering thresholds, and configurations of mutation and crossover. We analyzed the impact of varying these parameters on the computational time and the prediction accuracy.
Results and conclusions
Given that computational resources is a key for handling complexes with a large number of subunits and also for computing a large number of protein complexes in a genome-scale study, finding a proper setting for sampling the conformation space is of the utmost importance. Our results show that an excessive sampling of the conformational space by increasing the population size or by introducing the crossover operation is not necessary for improving accuracy for predicting structures of small complexes. The clustering is effective in reducing redundant pairwise predictions, which leads to successful identification of near-native conformations.
PMCID: PMC3504801  PMID: 23173833
3.  A ν-support vector regression based approach for predicting imputation quality 
BMC Proceedings  2012;6(Suppl 7):S3.
Decades of genome-wide association studies (GWAS) have accumulated large volumes of genomic data that can potentially be reused to increase statistical power of new studies, but different genotyping platforms with different marker sets have been used as biotechnology has evolved, preventing pooling and comparability of old and new data. For example, to pool together data collected by 550K chips with newer data collected by 900K chips, we will need to impute missing loci. Many imputation algorithms have been developed, but the posteriori probabilities estimated by those algorithms are not a reliable measure the quality of the imputation. Recently, many studies have used an imputation quality score (IQS) to measure the quality of imputation. The IQS requires to know true alleles to estimate. Only when the population and the imputation loci are identical can we reuse the estimated IQS when the true alleles are unknown.
Here, we present a regression model to estimate IQS that learns from imputation of loci with known alleles. We designed a small set of features, such as minor allele frequencies, distance to the nearest known cross-over hotspot, etc., for the prediction of IQS. We evaluated our regression models by estimating IQS of imputations by BEAGLE for a set of GWAS data from the NCBI GEO database collected from samples from different ethnic populations.
We construct a ν-SVR based approach as our regression model. Our evaluation shows that this regression model can accomplish mean square errors of less than 0.02 and a correlation coefficient close to 0.75 in different imputation scenarios. We also show how the regression results can help remove false positives in association studies.
Reliable estimation of IQS will facilitate integration and reuse of existing genomic data for meta-analysis and secondary analysis. Experiments show that it is possible to use a small number of features to regress the IQS by learning from different training examples of imputation and IQS pairs.
PMCID: PMC3504919  PMID: 23173775
4.  Evaluation of function predictions by PFP, ESG, and PSI-BLAST for moonlighting proteins 
BMC Proceedings  2012;6(Suppl 7):S5.
Advancements in function prediction algorithms are enabling large scale computational annotation for newly sequenced genomes. With the increase in the number of functionally well characterized proteins it has been observed that there are many proteins involved in more than one function. These proteins characterized as moonlighting proteins show varied functional behavior depending on the cell type, localization in the cell, oligomerization, multiple binding sites, etc. The functional diversity shown by moonlighting proteins may have significant impact on the traditional sequence based function prediction methods. Here we investigate how well diverse functions of moonlighting proteins can be predicted by some existing function prediction methods.
We have analyzed the performances of three major sequence based function prediction methods, PSI-BLAST, the Protein Function Prediction (PFP), and the Extended Similarity Group (ESG) on predicting diverse functions of moonlighting proteins. In predicting discrete functions of a set of 19 experimentally identified moonlighting proteins, PFP showed overall highest recall among the three methods. Although ESG showed the highest precision, its recall was lower than PSI-BLAST. Recall by PSI-BLAST greatly improved when BLOSUM45 was used instead of BLOSUM62.
We have analyzed the performances of PFP, ESG, and PSI-BLAST in predicting the functional diversity of moonlighting proteins. PFP shows overall better performance in predicting diverse moonlighting functions as compared with PSI-BLAST and ESG. Recall by PSI-BLAST greatly improved when BLOSUM45 was used. This analysis indicates that considering weakly similar sequences in prediction enhances the performance of sequence based AFP methods in predicting functional diversity of moonlighting proteins. The current study will also motivate development of novel computational frameworks for automatic identification of such proteins.
PMCID: PMC3504920  PMID: 23173871
5.  Identifying stage-specific protein subnetworks for colorectal cancer 
BMC Proceedings  2012;6(Suppl 7):S1.
In recent years, many algorithms have been developed for network-based analysis of differential gene expression in complex diseases. These algorithms use protein-protein interaction (PPI) networks as an integrative framework and identify subnetworks that are coordinately dysregulated in the phenotype of interest.
While such dysregulated subnetworks have demonstrated significant improvement over individual gene markers for classifying phenotype, the current state-of-the-art in dysregulated subnetwork discovery is almost exclusively limited to binary phenotype classes. However, many clinical applications require identification of molecular markers for multiple classes.
We consider the problem of discovering groups of genes whose expression signatures can discriminate multiple phenotype classes. We consider two alternate formulations of this problem (i) an all-vs-all approach that aims to discover subnetworks distinguishing all classes, (ii) a one-vs-all approach that aims to discover subnetworks distinguishing each class from the rest of the classes. For the one-vs-all formulation, we develop a set-cover based algorithm, which aims to identify groups of genes such that at least one gene in the group exhibits differential expression in the target class.
We test the proposed algorithms in the context of predicting stages of colorectal cancer. Our results show that the set-cover based algorithm identifying "stage-specific" subnetworks outperforms the all-vs-all approaches in classification. We also investigate the merits of utilizing PPI networks in the search for multiple markers, and show that, with correct parameter settings, network-guided search improves performance. Furthermore, we show that assessing statistical significance when selecting features greatly improves classification performance.
PMCID: PMC3504924  PMID: 23173715
8.  Next-generation sequencing at Merck-Boston 
BMC Proceedings  2012;6(Suppl 6):P36.
PMCID: PMC3467709
13.  Analyzing genomes: is there a duty to disclose? 
BMC Proceedings  2012;6(Suppl 6):O14.
PMCID: PMC3467717
14.  Folding principles of genomes 
BMC Proceedings  2012;6(Suppl 6):O26.
PMCID: PMC3467459
17.  Hypothesis-generating clinical genomics research and predictive medicine 
BMC Proceedings  2012;6(Suppl 6):O6.
PMCID: PMC3467463
23.  Verification of systems biology research in the age of collaborative competition 
BMC Proceedings  2012;6(Suppl 6):P48.
PMCID: PMC3467476

Results 1-25 (385)