When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets.
Availability: companion website at http://gsp.tamu.edu/Publications/supplementary/zhao09a/
classification; feature ranking; ranking power
One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements.
A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology.
The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS) or the recent Sequencing-By-Synthesis (SBS) technique. Some of such genes identified by the proposed method may be useful to generate classifiers.
With the growing popularity of using QSAR predictions towards regulatory purposes, such predictive models are now required to be strictly validated, an essential feature of which is to have the model’s Applicability Domain (AD) defined clearly. Although in recent years several different approaches have been proposed to address this goal, no optimal approach to define the model’s AD has yet been recognized.
This study proposes a novel descriptor-based AD method which accounts for the data distribution and exploits k-Nearest Neighbours (kNN) principle to derive a heuristic decision rule. The proposed method is a three-stage procedure to address several key aspects relevant in judging the reliability of QSAR predictions. Inspired from the adaptive kernel method for probability density function estimation, the first stage of the approach defines a pattern of thresholds corresponding to the various training samples and these thresholds are later used to derive the decision rule. Criterion deciding if a given test sample will be retained within the AD is defined in the second stage of the approach. Finally, the last stage tries reflecting upon the reliability in derived results taking model statistics and prediction error into account.
The proposed approach addressed a novel strategy that integrated the kNN principle to define the AD of QSAR models. Relevant features that characterize the proposed AD approach include: a) adaptability to local density of samples, useful when the underlying multivariate distribution is asymmetric, with wide regions of low data density; b) unlike several kernel density estimators (KDE), effectiveness also in high-dimensional spaces; c) low sensitivity to the smoothing parameter k; and d) versatility to implement various distances measures. The results derived on a case study provided a clear understanding of how the approach works and defines the model’s AD for reliable predictions.
QSAR; Applicability domain; kNN; Nearest neighbour; Model validation
Kernel-based learning algorithms are among the most advanced machine learning methods and have been successfully applied to a variety of sequence classification tasks within the field of bioinformatics. Conventional kernels utilized so far do not provide an easy interpretation of the learnt representations in terms of positional and compositional variability of the underlying biological signals.
We propose a kernel-based approach to datamining on biological sequences. With our method it is possible to model and analyze positional variability of oligomers of any length in a natural way. On one hand this is achieved by mapping the sequences to an intuitive but high-dimensional feature space, well-suited for interpretation of the learnt models. On the other hand, by means of the kernel trick we can provide a general learning algorithm for that high-dimensional representation because all required statistics can be computed without performing an explicit feature space mapping of the sequences. By introducing a kernel parameter that controls the degree of position-dependency, our feature space representation can be tailored to the characteristics of the biological problem at hand. A regularized learning scheme enables application even to biological problems for which only small sets of example sequences are available. Our approach includes a visualization method for transparent representation of characteristic sequence features. Thereby importance of features can be measured in terms of discriminative strength with respect to classification of the underlying sequences. To demonstrate and validate our concept on a biochemically well-defined case, we analyze E. coli translation initiation sites in order to show that we can find biologically relevant signals. For that case, our results clearly show that the Shine-Dalgarno sequence is the most important signal upstream a start codon. The variability in position and composition we found for that signal is in accordance with previous biological knowledge. We also find evidence for signals downstream of the start codon, previously introduced as transcriptional enhancers. These signals are mainly characterized by occurrences of adenine in a region of about 4 nucleotides next to the start codon.
We showed that the oligo kernel can provide a valuable tool for the analysis of relevant signals in biological sequences. In the case of translation initiation sites we could clearly deduce the most discriminative motifs and their positional variation from example sequences. Attractive features of our approach are its flexibility with respect to oligomer length and position conservation. By means of these two parameters oligo kernels can easily be adapted to different biological problems.
Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the -test for feature selection; and -fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance.
The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, -fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.
Diffusion-weighted magnetic resonance imaging (DWI) and fiber tractography are the only methods to measure the structure of the white matter in the living human brain. The diffusion signal has been modelled as the combined contribution from many individual fascicles of nerve fibers passing through each location in the white matter. Typically, this is done via basis pursuit, but estimation of the exact directions is limited due to discretization [1, 2]. The difficulties inherent in modeling DWI data are shared by many other problems involving fitting non-parametric mixture models. Ekanadaham et al.  proposed an approach, continuous basis pursuit, to overcome discretization error in the 1-dimensional case (e.g., spike-sorting). Here, we propose a more general algorithm that fits mixture models of any dimensionality without discretization. Our algorithm uses the principles of L2-boost , together with refitting of the weights and pruning of the parameters. The addition of these steps to L2-boost both accelerates the algorithm and assures its accuracy. We refer to the resulting algorithm as elastic basis pursuit, or EBP, since it expands and contracts the active set of kernels as needed. We show that in contrast to existing approaches to fitting mixtures, our boosting framework (1) enables the selection of the optimal bias-variance tradeoff along the solution path, and (2) scales with high-dimensional problems. In simulations of DWI, we find that EBP yields better parameter estimates than a non-negative least squares (NNLS) approach, or the standard model used in DWI, the tensor model, which serves as the basis for diffusion tensor imaging (DTI) . We demonstrate the utility of the method in DWI data acquired in parts of the brain containing crossings of multiple fascicles of nerve fibers.
Multiple treatment comparison (MTC) meta-analyses are commonly modeled in a Bayesian framework, and weakly informative priors are typically preferred to mirror familiar data driven frequentist approaches. Random-effects MTCs have commonly modeled heterogeneity under the assumption that the between-trial variance for all involved treatment comparisons are equal (i.e., the ‘common variance’ assumption). This approach ‘borrows strength’ for heterogeneity estimation across treatment comparisons, and thus, ads valuable precision when data is sparse. The homogeneous variance assumption, however, is unrealistic and can severely bias variance estimates. Consequently 95% credible intervals may not retain nominal coverage, and treatment rank probabilities may become distorted. Relaxing the homogeneous variance assumption may be equally problematic due to reduced precision. To regain good precision, moderately informative variance priors or additional mathematical assumptions may be necessary.
In this paper we describe four novel approaches to modeling heterogeneity variance - two novel model structures, and two approaches for use of moderately informative variance priors. We examine the relative performance of all approaches in two illustrative MTC data sets. We particularly compare between-study heterogeneity estimates and model fits, treatment effect estimates and 95% credible intervals, and treatment rank probabilities.
In both data sets, use of moderately informative variance priors constructed from the pair wise meta-analysis data yielded the best model fit and narrower credible intervals. Imposing consistency equations on variance estimates, assuming variances to be exchangeable, or using empirically informed variance priors also yielded good model fits and narrow credible intervals. The homogeneous variance model yielded high precision at all times, but overall inadequate estimates of between-trial variances. Lastly, treatment rankings were similar among the novel approaches, but considerably different when compared with the homogenous variance approach.
MTC models using a homogenous variance structure appear to perform sub-optimally when between-trial variances vary between comparisons. Using informative variance priors, assuming exchangeability or imposing consistency between heterogeneity variances can all ensure sufficiently reliable and realistic heterogeneity estimation, and thus more reliable MTC inferences. All four approaches should be viable candidates for replacing or supplementing the conventional homogeneous variance MTC model, which is currently the most widely used in practice.
Brain imaging data have shown great promise as a useful predictor for psychiatric conditions, cognitive functions and many other neural-related outcomes. Development of prediction models based on imaging data is challenging due to the high dimensionality of the data, noisy measurements, complex correlation structures among voxels, small sample sizes, and between-subject heterogeneity. Most existing prediction approaches apply a dimension reduction method such as PCA on whole brain images as a preprocessing step. These approaches usually do not take into account of the cluster structure among voxels and between-subject differences. We propose a weighted cluster kernel PCA predictive model that addresses the challenges in brain imaging data. We first divide voxels into clusters based on neuroanatomic parcellation or data-driven methods, then extract cluster-specific principal features using kernel PCA and define the prediction model based on the principal features. Finally, we propose a weighted estimation method for the prediction model where each subject is weighted according to the percent of variance explained by the principal features. The proposed method allows assessment of relative importance of various brain regions in prediction; captures nonlinearity in feature space; and helps guard against overfitting for outlying subjects in predictive model building. We evaluate the performance of our method through simulation studies. A real fMRI data example is also used to illustrate the method.
Kernel PCA; prediction; multi-subject data; cluster; functional magnetic resonance imaging (fMRI); weighted estimation
Retroviral insertional mutagenesis screens, which identify genes involved in tumor development in mice, have yielded a substantial number of retroviral integration sites, and this number is expected to grow substantially due to the introduction of high-throughput screening techniques. The data of various retroviral insertional mutagenesis screens are compiled in the publicly available Retroviral Tagged Cancer Gene Database (RTCGD). Integrally analyzing these screens for the presence of common insertion sites (CISs, i.e., regions in the genome that have been hit by viral insertions in multiple independent tumors significantly more than expected by chance) requires an approach that corrects for the increased probability of finding false CISs as the amount of available data increases. Moreover, significance estimates of CISs should be established taking into account both the noise, arising from the random nature of the insertion process, as well as the bias, stemming from preferential insertion sites present in the genome and the data retrieval methodology. We introduce a framework, the kernel convolution (KC) framework, to find CISs in a noisy and biased environment using a predefined significance level while controlling the family-wise error (FWE) (the probability of detecting false CISs). Where previous methods use one, two, or three predetermined fixed scales, our method is capable of operating at any biologically relevant scale. This creates the possibility to analyze the CISs in a scale space by varying the width of the CISs, providing new insights in the behavior of CISs across multiple scales. Our method also features the possibility of including models for background bias. Using simulated data, we evaluate the KC framework using three kernel functions, the Gaussian, triangular, and rectangular kernel function. We applied the Gaussian KC to the data from the combined set of screens in the RTCGD and found that 53% of the CISs do not reach the significance threshold in this combined setting. Still, with the FWE under control, application of our method resulted in the discovery of eight novel CISs, which each have a probability less than 5% of being false detections.
A potent method for the identification of novel cancer genes is retroviral insertional mutagenesis. Mice infected with slow transforming retroviruses develop tumors because the virus inserts randomly in their genome and mutates cancer genes. The regions in the genome that are mutated in multiple independent tumors are likely to contain genes involved in tumorigenesis. As the size of these datasets increases, conventional methods to detect these so-called common insertion sites (CISs) no longer suffice, and an approach is required that can control the error independent of the dataset size. The authors introduce a framework that uses a technique called kernel density estimation to find the regions in the genome that show a significant increase in insertion density. This method is implemented over a range of scales, allowing the data to be evaluated at any relevant scale. The authors demonstrate that the framework is capable of compensating for the inherent biases in the data, such as preference for retroviruses to insert near transcriptional start sites. By better balancing the error, they are able to show that from the 361 published CISs, 150 can be identified that have a low probability of being a false detection. In addition, they discover eight novel CISs.
Finding point correspondences plays an important role in automatically building statistical shape models from a training set of 3D surfaces. For the point correspondence problem, Davies et al.  proposed a minimum-description-length-based objective function to balance the training errors and generalization ability. A recent evaluation study  that compares several well-known 3D point correspondence methods for modeling purposes shows that the MDL-based approach  is the best method.
We adapt the MDL-based objective function for a feature space that can exploit nonlinear properties in point correspondences, and propose an efficient optimization method to minimize the objective function directly in the feature space, given that the inner product of any vector pair can be computed in the feature space. We further employ a Mercer kernel  to define the feature space implicitly. A key aspect of our proposed framework is the generalization of the MDL-based objective function to kernel principal component analysis (KPCA)  spaces and the design of a gradient-descent approach to minimize such an objective function. We compare the generalized MDL objective function on KPCA spaces with the original one and evaluate their abilities in terms of reconstruction errors and specificity. From our experimental results on different sets of 3D shapes of human body organs, the proposed method performs significantly better than the original method.
Arguably, genotypes and phenotypes may be linked in functional forms that are not well addressed by the linear additive models that are standard in quantitative genetics. Therefore, developing statistical learning models for predicting phenotypic values from all available molecular information that are capable of capturing complex genetic network architectures is of great importance. Bayesian kernel ridge regression is a non-parametric prediction model proposed for this purpose. Its essence is to create a spatial distance-based relationship matrix called a kernel. Although the set of all single nucleotide polymorphism genotype configurations on which a model is built is finite, past research has mainly used a Gaussian kernel.
We sought to investigate the performance of a diffusion kernel, which was specifically developed to model discrete marker inputs, using Holstein cattle and wheat data. This kernel can be viewed as a discretization of the Gaussian kernel. The predictive ability of the diffusion kernel was similar to that of non-spatial distance-based additive genomic relationship kernels in the Holstein data, but outperformed the latter in the wheat data. However, the difference in performance between the diffusion and Gaussian kernels was negligible.
It is concluded that the ability of a diffusion kernel to capture the total genetic variance is not better than that of a Gaussian kernel, at least for these data. Although the diffusion kernel as a choice of basis function may have potential for use in whole-genome prediction, our results imply that embedding genetic markers into a non-Euclidean metric space has very small impact on prediction. Our results suggest that use of the black box Gaussian kernel is justified, given its connection to the diffusion kernel and its similar predictive performance.
In this paper a new framework, called Compressive Kernelized Reinforcement Learning (CKRL), for computing near-optimal policies in sequential decision making with uncertainty is proposed via incorporating the non-adaptive data-independent Random Projections and nonparametric Kernelized Least-squares Policy Iteration (KLSPI). Random Projections are a fast, non-adaptive dimensionality reduction framework in which high-dimensionality data is projected onto a random lower-dimension subspace via spherically random rotation and coordination sampling. KLSPI introduce kernel trick into the LSPI framework for Reinforcement Learning, often achieving faster convergence and providing automatic feature selection via various kernel sparsification approaches. In this approach, policies are computed in a low-dimensional subspace generated by projecting the high-dimensional features onto a set of random basis. We first show how Random Projections constitute an efficient sparsification technique and how our method often converges faster than regular LSPI, while at lower computational costs. Theoretical foundation underlying this approach is a fast approximation of Singular Value Decomposition (SVD). Finally, simulation results are exhibited on benchmark MDP domains, which confirm gains both in computation time and in performance in large feature spaces.
Markov Decision Process; sensor-actuator systems; random Projections; Kernelized Least Square Policy Iteration
We propose a statistical method to test whether two phylogenetic trees with given alignments are significantly incongruent. Our method compares the two distributions of phylogenetic trees given by two input alignments, instead of comparing point estimations of trees. This statistical approach can be applied to gene tree analysis for example, detecting unusual events in genome evolution such as horizontal gene transfer and reshuffling. Our method uses difference of means to compare two distributions of trees, after mapping trees into a vector space. Bootstrapping alignment columns can then be applied to obtain p-values. To compute distances between means, we employ a “kernel method” which speeds up distance calculations when trees are mapped in a high-dimensional feature space, e.g., splits or quartets feature space. In this pilot study, first we test our statistical method on data sets simulated under a coalescence model, to test whether two alignments are generated by congruent gene trees. We follow our simulation results with applications to data sets of gophers and lice, grasses and their endophytes, and different fungal genes from the same genome. A companion toolkit, Phylotree, is provided to facilitate computational experiments.
phylogenetic trees; difference of means; tree congruency
This paper describes the ontogeny, breakdown and absorption of the radular teeth of cephalopods and, for the first time, considers the function of the 'bolsters' or radular support muscles. The radular ribbon, which bears many regularly arranged transverse rows of teeth one behind the other, lies in a radular canal that emerges from the radular sac. Here the radular teeth are formed by a set of elongate cells with microvilli, the odontoblasts. These are organized into two layers, the outer producing the radular membrane and the bases of the teeth, the inner producing the cusps. The odontoblasts also secrete the hyaline shield and the teeth on the lateral buccal palps, when these are present. At the front end of the radular ribbon the teeth become worn in feeding and are replaced from behind by new ones formed continuously in the radular sac, so that the whole ribbon moves forward during ontogeny. Removal of the old teeth is achieved by cells in the radular organs; these cells, which are formed from modified odontoblasts ('odontoclasts'), dissolve the teeth and membranes and absorb them. There is a subradular organ in all cephalopods. In Octopus vulgaris, which bores into mollusc shells and crustacean carapaces, it is especially well-developed and there is also a supraradular organ. A characteristic feature of the cephalopod radular apparatus is the pair of large radular support muscles or 'bolsters'. Their function seems never to have been investigated, but experiments reported here show that when they elongate, the radular teeth become erect at the bending plane and splayed, presumably enhancing their ability to rake food particles into the pharynx. The bolsters of Octopus function as muscular hydrostats: because their volume is fixed, contraction of their powerful transverse muscles causes them to elongate. In decapods and in nautiloids each bolster contains a 'support rod' of semi-fluid material, as well as massive transverse musculature. This rod may elongate to erect the radular teeth. At the extreme front end of the bolsters in Octopus there are many nerve fibres that may constitute a receptor organ signalling the movements of the radula against hard material. Such nerves are absent from decapods and from octopods that do not bore holes. The buccal mass of Nautilus is massive, with heavily calcified tips to the beaks and a wide radular ribbon, with 13 rather than nine elements in each row. Nevertheless all the usual coleoid features are present in the radular apparatus and the teeth are formed and broken down in the same way. However, Nautilus has a unique structure, the radular appendage. This comprises a papillate mass extending over the palate in the mid-line and forming paired lateral masses that are in part secretory. The organ is attached to the front of the radula by muscles and connective tissue. Its function is unknown.
Proteins do not carry out their functions alone. Instead, they often act by participating in macromolecular complexes and play different functional roles depending on the other members of the complex. It is therefore interesting to identify co-complex relationships. Although protein complexes can be identified in a high-throughput manner by experimental technologies such as affinity purification coupled with mass spectrometry (APMS), these large-scale datasets often suffer from high false positive and false negative rates. Here, we present a computational method that predicts co-complexed protein pair (CCPP) relationships using kernel methods from heterogeneous data sources. We show that a diffusion kernel based on random walks on the full network topology yields good performance in predicting CCPPs from protein interaction networks. In the setting of direct ranking, a diffusion kernel performs much better than the mutual clustering coefficient. In the setting of SVM classifiers, a diffusion kernel performs much better than a linear kernel. We also show that combination of complementary information improves the performance of our CCPP recognizer. A summation of three diffusion kernels based on two-hybrid, APMS, and genetic interaction networks and three sequence kernels achieves better performance than the sequence kernels or diffusion kernels alone. Inclusion of additional features achieves a still better ROC50 of 0.937. Assuming a negative-to-positive ratio of 600∶1, the final classifier achieves 89.3% coverage at an estimated false discovery rate of 10%. Finally, we applied our prediction method to two recently described APMS datasets. We find that our predicted positives are highly enriched with CCPPs that are identified by both datasets, suggesting that our method successfully identifies true CCPPs. An SVM classifier trained from heterogeneous data sources provides accurate predictions of CCPPs in yeast. This computational method thereby provides an inexpensive method for identifying protein complexes that extends and complements high-throughput experimental data.
Many proteins perform their jobs as part of multi-protein units called complexes, and several technologies exist to identify these complexes and their components with varying precision and throughput. In this work, we describe and apply a computational framework for combining a variety of experimental data to identify pairs of yeast proteins that partipicate in a complex—so-called co-complexed protein pairs (CCPPs). The method uses machine learning to generalize from well-characterized CCPPs, making predictions of novel CCPPs on the basis of sequence similarity, tandem affinity mass spectrometry data, yeast two-hybrid data, genetic interactions, microarray expression data, ChIP-chip assays, and colocalization by fluorescence microscopy. The resulting model accurately summarizes this heterogeneous body of data: in a cross-validated test, the model achieves an estimated coverage of 89% at a false discovery rate of 10%. The final collection of predicted CCPPs is available as a public resource. These predictions, as well as the general methodology described here, provide a valuable summary of diverse yeast interaction data and generate quantitative, testable hypotheses about novel CCPPs.
Machine-learning tools have gained considerable attention during the last few years for analyzing biological networks for protein function prediction. Kernel methods are suitable for learning from graph-based data such as biological networks, as they only require the abstraction of the similarities between objects into the kernel matrix. One key issue in kernel methods is the selection of a good kernel function. Diffusion kernels, the discretization of the familiar Gaussian kernel of Euclidean space, are commonly used for graph-based data.
In this paper, we address the issue of learning an optimal diffusion kernel, in the form of a convex combination of a set of pre-specified kernels constructed from biological networks, for protein function prediction. Most prior work on this kernel learning task focus on variants of the loss function based on Support Vector Machines (SVM). Their extensions to other loss functions such as the one based on Kullback-Leibler (KL) divergence, which is more suitable for mining biological networks, lead to expensive optimization problems. By exploiting the special structure of the diffusion kernel, we show that this KL divergence based kernel learning problem can be formulated as a simple optimization problem, which can then be solved efficiently. It is further extended to the multi-task case where we predict multiple functions of a protein simultaneously. We evaluate the efficiency and effectiveness of the proposed algorithms using two benchmark data sets.
Results show that the performance of linearly combined diffusion kernel is better than every single candidate diffusion kernel. When the number of tasks is large, the algorithms based on multiple tasks are favored due to their competitive recognition performance and small computational costs.
In the last decade data fusion has become widespread in the field of metabolomics. Linear data fusion is performed most commonly. However, many data display non-linear parameter dependences. The linear methods are bound to fail in such situations. We used proton Nuclear Magnetic Resonance and Gas Chromatography-Mass Spectrometry, two well established techniques, to generate metabolic profiles of Cerebrospinal fluid of Multiple Sclerosis (MScl) individuals. These datasets represent non-linearly separable groups. Thus, to extract relevant information and to combine them a special framework for data fusion is required.
The main aim is to demonstrate a novel approach for data fusion for classification; the approach is applied to metabolomics datasets coming from patients suffering from MScl at a different stage of the disease. The approach involves data fusion in kernel space and consists of four main steps. The first one is to extract the significant information per data source using Support Vector Machine Recursive Feature Elimination. This method allows one to select a set of relevant variables. In the next step the optimized kernel matrices are merged by linear combination. In step 3 the merged datasets are analyzed with a classification technique, namely Kernel Partial Least Square Discriminant Analysis. In the final step, the variables in kernel space are visualized and their significance established.
We find that fusion in kernel space allows for efficient and reliable discrimination of classes (MScl and early stage). This data fusion approach achieves better class prediction accuracy than analysis of individual datasets and the commonly used mid-level fusion. The prediction accuracy on an independent test set (8 samples) reaches 100%. Additionally, the classification model obtained on fused kernels is simpler in terms of complexity, i.e. just one latent variable was sufficient. Finally, visualization of variables importance in kernel space was achieved.
Feature selection for classification in high-dimensional spaces can improve generalization, reduce classifier complexity, and identify important, discriminating feature “markers.” For support vector machine (SVM) classification, a widely used technique is recursive feature elimination (RFE). We demonstrate that RFE is not consistent with margin maximization, central to the SVM learning approach. We thus propose explicit margin-based feature elimination (MFE) for SVMs and demonstrate both improved margin and improved generalization, compared with RFE. Moreover, for the case of a nonlinear kernel, we show that RFE assumes that the squared weight vector 2-norm is strictly decreasing as features are eliminated. We demonstrate this is not true for the Gaussian kernel and, consequently, RFE may give poor results in this case. MFE for nonlinear kernels gives better margin and generalization. We also present an extension which achieves further margin gains, by optimizing only two degrees of freedom—the hyperplane’s intercept and its squared 2-norm—with the weight vector orientation fixed. We finally introduce an extension that allows margin slackness. We compare against several alternatives, including RFE and a linear programming method that embeds feature selection within the classifier design. On high-dimensional gene microarray data sets, University of California at Irvine (UCI) repository data sets, and Alzheimer’s disease brain image data, MFE methods give promising results.
Alzheimer’s; classifier margin; discriminant function; feature elimination; Gaussian kernel; margin maximization; medical imaging; microarray; magnetic resonance imaging (MRI); neurodegenerative; polynomial kernel; recursive feature elimination (RFE); support vector machine (SVM)
Accurate real-time prediction of respiratory motion is desirable for effective motion management in radiotherapy for lung tumor targets. Recently, nonparametric methods have been developed and their efficacy in predicting one-dimensional respiratory-type motion has been demonstrated. To exploit the correlation among various coordinates of the moving target, it is natural to extend the 1D method to multidimensional processing. However, the amount of learning data required for such extension grows exponentially with the dimensionality of the problem, a phenomenon known as the ‘curse of dimensionality’. In this study, we investigate a multidimensional prediction scheme based on kernel density estimation (KDE) in an augmented covariate–response space. To alleviate the ‘curse of dimensionality’, we explore the intrinsic lower dimensional manifold structure and utilize principal component analysis (PCA) to construct a proper low-dimensional feature space, where kernel density estimation is feasible with the limited training data. Interestingly, the construction of this lower dimensional representation reveals a useful decomposition of the variations in respiratory motion into the contribution from semiperiodic dynamics and that from the random noise, as it is only sensible to perform prediction with respect to the former. The dimension reduction idea proposed in this work is closely related to feature extraction used in machine learning, particularly support vector machines. This work points out a pathway in processing high-dimensional data with limited training instances, and this principle applies well beyond the problem of target-coordinate-based respiratory-based prediction. A natural extension is prediction based on image intensity directly, which we will investigate in the continuation of this work. We used 159 lung target motion traces obtained with a Synchrony respiratory tracking system. Prediction performance of the low-dimensional feature learning-based multidimensional prediction method was compared against the independent prediction method where prediction was conducted along each physical coordinate independently. Under fair setup conditions, the proposed method showed uniformly better performance, and reduced the case-wise 3D root mean squared prediction error by about 30–40%. The 90% percentile 3D error is reduced from 1.80 mm to 1.08 mm for 160 ms prediction, and 2.76 mm to 2.01 mm for 570 ms prediction. The proposed method demonstrates the most noticeable improvement in the tail of the error distribution.
Abstract Purpose: To describe the evolving computed tomography (CT) appearances of a cellulose surgical bolster used as a hemostatic agent in patients who undergo laparoscopic partial nephrectomy for renal cell carcinoma. Materials and methods: We retrospectively reviewed the follow-up CT studies of 33 patients with stage T1N0M0 renal carcinoma who underwent laparoscopic partial nephrectomy using a rolled, oxidized, regenerated cellulose sheet sutured in place as a bolster in the parenchymal defect. Thirteen patients undergoing laparoscopic partial nephrectomy without the use of a bolster were also evaluated to differentiate imaging features. Results: The bolster-related masses were significantly larger than those seen in the non-bolster patients. There was a decrease in size of the post-operative bolster-related mass with time. The bolster shape evolved with time, initially appearing oval, and becoming irregular with decreasing size. Equivocal increase in attenuation of 10–20 HU was seen in 6 patients. Increase in attenuation of greater than 20 HU was seen in 3 patients. There was no evidence of tumor recurrence in any of the patients. Invagination of fat was seen in two bolster-related masses at 18 months or greater. Conclusions: Cellulose bolster has a variable appearance on follow-up CT exams. Evolutionary features include reduction in bolster size and shape with time leading finally to non-visualization. Bolster enhancement can mimic abscesses and tumor recurrence.
Renal; carcinoma; laparoscopic; bolster
Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as GO, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the GO terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology.
In this paper, we propose a Gene Ontology Based Transfer Learning Model (GO-TLM) for large-scale protein subcellular localization. The model transfers the signature-based homologous GO terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false GO terms that are resulted from evolutionary divergence. We derive three GO kernels from the three aspects of gene ontology to measure the GO similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate GO-TLM performance against three baseline models: MultiLoc, MultiLoc-GO and Euk-mPLoc on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that GO-TLM achieves substantial accuracy improvement against the baseline models: 80.38% against model Euk-mPLoc 67.40% with 12.98% substantial increase; 96.65% and 96.27% against model MultiLoc-GO 89.60% and 89.60%, with 7.05% and 6.67% accuracy increase on dataset MultiLoc plant and dataset MultiLoc animal, respectively; 97.14%, 95.90% and 96.85% against model MultiLoc-GO 83.70%, 90.10% and 85.70%, with accuracy increase 13.44%, 5.8% and 11.15% on dataset BaCelLoc plant, dataset BaCelLoc fungi and dataset BaCelLoc animal respectively. For BaCelLoc independent sets, GO-TLM achieves 81.25%, 80.45% and 79.46% on dataset BaCelLoc plant holdout, dataset BaCelLoc plant holdout and dataset BaCelLoc animal holdout, respectively, as compared against baseline model MultiLoc-GO 76%, 60.00% and 73.00%, with accuracy increase 5.25%, 20.45% and 6.46%, respectively.
Since direct homology-based GO term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, GO-TLM) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based GO term transfer and explicitly weighing the GO kernels substantially improve the prediction performance.
Traditional approaches to the problem of parameter estimation in biophysical models of neurons and neural networks usually adopt a global search algorithm (for example, an evolutionary algorithm), often in combination with a local search method (such as gradient descent) in order to minimize the value of a cost function, which measures the discrepancy between various features of the available experimental data and model output. In this study, we approach the problem of parameter estimation in conductance-based models of single neurons from a different perspective. By adopting a hidden-dynamical-systems formalism, we expressed parameter estimation as an inference problem in these systems, which can then be tackled using a range of well-established statistical inference methods. The particular method we used was Kitagawa's self-organizing state-space model, which was applied on a number of Hodgkin-Huxley-type models using simulated or actual electrophysiological data. We showed that the algorithm can be used to estimate a large number of parameters, including maximal conductances, reversal potentials, kinetics of ionic currents, measurement and intrinsic noise, based on low-dimensional experimental data and sufficiently informative priors in the form of pre-defined constraints imposed on model parameters. The algorithm remained operational even when very noisy experimental data were used. Importantly, by combining the self-organizing state-space model with an adaptive sampling algorithm akin to the Covariance Matrix Adaptation Evolution Strategy, we achieved a significant reduction in the variance of parameter estimates. The algorithm did not require the explicit formulation of a cost function and it was straightforward to apply on compartmental models and multiple data sets. Overall, the proposed methodology is particularly suitable for resolving high-dimensional inference problems based on noisy electrophysiological data and, therefore, a potentially useful tool in the construction of biophysical neuron models.
Parameter estimation is a problem of central importance and, perhaps, the most laborious task in biophysical modeling of neurons and neural networks. An emerging trend is to treat parameter estimation in this context as yet another statistical inference problem, which can be tackled using well-established methods from Computational Statistics. Inspired by these recent advances, we adopted a self-organizing state-space-model approach augmented with an adaptive sampling algorithm akin to the Covariance Matrix Adaptation Evolution Strategy in order to estimate a large number of parameters in a number of Hodgkin-Huxley-type models of single neurons. Parameter estimation was based on noisy electrophysiological data and involved the maximal conductances, reversal potentials, levels of noise and, unlike most mainstream work, the kinetics of ionic currents in the examined models. Our main conclusion was that parameters in complex, conductance-based neuron models can be inferred using the aforementioned methodology, if sufficiently informative priors regarding the unknown model parameters are available. Importantly, the use of an adaptive algorithm for sampling new parameter vectors significantly reduced the variance of parameter estimates. Flexibility and scalability are additional advantages of the proposed method, which is particularly suited to resolve high-dimensional inference problems.
The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model.
We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening.
The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.
Dengue virus infection causes a wide spectrum of illness, ranging from sub-clinical to severe disease. Severe dengue is associated with sequential viral infections. A strict definition of primary versus secondary dengue infections requires a combination of several tests performed at different stages of the disease, which is not practical.
Methods and Findings
We developed a simple method to classify dengue infections as primary or secondary based on the levels of dengue-specific IgG. A group of 109 dengue infection patients were classified as having primary or secondary dengue infection on the basis of a strict combination of results from assays of antigen-specific IgM and IgG, isolation of virus and detection of the viral genome by PCR tests performed on multiple samples, collected from each patient over a period of 30 days. The dengue-specific IgG levels of all samples from 59 of the patients were analyzed by linear discriminant analysis (LDA), and one- and two-dimensional classifiers were designed. The one-dimensional classifier was estimated by bolstered resubstitution error estimation to have 75.1% sensitivity and 92.5% specificity. The two-dimensional classifier was designed by taking also into consideration the number of days after the onset of symptoms, with an estimated sensitivity and specificity of 91.64% and 92.46%. The performance of the two-dimensional classifier was validated using an independent test set of standard samples from the remaining 50 patients. The classifications of the independent set of samples determined by the two-dimensional classifiers were further validated by comparing with two other dengue classification methods: hemagglutination inhibition (HI) assay and an in-house anti-dengue IgG-capture ELISA method. The decisions made with the two-dimensional classifier were in 100% accordance with the HI assay and 96% with the in-house ELISA.
Once acute dengue infection has been determined, a 2-D classifier based on common dengue virus IgG kits can reliably distinguish primary and secondary dengue infections. Software for calculation and validation of the 2-D classifier is made available for download.