Search tips
Search criteria

Results 1-6 (6)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  How informative is your kinetic model?: using resampling methods for model invalidation 
BMC Systems Biology  2014;8:61.
Kinetic models can present mechanistic descriptions of molecular processes within a cell. They can be used to predict the dynamics of metabolite production, signal transduction or transcription of genes. Although there has been tremendous effort in constructing kinetic models for different biological systems, not much effort has been put into their validation. In this study, we introduce the concept of resampling methods for the analysis of kinetic models and present a statistical model invalidation approach.
We based our invalidation approach on the evaluation of a kinetic model’s predictive power through cross validation and forecast analysis. As a reference point for this evaluation, we used the predictive power of an unsupervised data analysis method which does not make use of any biochemical knowledge, namely Smooth Principal Components Analysis (SPCA) on the same test sets. Through a simulations study, we showed that too simple mechanistic descriptions can be invalidated by using our SPCA-based comparative approach until high amount of noise exists in the experimental data. We also applied our approach on an eicosanoid production model developed for human and concluded that the model could not be invalidated using the available data despite its simplicity in the formulation of the reaction kinetics. Furthermore, we analysed the high osmolarity glycerol (HOG) pathway in yeast to question the validity of an existing model as another realistic demonstration of our method.
With this study, we have successfully presented the potential of two resampling methods, cross validation and forecast analysis in the analysis of kinetic models’ validity. Our approach is easy to grasp and to implement, applicable to any ordinary differential equation (ODE) type biological model and does not suffer from any computational difficulties which seems to be a common problem for approaches that have been proposed for similar purposes. Matlab files needed for invalidation using SPCA cross validation and our toy model in SBML format are provided at
PMCID: PMC4046068  PMID: 24886662
Model invalidation; Kinetic models; ODE; Differential equations; Smooth principal components analysis; SPCA; PCA; Resampling; Cross validation; Forecast analysis
2.  Inferring protein–protein interaction complexes from immunoprecipitation data 
BMC Research Notes  2013;6:468.
Protein–protein interactions in cells are widely explored using small–scale experiments. However, the search for protein complexes and their interactions in data from high throughput experiments such as immunoprecipitation is still a challenge. We present "4N", a novel method for detecting protein complexes in such data. Our method is a heuristic algorithm based on Near Neighbor Network (3N) clustering. It is written in R, it is faster than model-based methods, and has only a small number of tuning parameters. We explain the application of our new method to real immunoprecipitation results and two artificial datasets. We show that the method can infer protein complexes from protein immunoprecipitation datasets of different densities and sizes.
4N was applied on the immunoprecipitation dataset that was presented by the authors of the original 3N in Cell 145:787–799, 2011. The test with our method shows that it can reproduce the original clustering results with fewer manually adapted parameters and, in addition, gives direct insight into the complex–complex interactions. We also tested 4N on the human "Tip49a/b" dataset. We conclude that 4N can handle the contaminants and can correctly infer complexes from this very dense dataset. Further tests were performed on two artificial datasets of different sizes. We proved that the method predicts the reference complexes in the two artificial datasets with high accuracy, even when the number of samples is reduced.
4N has been implemented in R. We provide the sourcecode of 4N and a user-friendly toolbox including two example calculations. Biologists can use this 4N-toolbox even if they have a limited knowledge of R. There are only a few tuning parameters to set, and each of these parameters has a biological interpretation. The run times for medium scale datasets are in the order of minutes on a standard desktop PC. Large datasets can typically be analyzed within a few hours.
PMCID: PMC3874675  PMID: 24237943
Protein–protein interactions; Proteomics; Protein complexes; Immunoprecipitation
3.  To aggregate or not to aggregate high-dimensional classifiers 
BMC Bioinformatics  2011;12:153.
High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data.
Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets.
The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.
PMCID: PMC3113942  PMID: 21569498
4.  Crossfit analysis: a novel method to characterize the dynamics of induced plant responses 
BMC Bioinformatics  2009;10:425.
Many plant species show induced responses that protect them against exogenous attacks. These responses involve the production of many different bioactive compounds. Plant species belonging to the Brassicaceae family produce defensive glucosinolates, which may greatly influence their favorable nutritional properties for humans. Each responding compound may have its own dynamic profile and metabolic relationships with other compounds. The chemical background of the induced response is therefore highly complex and may therefore not reveal all the properties of the response in any single model.
This study therefore aims to describe the dynamics of the glucosinolate response, measured at three time points after induction in a feral Brassica, by a three-faceted approach, based on Principal Component Analysis. First the large-scale aspects of the response are described in a 'global model' and then each time-point in the experiment is individually described in 'local models' that focus on phenomena that occur at specific moments in time. Although each local model describes the variation among the plants at one time-point as well as possible, the response dynamics are lost. Therefore a novel method called the 'Crossfit' is described that links the local models of different time-points to each other.
Each element of the described analysis approach reveals different aspects of the response. The crossfit shows that smaller dynamic changes may occur in the response that are overlooked by global models, as illustrated by the analysis of a metabolic profiling dataset of the same samples.
PMCID: PMC3087346  PMID: 20015363
5.  Classification-based comparison of pre-processing methods for interpretation of mass spectrometry generated clinical datasets 
Proteome Science  2009;7:19.
Mass spectrometry is increasingly being used to discover proteins or protein profiles associated with disease. Experimental design of mass-spectrometry studies has come under close scrutiny and the importance of strict protocols for sample collection is now understood. However, the question of how best to process the large quantities of data generated is still unanswered. Main challenges for the analysis are the choice of proper pre-processing and classification methods. While these two issues have been investigated in isolation, we propose to use the classification of patient samples as a clinically relevant benchmark for the evaluation of pre-processing methods.
Two in-house generated clinical SELDI-TOF MS datasets are used in this study as an example of high throughput mass-spectrometry data. We perform a systematic comparison of two commonly used pre-processing methods as implemented in Ciphergen ProteinChip Software and in the Cromwell package. With respect to reproducibility, Ciphergen and Cromwell pre-processing are largely comparable. We find that the overlap between peaks detected by either Ciphergen ProteinChip Software or Cromwell is large. This is especially the case for the more stringent peak detection settings. Moreover, similarity of the estimated intensities between matched peaks is high.
We evaluate the pre-processing methods using five different classification methods. Classification is done in a double cross-validation protocol using repeated random sampling to obtain an unbiased estimate of classification accuracy. No pre-processing method significantly outperforms the other for all peak detection settings evaluated.
We use classification of patient samples as a clinically relevant benchmark for the evaluation of pre-processing methods. Both pre-processing methods lead to similar classification results on an ovarian cancer and a Gaucher disease dataset. However, the settings for pre-processing parameters lead to large differences in classification accuracy and are therefore of crucial importance. We advocate the evaluation over a range of parameter settings when comparing pre-processing methods. Our analysis also demonstrates that reliable classification results can be obtained with a combination of strict sample handling and a well-defined classification protocol on clinical samples.
PMCID: PMC2689848  PMID: 19442271
6.  Centering, scaling, and transformations: improving the biological information content of metabolomics data 
BMC Genomics  2006;7:142.
Extracting relevant biological information from large data sets is a major challenge in functional genomics research. Different aspects of the data hamper their biological interpretation. For instance, 5000-fold differences in concentration for different metabolites are present in a metabolomics data set, while these differences are not proportional to the biological relevance of these metabolites. However, data analysis methods are not able to make this distinction. Data pretreatment methods can correct for aspects that hinder the biological interpretation of metabolomics data sets by emphasizing the biological information in the data set and thus improving their biological interpretability.
Different data pretreatment methods, i.e. centering, autoscaling, pareto scaling, range scaling, vast scaling, log transformation, and power transformation, were tested on a real-life metabolomics data set. They were found to greatly affect the outcome of the data analysis and thus the rank of the, from a biological point of view, most important metabolites. Furthermore, the stability of the rank, the influence of technical errors on data analysis, and the preference of data analysis methods for selecting highly abundant metabolites were affected by the data pretreatment method used prior to data analysis.
Different pretreatment methods emphasize different aspects of the data and each pretreatment method has its own merits and drawbacks. The choice for a pretreatment method depends on the biological question to be answered, the properties of the data set and the data analysis method selected. For the explorative analysis of the validation data set used in this study, autoscaling and range scaling performed better than the other pretreatment methods. That is, range scaling and autoscaling were able to remove the dependence of the rank of the metabolites on the average concentration and the magnitude of the fold changes and showed biologically sensible results after PCA (principal component analysis).
In conclusion, selecting a proper data pretreatment method is an essential step in the analysis of metabolomics data and greatly affects the metabolites that are identified to be the most important.
PMCID: PMC1534033  PMID: 16762068

Results 1-6 (6)