Home | About | Journals | Submit | Contact Us | Français |

**|**BMC Bioinformatics**|**v.11; 2010**|**PMC2887810

Formats

Article sections

- Abstract
- Background
- Results
- Discussion and Conclusions
- Methods
- Authors' contributions
- Supplementary Material
- References

Authors

Related links

BMC Bioinformatics. 2010; 11: 282.

Published online 2010 May 26. doi: 10.1186/1471-2105-11-282

PMCID: PMC2887810

Martin Eklund: es.uu.oibmraf@dnulke.nitram; Ola Spjuth: es.uu.oibmraf@htujps.alo; Jarl ES Wikberg: es.uu.oibmraf@grebkiw.lraj

Received 2010 February 10; Accepted 2010 May 26.

Copyright ©2010 Eklund et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

The omics fields promise to revolutionize our understanding of biology and biomedicine. However, their potential is compromised by the challenge to analyze the huge datasets produced. Analysis of omics data is plagued by the curse of dimensionality, resulting in imprecise estimates of model parameters and performance. Moreover, the integration of omics data with other data sources is difficult to shoehorn into classical statistical models. This has resulted in *ad hoc *approaches to address specific problems.

We present a general approach to omics data analysis that alleviates these problems. By combining eScience and Bayesian methods, we retrieve scientific information and data from multiple sources and coherently incorporate them into large models. These models improve the accuracy of predictions and offer new insights into the underlying mechanisms. This "eScience-Bayes" approach is demonstrated in two proof-of-principle applications, one for breast cancer prognosis prediction from transcriptomic data and one for protein-protein interaction studies based on proteomic data.

Bayesian statistics provide the flexibility to tailor statistical models to the complex data structures in omics biology as well as permitting coherent integration of multiple data sources. However, Bayesian methods are in general computationally demanding and require specification of possibly thousands of prior distributions. eScience can help us overcome these difficulties. The eScience-Bayes thus approach permits us to fully leverage on the advantages of Bayesian methods, resulting in models with improved predictive performance that gives more information about the underlying biological system.

High-throughput experimental methods, including DNA and protein microarrays and other omics techniques, have become ubiquitous, indispensable tools in biology and biomedicine. The number of high-throughput technologies is constantly increasing. They provide the power to measure thousands of features of a biological system in a single experiment, and they have the potential to revolutionize our understanding of biology and medicine. However, the high expectations for omics methods have fallen short of realization, due to the challenges the data present for statistical modeling. Thus, the wealth of data produced is difficult to translate into concrete biological knowledge, new drugs, and clinical practices [1,2]. A recurring problem is that few experimental samples are generated relative to the number of model parameters [1]. This leads to imprecise parameter and performance estimates, and models prone to overfitting (the 'curse of dimensionality'). In fact, it is difficult to even obtain a trustworthy measure of *how *imprecise performance estimates are with standard techniques like confidence intervals based on holdout estimates in cross-validation or bootstrapping [3,4]. Another problem is that classical statistical models are too restrictive to account for the complexities of integrating omics data with previously acquired data. This has forced us to resort to *ad hoc *approaches for solving very specific problems [5-7]. Statistical approaches that use prior distributions over the model parameters are known as Bayesian methods [8,9]. The prior distributions summarize one's belief about the parameters before seeing the data. After performing an experiment, the parameters are updated to a posterior distribution according to Bayes' theorem:

The posterior distribution reflects the contribution of both our prior beliefs and the experimental data (through the likelihood function).

Bayesian methods have a number of advantages that allow us to address the problems inherent in omics data analysis. Firstly, they afford formal and coherent incorporation of prior information and integration of data sources. Apart from the philosophical appeal of being able to use all available information when analyzing a problem, this allows us to tackle the curse of dimensionality in two ways: (1) Prior information can be used to reduce model dimensionality with Bayesian variable selection and regularization, and (2) data from different research studies can be coherently combined to increase the number of available observations. Bayesian modeling thus allows us to "borrow information" across studies [8]. Secondly, Bayesian methods correctly summarize the model's predictive distribution [10]. Thus, we can obtain reliable Bayesian confidence intervals (often called credible intervals) of the estimates of model performance in small-sample problems; this avoids the problem of having to rely on imprecise confidence intervals based on holdout estimates from cross-validation or bootstrapping [4]. Thirdly, Bayesian methods may be applied to problems with structures too complex for classical statistical methods [11]. Bayesian models can be made increasingly elaborate to accommodate for the integration of prior information and omics data from multiple data sources.

However, the advantages of Bayesian modeling come at a price. We have to specify prior distributions for all parameters in a model. For large models, like those for analyzing omics data, it is unrealistic to expect practitioners to manually collect prior information for all individual parameters. In practice, this limits the powerful advantage of using available prior knowledge. In addition, Bayesian methods require numerical handling of analytically intractable integrals. This is typically performed with Markov chain Monte Carlo (MCMC) methods, which are computationally very expensive and therefore restricts the size of the modeling problems that we can address.

Recent developments in distributed information and computational resources have given rise to the notion of eScience, i.e. computationally intensive science that is carried out in highly distributed network environments [12]. Advances in eScience can be roughly grouped into: (1) developments in semantic technologies (e.g. data standardization and ontologization) and methods for retrieving standardized data (e.g. Web services); and (2) construction of high-performance computing (HPC) facilities and development of middleware for using the HPCs. These two aspects of eScience equip us for efficient use of the Bayesian approach to omics data analysis. Standardized data and Web services allow us to use machines to harvest the Internet for information to use in prior distribution specifications, and HPC resources provide the computational power required to fit large Bayesian models to high-throughput data (Fig. (Fig.1).1). In this paper we demonstrate how eScience permit us to leverage on the advantages of Bayesian analysis when modeling omics data. We show by two examples that the approach can improve predictive performance, and that it permits a deeper analysis of the data by accommodating more complex model structures.

Microarray gene expression profiling has shown promise for supporting the prognostication of breast cancer patients based on the expression pattern of specific gene sets ('signatures'). A number of studies have reported different signatures [13-16]. However, concerns have been raised against the prognostic capacity of these signatures when applied to new data [17,18]. These concerns include the following:

1. Signatures from different studies share almost no genes, which indicates that the signatures depend to a large degree on the datasets rather than being prognostic for breast cancer [17]. In fact, Michiels *et al*. [18] showed that, even *within *a study, the subset of patients used for signature derivation strongly influenced which genes were selected.

2. The reported performance estimates of the signatures have been questioned [18]; this suggests that worse performance is likely to be obtained when applied to new data. Also, the confidence intervals of the performance estimates (if reported at all) have been disputed [4,18]. This makes it difficult to assess the level of accuracy in the reported results.

In fact, both these concerns arise due to the small sample sizes typically used in microarray studies relative to the number of profiled genes (i.e. the curse of dimensionality). We employed the eScience-Bayes strategy described in Fig. Fig.11 to address this problem. In accordance with the practice in most previous articles, we modeled the development of distant metastases within five years. To increase the sample size, we used the Gene Expression Omnibus (GEO) Web service [19] to collect data from five previously published breast cancer studies [13-16,20]. To reduce the model dimensionality we employed a Bayesian variable selection procedure, giving higher prior probabilities of including a variable (gene) the higher our prior belief that the gene affects development of distant metastases. The prior belief for each gene to affect development of distant metastases was based on information retrieved by connecting three Web services: NetPath http://www.netpath.org/, DictService http://services.aonaware.com/ and Entrez Utilities http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html. NetPath is a curated resource of genes reported to be transcriptionally regulated by cancer-signalling pathways. All genes on the HG-U133A array listed in NetPath were used to text mine PubMed for all free fulltext articles where the gene names were mentioned in combination with breast cancer. To reduce the number of spurious hits, we discarded genes with names that represent English words by using the dictionary definition Web service DictService. This gave us a list of integers, associated with the genes on the HG-U133A array representing the number of times a cancer-pathway regulated gene is mentioned together with breast cancer in the literature (all genes not present in NetPath were assigned the number 0). These informative prior distributions were thus based on the assumption that the probability of a gene being related to breast cancer was reflected in the number of times a gene reported in NetPath was mentioned in combination with breast cancer in PubMed articles. A work flow of the procedure is shown in Fig. Fig.22 (further details are given in Methods). The five datasets and the prior distributions were incorporated in a multilevel Bayesian probit model (**Model I**, see Methods). To assess the discriminative power of Model I, we fitted it five times; each time using three of the datasets as training sets and the remaining two datasets as independent testsets (Fig. (Fig.3a,3a, red curve, and b; Additional file 1, Fig. S1 and S2; Table S1). It may be noted that Model I achieved high prediction accuracy on independent test data produced in different research groups. To the best of our knowledge, this is the first time that a modeling approach has demonstrated consistently good predictive ability across multiple independent testsets for predicting breast cancer metastasis development from gene expression data.

To validate the value of using prior information, we tested Model I with non-informative prior distributions (i.e. we assigned equal prior probability for all variables to be included in the variable selection; see Methods for further details) and compared the results to those obtained with informative priors (Fig. (Fig.3a,3a, green curve; Additional file 1, Fig. S1, green curves; Table S1). This clearly showed that the use of relevant prior information significantly improves the classification accuracy of Model I (Bayesian P-value < 0.001, related to the curves in Fig. Fig.3a).3a). The dramatic improvement was due to the fact that the prior information focused the model to include relevant genes in the signature, thus reducing the risk of detecting spurious relationships and overfitting the model. Analogously, the use of prior information enforced consistency in signature selection across disparate datasets: 15.2% of the genes were selected all five times Model I with informative priors was fit, which may be compared to 1% when noninformative priors were used, and 0% among the original signatures. The median of the pairwise overlap between two different model fits was 27.5%, compared to 3% among the original signatures and 12% reported in Chuang *et al*. [6].

We also assessed the advantage of using multiple datasets by using only single datasets for training Model I (with informative priors); results are shown in Fig. Fig.3a3a (blue curve), Additional file 1, Fig. S1 (blue curves) and Table S1. A significant improvement was found when multiple datasets were used for model training compared to when only a single dataset was used (Bayesian P-value < 0.001, related to the curves in Fig. Fig.3a).3a). In a final comparison, we modified Model I to use only the expression data that corresponded to the respective gene signatures derived in Wang *et al*. [13], Miller *et al*. [14], Sotiriou *et al*. [15], and Pawitan *et al*. [16], together with noninformative priors (only the genes printed on the HG-U133A were used for the Miller *et al*. [14] and Pawitan *et al*. [16] signatures). Again, we found that Model I (with informative priors) performed significantly better than when the signatures from Wang *et al*. [13], Miller *et al*. [14], Sotiriou *et al*. [15], and Pawitan *et al*. [16] were used. Results are displayed in Additional file 1, Fig. S4 and Table S1. To derive a "final" gene signature, we fitted Model I using all five datasets and informative priors. Some genes were included in this signature that *a priori *had been regarded as non-relevant to breast cancer; these genes were strongly supported by the data and may be of interest for future experimental investigations. The genes included in the signature are shown in Additional file 1, Table S2, stratified into genes that were relevant *a priori *as well as *a posteriori*, and genes that were regarded non-relevant *a priori *but relevant *a posteriori*.

In order to further demonstrate the eScience-Bayes approach, we used it to create a model that predicted the probability of distant metastasis development as a function of time. This model would provide finer granularity in the prognosis predictions of future patients compared to the current practice of stratifying patients into only two classes (metastases or no metastases within five years). Patient stratification into two classes discards information by discretizing a continuous variable (time to development of distant metastasis) into an arbitrarily defined binary variable (yes/no development of distant metastasis within five years). Because not all patients had developed distant metastases before the last follow-up in the studies, this required a consideration of the censoring of the data (Methods). We created a multilevel Bayesian accelerated failure time (AFT) (Methods; Sha *et al*. [21]) model with the expression data as explanatory variables and the informative prior distributions (**Model II**, see Methods for details). Analogously to the assessment of Models I, we assessed Model II by fitting it five times, each time using three of the datasets as a training set and the remaining two datasets as test sets to assess the performance of Model II. The results implied that we indeed could provide more fine-grained prognoses predictions, compared to the binary classification afforded with Model I (Fig. (Fig.3c3c).

In this example, we provided evidence for that adopting the eScience-Bayes approach can significantly improve consistency in gene signature selection and in the predictive performance of the constructed models. In contrast to previous studies, our results were presented together with Bayesian confidence intervals, which permit assessment of the level of accuracy in the results [4].

PDZ domains mediate protein-protein interactions and have been extensively studied over the last fifteen years. Recently, a large-scale PDZ domain-peptide interaction dataset was published in Stiffler *et al*. [22] and follow-up studies were later described in Chen *et al*. [23]. The combined data from these two publications comprise 2306 observations of interactions between peptides and PDZ domains from mouse, *C. elegans*, and *D. melanogaster*. We reanalyzed these data using the eScience-Bayes approach. To this end, we modeled the PDZ domain-peptide interactions using a multilevel Bayesian probit model with physicochemical properties of the PDZ domains and peptides as explanatory variables and a binary variable as response, indicating whether or not a PDZ domain and a peptide bound to each other (**Model III**). We derived informative prior distributions from 3D structural information using the Sequence Annotated by Structure [24] (SAS), a tool for annotating a protein sequence with structural information based on all solved 3D structures of the proteins in the Protein Data Bank [25] (PDB). Using the SAS Web service [26] (WSsas) we estiamted the number of contacts made between residues in each PDZ domain and in the peptide ligands. Similarly to the breast cancer demonstration, we used the informative prior distributions to "guide" the variable selection in order to reduce the dimensionality of the model based on the assumption that the importance of a PDZ domain residue for the interaction is reflected in the number of contacts it makes with the ligand (see Methods for details). Fig. Fig.44 shows the work flow for the prior elicitation and modeling. Fig. Fig.5a5a summarizes the predictive performance of Model III in a receiver operating characteristics (ROC, see Methods) curve based on a 10-fold cross-validation, and shows a comparison with the predictive performance when noninformative prior distributions were used (i.e. equal prior probability for all variables to be included in the variable selection; see Methods). We observed a nonsignificant (Bayesian P-value = 0.11) improvement in the predictive performance with informative compared to noninformative priors (Fig. (Fig.5a5a).

In this example, the main advantage of adopting an eScience-Bayes approach was the increase in information that we obtained. For example, Model III enabled an analysis of the differences between PDZ domains. By allowing the regression coefficients to vary between PDZ domains (Methods), we found differences in the estimated nonzero coefficients for different PDZ domains (Fig. (Fig.5b).5b). This suggests that there may be variations among the PDZ domains in the networks of residues that are important for interacting with ligands [27]. Interestingly, some of these residues are located far away from the binding site. Although the tertiary structures of PDZ domains are very similar, their primary structures vary substantially [28]. This may cause differences in the intramolecular interactions within different PDZ domains. Networks of such intramolecular interaction have previously been reported for PDZ domains [28,29]. The results from studies by Gianni *et al*. [29] and Chi *et al*. [30] support the notion that the networks may differ among PDZ domains; for example, a network present in a PDZ domain from mouse tyrosine phosphatase BL was not found in the human PSD-95 PDZ3. The eScience-Bayes approach allowed analysis of these differences across a large set of PDZ domains.

Model III further indicated that the PDZ domains from *C. elegans *and *D. melanogaster *may form two clusters (Fig. (Fig.5b).5b). Given the small datasets from *C. elegans *and *D. melanogaster*, which contain only seven PDZ domains each (Fig. (Fig.4),4), the clustering might be a sampling or an experimental artifact. Nevertheless, it is tantalizing to hypothesize that there are general trends in the intramolecular interaction networks in PDZ domains that differ between species. These observed differences are further illustrated in Fig Fig5c,5c, where five residues, estimated to be allosterically linked to residues in the binding peptide, are shown in one representative PDZ domain from mouse, *C. elegans*, and *D. melanogaster*. As shown in the figure, different amino acids, with different positions and physicochemical properties, are estimated to be allosterically coupled to the binding peptide in the three different PDZ domains.

This example shows that an eScience-Bayes approach to modeling omics data can straightforwardly accommodate grouping structures in the data. This enables the analysis to reveal tentative differences among PDZ domains that allowed the inference of putative allosteric networks.

In a recent Nature Horizons article [12], Cambridge chemistry professor Peter Murray-Rust sketched an eScience world where the answer to any question is at the fingertips of every person; the complexity of the question and the size of the data to analyze do not matter. Computers perform all trivial and time-consuming tasks, like searching through millions of research articles or performing massive algorithmic calculations. In this proof-of-principle paper we have shown that some aspects of this vision in fact have been realized: machines can via Web services be used to retrieve background information that together with multiple data sources (produced locally or retrieved from the Internet) can be coherently exploited in a Bayesian framework to create complex models by employment of HPC resources, thus allowing us to provide answers to biologically relevant questions. This *modus operandi *contrasts to non-Bayesian approaches for merging multiple omics datasets and for integrating omics data with prior information, which to a large degree have relied on *ad hoc *methods for solving specific problems (see e.g. Kutalik *et al*. [5], Chuang *et al*. [6], and Xu *et al*. [7]). Bayesian statistics provide the flexibility to tailor statistical models to complex data structures, and can simultaneously accommodate prior information. The rather complex models described here would be diffcult to fit using classical statistics. We demonstrated the value of the Bayesian approach with the quite drastic improvements in prediction accuracy in the breast cancer example, and the enhanced depth of the data analysis in the PDZ domain-peptide interaction example. However, presently the eScience-Bayes approach is not easy to apply because it requires manual procurement of suitable information resources and Web services, and it requires programming skills to implement the Web service clients, parse the output, and connect them together. Moreover, the eScience-Bayes approach requires implementation of MCMC-algorithms to fit Bayesian models and knowledge of middleware and non-user-friendly interfaces for deployment of the MCMC-algorithms on HPC facilities. However, the most crucial difficulty is the prior elicitation process. Since the prior distributions are central in the Bayesian paradigm, a key aspect of applying the eScienceBayes approach is to summarize retrieved prior knowledge as distributions. In both the breast cancer and PDZ domain demonstrations we used prior information quantified as integers to represent our *a priori *belief of the independent variables relevance for modeling the response. Although care was taken to design the prior distributions in a sensible way, the prior specifications reflects a somewhat arbitrary step in the work flow, and there is presently no clear way to standardize prior elicitation and make the process transparent, exchangeable, and completely general. Related to this difficulty, another problem is that useful data and prior information in today's scientific practice are lost in objects that are inaccessible to computers, such as figures, tables, and summary statistics.

Thus, there is a distinct need for a standardized system of technologies and tools for working efficiently with computer representations of biological entities, data, information, probabilities, and statistical models - both locally and on the network. Great efforts are indeed being made in this direction. These include the increasing degrees of semantically annotated data that are deposited in public repositories in machine readable formats [31], new Web service protocols that allow for service discovery [32], semantic web technologies to adhere information with probability distributions that can be directly used as priors in Bayesian statistical models [33], and high-level languages for specifying fast implementations of MCMC algorithms [34]. Moreover, graphical workbenches, like Bioclipse [35], aim to handle all these tasks from a single point of entry. When these technologies and software have matured, it will be straightforward to adopt an eScience-Bayes approach to virtually any biological or clinical question. It would then be easy to obtain prior information and data by performing semantic queries, for example, "give me all breast cancer susceptibility genes and the probability distributions describing their association with increasing risk of distant metastasis development, together with all breast cancer Affymetrix HG-U133A datasets where distant metastasis development was the clinical end-point". The retrieved priors and data could then be used *directly *in a Bayesian statistical model specified in a high-level language and converted to an algorithmic fitting process deployed on an HPC facility. Finally, the results could be summarized graphically to increase the likelihood of making correct clinical prognostic and treatment decisions.

The datasets with accession numbers GSE2034, GSE7390, GSE4922, GSE2990, and GSE1456 were retrieved with the GEO Web service [19]. The datasets were originally published in Wang *et al*. [13], Desmedt *et al*. [20], Miller *et al*. [14], Sotiriou *et al*. [15], and Pawitan *et al*. [16], and we hence denote them by (**X**, **t**, ** γ**)

We define *g*_{j}, *j *= 1,...,22283, to be the gene names of the probes on the Affymetrix HG-U133A array, and to be the set of gene names reported in NetPath as being transcriptionally regulated by cancer pathways, and to be the set of words in the English language (Fig. (Fig.2).2). Further, set *h*_{j}, *j *= 1,....,22283, to be the number of free, fulltext articles in PubMed where the gene name *g*_{j }was mentioned. Finally, let *k*_{j}, *j *= 1,...,22283, be defined according to:

(1)

We modeled whether development of distant metastases occurred within five years using a multilevel Bayesian probit model according to:

(2)

where *Pr *denotes probability, and Φ is the normal cumulative distribution function, *N *the multivariate normal distribution, *δ *the one-point distribution, and *W*^{-1 }the inverse Wishart distribution. *n*_{ρ }denotes the number of patients in dataset *ρ*. **J**^{κ }is the diagonal matrix with the non-zero elements in the vector as diagonal elements. Similarly, and are the elements of **x**_{ρi }and *β*_{ρ}, respectively, that correspond to the non-zero elements in (**x**_{ρi }relates to patient *i *in dataset *ρ *and *β*_{ρ }are the regression coefficients related to dataset *ρ*). The index *ρ *indicates that the coefficients varied by breast cancer study (*W, D, M, S, P*).

We assumed *a priori *that on average 100 genes should be included in the signature (i.e. that = 100, where *E *denotes the expectation operator). This assumptions was based on the number of genes used in previously reported signatures [13-16]. Further, we assumed that the greater the number of times a cancer pathway regulated gene's name has been mentioned together with breast cancer in free, fulltext PubMed articles, the greater is the *a priori *scientific support that the gene's regulation is correlated with breast cancer relapse. Thus, the larger the *k*-value (as defined in (1)) associated with a given gene, the greater the *a priori *probability of the gene to be included in the signature. In order to give all genes a chance of being selected, we assign the *k*-value 1 to a gene not reported in NetPath. Genes reported in NetPath but not mentioned in any free PubMed article get the *k*-value 2. The prior distribution for is founded on the idea that the probability of selecting *one *variable *j *of the 22283 variables is proportional to the variable *j*'s *k*-value, i.e. *Pr*(select *j *in one draw) = . Thus, the probability of *not *selecting the variable *j *is and the probability of selecting variable *j *in *m *independent draws is . All genes thus have a chance to get selected and all genes have a chance of not being selected.

Apart from the prior distribution of , no prior knowledge was assumed for the remaining model parameters (i.e. noninformative priors was used for , Σ^{κ }and *m*).

Model I was validated by fitting it five times, each time using three of the five datasets (**X**, **r**, ** γ**)

To afford comparisons with the gene signatures published in Wang *et al*. [13], Miller *et al*. [14], Sotiriou *et al*. [15], and Pawitan *et al*. [16], we modified Model I according to:

(3)

where denotes the subset of **x**_{ρi }that corresponds to the gene signature *ρ*_{sig }derived in publication *ρ*, is the number of genes in the signature *ρ*_{sig}, and the identity matrix of rank .

To derive a gene signature using all the data, we fitted Model I as described in (2). We counted the number of times that each gene appeared in the 200,000 draws from the posterior distribution in the Markov chain Monte Carlo algorithm (after 200,000 draws burn-in, see below). We defined the appearance frequency of a gene *g*_{j }as the number of appearances of *g*_{j }divided by the total number of iterations (i.e., 200,000 here). We reasoned that the genes with the highest appearance frequencies play the strongest role in predicting the response. The top 100 genes, according to appearance frequency, were selected and are found in Additional file 1, Table S2.

We modeled the time to development of distant metastases according to:

(4)

where the denotations are as in Model I, and *ε*_{ρi }is the error term for patient *i *in dataset *ρ*. Model II was validated by fitting it five times, each time using three of the five datasets (**X**, **r**, ** γ**)

The peptides and the aligned PDZ sequences (Fig. (Fig.4)4) were, after removal of alignment positions containing gaps, converted to numerical vectors capturing the physicochemical properties of each sequence by using the tciz scales of Muthas *et al *[36]. The tciz scales are the amino acids' component scores in the two first principal components derived from a principal component analysis of physicochemical properties of 113 natural and non-natural amino acids. The two tciz scales thus code each sequence residue into two real-valued numbers, which are strongly correlated with the amino acids size and polarity [36]. The tciz scales thus allow us to numerically characterize peptides by concatenating the tciz numbers for all residues in the peptide. Similarly, we could numerically characterize an interaction between a PDZ domain and a peptide as the concatenated tciz description vectors of the PDZ domain and the peptide. Characterizing all interactions in the Stiffler *et al*. [22] and Chen *et al*. [23] datasets gave us a matrix, **X**, with 2306 observations (2019 from mouse, 147 from *C. elegans*, and 140 from *D. melanogaster *PDZ domains) and 154 variables (67 residue positions, *r*_{1},...,*r*_{67}, in the alignment of the PDZ domains were "gap-free", and each position was coded with two tciz numbers: 67·2 = 134 variables; the peptides contained 10 residues each: 10·2 = 20 variables).

Let **H **= (**X**, *H*(**X**)) be mean centered and scaled to unit variance, where *H*(**X**) denotes the second order interaction terms of the variables in **X**. **H **thus has 2306 rows and 11935 columns. Let the column vector **y **= (*y*_{1},...,*y*_{2306}) be the response variable, where *y*_{i }= 1 represent that the PDZ domain and the ligand interact and *y*_{i }= 0 represents that they do not interact. We assumed, with support from the literature [37] and the data from Stiffler *et al*. [22], that some PDZ domains are more promiscuous binders than others. This resulted in a correlated error structure when modeling the data. This is an assumption violation in standard generalized linear models, but could be accommodated for in our Bayesian setting by allowing parameter estimates to vary among PDZ domains. Thus, let *ϕ *= 1,...,96 denote the 96 PDZ domains in the data from Stiffler *et al*. [22] and Chen *et al*. [23] (82 from mouse, 7 from *C. elegans*, and 7 from *D. melanogaster*).

The PDZ domain residues *r*_{1},...,*r*_{67 }and the peptide residues *p*_{1},...,*r*_{10 }are associated with the integers *c*_{1},...,*c*_{77}, where the *c*-values represent the number of contacts its corresponding residue makes with a ligand, as estimated by SAS by analyzing all solved 3D structures of PDZ domain-peptide complexes available in the Protein Data Bank [25] (PDB) (Fig. (Fig.4).4). We construct the row vectors

(5)

where *H*(**c**_{X}) denotes the second order interaction terms of **c**_{X}. **c**_{X }thus represents the *c*-values of the tciz-coded residues *r*_{1},...,*r*_{67 }and *p*_{1},...,*p*_{10 }incremented by 1. The *c*-values were incremented by 1 to ensure that all variables in **H **have a chance of being selected (i.e. including those with *c*-values equal to zero).

We modeled the interaction between PDZ domains and peptides using a Bayesian hierarchical probit model according to:

(6)

where is the *i*th row in (the rows of **H **corresponding to PDZ domain *ϕ*). *d*_{j }is the *j*th value in the row vector **c**_{X, H(X)}. Hence, the *a priori *probability of a variable to be selected is larger, the larger the number of contacts the PDZ residue it is derived from makes with the ligand (the assumption being that the more contact a PDZ residue makes with the ligand, the greater our *a priori *belief that the residue is important for the interaction). The other denotations are analogous to the ones used in Model I. The prior for *m*, i.e. the parameter controlling the number of selected variables, was chosen so that roughly one twentieth of the total number of variables was selected (i.e. so that = 11935/20, where *E *denotes the expectation operator). The mean of the prior distribution for *m *was estimated using cross-validation on the training data. Model III was also modified to assess the use of the prior information about which residues that are important for governing the interaction between PDZ domains and peptides. We then set *p *= 1/11935.

The coupling was estimated by studying the interaction terms in **H **formed between tciz scales from one residue in the PDZ domain and one in the peptide as outlined in Prusis *et al*. [38]. The figure displays the five such interaction terms with the largest (absolute-valued) modes of the regression coefficient's distributions. The network residues are color coded according to the physicochemical property that is estimated to be important for the network; residue size is shown in blue and residue polarity in red. Note that the same PDZ domain 3D structure (*α*1-syntrophin PDZ, PDB accession number 2PDZ) has been used in all three pictures since the structures of Lap4(2/4) and Lin7(1/1) from *C. elegans *and *D. melanogaster *have not been solved. The residue labels refer to the amino acid at the given position in Lap4(2/4) and Lin7(1/1) and do thus not correspond to the structure of the depicted amino acids. Residue numbering is according to 2PDZ.pdb.

To explain the Bayesian P-values used in the article, we give the following example: Let be the two datasets that were removed when fitting Model I and the remaining three datasets. Further, let ** β **and

ROC curves are plots of the true-positive fraction (TPF) plotted as a function of the false-positive fraction (FPF) for a binary classifier system as its discrimination threshold is varied. The TPF and FPF are calculated as follows: TPF = TP/(TP + FN) and FPF = FP/(FP + TN), where TP = true positives, FP = false positives, FN = false negatives, and TN = true negatives. The larger the area under the curve (the AUC), the better the classifier. The AUC is equal to the Mann-Whitney U statistic [40], a non-parametric test for assessing whether two independent samples of observations come from the same distribution.

The posterior distributions for Model I, II, and III are not possible to derive in closed form. The models were thus fitted using Markov chain Monte Carlo (MCMC) methods (see Robert and Casella [41] for an excellent account on MCMC methods). The MCMC algorithms were implemented in the statistical programming language R [42]. Three chains were run for each model and after a burn-in period of 200,000 samples, we collected the following 200,000 samples from the posterior distributions. Convergence was checked using the potential scale reduction factor [43]. The computations were performed on UPPMAX computational resources http://www.uppmax.uu.se/ under the project p2006026.

ME devised and implemented the proposed eScience-Bayes framework. OS was involved in program design and aided with the implementation. JESW supervised the project. All authors read and approved the final manuscript.

**Additional results**. Additional results obtained with Model I, II, and III. Specific information regarding the results are available within the additional pdf file.

Click here for file^{(239K, PDF)}

Financial support was obtained frm Swedish VR (04X-05957) and Uppsala University (KoF07).

- Butcher EC, Berg EL, Kunkel EJ. Systems biology in drug discovery. Nat Biotechnol. 2004;22(10):1253–1259. doi: 10.1038/nbt1017. [PubMed] [Cross Ref]
- Ho RL, Lieu CA. Systems Biology: An Evolving Approach in Drug Discovery and Development. Drugs in R&D. 2008;9(4):203–216. [PubMed]
- Wickenberg-Bolin U, Göransson H, Fryknäs M, Gustafsson MG, Isaksson A. Improved variance estimation of classiffication performance via reduction of bias caused by small sample size. BMC Bioinformatics. 2006;7:127. doi: 10.1186/1471-2105-7-127. [PMC free article] [PubMed] [Cross Ref]
- Isaksson A, Wallman M, Göoransson H, Gustafsson MG. Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recogn Lett. 2008;29(14):1960–1965. doi: 10.1016/j.patrec.2008.06.018. [Cross Ref]
- Kutalik Z, Beckmann JS, Bergmann S. A modular approach for integrative analysis of large-scale gene-expression and drug-response data. Nat Biotechnol. 2008;26(5):531–539. doi: 10.1038/nbt1397. [PubMed] [Cross Ref]
- Chuang H, Lee E, Liu Y, Lee D, Ideker T. Network-based classiffication of breast cancer metastasis. Mol Syst Biol. 2007;3:140. doi: 10.1038/msb4100180. [PMC free article] [PubMed] [Cross Ref]
- Xu L, Tan AC, Winslow RL, Geman D. Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics. 2008;9:125. doi: 10.1186/1471-2105-9-125. [PMC free article] [PubMed] [Cross Ref]
- Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall/CRC, New York; 2000.
- Eddy SR. What is Bayesian statistics? Nat Biotechnol. 2004;22(9):1177–1178. doi: 10.1038/nbt0904-1177. [PubMed] [Cross Ref]
- Gelman A, Hill J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, New York; 2007.
- Berger JO. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, Berlin; 1985.
- Murray-Rust P. Chemistry for everyone. Nature. 2008;451(7179):648–651. doi: 10.1038/451648a. [PubMed] [Cross Ref]
- Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EMJJ, Atkins D, Foekens JA. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365(9460):671–679. [PubMed]
- Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA. 2005;102(38):13550–13555. doi: 10.1073/pnas.0506230102. [PubMed] [Cross Ref]
- Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Vijver MJ Van de, Bergh J, Piccart M, Delorenzi M. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006;98(4):262–272. [PubMed]
- Pawitan Y, Bjöohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu ET, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw PM, Smeds J, Skoog L, Wedren S, Bergh J. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 2005;7(6):R953–64. doi: 10.1186/bcr1325. [PMC free article] [PubMed] [Cross Ref]
- Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics. 2005;21(2):171–178. doi: 10.1093/bioinformatics/bth469. [PubMed] [Cross Ref]
- Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005;365(9458):488–492. doi: 10.1016/S0140-6736(05)17866-0. [PubMed] [Cross Ref]
- Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–210. doi: 10.1093/nar/30.1.207. [PMC free article] [PubMed] [Cross Ref]
- Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JGM, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C. Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res. 2007;13(11):3207–3214. doi: 10.1158/1078-0432.CCR-06-2765. [PubMed] [Cross Ref]
- Sha N, Tadesse MG, Vannucci M. Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics. 2006;22(18):2262–2268. doi: 10.1093/bioinformatics/btl362. [PubMed] [Cross Ref]
- Stiffler MA, Chen JR, Grantcharova VP, Lei Y, Fuchs D, Allen JE, Zaslavskaia LA, MacBeath G. PDZ domain binding selectivity is optimized across the mouse proteome. Science. 2007;317(5836):364–369. doi: 10.1126/science.1144592. [PMC free article] [PubMed] [Cross Ref]
- Chen JR, Chang BH, Allen JE, Stiffler MA, MacBeath G. Predicting PDZ domain-peptide interactions from primary sequences. Nat Biotechnol. 2008;26(9):1041–1045. doi: 10.1038/nbt.1489. [PMC free article] [PubMed] [Cross Ref]
- Milburn D, Laskowski RA, Thornton JM. Sequences annotated by structure: a tool to facilitate the use of structural information in sequence analysis. Protein Eng. 1998;11(10):855–859. doi: 10.1093/protein/11.10.855. [PubMed] [Cross Ref]
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [PMC free article] [PubMed] [Cross Ref]
- Talavera D, Laskowski RA, Thornton JM. WSsas: a web service for the annotation of functional residues through structural homologues. Bioinformatics. 2009;25(9):1192–1194. doi: 10.1093/bioinformatics/btp116. [PubMed] [Cross Ref]
- Smock RG, Gierasch LM. Sending signals dynamically. Science. 2009;324(5924):198–203. doi: 10.1126/science.1169377. [PMC free article] [PubMed] [Cross Ref]
- Lockless S, Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999;286(5438):295–299. doi: 10.1126/science.286.5438.295. [PubMed] [Cross Ref]
- Gianni S, Walma T, Arcovito A, Calosci N, Bellelli A, Engström A, Travaglini-Allocatelli C, Brunori M, Jemth P, Vuister GW. Demonstration of long-range interactions in a PDZ domain by NMR, kinetics, and protein engineering. Structure. 2006;14(12):1801–1809. doi: 10.1016/j.str.2006.10.010. [PubMed] [Cross Ref]
- Chi CN, Elfström L, Shi Y, Snäll T, Engstörm Å, Jemth P. Reassessing a sparse energetic network within a single protein domain. Proc Natl Acad Sci USA. 2008;105(12):4679–4684. doi: 10.1073/pnas.0711732105. [PubMed] [Cross Ref]
- Taylor CF, Field D, Sansone SA, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, Brazma A, Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy NW, Hermjakob H, Julian RKJ, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novere N, Leebens-Mack J, Lewis SE, Lord P, Mallon AM, Marthandan N, Masuya H, McNally R, Mehrle A, Morrison N, Orchard S, Quackenbush J, Reecy JM, Robertson DG, Rocca-Serra P, Rodriguez H, Rosenfelder H, Santoyo-Lopez J, Scheuermann RH, Schober D, Smith B, Snape J, Stoeckert CJJ, Tipton K, Sterk P, Untergasser A, Vandesompele J, Wiemann S. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol. 2008;26(8):889–896. doi: 10.1038/nbt.1411. [PMC free article] [PubMed] [Cross Ref]
- Wagener J, Spjuth O, Willighagen EL, S WJE. XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous Web services. BMC Bioinformatics. 2009;10(279) [PMC free article] [PubMed]
- da Costa PCG, Laskey KB, Laskey KJ. PR-OWL: A Bayesian ontology language for the semantic web. Berlin, Heidelberg: Springer-Verlag; 2008.
- Daumé H. , IIIHBC: Hierarchical Bayes Compiler. 2007.
- Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-Rust P, Steinbeck C, Wikberg JES. Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinformatics. 2007;8:59. doi: 10.1186/1471-2105-8-59. [PMC free article] [PubMed] [Cross Ref]
- Muthas D, Lek PM, Nurbo J, Karlén A, Lundstedt T. Focused hierarchical design of peptide libraries -follow the lead. J Chemometrics. 2007;21(10-11):486–495. doi: 10.1002/cem.1069. [Cross Ref]
- Tonikian R, Zhang Y, Sazinsky SL, Currell B, Yeh JH, Reva B, Held HA, Appleton BA, Evangelista M, Wu Y, Xin X, Chan AC, Seshagiri S, Lasky LA, Sander C, Boone C, Bader GD, Sidhu SS. A specificity map for the PDZ domain family. PLoS Biol. 2008;6(9):e239. doi: 10.1371/journal.pbio.0060239. [PMC free article] [PubMed] [Cross Ref]
- Prusis P, Uhlén S, Petrovska R, Lapinsh M, Wikberg JES. Prediction of indirect interactions in proteins. BMC Bioinformatics. 2006;7:167. doi: 10.1186/1471-2105-7-167. [PMC free article] [PubMed] [Cross Ref]
- Xiao-Li M. Posterior Predictive p-values. The Annals of Statistics. 1994;22(3):1142–1160. doi: 10.1214/aos/1176325622. [Cross Ref]
- Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics. 1947;18:50–60. doi: 10.1214/aoms/1177730491. [Cross Ref]
- Robert CP, Casella G. Monte Carlo statistical methods. 2. Springer-Verlag, New York; 2004.
- R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2009. http://www.R-project.org [ISBN 3-900051-07-0]
- Gelman A, Rubin DB. Inference from Iterative Simulation Using Multiple Sequences. Statistical Science. 1992;7(4):457–472. doi: 10.1214/ss/1177011136. [Cross Ref]
- Kaplan EL, Maier P. Non-parametric estimation of incomplete observations. J Am Stat Assoc. 1958;53:457–81. doi: 10.2307/2281868. [Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of **BioMed Central**