Using a panel of more than 300 microarrays, measured from B cells under different conditions, we propose a data analysis framework that can be used to predict targets for the transcription factor BCL6. To quantify the performance of our approach, we focus on the verified DREAM2 gene set [6
]. In addition to microarray data, we also use genomic sequence information in our analysis. Other sources of information, such as gene ontologies, could easily be incorporated into this framework.
First, we performed an extensive literature review to find a set of 51 genes that have previously been verified as BCL6 targets [7
]. These genes, denoted as the positive target set, are listed in together with information about the origin of evidence. Out of these 51 genes, only one gene (STAT1) overlaps with the DREAM2 gene set. In addition to positive targets, we also generated a non-target gene set by selecting 94 genes from the microarray data at random. As there are more than 12000 probe sets on the array, it is expected that a random sampling will not include a significant number of target genes. As such, our training set should be relatively unbiased. These sets of 51 target and 94 non-target genes form a training set that we utilize to establish our data analysis framework.
Positive training set. List of verified BCL6 targets from the literature with sources. In these studies, BCL6 regulation was supported by a variety of targeted experiments, including direct binding assays and functional perturbations.
Next, we derived a number of features from the various data sources. We used MotifLocator [18
] and ProbTF [19
] algorithms to compute the p
-values and to predict the probabilities of BCL6 binding upstream of the putative target genes. By making use of the biological insights obtained from the literature review, we extracted features from gene expression data that should be biologically relevant. For example, it is known that BCL6 works mostly as a repressor [7
]. Thus, if the expression of BCL6 is high, it may be expected that the expression of BCL6 targets is low. Thus, we used correlation of putative target genes expression to BCL6 expression as a feature. Also perturbations of CD40 and anti-IgM are known to regulate BCL6 [16
]. As BCL6 activity is downregulated, given CD40 or anti-IgM stimuli, we compute the expression ratios for CD40 and anti-IgM stimulated conditions against non-stimulated conditions within a given cell type. We also utilized a machine learning approach to predict whether a given gene is a target using support vector machines, trained with our literature derived training set. All of the used features are listed in .
Table 2 Table of all features. First group of features includes the binding site prediction probabilities using the ProbTF algorithm  with four different position weight matrices (PWMs). Second group presents the binding probabilities obtained using a standard (more ...)
To identify which features are the most informative we measure the degree of class separation using the constructed training set. Class separation is measured using the Mann-Whitney U-test to test whether the distributions of the two classes are separated using a given feature. As a result, a p-value for class separation is obtained for each feature. Features with the smallest p-values are selected from each group of features (see ) and are used to score the target genes from the DREAM2 200 gene set. The best features are highlighted in .
We combine the best features using a weighted sum. After the features are normalized (see Methods), we optimize the weights on the training set. Marginal distributions of weights for each feature are shown in . It should be noted that using the p-values of the features directly as weights does not give optimal results, as this approach would discard possible joint contributions of the features.
Figure 1 Distributions for 100 weight combinations that yields the highest prediction accuracy on the training set. Marginal distributions for each feature are shown. Weights have been obtained with exhaustive search over the weight space with 0.2 steps on [0,1] (more ...)
Using the best weight combination, we score each gene of the DREAM2 gene set and rank the genes based on the score. The performance of our approach is illustrated by precision-recall and receiver operating characteristic (ROC) curves in . The area under the precision-recall curve for this prediction is 0.80. We can also observe that there is only one mis-classification among the first 25 predicted targets.
Figure 3 BCL6 target prediction accuracy for weights = [0.0, 0.2, 0.6, 0.6, 0.8, 1.0]. First five weights correspond to (a)–(e) in and the last weight is used for SVM score. Area under the curve is 0.80468 and 0.91002 for (a) precision-recall (more ...)
Effect of incorporating the sequence data
By examining the weights in , it is evident that sequence based information contributed only marginally to the classification result. This indicates that our sequence scores are not very informative about class separation. If this sequence evidence is given more weight than proposed by the weight optimization, it actually lowers the prediction accuracy.
This is somewhat surprising as the use of sequence data can potentially help exclude non-targets from the list of target predictions. The ROC curve (), computed against the DREAM2 200 gene set, illustrates that the initial sequence features that we derived are not very informative.
ROC curve computed against the gold standard 200 gene set using ProbTF and PWMs from the literature.
This raises the question about the validity of our promoter sequence analysis. There are two obvious possibilities for the sources of error. First, it was not originally disclosed in the DREAM2 challenge what type of ChIP-on-chip array was used for the validation of the DREAM2 200 gene set. Thus, it was not possible to know with any certainty which genome build to use, which set of gene annotations to refer to, and what size of the promoter regions to use to search for binding sites. Second, the position weight matrices (PWMs) that we have obtained may not be the optimal ones for identifying BCL6 binding in these experiments.
To test whether these issues caused problems in the analysis, we performed an additional analysis. After the DREAM2 conference, it was disclosed that the ChIP-on-chip array that was used for the generation of the DREAM2 gene set was the Agilent Human Promoter ChIP-on-chip array. This array is based on the NCBI 36.1 (March 2006) genome build and includes probes covering −5.5kb upstream to +2.5kb downstream of the transcriptional start sites for approximately 17000 human transcripts. Thus, we extracted the sequence areas corresponding to the probes on the array, repeat masked the sequences [20
], and applied binding site prediction algorithms [19
]. In addition, to test whether there are more informative motifs than the BCL6 position weight matrices that we had obtained, we applied the PRIORITY motif discovery algorithm [21
] to the sequences from our training set and also to the 200 genes from the DREAM2 set.
From the training set, PRIORITY identified low-complexity motifs (). However, when applied to the DREAM2 gene set together with the ProbTF algorithm, these motifs show good discrimination (AUC = 0.7639) in terms of the ROC curve (). Motif searching on DREAM2 gene set yields a slightly more complex motif () that also gives good discrimination (AUC = 0.7594) between targets and non-targets and performs slightly better especially on low false positive rates (). Low-complexity motifs may reflect as-yet uncharacterized sequence features that correlate with BCL6 transcription, but are themselves not a model for direct BCL6 binding cis-elements.
Motif discovered from training set using PRIORITY.
ROC curve for ProbTF binding site predictions using motifs discovered from the training set and the DREAM2 200 gene gold standard set.
Motif discovered from DREAM2 200 gene gold standard set using PRIORITY.
By replacing the earlier binding site prediction probabilities with the predictions obtained with discovered motifs and reoptimizing the weights, we can obtain the performance that is shown in . The area under the precision-recall curve has increased from 0.80 to 0.84. Improved accuracy is evident also in the ROC curves. While it can be argued that this increase in performance is due to over-fitting, as motifs have been learned from the same data as they are tested on, it does illustrate that the sequences do carry information that can be utilized to boost the target identification performance. Similar performance improvement was also obtained using the motif inferred from the test set. Thus, if the knowledge about biological motifs is reliable, sequence data can be a valuable source of evidence.
BCL6 target prediction accuracy using the ProbTF feature with discovered motif and weights = [0.6, 0.2, 0.6, 0.6, 0.8, 1.0]. Area under the curve is 0.84488 and 0.92273 for (a) precision-recall and (b) ROC, respectively.