Creating a mutation matrix
We designed our algorithm to be capable of utilizing many disparate sources of mutational data, including single-nucleotide polymorphisms, copy-number alterations, and epigenomic modifications. In a pre-processing step, these diverse data types were converted into a single two-dimensional binary "mutation" matrix (Figure ).
Figure 2 Analysis Pipeline. In a preprocessing step, validated SNPs and focal CNAs are combined into a mutation matrix. This matrix is fed into the winnow algorithm, which scores each gene pair by exclusivity, indicated by edge scores in a graph. This graph is (more ...)
Data was obtained from the The Cancer Genome Atlas Data Portal (http://tcga-data.nci.nih.gov/
). A complete listing of samples used can be found in Table 1 in Additional file 1
. Point mutations identified by resequencing were filtered such that only non-synonymous, validated mutations remained. Genes driving copy number alterations were detected using normalized probe-level data from Agilent 244A copy number arrays. These were processed to infer regions of amplification and deletion using circular binary segmentation as implemented in the R package DNAcopy [13
]. Log-ratio thresholds for amplification and deletion were set at 1.5 standard deviations from the mean probe intensity. These were intersected with peaks of recurrent copy-number change identified by the RAE algorithm [14
], then copy number variants were removed and driver genes were selected as described in Additional file 1
These two forms of data were then merged into a two-dimensional mutation matrix. Each gene in each sample was checked against these single nucleotide and copy number mutations and a matrix was created such that if sample i
contained an alteration in gene j
, the position xi,j
in the matrix was equal to 1, otherwise it was set to 0. This matrix is available at http://brl.bcm.tmc.edu/rme/gbm.dat
Constructing a gene network with Winnow
The first step in our module detection pipeline was to filter the mutation matrix and retain only genes that meet a set frequency of recurrence, as genes altered in only one or a few samples do not contain enough information to calculate meaningful exclusivity scores.
A possible next step would be to calculate the exclusivity score between each pair of genes, defined as the number of samples where exactly one of the pair is mutated divided by the number of samples where at least one of the pair is mutated. (Figure ). These data could be used to create a network where each node is a gene and each edge weight is the exclusivity between the genes. The highly connected sub-networks would then be used as a starting point for a focused combinatorial search for modules. The disadvantage of this approach is that the networks quickly becomes much too large and densely connected to effectively identify sub-networks.
Thus, we used an online-learning linear threshold algorithm called Winnow to detect signals of exclusivity against the noisy background of passenger mutations in many irrelevant genes [15
]. Its speed and insensitivity to irrelevant attributes allowed us to aggressively filter the output scores before generating a network, resulting in smaller, higher quality networks than pairwise exclusivity.
The Winnow algorithm was run in an online setting, using one gene as a classifier and the rest of the mutation array as training data. In the first winnow run, all the bits in the array were flipped, such that we calculated how well each aberration in the classifier is predictive of non-aberration in each gene of the matrix. Then, we flipped the bits of the classifier, such that we calculated how well each non-aberration in the classifier was predictive of aberration in each gene of the matrix. The resulting weights were used to score the edges of the graph, then low-scoring edges were removed.
Since the range of weights for each run was determined by how quickly Winnow finds an optimal classifier, we did not use an absolute threshold value when removing edges. Instead, for each classifier gene, we took the second highest weight and retained all edges with a score greater than or equal to that value.
Identifying candidate modules
We then used each gene in the network as a starting point in a greedy local combinatorial search for RME modules, such that we evaluated all possible connected modules with size below the specified limit. We report those that have algorithmic significance above a predetermined threshold, based on the size of the input data (Figure ). These potential RME modules are binned by number of genes and sorted by significance value. The module with the largest size and highest significance (as described in the next section) is kept, and all other modules containing any of the same genes are discarded. This process is repeated until all bins are empty.
Evaluating modules by performing an algorithmic significance test
The problem of determining whether a module (subset of genes) contains a significant RME pattern of aberrations can be addressed using probabilistic models or heuristic scores. Both approaches would generally require establishment of extremely low significance values (pre-Bonferroni correction), which would in turn require many cycles of computationally demanding permutation testing. To eliminate this bottleneck, we employ a new implementation of the computationally much less demanding algorithmic significance test [16
], which has recently been applied in the context of Hidden Markov Models [17
], but is also applicable as a general method for pattern discovery [20
]. As illustrated in Figure , the algorithm determines significance values directly.
Let k be the number of samples, m be the number of genes in a module, and let X be a k times m matrix of binary values, with each value indicating presence (xi,j = 1) or absence (xi,j = 0) of an aberration of the j-th gene in the i-th sample. The algorithmic significance test compares the number of bits required to encode the binary matrix X by the RME Algorithm to the number of bits required to encode the matrix under the null hypothesis. The RME algorithm attempts to encode the data in fewer bits by using the assumption that mutations occur at an unusually high frequency in a mutually exclusive fashion, vs. the Null Algorithm (corresponding to the null hypothesis) which assumes that aberrations occur independently at their background frequencies.
The presence of an RME pattern (Figure ) will allow the RME algorithm to encode the matrix X significantly more concisely. To minimize overall encoding length, the RME Algorithm is provided the identity of the m genes out of the total of n genes assayed (encoded in m log(n) bits, an implicit penalty that corrects for multiple testing), and the counts of aberrations for each sample, ai,0, i = 1,...,k and counts of aberrations for each gene, b 0,j,j = 1,...,m (encoded in k log*(m) and m log*(k) bits respectively, where log* denotes iterated logarithm). Using this information, the RME Algorithm first sorts the samples, then the genes by their counts of aberrations, placing the most frequently altered samples at the top of the matrix, and the genes with most aberrations at the beginning of each row. As a penalty for this sorting, we reduce the score by the number of bits necessary to represent the new sorted order.
The algorithm then examines the sorted matrix row by row in a left to right order, keeping track of how many aberrations have been observed, and calculates a probability of observing an aberration in the next cell of the matrix and encoding the bit optimally according to the calculated probability. To describe how the probability is calculated, we first introduce additional notation. Let p(xi,j = 1) denote the number of unobserved mutations divided by the number of unobserved positions remaining in the matrix. Let ai,j and bi,j denote the number of unobserved aberrations in the current gene and sample respectively.
Then, we can encode elements of X according to the following probability distribution: If ai,j
are both larger than 0, and a one has not been observed in this row yet, we use the following formula (derived by applying Bayes' rule):
else we estimate that the probability is very low (but not equal to zero in order to avoid infinitely large penalties):
In contrast, the Null algorithm encodes optimally assuming that the k genes contain aberrations at background frequency (no enrichment), denoted pNULL (1), and that mutations occur independently in each of the k genes.
The encoding length difference between the null and RME algorithms and the algorithmic significance are calculated in the following two steps:
Step 1. Encode the binary aberration matrix.
Set d' to 0. Examine the aberration matrix row by row, in left to right order incrementing d' as follows in each cell (i,j):then
where log denotes binary logarithm.
. Account for additional information (including implicit correction for multiple testing) and calculate significance.
Calculate significance value 2-d.
In order to benchmark the performance of this algorithm, we ran simulations on synthetic data sets. When generating sets with the same size as the current glioblastoma data (145 samples, 1290 genes), the actual distribution of mutations from the TCGA data was used to create random matrices. We simulated larger data sets using the knowledge that the current gene list is heavily biased towards known and frequently-altered oncogenes, so we compensated by assuming that 0.1% of newly considered genes will have a mutation frequency greater than 0.2, 0.9% will have frequency between 0.2 and 0.1, and 99% will have frequency less than 0.1.
We then used a binning procedure, where we started with the empirical GBM distribution, and calculated the proportion of mutations in each bin. To compensate for the fact that the distribution is heavily biased towards low-frequency mutations, we used bins of size 1% until we reached the tenth percentile, then used bins of size 5% to allow for some variability. We then distributed the specified proportion of aberrations randomly within each bin. We tested coverage levels between 50 and 100%, and generated RME Modules such that the number of alterations matched the given coverage level, exclusivity was 100%, and each gene was altered in a random number of samples that exceeded the minimum threshold.
Determination of prognostic significance
Affymetrix HG133-based GeneChip mRNA expression profiling data from two published datasets, the TCGA ("TCGA", n = 260) and the Erasmus Medical Center, Netherlands ("Erasmus", n = 153) were obtained as raw intensity files (.CEL files) and normalized [1
]. Samples were included for all cases in which clinical data were available (patient age at diagnosis, tumor grade, survival time, and vital status) and for which the diagnosis was primary glioblastoma. Mapping of Affymetrix GeneChip probes was performed using custom chip definition files (CDF) based on the NCBI Entrez Gene v.11 (http://brainarray.mbni.med.umich.edu/Brainarray/
. Database/CustomCDF/genomic_curated_CDF.asp) and probesets were summarized by median intensity [23
]. Recursive partitioning analysis was performed to separate samples as either high or low expression. Univariate comparison of survival by EP300
expression was performed by the Kaplan-Meier method [24
] with significance determined using the log-rank test. Multivariate analysis was performed using the Cox proportional hazards model [25
]. All analyses were conducted using JMP Genomics 4.0.
Implementation and availability
Implementation of the algorithm was done using Ruby and Bash. The core algorithm is available for download at http://brl.bcm.tmc.edu/rme/
or for use through the Genboree/Galaxy portal at http://www.genboree.org/galaxy
, under Tools > Pattern Discovery. Documentation and example files are included at each location.