New high-throughput sequencing and microarray technologies provide researchers with access to increasingly large and complex datasets comprising genome-level alterations in cancer
[1],
[2],
[3]. Computational algorithms have been designed to sift through this data with the goal of uncovering mutational patterns that are typical for a particular cancer type and consistent between sample sets
[1],
[2],
[4],
[5]. However, the ability to functionally validate recurrent genetic events in transgenic mouse models and human cell lines is limited by the lack of knowledge of the temporal order in which these alterations arise during tumorigenesis. The temporal sequence of events is important since it can inform the correct genomic background in which a mutation must arise to confer an oncogenic phenotype to cells. Furthermore, it may contribute to drug discovery since genomic alterations arising early during tumorigenesis may be more likely to induce oncogenic addiction and may thus represent promising therapeutic targets
[6]. In some cancers, such as colorectal cancer, the order of events can be determined through the analysis of several pre-malignant stages
[7],
[8]. Most cancer types, however, do not present with clinically observable precursor stages, and therefore the identification of the temporal sequence of events using biological or clinical approaches is difficult.
There is a growing literature of mathematical, statistical and computational approaches to determining the temporal sequence of events arising during tumorigenesis. Previously published methods include the linear model
[7], the oncogenetic tree (oncotree) approach
[9],
[10],
[11],
[12], various Bayesian graphical approaches
[13],
[14], and some clustering-based methods
[15],
[16]. Based upon the seminal work in delineating the temporal sequence of events in colorectal cancer by Vogelstein and colleagues, the linear model assumes that there exists a single, most likely order of mutations, and that all of these mutations arise in sequential order
[7]. The oncogenetic tree approach generalizes the assumption of a single sequential path by providing a tree structure to the temporal sequence of mutations, allowing for diverging temporal orderings of events
[9],
[10]. In probabilistic oncotrees, the tree structure represents the probabilities of accumulating further mutations along divergent temporal sequences
[9]. An alternative distance-based oncotree approach involves generating a phylogenetic tree over all events using a distance measure between mutational events, where leaf nodes represent the set of possible events. The closer a leaf node is to the root, the earlier the corresponding mutation arises
[10]. Further development of the probabilistic oncotree methodology by Beerenwinkel et al. resulted in the mixture tree model, in which multiple oncogenetic trees, each of which can result in cancer development independently, are included in the model
[12]. The consideration of multiple tree structures allows for inclusion of multiple independent temporal sequences of events that can result in the development of cancer. Notably, one tree structure that the mixture tree model includes is a star-shaped tree predicting that every mutation arises independently, accounting for random mutations that arise but are not involved in any temporal sequence of events. An expectation maximization algorithm is then used to determine the most likely tree mixture to fit the data
[12]. This approach has often been used to analyze CGH data
[17],
[18],
[19],
[20],
[21],
[22],
[23]. However, one acknowledged restriction of tree-based methods is that the tree structure precludes the possibility of converging evolutionary paths
[13] that occur when multiple alterations result in the same phenotypic effect. Furthermore, tree-based models impose a strict ordering of events: if an event occurs in a leaf of the tree, then it necessarily must be preceded by all events between the leaf and the root of the tree. Bayesian graphical methods, by allowing any network structure, can include converging evolutionary paths
[13],
[14], however at the cost of additional computation necessary to search the expanded multi-dimensional result space.
We have previously described a computational approach, called
Retracing the
Evolutionary
Steps
In
Cancer (RESIC)
[24], which determines the temporal sequence of specific genetic events for primary tumor types for which cross-sectional genomic data is available (). This approach can be used to resolve the relative order of genetic events with respect to other alterations of interest; however, in the absence of further data, the time of emergence of these events relative to phenotypes such as malignancy or metastasis cannot be identified. In the RESIC model, we adopted a different approach as compared to prior work: RESIC explicitly considers the evolutionary dynamics of mutation accumulation within a population of patients. Each patient harbors a collection of self-renewing cells that are at risk of accumulating the alterations leading to cancer; this cell population follows a stochastic process known as the Moran model
[25]. This stochastic process model of mutation accumulation was then approximated with a dynamical systems model whose steady state distribution across all possible mutational states can be compared with the frequencies of patients harboring the corresponding genetic events. The resulting fitness values conferred by genomic alterations, obtained by an optimization algorithm to minimize the distance between the observed and predicted patient frequencies, were then used to determine the relative order of events arising in a patient population
[24]. RESIC analyses are performed for sets of correlated genetic events, thus resulting in a relative ordering of alterations. The use of pair-wise comparisons also allows for converging temporal orderings. A detailed explanation of the RESIC algorithm is provided in the
Methods section.
In many cases, specific genetic alterations are not necessary for malignant transformation; instead, particular oncogenic phenotypes must be achieved through the emergence of any of a number of alternative mutations ()
[26],
[27],
[28],
[29],
[30]. In addition, many cancers can be subdivided into distinct molecular subtypes (), which often result from differences in the set of genetic alterations accumulated and potentially the order in which they arise. Including pathway and subtyping information may alter the order in which genetic events arise during cancer progression. To address these issues, we have extended our RESIC methodology to consider both pathway-based phenotypic changes and the subtype-driven context of cancer in order to examine how the temporal sequence of events differs when such information is included. These two topics are considered in the analysis of primary glioblastoma (GBM). Due to its aggressiveness and poor prognosis, as well as its frequency, patient samples of GBM have been extensively collected and the data made easily accessible to researchers. In particular, The Cancer Genome Atlas (TCGA) project has provided measurements of multiple types of genomic alterations, with each sample processed in a uniform manner, for a large set of patient samples of GBMs
[3]. The TCGA dataset provides an opportunity to effectively analyze the temporal sequence of pathway alterations using RESIC through an investigation of the specific signaling disruptions common to GBMs
[31],
[32], its well-defined molecular subtypes
[33],
[34],
[35], and the availability of similarly treated patient samples
[3]. We chose the RESIC methodology instead of previous approaches to investigate these issues since RESIC provides, in addition to the temporal ordering of events, the relative fitness values of cells harboring individual combinations of mutations. Furthermore, RESIC is capable of determining the order of different types of events, such as point mutations, focal amplifications and deletions, as well as whole chromosome and chromosome arm changes. Finally, RESIC is computationally efficient.