Human cancer cell lines have been an invaluable and practical resource for cancer research. The availability of genomic, transcriptomic and proteomic data on these lines is expected to further increase their utility. To this end, we conducted whole-genome and transcriptome sequencing on three tumor cell lines (A431, U251MG and U2OS) for which there is a large body of proteomics data [1
]. The choice of these lines was also motivated by their origin from different lineages (tumor cell lines from mesenchymal, epithelial and glial tumors) and abundance of literature.
A431 is used as a model cell line for epidermoid carcinoma and there are currently 3,359 publications describing studies using this cell line. It was established from an epidermoid carcinoma in the vulva of an 85-year-old patient [2
]. This cell line expresses high levels of epidermal growth factor receptor (EGFR) and is often used to investigate cell proliferation and apoptosis. U251MG is a commonly used glioblastoma cell line (over 1,200 published articles) established from a male's brain tissue [3
]. U2OS is an osteosarcoma cell line derived from a 15-year-old female [4
]. Osteosarcoma tumors arise from cells of mesenchymal origin that differentiate to osteoblasts. It is the most common form of bone cancer, responsible for 2.4% of all malignancies in pediatric patients, and its triggers are currently not known [5
]. U2OS is a common choice for osteosarcoma research: 35% of the articles associated with the osteosarcoma Medical Subject Headings (MeSH) term in the PubMed database have used this cell line.
Using modern technologies, we subjected these three cell lines to genome and RNA sequencing in order to identify genomic alterations and expression of messenger and microRNAs. A review by Ideker and Sharan summarized studies that demonstrate how genes with a role in cancer tend to cluster together on well-connected sub-networks of protein-protein interactions [6
]. We also earlier demonstrated that somatic mutations in a glioblastoma cancer genome produced a pathway-like pattern of enriched connectivity in the gene interaction network. Hence, in this work we analyzed functional relations between all detected somatic mutations, structural variations (altered copy number) and allelic imbalances of expression via network enrichment analysis (NEA) [7
]. A biological pathway could be seen as an area of densely connected genes in a functional gene network. The idea of NEA when applied to cancer-related genes is that multiple key mutations (which are believed to be common in cancer genomes) could alter normal cellular programs for proliferation, differentiation, cell death, and so on, sometimes even producing quasi-pathways [9
]. These altered pathways could then be detected as denser and more enriched areas and evaluated by comparing patterns formed by the same set of genes in biologically meaningless (random) networks. Either the whole group or members of such a pathway could have links to individual master switches of oncogenesis, which may themselves have not been altered.
In particular, Dutta and co-authors developed a valuable idea, according to which effects of driver genes might be seen as differential (mRNA or protein) expression of network neighbors [10
]. In the current work we pursue a similar approach with the difference that we did not make any prior assumptions about modular properties of driver mutations and entirely summarized their relations to each other and important pathways. This method is the closest analog of gene set enrichment analysis (GSEA), with the important novel option of analyzing single genes against functional sets [11
]. Apart from that, gene network information enables much higher sensitivity, which we demonstrate as well.
While different methods of network inference from single or two data sources have been published [12
], only data integration networks have a broader scope and include multiple molecular mechanisms required for our analysis. For the highest completeness, we employed a network of functional coupling that was drawn up using the methodology of the data integration tool FunCoup [13
], and then merged with curated pathways from Kyoto Encyclopedia of Genes and Genomes (KEGG), protein complex data from CORUM, and a special network from glioblastoma data. However, any state-of-the-art network is likely incomplete or does not account for a specific context and we thus complement the network analysis of direct links with analogous statistics that accounts for indirect links, that is, connections via third genes.
To enable a rigorous statistical evaluation, patterns of potential functional couplings are compared to observations in a series of randomized networks that preserve basic topological properties overall, but have no biological function. This results in probabilistic estimates for every tested hypothesis. As the analysis considers relative enrichment rather than absolute signal strength, functional patterns can be discerned in the presence of multiple spurious mutations, which are referred to as passengers. On the other hand, any computation-based gene network would have a high number of individual false edges. Again, looking at statistically significant enrichment patterns instead of focusing on particular links allows ignoring such false positive findings. Of note, a number of reports were dedicated to discovery of network structures (modules, clusters, hypothetical pathways, and so on) that could characterize pathologic conditions [10
Here we describe, to our knowledge, the first study in which whole-genome and transcriptome data for three cancer genomes were analyzed in conjunction with data on global protein levels. First, we select genes with the potentially highest signal concentration (that is, filter them by expression values, correlation of those to genome alteration, sequence features, and so on), and subject them to network enrichment analysis to prove that both the selection criteria and NEA can bring us closer to the true sets of driver mutations in these genomes. Second, we re-analyze in the interaction network all detected copy number and single nucleotide alterations and present the most likely driver mutations within each genome. We show that passengers account for the overwhelming majority of all detected structural variations. We believe that the results presented herein provide a basis for understanding the functional interactions between the genome, transcriptome and proteome for both these highly influential model cell lines and cancer genomes in general.