COSMIC (http://www.sanger.ac.uk/cosmic) curates comprehensive information on somatic mutations in human cancer. Release v48 (July 2010) describes over 136 000 coding mutations in almost 542 000 tumour samples; of the 18 490 genes documented, 4803 (26%) have one or more mutations. Full scientific literature curations are available on 83 major cancer genes and 49 fusion gene pairs (19 new cancer genes and 30 new fusion pairs this year) and this number is continually increasing. Key amongst these is TP53, now available through a collaboration with the IARC p53 database. In addition to data from the Cancer Genome Project (CGP) at the Sanger Institute, UK, and The Cancer Genome Atlas project (TCGA), large systematic screens are also now curated. Major website upgrades now make these data much more mineable, with many new selection filters and graphics. A Biomart is now available allowing more automated data mining and integration with other biological databases. Annotation of genomic features has become a significant focus; COSMIC has begun curating full-genome resequencing experiments, developing new web pages, export formats and graphics styles. With all genomic information recently updated to GRCh37, COSMIC integrates many diverse types of mutation information and is making much closer links with Ensembl and other data resources.
The catalogue of Somatic Mutations in Cancer (COSMIC) (http://www.sanger.ac.uk/cosmic/) is the largest public resource for information on somatically acquired mutations in human cancer and is available freely without restrictions. Currently (v43, August 2009), COSMIC contains details of 1.5-million experiments performed through 13 423 genes in almost 370 000 tumours, describing over 90 000 individual mutations. Data are gathered from two sources, publications in the scientific literature, (v43 contains 7797 curated articles) and the full output of the genome-wide screens from the Cancer Genome Project (CGP) at the Sanger Institute, UK. Most of the world’s literature on point mutations in human cancer has now been curated into COSMIC and while this is continually updated, a greater emphasis on curating fusion gene mutations is driving the expansion of this information; over 2700 fusion gene mutations are now described. Whole-genome sequencing screens are now identifying large numbers of genomic rearrangements in cancer and COSMIC is now displaying details of these analyses also. Examination of COSMIC’s data is primarily web-driven, focused on providing mutation range and frequency statistics based upon a choice of gene and/or cancer phenotype. Graphical views provide easily interpretable summaries of large quantities of data, and export functions can provide precise details of user-selected data.
Catalogue of Somatic Mutations in Cancer (COSMIC) (http://www.sanger.ac.uk/cosmic) is a publicly available resource providing information on somatic mutations implicated in human cancer. Release v51 (January 2011) includes data from just over 19 000 genes, 161 787 coding mutations and 5573 gene fusions, described in more than 577 000 tumour samples. COSMICMart (COSMIC BioMart) provides a flexible way to mine these data and combine somatic mutations with other biological relevant data sets. This article describes the data available in COSMIC along with examples of how to successfully mine and integrate data sets using COSMICMart.
Database URL: http://www.sanger.ac.uk/genetics/CGP/cosmic/biomart/martview/
A major focus of modern biological research is the understanding of how genomic variation relates to disease. Although there are significant ongoing efforts to capture this understanding in curated resources, much of the information remains locked in unstructured sources, in particular, the scientific literature. Thus, there have been several text mining systems developed to target extraction of mutations and other genetic variation from the literature. We have performed the first study of the use of text mining for the recovery of genetic variants curated directly from the literature. We consider two curated databases, COSMIC (Catalogue Of Somatic Mutations In Cancer) and InSiGHT (International Society for Gastro-intestinal Hereditary Tumours), that contain explicit links to the source literature for each included mutation. Our analysis shows that the recall of the mutations catalogued in the databases using a text mining tool is very low, despite the well-established good performance of the tool and even when the full text of the associated article is available for processing. We demonstrate that this discrepancy can be explained by considering the supplementary material linked to the published articles, not previously considered by text mining tools. Although it is anecdotally known that supplementary material contains ‘all of the information’, and some researchers have speculated about the role of supplementary material (Schenck et al. Extraction of genetic mutations associated with cancer from public literature. J Health Med Inform 2012;S2:2.), our analysis substantiates the significant extent to which this material is critical. Our results highlight the need for literature mining tools to consider not only the narrative content of a publication but also the full set of material related to a publication.
As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, individually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.93 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any individual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.
As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, individually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.95 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any individual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.
Cutaneous malignant melanoma is the most fatal skin cancer and although improved comprehension of its pathogenic pathways allowed to realize some effective molecular targeted therapies, novel targets and drugs are still needed. Aiming to add genetic information potentially useful for novel targets discovery, we performed an extensive genomic characterization by whole-exome sequencing and SNP array profiling of six cutaneous melanoma cell lines derived from metastatic patients. We obtained a total of 3,325 novel coding single nucleotide variants, including 2,172 non-synonymous variants. We catalogued the coding mutations according to Sanger COSMIC database and to a manually curated list including genes involved in melanoma pathways identified by mining recent literature. Besides confirming the presence of known melanoma driver mutations (BRAFV600E, NRASQ61R), we identified novel mutated genes involved in signalling pathways crucial for melanoma pathogenesis and already addressed by current targeted therapies (such as MAPK and glutamate pathways). We also identified mutations in four genes (MUC19, PAICS, RBMXL1, KIF23) never reported in melanoma, which might deserve further investigations. All data are available to the entire research community in our Melanoma Exome Database (at https://188.8.131.52/MExDB/). In summary, these cell lines are valuable biological tools to improve the genetic comprehension of this complex cancer disease and to study functional relevance of individual mutational events, and these findings could provide insights potentially useful for identification of novel therapeutic targets for cutaneous malignant melanoma.
COSMIC is currently the most comprehensive global resource for information on somatic mutations in human cancer, combining curation of the scientific literature with tumor resequencing data from the Cancer Genome Project at the Sanger Institute, U.K. Almost 4800 genes and 250000 tumors have been examined, resulting in over 50000 mutations available for investigation. This information can be accessed in a number of ways, the most convenient being the Web-based system which allows detailed data mining, presenting the results in easily interpretable formats. This unit describes the graphical system in detail, elaborating an example walkthrough and the many ways that the resulting information can be thoroughly investigated by combining data, respecializing the query, or viewing the results in different ways. Alternate protocols overview the available precompiled data files available for download.
COSMIC; cancer; somatic; mutation; database
In the assessment of radiogenic cancer risk from space flight, it is imperative to consider effects not only on the creation of cancer cells (initiation) but also on cell–cell interactions that play an important and often decisive role in the promotion and progression phases. Autopsy results confirm that most adults carry fully malignant tumors that are held in check at a small size and will never become symptomatic [
2]. This introduces the possibility that cosmic radiation may significantly influence cancer risk through alteration of the bottleneck inter-tissue interactions responsible for maintaining this dormant state. One such bottleneck is the growth limitation imposed by the failure of the tumor to induce blood vessels (angiogenesis). Other deciding events are the ability of a tumor to proliferate and invade. We have previously shown that proton radiation, the most prevalent radiation in space, has a suppressive effect on all three of these functional responses. It down-regulates angiogenic genes like VEGF and HIF-1α and impairs cell invasion and tumor growth [
3]. We decided to test these responses after 56Fe irradiation, an HZE radiation type present in the cosmic environment with presumably high carcinogenic potential [
Human microvascular endothelial cells (HMVEC) and normal human dermal fibroblast (NHDF) cells were irradiated with different doses of 56Fe ion radiation (1 GeV/n) at Brookhaven National Laboratory and RNA was extracted 6 h later. Genomic-wide array analysis was done on the isolated RNA through the Agilent Platform. It was observed that several pro-angiogenic genes like VEGF, IL-6 and HIF-1α were significantly up-regulated after treatment with 56Fe ion radiation (Fig.
1). These results were also confirmed at the mRNA and protein levels with the human and murine lung cancer lines, A549 and LLC, respectively. Additional verification of modulation of these key genes was also observed when lungs of C57BL/6 mice treated with 56Fe ion radiation showed an increase in VEGF and MMP9 mRNA and protein expression 6 h post-irradiation (Fig.
2). Cell invasion was shown to be increased by 56Fe ion radiation in various cell types, including fibroblast, tumor and endothelial progenitor cells. 56Fe ion irradiation also modulated functional processes crucial to angiogenesis. It enhanced the ability of untargeted (bystander) endothelial cells to invade and proliferate in response to factors produced by targeted fibroblast or cancer cells in vitro. Results also carry over to in vivo. C57BL/6 mice exposed to whole-body irradiation with 0.2 Gy dose of 56Fe and injected subcutaneously with LLC tumor cells showed a significant augmentation in tumor growth and growth rate in the irradiated group. Additionally, nude mice exposed to whole-body 56Fe radiation and injected intravenously with A549 cancer cells 3 h post-irradiation demonstrated a significant enhancement in lung colonization capacity when compared with the sham-irradiated control mice injected.
These results together suggest cell and tissue-level responses to 56Fe irradiation may act to overcome major cancer progression-level bottlenecks including those related to angiogenesis, cell proliferation and invasion. This is of significant concern for cancer risk estimations pertinent to NASA as achieving these cancer hallmark processes can make the difference between a radiation-induced cancer cell progressing to a clinically detectable cancer in astronauts or not. In conclusion, we demonstrate a strong radiation quality dependence for space radiation carcinogenesis risk manifested through influences on intercellular interactions in the progression phase of carcinogenesis.
Fig. 1.Heatmaps of selected differentially regulated major angiogenesis genes after proton and 56Fe ion radiation in HMVECs and NHDF. Cells were treated with either 0, 0.5, 1 or 2 Gy of proton radiation or 0, 0.2, 0.4 or 1 Gy of 56Fe ion dose. Among the major regulated genes were VEGF, HIF-1A and IL-6; they were down-regulated by proton radiation and up-regulated by iron radiation.
Fig. 2.Immunofluorescence images of lungs of C57BL/6 mice treated with 0, 0.2 or 1 Gy of 56Fe ion dose and stained 6 h later. Pro-angiogenic factors VEGF and MMP9 were increased in mice that received the 56Fe ion treatment.
To assess the incidence of cancer among male airline pilots in the Nordic countries, with special reference to risk related to cosmic radiation.
Retrospective cohort study, with follow up of cancer incidence through the national cancer registries.
Denmark, Finland, Iceland, Norway, and Sweden.
10 032 male airline pilots, with an average follow up of 17 years.
Main outcome measures
Standardised incidence ratios, with expected numbers based on national cancer incidence rates; dose-response analysis using Poisson regression.
466 cases of cancer were diagnosed compared with 456 expected. The only significantly increased standardised incidence ratios were for skin cancer: melanoma 2.3 (95% confidence interval 1.7 to 3.0), non-melanoma 2.1 (1.7 to 2.8), basal cell carcinoma 2.5 (1.9 to 3.2). The relative risk of skin cancers increased with the estimated radiation dose. The relative risk of prostate cancer increased with increasing number of flight hours in long distance aircraft.
This study does not indicate a marked increase in cancer risk attributable to cosmic radiation, although some influence of cosmic radiation on skin cancer cannot be entirely excluded. The suggestion of an association between number of long distance flights (possibly related to circadian hormonal disturbances) and prostate cancer needs to be confirmed.
What is already known on this topicAirline pilots are occupationally exposed to cosmic radiation and other potentially carcinogenic elementsIn the studies published so far, dose-response patterns have not been characterisedWhat this study addsNo marked risk of cancer attributable to cosmic radiation is observed in airline pilotsA threefold excess of skin cancers is seen among pilots with longer careers, but the influence of recreational exposure to ultraviolet light cannot be quantifiedA slight increase in risk of prostate cancer with increasing number of long haul flights suggests a need for more studies on the effects of circadian hormonal disturbances
The frequency and poor prognosis of patients with metastatic colorectal cancer (mCRC) emphasizes the requirement for improved biomarkers for use in the treatment and prognosis of mCRC. In the present study, somatic variants in exonic regions of key cancer genes were identified in mCRC patients. Formalin-fixed, paraffin-embedded tissues obtained by biopsy of the metastases of mCRC patients were collected, and the DNA was extracted and sequenced using the Ion Torrent Personal Genome Machine. For the targeted amplification of known cancer genes, the Ion AmpliSeq™ Cancer Panel, which is designed to detect 739 Catalogue of Somatic Mutations in Cancer (COSMIC) mutations in 604 loci from 46 oncogenes and tumor suppressor genes using as little as 10 ng of input DNA, was used. The sequencing results were then analyzed using the Ampliseq™ Variant Caller plug-in within the Ion Torrent Suite software. In addition, Ingenuity Pathway software was used to perform a pathway analysis. The Cox regression analysis was also conducted to investigate the potential correlation between alteration numbers and clinical factors, including response rate, disease-free survival and overall survival. Among 10 specimens, 65 genetic alterations were identified in 24 genes following the exclusion of germline mutations using the SNP database, whereby 41% of the alterations were also present in the COSMIC database. No clinical factors were found to significantly correlate with the alteration numbers in the patients by statistical analysis. However, pathway analysis identified ‘colorectal cancer metastasis signaling’ as the most commonly mutated canonical pathway. This analysis further revealed mutated genes in the Wnt, phosphoinositide 3-kinase (PI3K)/AKT and transforming growth factor (TGF)-β/SMAD signaling pathways. Notably, 11 genes, including the expected APC, BRAF, KRAS, PIK3CA and TP53 genes, were mutated in at least two samples. Notably, 90% (9/10) of mCRC patients harbored at least one ‘druggable’ alteration (range, 1–6 alterations) that has been linked to a clinical treatment option or is currently being investigated in clinical trials of novel targeted therapies. These results indicated that DNA sequencing of key oncogenes and tumor suppressors enables the identification of ‘druggable’ alterations for individual colorectal cancer patients.
druggable alterations; Ion Torrrent; metastasic colorectal cancer; formalin-fixed paraffin-embedded
The richest uranium ore bodies ever discovered (Cigar Lake and McArthur River) are presently under development in northeastern Saskatchewan. This subarctic region is also home to several operating uranium mines and aboriginal communities, partly dependent upon caribou for subsistence. Because of concerns over mining impacts and the efficient transfer of airborne radionuclides through the lichen-caribou-human food chain, radionuclides were analyzed in tissues from 18 barren-ground caribou (Rangifer tarandus groenlandicus). Radionuclides included uranium (U), radium (226Ra), lead (210Pb), and polonium (210Po) from the uranium decay series; the fission product (137Cs) from fallout; and naturally occurring potassium (40K). Natural background radiation doses average 2-4 mSv/year from cosmic rays, external gamma rays, radon inhalation, and ingestion of food items. The ingestion of 210Po and 137Cs when caribou are consumed adds to these background doses. The dose increment was 0.85 mSv/year for adults who consumed 100 g of caribou meat per day and up to 1.7 mSv/year if one liver and 10 kidneys per year were also consumed. We discuss the cancer risk from these doses. Concentration ratios (CRs), relating caribou tissues to lichens or rumen (stomach) contents, were calculated to estimate food chain transfer. The CRs for caribou muscle ranged from 1 to 16% for U, 6 to 25% for 226Ra, 1 to 2% for 210Pb, 6 to 26% for 210Po, 260 to 370% for 137Cs, and 76 to 130% for 40K, with 137Cs biomagnifying by a factor of 3-4. These CRs are useful in predicting caribou meat concentrations from the lichens, measured in monitoring programs, for the future evaluation of uranium mining impacts on this critical food chain.
Identifying cancer-associated mutations (driver mutations) is critical for understanding the cellular function of cancer genome that leads to activation of oncogenes or inactivation of tumor suppressor genes. Many approaches are proposed which use supervised machine learning techniques for prediction with features obtained by some databases. However, often we do not know which features are important for driver mutations prediction. In this study, we propose a novel feature selection method (called DX) from 126 candidate features' set. In order to obtain the best performance, rotation forest algorithm was adopted to perform the experiment. On the train dataset which was collected from COSMIC and Swiss-Prot databases, we are able to obtain high prediction performance with 88.03% accuracy, 93.9% precision, and 81.35% recall when the 11 top-ranked features were used. Comparison with other various techniques in the TP53, EGFR, and Cosmic2plus datasets shows the generality of our method.
Objectives: US commercial airline pilots, like all flight crew, are at increased risk for specific cancers, but the relation of these outcomes to specific air cabin exposures is unclear. Flight time or block (airborne plus taxi) time often substitutes for assessment of exposure to cosmic radiation. Our objectives were to develop methods to estimate exposures to cosmic radiation and circadian disruption for a study of chromosome aberrations in pilots and to describe workplace exposures for these pilots.
Methods: Exposures were estimated for cosmic ionizing radiation and circadian disruption between August 1963 and March 2003 for 83 male pilots from a major US airline. Estimates were based on 523 387 individual flight segments in company records and pilot logbooks as well as summary records of hours flown from other sources. Exposure was estimated by calculation or imputation for all but 0.02% of the individual flight segments’ block time. Exposures were estimated from questionnaire data for a comparison group of 51 male university faculty.
Results: Pilots flew a median of 7126 flight segments and 14 959 block hours for 27.8 years. In the final study year, a hypothetical pilot incurred an estimated median effective dose of 1.92 mSv (absorbed dose, 0.85 mGy) from cosmic radiation and crossed 362 time zones. This study pilot was possibly exposed to a moderate or large solar particle event a median of 6 times or once every 3.7 years of work. Work at the study airline and military flying were the two highest sources of pilot exposure for all metrics. An index of work during the standard sleep interval (SSI travel) also suggested potential chronic sleep disturbance in some pilots. For study airline flights, median segment radiation doses, time zones crossed, and SSI travel increased markedly from the 1990s to 2003 (Ptrend < 0.0001). Dose metrics were moderately correlated with records-based duration metrics (Spearman’s r = 0.61–0.69).
Conclusions: The methods developed provided an exposure profile of this group of US airline pilots, many of whom have been exposed to increasing cosmic radiation and circadian disruption from the 1990s through 2003. This assessment is likely to decrease exposure misclassification in health studies.
circadian disruption; cosmic radiation; exposure assessment; flight crew; pilots
To investigate signal regulation models of gastric cancer, databases and literature
were used to construct the signaling network in humans. Topological characteristics
of the network were analyzed by CytoScape. After marking gastric cancer-related genes
extracted from the CancerResource, GeneRIF, and COSMIC databases, the FANMOD software
was used for the mining of gastric cancer-related motifs in a network with three
vertices. The significant motif difference method was adopted to identify
significantly different motifs in the normal and cancer states. Finally, we conducted
a series of analyses of the significantly different motifs, including gene ontology,
function annotation of genes, and model classification. A human signaling network was
constructed, with 1643 nodes and 5089 regulating interactions. The network was
configured to have the characteristics of other biological networks. There were
57,942 motifs marked with gastric cancer-related genes out of a total of 69,492
motifs, and 264 motifs were selected as significantly different motifs by calculating
the significant motif difference (SMD) scores. Genes in significantly different
motifs were mainly enriched in functions associated with cancer genesis, such as
regulation of cell death, amino acid phosphorylation of proteins, and intracellular
signaling cascades. The top five significantly different motifs were mainly cascade
and positive feedback types. Almost all genes in the five motifs were cancer related,
including EPOR, MAPK14, BCL2L1,
KRT18, PTPN6, CASP3,
TGFBR2, AR, and CASP7. The
development of cancer might be curbed by inhibiting signal transductions upstream and
downstream of the selected motifs.
Significantly different motifs; Human signaling network; Gastric cancer
The Catalogue Of Somatic Mutations In Cancer (COSMIC) database and web site was developed to preserve somatic mutation data and share it with the community. Over the past 25 years, approximately 350 cancer genes have been identified, of which 311 are somatically mutated. COSMIC has been expanded and now holds data previously reported in the scientific literature for 28 known cancer genes. In addition, there is data from the systematic sequencing of 518 protein kinase genes. The total gene count in COSMIC stands at 538; 25 have a mutation frequency above 5% in one or more tumour type, no mutations were found in 333 genes and 180 are rarely mutated with frequencies <5% in any tumour set. The COSMIC web site has been expanded to give more views and summaries of the data and provide faster query routes and downloads. In addition, there is a new section describing mutations found through a screen of known cancer genes in 728 cancer cell lines including the NCI-60 set of cancer cell lines.
somatic; mutation; database; website
With the advent of whole-genome analysis for profiling tumor tissue, a pressing need has emerged for principled methods of organizing the large amounts of resulting genomic information. We propose the concept of multiplicity measures on cancer and gene networks to organize the information in a clinically meaningful manner. Multiplicity applied in this context extends Fearon and Vogelstein's multi-hit genetic model of colorectal carcinoma across multiple cancers.
Using the Catalogue of Somatic Mutations in Cancer (COSMIC), we construct networks of interacting cancers and genes. Multiplicity is calculated by evaluating the number of cancers and genes linked by the measurement of a somatic mutation. The Kamada-Kawai algorithm is used to find a two-dimensional minimum energy solution with multiplicity as an input similarity measure. Cancers and genes are positioned in two dimensions according to this similarity. A third dimension is added to the network by assigning a maximal multiplicity to each cancer or gene. Hierarchical clustering within this three-dimensional network is used to identify similar clusters in somatic mutation patterns across cancer types.
The clustering of genes in a three-dimensional network reveals a similarity in acquired mutations across different cancer types. Surprisingly, the clusters separate known causal mutations. The multiplicity clustering technique identifies a set of causal genes with an area under the ROC curve of 0.84 versus 0.57 when clustering on gene mutation rate alone. The cluster multiplicity value and number of causal genes are positively correlated via Spearman's Rank Order correlation (rs(8) = 0.894, Spearman's t = 17.48, p < 0.05). A clustering analysis of cancer types segregates different types of cancer. All blood tumors cluster together, and the cluster multiplicity values differ significantly (Kruskal-Wallis, H = 16.98, df = 2, p < 0.05).
We demonstrate the principle of multiplicity for organizing somatic mutations and cancers in clinically relevant clusters. These clusters of cancers and mutations provide representations that identify segregations of cancer and genes driving cancer progression.
It is well established that genomic alterations play an essential role in oncogenesis, disease progression, and response of tumors to therapeutic intervention. The advances of next-generation sequencing technologies (NGS) provide unprecedented capabilities to scan genomes for changes such as mutations, deletions, and alterations of chromosomal copy number. However, the cost of full-genome sequencing still prevents the routine application of NGS in many areas. Capturing and sequencing the coding exons of genes (the “exome”) can be a cost-effective approach for identifying changes that result in alteration of protein sequences. We applied an exome-sequencing technology (Roche Nimblegen capture paired with 454 sequencing) to identify sequence variation and mutations in eight commonly used cancer cell lines from a variety of tissue origins (A2780, A549, Colo205, GTL16, NCI-H661, MDA-MB468, PC3, and RD). We showed that this technology can accurately identify sequence variation, providing ∼95% concordance with Affymetrix SNP Array 6.0 performed on the same cell lines. Furthermore, we detected 19 of the 21 mutations reported in Sanger COSMIC database for these cell lines. We identified an average of 2,779 potential novel sequence variations/mutations per cell line, of which 1,904 were non-synonymous. Many non-synonymous changes were identified in kinases and known cancer-related genes. In addition we confirmed that the read-depth of exome sequence data can be used to estimate high-level gene amplifications and identify homologous deletions. In summary, we demonstrate that exome sequencing can be a reliable and cost-effective way for identifying alterations in cancer genomes, and we have generated a comprehensive catalogue of genomic alterations in coding regions of eight cancer cell lines. These findings could provide important insights into cancer pathways and mechanisms of resistance to anti-cancer therapies.
Recently, a number of large-scale cancer genome sequencing projects have generated a large volume of somatic mutations; however, identifying the functional consequences and roles of somatic mutations in tumorigenesis remains a major challenge. Researchers have identified that protein pocket regions play critical roles in the interaction of proteins with small molecules, enzymes, and nucleic acid. As such, investigating the features of somatic mutations in protein pocket regions provides a promising approach to identifying new genotype-phenotype relationships in cancer.
In this study, we developed a protein pocket-based computational approach to uncover the functional consequences of somatic mutations in cancer. We mapped 1.2 million somatic mutations across 36 cancer types from the COSMIC database and The Cancer Genome Atlas (TCGA) onto the protein pocket regions of over 5,000 protein three-dimensional structures. We further integrated cancer cell line mutation profiles and drug pharmacological data from the Cancer Cell Line Encyclopedia (CCLE) onto protein pocket regions in order to identify putative biomarkers for anticancer drug responses.
We found that genes harboring protein pocket somatic mutations were significantly enriched in cancer driver genes. Furthermore, genes harboring pocket somatic mutations tended to be highly co-expressed in a co-expressed protein interaction network. Using a statistical framework, we identified four putative cancer genes (RWDD1, NCF1, PLEK, and VAV3), whose expression profiles were associated with overall poor survival rates in melanoma, lung, or colorectal cancer patients. Finally, genes harboring protein pocket mutations were more likely to be drug-sensitive or drug-resistant. In a case study, we illustrated that the BAX gene was associated with the sensitivity of three anticancer drugs (midostaurin, vinorelbine, and tipifarnib).
This study provides novel insights into the functional consequences of somatic mutations during tumorigenesis and for anticancer drug responses. The computational approach used might be beneficial to the study of somatic mutations in the era of cancer precision medicine.
Electronic supplementary material
The online version of this article (doi:10.1186/s13073-014-0081-7) contains supplementary material, which is available to authorized users.
The genome sequence framework provided by the human genome project allows us to precisely map human genetic variations in order to study their association with disease and their direct effects on gene function. Since the description of tumor suppressor genes and oncogenes several decades ago, both germ-line variations and somatic mutations have been established to be important in cancer—in terms of risk, oncogenesis, prognosis and response to therapy. The Cancer Genome Atlas initiative proposed by the NIH is poised to elucidate the contribution of somatic mutations to cancer development and progression through the re-sequencing of a substantial fraction of the total collection of human genes—in hundreds of individual tumors and spanning several tumor types. We have developed the CancerGenes resource to simplify the process of gene selection and prioritization in large collaborative projects. CancerGenes combines gene lists annotated by experts with information from key public databases. Each gene is annotated with gene name(s), functional description, organism, chromosome number, location, Entrez Gene ID, GO terms, InterPro descriptions, gene structure, protein length, transcript count, and experimentally determined transcript control regions, as well as links to Entrez Gene, COSMIC, and iHOP gene pages and the UCSC and Ensembl genome browsers. The user-friendly interface provides for searching, sorting and intersection of gene lists. Users may view tabulated results through a web browser or may dynamically download them as a spreadsheet table. CancerGenes is available at .
Over the past three decades, mortality from lung cancer has sharply and continuously increased in China, ascending to the first cause of death among all types of cancer. The ability to identify the actual sequence of gene mutations may help doctors determine which mutations lead to precancerous lesions and which produce invasive carcinomas, especially using next-generation sequencing (NGS) technology. In this study, we analyzed the latest lung cancer data in the COSMIC database, in order to find genomic “hotspots” that are frequently mutated in human lung cancer genomes. The results revealed that the most frequently mutated lung cancer genes are EGFR, KRAS and TP53. In recent years, EGFR and KRAS lung cancer test kits have been utilized for detecting lung cancer patients, but they presented many disadvantages, as they proved to be of low sensitivity, labor-intensive and time-consuming. In this study, we constructed a more complete catalogue of lung cancer mutation events including 145 mutated genes. With the genes of this list it may be feasible to develop a NGS kit for lung cancer mutation detection.
Lung cancer; Next-generation sequencing; Somatic mutation kit; COSMIC
With the advent of Next Generation Sequencing the identification of mutations in the genomes of healthy and diseased tissues has become commonplace. While much progress has been made to elucidate the aetiology of disease processes in cancer, the contributions to disease that many individual mutations make remain to be characterised and their downstream consequences on cancer phenotypes remain to be understood. Missense mutations commonly occur in cancers and their consequences remain challenging to predict. However, this knowledge is becoming more vital, for both assessing disease progression and for stratifying drug treatment regimes. Coupled with structural data, comprehensive genomic databases of mutations such as the 1000 Genomes project and COSMIC give an opportunity to investigate general principles of how cancer mutations disrupt proteins and their interactions at the molecular and network level. We describe a comprehensive comparison of cancer and neutral missense mutations; by combining features derived from structural and interface properties we have developed a carcinogenicity predictor, InCa (Index of Carcinogenicity). Upon comparison with other methods, we observe that InCa can predict mutations that might not be detected by other methods. We also discuss general limitations shared by all predictors that attempt to predict driver mutations and discuss how this could impact high-throughput predictions. A web interface to a server implementation is publicly available at http://inca.icr.ac.uk/.
Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies.
Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu
Human cancer is caused by the accumulation of tumor-specific mutations in oncogenes and tumor suppressors that confer a selective growth advantage to cells. As a consequence of genomic instability and high levels of proliferation, many passenger mutations that do not contribute to the cancer phenotype arise alongside mutations that drive oncogenesis. While several approaches have been developed to separate driver mutations from passengers, few approaches can specifically identify activating driver mutations in oncogenes, which are more amenable for pharmacological intervention.
We propose a new statistical method for detecting activating mutations in cancer by identifying nonrandom clusters of amino acid mutations in protein sequences. A probability model is derived using order statistics assuming that the location of amino acid mutations on a protein follows a uniform distribution. Our statistical measure is the differences between pair-wise order statistics, which is equivalent to the size of an amino acid mutation cluster, and the probabilities are derived from exact and approximate distributions of the statistical measure. Using data in the Catalog of Somatic Mutations in Cancer (COSMIC) database, we have demonstrated that our method detects well-known clusters of activating mutations in KRAS, BRAF, PI3K, and β-catenin. The method can also identify new cancer targets as well as gain-of-function mutations in tumor suppressors.
Our proposed method is useful to discover activating driver mutations in cancer by identifying nonrandom clusters of somatic amino acid mutations in protein sequences.
As large-scale re-sequencing of genomes reveals many protein mutations, especially in human cancer tissues, prediction of their likely functional impact becomes important practical goal. Here, we introduce a new functional impact score (FIS) for amino acid residue changes using evolutionary conservation patterns. The information in these patterns is derived from aligned families and sub-families of sequence homologs within and between species using combinatorial entropy formalism. The score performs well on a large set of human protein mutations in separating disease-associated variants (∼19 200), assumed to be strongly functional, from common polymorphisms (∼35 600), assumed to be weakly functional (area under the receiver operating characteristic curve of ∼0.86). In cancer, using recurrence, multiplicity and annotation for ∼10 000 mutations in the COSMIC database, the method does well in assigning higher scores to more likely functional mutations (‘drivers’). To guide experimental prioritization, we report a list of about 1000 top human cancer genes frequently mutated in one or more cancer types ranked by likely functional impact; and, an additional 1000 candidate cancer genes with rare but likely functional mutations. In addition, we estimate that at least 5% of cancer-relevant mutations involve switch of function, rather than simply loss or gain of function.