|Home | About | Journals | Submit | Contact Us | Français|
More than 100 000 human genetic variations have been described in various genes that are associated with a wide variety of diseases. Such data provides invaluable information for both clinical medicine and basic science. A number of locus-specific databases have been developed to exploit this huge amount of data. However, the scope, format and content of these databases differ strongly and as no standard for variation databases has yet been adopted, the way data is presented varies enormously. This review aims to give an overview of current resources for human variation data in public and commercial resources.
Over the recent years the cloning of genes involved in complex diseases such as cancer as well as the development of new high throughput techniques like single nucleotide polymorphism (SNP) arrays has made enormous progress. This resulted in more than 100000 human genetic variations which have been described in various genes associated with a wide variety of diseases (1–3). Somatic variations in cancer are used in clinical studies and molecular pathology to characterize tumor types, to improve the best suited treatment choice, and to predict response to treatment. Thus, mutation analysis can play an important role in drug discovery and drug development. Identification of genetic variants will yield new drug targets and biomarkers.
Cancer, as a disease of genome alterations, arises through the sporadic acquisition of multiple somatic variations (4). However, not all mutations contribute equally to the cancer type in which they are found. The proportion of mutations causally implicated in cancer is still unknown especially due to the high number of variations between different tumors (5–9) Although the number of unique variations for each cancer genome can be very high (10,11), only a few somatic variations will be critical for the development of the tumor. These causative variations, the so-called ‘drivers’, are emerging because of selective pressure during tumorigenesis, whereas many mutations are only incidental or caused by genome instabilities, so-called ‘passengers’ (12). The differentiation of disease causing driver mutations from the passenger variations is a challenge for mutation analysis (13).
Analysis of mutations is useful in many ways: the study of cancer-prone DNA repair diseases (Xeroderma pigmentosum, Ataxia telangiectasia, Fanconi’s anemia, Bloom’s syndrome and others) has given valuable insights in the type and function of genes responsible for maintaining DNA integrity (14–18). Mutation analysis can help to predict the risk for developing certain types of cancer, BRCA1 and BRCA2 (increased breast cancer risk) (19) and APC (increased risk for colon cancer) (20) being among the best known so far.
Mutations can also influence the response of patients to cancer drugs, e.g. the KRAS (21,22) or BRAF (23,24) mutations. The presence of certain mutations can also influence progression free or overall survival rates of patients (22,25).
In general, mutations can be grouped into two different categories: germline and somatic. Germline mutations are variations found in all cells of an organism including germ line cells. They play an important role in evolution by giving every human its genetic individuality (see SNPs) but also give a rise to hereditary diseases like sickle-cell anemia or phenylketonuria. Germline mutations can also lead to increased risk for developing cancer, like BRCA1 and BRCA2 gene mutations which are associated with an increased risk for breast and ovarian cancer (26–28). Other examples of familial cancer syndromes include von Hippel–Lindau syndrome (caused by mutations in VHL) (29), Peutz–Jeghers syndrome (caused by mutations in LKB1) (30) and Li–Fraumeni syndrome (caused by mutations in TP53) (31).
Detection of germline mutations with current technologies is state of the art but time-consuming. Usually a large amount of genetic material of good quality can be extracted from blood cells. However, in addition to the mutation detection, the differentiation of disease causing and neutral germline mutations having no effect on the phenotype is an important but non-trivial task. Currently, no generally applicable solution for this problem exists and this question often remains unsolved.
Somatic mutations are not inherited but acquired during lifetime in somatic cells of an organism and might cause tissue specific tumors. An important problem with somatic mutations is the difficulty of their detection. Tumor samples can be very heterogeneous and are very often ‘contaminated’ with normal cells, such as stromal cells. However, since somatic mutations are identified through a comparison of a tumor sample with a normal sample of the same organism the identification of the mutation is unambiguous. Also for somatic mutations the differentiation between drivers and passengers is an important but still unsolved problem. In contrast to germline mutations however, all somatic mutations are tumor associated. Therefore, all non-silent somatic mutations are potential candidates for biomarker development.
Genome alterations are typically classified by the mutation type. The different databases characterize all variations first by the effect on the nucleotide sequence: deletions, insertions and single nucleotide variations. Mutations occurring in the coding region of a gene can also be classified by their effect on the amino acid sequence. A variation of the coding sequence without any change of the amino acid sequence of the protein is called silent mutation. Single nucleotide mutations causing the substitution of a different amino acid are called missense mutation. A frameshift mutation is an insertion or deletion in the coding sequence which changes the reading frame resulting in a different translation of the subsequent sequence. Nonsense mutations generate a premature stop codon and often a non-functional truncated protein product.
Single nucleotide germline mutations and SNP are often used as synonyms, since both describe variations of single nucleotides, which are inherited and not tumor-associated per se. However, concerning the databases presented here these synonyms are used in two different meanings: SNPs as presented in public databases like dbSNP (32,33) or HapMap (34) are germline variations for which at most population frequencies are known. In literature it is usually assumed that the variation should be found in more than 1% of the population in order to be called a SNP. Such information is very useful for biomarker development since it describes the prevalence of the mutation in different populations. However, it is normally not possible to get additional information (like gender, age, or disease status) on the individuals having the SNP, only the population a person belongs to is given. Since it is not known if the information comes from a tumor or normal sample, a correlation between diseases and SNPs cannot be calculated.
In contrast, germline mutations presented in cancer or disease mutation databases like ‘The Cancer Genome Atlas (TCGA)’ (35) are usually connected with additional sample information like patient gender, age, histology or tissue. Germline mutations are found in the normal as well as the tumor sample. Hence, the sample information allows for further analyses of associations between germline mutations and diseases.
A standard problem occurring in every field where huge amounts of data are generated is standardization. Without standardization the task to identify and integrate the data is very complicated, laborious, error-prone and time-consuming. Although databases may have different scope and aims it is important to standardize content and annotation. The Human Genome Variation Society (HGVS) has proposed a recommendation for the nomenclature of genetic variations and content of mutation databases and scientific publications (36). This naming of mutations has now become widely accepted. Some journals (e.g. Human Mutation) already accept only publications with mutation notation following the HGVS proposed recommendations. If more publishers should follow this trend it would have a very positive effect on the usability of mutation databases including an increase of the quality and amount of their content.
HGVS and members have published number of recommendations e.g. for the collection of somatic mutations, sharing data, etc. There are also projects at European Bioinformatics Institute (EBI) and National Center for Biotechnology Information (NCBI) to develop reference sequences, locus reference genomics (LRGs) (http://www.lrg-sequence.org) and RefSeqGenes (http://www.ncbi.nlm.nih.gov/projects/RefSeq/RSG/), respectively. In addition, the Gen2Phen (http://www.gen2phen.org) project works on data models and standards for a number of aspects related to variation data description, storage and integration in databases.
Except for the already widely accepted naming recommendations of mutations by the HGVS, a promising standardization effort for integrating all cancer genome data is still missing.
Historically, mutations and variations in human have been reported only in the published literature. Mutation descriptions were often not precise, no standard notation existed, and the sequence of the reference gene under study was almost never indicated. To this end a sophisticated mutation analysis was mostly unfeasible. However, with the explosion of large-scale cancer genome sequencing (35,37–40) more and more information on genetic variations has been captured over the last years in publicly available databases that can be used by clinicians or scientists as a research tool. These databases are widespread and their scope, format and content can be very different. Current data related to somatic mutations is mostly buried in journals or scattered between several locus-specific databases (LSDBs) and general databases that have no or very limited connections between them.
Only a few large public resources exist that comprehensively compile data on somatic gene alterations in cancer: International Agency for Research on Cancer (IARC) TP53 Database (41), Catalog of Somatic Mutations in Cancer (COSMIC) (42), TCGA (35), Roche Cancer Genome Database (RCGDB) and Human Gene Mutation Database (HGMD®) (43).
The LSDBs often originate as loosely organized compilation of data. Since no standard system similar to the HGVS recommendation for mutation notation has yet been established, the presentation of the data in LSDBs varies enormously. The data is mostly presented in flat files, plain text databases, or Microsoft Excel spreadsheets making it easy to collect and store the information, but nearly impossible to search or retrieve specific data. More ambitious databases use open source database management software (DBMS)—like MySQL or PostgreSQL—whereas only a minority of curators use specialized software such as the UMD (44), the Mutation Storage and Retrieval Program (MuStaR) (45), or the Leiden Open Source Variation Database (LOVD) (46). The use of such relational DBMSs allows to specify complex queries and specific analyses of customized subsets of the database.
Currently, the best-known publicly available primary database on somatic mutations in human cancer is the ‘COSMIC’(42) hosted at the Wellcome Trust Sanger institute in Cambridge. The data is gathered from scientific publications and genome-wide screens from the Cancer Genome Project (CGP) at the same institute. The project has been continuously updated and improved for over 9 years and currently contains more than 108 773 mutations in >13 500 different genes observed in over 449 676 different tumor samples. The curation process in COSMIC is largely manual resulting in a very high quality of the data. For each mutation all details on the sample like patient age, gender, histology and tissue are available. COSMIC uses its own internal classification system to provide tissue and histology consistency within the database and to reduce redundancy. All tissue and histology information from scientific publications is translated using this classification system. In addition, for each study the project offers the information which genes where actually screened, since published studies often focus on mutation hot spots, for example KRAS (47), BRAF (48) or TP53 (49). This information enables frequency data to be calculated for mutations in various genes and different cancer types. COSMIC offers also somatic mutations found in cell-lines including the NCI-60 (50). The website of COSMIC has a clear structure and is easy to use. The interface allows to browse by gene, or search by phenotype. Summary information on mutation counts and frequencies are presented graphically for a better understanding. In addition, all information can be downloaded as txt files, or as an Oracle dump file.
Another large mutation data source is ‘TCGA’ (35), a project at the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). The main goal of TCGA is to understand the molecular basis of cancer through the application of genome analysis techniques, including large-scale genome sequencing and SNP analysis. For each patient a whole genome analysis of a normal, a tumor and control samples (a second normal and tumor sample as control) is performed enabling researchers to distinguish between somatic and germline mutations. All mutations found are publicly available in a special Mutation Analysis file Format (MAF) and can be downloaded via the TCGA Data Portal. This portal contains all TCGA data concerning to clinical information associated with tumors and human subjects, genomic characterization, and high-throughput sequencing analysis of the tumor genomes. However, no advanced search interface or graphical visualization of the mutation data is available. In the starting phase the project focused on only two cancer types: brain cancer (glioblastoma multiforme) and ovarian serous adenocarcinoma. After the pilot phase, which was completed in 2009, TCGA matured to a full project and is now dealing with more than 20 types of cancer.
Another concept focusing on the integration of heterogeneous mutation data sources is pursued by the RCGDB (51), developed at Roche Pharma Research. The freely available warehouse system integrates somatic and germline mutations gathered by manual curation from scientific publications and public cancer mutation databases (COSMIC, TCGA, etc.). In addition, these mutations are enriched by SNP data from the HapMap (34) project. Updates are provided on a regular basis depending on the update frequency of the external data sources (approximately every 3 months). Access to the RCGDB is offered via a publicly available web interface. A major aspect in designing the user interface was that users should be able to search and view mutations in an intuitive and straight-forward manner, without having to understand the architectural details of the warehouse system. Therefore, the database offers a Google-like web interface to search for cancer genome information on a single gene, sample or cell-line, and on multiple genes, samples or cell-lines. As a special feature the search is supported by an auto-suggestion functionality allowing to search by NCBI GeneIDs, names, or synonyms.
The HGMD® (43) at the Institute of Medical Genetics in Cardiff is a commercial mutation database providing information on somatic and germline mutations. Furthermore, the database offers a less up-to-date public version which is freely available only to registered users from academic institutions or non-profit organizations, respectively. The data is gathered from scientific publications and from publicly available LSDBs. The project claims to include all mutations causing or associated with human inherited disease, plus disease-associated/functional polymorphisms reported in the literature. Currently, HGMD provides information on 96 631 mutations in 3611 genes under the professional license and 69 660 mutation in 2572 genes in the public version of the database. The website of HGMD allows to search by gene, publication or mutation id and presents the results in a table view. A downloadable version of HGMD is only available under the professional license.
In addition to multi-gene LSDBs, various single-gene LSDBs are publicly available. The largest and best-known single-gene LSDB is the TP53 mutation database from the IARC (41), with all TP53 gene variations identified in human populations and tumor samples since 1989. The database contains information on somatic as well as germline mutations of TP53 in patient samples, human cell-lines, and mouse models. This data is compiled from the peer-reviewed literature and from generalist databases. The website offers different sophisticated interfaces for searching and mining the database by multiple criteria. Furthermore, all information can be downloaded in tab-delimited txt-files. A large number of other single gene databases exists like the L1CAM mutation database from the university of Groningen (52) containing single gene somatic mutations. Most of these LSDBs are small containing mostly <500 variants.
For a detailed list of cancer mutation databases see Table 1.
In addition to the Cancer variation database a large number of publicly available databases focuses on disease specific variations. An overview on such disease variation databases can be found in Table 2.
Prominent disease mutation databases are the public IDbases (53) maintained at the Institute of Medical Technology, University of Tampere. The IDbases are LSDBs for immunodeficiency-causing mutations. The project maintains 122 different IDBases containing altogether data for 5359 patients. In addition to gene mutations, IDbases provide information about clinical presentation. All information has been collected from the literature as well as directly from researchers. The databases do not provide any sophisticated search interface and allow to download the data as a txt-file.
All databases presented are good starting points to retrieve human variation data for certain use cases depending on the provided interfaces. However, as soon as a query gets more complicated, an integrative approach will be necessary. Unfortunately, the diversity of current mutation information systems and the underlying data models make it difficult to mine human variation databases in an integrative approach. Currently, researchers typically have to browse and search several databases to obtain the required information. No unified access to all different cancer genome related data sources exists resulting in a need for more efficient integrative systems. With COSMIC, which is currently integrating TCGA and IARC TP53 information, and the RCGDB, which already integrates most of in this review presented data sources, two promising integrative data resources are available. Nevertheless, the standardization and virtual consolidation of existing databases will be one major challenge for future developments. Although these problems have already been discussed in previous publications (54–56), the current situation concerning mutation databases and their heterogeneity is still an acute problem due to the exponential growth of data generated by genome sequencing. This review is meant to provide an overview on the current status of mutation data in public resources to overcome the difficulties for users to know where to find the information they are looking for.
This work was supported by the Roche Postdoc Fellowship Program.
Conflict of interest. None declared.