NCI and CGAP
SNP500Cancer is part of the National Cancer Institute’s Cancer Genome Anatomy Project (CGAP) and is specifically designed to generate resources for the identification and characterization of genetic variation in genes important in cancer. CGAP (1
) is dedicated to the development of technology, including both assays and utilization of technical platforms, to determine the gene expression profiles of normal, precancer and cancer cells. Accordingly, data pertaining to genes and their variation are made available on the public web site http://cgap.nci.nih.gov/
. SNP500Cancer represents one of several initiatives designed to characterize sequence variation and is a resource for applying genetic approaches to understanding the etiology of different cancers as well as related phenotypes. Single nucleotide polymorphisms (SNPs) validated in this initiative are used by the NCI’s Core Genotyping Facility (CGF) to genotype samples for studies coordinated by the Division of Cancer Epidemiology and Genetics (DCEG), the primary focus within the NCI for population-based research on environmental and genetic determinants of cancer.
The SNP500Cancer initiative studies the genomes of 102 individuals of self-described heritage. The SNP500Cancer population is defined here as the sample of n = 102 DNAs with geographic origin and self-described ethnic group affiliation information to represent a diverse group of human populations. The anonymized samples are obtained from the Coriell Cell Repositories (Coriell Institute for Medical Research, Camden, NJ, USA), and represent four ethnic groups: 24 African/African-American, 31 Caucasian, 23 Hispanic and 24 Pacific Rim. These individuals are not a random sample of any specific human population, and thus the predictive value of the sequence and genotype data provided will vary for different population samples. However, where literature data are sparse, the allele frequencies in the SNP500Cancer population of n = 102 should provide assistance in determining how informative a given SNP is overall, as well as in each of the four subpopulations. It should also be noted that the SNP500Cancer subpopulations consist of subjects originally obtained from different geographic and ethnic groups. This approach was chosen for the purposes of discovery and validation of SNPs of interest to molecular epidemiology studies in cancer.
Selection of genes and SNPs
SNPs are chosen to be within or closely situated to candidate genes. The selection of genes and SNPs for analysis has been drawn from the following sources: (i) review of the published literature on SNPs and cancer, (ii) genes that fit a plausible model for cancer studies (e.g. by pathway), and (iii) SNPs reported in public databases with some associated non-in silico determined frequency.
As of July 2003, the database contains 480 genes. Figure shows the distribution of number of validated SNPs per gene. The range is from 1 to 44 SNPs per gene, average = 5.4 SNPs per gene, median = 4 SNPs per gene.
A contig of approximately 600 bp in length is generated for each SNP, which is localized to the center, creating flanking regions of roughly 300 bp in each direction. Additional putative SNPs (determined from dbSNP) are annotated onto the contig. Sequencing primers are designed for bi-directional sequence analysis using Primer3 software (2
). Each primer is tagged with a universal sequencing primer, M13 (TGTAAAACGACGGCCAGT) for forward and M13 (CAGGAAACAGCTATGACC) for reverse. The sequencing assay procedures and conditions are displayed on the SNP500Cancer website. Sequence tracings are analyzed in Sequencher 4.0.5 program (Genecodes, Ann Arbor, MI). After alignment of bi-directional sequence reads to the pre- annotated 600 bp contig, two independent reviewers analyze each contig for annotated and novel SNPs. The criteria for completing sequence alignment of each contig include 190 separate sequence tracings at a minimum of 70% assembly parameters. Genotype calls are determined for each of the 102 individuals and genotype and allele frequencies are maintained in an Oracle database and displayed on the SNP500Cancer website.
Genotyping protocols and validation
For SNPs that are determined to have >0.05 minor allele frequency in at least one of the SNP500Cancer subpopulations, approximately 200 bp of DNA sequence surrounding each SNP is submitted for design on one or more of the CGF’s genotyping platforms: (i) Applied Biosystems’ TaqMan™ ‘Assay by Design’ service, (ii) EPOCH Biosciences’ MGB Eclipse™ probes, (iii) Sequenom Mass Array™. The genotyping assay procedures and conditions for all three platforms are displayed on the SNP500Cancer web site.
Genotypes are validated to establish concordance on two or more molecular genetic analysis platforms, where the primary comparison is between genotyping results from sequencing and from another genotyping platform, e.g. AB TaqMan™, Epoch MGB Eclipse™ or Sequenom MassARRAY™. A genotyping assay is validated when genotype analysis of the n = 102 DNA samples for that assay are concordant with genotypes determined from sequencing.
For each validated SNP, allele and genotype frequencies are displayed for the total SNP500Cancer population and for each SNP500Cancer subpopulation. For each analyzed SNP, a test for Hardy–Weinberg Equilibrium (HWE), χ2
with one degree of freedom for two alleles (3
) is performed per subpopulation. Figure shows the distribution of minor allele frequencies in the four SNP500Cancer subpopulations.
SNP500Cancer allele frequencies by subpopulation.
All analyzed SNPs from the SNP500Cancer Database are submitted to dbSNP (4
. This information includes flanking sequence, observed variation, assay primers, probes, and conditions, and frequency of the sequence variation among the SNP500Cancer total population and subpopulations.