shows the number of supplementary SNPs needed to tag all common variants in various populations for our primary set of 910 genes that are biologically relevant to addiction. This set includes 486 genes that were derived mainly through an expert nomination process, and 424 additional genes that correspond to roughly the top 5% of genes identified using mouse systems genetics (Chesler and colleagues, submitted). Together this set of 910 genes was our primary set for the analysis of microarray coverage (see supporting
file S1 for the complete list of these genes). We assessed the SNP coverage of these genes by determining if common SNPs (MAF≥5%) were tagged through LD by SNPs on a particular microarray. In , for each array and each population, we report the number of common SNPs in these genes that fail to satisfy
r2≥0.8 with a SNP from the array; that is, the number of SNPs not tagged by the array. For example, we found that 57% of the common SNPs in these genes were not tagged by the Affymetrix 5.0 SNP microarray in the African population. In other words, due to the haplotype/LD structure in the African population, 43% of the common genetic variation in these regions fails to be captured by this microarray.
Table S1 gives a broader view of how microarray coverage depends on biology, and shows that the Illumina coverage tends to improve with the prioritization score, while the Affymetrix coverage is uniform.
| Table 1The number of SNPs required to supplement commercial microarrays in order to comprehensively cover our primary set of 910 genes that are biologically relevant to addiction. |
These results suggest that a significant amount of common genetic variation in these addiction related genes is unaccounted for by these commercial SNP microarrays. The deficiency is particularly high in the African sample. This is likely due to the lower LD in this older population, which means more SNPs are required for tagging. While the Illumina 1M clearly provides the best coverage, we would still need to add 23,441 SNPs to tag all common SNPs in the HapMap African sample for these 910 genes. The best-case scenario is when the Illumina 1M is used for European-Americans. But even in this case, there are still 5,117 SNPs that are not well represented.
shows some examples of coverage by the Illumina 610 Quad microarray for ten genes that are of particular interest. We chose this array because it offers a median level of coverage among the seven arrays we studied. These genes were among the most highly prioritized by the addiction researchers with whom we consulted. The selection process involved a number of criteria, including pharmacogenetic pathways, gene expression data, and mouse models. For example,
CDH13 (Cadherin 13) is known to be expressed in neurons in the human adult cerebral cortex, midbrain, thalamus and medulla regions
[5]. Because it is also known to inhibit neurite extension
[6] and activate a number of signaling pathways
[7]–
[10], it is a strong candidate for the genetic study of addiction phenotypes
[11].
CDH13 contains 2,414 SNPs that are common in the African population, and only 50% of these are tagged by the by the Illumina 610 Quad microarray.
Figure S1 shows the complete distribution of individual gene coverage percentages using our primary set of 910 genes for the Illumina 610 Quad microarray in each population.
| Table 2The number of SNPs required to supplement the Illumina 610 Quad microarray for genes of particularly strong interest. |
We have designed a SNP database (available at
http://zork.wustl.edu/nida/neurosnp.html) to systematically determine how to supplement these commercial microarrays for addiction. Our database includes a SNP prioritization score based on the genomic information network (GIN) method introduced by S. Saccone and colleagues
[4]. This method was originally designed to systematically incorporate
a priori biological hypotheses into the prioritization of SNPs after a genome-wide association study. The method begins with a set of SNPs that are ranked by their association
p-values, and then increases the rank of a SNP when it is determined to be biologically relevant to the phenotype according to an
a priori set of conditions, such as being in a biologically relevant gene, and additionally, perhaps, being a missense mutation. The score is a measure of biological relevance to addiction, and can be used independently of association
p-values to prioritize which SNPs are selected to supplement commercial microarrays. The score incorporates SNP/gene functional properties (such as coding and promoter regions), human/mouse evolutionary conservation, and a quantitative trait locus (QTL) mapping method that utilizes mouse models to identify genes associated with addiction phenotypes (Chesler and colleagues, submitted).
Figure S2 shows the distribution of prioritization scores for our genome-wide SNP database, and
Figure S3 shows the GIN network model we used to model addiction, which was adapted from the nicotine dependence model used by Saccone and colleagues
[4].
In addition to our primary set of 910 genes, the mouse systems genetics method (Chesler and colleagues, submitted) that identified 7,842 additional genes with potential biological relevance to addiction through mouse QTL and gene expression correlation analysis and the GIN prioritization scores reflect this quantitative assessment of biological relevance. Genes with a large number of mouse associations are prioritized more highly, and those with a relatively low number receive little increase in the prioritization score relative to arbitrary genes (see the
methods section for details). These additional data provide a broader measure of biological relevance to addiction which may be useful for prioritizing SNPs for further study after a GWAS
[4] or fine mapping a region of genetic linkage. This method has the effect of combining information from the expert nomination process and the mouse systems genetics data. SNPs in the 486 expert nominated genes, the determination of which did not involve the mouse data, receive a uniform increase in priority. If there is additional evidence from the mouse data of relevance of the gene to addiction, the priority is increased further depending on the extent of the evidence, which is measured by the number of mouse phenotypes that link to the gene.
Table S2 shows the distribution of phenotypes that map to mouse genes, both for the entire set of mouse genes considered and for the top 424 genes that were used for our primary analysis of SNP microarrays (these were mapped to human genes via NCBI Homologene). More detailed information on this latter set of genes can be found in supporting
file S1 which is discussed in more detail below. Complete details on the data and experiments for this mouse systems genetics project are described in Chesler and colleagues (submitted).
In order to determine the coverage of regions inferred to be undergoing recent adaptive selection
[12],
[13], all SNPs detected by the LD decay (LDD) test in the Perlegen and HapMap datasets were compared to the Illumina 1M and Affymetrix 6.0 SNPs. Uncovering evidence for recent selection is an additional approach to defining functional human genomic variation. The LDD test identifies alleles undergoing selection by searching for an expected increase with distance in the fraction of inferred recombinant chromosomes surrounding a selected variant. This method is insensitive to local recombination rate because it relies on LD differences between the two alleles at a site, while the local rate influences the extent of LD surrounding both alleles. While over 99.9% of the selected regions defined by the LDD test fall within +/−10 kb of a SNP present on these microarrays, there are some important exceptions. For example, the extensive LD surrounding the selected
DRD4 7R allele
[14] is not captured by these arrays, which contain very few SNPs in the region (only 1 in 100 kb). In general, however, the extensive long-range LD exhibited by these recently selected alleles (up to 1 Mb), and the current density of microarray SNPs, indicates that most of these evolutionarily important alleles can be “tagged” by an adjacent SNP surrogate.
The combined set of 910 genes used for our analysis of SNP microarrays is available as a spreadsheet in supporting
file S1. The spreadsheet contains detailed annotation, including the logical category used by the NeuroSNP project, such as “Nicotine System” and “Dopamine System” (further documentation of these categories and other columns is contained in the spreadsheet – see the sheet labeled “Column Descriptions”). Other columns include the Entrez Gene ID and gene symbol, the full name of the gene as well as all known symbol aliases and alternative descriptions, build 36.2 physical mapping data and mouse homologs. Some columns contain links to external databases, such as GenoPedia (
http://www.hugenavigator.net/HuGENavigator/startPagePedia.do), which contains a list of all human diseases that have been linked to the gene, including links to publications. The spreadsheet also contains links to the Knowledgebase of Addiction Related Genes (KARG,
http://karg.cbi.pku.edu.cn)
[15], and also GeneNetwork (
http://genenetwork.org) for additional information on mouse systems genetics data. We have also created a web site (
http://zork.wustl.edu/nida/neurosnp.html) that contains a searchable database of this set of genes, as well as downloadable files for the gene and SNP databases. These resources will allow investigators to both gather new biologically relevant targets for genetic association studies of addiction, and also to discover new information on well-known targets, such as the extent of tagged coverage in various population by commercial SNP microarrays.
Our complete SNP database is available for download from our web site at
http://zork.wustl.edu/nida/neurosnp.html, and the top 5,000 SNPs ranked by GIN prioritization score
[4] is provided in a spreadsheet as supporting
file S2. The entire database includes all SNPs from dbSNP build 128, and is annotated with allele frequency data from the four HapMap samples; there is no restriction on the allele frequency in the database. There are also flags indicating whether a SNP is on a particular custom microarray specifically designed by Hodgkinson and colleagues to target alcoholism and other addiction related phenotypes
[16], or was part of an addiction study by Nielsen and colleagues
[17].