The completion of the International HapMap Project (1
) and the development of advanced genotyping technologies have made genome-wide association studies (GWAS) possible. These studies typically genotype more than 1000 cases and 1000 controls for 300 K to 1 million SNPs. A number of GWAS have been published with many more in progress (2–4
). A number of disease-associated SNPs have been identified and confirmed by these breakthrough studies with many more yet to come. Repeating GWAS in additional individuals has helped to find more disease-associated SNPs, although doing so is costly. Interestingly, the SNPs identified and subsequently confirmed in large replication samples are not always those with the smallest P
-value in the GWAS, and two GWAS may have radically different P
-values assigned to a confirmed SNP. For example, in prostate cancer a confirmed SNP in MSMB
from the initial GWAS had a P
-value of only 0.042, but the P
-value was 7.31 × 10–13
in a follow up study (4
). Thus the list of potential SNPs from any GWAS remains large. This large SNP list poses a problem for validation studies where a very large number of people are genotyped because custom arrays can cost more than standard GWAS arrays.
For many diseases there exists a rich, diverse and growing literature that can be used to identify genes and chromosomal regions of high interest. This literature includes existing genetic studies of linkage and candidate genes as well as research on disease pathogenesis. For example, information about disrupted cell signaling pathways and genomic-level expression data from comparisons of tumor and normal tissues have identified interesting candidate genes for cancer. Thus investigators may have a large but finite set of genes and genomic regions that they feel deserve particular scrutiny or they may have a special interest in certain genes or chromosomal regions.
Agnostic GWAS data provide a unique opportunity for hypothesis driven candidate gene exploration, but the sheer size and complexity of GWAS data can be difficult to manage. Although it may not be difficult to find which SNPs of a gene are directly included in a GWAS panel, it is harder to determine which additional SNPs are tagged by the panel, particularly when examining multiple ethnic groups where linkage disequilibrium (LD) structure and allele frequencies differ. There are a growing array of tools for gene annotation (e.g. identifying regulatory elements, alternative splicing, miRNA-binding sites), but many researchers may find it difficult to gather and employ these algorithms. Finally, while such tools predict putative functional regions for the Reference Sequence, they do not necessarily consider if the alternative alleles of SNPs in that sequence are likely to have different consequences.
Here we describe a comprehensive web server designed to select SNPs for genetic association studies. In designing this application we provide 3 pipelines for SNP selection with options to combine all three pipelines. The candidate gene pipeline uses both a user-provided list of candidate genes and disease-specific GWAS data [readily available from dbGaP (www.ncbi.nlm.nih.gov/sites/entrez?db=gap
) and elsewhere] to select SNPs that are predicted to have functional consequences and that are in high LD with a small P
-value GWAS SNP. For genes where a large proportion of the SNPs were not in LD with any GWAS SNP and thus are uninvestigated in the GWAS, the web application can pick LD tag SNPs to evaluate the untagged SNPs. The second, genomic pipeline selects SNPs with likely functional consequences from SNPs with small P
-value in a GWAS and from SNPs in high LD with such SNPs. The third, linkage pipeline uses a user-provided list of linkage regions to select small P
-value GWAS SNPs for each linkage region. The web application has information on all SNPs in HapMap and dbSNP and automatically constructs ethnic-specific LD relationships from both sources provided that the SNPs have population data available. In this way, SNPs that were not genotyped in a GWAS, but are in LD with a SNP that was genotyped, can be screened appropriately and GWAS data generated in one ethnic group can be used to pick SNPs in one or more other ethnic groups. We illustrate this application using prostate cancer as an example in which we start with a set of a priori
candidate genes, prostate cancer GWAS data, and a set of linkage regions, and use the pipelines to select a small panel of 1361 SNPs. We evaluate the utility of the application against the results of a follow-up validation study that screened a much larger panel of ~27 000 SNPs genotyped in ~8000 cases and controls and find that we included five of the seven SNPs found to be associated with prostate cancer.