|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies using hundreds of thousands of single-nucleotide polymorphism (SNP) markers have become a standard approach for identifying disease susceptibility genes. The change in the technology poses substantial computational and statistical challenges that have been addressed in the quality control, imputation, and population-based measure groups of the Genetic Analysis Workshop 16. The computational challenges pertain to efficient memory management and computational speed of the statistical procedures, and we discuss an approach for efficient SNP storage. Accuracy and computational speed is relevant for genotype calling, and the results from a comparison of three calling algorithms are discussed. The first statistical challenge is related to statistical quality control, and we discuss two novel quality control procedures. These low-level analyses have an effect on subsequent preparatory steps for high-level analyses, e.g., the quality of genotype imputation approaches. After the conduct of a genome-wide association study with successful replication and/or validation, measures of diagnostic accuracy including the area under the curve are investigated. The area under the curve can be constructed from summary data in some situations. Finally, we discuss how the population-attributable risk of a genetic variant that is only measured in a reference data set can be determined.
With the availability of high-throughput genotyping technologies based on hundreds of thousands of single-nucleotide polymorphisms (SNPs), genome-wide association (GWA) studies have become a standard approach for unraveling the basis of complex genetic diseases. The recent technological advances have created a series of challenges for genetic epidemiologists. Before we describe these challenges and some solutions that have been proposed at the Genetic Analysis Workshop 16 (GAW16), we describe the typical flow of a GWA and subsequent studies from the perspective of a genetic epidemiologist [Fig. 1, adapted from Ziegler et al., 2008].
Genetic epidemiological research starts at the design stage with a biological question. Having decided on the most appropriate study design, samples are collected, and the DNA chip is selected. The second stage of today’s genetic epidemiological research is the laboratory stage: before chips can be hybridized, the DNA needs to be prepared. After chip hybridization, the chip is scanned. Now the statistical stages follow, and low-level analyses have to be performed before high-level analyses.
Low-level analysis starts with image analysis. In some software packages the investigator does not recognize that normalization of signal intensities and genotype calling are two different tasks that are performed in two steps. Based on the called genotypes and/or the signal intensities, extensive quality control is performed.
High-level statistical analysis only starts after completion of quality control. These are followed by subsequent replication and validation studies. At the end, the effect on the population is investigated. The power to identify new loci relies on the sample size, and therefore there is a need to combine data for meta-analyses across multiple studies and multiple platforms. To this end, SNPs that are not available on a specific chip are imputed, and statistical analysis, typically a meta-analysis with subsequent replication and validation, is conducted.
Finally, as in the path without imputation, population effect is investigated. Other aspects that are not depicted in Fig. 1 also play a role. For example, functional or animal studies are carried out to investigate whether the identified associations cause disease. Alternatively, family studies are used to determine whether the disease follows a particular Mendelian model.
New challenges emerge for genetic epidemiologists at different points in the conduct of a GWA and subsequent studies, and in this GAW16 group several solutions to these challenges have been proposed or evaluated. This GAW16 group only analyzed the real data, and it consists in seven different papers.
First, important computational challenges arise. For example, the image of a typical single Affymetrix chip requires more than 60 MB of storage. With a typical sample size of 1,000 cases and 1,000 controls, approximately 120 GB of storage are needed. After image processing, a sample still requires 37 GB of storage as an Affymetrix cel file. Only after normalization and genotype calling is the file size substantially reduced; it is approximately 3.5 GB for the typical case–control study with a total of 2,000 subjects. Of course, for simple input/output operations of the data, more than 1 GB of memory is still required. Therefore, specific data management and memory management tools have been developed for the analysis of GWA studies, including GenABEL [Aulchenko et al., 2007], GENOMIZER [Franke et al., 2006], GSCANDB [Taylor et al., 2007], OpenADAM [Yeung et al., 2008], PLINK [Purcell et al., 2007], or SNPLims [Orro et al., 2008].
The fundamental idea used in some of these specific programs is to use two bits for one SNP genotype [Aulchenko et al., 2007; Purcell, 2008] for high data compression. Specifically, a diallelic SNP-based genotype has four possible choices: 0 (AA), 1 (AB), 2 (BB), or 3 (missing), leading to 2 bits per SNP. The theoretical compression ratio therefore is 4:1 compared with a byte storage scheme (one byte for each genotype) minus some overhead.
Chen et al.  investigated the performance of their own memory management tool, which uses the four SNPs per byte storage approach and compared it with the standard one SNP per byte storage approach. They applied their tool to the data from the North American Rheumatoid Arthritis Consortium (NARAC), which included 2,062 subjects and 550,000 SNPs from the Illumina Infinium HumanHap550 SNP chip [Amos et al., 2009]. In their analysis using the simple allelic χ2 test for association, Chen et al.  observed a heap memory usage of 305 MB for the compressed data storage and more than 1 GB (1074 MB) for the uncompressed data storage. The differences between the two approaches in central processing unit (CPU) time were not pronounced for the simple allelic test. However, when haplotype blocks were to be identified, a huge discrepancy was found with CPU time of ~11 sec for compressed data storage but 169 sec for uncompressed data storage.
In conclusion, SNP data should be stored with two bits per SNP. This saves both CPU time and storage.
Computational speed and memory management as well as accuracy play an important role in the genotype calling stage of a study. In the last few years, many different genotype calling algorithms have been proposed for both the Affymetrix and the Illumina platforms. In their contribution, Vens et al.  compared the three genotype calling algorithms BRLMM [Affymetrix, 2007], Chiamo [The Wellcome Trust Case Control Consortium, 2007], and JAPL [Plagnol et al., 2007] using Affymetrix GeneChip Human Mapping 500k Array Set data from the Framingham Heart Study (FHS) as provided for GAW16 [Cupples et al., 2009]. An important aspect of the study is that Vens et al.  were not able to normalize all subjects in one run because of a memory access error when more than approximately 2,000 subjects were used in CelQuantileNorm, the normalization procedure recommended for JAPL and Chiamo. By investigating the concordance between the genotype calling algorithms, Vens et al.  were able to identify previously undetected errors in strand coding. The highest number of samples with a call fraction <0.97 was observed for BRLMM, followed by Chiamo. No subject had a call fraction <0.97 when JAPL was used. Therefore, the authors conclude that JAPL would be the algorithm of choice if as many samples as possible should be retained for further analysis. This finding is in line with the conclusions of Plagnol et al. , who stated that their genotype calling algorithm was specifically designed to deal with uncertain genotypes that are said to be missing by other approaches. Vens et al.  also found that the highest number of SNPs was kept by Chiamo, so that this genotype calling algorithm would be the method of choice if investigators aim at keeping a high number of SNPs for further analyses after standard quality control.
When SNPs from a GWA study are represented in a deFinetti triangle [Ziegler and König, 2006], most of the SNPs group around the Hardy-Weinberg curve [Goddard et al., 2009]. BRLMM and JAPL showed excess heterozygosity, i.e., more heterozygous subjects than expected under HWE, for a larger number of SNPs than Chiamo. In contrast, Chiamo more often revealed a deficiency of heterozygotes than BRLMM and JAPL.
In summary, JAPL would be the algorithm of choice if as many samples as possible should be retained for further analysis. Chiamo would be the method of choice if investigators aim to keep a high number of SNPs for further analyses after standard quality control.
After genotype calling, standard quality control is performed on the subject level as well as on the SNP level. Standard filters on the subject level include
Standard filters on the SNP level include
These global filters are effective in removing SNPs with clustering problems. They reduce a large number of highly significant erroneous associations and lower the genomic control lambda so that quantile-quantile plots do not show more outliers than expected under the null [Ling et al., 2009]. However, these filters are not able to identify all SNPs of bad quality. Therefore, Ling et al.  have introduced sex-specific filters which should be added to the standard quality control procedures. The first three are for X-chromosomal markers (X), and the last four for autosomal markers:
The last test, which is carried out in the control group only, is especially meaningful because sex-based confounding is likely to cause some small differences in allele and genotype frequencies.
The traditional standard quality control filters, termed Travemunde Criteria, are summarized with the additional standard quality control filters in Table I. The name Travemunde Criteria comes from a consortium meeting in Travemunde held in 2007 that was used for the work of Samani et al.  and subsequent papers.
Although the standard quality control approaches and the novel filters are helpful in identifying SNPs of low quality, the visual inspection of signal intensity plots is still the ultimate quality control approach when an association has been identified [Ziegler et al., 2008]. For example, Affymetrix states in its “Best Practices” for the analysis of data from GWA studies “Visually analyze all candidate SNPs” [Affymetrix, 2008, p. 257]. The recommendation to inspect only candidate SNPs is probably a consequence of the fact that systematic visual inspection of all cluster plots is impossible in a high-throughput setting because of the high workload. For example, we currently require approximately 2 h for the independent visual inspection of 100 cluster plots by two experienced readers, and readers are fatigued after a short period.
Nevertheless, the inspection of all cluster plots, i.e., on the genome-wide level, is of interest. For example, for genotype imputation, which often is the basis for meta-analyses of GWA studies, only SNPs of high quality should be used [de Bakker et al., 2008]. Furthermore, when machine learning approaches or genome-wide haplotype analyses are used for GWA data [Trégouët et al., 2009; Ziegler et al., 2007], all SNPs should be quality assured.
Therefore, approaches would be helpful that allow the automated inspection of cluster plots, and this task is comparable to measuring the internal validity of the clustering in cluster analysis [Halkidi et al., 2002a; Halkidi et al., 2002b; Handl et al., 2005]. Intuitively, the genotype calling performs well for a specific SNP if neighboring points in a signal intensity plot that are similar are assigned the same genotype and points that are dissimilar are assigned to different genotypes. Furthermore, a good SNP will have small distances within a genotype group and large distances between different genotypes.
The usefulness of these approaches for large sample sizes and GWA studies has not been studied in detail. Therefore, the proposal of Schillert et al.  can be considered a first step in this direction. They introduced an automated cluster plot analysis (ACPA) approach, and their method falls in the group of connectedness. In the method, the Mahalanobis distance is considered from the center of a cluster to all samples within the cluster. Next, a cluster boundary is defined by distending the ellipses of the cluster using a factor that depends on the interquartile range. Finally, the number of samples from the other clusters falling in the boundary of the cluster under consideration is calculated. If the number of subjects falling in the boundary of a different cluster exceeds a certain limit, the SNP is said to have unreliable clustering. They assessed the performance of ACPA with the decision made by two independent readers based on the BRLMM calls for 1,000 randomly selected SNPs from the FHS. Sensitivity – correct detection of low quality SNPs – was 88% and specificity – correct detection of high quality SNPs – was 86%. By varying the width of the boundary, Schillert et al.  were able to increase the specificity to 99% with a sensitivity of 50%.
In summary, standard quality control, including the novel filters proposed by Ling et al. , is an absolute requirement before high-level analyses. The automated evaluation of cluster plots should be further improved.
There is a growing need to work with complete genotypic data, e.g., for machine learning approaches, and to combine genotype data across multiple studies that have been obtained from different platforms. The analysis of missing data has a long tradition in statistics, and it is important to be aware of the different missing data mechanisms and potential pitfalls for the statistical analysis [D’Agostino, 2007; Gail, 1991; Laird, 1988]. While traditional statistical approaches for dealing with missing data use data from the study of interest only, several approaches have been proposed in the context of GWA studies recently that make use of external data sources [Li and Abecasis, 2006; Marchini et al., 2007; Nicolae, 2006; Servin and Stephens, 2007]. A disadvantage of the available publications is that the statistical assumptions underlying the employed methods are rarely formulated. Several studies were performed at GAW16 that compared the performance of several genotype imputation packages in terms of accuracy, speed, and user-friendliness [for a review see Thomas, 2009].
When a series of disease-associated SNPs have been identified, replicated, and possibly validated [for a detailed discussion of the terminology, see Igl et al., 2009], standard measures of diagnostic accuracy for a quantitative diagnostic test are investigated. These include the area under the curve (AUC), which can be constructed even if only summary data are available [Lu and Elston, 2008]. Jeffries and Zheng  compared the Lu-Elston approach with the standard logistic regression method when individual-level data are available. They observed that the Lu-Elston method is valuable when only summary statistics can be used. However, the conventional logistic regression is preferable when full data sets are available, because it allows model selection using standard likelihood theory. Furthermore, to provide useful information without a complete data set, the Lu-Elston method is subject to two constraints. First, to be included in the model, continuous covariates are converted to factors with a few levels. Second, unless considerable information regarding pairwise LD is available, the SNPs are modeled as independent. This means that multilocus genotype probabilities have to be obtained by the product of single-SNP genotype probabilities.
A different scenario for population-based measures has been considered by Hadley and Strachan . They showed that the population attributable risk (PAR), i.e., the proportion of cases attributable to a variant, at the untyped functionally relevant SNP can be estimated from the allele frequency p and the allelic relative risk RR at an observed SNP as follows. The (PAR) is often called attributable risk for short. In a first step, a parameter obs is estimated at the typed – observed – SNP as . In the second step, true at the functionally relevant position is estimated via true = obs/′ where ′ is the usual normalized Lewontin’s measure of linkage disequilibrium (LD). When the functionally relevant SNP is not typed in a specific study, the ′ estimated from an external data source is used. Finally, the PAR at the untyped SNP is obtained as PAR = true(2 − true).
When a set of genotyped SNPs k = 1,…,K is available that are in LD with the functionally relevant variant, Hadley and Strachan  proposed to calculate true as a weighted average across typed SNPs. The weights should be inversely proportional to the variance. Using the delta method, Hadley and Strachan  derived the approximate variance of true and showed that inverse-variance weights proportional to the r2 LD measure are appropriate so that is the coefficient of determination between the kth typed SNP and the unmeasured functionally relevant variant.
In summary, the conventional logistic regression is preferable for constructing an AUC over the Lu-Elston approach when full datasets are available. Using a simple transformation, the PAR can be estimated at an untyped SNP from genotyped SNPs when information about the LD is available.
The author is grateful to all participants of GAW16 Group 8. The group discussions would not have been a success without the participating senior colleagues Joan E. Bailey-Wilson, Heather Cordell, Charles C. Gu, and Yan Sun. The author is also grateful to the authors of the seven BMC Proceedings papers summarized in this work: Xiang Chen, David Hadley, Neal Jeffries, Hua Ling, Arne Schillert, Daniel F. Schwarz, and Maren Vens. This work was supported by the German Ministry of Education and Science, grant 01 EZ 0874, and the German Research Foundation, grant ZI 591/17-1. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.