Pitfalls of Merging GWAS Data: Lessons Learned in the eMERGE Network and Quality Control Procedures to Maintain High Data Quality
1Center for Human Genetics Research, Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN, USA
2Division of Endocrinology, Metabolism, and Molecular Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
3Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
4Cancer Prevention, Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
5Genetic Analysis Platform and Program in Medical and Population Genetics, Broad Institute, Cambridge, MA, USA
6Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
7Center for Inherited Disease Research, Johns Hopkins University, Baltimore, MD, USA
8Departments of Medicine and Genome Sciences, University of Washington, Seattle, WA, USA, USA
9Division of Cardiovascular Diseases, Department of Medicine, Mayo Clinic, Rochester, MN, USA
10Office of Population Genomics, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
11Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, WI, USA
12National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
13Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI, USA
14Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
15Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University, Nashville, TN, USA
16Center for Systems Genomics, The Huck Institutes of the Life Sciences, Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
We have expanded the single-site eMERGE QC pipeline (14
) to include additional steps to be used when merging datasets either for replication, higher-powered studies, or meta-analysis (see ). Taking “clean” datasets and merging them together is a non-trivial, time-consuming task (24,25). Various other consortia have developed important QC and quality assurance (QA) pipelines to ensure thorough cleaning of data, especially GWAS data (23,26,27). These processes should not be neglected nor marginalized, as they help to reduce the number of false positive and false negative results. The best place to start is with a good study design. At the inception of the eMERGE network, we anticipated merging our datasets, thus, we included various safeguards to reduce the number of complications that could occur (for example: the same HapMap controls were genotyped across the studies to check for concordance, the same Illumina BeadChip was used for majority of our samples, and the majority of our samples are of the same ancestry group (European descent)).
Flowchart illustrating additional QC steps when merging several datasets
We illustrate here a less complex merging of datasets than might be the case in other pooled studies, including data pooled from dbGaP. Merging genotype data from various centers across different platforms creates additional complications and may require imputation. Likewise, merging phenotype data across studies also has its complexities. Other groups are developing phenotype harmonization methods for ease of merging phenotype data (PhenX and PAGE) (28,29).
Quality control analyses had already been performed on each individual dataset, which was an important check when doing these analyses on the merged dataset. When pooling GWAS data, we have demonstrated that thorough cleaning of the individual datasets prior to merging, and then cleaning of the combined dataset once merged, are essential to obtaining good quality data. Establishing that strand orientation of alleles is consistent among datasets prior to merging is an important first step. Once merged, investigation of kinship coefficients is important to check for unintended duplicates or related pairs across datasets.
Following QC of the merged dataset, genetic analyses will be conducted within eMERGE-I to identify variants associated with fourteen phenotypes using subsets of individuals with each of the case and control definitions for the phenotypes available. Given the demographics, the merged dataset will likely be split into two populations for application of stratified regression analyses: European ancestry and African ancestry. QC procedures, such as Eigenstrat, batch effects, and HWE analysis, will need to be performed on the subsets used for each particular phenotype to adjust for principal components, to ensure associations findings are not confounded by which studies the samples came from, and to ensure associated SNPs are not grossly out of HWE, respectively.
When extracting subsets of samples for additional phenotype association studies, it is important to examine the distribution of samples among the different studies. Pulling the majority of samples from one of these two studies (Marshfield or Vanderbilt-660W) could lead to spurious results as observed in the batch effect analyses. In other words, if the proportion of cases and controls differ by site, an association observed could be a result of a batch effect instead of a true effect.
Through the eMERGE network, we have learned a significant amount of detail about large-scale quality control to ensure data integrity. In particular, merging datasets for joint analysis introduced interesting, unanticipated subtleties. We identified some potential points of concern when merging datasets as well as approaches to address these (). Such approaches will become even more important as we venture increasingly more into all the possibilities offered by dbGaP. With careful merging of datasets, we enable an increase in sample size, and likewise power to detect genetic associations and improve our understanding of the genetic architecture of complex traits.