The ideal genetic analysis of family data would include whole genome sequence on all family members. A strategy of combining sequence data from a subset of key individuals with inexpensive, genome-wide association study (GWAS) chip genotypes on all individuals to infer sequence level genotypes throughout the families has been suggested as a highly accurate alternative. This strategy was followed by the Genetic Analysis Workshop 18 data providers. We examined the quality of the imputation to identify potential consequences of this strategy by comparing discrepancies between GWAS genotype calls and imputed calls for the same variants. Overall, the inference and imputation process worked very well. However, we find that discrepancies occurred at an increased rate when imputation was used to infer missing data in sequenced individuals. Although this may be an artifact of this particular instantiation of these analytic methods, there may be general genetic or algorithmic reasons to avoid trying to fill in missing sequence data. This is especially true given the risk of false positives and reduction in power for family-based transmission tests when founders are incorrectly imputed as heterozygotes. Finally, we note a higher rate of discrepancies when unsequenced individuals are inferred using sequenced individuals from other pedigrees drawn from the same admixed population.
Cryptic population structure can increase both type I and type II errors. This is particularly problematic in case-control association studies of unrelated individuals. Some researchers believe that these problems are obviated in families. We argue here that this may not be the case, especially if families are drawn from a known admixed population such as Mexican Americans. We use a principal component approach to evaluate and visualize the results of three different approaches to searching for cryptic structure in the 20 multigenerational families of the Genetic Analysis Workshop 18 (GAW18). Approach 1 uses all family members in the sample to identify what might be considered "outlier" kindreds. Because families are likely to differ in size (in the GAW18 families, there is about a 4-fold difference in the number of typed individuals), approach 2 uses a weighting system that equalizes pedigree size. Approach 3 concentrates on the founders and the "marry-ins" because, in principle, the entire pedigree can be reconstructed with knowledge of the sequence of these unrelated individuals and genome-wide association study (GWAS) data on everyone else (to identify the position of recombinations). We demonstrate that these three approaches can yield very different insights about cryptic structure in a sample of families.
We conducted a genome-wide SNP association study on prostate cancer on over 23,000 Icelanders, followed by a replication study including over 15,500 individuals from Europe and the United States. Two newly identified variants were shown to be associated with prostate cancer: rs5945572 on Xp11.22 and rs721048 on 2p15 (odds ratios (OR) = 1.23 and 1.15; P = 3.9 × 10−13 and 7.7 × 10−9, respectively). The 2p15 variant shows a significantly stronger association with more aggressive, rather than less aggressive, forms of the disease.
We report a genome-wide association follow up study on prostate cancer. We identify four variants associated with the disease in European populations: rs10934853-A (OR = 1.12, P = 2.9×10−10) on 3q21.3, two moderately correlated (r2 = 0.07) variants on 8q24.21; rs16902094-G (OR = 1.21, P = 6.2×10−15) and rs445114-T (OR = 1.14, P = 4.7×10−10) and rs8102476-C (OR = 1.12, P = 1.6×10−11) on 19q13.2. We also refine a previous association signal on 11q13 with the SNP rs11228565-A (OR =1.23, P = 6.7 × 10−12). In a multi-variant analysis, using 22 prostate cancer risk variants typed in the Icelandic population, we estimate that carriers belonging to the top 1.3% of the risk distribution have a risk of developing the disease that is more than 2.5 times greater than the population average risk estimates.
Recent whole genome association studies have independently identified multiple prostate cancer (PC) risk variants on 8q24. We have evaluated association of common variants in this region with PC susceptibility and tumor aggressiveness in a sample of European American men.
Forty-nine tagging SNPs including three previously reported significant variants (rs1447295, rs6983267, rs16901979) and seven variants in the 5′ upstream region of the MYC proto-oncogene were tested for association with susceptibility to PC and tumor aggressiveness in 596 histologically verified PC cases and 567 ethnically matched controls.
Significant associations with susceptibility to PC were found at 17 SNPs, four of which (rs1016342, rs1378897, rs871135 and rs6470517) remained significant after adjusting for multiple corrections. One of the associated SNPs, rs871135, is located in the putative gene POU5F1P1 within the 8q24 region. An in slico analysis showed that the associated variant of this SNP alters a transcription factor implicating a plausible regulatory role. Additionally, one of the significantly associated SNPs, rs6470517, with PC susceptibility showed a significant over-representation of the G allele in cases with aggressive tumor.
Although this study does not directly confirm associations of the three specific SNPs (cited above), it corroborates reported signals of association in 8q24 reaffirming that genetic variation on 8q24 influences susceptibility to PC in men of European ancestry. Although our study did not confirm the allelic association of rs1447295, meta-analysis of this SNP provided support to previous reported associations. Further, this study implicates the 8q24 region with aggressive forms of PC.
Genetic susceptibility; association study; tagging SNPs; chromosome 8q24; MYC proto-oncogene
Deciphering the genetic basis of prostate cancer aggressiveness could provide valuable information for the screening and treatment of this common but complex disease. We previously detected linkage between a broad region on chromosome 7q22-35 and Gleason score—a strong predictor of prostate cancer aggressiveness. To further clarify this finding and focus on the potentially causative gene, we undertook a fine-mapping study across the 7q22-35 region.
Our study population encompassed 698 siblings diagnosed with prostate cancer. 3,072 single nucleotide polymorphisms (SNPs) spanning the chromosome 7q22-35 region were genotyped using the Illumina GoldenGate assay. The impact of SNPs on Gleason scores were evaluated using affected sibling pair linkage and family-based association tests.
We confirmed the previous linkage signal and narrowed the 7q22-35 prostate cancer aggressiveness locus to a 370 kb region. Centered under the linkage peak is the gene KLRG2 (killer cell lectin-like receptor subfamily G, member 2). Association tests indicated that the potentially functional non-synonymous SNP rs17160911 in KLRG2 was significantly associated with Gleason score (p = 0.0007).
These findings suggest that genetic variants in the gene KLRG2 may affect Gleason score at diagnosis and hence the aggressiveness of prostate cancer.
prostate cancer; Gleason Score; siblings; SNP
Genetic Analysis Workshop 17 (GAW17) provided a platform for evaluating existing statistical genetic methods and for developing novel methods to analyze rare variants that modulate complex traits. In this article, we present an overview of the 1000 Genomes Project exome data and simulated phenotype data that were distributed to GAW17 participants for analyses, the different issues addressed by the participants, and the process of preparation of manuscripts resulting from the discussions during the workshop.
The unrelated individuals sample from Genetic Analysis Workshop 17 consists of a small number of subjects from eight population samples and genetic data composed mostly of rare variants. We compare two simple approaches to collapsing rare variants within genes for their utility in identifying genes that affect phenotype. We also compare results from stratified analyses to those from a pooled analysis that uses ethnicity as a covariate. We found that the two collapsing approaches were similarly effective in identifying genes that contain causative variants in these data. However, including population as a covariate was not an effective substitute for analyzing the subpopulations separately when only one subpopulation contained a rare variant linked to the phenotype.
We report two approaches for linkage analysis of data consisting of replicate phenotypes. The first approach is specifically designed for the unusual (in human data) replicate structure of the Genetic Analysis Workshop 17 pedigree data. The second approach consists of a standard linkage analysis that, although not specifically tailored to data consisting of replicate genotypes, was envisioned as providing a sounding board against which our novel approach could be assessed. Both approaches are applied to the analysis of three quantitative phenotypes (Q1, Q2, and Q4) in two sets of African families. All analyses were carried out blind to the generating model (i.e., the “answers”). Using both methods, we found numerous significant linkage signals for Q1, although population colocalization was absent for most of these signals. The linkage analysis of Q2 and Q4 failed to reveal any strong linkage signals.
Although the importance of selecting cases and controls from the same population has been recognized for decades, the recent advent of genome-wide association studies has heightened awareness of this issue. Because these studies typically deal with large samples, small differences in allele frequencies between cases and controls can easily reach statistical significance. When, unbeknownst to a researcher, cases and controls have different substructures, the number of false-positive findings is inflated. There have been three recent developments of purely statistical approaches to assessing the ancestral comparability of case and control samples: genomic control, structured association, and multivariate reduction analyses. The widespread use of high-throughput technology has allowed the quick and accurate genotyping of the large number of markers required by these methods.
Group 13 dealt with four population stratification issues: single-nucleotide polymorphism marker selection, association testing, non-standard methods, and linkage disequilibrium calculations in stratified or mixed ethnicity samples. We demonstrated that there are continuous axes of ethnic variation in both datasets of Genetic Analysis Workshop 16. Furthermore, ignoring this structure created p-value inflation for a variety of phenotypes. Principal-components analysis (or multidimensional scaling) can control inflation as covariates in a logistic regression. One can weight for local ancestry estimation and allow the use of related individuals. Problems arise in the presence of extremely high association or unusually strong linkage disequilibrium (e.g., in chromosomal inversions). Our group also reported a method for performing an association test controlling for substructure when genome-wide markers are not available to explicitly compute stratification.
genetic association; genome-wide association study; principal components; multidimensional scaling; ethnic substructure
Because of the dramatically different clinical course of aggressive and indolent prostate carcinoma (PCa), markers that distinguish between these phenotypes are of critical importance. Apoptosis is an important protective mechanism for unrestrained cellular growth and metastasis. Therefore, dysfunction in this pathway is a key step in cancer progression. As such, genetic variants in apoptosis genes are potential markers of aggressive PCa. Recent work in breast carcinoma has implicated the histidine variant of CASP8 D302H (rs1045485) as a protective risk allele.
We tested the hypothesis that the Hvariant was protective for aggressive PCa in a pooled analysis of 796 aggressive cases and 2,060 controls. RESULTS. The H allele was associated with a reduced risk of aggressive PCa (ORper allele = 0.67, 95% CI: 0.54–0.83, Ptrend = 0.0003). The results were similar for European-Americans (ORper allele = 0.68; 95% CI: 0.54–0.86) and African-Americans (ORper allele = 0.61; 95% CI: 0.34–1.10). We further determined from the full series of 1,160 cases and 1,166 controls in the Prostate, Lung, Colorectal, Ovarian (PLCO) population that the protective effect of the H allele tended to be limited to high-grade and advanced PCa (all cases ORper allele = 0.94; 95% CI: 0.79–1.11; localized, low-grade disease ORper allele = 0.98; 95% CI: 0.79–1.23; and aggressive disease ORper allele = 0.73; 95% CI: 0.50–1.07).
These results suggest that histidine variant of CASP8 D302H is a protective allele for aggressive PCa with potential utility for identification of patients at differential risk for this clinically significant phenotype.
Prostate Carcinoma; Apoptosis; Risk Assessment; Cancer Susceptibility; Metastatic Disease
Many phenotypes of public health importance (e.g., diabetes, coronary artery disease, major depression, obesity, and addictions to alcohol and nicotine) involve complex pathways of action. Interactions between genetic variants or between genetic variants and environmental factors likely play important roles in the functioning of these pathways. Unfortunately, complex interacting systems are likely to have important interacting factors that may not readily reveal themselves to univariate analyses. Instead, detecting the role of some of these factors may require analyses that are sensitive to interaction effects.
In this study, we evaluate the sensitivity and specificity of the restricted partition method (RPM) to detect signals related to coronary artery disease in the Genetic Analysis Workshop 16 Problem 3 data using the 50,000 k candidate gene single-nucleotide polymorphism set. Power and false-positive rates were evaluated using the first 100 replicate datasets. This included an exploration of the utility of using of all genotyped family members compared with selecting one member per family.
We conducted a search for non-chromosome 6 genes that may increase risk for rheumatoid arthritis (RA). Our approach was to retrospectively ascertain three "extreme" subsamples from the North American Rheumatoid Arthritis Consortium. The three subsamples are: 1) RA cases who have two low-risk HLA-DRB1 alleles (N = 18), 2) RA cases who have two high-risk HLA-DRB1 alleles (N = 163), and 3) controls who have two low-risk HLA-DRB1 alleles (N = 652). We hypothesized that since Group 1's RA was likely due to non-HLA related risk factors, and because Group 3, by definition, is unaffected, comparing Group 1 with Group 2 and Group 1 with Group 3 would result in the identification of candidate susceptibility loci located outside of the MHC region. Accordingly, we restricted our search to the 21 non-chromosome 6 autosomes. The case-case comparison of Groups 1 and 2 resulted in the identification of 17 SNPs with allele frequencies that differed at p < 0.0001. The case-control comparison of Groups 1 and 3 identified 23 SNPs that differed in allele frequency at p < 0.0001. Eight of these SNPs (rs10498105, rs2398966, rs7664880, rs7447161, rs2793471, rs2611279, rs7967594, and rs742605) were common to both lists.
Although identification of cryptic population stratification is necessary for case/control association analyses, it is also vital for linkage analyses and family-based association tests when founder genotypes are missing. However, including related individuals in an analysis such as EIGENSTRAT can result in bias; using only founders or one individual per pedigree results in loss of data and inaccurate estimates of stratification. We examine a generalization of principal-component analyses to allow for the inclusion of related individuals by down-weighting the significance of individual comparisons.
We carried out an analysis of the Genetic Analysis Workshop 15 simulated Problem 3 data. We restricted ourselves to the present/absent phenotype. Linkage analysis revealed a very strong signal on chromosome 6. Association analysis revealed additional susceptible loci located on chromosomes 11 and 18. The latter two signals were subsequently verified with linkage analysis – but only after 20 replicates were pooled. Analysis of linkage disequilibrium patterns, in concert with family-based association tests, led us to infer the presence of a second chromosome 6 locus located in the vicinity of single-nucleotide polymorphisms 160–162. These analyses were carried out without knowledge of the model used to generate the simulation.
Performing linkage and association analyses on a large set of correlated data presents an interesting set of problems. In the current setting, we have 3554 expression levels from lymphoblastoid cell lines in 194 individuals from 14 three-generation Utah CEPH (Centre d'Etude du Polymorphisme Humain) pedigrees. We formed multivariate expression phenotypes from six sets of genes. These consisted of a set of genes identified by the data providers as showing common linkage to a region of chromosome 14, as well as five other sets suggested by ontological evidence. Using principal-component analyses, we generated seven quantitative phenotypes for expression levels from these six sets of genes. We performed quantitative genome linkage screens on these traits using the expression traits from the third generation of each pedigree. As expected, the strongest linkage signal was achieved when the trait under analysis was the composite of the expressions of genes previously showing linkage to chromosome 14. In particular, this trait produced a LOD score of 5.2 on chromosome 14. The trait also produced LOD scores over 3.5 on chromosomes 1, 7, 9, and 11; this suggests that these genes may be controlled by additional genetic factors on the genome. Subsequent association analyses on the first two generations of these pedigrees identified two polymorphisms on chromosome 11 as significant after correcting for multiple tests. These results suggest that principal-component analyses are useful for the analysis of pleiotropic loci. Furthermore, we have identified two single-nucleotide polymorphisms that may influence the expression of multiple genes linked to chromosome 14.
The restricted partition method (RPM) provides a way to detect qualitative factors (e.g. genotypes, environmental exposures) associated with variation in quantitative or binary phenotypes, even if the contribution is predominantly an interaction displaying little or no signal in univariate analyses. The RPM provides a model (possibly non-linear) of the relationship between the predictor covariates and the phenotype as well as measures of statistical and clinical significance for the model.
Blind to the generating model, we used the RPM to screen a data set consisting 1500 unrelated cases and 2000 unrelated controls from Replicate 1 of the Genetic Analysis Workshop 15 Problem 3 data for genetic and environmental factors contributing to rheumatoid arthritis (RA) risk. Both univariate and pair-wise analyses were performed using sex, smoking, parental DRB1 HLA microsatellite alleles, and 9187 single-nucleotide polymorphisms genotypes from across the genome. With this approach we correctly identified three genetic loci contributing directly to RA risk, and one quantitative trait locus for the endophenotype IgM level. We did not mistakenly identify any factors not in the generating model. All the factors we found were detectable with univariate RPM analyses. We failed to identify two genetic loci modifying the risk of RA. After breaking the blind, we examined the true modeling factors in the first 50 data replicates and found that we would not have identified the additional factors as important even had we combined all the data from the first 50 replicates in a single data set.
Genetic maps based on single-nucleotide polymorphisms (SNP) are increasingly being used as an alternative to microsatellite maps. This study compares linkage results for both types of maps for a neurophysiology phenotype and for an alcohol dependence phenotype. Our analysis used two SNP maps on the Illumina and Affymetrix platforms. We also considered the effect of high linkage disequilibrium (LD) in regions near the linkage peaks by analysing a "sparse" SNP map obtained by dropping some markers in high LD with other markers in those regions.
The neurophysiology phenotype at the main linkage peak near 130 MB gave LOD scores of 2.76, 2.53, 3.22, and 2.68 for the microsatellite, Affymetrix, Illumina, and Illumina-sparse maps, respectively. The alcohol dependence phenotype at the main linkage peak near 101 MB gave LOD scores of 3.09, 3.69, 4.08, and 4.11 for the microsatellite, Affymetrix, Illumina, and Illumina-sparse maps, respectively.
The linkage results were stronger overall for SNPs than for microsatellites for both phenotypes. However, LOD scores may be artificially elevated in regions of high LD. Our analysis indicates that appropriately thinning a SNP map in regions of high LD should give more accurate LOD scores. These results suggest that SNPs can be an efficient substitute for microsatellites for linkage analysis of both quantitative and qualitative phenotypes.
We used the LOKI software to generate multipoint identity-by-descent matrices for a microsatellite map (with 31 markers) and two single-nucleotide polymorphism (SNP) maps to examine information content across chromosome 7 in the Collaborative Study on the Genetics of Alcoholism dataset. Despite the lower information provided by a single SNP, SNP maps overall had higher and more uniform information content across the chromosome. The Affymetrix map (578 SNPs) and the Illumina map (271 SNPs) provided almost identical information. However, increased information has a computational cost: SNP maps require 100 times as many iterations as microsatellites to produce stable estimates.
The overlap of 94 single-nucleotide polymorphisms (SNP) among the 4,720 and 11,120 SNPs contained in the linkage panels of Illumina and Affymetrix, respectively, allows an assessment of the discrepancy rate produced by these two platforms. Although the no-call rate for the Affymetrix platform is approximately 8.6 times greater than for the Illumina platform, when both platforms make a genotypic call, the agreement is an impressive 99.85%. To determine if disputed genotypes can be resolved without sequencing, we studied recombination in the region of the discrepancy for the most discrepant SNP rs958883 (typed by Illumina) and tsc02060848 (typed by Affymetrix). We find that the number of inferred recombinants is substantially higher for the Affymetrix genotypes compared to the Illumina genotypes. We illustrate this with pedigree 10043, in which 3 of 7 versus 0 of 7 offspring must be double recombinants using the genotypes from the Affymetrix and the Illumina platforms, respectively. Of the 36 SNPs with one or more discrepancies, we identified a subset that appears to cluster in families. Some of this clustering may be due to the presence of a second segregating SNP that obliterates a XbaI site (the restriction enzyme used in the Affymetrix platform), resulting in a fragment too long (>1,000 bp) to be amplified.
Accurately resolving population structure in a sample is important for both linkage and association studies. In this study we investigated the power of single-nucleotide polymorphisms (SNPs) in detecting population structure in a sample of 286 unrelated individuals. We varied the number of SNPs to determine how many are required to approach the degree of resolution obtained with the Collaborative Study on the Genetics of Alcoholism (COGA) short tandem repeat polymorphisms (STRPs). In addition, we selected SNPs with varying minor allele frequencies (MAFs) to determine whether low or high frequency SNPs are more efficient in resolving population structure. We conclude that a set of at least 100 evenly spaced SNPs with MAFs of 40–50% is required to resolve population structure in this dataset. If SNPs with lower MAFs are used, then more than 250 SNPs may be required to obtain reliable results.
The genetic regulation of variation in intra-individual fluctuations in systolic blood pressure over time is poorly understood. Analysis of the magnitude of the average fluctuation of a person's systolic blood pressure around his or her age-adjusted trend line, however, shows moderate, albeit significant, family resemblance in Cohort 1 of the Framingham Heart Study. To determine whether genomic regions affecting this phenotype could be identified, we pursued a "model-free" multipoint quantitative linkage analysis.
Two different linkage methods revealed multiple nominally significant signals, two to four of which are "replicated" in Cohort 2. When both cohorts are assembled into extended pedigrees, three linkage signals remain nominally significant by one or both methods.
Any or all of the genomic regions in the vicinity of D5S1456, D11S2359, and D20S470 may contain elements that regulate systolic blood pressure homeostasis.
The biologic aggressiveness of prostate tumors is an important indicator of prognosis. Chromosome 7q32—q33 was recently reported to show linkage to more aggressive prostate cancer, based on Gleason score, in a large sibling pair study. We report confirmation and narrowing of the linked region using finer-scale genotyping. We also report a high frequency of allelic imbalance (AI) defined within this locus in a series of 48 primary prostate tumors from men unselected for family history or disease status. The highest frequency of AI was observed with adjacent markers D7S2531 (52%) and D7S1804 (36%). These two markers delineated a common region of AI, with 24 tumors exhibiting interstitial AI involving one or both markers. The 1.1-Mb candidate region contains relatively few transcripts. Additionally, we observed positive associations between interstitial AI at D7S1804 and early age at diagnosis (P=.03) as well as a high combined Gleason score and tumor stage (P=.06). Interstitial AI at D7S2531 was associated with a positive family history of prostate cancer (P=.05). These data imply that we have localized a prostate cancer tumor aggressiveness loci to chromosome 7q32–q33 that is involved in familial and nonfamilial forms of prostate cancer.
prostate; linkage; AI; aggressive; 7q