Refractive error (RE) is a complex, multifactorial disorder characterized by a mismatch between the optical power of the eye and its axial length that causes object images to be focused off the retina. The two major subtypes of RE are myopia (nearsightedness) and hyperopia (farsightedness), which represent opposite ends of the distribution of the quantitative measure of spherical refraction. We performed a fixed effects meta-analysis of genome-wide association results of myopia and hyperopia from 9 studies of European-derived populations: AREDS, KORA, FES, OGP-Talana, MESA, RSI, RSII, RSIII and ERF. One genome-wide significant region was observed for myopia, corresponding to a previously identified myopia locus on 8q12 (p = 1.25×10−8), which has been reported by Kiefer et al. as significantly associated with myopia age at onset and Verhoeven et al. as significantly associated to mean spherical-equivalent (MSE) refractive error. We observed two genome-wide significant associations with hyperopia. These regions overlapped with loci on 15q14 (minimum p value = 9.11×10−11) and 8q12 (minimum p value 1.82×10−11) previously reported for MSE and myopia age at onset. We also used an intermarker linkage- disequilibrium-based method for calculating the effective number of tests in targeted regional replication analyses. We analyzed myopia (which represents the closest phenotype in our data to the one used by Kiefer et al.) and showed replication of 10 additional loci associated with myopia previously reported by Kiefer et al. This is the first replication of these loci using myopia as the trait under analysis. “Replication-level” association was also seen between hyperopia and 12 of Kiefer et al.'s published loci. For the loci that show evidence of association to both myopia and hyperopia, the estimated effect of the risk alleles were in opposite directions for the two traits. This suggests that these loci are important contributors to variation of refractive error across the distribution.
Functional linear models are developed in this paper for testing associations between quantitative traits and genetic variants, which can be rare variants or common variants or the combination of the two. By treating multiple genetic variants of an individual in a human population as a realization of a stochastic process, the genome of an individual in a chromosome region is a continuum of sequence data rather than discrete observations. The genome of an individual is viewed as a stochastic function that contains both linkage and linkage disequilibrium (LD) information of the genetic markers. By using techniques of functional data analysis, both fixed and mixed effect functional linear models are built to test the association between quantitative traits and genetic variants adjusting for covariates. After extensive simulation analysis, it is shown that the F-distributed tests of the proposed fixed effect functional linear models have higher power than that of sequence kernel association test (SKAT) and its optimal unified test (SKAT-O) for three scenarios in most cases: (1) the causal variants are all rare, (2) the causal variants are both rare and common, and (3) the causal variants are common. The superior performance of the fixed effect functional linear models is most likely due to its optimal utilization of both genetic linkage and LD information of multiple genetic variants in a genome and similarity among different individuals, while SKAT and SKAT-O only model the similarities and pairwise LD but do not model linkage and higher order LD information sufficiently. In addition, the proposed fixed effect models generate accurate type I error rates in simulation studies. We also show that the functional kernel score tests of the proposed mixed effect functional linear models are preferable in candidate gene analysis and small sample problems. The methods are applied to analyze three biochemical traits in data from the Trinity Students Study.
rare variants; common variants; association mapping; quantitative trait loci; complex traits; functional data analysis
To identify myopia susceptibility genes influencing common myopia in 94 African-American and 36 White families.
A prospective study of families with myopia consisting of a minimum of two individuals affected with myopia.
Extended families consisting of at least two siblings affected with myopia were ascertained. A genome-wide linkage scan using 387 markers was conducted by the Center for Inherited Disease Research. Linkage analyses were conducted with parametric and nonparametric methods. Model-free linkage analysis was performed maximizing over penetrance and over dominance (that is, fitting a wide range of both dominant and recessive models).
Under the model-free analysis, the maximum two point heterogeneity logarithm of the odds score (MALOD) was 2.87 at D6S1009 in the White cohort and the maximum multipoint MALOD was 2.42 at D12S373-D12S1042 in the same cohort. The nonpara-metric linkage (NPL) maximum multipoint at D6S1035 had a P value of .005. An overall multipoint NPL score was obtained by combining NPL scores from both populations. The highest combined NPL score was observed at D20S478 with a significant P value of .008. Suggestive evidence of linkage in the White cohort mapped to a previously mapped locus on chromosome 11 at D11S1981 (NPL = 2.14; P = .02).
Suggestive evidence of linkage to myopia in both African Americans and Whites was seen on chromosome 20 and became more significant when the scores were combined for both groups. The locus on chromosome 11 independently confirms a report by Hammond and associates mapping a myopia quantitative trait locus to this region.
Refractive error is a complex trait with multiple genetic and environmental risk factors, and is the most common cause of preventable blindness worldwide. The common nature of the trait suggests the presence of many genetic factors that individually may have modest effects. To achieve an adequate sample size to detect these common variants, large, international collaborations have formed. These consortia typically use meta-analysis to combine multiple studies from many different populations. This approach is robust to differences between populations; however, it does not compensate for the different haplotypes in each genetic background evidenced by different alleles in linkage disequilibrium with the causative variant. We used the Age-Related Eye Disease Study (AREDS) cohort to replicate published significant associations at two loci on chromosome 15 from two genome-wide association studies (GWASs). The single nucleotide polymorphisms (SNPs) that exhibited association on chromosome 15 in the original studies did not show evidence of association with refractive error in the AREDS cohort. This paper seeks to determine whether the non-replication in this AREDS sample may be due to the limited number of SNPs chosen for replication.
We selected all SNPs genotyped on the Illumina Omni2.5v1_B array or custom TaqMan assays or imputed from the GWAS data, in the region surrounding the SNPs from the Consortium for Refractive Error and Myopia study. We analyzed the SNPs for association with refractive error using standard regression methods in PLINK. The effective number of tests was calculated using the Genetic Type I Error Calculator.
Although use of the same SNPs used in the Consortium for Refractive Error and Myopia study did not show any evidence of association with refractive error in this AREDS sample, other SNPs within the candidate regions demonstrated an association with refractive error. Significant evidence of association was found using the hyperopia categorical trait, with the most significant SNPs rs1357179 on 15q14 (p=1.69×10−3) and rs7164400 on 15q25 (p=8.39×10−4), which passed the replication thresholds.
This study adds to the growing body of evidence that attempting to replicate the most significant SNPs found in one population may not be significant in another population due to differences in the linkage disequilibrium structure and/or allele frequency. This suggests that replication studies should include less significant SNPs in an associated region rather than only a few selected SNPs chosen by a significance threshold.
Linkage analysis was developed to detect excess co-segregation of the putative alleles underlying a phenotype with the alleles at a marker locus in family data. Many different variations of this analysis and corresponding study design have been developed to detect this co-segregation. Linkage studies have been shown to have high power to detect loci that have alleles (or variants) with a large effect size, i.e. alleles that make large contributions to the risk of a disease or to the variation of a quantitative trait. However, alleles with a large effect size tend to be rare in the population. In contrast, association studies are designed to have high power to detect common alleles which tend to have a small effect size for most diseases or traits. Although genome-wide association studies have been successful in detecting many new loci with common alleles of small effect for many complex traits, these common variants often do not explain a large proportion of disease risk or variation of the trait. In the past, linkage studies were successful in detecting regions of the genome that were likely to harbor rare variants with large effect for many simple Mendelian diseases and for many complex traits. However, identifying the actual sequence variant(s) responsible for these linkage signals was challenging because of difficulties in sequencing the large regions implicated by each linkage peak. Current ‘next-generation’ DNA sequencing techniques have made it economically feasible to sequence all exons or the whole genomes of a reasonably large number of individuals. Studies have shown that rare variants are quite common in the general population, and it is now possible to combine these new DNA sequencing methods with linkage studies to identify rare causal variants with a large effect size. A brief review of linkage methods is presented here with examples of their relevance and usefulness for the interpretation of whole-exome and whole-genome sequence data.
Linkage; Genetics; DNA sequence; Whole-genome sequence; Whole-exome sequence
Visual refractive errors (REs) are complex genetic traits with a largely unknown etiology. To date, genome-wide association studies (GWASs) of moderate size have identified several novel risk markers for RE, measured here as mean spherical equivalent (MSE). We performed a GWAS using a total of 7280 samples from five cohorts: the Age-Related Eye Disease Study (AREDS); the KORA study (‘Cooperative Health Research in the Region of Augsburg’); the Framingham Eye Study (FES); the Ogliastra Genetic Park-Talana (OGP-Talana) Study and the Multiethnic Study of Atherosclerosis (MESA). Genotyping was performed on Illumina and Affymetrix platforms with additional markers imputed to the HapMap II reference panel. We identified a new genome-wide significant locus on chromosome 16 (rs10500355, P = 3.9 × 10−9) in a combined discovery and replication set (26 953 samples). This single nucleotide polymorphism (SNP) is located within the RBFOX1 gene which is a neuron-specific splicing factor regulating a wide range of alternative splicing events implicated in neuronal development and maturation, including transcription factors, other splicing factors and synaptic proteins.
Two-point linkage analyses of whole genome sequence data are a promising approach to identify rare variants that segregate with complex diseases in large pedigrees because, in theory, the causal variants have been genotyped. We used whole genome sequence data and simulated traits provided by Genetic Analysis Workshop 18 to evaluate the proportion of false-positive findings in a binary trait using classic two-point parametric linkage analysis. False-positive genome-wide significant log of odds (LOD) scores were identified in more than 80% of 50 replicates for a binary phenotype generated by dichotomizing a quantitative trait that was simulated with a polygenic component (that was not based on any of the provided whole genome sequence genotypes). In contrast, when the trait was truly nongenetic (created by randomly assigning affected-unaffected status), the number of false-positive results was well controlled. These results suggest that when using two-point linkage analyses on whole genome sequence data, one should carefully examine regions yielding significant two-point LOD scores with multipoint analysis and that a more stringent significance threshold may be needed.
Group 14 of Genetic Analysis Workshop 17 examined several issues related to analysis of complex traits using DNA sequence data. These issues included novel methods for analyzing rare genetic variants in an aggregated manner (often termed collapsing rare variants), evaluation of various study designs to increase power to detect effects of rare variants, and the use of machine learning approaches to model highly complex heterogeneous traits. Various published and novel methods for analyzing traits with extreme locus and allelic heterogeneity were applied to the simulated quantitative and disease phenotypes. Overall, we conclude that power is (as expected) dependent on locus-specific heritability or contribution to disease risk, large samples will be required to detect rare causal variants with small effect sizes, extreme phenotype sampling designs may increase power for smaller laboratory costs, methods that allow joint analysis of multiple variants per gene or pathway are more powerful in general than analyses of individual rare variants, population-specific analyses can be optimal when different subpopulations harbor private causal mutations, and machine learning methods may be useful for selecting subsets of predictors for follow-up in the presence of extreme locus heterogeneity and large numbers of potential predictors.
rare variants; LASSO; machine learning; random forests; logic regression; binary trees; Poisson regression; ISIS; classification trees; meta-analysis; extreme sampling
Genetics Analysis Workshop 17 provided common and rare genetic variants from exome sequencing data and simulated binary and quantitative traits in 200 replicates. We provide a brief review of the machine learning and regression-based methods used in the analyses of these data. Several regression and machine learning methods were used to address different problems inherent in the analyses of these data, which are high-dimension, low-sample-size data typical of many genetic association studies. Unsupervised methods, such as cluster analysis, were used for data segmentation and subset selection. Supervised learning methods, which include regression-based methods (e.g., generalized linear models, logic regression, and regularized regression) and tree-based methods (e.g., decision trees and random forests), were used for variable selection (selecting genetic and clinical features most associated or predictive of outcome) and prediction (developing models using common and rare genetic variants to accurately predict outcome), with the outcome being case-control status or quantitative trait value. We include a discussion of cross-validation for model selection and assessment and a description of available software resources for these methods.
unsupervised learning; supervised learning; cluster analysis; logistic regression; Poisson regression; logic regression; LASSO; ridge regression; decision trees; random forests; cross-validation; software
Prostate cancer (PrCa) is the most common male cancer in developed countries and the second most common cause of cancer death after lung cancer. We recently reported a genome-wide linkage scan in 69 Finnish hereditary PrCa (HPC) families, which replicated the HPC9 locus on 17q21-q22 and identified a locus on 2q37. The aim of this study was to identify and to detect other loci linked to HPC. Here we used ordered subset analysis (OSA), conditioned on nonparametric linkage to these loci to detect other loci linked to HPC in subsets of families, but not the overall sample. We analyzed the families based on their evidence for linkage to chromosome 2, chromosome 17 and a maximum score using the strongest evidence of linkage from either of the two loci. Significant linkage to a 5-cM linkage interval with a peak OSA nonparametric allele-sharing LOD score of 4.876 on Xq26.3-q27 (ΔLOD=3.193, empirical P=0.009) was observed in a subset of 41 families weakly linked to 2q37, overlapping the HPCX1 locus. Two peaks that were novel to the analysis combining linkage evidence from both primary loci were identified; 18q12.1-q12.2 (OSA LOD=2.541, ΔLOD=1.651, P=0.03) and 22q11.1-q11.21 (OSA LOD=2.395, ΔLOD=2.36, P=0.006), which is close to HPC6. Using OSA allows us to find additional loci linked to HPC in subsets of families, and underlines the complex genetic heterogeneity of HPC even in highly aggregated families.
linkage analysis; ordered subset analysis; prostate cancer
Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios.
We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented.
The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model mis-specification and the resultant estimation biases as a logistic model. This methodology, which we call a “risk machine”, will share properties from the statistical machine that it is derived from.
Consistent nonparametric regression; Logistic regression; Probability machine; Odds ratio; Counterfactuals; Interactions
Despite many years of research, most of the genetic factors contributing to myopia development remain unknown. Genetic studies have pointed to a strong inherited component, but although many candidate regions have been implicated, few genes have been positively identified.
We have previously reported 2 genomewide linkage scans in a population of 63 highly aggregated Ashkenazi Jewish families that identified a locus on chromosome 22. Here we used ordered subset analysis (OSA), conditioned on non-parametric linkage to chromosome 22 to detect other chromosomal regions which had evidence of linkage to myopia in subsets of the families, but not the overall sample.
Strong evidence of linkage to a 19-cM linkage interval with a peak OSA nonparametric allele-sharing logarithm-of-odds (LOD) score of 3.14 on 20p12-q11.1 (ΔLOD=2.39, empirical p=0.029) was identified in a subset of 20 families that also exhibited strong evidence of linkage to chromosome 22. One other locus also presented with suggestive LOD scores >2.0 on chromosome 11p14-q14 and one locus on chromosome 6q22-q24 had an OSA LOD score=1.76 (ΔLOD=1.65, empirical p=0.02).
The chromosome 6 and 20 loci are entirely novel and appear linked in a subset of families whose myopia is known to be linked to chromosome 22. The chromosome 11 locus overlaps with the known Myopia-7 (MYP7, OMIM 609256) locus. Using ordered subset analysis allows us to find additional loci linked to myopia in subsets of families, and underlines the complex genetic heterogeneity of myopia even in highly aggregated families and genetically isolated populations such as the Ashkenazi Jews.
To determine the potential influence of genetic factors on the prevalence of myopia in Tehran.
Of 6497 citizens of Tehran sampled from 160 clusters using stratified random cluster sampling, 4565 (70.3%) participated in the study and were referred to a clinic for an extensive eye examination and interview. These were from 1259 nuclear families with the average size of 3.6. Refraction data obtained from 3321 participants aged 16 years and over are presented. Three definitions of myopia, as the spherical equivalent of −0.5, −1, and −2 diopters or less, were used. Familial aggregation of myopia was evaluated with odds ratios and recurrence risk ratios (λR) using a multiple logistic regression with generalised estimating equations (GEE), adjusted for age, sex, height, and education.
Multivariate analyses showed a strong familial aggregation of myopia among siblings (λR ranging from 2.09 to 3.86) and parent–offspring pairs (λR from 1.82 to 3.81) adjusted for age, sex, height, and education. The aggregation increased with higher myopia thresholds and with the use of cycloplegic refraction. The odds ratios for spouse pairs were not significantly different from 1.0. The association of myopia with sex, height, and education (and not age) remained significant in the final GEE2 model.
The findings indicate a relatively high degree of familial aggregation of myopia in the Tehran population, independent of age, sex, height, and education. This residual aggregation may be a result of heredity or of an unmeasured common environmental effect.
familial myopia; myopia; refractive error; recurrence risk
A previous study of Old Order Amish families has shown association of ocular refraction with markers proximal to matrix metalloproteinase (MMP) genes MMP1 and MMP10 and intragenic to MMP2. We conducted a candidate gene replication study of association between refraction and single nucleotide polymorphisms (SNPs) within these genomic regions.
Candidate gene genetic association study.
2,000 participants drawn from the Age Related Eye Disease Study (AREDS) were chosen for genotyping. After quality control filtering, 1912 individuals were available for analysis.
Microarray genotyping was performed using the HumanOmni 2.5 bead array. SNPs originally typed in the previous Amish association study were extracted for analysis. In addition, haplotype tagging SNPs were genotyped using TaqMan assays. Quantitative trait association analyses of mean spherical equivalent refraction (MSE) were performed on 30 markers using linear regression models and an additive genetic risk model, while adjusting for age, sex, education, and population substructure. Post-hoc analyses were performed after stratifying on a dichotomous education variable. Pointwise (P-emp) and multiple-test study-wise (P-multi) significance levels were calculated empirically through permutation.
Main outcome measures
MSE was used as a quantitative measure of ocular refraction.
The mean age and ocular refraction were 68 years (SD=4.7) and +0.55 D (SD=2.14), respectively. Pointwise statistical significance was obtained for rs1939008 (P-emp=0.0326). No SNP attained statistical significance after correcting for multiple testing. In stratified analyses, multiple SNPs reached pointwise significance in the lower-education group: 2 of these were statistically significant after multiple testing correction. The two highest-ranking SNPs in Amish families (rs1939008 and rs9928731) showed pointwise P-emp<0.01 in the lower-education stratum of AREDS participants.
We show suggestive evidence of replication of an association signal for ocular refraction to a marker between MMP1 and MMP10. We also provide evidence of a gene-environment interaction between previously-reported markers and education on refractive error. Variants in MMP1- MMP10 and MMP2 regions appear to affect population variation in ocular refraction in environmental conditions less favorable for myopia development.
refraction; refractive error; myopia; association study; gene-environment interaction; matrix metalloproteinase; MMP; genetics
Genome-wide association studies have identified novel genetic factors that contribute to intracranial aneurysm (IA) susceptibility. We sought to confirm previously reported loci, to identify novel risk factors, and to evaluate the contribution of these factors to familial and sporadic IA.
We utilized 2 complementary samples, one recruited on the basis of a dense family history of IA (discovery sample 1: 388 IA cases and 397 controls) and the other without regard to family history (discovery sample 2: 1095 IA cases and 1286 controls). Imputation was used to generate a common set of single nucleotide polymorphisms (SNP) across samples, and a logistic regression model was used to test for association in each sample. Results from each sample were then combined in a meta-analysis.
There was only modest overlap in the association results obtained in the 2 samples. In neither sample did results reach genome-wide significance. However, the meta-analysis yielded genome-wide significance for SNP on chromosome 9p (CDKN2BAS; rs6475606; P=3.6×10−8) and provided further evidence to support the previously reported association of IA with SNP in SOX17 on chromosome 8q (rs1072737; P=8.7×10−5). Analyses suggest that the effect of smoking acts multiplicatively with the SNP genotype, and smoking has a greater effect on risk than SNP genotype.
In addition to replicating several previously reported loci, we provide further evidence that the association on chromosome 9p is attributable to variants in CDKN2BAS (also known as ANRIL, an antisense noncoding RNA).
genome-wide association study; intracranial aneurysm
Genome-wide linkage analysis using microsatellite markers has been successful in the identification of numerous Mendelian and complex disease loci. The recent availability of high-density single-nucleotide polymorphism (SNP) maps provides a potentially more powerful option. Using the simulated and Collaborative Study on the Genetics of Alcoholism (COGA) datasets from the Genetics Analysis Workshop 14 (GAW14), we examined how altering the density of SNP marker sets impacted the overall information content, the power to detect trait loci, and the number of false positive results. For the simulated data we used SNP maps with density of 0.3 cM, 1 cM, 2 cM, and 3 cM. For the COGA data we combined the marker sets from Illumina and Affymetrix to create a map with average density of 0.25 cM and then, using a sub-sample of these markers, created maps with density of 0.3 cM, 0.6 cM, 1 cM, 2 cM, and 3 cM. For each marker set, multipoint linkage analysis using MERLIN was performed for both dominant and recessive traits derived from marker loci. Our results showed that information content increased with increased map density. For the homogeneous, completely penetrant traits we created, there was only a modest difference in ability to detect trait loci. Additionally, as map density increased there was only a slight increase in the number of false positive results when there was linkage disequilibrium (LD) between markers. The presence of LD between markers may have led to an increased number of false positive regions but no clear relationship between regions of high LD and locations of false positive linkage signals was observed.
Covariate-based linkage analyses using a conditional logistic model as implemented in LODPAL can increase the power to detect linkage by minimizing disease heterogeneity. However, each additional covariate analyzed will increase the degrees of freedom for the linkage test, and therefore can also increase the type I error rate. Use of a propensity score (PS) has been shown to improve consistently the statistical power to detect linkage in simulation studies. Defined as the conditional probability of being affected given the observed covariate data, the PS collapses multiple covariates into a single variable. This study evaluates the performance of the PS to detect linkage evidence in a genome-wide linkage analysis of microsatellite marker data from the Collaborative Study on the Genetics of Alcoholism. Analytical methods included nonparametric linkage analysis without covariates, with one covariate at a time including multiple PS definitions, and with multiple covariates simultaneously that corresponded to the PS definitions. Several definitions of the PS were calculated, each with increasing number of covariates up to a maximum of five. To account for the potential inflation in the type I error rates, permutation based p-values were calculated.
Results suggest that the use of individual covariates may not necessarily increase the power to detect linkage. However the use of a PS can lead to an increase when compared to using all covariates simultaneously. Specifically, PS3, which combines age at interview, sex, and smoking status, resulted in the greatest number of significant markers identified. All methods consistently identified several chromosomal regions as significant, including loci on chromosome 2, 6, 7, and 12.
These results suggest that the use of a propensity score can increase the power to detect linkage for a complex disease such as alcoholism, especially when multiple important covariates can be used to predict risk and thereby minimize linkage heterogeneity. However, because the PS is calculated as a conditional probability of being affected, it does require the presence of observed covariate data on both affected and unaffected individuals, which may not always be available in real data sets.
We compared seven different tagging single-nucleotide polymorphism (SNP) programs in 10 regions with varied amounts of linkage disequilibrium (LD) and physical distance. We used the Collaborative Studies on the Genetics of Alcoholism dataset, part of the Genetic Analysis Workshop 14. We show that in regions with moderate to strong LD these programs are relatively consistent, despite different parameters and methods. In addition, we compared the selected SNPs in a multipoint linkage analysis for one region with strong LD. As the number of selected SNPs increased, the LOD score, mean information content, and type I error also increased.
The haplotypes of the X chromosome are accessible to direct count in males, whereas the diplotypes of the females may be inferred knowing the haplotype of their sons or fathers. Here, we investigated: 1) the possible large-scale haplotypic structure of the X chromosome in a Caucasian population sample, given the single-nucleotide polymorphism (SNP) maps and genotypes provided by Illumina and Affimetrix for Genetic Analysis Workshop 14, and, 2) the performances of widely used programs in reconstructing haplotypes from population genotypic data, given their known distribution in a sample of unrelated individuals.
All possible unrelated mother-son pairs of Caucasian ancestry (N = 104) were selected from the 143 families of the Collaborative Study on the Genetics of Alcoholism pedigree files, and the diplotypes of the mothers were inferred from the X chromosomes of their sons. The marker set included 313 SNPs at an average density of 0.47 Mb. Linkage disequilibrium between pairs of markers was computed by the parameter D', whereas for measuring multilocus disequilibrium, we developed here an index called D*, and applied it to all possible sliding windows of 5 markers each. Results showed a complex pattern of haplotypic structure, with regions of low linkage disequilibrium separated by regions of high values of D*. The following programs were evaluated for their accuracy in inferring population haplotype frequencies: 1) ARLEQUIN 2.001; 2) PHASE 2.1.1; 3) SNPHAP 1.1; 4) HAPLOBLOCK 1.2; 5) HAPLOTYPER 1.0. Performances were evaluated by Pearson correlation (r) coefficient between the true and the inferred distribution of haplotype frequencies.
The SNP haplotypic structure of the X chromosome is complex, with regions of high haplotype conservation interspersed among regions of higher haplotype diversity. All the tested programs were accurate (r = 1) in reconstructing the distribution of haplotype frequencies in case of high D* values. However, only the program PHASE realized a high correlation coefficient (r > 0.7) in conditions of low linkage disequilibrium.
Using the Genetic Analysis Workshop 13 simulated data set, we compared the technique of importance sampling to several other methods designed to adjust p-values for multiple testing: the Bonferroni correction, the method proposed by Feingold et al., and naïve Monte Carlo simulation. We performed affected sib-pair linkage analysis for each of the 100 replicates for each of five binary traits and adjusted the derived p-values using each of the correction methods. The type I error rates for each correction method and the ability of each of the methods to detect loci known to influence trait values were compared. All of the methods considered were conservative with respect to type I error, especially the Bonferroni method. The ability of these methods to detect trait loci was also low. However, this may be partially due to a limitation inherent in our binary trait definitions.
In spite of intensive efforts, understanding of the genetic aspects of familial prostate cancer remains largely incomplete. In a previous microsatellite-based linkage scan of 1233 prostate cancer (PC) families, we identified suggestive evidence for linkage (i.e. LOD≥1.86) at 5q12, 15q11, 17q21, 22q12, and two loci on 8p, with additional regions implicated in subsets of families defined by age at diagnosis, disease aggressiveness, or number of affected members.
In an attempt to replicate these findings and increase linkage resolution, we used the Illumina 6000 SNP linkage panel to perform a genome-wide linkage scan of an independent set of 762 multiplex PC families, collected by 11 ICPCG groups.
Of the regions identified previously, modest evidence of replication was observed only on the short arm of chromosome 8, where HLOD scores of 1.63 and 3.60 were observed in the complete set of families and families with young average age at diagnosis, respectively. The most significant linkage signals found in the complete set of families were observed across a broad, 37 cM interval on 4q13-25, with LOD scores ranging from 2.02 to 2.62, increasing to 4.50 in families with older average age at diagnosis. In families with multiple cases presenting with more aggressive disease, LOD scores over 3.0 were observed at 8q24 in the vicinity of previously identified common PC risk variants, as well as MYC, an important gene in PC biology.
These results will be useful in prioritizing future susceptibility gene discovery efforts in this common cancer.
Prostate cancer has a strong familial component but uncovering the molecular basis for inherited susceptibility for this disease has been challenging. Recently, a rare, recurrent mutation (G84E) in HOXB13 was reported to be associated with prostate cancer risk. Confirmation and characterization of this finding is necessary to potentially translate this information to the clinic. To examine this finding in a large international sample of prostate cancer families, we genotyped this mutation and 14 other SNPs in or flanking HOXB13 in 2,443 prostate cancer families recruited by the International Consortium for Prostate Cancer Genetics (ICPCG). At least one mutation carrier was found in 112 prostate cancer families (4.6 %), all of European descent. Within carrier families, the G84E mutation was more common in men with a diagnosis of prostate cancer (194 of 382, 51 %) than those without (42 of 137, 30 %), P = 9.9 × 10−8 [odds ratio 4.42 (95 % confidence interval 2.56–7.64)]. A family-based association test found G84E to be significantly over-transmitted from parents to affected offspring (P = 6.5 × 10−6). Analysis of markers flanking the G84E mutation indicates that it resides in the same haplotype in 95 % of carriers, consistent with a founder effect. Clinical characteristics of cancers in mutation carriers included features of high-risk disease. These findings demonstrate that the HOXB13 G84E mutation is present in ~5 % of prostate cancer families, predominantly of European descent, and confirm its association with prostate cancer risk. While future studies are needed to more fully define the clinical utility of this observation, this allele and others like it could form the basis for early, targeted screening of men at elevated risk for this common, clinically heterogeneous cancer.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-012-1229-4) contains supplementary material, which is available to authorized users.
In this study, we observed loss of heterozygosity (LOH) in human chromosomal fragment 6q25.1 in sporadic lung cancer patients. LOH was observed in 65% of the 26 lung tumors examined and was narrowed down to a 2.2-Mb region. Single-nucleotide polymorphism (SNP) analysis of genes located within this region identified a candidate gene, termed p34. This gene, also designated as ZC3H12D, C6orf95, FLJ46041, or dJ281H8.1, carries an A/G nonsynonymous SNP at codon 106, which alters the amino acid from lysine to arginine. Nearly 73% of heterozygous lung cancer tissues with LOH and the A/G SNP also exhibited loss of the A allele. In vitro clonogenic and in vivo nude mouse studies showed that overexpression of the A allele exerts tumor suppressor function compared with the G allele. p34 is located within a recently mapped human lung cancer susceptibility locus, and association of the p34 A/G SNP was tested among these families. No significant association between the less frequent G allele and lung cancer susceptibility was found. Our results suggest that p34 may be a novel tumor suppressor gene involved in sporadic lung cancer but it seems not to be the candidate familial lung cancer susceptibility gene linked to chromosomal region 6q23-25.