|Home | About | Journals | Submit | Contact Us | Français|
Amyotrophic lateral sclerosis/parkinsonism–dementia complex (ALS/PDC) is a fatal neurodegenerative disease found in the Chamorro people of Guam and other Pacific Island populations. The etiology is unknown, although both genetic and environmental factors appear important. To identify loci for ALS/PDC, we conducted both genome-wide linkage and association analyses, using approximately 400 microsatellite markers, in the largest sample assembled to date, comprising a nearly complete sample of all living and previously sampled deceased cases. A single, large, complex pedigree was ascertained from a village on Guam, with smaller families and a case–control sample ascertained from the rest of Guam by population-based neurological screening and archival review. We found significant evidence for two regions with novel ALS/PDC loci on chromosome 12 and supportive evidence for the involvement of the MAPT region on chromosome 17. D12S1617 on 12p gave the strongest evidence of linkage (maximum LOD score, Zmax = 4.03) in our initial scan, with additional support in the complete case–control sample in the form of evidence of allelic association at this marker and another nearby marker. D12S79 on 12q also provided significant evidence of linkage (Zmax = 3.14) with support from flanking markers. Our results suggest that ALS/PDC may be influenced by as many as three loci, while illustrating challenges that are intrinsic in genetic analyses of isolated populations, as well as analytical strategies that are useful in this context. Elucidation of the genetic basis of ALS/PDC should improve our understanding of related neurodegenerative disorders including Alzheimer disease, Parkinson disease, frontotemporal dementia and ALS.
Amyotrophic lateral sclerosis/parkinsonism-dementia complex [ALS/PDC (MIM 105500)] is a fatal neurodegenerative disease of unknown etiology found in the Chamorro people of Guam (1–3), a number of Japanese families located on the Kii peninsula of Japan (4,5) and the Auyu and Jakai people of New Guinea (6). The neuropathology in the Chamarro and Japanese subjects consists of neurofibrillary tangles (NFTs) in both the brain and spinal cord (7–10), with a higher NFT burden in regions corresponding to the respective clinical manifestations of ALS and/or PDC. Clinically, ALS in Chamorros is similar to ALS in other populations (11), and PDC shares features with Parkinson's disease (MIM 168600), but is also accompanied by early and progressive dementia (1,7). The frequency of Guam ALS and PDC was highest in the 1950s, with similar prevalences and annual incidences of about 140 and 50 per 100 000, respectively (2,12,13). In the past 40 years, the age-specific incidence of ALS has steadily declined, whereas the incidence of PDC appeared to decline more modestly until the 1980s and then to increase slightly (14,15). A recent survey identified a prevalence of PDC that is nearly the same as that of the 1950s, but with an increased age-at-onset and decreased annual incidence rate (16). The changing incidence rates and age-at-onset suggest the existence of a changing environmental factor. Cycad toxins, infectious agents and mineral deficiencies have been extensively investigated, but a causal role has not been established, suggesting a complex etiology (17–20).
A genetic basis for ALS/PDC is supported by evidence from both epidemiologic and genetic studies. Early reports noted familial clustering of ALS/PDC, with multiple cases of ALS, PDC, or both in the same individual, often occurring in relatives of probands and in successive generations (1–3). Segregation analyses supported a major gene model, consistent with autosomal dominant inheritance with reduced penetrance, possibly affected by environmental factors (21–23). A 40 year prospective study found significantly higher incidence rates of ALS/PDC in siblings and offspring of cases compared with the expected rates in the Chamorro population (24). Also, despite similar environments and lifestyles, the village of Umatac, which was relatively isolated until the 1950s, had a much higher annual incidence rate of ALS (approximately 250/100 000 in the 1950s) than any other Guam village (2,3,12,13,25). Finally, the population has the hallmarks of a genetic isolate: Chamorros are descendants of aboriginal Austronesian populations from Island Southeast Asia (26–28), with a population size of 40 000–100 000 in the Mariana Islands by the time of the Spanish explorations, a precipitous reduction in population size to approximately 1500 by the end of the 18th century as a consequence of this contact (29,30), and subsequent growth to a current population size on Guam of approximately 65 000.
Population isolates, such as the Chamorros of Guam, provide unique opportunities, but pose significant challenges for mapping complex traits (31–34). A population isolate arises from a small group of founders, and is consequently likely to be more genetically homogeneous than are outbred populations, with sometimes a single founder mutation identified in all affected individuals. The reduced genetic heterogeneity can improve the power to detect linkage (34–36) and diminish concerns about population stratification in case–control studies. Environmental factors also tend to be less heterogeneous among individuals because of shared environment and culture (32,34). Additionally, the availability of genealogical records sometimes enables the reconstruction of large pedigrees, which have higher power than small pedigrees for linkage analysis (37). However, a major disadvantage of population isolates is that the statistical methods designed for genetic analysis of outbred populations can be difficult to apply (31). The large and complex pedigrees found in population isolates often require pedigree simplification to make computations feasible, yet reducing pedigree complexity can result in loss of power (38,39), and breaking inbreeding loops can inflate the type I error rate (40,41). Furthermore, classic association tests, which rely on the independence of cases and controls, may not be valid in populations where randomly selected individuals are more likely to be related (42), particularly among cases (43). Finally, isolates often constitute unique populations with finite sample sizes, which can limit the power to study endemic diseases and the feasibility of replication studies (44).
Despite a few previous studies, a major locus for ALS/PDC has not been identified. A genome-wide association study of a small sample of 41 case and control subjects from Guam, followed by linkage analysis of interesting regions in five small families, identified 17 markers with moderate evidence for association with PDC, but no region with a maximum LOD score Zmax > 3 in linkage analysis (45). A single-marker Zmax = 1.8 was obtained on chromosome (chr) 20, with multipoint Zmax > 2 on chr 14 and 20. Case–control studies also demonstrated that variants in the microtubule-associated protein tau [MAPT (MIM 157140)] gene on chr 17 are associated with ALS, PDC and Guam dementia (46,47). These previous attempts to map ALS/PDC loci have been hindered by limited power because of small sample sizes and the lack of information from a genome linkage scan.
To identify loci for ALS/PDC of Guam, we conducted a genome scan in the largest sample of Chamorros studied to date, representing half a century of data collection (14,22,48) and a complete or nearly complete sample of all available cases, both living and deceased. We performed genome-wide linkage analysis with a large, complex pedigree from the village of Umatac, and with a number of small nuclear families from the rest of Guam. We also performed a genome-wide association study in a sample of cases and controls assembled from archival data and screening efforts targeting the entire current Chamorro population on the island of Guam, and the pedigree from Umatac. The late onset of the disease, the complexity of the Umatac pedigree structure and the population history of the Chamorro people all posed significant analytical challenges. We describe our approach to these challenges along with the results of our findings supporting significant evidence for the existence of one to two major genes for ALS/PDC on chr 12, and additional support for the involvement of the MAPT region on chr 17.
We carried out a genome-wide linkage scan of an extended and complex Umatac kindred (U pedigree; Fig. 1) divided into three subcomponents (R, PQ and H), and of 11 simple-structure non-Umatac kindreds (S families; Table 1). The R and PQ subpedigrees were defined by the descendents of particular founder couples, and overlap; the H subpedigrees were defined by identifying simpler pedigrees on the basis of sampled individuals in the most recent generations, and do not overlap. We obtained significant evidence for linkage (Zmax > 3) in some of these subcomponents in the initial genome scan for markers on two regions of chr 12, with suggestive evidence for linkage (Zmax > 2) on chr 7 and 17 (Fig. 2, Table 2).
The strongest evidence for linkage with an individual marker was obtained for D12S1617 at 44 cM. In the R subpedigree, genome-wide significant evidence of linkage was obtained: Zmax = 4.03 at a recombination fraction, θ, of 0. The combined H + S (HS) data set also gave significant evidence of linkage to this same marker (Zmax = 3.1, θ = 0), with the H families contributing Zmax = 1.96 to this total; no other markers yielded Zmax > 3 in the HS data set. Empirical P-values for these LOD scores, including the effects of pedigree simplification, were <0.0001 for R (maximum LOD = 2.87 across 10 000 replicates) and 0.0004 for HS. The PQ subpedigree also provided suggestive evidence in favor of linkage to D12S1617 (Zmax = 1.3, θ = 0). Results for D12S1617 were sensitive to the choice of the sample used to specify marker allele frequencies, as would be expected if sampling is through related cases with a disease locus close to the marker: when frequencies were derived from external cases or the U pedigree, Zmax was reduced by 29 or 89%, respectively, compared with the use of frequencies from unrelated controls. Similar patterns of reduced LOD scores using case allele frequencies were found for the H and PQ subpedigrees. Further evaluation of the evidence for the association with this marker is presented later.
A second region on chr 12q at 125 cM also gave a significant LOD score for D12S79 in the genome scan (Table 2). The PQ component of the U pedigree was the major contributor to this signal, yielding a significant Zmax = 3.14 at θ = 0, with support (Zmax > 1) from markers 9–14 cM on either side (Supplementary Material, Table S1). The combined HS data set also gave supportive evidence of linkage to D12S79 (Zmax = 2.12, θ = 0), with equal contributions to the total from the H and S data sets. The R subpedigree was essentially uninformative for D12S79 (Zmax = 0.2, θ = 0.2), but provided modest support to this region from nearby marker D12S78 (Zmax = 1.58). Computations with the PQ component were too slow (Table 3) to allow practical computation of an empirical P-value.
Sensitivity analyses suggested that breaking loops did not substantially alter the interpretation of the results (Table 3). Exact single-marker computations with five loops were extremely slow, taking over four CPU-months, whereas analyses based on Markov chain Monte Carlo (MCMC) methods, while still demanding (Table 3), remained computationally feasible for primary analyses and produced stable estimates for as many as seven loops. For D12S1617 in the R subpedigree, breaking connections led to only slight inflation of the LOD score in all but the deepest portion of the pedigree. For D12S79 in the PQ subpedigree, breaking loops resulted in modest LOD score increase or decrease depending on the particular individual and number of loops involved. For example, reducing the number of loops from five to four decreased Zmax at D12S79 from 3.05 to 2.67, whereas breaking four loops increased Zmax from 2.91 to 3.14 compared with the intact seven-loop PQ subpedigree.
Suggestive LOD scores were also found near MAPT on chr 17 (Fig. 3), and for two markers on chr 7 (Table 2; Supplementary Material, Table S1). For D17S787 (75 cM), located ~9 mb distal to MAPT, all data sets showed consistent but modest evidence of linkage, with Zmax at θ = 0 ranging from 1.4 in the R subpedigree to 2.3 in the HS data set. For D17S1868, located ~3 mb distal to MAPT, the PQ subpedigree provided weak evidence in favor of linkage (Zmax = 1, θ = 0.05), while the R and HS data sets were largely uninformative. On chr 7, marker D7S502 (79 cM) showed evidence in favor of linkage in the H (Zmax = 2.2, θ = 0) and R (Zmax = 2.0, θ = 0) subpedigrees, while the PQ and S data sets were essentially uninformative at this marker. For D7S510 (60 cM), only the R subpedigree provided support for linkage (Zmax = 2.3, θ = 0).
Separate linkage analyses with ALS and PDC as distinct traits gave results that were consistent with a single disease. The two phenotypes mapped to the same regions on chr 12 and 17 and did not separately explain the two distinct peaks on chr 12. No Zmax > 3 was found in genome scans of ALS or PDC, separately, and higher Zmax values in the regions of interest were obtained by combining ALS and PDC in the trait model. Evidence in favor of linkage was obtained in all complex subpedigrees and in the S data set at markers D12S1617 (44 cM), D12S79 (125 cM) and D17S787 (~10 cM distal to MAPT) for both ALS and PDC, analyzed as distinct traits.
Initial genome-wide tests of association resulted in a markedly non-uniform P-value distribution from the analysis of both the full sample and the sample of all unrelated cases and controls (Fig. 4). While the existence of relatives in the Umatac sample was known, these results for the non-Umatac subjects suggested violation of analysis assumptions. Deviation from Hardy–Weinberg equilibrium (HWE) was unlikely to explain the aberrant distribution because there was little evidence of departure from HWE among cases or controls (not shown). Further investigation identified cryptic relatedness as the cause of the aberrant P-value distribution.
Using the genome-wide marker data, we identified clusters of related individuals (Fig. 5A) who had a strong effect on the performance of the association testing. Inferred pairwise relationships and estimated k-statistics used to estimate kinship coefficients demonstrate that a substantial majority (72.3%) of putatively unrelated pairs of individuals had estimated k1 > 5% and therefore showed evidence of cryptic relatedness. In addition, the proportion of the genome shared identical by descent was cumulatively higher between pairs of unrelated cases than pairs of unrelated controls, as would be expected for a genetic disease (Fig. 5B) (43), thereby providing further support for an underlying genetic basis.
Three markers on chr 12 continued to show evidence for the association with ALS/PDC (Table 4) after we developed and applied a test that allowed the use of the entire sample for such analysis. The strongest evidence for association was obtained for D12S1617. Evidence for association in the entire sample was suggestive based on a relationship-corrected χ2 test (49) (P = 0.097, 5th percentile of tests), but became nominally significant (q = 0.0058) when results from the genome-wide linkage scan were incorporated as weights in the computation of the false-discovery rate (FDR). For D12S1617, alleles 252 and 258 had consistently higher frequencies in cases compared with controls across samples (Table 5), with allele 252 showing an increased allele frequency in the U pedigree and allele 258 showing an increased frequency in the non-Umatac samples. Subsequent follow-up genotyping (see what follows) also identified evidence for the association with nearby marker D12S1048, providing evidence for association from the relationship-corrected χ2 test (P = 0.055, 3rd percentile of tests), and a significant FDR (q = 0.015). Finally, D12S79 gave a nominally significant FDR of 0.05.
As is typical of follow-up markers, heterozygosity (h) was lower than that of the genome scan markers. For all of the six follow-up markers, h was at or below the median (h = 0.79) of a typical Caucasian sample. Although the overall heterozygosity of the genome scan markers in this sample was also less than in a Caucasian sample—the median h = 0.75 in this sample is the lowest quartile in a Caucasian sample—five of the six follow-up markers had h below the median in this sample, and four of the six markers had h < 0.69, the lowest quartile in the genome scan markers in this sample.
Multipoint linkage analyses with consecutive pairs of markers supported evidence of linkage to both regions on chr 12. Pairwise linkage analysis of markers near D12S1617 had little effect on evidence for linkage to this region (Table 6). LOD scores remained strong for two-marker linkage analyses on 12p that involved D12S1617 and the closest markers on either side: Zmax = 2.66 and 2.48 in the HS and R pedigrees, respectively, for D12S1617-D12S1596, and Zmax = 2.45 in the R pedigree for D12S1057-D12S1617, with the LOD score maximizing at the position of D12S1617 in both cases. These results are consistent with expectations from multipoint analysis with less informative flanking markers. Multipoint analysis also supported evidence of linkage to 12q, but localization of the linkage peak was slightly more variable: two-marker analysis with markers immediately surrounding D12S79 was strongest for markers D12S79-D12S86, with Zmax = 2.94 and 1.55 for the PQ and HS pedigrees, respectively, maximizing ~3 cM distal to D12S79, whereas LOD score curves in analyses of D12S1646-D12S79 were lower and maximized >0.7 cM proximal to D12S79. The R subpedigree was essentially uninformative for two-marker analyses of 12q, whereas the PQ subpedigree was uninformative for two-marker analyses of 12p. Analysis of marker–marker recombination patterns identified some evidence for marker-map inconsistencies, consistent with possible distortion of the genetic map in the sample, which could be caused by biological phenomenon such as undetected genotyping error or an inversion segregating in the families (Supplementary Material).
ALS/PDC of Guam is a complex neurodegenerative trait that disproportionately affects the Chamorro people. We report significant evidence for two distinct regions with novel ALS/PDC loci on chr 12 on the basis of both genome-wide linkage analysis of a large complex pedigree and a series of smaller families, and association analysis of a population-based sample of cases and controls. We also obtained suggestive evidence of linkage to a region on chr 17 that includes MAPT. This signal on chr 17 is consistent with our previous work that showed that two sites, one in MAPT and one immediately upstream of MAPT, have independent high-risk alleles that are associated with Guam disease (47). Our study shows that ALS/PDC has a complex etiology, possibly involving multiple genes, even in this isolated population, and illustrates both the power of such samples and some of the challenges of genome scans in such ethnic isolates.
Statistically significant evidence for a novel locus for ALS/PDC on chr 12p was supported by both linkage and association analyses. Evidence for linkage to D12S1617, located at 44 cM on the Marshfield map, is robust, with evidence for linkage in all subpedigrees extracted from the large, complex U pedigree, from an independent collection of smaller families from non-Umatac villages, and under a variety of loop-breaking and intact pedigree computations. Support for a risk locus on chr 12p also derives from the combination of insensitivity of the results to most of the compromises needed to analyze the complex U pedigree coupled with positive results from information provided by both linkage and association analyses. The one exception to the robust results is the sensitivity to marker allele frequency, which was present in the genome scan for this marker only. We also found evidence of multiple disease mutation events or, more likely, either mutation at the marker or historical recombination events between the D12S1617 marker and the trait locus, since the Umatac and non-Umatac pedigrees appeared to be segregating two different alleles for D12S1617, with these alleles having inflated frequencies in the pedigrees relative to a population-based sample of controls. The introduction, through recombination, of a second associated marker allele is plausible in a rapidly growing population when the disease mutation is a few centuries old. Such situations have been observed for other disorders with <1 cM between the marker and trait locus (50). However, the existence of multiple haplotypes harboring disease mutations can substantially reduce the power of genetic studies, even in genetically homogeneous populations, when the joining connections between the alternative disease haplotypes are missing, as may be the case in this sample (38,39).
Evidence for linkage to 12q was also statistically significant. The support for linkage to D12S79, located at 125.3 cM, was modestly more sensitive to analysis conditions than for 12p. Results for 12q were less consistent across different components of the large complex pedigree than were the equivalent analyses for the 12p region, although the independent series of small families provided nearly equivalent support for linkage to both regions. In addition, there was modest evidence of linkage disequilibrium with markers in this region as measured by the FDR, and there was little sensitivity to the choice of marker allele frequency. The lack of consistency among overlapping complex pedigree components can be explained by missing key ancestral joining components, chance differences in marker information in the separate components or the effects of loop-breaking, which could either have caused inflated or deflated LOD scores.
There are several possible explanations for the presence of two regions with evidence of linkage on chr 12. The first hypothesis is the existence of two risk loci. This would be consistent with the complex etiology of ALS/PDC previously suggested by traditional segregation analyses (21–23). A second hypothesis is that there is only a single risk locus on chr 12, with the second signal representing a false-positive result caused by misspecification of some aspect of the analysis model or the data. This includes marker allele frequency misspecification because of stochastic events such as genetic drift, or the possibility of an undetected chromosomal rearrangement, which would distort the map. Inversions exist, including one that was recently reported in a single individual (51) with breakpoints located near the two markers with strongest evidence for linkage in our sample. However, we were unable to obtain a high-quality karyotype required to determine whether there was cytological evidence for an inversion. Our marker–marker linkage analyses gave some evidence of map discrepancies that could indicate such an event in our data, although again the complexity of the pedigree limited our ability to carry out the computations needed to fully use the available information. Finally, the need for pedigree simplification could have had different effects at different markers, perhaps producing two apparent peaks. Although analysis of the U pedigree in its entirety could help resolve localization of the trait locus, such an analysis is currently infeasible with any existing software.
Two of the chromosomal regions implicated harbor-known loci for related neurodegenerative diseases. Mutations in LRRK2 [leucine-rich repeat kinase 2 (MIM 609007)] on 12q12 cause autosomal dominant parkinsonism (52,53) and sporadic Parkinson disease (54,55). D12S1617, the marker in our initial analyses for which we found the strongest evidence of linkage, is located only ~12 cM away from LRRK2. D12S1048, which showed significant evidence of association and weak evidence of linkage is located ~263 kb closer to LRRK2, making LRRK2 a possible candidate gene for ALS/PDC of Guam. The peripherin gene [PRPH (MIM 170710)] in this region is also implicated as a gene for ALS in humans (56,57) and in mice (58). Finally, there are reports of late-onset familial Alzheimer disease mapping to chr 12p (59,60) in a region that spans that implicated here, suggesting that this region may be important for taupathies. To our knowledge, there are no known loci for disorders related to ALS/PDC in the 12q region of interest. Mutations in MAPT and nearby progranulin [PGRN (MIM 138945)] on chr 17q21 have been associated with frontotemporal dementia and parkinsonism (61–63), and SNPs in this region show association with ALS/PDC of Guam (47). However, we sequenced the PGRN gene in Guam ALS and PDC subjects and the results were negative (data not shown). Although our results on 17q place the putative risk locus several centimorgans from MAPT and our results on 12p are several centimorgans from LRRK2, localization of the position of the Zmax in these data may be complicated by misspecification of unknown aspects of the data or analysis model. Therefore, both signals, on 12p and chr 17, are consistent with previous studies and lend further support to a role for LRRK2/PRPH and MAPT in mediating genetic susceptibility to ALS/PDC.
Two of our results underscore the challenges that isolated populations pose in the search for the genetic basis of complex traits. First, the highly complex pedigree structures that are typical in such populations introduce computational limitations into the analysis. Despite the intrinsic information in such samples, necessary pedigree reduction may result in either a loss of power or inflated evidence for linkage, complicating the detection and interpretation of linkage signals, as we have shown. Similarly, the number of markers that can be incorporated into multipoint analysis is limited by computational feasibility, thereby restricting potential gains in linkage information and resolution from follow-up markers, which also tend to be less informative than are the genome scan markers. This computational problem is particularly challenging in studies that use SNPs instead of multiallelic markers, because of the importance, then, of multipoint analysis for obtaining high-linkage information. Multiallelic markers, therefore, remain most useful in such contexts, although emerging sequencing technologies may eventually provide alternatives through phased haplotypes that avoid the computational burden of unphased SNPs. Second, association studies in isolated populations are complicated by the difficulty of obtaining a truly unrelated sample of cases and controls, so that traditional association tests that assume independence among observations are invalid in this context. While the use of genomic controls (42,64) has been proposed to account for such cryptic relatedness, evaluation of this approach has been limited to simulated genome-wide marker data with uniform numbers of alleles and allele frequencies. We found that quantile–quantile (qq) plots of genome-wide P-values were highly useful for diagnosing cryptic relatedness in our sample, and that the estimation of k coefficients along with statistical inference of relationships provided a practical method of identifying and correcting for cryptic relationships (49). We feel that adjustment for cryptic relationships, rather than removal of samples, is likely to be preferable in the quest to maintain high power, while controlling for type I error, in studies like this one with a finite available sample size. Finally, to maximize the use of scarce samples, there may be advantages, in such populations, to combining information from both linkage and association analyses, with methods such as the weighted false discovery rate (65).
Our results suggest that ALS/PDC of Guam may be influenced by as many as three loci: two on chr 12 and one on chr 17. Other previously implicated genes or regions such as variants in TRPM7 on chr 15 (66) and regions on chr 14 and chr 20 (45) are not supported by our current results. A genome scan also cannot address the effects of possible somatic or mitochondrial mutations (67) which therefore remain possible risk modifiers. Given the Umatac pedigree structure, the evidence is most consistent with additive genetic effects on the risk scale, since the need for joint genotypes would not have the multigenerational patterns seen in the Umatac pedigree, nor would it be likely that significant evidence for linkage would be obtained with a single-locus model-based approach used here. Additional work is needed to elucidate this genetic basis, including using dense markers in the regions of interest. Ultimately, although the use of a genetic isolate such as the population here is not a panacea for dissecting the genetic basis of complex traits, the eventual elucidation of ALS/PDC mutation(s) in this population could improve our understanding of not only ALS/PDC, but also related neurodegenerative diseases such as Alzheimer disease, Parkinson disease, frontotemporal dementia, ALS and other dementias.
Cases, controls and small families were ascertained from the entire Chamorro population of Guam, whereas a single, large, complex pedigree (described in what follows) was ascertained and followed over a half century from the village of Umatac. We believe that virtually all living Chamarro cases were identified and recruited into the study, with as many additional archival samples tracked down and obtained as was possible to augment the sample size, including samples that had been collected as long as 50 years ago. Subject recruitment and diagnostic assessment procedures have previously been described (16,47). Briefly, archival clinical data and DNA samples were obtained from the National Institutes of Neurological and Communicative Disorders and Stroke Intramural Research program in Guam (1956–1983), and additional subjects were recruited through a neurological screening program (1995–2006) targeting all Chamorros ≥65 years on Guam (48) and through referral from or collaboration with additional investigators elsewhere (14,16). Cases who reported a family history of ALS or PDC were later recruited into the family study. All subjects (or proxies in the case of demented patients) provided written informed consent to participate in the study, and the study was approved by Institutional Review Boards at the University of Guam, University of California-San Diego and University of Washington. The diagnosis of PDC required insidious onset and gradual progression of primary parkinsonism and dementia, either of which could be the initial feature. Controls were cognitively and neurologically normal. For analysis purposes, individuals were considered to have an affected, unaffected or unknown phenotype. Affected individuals had a diagnosis of ALS and/or PDC. Unaffected individuals did not have a neurological or neuropsychological disease at last follow-up. Individuals with unknown phenotype had no disease information or had a diagnosis of a condition with unknown relationship to ALS/PDC, i.e. Alzheimer disease (MIM 104300) or dementia.
In 1962, all adults over age 20 living in Umatac had a neurologic examination, and pedigrees were assembled (25). The village of Umatac was chosen because it had the highest prevalence of ALS/PDC and its population was descended from a small number of families; thus ALS/PDC in this village could be the result of a founder mutation. Most Chamorros in Umatac were part of a single extended pedigree of more than 1100 individuals with 262 sibships, reaching back eight generations to seven founder couples. More recently, all identifiable additional cases of ALS/PDC related to this pedigree were ascertained (46) and DNA samples were obtained from both affected and unaffected consenting individuals in the most recent generations.
We identified 11 small families, referred to as the S data set, from a population-based sample of cases not living in Umatac. These families were identified from ongoing family recruitment efforts, follow-up of decendents of an original case–control study carried out >30 years ago (13,24), and as a result of quality control analyses utilizing the available genome scan data on the case–control sample, described in what follows. The 11 families consisted of 6 affected sibships, 3 affected half-sibships and 2 three-generation families. Genealogical records confirmed the accuracy of the inferred relationships where such information was available.
The complete Umatac pedigree has many inbreeding loops and cannot be analyzed intact by any currently available software. A necessary and relatively efficient strategy is therefore to perform an initial genome scan on simplified versions of the pedigree. Although there have been multiple suggestions for heuristic approaches to pedigree simplification (68–70), none has been shown to produce optimal results by comparison with analysis that uses the complete pedigree structure. Our goal was therefore to eliminate sufficient loops such that each resulting pedigree component could be analyzed for a full genome scan, one marker at a time, in <2 weeks of CPU time, while retaining as much pedigree information as possible to both maximize power (38,39,70) and minimize type I errors (40,41) induced by simplification. Follow-up analyses were then performed in regions of interest, using more complex pedigree structures and multiple markers to more fully utilize the available linkage information, as well as to evaluate sensitivity to the choice of model parameters. The computational demands limited the number of such follow-up studies that were feasible, especially on more complex pedigrees and with multiple markers, and prevented the use of simulation-based approaches for obtaining empirical significance levels or estimates of power across the full range of pedigree simplifications used.
The most informative subset of the complete complex pedigree was extracted by selecting all sampled affected individuals, their sampled siblings and descendents and all ancestors needed to join these individuals. This resulted in a single pedigree, referred to as the U pedigree, with 159 members including 61 phenotyped individuals (48 affected and 13 unaffected) and 35 genotyped individuals (20 affected, 13 unaffected, 2 unknown phenotype). The structure of the U pedigree was highly complex, consisting of 5 of the original 7 founder couples in the complete Umatac pedigree, 8 generations and 24 loops (Fig. 1). We took two pragmatic approaches towards pedigree simplification. Both approaches resulted in pedigrees that had higher mean pairwise kinship coefficients within versus between simplified pedigrees: a feature that is key in such simplifications for retaining mapping power while avoiding false-positive results (69,70).
Our first approach was to eliminate the deepest ancestral generations because they introduce the greatest analytic complexity and have the smallest effect when removed (38,39,70). The advantage of this approach was that nearly all the genotyped individuals, in the most recent generations, could be preserved for analysis. Eliminating the founders and one successive generation, and pruning additional uninformative individuals, resulted in the H data set, comprised of 4 disjoint pedigree components (H1–H4) ranging in size and complexity from 7 members and no loops to 66 members and 2 loops (Table 1). The independent H and S data sets were also combined (HS data set) for analysis. The HS data set included all but three of the available genotyped affected family members. Mean kinship coefficients within the four H pedigrees ranged from 0.038 to 0.238, whereas mean kinship coefficients between members of each H pedigree and members of the other three H pedigrees were much lower, ranging from 0.004 to 0.011.
The second approach was to select a founder pair and all of their descendents in order to preserve information regarding a potential shared ancestral mutation. We eliminated a few additional connections, primarily in the most ancestral portion of the pedigree, to reduce the number of loops as needed to achieve computationally practical pedigrees. The R, P and Q subpedigrees are the three largest resulting subpedigrees (Fig. 1), but are not disjoint components so that the sum of their membership exceeds the total size of the U pedigree. The R subpedigree is the largest subpedigree, descended from a single pair of founders with 93 members and 7 loops. We broke five loops and retained two loops in the R subpedigree to make genome-wide linkage analysis feasible. The P and Q subpedigrees are individually smaller than the R subpedigree and were combined (PQ subpedigree) for analysis, yielding a pedigree with 107 members and 7 loops. Four loops were broken, and three were retained for the genome scan with the PQ subpedigree. Mean kinship coefficients within the R and PQ pedigrees were 0.035 and 0.027, respectively, whereas they were 0.009 and 0.013, respectively, for kinship coefficients between members of each of these pedigrees and the rest of the U pedigree.
We assumed an autosomal dominant mode of inheritance, disease allele frequency of 1%, age-dependent penetrances and a constant sporadic disease rate of 0.005 for the low-risk genotype in parametric linkage analysis. This model was supported by previous segregation analyses (21,25), by the transmission across multiple generations for what is ordinarily a rare disease and by confinement of affected individuals to a relatively small component of the entire extended Umatac pedigree. Penetrances for the high-risk genotypes in six liability classes (≤30, 31–40, 41–50, 51–60, 61–70 and >70 years) were calculated using a cumulative normal distribution in unaffected individuals and normal probability density function in affected individuals (71), with a mean of 49.8 years and standard deviation of 10.1 years. These parameters were obtained from the 61 affected individuals with available age-of-onset data in the full 1169-member Umatac pedigree. Individuals with missing age and unknown disease status were assigned to the lowest liability class, with a penetrance of 0.007 for disease-allele carriers. All 13 unaffected individuals had age data; there were 20/48 affected individuals with missing age at onset who were assigned to the 41–50 liability class with a penetrance of 0.247.
Marker allele frequencies were estimated from an external sample of 88 Chamorro controls for use in linkage analyses of U pedigree components, and from the full population-based sample of 236 Chamorros (see case–control sample) in linkage analyses of the S families. Marker alleles present in the U pedigree but not the controls were assigned the U pedigree allele frequency, and frequencies were scaled to sum to 1. For multipoint analysis, the Haldane map function was used, but all results are presented on a Kosambi map obtained from the Marshfield map (http://research.marshfieldclinic.org/genetics/home/index.asp) (72).
The initial genome scan was performed using pairwise parametric LOD score analysis with FASTLINK version 4.1P (73). Single-marker analysis has the important advantages of being robust to marker map misspecification and trait model misspecification for linkage detection (74,75) even when the disease is influenced by more than one locus (76). It was also computationally much more feasible on a genome-wide scale in these data.
Regions with suggestive (Zmax > 2) single-marker results were followed up with multipoint LOD score linkage analysis. Although multipoint analysis can potentially increase the available mapping information, it has the disadvantage of being sensitive to both model (77) and map misspecification (78–80). In order to increase robustness to map misspecification and to decrease computational burden, we performed multipoint analysis using only pairs of sequential markers across chromosomal regions of interest, which allows model misspecification to be absorbed into maximization of the LOD score in the flanking intervals (77).
To allow additional practical multipoint computation in key regions on these complex pedigrees, we used the MCMC parametric linkage analysis program lm_bayes (81) from the MORGAN v.2.8.1 software package (http://www.stat.washington.edu/thompson/Genepi/pangaea.shtml). MCMC expands the situations where linkage analysis is computationally practical, and the implementation in MORGAN works well under a wide variety of conditions (82). The program lm_bayes utilizes both the phenotype and marker data in MCMC sampling, which can result in better LOD score estimates for traits with high penetrance compared with MCMC procedures such as lm_markers in the MORGAN package and Simwalk2 (83,84) that only utilize the marker data (81). An MCMC-based approach was necessary because of the computational challenges: exact computation with a single two-marker analysis took approximately 60 CPU-days on the R pedigree with two loops, and was therefore not further pursued. For two-marker analyses in the PQ and H pedigrees with up to three loops, we performed five replicate scans using different random number seeds, a main run length of 1 million iterations, Rao-Blackwellized estimates of the LOD score and the default settings for other parameters; this produced reliable LOD score estimates (SD <0.12 LOD units) and took <1 CPU-day for each two-marker scan on a 1.8 GHz AMD Opteron processor. Because of the computational challenges even with MCMC methods, we limited multipoint analysis only to regions with an initial LOD score >3.
For single-marker analysis in pedigrees with seven loops, we took additional steps to shorten the MCMC analysis to ~1 week on a single processor. We set the locus:meiosis sampling probability ratio to perform an average of only 50 locus samples in the main run, and we eliminated the preliminary run, used to set the ‘pseudopriors’ or sampling probabilities at each recombination fraction, by instead using the exact LOD scores in the same pedigrees with four to five loops to set the pseudopriors. Pseudopriors were calculated by inverting the exact likelihood ratios at each position and scaling the resulting probabilities to sum to one. To evaluate MCMC performance, we performed six replicate scans using different seeds and a main run length of either 500 000 or 1 million iterations.
We wrote a customized Perl script to carry out simulations, when possible, in order to evaluate statistical significance of LOD scores, including the effects of simplifying and partitioning the large and complex U pedigree. For computational reasons, this was possible only for the H and S (or HS) and two-loop version of the R pedigree components. Even the simplest three-loop version of the PQ component would have required >400 CPU-days to complete an equivalent simulation, as would more complex versions of the PQ and R pedigrees (Table 3), with more complex versions of these pedigree components requiring up to several CPU-years for such analyses. We used Genedrop from the MORGAN v.2.8.1 software package to generate simulated marker data sets under the null hypothesis of no linkage. We simulated marker data on the complete, complex, Umatac pedigree, basing the simulation on the sample allele frequencies and assumed marker map. We followed this with linkage analysis, using the real trait data and simplified pedigree structures as used in the original data analysis, and computed the empirical P-value as the fraction of data sets, out of 10 000 replicates, in which the LOD score was at least as high as that observed for the real data.
We evaluated the sensitivity of our analysis in key regions to several factors. A higher disease allele frequency of 5% was used in the ALS/PDC analysis to accommodate the possibilities of multiple disease alleles at the trait locus or that a key pedigree loop might have been broken. Results using Chamorro marker allele frequencies derived from 140 cases versus 88 controls were compared. The impact of breaking loops was evaluated by carrying out computations with additional loops, to the extent possible.
We separately analyzed ALS and PDC as distinct traits to determine whether separate linkage peaks were explained by distinct loci for the two phenotypes. Individuals with a diagnosis of PDC were considered to have unknown phenotype in the ALS analysis, and vice versa; two individuals with ALS and PDC were considered to be affected in both phenotype-specific analyses. Trait model parameters were the same as for ALS/PDC, except that phenotype-specific penetrances were calculated on the basis of the age of onset distribution in 29 individuals with ALS (mean = 45.1 years, SD = 10.4 years) and 33 individuals with PDC (mean = 53.8 years, SD = 7.7 years), respectively.
We carried out a case–control analysis of the genome scan data as an alternative strategy. The justification for this was that: (i) it provides complementary information; (ii) the genome scan data were available with the analysis being computationally rapid; (iii) the small population at the end of the 18th century followed by rapid growth over the next two centuries (29,85) predicts long tracts of linkage disequilibrium complete with power >80% to detect association with genome scan STRPs at a nominal significance level of 0.001 in the available sample (86); and (iv) STRPs can have considerable power to detect association in the context of recent founder events at moderate genetic distances (86). These analyses were initially carried out on a sample of 137 putatively unrelated cases (mean age 57.9 ± 11.4 years) and 88 age-matched controls (mean age 51.87 ± 15.8 years at examination), representing the entire available sample of population-based non-Umatac cases. To achieve the maximum possible number of affected individuals that could be included, we later also included all sampled subjects from Umatac, providing, in total, 157 cases and 101 controls in the full sample, while accounting for relationships among individuals with a newly developed method of analysis as described in what follows. Finally, to achieve maximum power to detect association with this analysis, given the modest total available sample size, in addition to a standard case–control analysis, we also carried out analysis with a weighted false-discovery rate (wFDR), which increases power to detect association by incorporating weights on the basis of genomewide linkage analysis results to provide an estimate of the FDR (65).
We investigated the existence of unidentified relationships in the sample by several methods that used the genome-wide marker data. We used Relpair v.2.0.1 (http://csg.sph.umich.edu/boehnke/relpair.php) (87), with an assumption of a 0.5% genotyping error rate and estimates of marker allele frequencies from the full case–control sample. We also computed maximum likelihood estimates (88,89) of k coefficients (90), which are the multilocus probability of sharing 0, 1 or 2 (k0, k1, k2) alleles identical by descent with no restrictions on the types of relationships allowed. Finally, genealogical and clinical records were reviewed for all individuals found to be related on the basis of genome-wide marker data.
Evidence for association between case status and marker allele frequencies was evaluated with χ2 tests. Provided that the assumptions of independence among observations and HWE are satisfied, P-values are expected to have a uniform (0,1) distribution except for a few markers that are near to and in linkage disequilibrium with the disease locus. We used qq plots to assess departures of the genome-wide P-value distribution from uniformity because such departures can indicate violation of the analysis assumptions, for example, caused by the presence of cryptic relatedness or population stratification. Marked deviation from the expected null distribution of P-values was obtained even when the Umatac subjects were excluded, and led to the identification of cryptic relatedness in this sample. As a result, and also to allow the use of the maximal available sample, we developed a corrected χ2 test on the basis of estimating relatedness from genotype data (89) and correcting the test for this relatedness (91). Details of this extension and its properties are described elsewhere (49), including the information that the corrected test is slightly conservative. Finally, to maximize power to detect association and to use the joint information from both the linkage and association analyses in the interpretation of the association results, we computed q-values using the weighted false discovery rate (FDR) as described elsewhere (65). For the estimation of the FDR, we used the genome scan LOD scores for the HS pedigree configuration, since this was the only pedigree configuration that used the full sample, an exponential model for each marker, and an empirical estimate of the proportion of tests representing the null distribution (92). To evaluate evidence for Hardy–Weinberg disequilibrium (HWD) and population structure, we performed an exact test for HWD for multiallelic markers using the MCMC program hwe in the Hardy software package (http://www.stat.washington.edu/thompson/Genepi/pangaea.shtml) (93), and we used Structure (http://pritch.bsd.uchicago.edu/software.html) (94,95) to evaluate evidence for the existence of admixture or distinct subpopulations. However, the results from Structure were difficult to interpret because of sensitivity to the choice of model assumptions and prior distributions, and this approach was therefore not pursued further.
Genomic DNA was isolated from available frozen brain tissue or fresh blood samples. Genotypes were determined by PCR amplification of polymorphic loci using primers labeled with fluorescent tags. We genotyped 408 microsatellite markers, primarily from the ABI Prism Linkage mapping set, Version 2.5, HD5 (http://home.appliedbiosystems.com/): 402 genome scan markers on all subjects, and 6 additional markers on chromosome 12p on the pedigree samples. The six additional markers on 12p, only, were chosen because the markers in the genome scan panel did not provide strong additional support for a locus on 12p, beyond the initial marker providing evidence for linkage. DNA fragments were analyzed using an ABI377 or an ABI3100 DNA sequencing instrument and GeneScan and Genotyper software. Quality control of the resulting data involved validating pedigree structures, identifying and eliminating genotyping errors, mapping of markers back to somatic cell hybrids to verify chromosomal location and evaluation of consistency of the data with published maps, as described in the Supplementary Material.
This work was funded by the National Institutes of Health (AG14382; AG11762; AG05136; and GM46255) and by the Department of Veterans Affairs.
We wish to thank Dr Chris C. Plato, who passed away in March 2008, and whose meticulous work and generosity provided the initial Umatac pedigree, without which the current study would not have been possible. We also wish to acknowledge Hiep Nguyen for computer assistance, as well as the efforts of research assistants and clinicians who have assisted with the laboratory work and in identifying, evaluating and referring patients in Guam and elsewhere over the years.
Conflict of Interest statement. None declared.