The sumLINK statistic is a new method aimed at addressing both heterogeneity and localization. The procedure is designed to identify the genomic regions for which an excessive number of powerful pedigrees are concordant. It is an ideal approach for multi-center collaborations or large single-site studies where a large number of pedigrees are available. A distinct advantage of this method is that it does not require collaborating centers to share raw data such as pedigree structures or genotypes; and does not require that each center use the same marker set. Provided a common genetic map is used for analysis, each center can perform their own analyses, calculating multipoint LOD scores at the same equally-spaced increment across the genome. It is only necessary to share these meta data (a multipoint genome scan for each pedigree), which enhances data privacy and security.
An important advantage of the sumLINK is the ability to identify loci that have good potential for gene localization, as several linked pedigrees exist beneath each peak identified. An unexpected benefit of compiling data across centers that used different marker maps is that the resolution of the localization can be higher than any of the individual genetic maps due to the overlaying of data. In our example, even with a low density 10 cM marker map, we were able to localize each region to approximately 20 cM, and these localized regions would be greatly refined with fine-mapping. This method of using the limits of sharing observed within extended pedigrees is intuitively appealing for localization, but may also have theoretical advantages over other common methods. Often so-called “1-LOD” support intervals are reported for linkage peaks generated from a HLOD analysis; however, support intervals should strictly be applied to parameter estimates (the recombination fraction parameter, θ, in the case of linkage statistics) and are relevant in the context of two-point maxLOD statistics that are directly analogous to likelihood ratio tests. The standard practice of a 1-LOD support interval using the value of the statistic itself (usually HLOD) rather than a parameter is not statistically well-grounded, although since θ is a distance parameter has intuitive appeal. In particular, in a HLOD analysis it is not clear whether the statistical noise generated by “unlinked” pedigrees may mask or shift positive linkage evidence. Hence, these “1-LOD” intervals can only be considered as a rough guide.
The shuffling method we have implemented to determine the null distribution is a particularly innovative element of the sumLINK procedure, and may be especially useful to the broader research community. We used the procedure to assess the significance of two genome-wide linkage statistics (sumLOD and sumLINK), but it may have broader applications for testing the significance of other statistics with unknown distributions. It is a simple, elegant, and quick way to create null data for assessing significance. It accounts for variations in pedigree structure as well as the autocorrelation of consecutive loci inherent in genetic linkage data. We developed a post-processing script written in R [RDCT 2006
] that calculates the sumLINK and sumLOD statistics, performs the genome-shuffling, generates the empirical null distribution, and tests the significance of observed linkage peaks. Computational time is dependent upon the number of pedigrees and the length of the genomic region being analyzed. The ICPCG data, comprised of 190 pedigrees and 3502 data points from 22 chromosomes, required 21.3 seconds per iteration with a 3.0 GHz Intel Xeon Duo Core 64-bit CPU running R v2.4.1 on Red Hat Enterprise Linux v5. One iteration consists of shuffling all pedigrees, calculating the null sumLINK, sumLOD, and number of pedigrees contributing to each statistic at all data points, and writing out a text file containing these values. Significance is computed in a later step after all shuffling iterations are complete. Our simulated null data (400 pedigrees, 3550 cM, 22 chromosomes) required 60.8 seconds per iteration, and the simulated data sets from GAW13 (200 pedigrees, 3604 cM, 22 chromosomes) required 22.9 seconds per iteration.
Analysis of simulated null data illustrated that the type-I error rate for the sumLINK and sumLOD statistics were all within acceptable boundaries, indicating that the genome shuffling procedure is valid for significance testing. It is interesting to note that the sumLINK and sumLOD statistics did not frequently agree with regard to the locations of statistically significant peaks in the null data, nor did they generally agree with the HLOD. This perhaps indicates that the three statistics are sensitive to different characteristics of the null data.
Analysis of simulated alternative hypothesis data was based on a GAW13 complex model. An obesity phenotype was selected because it is a complex trait simulated with extensive locus heterogeneity. One major weight gene, Gb11
was easily identified, with both the sumLOD and sumLINK showing good comparability with the HLOD. Power was low for all other genes, but this was not unexpected. Others who analyzed these data reported that the simulated obesity-related genes, particularly those genes affecting change over time, were very difficult to find.[Strauch, et al. 2003
; Yoo, et al. 2003
] The data creators intentionally made many of the genes challenging and perhaps even impossible to find.[Daw, et al. 2003
] Although the power was low, the sumLINK and sumLOD statistics consistently outperformed the HLOD in identifying the minor genes. However, we do not believe that these new statistics should replace the HLOD, rather that our investigation indicates a proof-of-principle that the sumLINK and/or sumLOD are useful companion measures to help identify the best loci for further testing.
Potential limitations of our method include that the genome-shuffling procedure to create the null distribution may not be useful for studies including only a small number of pedigrees due to the limited number of shuffled genomes that can be generated. The shuffling procedure also assumes that information content is approximately constant across the genome, an assumption that may be violated at the telomeres where multipoint information and information content is reduced systematically. We tested robustness to this by removing all the telomeric regions from the ICPCG data and repeating the analysis. We found that because these regions are such a small part of the entire genome, they do not substantially bias the shuffled null genomes and the results were extremely robust. However, given the difference in information content between the sex chromosomes and the autosomes, we suggest the method for autosomal genome scans only. The term “genome-wide” as used in this manuscript refers only to the 22 autosomes. All of the sumLINK and sumLOD analyses we presented were performed using sex-averaged genetic maps. The effect of this assumption on the characteristics of these new statistics has not been investigated here.
In our example case study of the ICPCG aggressive prostate cancer data we identified 3 regions of interest for further follow-up; two with genome-wide significant evidence supported by FDR analysis, and one with suggestive evidence. This performance is very encouraging. A prior linkage study of these data using conventional LOD/HLOD procedures indicated suggestive linkage evidence at the same loci that we identified on chromosomes 11 and 20 (HLODs of 2.40 for a recessive model and 2.49 for a dominant model, respectively).[Schaid and ICPCG 2006
] Our method finds superior levels of significance; both loci are genome-wide significant. However, it is certainly notable that Schaid et al. reported that the evidence on chromosome 11 increased to HLOD=3.31 in subset analyses for early age-at-onset pedigrees, and the region on chromosome 20 increased to HLOD= 2.65 in the subset of pedigrees with mean age-at-onset greater than 65 years.[Schaid and ICPCG 2006
] Without necessitating the increased multiple testing inherent from subset analyses, the sumLINK was able to identify the more powerful pedigrees and the superior evidence. Our suggestive region on chromosome 2 was not identified using conventional linkage statistics in the previous study.