|Home | About | Journals | Submit | Contact Us | Français|
CLU, PICALM and CR1 were identified as genetic risk factors for late onset Alzheimer’s disease (AD) in two large genome wide association studies (GWAS) published in 2009, but the variants that convey this alteration in disease risk, and how the genes relate to AD pathology is yet to be discovered. A next generation sequencing (NGS) project was conducted targeting CLU, CR1 and PICALM, in 96 AD samples (8 pools of 12), in an attempt to discover rare variants within these AD associated genes. Inclusion of repetitive regions in the design of the SureSelect capture lead to significant issues in alignment of the data, leading to poor specificity and a lower than expected depth of coverage. A strong positive correlation (0.964, p<0.001) was seen between NGS and 1000 genome project frequency estimates. Of the ~170 “novel” variants detected in the genes, seven SNPs, all of which were present in multiple sample pools, were selected for validation by Sanger sequencing. Two SNPs were successfully validated by this method, and shown to be genuine variants, while five failed validation. These spurious SNP calls occurred as a result of the presence of small indels and mononucleotide repeats, indicating such features should be regarded with caution, and validation via an independent method is important for NGS variant calls.
Late onset Alzheimer’s disease (AD) is a complex disorder with a strong genetic component, with heritability estimated to be up to 60-80% . ApoE was the only robustly replicated genetic risk factor for AD until relatively recently , when the advent of the Genome Wide Association Study (GWAS) allowed researchers to test the vast majority of loci in the human genome for association with AD in a single experiment, without prior assumptions as to which might be involved. The publication of the first two sufficiently powered, large scale AD GWAS [3,4] in 2009 identified CLU, CR1 and PICALM as genetic risk factors for AD, and subsequent to this, another 6 genes (BIN1, ABCA7, the MS4A locus, CD33, CD2AP and EPHA1 [5-7]) have been implicated and replicated through further GWAS and meta-analyses. However, it remains to be determined how these genes are involved in the pathology of AD, and which variants convey the alteration in disease risk.
In order to begin to address this issue in AD, as in numerous other disorders [8-10], extensive resources are being invested in Next Generation Sequencing (NGS) projects to discover rare, potentially causative variants at loci implicated by GWAS. Recent advances in technology have allowed scientists to generate sequence data to an unprecedented scale, with experiments that can produce millions of short sequencing reads in a single run. However, the data analysis methods needed to analyse and interpret the output from these experiments are still in their infancy, and at present there is no real “gold standard” for handling this data .
Various methods of target enrichment are available which allow researchers to hone in on specific genomic regions of interest . Pooling of samples prior to enrichment and sequencing maximises the utility of NGS technologies, reducing the cost per sample dramatically. This enables far more subjects to be included in studies than would be feasible with individual sequencing or even indexed pooled capture. However, there is a trade off between increasing numbers of individuals within a pool, and the reliability of SNP calls. Increasing samples in a pool decreases coverage per sample, bringing the rate of a singleton SNP within a pool closer to the inherently high error rate of NGS technologies, whereas with individual sequencing, variants are present in approximately 50% of reads, at a given position. Indeed, when utilising a pooling strategy combining the DNA of 75 individuals, a validation success rate of around 20% was achieved for rare variants, compared to an almost 75% successful validation rate with pools of 12 (data unpublished). Estimating the frequencies of variants from pooled data is also reportedly unreliable, making comparisons between case and control allele frequencies for disease association testing problematic .
Repetitive DNA, which comprises around half of the human genome , and ranges from short stretches of mononucleotide repeats, to large segmental duplications, brings significant challenges for NGS projects. The main area where this presents an issue is in mapping short sequencing reads to the reference genome, since it can lead to reads having multiple potential alignment locations. Alignment programs may deal with the issue by reporting the best match only; by discarding all reads that map to multiple locations (or >n locations); or by reporting all potential alignment locations . None of these methods is satisfactory, as data will be lost or aligned inaccurately, which may lead to erroneous variant calls in downstream analysis.
Given the current high error rates of NGS technologies and issues achieving accurate alignment, particularly around repetitive regions , validation of putative variants detected in NGS data via an independent method is important.
An NGS project was conducted targeting CLU, CR1 and PICALM, in 96 AD patients, in an attempt to discover rare variants within these genes. A subset of SNPs from this NGS data were selected for validation via Sanger sequencing  to explore some of the potential issues mentioned above.
All subjects gave informed consent to be included in the study, which was granted approval by the local Ethics Committee. 96 AD samples were obtained from two UK centres - The University of Nottingham Brain Bank and Manchester Brain Bank (48.4% female, 51.6% male; mean age at onset 70.4 years, standard deviation 11.77). ApoE alleles; ε2 - 5.7%; ε3 - 63.5%; ε4 - 30.8%). Both of these resources comprise part of the ARUK brain bank, which has been used in such projects as the GWAS that first implicated CLU and PICALM in AD risk . This quantity of samples gave 80% power to detect variants with minor allele frequencies (MAF) down to 0.85%, based on the equation n=log(1-p)/log(1-MAF) where n is the number of chromosomes, p is power, and MAF is minor allele frequency. Whole genomic DNA was combined into 8 equimolar pools of 12 samples. Individual DNA samples were assessed for quality using agarose gel electrophoresis, with samples showing signs of degradation rejected from the study. The Invitrogen Quant-iTTM dsDNA Broad Range Assay Kit (Life TechnologiesTM, NY) was used to quantify the DNA concentrations to allow for accurate pooling. Enrichment of the genomic regions of interest (CLU, PICALM and CR1, totalling 292kb) was performed using Agilent’s Sure Select XT Custom kit (Agilent Technologies, CA). SureSelect baits were designed to capture the whole gene (introns and exons), using 5x tiling, without repeat masking. Design of baits with and without the repeat masker made it possible to ascertain the proportion of each region that would not be targeted, were the repeat masker utilised; effectively the proportion of repetitive DNA in the region, although the masking method used is arguably over conservative. The region to be sequenced was also extended to include flanking regions demonstrating notable conservation across vertebrate species (areas >100bp showing at least 70% sequence identity between man, macaque, dog and mouse), up to a maximum of 5kb from the genes, using the ECR browser (http://ecrbrowser.dcode.org/) . Sequencing was conducted on Illumina’s GAIIx (Illumina Inc. CA), with one pool per lane on a flow cell producing 38bp single-end reads. Both of these processes were conducted by Source Bioscience (http://www.sourcebioscience.com/), following manufacturer’s protocols.
Alignment of the sequence data was performed using BFAST v0.6.5a (Blat-like Fast Accurate Search Tool - http://bfast.sourceforge.net ), following the protocol in the program’s manual with default settings for short single-end Illumina sequencing reads, to hg19 . The quality of the alignment was assessed using SamStat . SAMTools  v1.17 was used for basic file manipulation. Integrative Genomics Viewer  (IGV) v2.0 was used to visualise the aligned sequencing reads.
CRISP  v5, a program specifically designed for variant calling in NGS data from pooled samples, was used to identify SNPs and small indels, following the program’s manual, with default settings.
Basic annotation of the polymorphisms called by CRISP was conducted using Ensembl’s Variant Effect Predictor  (VEP, accessed November 2011), which provided information on where the variants lie in relation to the major transcripts of each gene and whether these were novel or had been documented in dbSNP.
Only variants which did not have known co-located variants (as determined by the VEP) were taken forward to further analysis. To minimise the chance of pursuing false positive variants, an additional filtering step was applied: variants which featured in less than 4% of the total number of reads in the pool in which they occurred were disregarded, since on average a singleton variant in a pool of 12 individuals (24 chromosomes) would be expected to be present in 4.2% of the reads.
The Genome Analysis Tool Kit (GATK)  v1.1.10 was used to perform local realignment around known indels and base quality score recalibration , using variant data from dbSNP 134, on the complete data for each of the genes. Both processes were run following the default protocol documented on the GATK online user guides. CRISP was then re-run on the realigned, recalibrated data.
PCR primers were designed to amplify a region including at least 100bp either side of the position of interest (SNPs listed in Table 1) using Primer3  v0.4.0 (http://frodo.wi.mit.edu/). Specificity for each primer pair was checked using UCSC’s ; Virtual PCR function (http://genome.ucsc.edu/cgi-bin/hgPcr), and the primer binding sites were determined to be free of known polymorphisms using NGRL Manchester’s SNPCheck v2.1 (https://ngrl.manchester.ac.uk/SNPCheckV2/snpcheck.htm). PCR optimisation and amplifications were completed following standard laboratory protocol (reaction mix: 1xPCR buffer (Roche Diagnostics Corp., IN); 200μM dNTPs (Fermentas); 1μM of each primer (Eurogentec Biologics, Belgium); 1 unit Taq DNA Polymerase (Roche, Diagnostics Corp., IN); plus molecular grade water up to a final volume of 30μl. Primer concentrations were halved for 8:27466924 and doubled for 1:207690803 after optimisation. Thermal cycling conditions used were 94°C for two minutes; 30 cycles of 94°C for 30 seconds, appropriate annealing temperature for 1 minute, 72°C for 1 minute; and finally 72°C for 7 minutes). Primer sequences and annealing temperatures are shown in Supplementary Table 1. Sequencing was conducted using PCR primers with Applied Biosystems BigDye Terminator v3.1 chemistry, run on the ABI 3130xl (Applied Biosystems, CA). Chromas Lite v2.01 (http://www.technelysium.com.au/chromas_lite.html) was used to visualise electropherograms which were assessed by eye to determine genotype. In each case, one pool of samples (12 individuals) was Sanger sequenced. The pool to be sequenced was selected based on having the highest proportion of alternative reads at the position of interest.
Tabix  was used to obtain variant information from 1000 genomes  release 20110521 and an in-house compiled Perl script was used to extract information of interest (variants and European population frequency data) from this. SPSS v16 was used to calculate correlation coefficients between CRISP and 1000 genomes frequency data.
Following alignment with BFAST of the ~350 million reads obtained from the NGS run, the average coverage per individual across the three genes was 17.4x (18.1x, 21.5x and 13.9x for CLU, PICALM and CR1 respectively). Sequencing characteristics from the experiment are shown in Table 2.
The number of variants detected by CRISP in each of the genes is summarised in Table 3, both before and after realignment and recalibration of the data with GATK. The table also shows the number of common (MAF >5%) SNPs listed within dbSNP 134 in the targeted regions, and how many of these were detected within our data (in each case this remained the same pre- and post- GATK). As a positive control, the CRISP output was compared with variants which had a MAF between 0.85-1% in the 1000 genomes project EUR population data (i.e. the minimum MAF the study had 80% power to find). Six of the thirteen variants received particularly poor coverage in our data (<10x per individual). Of the remaining seven, four were successfully identified in our data, demonstrating the capability of this method to detect variants in this range of MAF. The final three appeared to be non-variant sites in our 96 samples.
In order to ascertain the accuracy of the frequency estimates from CRISP, MAF estimates (based on percentage of alternative reads - a surrogate for MAF) from CRISP for all of the SNPs called in the three genes were compared with MAFs from the 1000 genomes project (European population). A strong, significant positive correlation was observed between datasets (Spearman Correlation Coefficient=0.964, p=<0.001).
The seven SNPs selected for Sanger Validation were chosen on the basis of having no co-located variant (as determined by Ensembl’s VEP) and having the alternate allele occur in >5% of sequencing reads at that position. Given the apparent commonality of these variants in our data, the fact that they had not been documented prior to the extensive resequencing efforts of the 1000 genomes project, if at all, seemed worthy of investigation. Other variants called by CRISP which fell within the regions sequenced were also considered in the analysis. Information on all of the SNPs Sanger sequenced is presented in Table 1.
Of the 12 putative SNPs included in Sanger sequenced regions, six were found to be genuine (Table 4). Reliability of frequency estimates could also be assessed once the number of actual alternative alleles in a pool was established by Sanger sequencing. Each alternative allele should contribute ~4.2% of reads to the pool total, assuming equal representation. The relationship between the number of actual alternative alleles and the proportion of NGS reads they make up is shown in Table 4.
Three of the remaining SNPs were not found to be present in the samples Sanger sequenced, but instead small indels were found at the suggested variant sites (all which were also called by CRISP). These variants are summarised in Table 5.
The final three SNPs (8:27466924, 8:27473743 and 11:85668102) were not validated by Sanger Sequencing. All three of these putative variants occurred adjacent to mononucleotide polyA repeats, with numerous other potential variants in the immediate area, called by CRISP or present in dbSNP 134 (see Figure 1). When CRISP was run on GATK realigned and recalibrated data, the 8:27466924 and 11:85668102 SNPs were no longer called, and 8:27473743 went from being called as present in all pools to only being present in two (including the pool Sanger Sequenced). Within each of the polyA repeats, CRISP also called a +A insertion, all of which persisted following GATK realignment and recalibration. There are also multiple +A insertions reported in dbSNP for each of the mononucleotide repeat sites, which suggests these may be genuinely variant. Due to the issues even Sanger sequencing has in dealing with mononucleotide repeats, our findings were inconclusive as to whether there were genuine variations in the number of A nucleotides present at these sites.
The percentage of reads which mapped to the target region was lower than expected, resulting in lower than anticipated coverage of the targeted areas. SureSelect has been reported to return 40-50%  of reads on target, although with the custom kits it can be lower . The figures for average coverage of the three genes are actually deceptively low - repetitive regions within the genes tended to have few reads aligned to them, lowering the calculated coverage. An optional repeat masker (based on RepBase  v9.11) could have been used when designing the SureSelect baits, which would have prevented strongly repetitive regions from being targeted. However, this was not used in the design of this experiment as it was deemed overly conservative (masking out 34%, 34% and 48% of CLU, PICALM and CR1 respectively). However, when images from IGV and UCSC are compiled, as in Figure 2 [18,21,27], it can be seen that the regions that would have been repeat masked and therefore not baited, generally had extremely low coverage, so no real information was gained by targeting these regions. Additionally, allowing these repetitive regions to be included in the bait design will have reduced the specificity of the capture, as non-target, similar DNA will also have been pulled down, reducing the recovery of the true target region, and limiting the number of reads which could be uniquely mapped.
For CR1, around 40kb of the ~150kb region targeted received effectively no coverage, reducing the average for the whole region. This gap in the sequencing is as a result of the nature of the CR1 gene. CR1 encodes complement receptor 1, a membrane glycoprotein and the main receptor for complement proteins C3b and C4b. There are four different CR1 isoforms, encoded by CR1-A (the F allele), CR1-B (the S-allele), CR1-C and CR1-D, with allele frequencies in Caucasian populations of 0.83, 0.15, 0.01 and <0.01  respectively. The proteins differ in the number of C3b binding sites present, and the different alleles are thought to have arisen from unequal crossover events involving a stretch of DNA encoding this. CR1-C which encodes the smallest isoform has a single C3b binding site, and thus a single copy of this stretch of DNA, but alleles encoding the larger isoforms with multiple C3b binding sites will have multiple copies of this region, making accurate alignment of reads virtually impossible. A recent study  which used multiplex amplicon quantification to distinguish F- and S-alleles found an association between the S-allele (with an extra C3b binding site) and increased AD risk, however, with our methodology it was not possible to determine genotypes for this polymorphism within sample pools, let alone within individual samples.
When repetitive DNA comprises such a large proportion (~50% ) of the human genome, these regions cannot simply be ignored, but nor can they be accurately sequenced, given the current technology and data analysis methodologies available. The CR1 example above demonstrates that repetitive DNA can be biologically important and disease relevant, but whether this is a common phenomenon or an isolated example remains to be seen.
This study had 80% power to detect variants with a MAF down to 0.85%. This figure is a higher frequency than the multiple individual variants Bettens et al. found in their study, however significantly more samples (several thousand) were included in their resequencing approach targeting CLU exons . The study here documented used less samples, so would not be expected to find so many extremely rare variants, but has strengths in its coverage of full exonic, intronic and potential regulatory regions. That said, CRISP did detect several variants (rs185685560, rs188050008, rs186928661 and rs1834226) which all have MAFs as low as 0.13% in the 1000 genomes EUR population data. Of the 13 variants with MAFs from 0.85-1% in the 1000 genomes data for these genes, nine were not identified in our samples. Although it is likely some of these rare variants were simply not present in our sample set, some were likely missed due to poor coverage of the regions. This limitation arose through the inclusion of repetitive regions in the study design, creating issues during alignment, which could have been improved via the use of paired-end rather than single-end reads.
Overall, the total number of variants called by CRISP in the three genes was 761 SNPs and 171 indels, of which over 150 were without co-located variants. All of the SNPs listed in dbSNP 134 with CEU frequencies >5% within the targeted regions were found in our data (see Table 3), with the exception of five in CR1, likely due to the coverage issues mentioned above. This indicates a low false negative rate, for common SNPs at least. The strong positive correlation between 1000 genomes and CRISP frequency estimates for all CRISP called SNPs indicates that even from pooled data, frequency estimates can be considered reasonably accurate, although it should also be noted that as only AD patients were included in this sequencing project, deviance from a perfect correlation could be indicative of disease association. A recent paper  commented that frequency estimates from pooled samples were not generally accurate enough to use in meaningful association testing between case and control groups, however, the strong correlation between CRISP and 1000 genomes frequencies in our data suggests otherwise.
However, comparison of the actual number of alleles within a pool and the CRISP frequency does not support this. Sanger sequencing allowed the determination of the exact number of alternative alleles within a pool for the validated SNPs. Frequency estimates from pooled data are based on the assumption that each allele contributes equally to the total number of reads. If this assumption is incorrect, frequency estimations will not be accurate. For the majority of the variants considered in this study, there is a discrepancy between the number of alternative alleles within the Sanger sequenced pools, and the frequency estimation from CRISP, which could indicate the assumption is invalid and alleles are not equally represented. This could occur as a result of inaccurate pooling, resulting in DNA from certain subjects being over or under represented. Alternatively, it may reflect inherent biases in the target enrichment or NGS processes if DNA from certain individuals is captured or sequenced to a lesser extent. The more samples included in a study, the more accurate estimations of frequency will become , so when the full 96 are considered, frequency estimates improve, but this does not mean samples are being equally represented, which should be acknowledged during analysis. It is possible that while keeping a small number of samples per pool ensures accurate variant calls, it actually reduces the reliability of frequency estimates for those pools.
With so many potential variants arising from even relatively small scale resequencing projects, it is crucial to minimise the amount of false positive variants, hence why validation via an independent methodology is important. Half of the putative SNPs in the Sanger sequenced regions were validated as being genuine SNPs, but this included only two which were deliberately targeted for validation, and almost all of the variants which were incidentally sequenced as they fell within the amplicons to be sequenced. Bansal et al.  estimated a false positive rate of <1% using CRISP, which is significantly lower than our 50%. However, our rate would be expected to be higher than this, since many of the SNPs selected for validation were in possession of unusual characteristics (e.g. being present in all pools sequenced).
The location of these variants relative to the major transcripts of their respective genes is given in Table 1. The majority are deeply intronic, so are unlikely to be affecting splicing activity. One variant (11:85668163) falls around 1.5kb downstream of PICALM, where it is not likely to be affecting the gene’s function or regulation. None of the variants show a high degree of conservation. The other SNP, at 11:85692181, is a synonymous exonic change, which whilst not affecting the primary sequence of the protein, could be having an effect on splicing regulatory elements or mRNA structure. Functional studies would be needed to clarify the effects of any of these variants.
The remainder of the potential SNPs sequenced did not turn out to be genuine. These constitute false positives. However, three of these transpired to be next to genuinely variant sites of small indels; two +T insertions and a -TA deletion, all of which had been previously documented and were identified by CRISP.
The final three putative SNPs which were not validated were each next to a string of polyA mononucleotide repeats. These repeats are too small for the issue to be solved by masking repeats in the design of SureSelect baits. Unlike longer repeats, short stretches of mononucleotide repeats do not make alignment of reads impossible, as enough unique sequence is present, even in short 38bp reads to allow mapping. It is likely that these are genuinely variant sites, since CRISP calls +A insertions within each of them, and all have multiple rs number +A insertions falling within the repeat region. However, the problems presented by repetitive DNA even to the relatively robust Sanger sequencing meant it was not possible to tell whether these sites were polymorphic in our samples, or whether apparent variance was simply an artefact of slippage during Sanger sequencing or PCR amplification .
It was hoped that running CRISP on data which had been realigned around indels and had had base quality scores recalibrated using GATK would reduce these false positive SNP calls, and for two of the variants next to mononucleotide repeats this was the case, but not for the other one, or for the three spurious “SNPs” called next to indel sites. From this, it can be said that GATK is useful in clearing up some false positive calls, but certainly does not work for all, and a more reliable and less time consuming approach to avoiding these false positives might be to be aware of documented indels and mononucleotide repeats which could cause potential issues, and regard any SNPs called close to these sites with caution.
We thank the Alzheimer’s Research UK (ARUK) consortium for collection of the samples used, and to the patients and families whose participation made this work possible. We also thank ARUK and the Big Lottery Fund for financial support of this project, and ARUK for funding the PhD studentship of Jenny Lord.
The Alzheimer’s Research UK consortium: Peter Passmore, Bernadette McGuinness, Janet Johnston, Stephen Todd, Queen’s University Belfast, UK; Reinhard Heun (now at Royal Derby Hospital), Heike Kölsch, University of Bonn, Germany; Patrick G. Kehoe, University of Bristol, UK; Nigel M. Hooper, Emma R.L.C. Vardy (now at University of Newcastle), University of Leeds, UK; David M. Mann, University of Manchester, UK; Kristelle Brown, Noor Kalsheker, Kevin Morgan, University of Nottingham, UK; A. David Smith, Gordon Wilcock, Donald Warden, University of Oxford (OPTIMA), UK; Clive Holmes, University of Southampton, UK.
There are no actual or potential conflicts of interest to disclose.