|Home | About | Journals | Submit | Contact Us | Français|
Monozygotic (MZ) or “identical” twins have been widely studied to dissect the relative contributions of genetics and environment in human diseases. In multiple sclerosis (MS), an autoimmune demyelinating disease and common cause of neurodegeneration and disability in young adults, disease discordance in MZ twins has been interpreted to indicate environmental importance in its pathogenesis1–8. However, genetic and epigenetic differences between MZ twins have been described, challenging the accepted experimental paradigm in disambiguating effects of nature and nurture.9–12 Here, we report the genome sequences of one MS-discordant MZ twin pair and messenger RNA (mRNA) transcriptome and epigenome sequences of CD4+ lymphocytes from three MS-discordant, MZ twin pairs. No reproducible differences were detected between co-twins among ~3.6 million single nucleotide polymorphisms (SNPs) or ~0.2 million insertion-deletion polymorphisms (indels). Nor were any reproducible differences observed between siblings of the three twin pairs in HLA haplotypes, confirmed MS-susceptibility SNPs, copy number variations, mRNA and genomic SNP and indel genotypes, or expression of ~19,000 genes in CD4+ T cells. Only two to 176 differences in methylation of ~2 million CpG dinucleotides were detected between siblings of the three twin pairs, in contrast to ~800 methylation differences between T cells of unrelated individuals and several thousand differences between tissues or normal and cancerous tissues. In the first systematic effort to estimate sequence variation among MZ co-twins, we did not find evidence for genetic, epigenetic or transcriptome differences that explained disease discordance. These are the first female, twin and autoimmune disease individual genome sequences reported.
We sought to assess the magnitude of genetic, epigenetic and transcriptomic differences in CD4+ lymphocytes from MS-affected and unaffected MZ twin sibships (Supp. Fig. 1). CD4+ T cells are involved in the pathophysiology of MS (MIM 126200)1. mRNA, genomic DNA (gDNA) and reduced-representation, bisulphite-treated gDNA were prepared from negatively isolated, CD4+ T lymphocytes from three pairs of adult, MZ twins who were discordant for MS (‘001, affected; ‘101, unaffected). Affected individuals fulfilled McDonald criteria for MS diagnosis13. Lack of sibling affectation was assessed through clinical evaluation, and, for twin 041896-101, confirmed by magnetic resonance brain imaging, and cerebrospinal studies. MZ twin pair 041896 was female, of Ashkenazi Jewish origin and beyond the susceptibility age-range for MS at time of study (Supp. Table 1). Twin pair 230178 was female and African American, while twins 041907 were white males. Individual 041896-001 had onset of MS at age 30 years, and is currently in the secondary progressive phase; individuals 230178-001 and 041907-001 had MS onset at ages 38 and 13, respectively, and have relapsing-remitting disease. Molecular typing of HLA loci showed identical genotypes within the three twin pairs (Supp. Table 1). Only co-twins 041907 had DRB1*1501, the strongest genetic susceptibility factor for MS14.
Nucleic acid samples were sequenced by sequencing-by-synthesis with reversible-terminator chemistry15–18. mRNA was prepared from blood samples drawn on different days from twin pair 041896 to ascertain sampling variance. Fifty to 68 million, high-quality, 36–44 nt, singleton sequences from each of eight mRNA samples were aligned to the NCBI human genome reference and read-counts-per-gene were calculated18–20 (Supp. Table 2). Sequencing to this depth (median relative transcript coverage of 5.0-fold and 6.4-fold for 041896-001 and 041896-101, respectively) allowed determination of diversity of the polyadenylated transcriptome in CD4+ lymphocytes: ~92% of 20,601 genes with exon annotations were expressed, as assessed both by aligned reads and the upper asymptote of the best-fit sigmoid curve (Supp. Table 2, Supp. Fig. 2). The distribution of transcript abundance was a left-skewed, bell-shaped curve with >7 log10 dynamic range (Supp. Fig. 2), in agreement with a previous study17. Digital gene expression values correlated well with exon-resolution array hybridization results (Supp. Fig. 3), in agreement with another report21. Surprisingly, diagnosis or treatment of MS accounted for only 9.4% of variance in transcript abundance in T cells of MZ twins, compared with 57.3% attributable to twin pair-to-twin pair differences, 23.6% to day-to-day variation (as assessed in twin pair 041896 alone) and 3.5% to sequencing lane-to-lane variation, Supp. Figs. 4–7). The variance in transcript abundance attributable to MS was within the range of variances obtained by random permutation of MS diagnosis labels (Supp. Fig. 8, Supp. Table 3). Thus, robust gene expression differences were not observed between MS-affected and unaffected twins in CD4+ lymphocytes that were inexplicable by other variables.
One billion, high-quality, shotgun, whole genome sequences were generated from twins 041896-001 and -101, corresponding to 21.7- and 22.5-fold aligned coverage, and representing 99.6% and 99.5% of the NCBI human reference genome, respectively (Suppl. Table 4). Comparisons of genome coverage of the twins with the AK1 genome, which was determined using identical procedures, revealed no individual coverage bias15 (Supp. Fig. 9 and 10).
Viral infection has been suggested to contribute to the etiology of MS. Upon re-alignment of unaligned sequences to 2864 viral genomes, ~0.02% of DNA reads from twins 041896 and 0.2% of RNA reads from the three twin pairs matched 310 viral genomes. A large majority of these alignments reflected simple sequence repeats or endogenous retroviral sequences. Upon reverse-transcription and PCR, no reproducible differences were found between sequences aligning to viral genomes in T cells from MS-affected and unaffected individuals.
Approximately 3.6 million SNPs and ~0.2 million indels were detected in subject 041896-001 and -101 genomes, using optimized criteria, which are similar to values reported for male genomes (15 and references therein; Supp. Table 5). Indels varied in size from −31 to +8 nt, with an approximately normal frequency distribution. Of 13 common risk variants previously associated with MS susceptibility14, co-twins 041896 were homozygous for five, heterozygous for five, and three were absent. This genetic load is predicted to increase risk for development of MS ~8-fold under an additive model (Supp. Table 6). Co-twins 230178 were homozygous for seven susceptibility loci and heterozygous for two, and co-twins 041907 were homozygous for eight risk alleles and heterozygous for two, conferring 14-fold and 43-fold increased risk, respectively (Supp. Table 6). These data should be interpreted cautiously since translation of genetic burden into risk for complex disorders is rudimentary. Clustering of 9.9 million SNPs in eight individual genome sequences showed close similarity of the twin 041896, female genomes and their separation from six male genomes (Supp. Fig. 11).
SNP genotype differences were sought between affected and unaffected twin siblings in genomic DNA and mRNA (Supp. Fig. 1). Firstly, stringent bioinformatic filters were trained both to call SNPs in aligned genome and mRNA sequences and to infer SNP genotypes, by comparing genotypes obtained from duplicate Affymetrix 6.0 SNP array hybridizations with those derived from genome and mRNA sequencing (Supp. Tables 7 and 8; Supp. Fig. 12)15. These filters excluded low coverage or repetitive genomic sequences (<11; –fold or >44-fold coverage, respectively), yielding high positive predictive values (PPV) to enable meaningful co-twin comparisons. Secondly, these filters were used to determine SNP genotypes in aligned genomic sequences of twin pair 041896 and in aligned mRNA sequences of the three twin pairs. Thirdly, identities and differences in inferred SNP genotypes were sought between affected and unaffected twin siblings. Co-twin genotype differences were categorized either as changes from homozygous reference allele to heterozygote, or heterozygote to homozygous variant (Table 1). Of 1,089,550 SNP genotypes inferred in genomes 041896 using these filters, 3,241 (0.3%) differed between twins (Table 1). Of over 730,000 genomic SNP genotypes determined by duplicate array hybridizations, 126 (0.02%), 153(0.02%), and 120 (0.02%) differed between siblings in the three twin pairs, respectively, which was considerably less than ~8,500 SNPs that were discordant between repeated hybridizations of individual DNA samples (Supp. Table 9). mRNA sequencing covered ~65.6 Mb of annotated exons to a depth of ~5-fold. Three hundred and twenty two (0.6%), 1,017 and 380 SNP genotypes inferred in mRNA sequences differed between siblings of twin pairs 041896, 230178 and 041907, respectively (Table 1). Finally, replication of co-twin SNP genotype identities and differences was sought. No differences in SNP genotypes inferred by one approach were recapitulated by a second method. In contrast, >98% of SNPs that were identical in twin siblings and genotyped by two methods (array hybridization, mRNA sequencing or genomic DNA sequencing) were replicated (Table 1). Furthermore, Sanger resequencing revealed identical genotypes in twin pair 041896 for a set of 15 SNP differences well supported by at least one method.
The SNP genotyping filters were also used to infer indel genotypes in genome and mRNA sequences of the twins: 91.9% of indels detected in both genome and mRNA sequences had identical genotypes (Table 1). Of 26,908 indel genotypes inferred in the genomes of twins 041896, 213 (0.8%) differed between siblings. Of 1,322, 1,073 and 407 indel genotypes inferred in mRNA sequences from twins 041896, 230178 and 041907, 8, 39 and 10 differed between twin siblings, respectively (Table 1). No indel genotype differences identified by one approach were recapitulated by a second method. In summary, siblings in three MZ twin pairs exhibited no replicable nucleotide variation differences in non-repetitive sequences, as assessed by genome and mRNA sequencing and SNP array hybridization. Much longer reads and lower error rates will be required to evaluate variation differences in repetitive sequences comprehensively. Detection of no replicable SNP genotype differences between siblings of any of the three twin pairs in peripheral CD4+ T cells accords with estimated rates of somatic mutation of 8.4 × 10−9 to 4.6 × 10−10 per nt per generation in human tumors, Saccharomyces cerevisiae and Drosophila melanogaster22–24.
Expression quantitative trait loci (eQTL) are emerging as a molecular mechanism for common SNPs that are significant in genome-wide association studies of disease25. In light of an absence of significant MS-associated genotypic or mRNA expression differences between twins, we sought allele specific differences in mRNA expression. For heterozygous coding SNPs (cSNPs), the expression of both alleles in CD4+ lymphocytes was measured to address deviation from the 1:1 expected ratio (allelic imbalance). 268 heterozygous cSNPs exhibited allelic imbalance in cis at 188 loci in twin 041896 transcriptomes, as determined by significant deviation of aligned genomic and mRNA read counts (Supp. Table 10). Single base mismatches do not cause systematic bias in GSNAP alignments. Two imprinted genes showed altered allelic expression in both co-twins (ZNF331 and GNAS) as did three genes that exhibit altered allelic expression in human cerebellar cortex (ABLIM1, UBE2I and KIAA1267, Kingsmore et al., unpublished) and two that previously have shown altered allelic expression in CD4+ lymphocytes (the MS-associated gene CD6 and acid trehalase-like 1 [ATHL1])14,26. We used quantitative PCR to validate each of the three possible outcomes: a) where both twins showed an expected 1:1 ratio of allelic expression; b) where both twins show skewed expression of an allele in the same direction and magnitude, indicative of cis-acting eQTL or imprinting, and; c) where the direction or magnitude of the imbalance differed between the twins (Supp. Fig. 13). Interestingly, 115 (43%) cSNPs differed between twins (i.e. differential allelic expression; Fig. 1, Supp. Table 10). These results suggest that some gene expression differences between twins represent chromatid-specific alterations in transcription. Variance in allelic expression between samples mirrored that observed in overall mRNA levels, with twin pair-to-twin pair accounting for 51.2%, day-to-day variation for 27.7% and MS diagnosis for 8.0% of variance. No cSNPs showing allelic imbalance were shared among the three twin pairs. Interestingly, however, cSNPs that show allelic imbalance were significantly closer to transcription factor binding sites than random SNPs, providing a novel, potential mechanism of action.
Structural variants were identified in the six genomes by hybridization of duplicate arrays. In contrast to a recent report, we found no copy number variants or allelic gains / losses that differed between siblings in any twin pair12. Twins 041896 displayed 143 structural variants comprising 89 Mb, twins 230178 exhibited 13 variants comprising 3 Mb and twins 041907 had 58 variants encompassing 33 Mb (Supp. Fig. 1, 14, 15, Suppl. Table 11). Of note, seven structural variants were common to all three twin pairs, and changed the copy number of two genes (Late Cornified Envelope-3B (LCE3B) and T Cell Receptor Gamma Chain Alternate Reading Frame Protein (TARP)) and one pseudogene (A Disintegrin And Metalloprotease-6 (ADAM6), Supp. Table 12). LCE3B was not expressed in T cell mRNA samples from these patients. TARP was expressed at a level of 12.9 ± 6.1 reads/million (mean ± standard deviation) and did not show altered expression in MS. These genes have not previously been associated with MS.
An additional axis of heritable genetic information in human genomic DNA is cytosine methylation, which serves several functions including regulation of gene expression, silencing of retrotransposons, genomic imprinting and X-chromosome inactivation and has been implicated in several diseases27,28. We sought to compare genome-scale DNA methylation profiles between twin siblings at nucleotide resolution. We aligned 50 – 90 million, high-quality, 50 nt, reduced representation bisulphite sequences (RRBS) from ten samples – the three pairs of twin T lymphocytes, normal lung and lung cancer, and normal breast and breast cancer16 (Supp. Table 13). For twins 041896, these corresponded to 45.5- and 32.7-fold coverage of 1.4 million uniquely aligning, non-repetitive MspI fragments, and 2,146,620 and 2,033,078 CpG dinucleotides from -001 and -101 genomes, respectively (Table 2). Bisulphite conversion of non-CpG cytosines was >99%. Almost identical numbers of CpG sites were identified in the forward and reverse strands, as expected (Supp. Table 14). As reported for mouse, methylation levels of CpG dinucleotides in human T cells displayed a bimodal distribution, with most being unmethylated or methylated (>95% of reads in either state) [Fig. 2a, Supp. Fig. 16]16. Approximately one quarter of CpGs were methylated. Over 90% of CpG sites were common to siblings within each twin pair (Table 2). CpGs aggregated into clusters (corresponding to CpG islands16) at a ratio of 1.58 – 1.74 CpGs per cluster. Over 92% of CpG clusters were common to siblings within each twin pair (Table 2 and Supp. Table 14). Highly congruent results were obtained with two alignment algorithms (Suppl. Table 14; Suppl. Figs. 17 and 18) and two reference genome datasets. Of ~2 million CpGs represented by ≥10 high-quality reads in twins 041896, only two showed a switch between siblings from ≤20% methylated to ≥80% by ELAND and four by GSNAP, none of which was supported by both methods (Fig. 2b, Table 2). Likewise, 10 of 1.7 million CpG sites in twins 230178 and 176 of 1.7 million CpG sites in twins 041907 showed a switch in methylation by ELAND (Fig. 2c,d, Supp.Table 15). Two CpG methylation switches between affected and unaffected siblings were common to twin pairs 230178 and 41907, albeit with opposite directions of change (>80% → <20% mCpG in 041907-001 and -101, respectively, whereas <20% → >80% mCpG in 230178-001 and -101, at a CpG site 9912 nt 5’ of TMEM1 and 8536 and 10,659 nt 5’ of PEX14). To put these findings in context, we evaluated the magnitude of methylation changes in CD4+ T cells from unrelated individuals, between tissues and between normal and cancerous tissue. 586 – 827 inter-individual <20% → >80% CpG methylation differences were observed (Fig. 2e,f); 4255 – 7180 CpG methylation shifts were observed between T lymphocytes, lung and breast tissues (Table 2, Fig.2i,j). Breast and lung cancers revealed 1,557 and 16,509 CpG methylation shifts, when compared with normal breast and lung tissue, respectively (Fig. 2g,h, Table 2). A second pattern of change in CpG methylation was observed in comparison of male and female samples: 394 CpGs were <5% methylated in 041907-001 T lymphocytes (male) but 20 – 50% methylated in 041896-001 (female). Likewise, 406 CpG sites were <5% methylated in 041907-101 (male) and 20 – 50% methylated in 041896-101 (female). Of these, 385 and 389, respectively, mapped to Chromosome X, consistent with female X inactivation (Fig. 2e). Similarly, a very large number of CpG sites that were <10% methylated in normal lung were 20–70% methylated in lung cancer (Fig. 2h). A previous study has shown epigenetic differences between dizygotic twins to be qualitatively greater than between MZ twins29. Here we show the magnitude of epigenetic differences between MZ twin sibling CD4+ lymphocytes to be at least an order of magnitude less than those between individuals and ~3 orders less than those observed between tissues and in malignant transformation.
In summary, the recent GWAS-identification of novel risk loci is opening a broad window into genetic intricacies underpinning complex diseases. Although genetic knowledge remains incomplete, a new generation of sequencing and analytical tools may prove to hold great potential, as shown herein. Likewise, a discordant MZ twins study controls for many genetic and non-genetic confounders, enhancing the tractability of mechanisms in complex disorders. We sought genetic, epigenetic or transcriptomic differences between CD4+ T cells of twin siblings that might explain MS-discordance. While MS is a neurologic disease, T cells are fundamentally involved in its pathophysiology1. However, no reproducible differences in SNPs, indels, CNVs, gene expression levels or sequences aligning to viral genomes were detected between CD4+ T cells of co-twins. To provide analytical rigor, SNP and indel differences were sought using at least two different approaches and CNV experiments were performed in duplicate. However, analysis of nucleotide variants was limited in scope by exclusion of low coverage regions and repetitive sequences (since the latter cannot be reliably interrogated by alignment of short reads or array hybridization), by moderate sensitivity for detection of structural variants of size 50 – 1500 nt (which fall between the resolution of sequencing and array hybridization), and limited feasibility to detect possible somatic mosaicism. A previous study has shown differences in selection of T-cell receptors after antigen stimulation between MZ twins discordant for MS30. Quantitative analysis of T-cell repertoire or immunoglobulin locus recombination was not possible at ~22X depth of aligned coverage. Progress in single molecule sequencing technologies with longer reads and deeper coverage should overcome many of these limitations in the future, as would examination of additional cellular compartments of innate and adaptive immunity. Additionally, deep RRBS revealed very few changes in CpG methylation between CD4+ T cells of twin siblings and no differences common to two or more twin pairs. It should be noted, however, that RRBS was limited to the investigation of dramatic shifts in CpG methylation in a relatively broad population of T cells. Other epigenetic mechanisms, differences within lymphocyte subsets, mono-allelic differences or other tissues were not examined. These caveats aside, however, MZ twins lacked genetic, epigenetic or transcriptomic differences in T cells to explain MS-discordance. A number of tantalizing, novel, differences were detected that will require replication and additional studies: 43% of eQTLs had a different direction or magnitude of imbalance in twin siblings. In summary, a singular genetic, epigenetic or transcriptomic mechanism underpinning MS-discordance in MZ-twins was not detected in a study of unprecedented resolution. While disease discordant MZ twins appear to provide a framework for analysis of complex disorders that has fewer variables, additional stratification and/or concomitant measurement of multiple data types may be necessary to yield molecular mechanisms underpinning disease.
The study was approved by the UCSF Institutional Review Board. Informed, written consent was obtained from all individuals. CD4+ lymphocytes were isolated from peripheral blood and nucleic acids extracted with standard methods. Two samples were obtained on different days from twins 041896 and single samples from the others. HLA typing was by AlleleSEQR (Atria Genetics) and Assign SBT software (Conexio Genomics). Genome-wide genotypes and CNVs were detected with Affymetrix 6.0 arrays in duplicate. Log-R ratios were generated with Affymetrix Genotyping Console 3.0.2 and analyzed with Nexus software (BioDiscovery Inc., El Segundo, CA). Short- and long-insert, paired-end libraries were generated from gDNA, mRNA and reduced-representation, bisulphite-treated gDNA as described15–18. Paired-end and singleton, 36–130 nt reads were generated using Illumina GAIIx instruments. Sequences were aligned principally to NCBI reference genome build 36.3, with GSNAP and tolerance of 5% mismatches15. SNPs, indels and gene expression were analyzed with Alpheus using filters trained with array results15,18–20: Genomic SNP calling filters were >20% and >4 uniquely aligning reads with average quality score (Q) ≥20 (Supp. Table 7). mRNA SNP calling filters were Q ≥20, presence in ≥20% and ≥2 reads and ≥1 uniquely aligning read. Nucleotides with coverage 11–44X and Q ≥20 were genotyped according to frequency cutoffs in Supp. Table 8; Genotype differences were called where frequencies differed by >50%. eQTLs were detected by allelic mRNA read counts differing from equality with χ2 p-values of <10−7. Gene expression was assessed by log2-transformed aligned read counts. Putative SNP differences were validated by Sanger sequencing and putative gene expression differences using Affymetrix Human Exon 1.0 ST arrays. Putative eQTLs and virus alignments were validated by quantitative PCR (with allele specificity for the former). Statistical analysis used JMP-Genomics (SAS Institute, Cary, NC) or R (http://www.R-project.org).
Genome-wide genotypes (>900,000 SNPs) and CNVs (~1.8 million probes) were detected with Affymetrix 6.0 arrays (Santa Clara, CA). Genomic DNA from each individual was tested on duplicate arrays. Log-R ratios (normalized probe intensities) were generated with Affymetrix Genotyping Console 3.0.2 and analyzed with Nexus software (BioDiscovery Inc., El Segundo, CA), which identifies CNVs with a circular binary algorithm using intensity data from all probes, and allele ratios from SNP probes.
mRNA-Seq and whole genome shotgun sequences were aligned to the NCBI reference genome (build 36.3) with GSNAP and tolerance of 5% mismatches15,20 (Supp. Table 2 and 4). For definition of exon boundaries, annotations from RefSeq Transcript (downloaded 9/2/2008) and from 5,224 non-redundant UCSC transcripts (downloaded 4/13/2009) were appended to Build 36.3 of the reference human genome. Long (75 – 130 nucleotides) genomic reads were found to align poorly using these criteria, due to low terminal quality scores and higher rates of mismatch. Therefore, unaligned long, genomic, paired reads were further aligned to the NCBI reference genome with GSNAP by trimming to paired 75 nucleotides (nt) and tolerance of ≤10 mismatches.
mRNA-Seq and whole genome shotgun reads not mapping to the human genome were aligned to 2,864 NCBI viral genome sequences (release 35) with GSNAP and tolerance of 5% mismatches. Alignments were visualized using Alpheus20 and CMTV31. High likelihood true alignments were identified on the basis of:
Putative, novel viral sequences with average quality scores (Q) ≥20 were assembled by ABySS32 or by reference-guided assembly with AMOScmp-shortReads-alignmentTrimmed33. Default parameters were used. Contigs were aligned to the NCBI nr database using BLASTN 2.2.21.
Upon alignment of mRNA-Seq reads, read counts were calculated per gene for each lane of sequence and log2 transformed. Distribution analysis (Supp. Fig. 4) and Mahalanobis differences (Supp. Fig. 6) were assessed for log-transformed read counts from each lane of mRNA-Seq and outlier lanes were removed. Principal component analysis (Supp. Fig. 6) and variance decomposition of principal components was undertaken for log-transformed read counts from each lane to assess sources of variability in gene expression (Supp. Fig. 7). Since Diagnosis (MS-affected versus non-affected) accounted for 9.4% of variance, all possible permutations of lanes of sequence were examined to determine whether Diagnosis-associated variance was greater than a random permutation (experimental design file in Supp. Table 3). Principal component analysis and variance decomposition of principal components were repeated with log-transformed read counts from each lane for each permutation to assess permuted diagnosis-associated variance in gene expression (Supp. Fig. 8). Since true Diagnosis-associated variance was not greater than permuted variance, genes differing between MS-affected and unaffected individuals were not assessed by weighted ANOVA.
Treatment of DNA with bisulfite converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. Thus, alignments of 50 bp, singleton, reduced representation bisulphate sequences (RRBS) to the human genome are complicated by the simplification of the genetic code from four to three bases, except at methylcytosine (mC) locations. Eland-extended performs alignments of the first 32 nt of a read with up to two substitutions, and then extends the alignment with unlimited mismatches. Alignment of 3-base reads (following conversion of residual cytosines to thymidines in the RRBS reads) to a 3-base genome (following conversion of all cytosines to thymidines) with Eland-extended resulted in many non-unique alignments. In order to circumvent this problem, we made use of the fact that all RRBS start at an MspI site (which comprise the majority of CpG residues and large majority of CpG islands16). Thus, 3-base reads were aligned to a 3-base version of ~3.7% of the human genome, comprising 2.3 million MspI fragments of up to 50 nt in length, derived from the NCBI human genome sequence, version 36.3, totaling 113 Mb in length (Supp. Table 16). The fragments were of two types: 133,609 fragments of 30 – 50 bp that were flanked by MspI sites on both ends and 2.2 million 50 bp fragments with a 5’ flanking MspI site (representing genomic MspI fragments of greater than 50 bp in length). Only unique alignments with Phred-like scores >4 (greater than 50% likelihood of being correct alignments) and only those starting with a 5’ thymidine (base 1 of a converted MspI fragment) were retained (Supp. Table 13). Alignments to fragments of less than 50 nt terminated at the end of the fragment. Eland does not align to MspI fragments of less than 30 nt in length. Following alignment of converted reads, thymidine residues were corrected to their original sequence in the RRBS and reference, and C-to-T transitions were identified. Percent methylation for CpG sites was scored by the ratio of C/(C+T) calls for each C that was followed by a G. Percent conversion of C to T when followed by another base was used for estimation of bisulfite conversion rate, and was > 99.8%.
RRBS were also aligned with GSNAP to the NCBI human genome reference sequence, version 36.3, allowing 5% mismatches and without penalizing C-to-T transitions (Supp. Table 13). Since GSNAP reports only the best alignments (those with the fewest mismatches) using the entire 50 nt alignment, unique alignments were possible using the entire genome without penalizing C-to-T transitions. % methylation was assessed for CpG sites with at least 10-fold coverage, based on all alignments (i.e. not restricted to unique). Only CpG sites within MspI fragments were considered. For identification of differences between subjects from “largely methylated” to “largely unmethylated”, we sought positions where there was at least 80% cytosine in one subject and less than 20% cytosine in the other.
GSNAP is a short-read alignment program based on GMAP that employs a hash table and a compressed version of the reference genome, which is constructed once for that genome34. The reference may include arbitrary contigs (up to 4 billion), so that one may also align to a reference transcriptome, with redundancy allowed among the contigs. The hash table contains the locations of a given 12-mer in the genome, subject to sampling. The sampling step occurs during pre-processing of the genome, so that genomic locations are stored only for every third 12-mer in the genome. Sampling is needed to reduce the memory footprint of the program below 4 gigabytes for a human-sized genome. GSNAP can handle short reads of > 24 or more nt, with each read in the input potentially having a varying length. There is theoretically no upper bound on the length of the query sequence, except that this bound is compiled into GSNAP by default at 200 nt; longer sequences can be handled simply by changing this constant at compile time.
GSNAP has specialized algorithms for identifying exact mappings, one-mismatch mappings, multiple-mismatch mappings, and indel mappings (including a user-specified number of mismatches). Exact mappings are identified by taking the intersection of genomic positions over a spanning set of 12-mers in the query sequence. The spanning set must contain 12-mers in the same phase modulo 3, to account for the sampling used in pre-processing the genome, so the program must test each of the three possible phases. For spanning set members that overhang the ends of the query sequence by 1 or 2 nt, the relevant genomic positions can be obtained by substituting 1 or 2 wildcard nt, respectively, and taking the union of genomic locations in the hash table.
Candidates for one-mismatch mappings are similarly identified by computing an incomplete intersection, in which one 12-mer in the spanning set does not contain the given genomic location. These candidate genomic mappings are then compared against a compressed version of the genome to verify that only one mismatch was present.
Candidates for multiple-mismatch mappings are determined by processing a sorted list of genomic locations from all 12-mers in the query sequence. This sorted list is computed efficiently using a heap-based priority queue. For each candidate genomic location, a floor on the number of mismatches can be computed from the pattern of query positions of the 12-mers that match the genomic location. Candidates with a sufficiently low floor (based either on a user-specified limit or on the best mapping determined so far) are then compared against the compressed genome to determine the actual number of mismatches.
For identifying indel mappings, GSNAP accumulates partial genomic alignments during the multiple-mismatch algorithm, where a partial alignment can be supported by a single 12-mer in the query sequence. These partial alignments are then scanned in genomic order to identify pairs that are sufficiently close to constitute a candidate indel, where the default distances are 30 nt for an insertion and 12 nt for a deletion. These candidate pairs are then compared against the compressed genome to determine the number of mismatches. To identify indels occurring at either end of the query sequence, the program computes floors that exclude the 12-mers on either end. Candidates with a sufficiently low floor are then compared against the compressed genome to identify a possible indel at the end and to count the actual number of mismatches.
Although GSNAP allows repetitive regions of the genome to be masked before building the genomic data structure, in typical usage (as herein) the genome is not pre-masked. Therefore, GSNAP is able to align sequences to redundant regions in the genome, including repetitive regions, and report all such alignments. In default mode (as herein), the program reports only the best alignments, those with the fewest mismatches, although the program also can be run to identify and report suboptimal alignments. GSNAP differs from ELAND in that it processes the reference genome first, constructs a hash table of the genome, and then aligns the short reads to the genome. In contrast, ELAND processes the short reads first, constructs a hash table of the short reads, and then scans the genome to find matches.
SNP detection in Illumina GAII sequences is complicated by relatively high sequencing error rates, particularly at nucleotides 50 – 130 using the chemistry and base calling software available during the first half of 2009. SNP genotyping in Illumina GAII sequences is complicated by a continuous, albeit trimodal, distribution of frequencies of SNP- and reference sequence-containing reads at a given location (Supp. Figure 12). In order to translate SNP- and reference sequence-containing read frequencies into genotypes and to understand the sensitivity and specificity of SNP detection and genotyping, comparisons between array-based SNP genotypes and sequencing results were performed extensively. Unambiguous SNP genotypes from duplicate array hybridizations (with SNP calls and concordant genotypes in both replicates) were assessed to be true. Subsets of SNPs common to Affymetrix 6.0 arrays and sequence datasets were identified. Optimal SNP genotyping filters (those with maximal positive predictive value (PPV) and near-optimal sensitivity) for each sequence dataset were identified by determining the number of true positives, false positives and false negatives and determining the PPV and sensitivity of all combinations of the following criteria: number of reads calling the SNP, number of uniquely aligning reads calling the SNP, % reads calling the SNP, average quality score (Q), and minimum quality score. To detect changes in SNP genotype, each possible genotype in a diploid genome was modeled (homozygous reference allele, heterozygote, and homozygous variant allele) and the optimal change in allele frequency was determined. Resultant filters are shown in Supp. Tables 7 and 8. These methods represent a refinement of those used previously15, and which were extensively validated by Sanger resequencing and genotyping arrays.
Allele-specific expression in mRNA sequences was identified by methods similar to those described25. Frequencies of frequencies of SNP- and reference sequence-containing reads at a given heterozygous location in mRNA sequences are continuous, albeit unimodal (Supp. Figure 12), reflecting both random reference and variant-containing read sequencing, effects of clonal reads and allele-specific expression. Unambiguous heterozygous SNP locations in each individual were determined based on duplicate array hybridizations (with SNP calls and concordant genotypes in both replicates) and by the SNP calling criteria developed above. Allele-specific expression effects were assessed by application of genome-wide p values to significance testing of deviation from 50:50 read frequencies. Artifactual allele-specific expression associated with enrichment of clonal reads was evaluated for many, putative allele-specific expression SNPs by visualization of start and stop sites of reads using Alpheus. Artifactual allele-specific expression associated with bias in GSNAP alignment of reads containing or lacking specific SNPs was evaluated as discussed above.
We deeply thank the study subjects. This work was supported by grants from Small Ventures USA Inc., A.J. Brass Foundation, Nancy Davis Foundation, NIH grants RR016480 to F.D.S., RO1NS26799 to S.L.H., RO1NS46297 to J.R.O. and NMSS grants RG3060C8 and RG2901D9 to J.R.O. S.E.B is a Harry Weaver Neuroscience Scholar of NMSS. A deo lumen, ab amicis auxilium.
Full Methods are available in the online version of the paper at www.nature.com/nature.
Supplementary Information is linked to the online version of the paper at www.nature.com/nature.
Author Contributions S.E.B., G.P.S., J.R.O., S.L.H and S.F.K. designed the project. S.F.K., S.E.B., J.M. and J.R.O. wrote the paper with input from the other authors. S.E.B., J.M., J.C.V., L.Z., R.W.K., G.D.M., J.E.W., S.J.C., J.P.M., R.G., M.J.P., L.E.C., E.E.G., F.D.S., J.J.H. and S.L. performed the experiments. S.E.B., J.M., J.C.V., P.K., I.K., N.A.M., L.Z., A.D.F., C.J.B., T.R., S.L., P.K., T.D.W., G.P.S., J.R.O., S.L.H. and S.F.K. analyzed the data. S.L.H., J.R.O. and O.A.K. supervised patient recruitment.
Data is deposited at dbGaP under accession phs000239.v1.p1. Reprints and permissions information is available at www.nature.com/reprints.