D2 BAC contig and sequence
Using 32 PCR probes spanning the 171.6 – 174.6 Mb interval of chromosome 1, we screened a commercially available D2 strain bacterial artificial chromosome (BAC) library, MM_DBa. End-sequencing of the BACs identified by these PCR probes allowed us to assemble a minimal overlapping contig of 27 D2 BACs (Figure ). The resulting D2 BAC contig spanned a total of 3.1 Mb from 171,509,721 – 174,625,201 bp on chromosome 1 (Build 37). Contig sequence data from the SOLiD and Genome Anlayzer platforms were assembled via realignment to the Ensembl reference sequence [7
] and covered the region surveyed without any large gaps.
Figure 1 BAC sequencing coverage. B6 strain RPCI-23 BACs used for genomic sequencing are denoted as white boxes. D2 strain MM_DBa BACs are indicated as black boxes. Each set of BACs is assembled as an overlapping contig. The 3 Mb (171.6–174.6 Mb) region (more ...)
B6 BAC contig and sequence
To evaluate our realignment strategy, we sequenced the corresponding region from the B6 strain, for which full reference sequence is available [7
]. We prepared a B6 contig based on public end-sequence data for BAC clones from the RPCI-23 library [8
]. The resulting B6 BAC contig spanned 170,806,384 – 174,768,169 bp on chromosome 1 (Figure ). Using both the SOLiD and Genome Analyzer datasets we attained realignment coverage with the exception of two gaps. The first gap (11 kb) was expected since the RPCI-23 BAC library map lacks annotated coverage in this region. The second gap (100 kb) could be due to an error in the mapping of one or two of the RPCI-23 BACs, since we relied upon reported locations of the B6 BAC library clones, or, alternatively, could be due to one or two clones missing from our pools. These gaps were present in both assemblies indicating a template problem rather than a sequencing discrepancy.
Comparison of Applied Biosystems and Illumina sequence realignments
Applied Biosytems and Illumina each performed realignments of the D2 and B6 datasets to the public B6 reference sequence [7
]. The SOLiD platform produces short reads (35 bp) encoded in a color-space format, and Applied Biosystems used their realignment pipeline specially designed to take advantage of this format for color-space read mapping and downstream analysis including SNP detection. The Genome Analyzer platform produces short read (33 bp) datasets with bases encoded in the standard letter representations, so Illumina carried out a different realignment approach and ran their dataset through the Maq [9
] pipeline which includes both realignment and SNP calling procedures. Because both datasets included only short reads, we did not assess potential insertion/deletions.
Comparison of our SOLiD and Genome Analyzer sequence data to the Ensembl reference sequence confirmed that the sequence from both platforms is complete. SOLiD generated 31.8 million reads for the pooled D2 BACs, with 12.4 million (39%) of those mapping to the chromosome 1 target interval. SOLiD generated 32.7 million reads of pooled B6 BACs with 10.5 million (32%) mapping to the chromosome 1 interval. Genome Analyzer generated 14.5 million D2 reads with 9.6 million (66%) mapping to the chromosome 1 interval, and 14.0 million B6 reads with 6.2 million (44%) mapping to the chromosome 1 region. Reads that did not align to the chromosome 1 interval were mostly BAC vector sequence (which was not removed when preparing the DNA), adapters from sequencing chemistry, and bacterial contamination. These are template specific problems and do not reflect upon differences between the sequencing platforms. Additional purification steps would have resulted in fewer unaligned sequences; however, the aligned sequences provided sufficient high read depth coverage, so the exclusion of the unaligned sequences did not adversely affect the present analyses.
Deep sequencing was achieved, with average read depths using the Genome Analyzer of 75 and 51 for D2 and B6, respectively, and 108 and 85, respectively, using SOLiD, for reads aligning to the chromosome 1 interval (Figure ). Figure illustrates the BAC contig composition. Each BAC was pooled equimolarly and, as expected, we saw higher coverage (i.e., two or more times as many reads) in the areas where two or more BACs overlap because more template was included for those regions (Figure ). We observed differences in the masking of repetitive sequences in Applied Biosystems vs. Illumina realignments, suggesting that repetitive regions are problematic for both platforms with short read sequencing. Paired-end sequencing was not performed in the present analyses, but could decipher potential ambiguous, repetitive regions in future analyses.
Figure 2 Read depth for Applied Biosystems' SOLiD (red) and Illumina's Genome Analyzer (blue) realignments. For D2 and B6 strains for the 171.6–174.6 Mb region of chromosome 1, each tick indicates the average read depth within the corresponding 25,000 (more ...)
B6 reference sequence quality
Although comparison of our custom HTS B6 data to the B6 reference sequence confirmed the high accuracy of the reference sequence (≥ 99.998%), we nonetheless identified a small number of discrepancies in the realignments to the reference sequence. Upon assembly, the Applied Biosystems and Illumina realignments of the B6 sequence reads differed from the B6 reference sequence for 41 and 60 residues (Table ), respectively; 29 of these differences were consistent in both datasets when compared to the reference sequence (Figure ). Subsequently, based upon realignment discrepancies called by one or the other platforms, we detected 43 provisional realignment discrepancies between our HTS and the reference sequence. Sequencing on each the SOLiD and Genome Analyzer platforms was completed independently, with assembly algorithms each requiring a minimal depth of 3 reads in order to confirm an allele call. Quality of the nucleotide in question, as well as the quality of the surrounding base calls, was taken into account. Together, these restraints offer high confidence in the B6 allele calls identified by both platforms that differ from the references sequence. Overall, this indicates an exceptionally low discrepancy rate (only 0.001 – 0.002%) between HTS data and the reference sequence and also demonstrates the high quality of the sequence data generated in our analyses. Importantly, our custom HTS sequence data and the B6 reference strain sequence were both from RPCI-23 B6 BACs, which were generated from a pool of five female B6 mice from the Jackson Laboratory. Thus, it is extremely unlikely that strain or template inconsistency contributed to the discrepancies in the realignment of our sequence to the B6 reference sequence.
Comparison of Illumina and Applied Biosystems realignment data.
Figure 3 D2 vs. B6 SNPs and B6 realignment discrepancies vs. reference sequence. Custom HTS of chromosome 1 (171.6 – 174.6 Mb) data reveals D2 vs. B6 PARC SNPs and realignment discrepancies when compared to the B6 reference sequence. A) B6 vs. D2 PARC (more ...)
SNP identification and confirmation
In order to compare the SOLiD and Genome Analyzer platforms under optimal conditions for each platform and to compare in a manner the majority of end-users are likely to employ, the manufacturers used parameters determined to be optimal for SNP calling on their particular platform. The Illumina SNP calling method (Maq) relies upon quality scores and applies a filter based on these qualities, read depth, and neighboring SNPs in order to discriminate between a SNP and a base-calling error. Applied Biosystems calls SNPs based on two-base encoding in color-space allowing for more sensitive discrimination between SNPs and base-calling errors. Illumina's method produced fewer no calls (Ns) than Applied Biosystems' method based upon realignments performed by each vendor (Table ). No nucleotide bias was apparent in the frequency of the calls using the Genome Analyzer and SOLiD platforms. We conclude that because of Maq's probabilistic approach using qualities, more SNPs are called by Illumina/Maq than by Applied Biosystems; however, it remains to be confirmed if Illumina has more false positive calls. It is important to keep in mind that because the vendors used independent mapping and SNP calling approaches, differential results due to the platform cannot be distinguished from those due to the analysis pipeline.
Currently, the Mouse Phenome Database (MPD), which includes dbSNP and Perlegen data among other resources, offers the most inclusive SNP queries for mouse strains, including comparisons of the D2 and B6 strains (see Methods for further explanation of SNP Databases). While, MPD currently annotates 4,527 D2 vs. B6 SNPs in the chromosome 1 interval (171.6–174.6 Mb), custom HTS identified 11,824 SNPs (Figure ) for the same interval (referred to as PARC SNPs because they were sequenced by one or both platforms in work supported in part by the Portland Alcohol Research Center, PARC). 9,152 (77%) of the 11,824 PARC SNPs identified by custom HTS were identified using both the Applied Biosystems and Illumina realignments (i.e., were identified by two independent experiments) and are therefore of very high quality (Figure ). 2,033 (17.2%) PARC SNPs were identified only by Illumina's realignment, and 639 (5.4%) PARC SNPs were identified only by Applied Biosystems realignment. Only 271 (13%) of the Illumina-specific PARC SNPs and 56 (10%) of the Applied Biosystems-specific PARC SNPs confirmed known SNPs in MPD.
The SOLiD and Genome Anlayzer datasets were merged for subsequent comparisons. Our results confirmed 4,161 (92%) of the D2 vs. B6 SNPs currently annotated in the MPD public dataset within the chromosome 1 interval, while 236 (5%) of the SNPs reported in MPD for this interval were determined to be false-positive SNPs. The 130 remaining either lie in gaps in our realignments or had ambiguous (undetermined or low quality) calls in the sequence data. Our results identified numerous SNPs not previously annotated in MPD. In fact, our results identified 7,663 new SNPs, more than doubling the number of D2 vs. B6 SNPs found in this chromosome 1 interval.
This chromosome 1 interval spans a clear haplotype break, resulting in a SNP sparse region (171.6–172.9 Mb) and a more distal SNP dense region (172.9 – 174.6 Mb) (Figure ). The SNP sparse region contains only 16 PARC SNPs based on our custom sequencing, and the SNP dense region harbors 11,808 D2 vs. B6 PARC SNPs. This data further defines the haplotype break between D2 and B6 and, importantly, provides additional genetic markers for fine mapping within the SNP-sparse region which previously was not possible [6
]. Additionally, full SNP annotation will inform future SNP array chips allowing for more precise genotyping.
Figure 4 Density of PARC SNPs and protein coding genes. B6 vs. D2 PARC SNPs are binned in 5000 bp intervals. The blue lines indicate how many SNPs are currently annotated in the public MPD database, and the red lines show how many PARC SNPs were discovered by (more ...)
SNP impact on protein function or expression
We assessed non-synonymous SNPs that result in amino acid changes in proteins because of their clear impact on protein activity. Based on our custom HTS, we found 76 missense (non-synonymous coding) PARC SNPs that changed an amino acid residue between the D2 and B6 strains. 36 of these were new missense SNPs (Table ). Confirmation experiments (e.g., transcriptome sequencing and PCR directed sequencing) and interrogation of additional datasets [6
] already have confirmed 30 of the 36 new missense SNPs (Table ), with the remaining six lacking additional data. In an effort to determine whether the 36 new missense SNPs affect protein function, we used SIFT (Sorting Intolerant From Tolerant) [11
], which uses sequence homology to predict whether an amino acid substitution affects protein function, and PolyPhen (Polymorphism Phenotyping) [12
], which uses multiple sequence alignments and protein 3D structure to predict protein sequence effects on function. Based on these two prediction methods, eight of the new missense SNPs were predicted to affect protein function, although only one of these amino acid changes was predicted to be damaging by both methods (Table ). Our data also confirmed 40 missense SNPs annotated in MPD. Finally, five missense SNPs in MPD were not confirmed by our data, including three that had full sequence coverage in our data and were determined to have identical D2 and B6 alleles. Our data indicate that these three are false positive missense SNPs in the public dataset: Fcgr2b
(Y133F). The other two had ambiguous coverage in our alignments and could not be evaluated: Fcgr3
(G72W) and Itln1
Missense B6 vs. D2 PARC SNPs discovered by custom HTS.
In addition to the new missense SNPs, we identified 7,627 new PARC SNPs in non-coding regions. These SNPs were not previously known or annotated and may regulate gene and/or protein expression. For example, in the Kcnj9
gene, which demonstrates differential transcript expression between the D2 and B6 strains [13
], we identified 18 new SNPs within 2 kb upstream of the transcriptional start site as well as nine new SNPs in the 3' untranslated region (Figure ). We also identified one false-positive SNP annotated in MPD in the 3' untranslated region. Thus, for Kcnj9
and other differentially expressed genes, elucidation of cryptic SNPs could identify nucleotide variation that underlies QTL phenotypic effects.
Figure 5 Kcnj9 B6 vs. D2 PARC SNPs. Kcnj9 is represented with the coding region shown in the gray boxes and the 5' and 3' untranslated regions (UTRs) in the white boxes. Introns are illustrated as V-lines, and upstream region as a horizontal black line. The blue (more ...)
SNP impact on gene expression microarrays
Naturally occurring genetic polymorphisms dramatically impact hybridization based techniques, including gene expression microarray analyses [14
]. With the ability to assess alternative transcript expression, exon microarrays have ten times as many probes as previous gene expression microarrays, so eliminating hybridization bias using SNP masks is increasingly critical. We have developed and applied a complete SNP mask using all of the PARC SNPs found by our custom HTS, which allowed us to rigorously assess differential expression between the D2 and B6 strains. We assessed the impact of 124 SNPs that lie within core probesets on the detection of differential (genotype-dependent) exon expression for genes in the chromosome 1 region of interest using Affymetrix Mouse Exon 1.0 ST array data (for details, please see Mooney et al., companion publication). 629 core
probesets interrogate this interval. When compared to our unmasked data, masked results were consistent for 141 differentially expressed probesets and 437 non-differentially expressed probesets, but indicated 47 false positive and 4 false negative results due to SNPs.
Furthermore, we overlaid D2 vs. B6 SNPs discovered by our custom HTS with the probe locations of all four types of probesets within the chromosome 1 interval (i.e., core, extended, full, and free) for the Affymetrix Exon 1.0 ST gene expression microarray platform. For chromosome 1 (171.6 – 174.6 Mb), there are 8201 probes and 2126 probesets on the Affymetrix Mouse Exon 1.0 ST array, and we identified 861 probes that spanned at least one SNP encompassing 480 probesets or 23% of the probesets that interrogate this interval (unpublished data). Thus, compared to publicly available D2 vs. B6 SNPs, custom HTS identified 60% more probesets that span SNPs.