High-throughput, quantitative DNA methylation analysis is crucial for the study of the normal physiology of the epigenome and its dysregulation in disease. The need for such assays is increasing as a means of validating the many emerging genome-wide cytosine methylation techniques that use microarray or massively parallel sequencing technologies. While the MassCLEAVE assay, as commercialized by Sequenom, has great potential for meeting this need, especially when compared with other methodologies such as clonal bisulphite direct sequencing, in this report we focused on several areas where the MassArray technique continues to have analytical shortcomings.
First, we describe the implementation of an assay design tool which allows the user to visualize each of the four different possible reactions for a given sequence (either T or C cleavage on either the plus or minus strand). The graphical and tabular output allows for a highly detailed picture of the nature of results expected for a given assay, enabling the user to maximize the number of CG dinucleotide sites that can be independently assayed (; Supplementary Fig. 5
). Sequenom provides the EpiDesigner as a tool for obtaining primer pairs to target a given region with multiple overlapping amplicons. EpiDesigner measures the number of CG sites assayable given a combination of cleavage reaction and DNA strands, however it includes no information about the ability to garner independent methylation status for each CG site. Moreover, EpiDesigner weighs all CG dinucleotides equally, whether or not there are specific CG sites that the user is particularly interested in interrogating (e.g. restriction enzyme cut sites). We therefore enabled the ability to label important (‘required’) sequences, and also provide a detailed graphical output for the relative independent ability to assay each CG dinucleotide in a given amplicon.
Measurement of methylation status by MassArray at a single CG dinucleotide is, in its simplest form (i.e. an isolated site whose containing fragment is of a unique molecular weight), simply a ratio of peak heights. When a given fragment contains multiple CG dinucleotides or when a CG-containing fragment has a molecular weight overlap with another CG-containing fragment, measurement of the constituent CG dinucleotides must be taken in aggregate. Coolen et al.
described a modification of MassCLEAVE analysis where the accuracy of measurement can be improved for multiple CG-containing fragments, and further described the potential analysis of allele-specific methylation in these fragments. Additionally, EpiTYPER implements a Monte Carlo-type estimation of methylation states in cases where one or more CG-containing fragments overlap (without overlap due to non-CG fragments) (Deciu, 2008
). This approach is of particular value as a measurement of confidence for average methylation values at each overlapping CG site. Despite these improvements, however, no current analytical approaches ascertain independent methylation values for CG dinucleotides that share a single fragment or are obscured by molecular weight overlaps with other CG-containing fragments.
As shown here and described elsewhere (Coolen et al.
), polymorphisms in the DNA sequence analyzed can be a significant problem for the MassCLEAVE assay, particularly when dealing with outbred rodent strains or clinical samples. The current EpiTYPER software does not allow the interrogation of SNPs, moreover the user is unable to modify the sequence input for a given analysis without completely repeating the experiment. The probability that an amplicon may contain polymorphisms that obscure analysis is variable and depends on the size of the fragment, the density of CG dinucleotides, and the genetic variability of the samples themselves. Our SNP detection tool evaluates their candidacy in an automated fashion. The algorithm makes use of EpiTYPER-identified new peaks and also takes into account missing peak data and average SNR. Currently, we treat each sample and reaction in isolation, but the integration of multiple replicates, as well as data from both the T- and C-cleavage reactions increases the confidence in SNP detection, discriminating genuine SNPs from sequence contamination or experimental artifact. This approach could be combined with that described by Coolen et al.
to match candidate SNPs against the public SNP databases to lend additional certainty to this SNP discovery approach. Finally, it is important to note that the current tool is only designed to identify single nucleotide substitutions or deletions. Insertions of any size, duplications, inversions or multiple base pair deletions may not be detected by this current implementation of SNP discovery.
The bisulphite conversion of DNA is a chemical reaction whose equilibrium state skews to the preferential modification of unmethylated cytosine to uracil. However, the reaction may not run to completion under conditions that depart from the ideal, for example, excessive DNA concentration, sample evaporation, reagent deterioration or other thermodynamic factors (Grunau et al.
; Hayatsu et al.
). While primer design is intended to enrich for completely converted sequences, some proportion of amplicons may contain remnant unconverted (unmethylated) cytosines. Because these remnant cytosines may cause a molecular weight shift in the resultant fragmentation profile, it is essential to measure the extent of bisulphite conversion as the relative signal of unconverted to converted cytosines.
In our hands, the majority of conversion reactions work quite well, with a negligible background (0–2%) of remnant unconverted cytosines detected. However, without directly testing the extent of bisulphite conversion, the methylation readings are inherently unreliable. We show an example of a constitutively hypomethylated locus that appears to be methylated in specific MassArray experiments, due at least in part to incomplete bisulphite conversion, with the control for the same samples showing up to 25% unconverted cytosines (Supplementary Fig. 2
Our new tools also enable automated analysis of multiple individual amplicons that, when taken together, form a larger target region, potentially spanning thousands of base pairs. Without such automation, comparing data across multiple replicates, assays and amplicons can become especially time consuming, scaling in proportion with the complexity and size of the study. Of additional note, the gel excision protocol used for the assays in this study impedes the high-throughput value of the MassArray approach, but tends to improve overall data quality. We also note, however, that gel excision is not a required step, and that the analyses presented perform as well for samples that have not been gel-purified. We generally recommend the use of non-template PCR products (i.e. water controls) in order to identify instances of potentially contaminating signal. For this purpose, we note that primer-dimer levels can be estimated for a given sample (Supplementary Fig. 6
), an additional functionality included with this software.
Statistical analyses for multiple loci are likewise facilitated by this tool, although the t.test()
function is not currently supported for MassArray data. We also developed a novel visualization of methylation data, analogous but alternative to the ‘epigrams’ generated by Sequenom's EpiTYPER software. This visualization enables the user to retain the quantitative aspect of methylation data and also to capture the variability of the data (in the form of error bars) for a given CG site. Alternatively, we provide a function to export methylation data in the form of BED tracks for upload and visualization with the UCSC Genome Browser (Kent et al.
In conclusion, we have implemented a collection of tools for MassArray analysis that provides a number of important improvements. In particular, these new analytical tools allow for the measurement of conversion controls, flag where sequence polymorphisms may be present, and provide for clear assay design and data visualization.