Cell lines, RNA and genomic DNA, expression profiling, and bisulfite treatment
Genomic DNA of GM06990 (a HapMap/ENCODE sample) was obtained from Corriell Cell Repository. With the approval of Harvard Medical School’s Institutional Review Boards, blood and skin biopsies were obtained from donors of the Personal Genome Project. The EBV-transformed B-lymphocyte cell lines and the derivative genomic DNA for donors PGP1 (GM20431) and PGP9 (GM21833) were generated and acquired from Coriell Cell Repository. Genomic DNA obtained directly from Corriell was used for methylation analysis of these lines, cultured cell lines were used for gene expression profiling. The primary fibroblast lines for PGP1 and PGP9 was generated by and obtained from Brigham Women’s Hospital. The cultured cell line was used for both genomic DNA and gene expression profiling.
The PGP1 iPS line and two PGP9 iPS cell lines were derived by infecting primary human fibroblasts of PGP1 and PGP9 with highly concentrated retroviral OCT3, KLF4, SOX2 and c-MYC particles39
. The infected cells were trypsinized onto a feeder layer after 4 days and maintained in hES median (KO-DMEM (Invitrogen), 20% KO-SR (Invitrogen), 1X L-glutamine (Gibco), 1× MEM NEAA (Gibco), 1× pen/strep (Gibco), 55µM mercaptoethanol and 10 ng/ml bFGF). The iPS colonies were identified by their characteristic morphology after 3–4 weeks.
Immortalized lymphocytes were cultured in RPMI-1640 medium (Invitrogen) with 10% FBS (Invitrogen) and 2mM L-Glutamine (Invitrogen). Primary fibroblasts were cultured in DMEM/F12 medium (Invitrogen) with 15% FBS and 10 ng/µl EGF. Human iPS cell lines were grown on a feeder layer of mouse embryonic fibroblasts (Global Stem) in hES media, and mechanically separated from mouse cells prior to DNA/RNA extraction.
Genomic DNAs and total RNAs were extracted with AllPrep DNA/RNA/Protein Mini Kit (Qiagen). RNA gene expression profiling was done using Illumina’s bead array technology through the service provided by Harvard Partner Center for Genetics and Genomics. Bisulfite treatment was performed using the EZ DNA Methylation-Gold Kit (Zymo Research). Typical yield was 50–75% after bisulfite conversion.
Bisulfite padlock probe design and synthesis
Files for genomic sequence for the ENCODE pilot project regions was obtained from UCSC. Potential locations were chosen from nonrepetitive sequence containing 10 bases with a 5’ CpG flanked by least 20 bases of CpG-free flanking sequence on each side. Flanking “arm” sequences were designed for either bisulfite-treated strand, up to 28 bases in length, avoiding CpGs and targeting a Tm range of 50–55°C. The “ligation arm” was required to contain at least 3 non-CpG cytosines, and a guanine content of at least 20% was required of both arms. Probes were then selected to optimize uniqueness measurements based on 15mer frequencies and BLAST searches for near matches. To avoid self-hybridization, no overlap between probes was allowed in the final set. Final probe sequences were 106bp in length: two arms 28bp long (random sequence to 28bp if necessary) and a 50bp common “backbone” sequence. The final set of 9,552 probe sequences and locations as well as number of observations and methylation estimates from each sample is provided in Supplementary Table 1
Probes were synthesized using a programmable microarray (Agilent technologies) as 150bp oligos containing common end sequences. These were cleaved off and collected in a single tube with an estimated concentration of 0.18 fmol/species. To amplify, we took 1% of the oligos and performed real time PCR, monitoring the amplicon in a 100µl reaction assembled with Platinum Taq supermix, 50pmol of each primer, and 0.5× SYBR green. One of the primers was designed to contain phosphorothioates between the first four 5’ bases and a 3’ uracil. The other primer contained the sequence “GATC” at the 3’ end. The PCR program was: 95°C for 5 min, 15 cycles of 95°C 30 sec / 58°C 1 min / 72°C 1 min, and finally 72°C for 5 min. The PCR product was purified with Qiagen PCR purification kit and quantified. Using a 96-well plate, a total of 9.6 ml PCR reaction was set up with 25 fmol template along with Platinum Taq supermix, 4.8 nmol of each primer, and 0.5× SYBR Green. The same PCR program was used. PCR products were there purified by phenol:chloroform followed by Qiagen PCR purification kit and a total of 37 µg of DNA was obtained.
The PCR product was split into eight reactions with 10 units of lambda exonuclease (NEB) in 1× lambda exonuclease reaction buffer and incubated at 37°C 45 min then 75°C 15 min. After being purified with QiaQuick coloumns the ssDNA was quantified with Nanodrop to be 33 ng/µl in 200 µl total. This was split into four tubes, each of which was assembled with 50 µl of ssDNA (33 ng/µl), 6 µl of 10× DpnII reaction buffer, and 2 µl of 100 uM “guide oligo” designed to hybridize to the 3’ end of the ssDNA and ending in “GATCNN”. The mixture was heated to 95°C for 5 min, followed by a ramp to 60°C at 0.1C / sec, 60C for 10 min, then 37C for 1 min. Into each tube, 5 µl of DpnII (10 units/µl) (NEB) and 5 µl of USER enzyme (1 unit/µl) (NEB) were added and these were incubated at 37°C for 3 hours. The final product was loaded into 6% TBE Urea precast polyacrylamide gels (Invitrogen) and the desired band was cut and purified. The final concentration of padlock probes was quantified on a gel to be 9 ng/µl, which is 257 nM (27 pM for each of 9,552 species).
CpG padlock capturing and sequencing library construction
1 µg (~ 0.5 amol of haploid) bisulfite-treated genomic DNA was assembled in a 15 µl reaction with 1× Ampligase buffer and 33.5 ng (~ 1 pmol) of probes. The reaction was incubated at 95°C for 10 min, ramped to and held at 64°C for 5 hours, then 65°C for 5 hours, then 60°C for 24 hours. At 60°C we added the gap filling and sealing mix: 2 µl of Ampligase storage buffer containing 0.5 pmol of dNTPs, 2 units Taq Stoffel fragment (Applied Biosystems), and 2.5 units Ampligase (Epicenter). The reaction was then incubated at 60°C for 2 hours, then cycled 5 times with 95°C for 2 min / 60°C for 5 hours. The temperature was then lowered to 37°C and 2 µl of Exonuclease I (20 units/µl) (USB) and 2 µl Exonuclease II (200 units/µl) (USB) were added. The reaction was incubated at 37°C for 2 hours followed by 94°C for 5 min.
The circularized probes were amplified using primers matching the backbone sequences in two 100 µl reactions containing 10 µl of the above reaction product, 50 µl of 2× iQ SYBR Green supermix (Bio-Rad), and 40 pmol each primer. Real time PCR was used to monitor the reaction, which used this program: 96°C 3 min, 5 cycles of 96°C 15 sec / 60°C 30 sec / 72°C 30 sec, then 13 cycles of 96°C 15 sec / 72°C 1 min / 72°C 1 min, then 72°C for 5 min. A 6% TBE polyacrylamide gel was used to purify the band containing the final library molecules.
BSPP library sequencing and analysis
Libraries were diluted to 10 nM and each was sequenced with one lane of an Illumina Genome Analyzer. Reads were matched with BLAST to a custom database containing the predicted reads, with CpG cytosines replaced with “N”, and accepted only if they had no mismatches in the 10 bp span (except the masked CpG cytosines) and not more than three mismatches elsewhere. Methylation was determined by the number of “C” reads out of all reads for a given location.
To validate methylation levels determined by padlock probes we designed primers targeting 33 of the profiled locations in bisulfite treated DNA, performed PCR amplification and Sanger sequencing of the PCR product. The methylation level of each site was determined using the ratio of “T” peak at the target location compared to neighboring non-CpG “T” peaks, with peak height determined using PeakPicker software40
. This is similar to the principle applied in the commercially available software ESME41
. Because we performed multiple sequencing reactions and from both directions, multiple estimates were combined to get the average and standard deviation values we plotted for each site.
RNA expression data was gathered using the ENCODE project PolyA+ RNA signal track downloaded from UCSC. Using scores for regions annotated as exons by RefGene, median values were taken to represent gene expression level. To construct average gene graphs, each methylation data point was assigned position information according to its location relative to nearby genes: a fractional value if within a gene, or bp if upstream or downstream. The running median and quartiles were plotted.
Histone modification data was acquired from Sanger ChIP data downloaded from UCSC. To look for correlations, raw ChIP scores vs. methylation were plotted along with the running median and quartiles. Gene profiles of histone modifications were also created as done for methylation data.
Methyl sensitive cut counting (MSCC) library creation
Two custom adapters were created for MSCC, each composed of two oligonucleotides ordered from IDT. “Adapter A” contains an 5’ MmeI recognition site and 5’ CG overhang, “adapter B” contains a 3’ NN overhang.
To construct the MSCC HpaII library, 2 µg of PGP 1 lymphocyte gDNA were assembled into a 100 µl reaction with 20 units HpaII (NEB) in 1× NEBuffer 1, incubated at 37°C 2 hours, then 65°C 20 min. To this was added 1.66 µl of 10 µM adapter A, 12 µl 10 mM ATP, and 120 units T4 DNA ligase (NEB). This was incubated at 16°C 4 hours, then 65°C 15 min. Ethanol precipitation was performed and DNA was resuspended to 50 µl with a reaction mixture containing 8 units Bst DNA polymerase fragment (NEB), 200 µM dNTP, and 1× thermopol buffer (NEB). This was incubated at 50°C for 20 min, then 85°C for 20 min. Ethanol precipitation was performed again, and the pellet was resuspended to 50ul with a reaction mixture containing 2 units MmeI (NEB), 50 µM SAM and 1× NEBuffer 4. This was incubated at 37°C for 2 hours, then 80°C for 20 min. To this was added 1.66 µl of 10 µM adapter B, 6 µl 10mM ATP, and 3 µl T4 DNA ligase, and the mixture was incubated at 16°C for 4 hours, then 65°C for 15 min.
The mixture was run on a 6% non-denaturing TBE polyacrylamide gel (Invitrogen) and the target band at ~140 bp was purified. PCR was then performed on ~80% of this purified sample using primers matching the sequences of adapter A and adapter B. The assembled mixture was 100 µl containing 500 nM of each primer, 200 µM dNTPs, 1× HF buffer and 2 units iProof (Bio-Rad) and run with the cycle: 98°C for 30 sec, 8 cycles of 98°C 10 s / 67°C 15s / 72°C 15s, then 72°C for 5 min. PCR product was purified with QiaQuick PCR clean-up kit.
The MspI control library was constructed in the same manner as the HpaII library, with the following changes: (1) in the first step 40 units of MspI (NEB) were used in place of HpaII and NEBuffer 2 was used instead of NEBuffer 1; and (2) no amplification was done after gel purification.
The “inverse library” was constructed in this manner: HpaII digestion was performed as done in the HpaII library. After this, 10 units Antarctic Phosphatase (NEB) and 11ul 10× Antarctic Phosphatase Buffer (NEB) were added to the mixture, which was then incubated at 37°C for 1 hour, and 65°C for 15 min. DNA was purified with phenol:chloroform followed by ethanol precipitation. The DNA was then resuspended and treated in the same manner as the MspI control library.
MSCC sequencing and read placement
In total, three lanes of sequencing were performed using an Illumina Genome Analyzer: two for the first technical replicate and one for the second technical replicate. These reads each contained sequence from the adapters and an 18–19 bp “tag” derived from genomic sequence.
To match sequences, a list of all possible tags was created from all CCGG sites in the human genome (hg18, downloaded from UCSC). Tags were considered “unique” (later used for profiling) if no identical or single-mismatch tags existed, the neighboring Hpa
II site was at least 40bp distant, and there were no conflicting Mme
I recognition sites. An in house program was used to find all tag matches within 0, 1, or 2 single base distances. Reads were accepted if they were an exact match and no single mismatches could be made, or if there was no exact match, a single mismatch and no double mismatch matches existed. The number of times a particular location was matched by a read is it’s “counts” or “observations”, and these are provided in Supplementary Table 7
MSCC data analysis
To validate MSCC data we compared it with BSPP data collected for a set of 381 shared CpG locations (726 total tags) to get “counts vs. methylation” information. These data points were binned according to methylation to form 20 bins with 36 or 37 data points each and the average counts vs. average methylation was plotted. We expect average counts to be linearly related to methylation with the equation: methylation = a * counts – 1. A best fit for this equation to the average data points was produced with a = −0.1124. This was used to infer methylation when plotting average counts information.
Positions relative to genes for each MSCC site were calculated as before, using the RefGene list from UCSC. For multiple possible starts/ends, only the first entry was used. Using expression data genes were split into five equally sized groups based on gene expression levels. Running averages of MSCC counts were made for each graph: an interval of 5000 data points for , an interval of 5000 data points and 500bp minimum window size for 2c and 2d and 500 data points with 500bp minimum and 2000bp maximum windows for . Counts were normalized for local CpG density (surrounding 200bp), for MspI control library counts, and, for the in-gene locations in , for gene length.
To analyze promoters based on CpG density, promoters were split into three types based on CpG density. Looking within the interval of −0.5kb to +2kb relative to transcription start (based on refGene annotation): high CpG promoters (HCPs) contain a 500bp interval with a GC content of at least 0.55 and a CpG observed/expected ratio of at least 0.75, low CpG promoters (LCPs) contained no 500bp interval with a CpG observed/expected ratio of at least 0.48, and all remaining promoters were defined as intermediate CpG promoters (ICPs). Of 17,546 promoters analyzed, 11,445 (65%) were defined as HCP, 2,849 (16%) were defined as ICP, and 3,252 were defined as LCP (28%).
Methylation profiles for individual genes were created by finding average MSCC counts in the promoter region (defined as −400 to +1000bp) and in the gene body (defined as +3000bp to the end). Only genes with at least 10 MSCC data points in each region were plotted.