Fly strains and husbandry
Flies were raised using standard medium at 25°C unless stated otherwise. The bam1
/TM3 stock was obtained from Bloomington Drosophila
Stock Center. The bam114-97
/TM6B stock was a gift from Dr Margaret Fuller. The bam1/bam114-97
testis was used for immunostaining with antibodies against a germ cell-specific mark Vasa [61
] and a somatic cell-specific mark Traffic jam (Tj) [62
] to demonstrate that most of the cells in this tissue are germ cells (Additional file 12
Culturing S2 cells
The Drosophila S2 cells (ATCC, CRL-1963, Lot#5054622, Manassas, VA, USA) were cultured following the manufacturer's protocol. Briefly, one vial of S2 cell stock was taken from liquid nitrogen and thawed quickly at room temperature. Once the cells were completely thawed, they were transferred to a 25 cm2 flask (Corning, CLS430639-20EA, Lowell, MA, USA) containing 5 ml of room temperature complete Schneider's Drosophila medium (GIBCO, #11720-034, Carlsbad, CA, USA; contains 10% heat-inactivated fetal bovine serum (GIBCO #16140-071)). After incubating at 25°C for 30 minutes, the S2 cells were centrifuged at 1,000 rpm, the medium was removed and the cells were transferred into a 25 cm2 flask (Corning, CLS430639-20EA) containing 5 ml of room temperature complete Schneider's Drosophila medium. The S2 cells were harvested at the exponentially growing stage after incubating at 25°C for about 90 hours.
Microarray experiments and data analysis
Total RNA from approximately 200 pairs of bam (bam1/bam114-97) fly testes was extracted using TRIzol (Invitrogen, #15596-018, Carlsbad, CA, USA) following the manufacturer's instructions and the genomic DNA was degraded using 2 Units of DNase I (Fermentas, #EN0521, Glen Burnie, MD, USA) at 37°C for 20 minutes. Total RNA from Drosophila S2 cells was extracted using the Qiagen RNeasy Mini Kit (catalogue number 74014, Valencia, CA, USA), and the genomic DNA was degraded with RNase-Free DNase (catalogue number 79254). RNA integrity was checked by gel electrophoresis (1% agarose). Approximately 4 μg of total RNA from each biological replicate were used to generate labeling probes to hybridize with the Affymetix GeneChip Drosophila Genome 2.0 Array according to the Affymetrix protocol. Three biological replicates were performed for bam testes and two biological replicates for S2 cells.
Microarray hybridization was processed at the Genomics Core Facility at the National Heart, Lung and Blood Institute and the raw data were exported from the Affymetrix Microarray Suite (MAS). The CEL files were used for signal normalization with RMA as part of the limma package from the Bioconductor R packages [63
]. The 'Present (P)', 'Absent (A)' and 'Marginal (M)' calls were retrieved with the code of 'eset' in the Affy package.
Extraction of RNA using bam testes or S2 cells was performed using similar methods as described for microarray experiments. For the total RNA from bam testes (8.5 μg) and S2 cells (approximately 20 μg), we performed two rounds of mRNA isolation using Dynabeads mRNA purification kit (Invitrogen, #610-06), according to the manufacturer's instructions. The final mRNAs were eluted in 13.5 μl 10 mM Tris-HCl (pH 7.5) and immediately used to generate the first strand cDNA, using 4 μl random hexamers (ABI, #N8080127, Foster City, CA, USA) and SuperScript II Reverse Transcription Kit (Invitrogen, #18064-014) in a 30 μl final volume, following the manufacturer's instructions. The second strand cDNA was generated with the following recipe: 10 μl 5× second strand buffer (500 mM Tris-HCl pH7.8, 50 mM MgCl2, 10 mM DTT), 30 nmol dNTPs (Invitrogen, #18427-013), 2 Units of RNase H (Invitrogen, #18021-014) and 50 Units of DNA Pol I (Invitrogen, #18010-025). The entire reaction mix was incubated at 16°C for 2.5 hours. The double-stranded DNA (dsDNA) was purified with a QIAquick PCR purification kit (Qiagen, #28106) and the concentration was quantified using a Qubit fluorometer (Invitrogen).
To generate sequencing libraries, about 300 ng dsDNA from each sample was fragmented by sonication using Bioruptor (Diagenode, UCD-200-TM-EX, Sparta, NJ, USA) using medium power output for 30 minutes in ice water. The resulting DNA fragments were analyzed by agarose gel to verify they were within the approximately 100 to 300 bp size range. Sequencing libraries were prepared as follows: end-repair (DNA end-repair kit from Epicenter, #ER0720, Madison, WI, USA); A-tailing (300 ng dsDNA, 5 μl Thermo buffer, 10 nmol dATP, 15 Units of Taq polymerase, at 70°C for 30 minutes); Solexa adaptor ligation (300 ng dsDNA, 4 μl DNA Ligase buffer, 1 μl Solexa adaptor mix, 3 ul DNA Ligase, at 70°C overnight); PCR (98°C 10 s, 65°C 30 s, 72°C 30 s for 16 cycles; then additional 72°C for 5 minutes) amplification with adaptor primers and size selection (200 to 400 bp). Then the library dsDNA for S2 cells was used on an Illumina Genome Analyzer II at a concentration of 10 ng per lane.
We obtained 20,041,035 and 9,780,523 total reads from an Illumina Genome Analyzer II for bam testis and S2 cell samples, respectively. And 10,163,916 (bam testis) and 6,263,318 (S2 cell) unique and non-redundant reads were used for downstream data analysis.
The Gene Expression Omnibus accession number for the raw and analyzed RNA-seq data is [GEO:GSE19325].
Comparison of microarray results with RNA-seq data
To compare the RPKM value from RNA-seq data with microarray results, we first retrieved 12,728 Drosophila genes from the Ensembl database, which also have probe(s) in the Affymetix GeneChip Drosophila Genome 2.0 Array. Genes with multiple probes were filtered out if different probes gave inconsistent 'Present (P)' or 'Absent (A)' calls. We then analyzed the RPKM distribution for genes with all 'Absent' calls or all 'Present' calls in microarray datasets, for both bam testis and S2 cell samples.
The histograms were generated using the 'hist' function in the R programming environment (R version 2.5.0 [64
]). To calculate the log2
RPKM values of individual genes, all their original RPKM values were added a pseudo-count of 1.
For each modified histone and Pol II ChIP experiment, we dissected approximately 200 pairs of bam testes in cold phosphate-buffered saline (PBS) and grouped them in 200 μl PBS that contained protease inhibitor (Roche complete mini, #11836153001, Nutley, NJ, USA) and 0.5 mM phenylmethanesulfonyl fluoride (PMSF; MP Biomedicals, #195381, Solon, OH, USA). Approximately 1,000 cells could be extracted from one bam testis. We then added 5.5 μl 37% fresh formaldehyde (Supelco, #47083-U, Bellefonte, PA, USA) and incubated at 37°C for 15 minutes. The testes were washed twice with 450 μl cold 1× PBS (with inhibitors and PMSF). Then 200 μl lysis buffer (50 mM Tris-HCl, pH7.6, 1 mM CaCl2, 0.2% Triton X-100, 5 mM butyrate, 1× protease inhibitor cocktail, and 0.5 mM fresh PMSF) was added and the tissues were homogenized thoroughly followed by incubation at room temperature for 10 minutes. Chromatin was sheared into approximately 200-bp fragments by sonication using Microtip (Misonix, Inc., Microson XL-2000, Farmingdale, NY, USA) with the following procedure: 4 s at power 20, rest for 50 s, 4 to 5 times, followed by spinning at 14k rpm for 10 minutes at 4°C. The chromatin was diluted 10× with RIPA buffer (10 mM Tris, pH7.6, 1 mM EDTA, 0.1% SDS, 0.1% Na-Deoxycholate, 1% Triton X-100, with protease inhibitors and PMSF) and 50 μl of this dilution was reverse cross-linked with 0.25 M NaCl for 2 hours at 65°C and used as input for real-time PCR analysis.
We washed 40 μl of Dynabeads Protein A (Invitrogen, #100.01D) with 600 μl 1× PBS. We then added 100 μl 1× PBS with 4 μg antibody and incubated the antibody-Protein A beads mixture at room temperature for 40 minutes with occasional tapping. After the unbound antibody was removed using the manipulator (Invitrogen, DYNAL MPC-S), 1 ml of the chromatin extract was added to the beads and the mixture was rotated at 4°C overnight. The beads were then washed twice with 1 ml RIPA buffer, twice with 1 ml RIPA buffer containing 0.3 M NaCl, once with LiCl wash buffer (0.25 M LiCl, 0.5% NP40, and 0.5% sodium deoxycholate), once with 1 ml TE (10 mM Tris-HCl, pH 8.0 and 1 mM EDTA) containing 0.2% Triton X-100, and once with 1 ml TE. The beads were then suspended in 100 μl 1× TE containing 3 μl 10% SDS and 5 μl 20 mg/ml proteinase K, followed by incubation at 65°C overnight. After the supernatant was collected, the beads were washed once more with 100 μl TE with 0.5 M NaCl. The supernatant from this wash was combined with the previous supernatant. The combined samples were treated by Phenol/Chloroform extraction, salt/EtOH precipitation, and dissolved in 50 μl 1× TE. The products were either used for real-time PCR analyses or processed for Solexa sequencing according to the established protocol. Antibodies used include those against H3K4me3 (Abcam, #ab8580, Cambridge, MA, USA), H3K27me3 (Millipore, #07-449, Billerica, MA, USA), H3K36me3 (Abcam, #ab9050), H3 (Abcam, #ab1791) and RNA Pol II (Abcam, ab5408).
ChIP experiment using S2 cells
Exponentially growing S2 cells were harvested and dissolved in digestion buffer (50 mM Tris-HCl, pH7.6, 1 mM CCl2
, 0.2% Triton X-100, 5 mM butyrate, 1× protease inhibitor cocktail and 0.5 mM PMSF). Chromatin was prepared and ChIP-seq experiments were performed as described previously [27
] with antibodies against Pol II (Abcam, #ab5408), H3K4me3 (Abcam, #ab8580), H3K27me3 (Millipore/Upstate, #07-449), and H3K36me3 (Abcam, #ab9050).
Solexa pipeline analysis
The 25-bp sequencing reads were obtained from the Illumina Genome Analyzer pipeline. All reads were aligned to the Drosophila
genome (dm3) using the ELAND (Efficient Local Alignment of Nucleotide Data) software, allowing up to two mismatches with the reference sequence. Only uniquely mapped reads were retained. For multiple identical reads, at most three copies were retained to reduce the possibility of biases from PCR amplification. The output of the Genome Analyzer pipeline was converted to browser extensible data (BED) files. The wig files used for visualization on the UCSC browser were generated from the uniquely mapped reads using a 4-bp window and 160 bp as the DNA fragment size, as previously described [27
]. The size of the DNA fragment was determined by the distance from the 5' to the 3' peak of the mapped reads, as shown in Additional file 13
The Gene Expression Omnibus accession number for the raw and analyzed ChIP-seq data is [GEO:GSE19325].
Defining genomic regions for analyzing modified histone and Pol II enrichment
To determine the regions used for modified histone and Pol II occupancy of the annotated transcripts in Figure , we first plotted each histone modification and Pol II using the entire annotated transcripts that are applicable for ChIP-seq analysis. Based on the overall enrichment plot, we used the region from -250 to +250 bp (the TSS was defined as 0) to calculate the Pol II enrichment of individual transcripts. A 0 to +500-bp window was used for H3K4me3 and H3K27me3, and a +500 to 1,500-bp window was used for H3K36me3 enrichment calculations. For genes with a transcript size longer than 1 kb, a +500 to +1,000-bp window was defined as the gene body region to calculate the stalling index.
Drosophila genes used for ChIP-seq analysis
genes used for ChIP-seq analysis were derived from the UCSC database (April 2006/BDGP R5/dm3), which contained 14,058 genes and 21,243 transcripts. The coordinates for these transcripts were downloaded (August 2009) from the UCSC table browser [65
]. Compared to human and mouse genomes, the Drosophila
genome is more gene-dense, including many overlapping genes and short distances between TSSs, which may lead to incorrect conclusions from ChIP-seq analysis. To avoid this, we chose transcripts for analysis using the following three steps. First, exclude transcripts shorter than 500 bp (transcript size is defined as the distance between the annotated transcription start and end sites) because these short transcripts will affect the promoter analysis for H3K4me3 and H3K27me3. A total of 740 transcripts (736 genes) were excluded using this criterion. Second, exclude overlapping transcripts from different genes (Additional file 14
) and transcripts that have short distances between their TSSs; 6,497 transcripts (4,535 genes) overlap with at least one other transcript, but 3,832 transcripts (2,555 genes) among them were useful for promoter analysis (-250 to +250 bp, or 0 to +500 bp), because the overlapping regions do not affect promoter analysis of these genes. A total of 3,558 non-overlapping transcripts (2,729 genes with opposite transcription direction) have a distance between TSSs shorter than 400 bp, which will affect the Pol II promoter analysis (-250 to +250 bp), and were thus excluded. Third, after removal of certain transcripts using the above two criteria, 14,849 transcripts (9,459 genes) were retained for ChIP-seq promoter analysis. In total, 7,799 genes were applicable for stalling index analysis, and 6,400 genes for H3K36me3 enrichment analysis. Among these 9,459 genes, 2,612 have multiple transcripts, but only one transcript for each was used for ChIP-seq analysis.
Choosing the dominant transcript for multi-isoform genes
For genes with multiple isoforms, we used the most abundant one for the comparison between ChIP-seq and RNA-seq data.
bam Pol II versus testis RPKM, and S2 Pol II versus cell RPKM
1, The dominant transcript was determined as that with the highest number of Pol II ChIP-seq reads in the -250 to +250-bp window with respect to the TSS region. If multiple transcripts have the same number of reads in this region, criterion 2 is used. 2, For transcripts that are longer than 1 kb, the dominant transcript was determined as that with the highest number of ChIP-seq reads in the gene body (+500 to +1,000-bp with respect to the TSS). For those transcripts that are shorter than 1 kb, the dominant transcript was determined as that with the highest number of ChIP-seq reads for H3K4me3 in the region 0 to +500 bp with respect to the TSS. If multiple transcripts have the same numbers of reads, criterion 3 is used. 3, The dominant transcript was determined as that with the longest transcript. If all isoforms have the same length, one was chosen randomly.
bam H3K4me3 versus H3K27me3, bam H3K4me3 versus testis RPKM, S2 cell H3K4me3 versus H3K27me3, and S2 H3K4me3 versus RPKM
1, The dominant transcript was determined as that with the highest number of H3K4me3 ChIP-seq reads in the region 0 to +500 bp with respect to the TSS. If multiple transcripts have the same number of reads, criterion 2 is used. 2, For transcripts that are longer than 1.5 kb, the dominant transcript was determined as that with the highest number of ChIP-seq reads for H3K36me3. For transcripts that are shorter than 1.5 kb, and thus are not applicable to calculate H3K36me3 reads, the dominant transcript was determined as that with the highest number of ChIP-seq reads for Pol II in the region -250 to +250 bp with respect to the TSS. If multiple transcripts have the same number of reads, criterion 3 is used. 3, The dominant transcript was determined as that with the longest transcript. If all isoforms have the same length, one was chosen randomly.
bam H3K36me3 versus testis RPKM, and S2 cell H3K36me3 versus RPKM
1, For the 6,400 genes that have at least one transcript longer than 1.5 kb, the dominant transcript was determined as that with the highest number of H3K36me3 ChIP-seq reads in the region +500 to +1,500 bp with respect to the TSS. If multiple transcripts have the same number of reads, criterion 2 is used. 2, The dominant transcript was determined as that with the highest number of ChIP-seq reads for H3K4me3 in the region 0 to +500 bp with respect to the TSS. If multiple transcripts have the same number of reads, criterion 3 is used. 3, The dominant transcript was determined as that with the longest transcript. If all isoforms have the same length, one was chosen randomly.
Comparison of ChIP-seq results with the RNA-seq data
To generate the plots in Figure , we retrieved annotated Drosophila
genes from the Ensembl database for ChIP-seq analysis. We classified them into silent and expressed genes according to RPKM value (genes with RPKM = 0 were classified as the silent group, and genes with RPKM ≥ 1 were classified as the expressed group). The expressed group was further classified into low, moderate and high groups based on RPKM values. The coordinates of these transcripts were downloaded from the UCSC table browser [65
]. The read density was calculated in a 5-bp window across the genome.
Comparison of our ChIP-seq data with the published ChIP-chip data
We compared our 91 bivalent genes in bam
testis with the ChIP-chip data using BG3 and D23 cells [37
]. We first downloaded both the H3K4me3 and H3K27me3 sgr files from the NCBI website. We retained the probes whose ChIP/input hybridization intensity ratios are ≥ 2, to be consistent with the authors' definition of enriched regions. We then mapped the probes to the promoter regions (0 to +500 bp) of our 91 bivalent genes. For BG3 cells, we found 6 out of the 91 bam
testis bivalent genes contained an enriched H3K4me3 signal; and 4 of these 91 bivalent genes contained an enriched H3K27me3 signal. However, none of these 91 bivalent genes is enriched with both H3K4me3 and H3K27me3 in BG3 cells. Similarly, for D23 cells, 5 and 11 out of these 91 bam
testis bivalent genes contained enriched H3K4me3 and H3K27me3 signals, respectively. Again, none of these 91 bivalent genes is enriched with both H3K4me3 and H3K27me3 in D23 cells.
We also checked the ChIP-on-chip data using Drosophila
] and found 4,893 H3K4me3 and 2,480 H3K27me3 enriched regions. However, only 161 of them overlap with each other. We then compared the 91 bivalent genes we identified in bam
testis with these 161 bivalent regions in embryos. From this comparison, we found only two genes (CG4637
) for which the promoter region (0 to +500 bp) overlaps two bivalent regions in embryos (chr3R: 18965718-18967662 and chr3R: 4158335-4159501).
Determination of the P-value of enrichment of ChIP-seq reads within a 500-bp window
Since we used a 500-bp window to detect enrichment of Pol II and histone modifications (for example, active H3K4me3 and repressive H3K27me3), a sequencing read count threshold was chosen according to the Poisson distribution, which distributes the total and unique reads randomly across the Drosophila genome that can be mapped. For every 500-bp window with read number ranging from 1 to 99, the P-value was calculated. The threshold was chosen as the minimal number of reads that reached a significant enrichment compared to the random distribution (that is, P < 0.05). For example, for the total 1,342,075 reads of anti-Pol II ChIP-seq in bam testis, a 500-bp window containing 11 read counts has a P-value of 0.03, which passes the threshold P-value of ≤ 0.05. We then set the threshold of significant enrichment within a 500-bp window in this data set to be 11 read counts.
Analysis of the RNA Pol II stalling index
To analyze the RNA Pol II stalling index, we modified a method that was adapted from published work [39
]. Basically, the stalling index reflects differential Pol II binding at the promoter region versus the gene body region. Therefore, we defined the -250 to +250-bp region around the TSS as the promoter region (with respect to the TSS), and the +500 to +1,000-bp region (with respect to the TSS) as the gene body region. To calculate the stalling index, we first counted the total Pol II reads at the promoter region and at the gene body region. A stalling index was defined as the ratio of total reads in the promoter region divided by the total reads in the gene body region. Based on the stalling index, we classified genes into active, stalled or no Pol II categories based on the following criteria: active Pol II genes, stalling index ≤ 3 and significant Pol II enrichment (P
< 0.05); stalled Pol II genes, stalling index ≥ 5 and significant Pol II enrichment (P
< 0.05); no Pol II genes, no significant Pol II enrichment (P
Scatter plot analysis
The scatter plots delineate comparisons of different chromatin modifications in bam
. All plots were generated in the R programming environment (R version 2.5.0 [64
]). Transformation to the log value was used to compare chromatin modifications with RPKM values.
Box plot analysis
The distribution of gene expression level was analyzed using box plots in the R programming environment (R version 2.5.0 [64
]). The box represents the 25th and 75th percentiles, with the 50th percentile as a black bar. The whiskers refer to outliers that are at least 1.5× the interquartile range from the box. The y-axis represents the RPKM value.
Identification of Pol II or modified histone-enriched regions using SICER
To identify the significantly enriched regions, we used SICER software [52
] with the following parameters: window size = 200 bp, gap size = 0 bp and E-value = 100 for Pol II and H3K4me3; window size = 400 bp, gap size = 0 bp and E-value = 100 for H3K27me3 and H3K36me3. The reason we do not allow any gap is due to the density of genes in the Drosophila
Sequencing depth analysis
To analyze the sequencing depth of ChIP-seq, we first shuffled the reads and their corresponding genomic loci, then extracted subsamples (2.5%, 5%, 7.5%, and so on until 100% of the total unique reads), and then identified the enriched regions in each subsample using SICER software as described previously. The E-value was increased from the first subsample (E-value = 3) to the last subsample (E-value = 120) by an increment of 3. We then plotted the correlation between subsamples and the enriched Pol II or modified histones in the corresponding regions, as shown in Additional file 2
Gene Ontology assay
The gene function ontology analyses were performed using the DAVID 2008 informatics tools [67
], based on the Gene Ontology Consortium [68
]. All Ensembl annotated genes were used as a background comparison. Two particular Gene Ontology annotations (molecular function and biological process) were analyzed with a cutoff P value of < 0.01.