Biological samples for transcriptional analysis
stocks used in this experiment were the sequenced strains of D. melanogaster
(iso-1) and D. virilis
(15010-1051.87). Flies were maintained in bottle cultures on a rich dextrose medium at 25° and in 12 hr:12 hr light/dark for the duration of the experiment. We infected 50 females of each species with a mixed bacterial culture of Serratia marcescens
and Enterococcos faecalis
by pricking the thorax with a 0.1-mm dissecting pin (Fine Science Tools, Foster City, CA) dipped in bacterial culture, as previously described [39
]. We chose to use a mixed culture, consisting of one Gram-negative bacteria and one Gram-positive bacteria, in order to maximize our ability to detect a diversity of immune-inducible proteins. At 12 hours after infection, infected flies and a sample of 50 naïve flies were frozen in liquid nitrogen. We extracted total RNA from frozen flies using standard protocols (Trizol). After extraction, total RNA was treated with DNase to remove potential genomic DNA contamination, according to the manufacturer's protocols. We synthesized first strand cDNA using oligio-d(T) primers, and then synthesized second strand cDNA, according to standard protocols. Sequencing was done by the Cornell University Life Sciences Core Laboratories Center on an Illumina GA2 sequencer.
Aligning reads to the reference genome
Prior to mapping reads to the reference genome, we filtered low quality, low complexity, and repetitive reads. We first removed any read having fewer than 24 bases with a Phred quality score (Q) greater than 20; this is our 'low quality' filter. Next, we removed any read with low nucleotide complexity (80%+ of the sequence composed of only 2 bases) or repetitive elements (more than half the sequence composed of dinucleotide or trinucleotide repeats). Finally, we removed any reads with a mononucleotide run greater than 24 bases.
After filtering, we did an initial round of mapping to the repeat-masked D. virilis
or D. melanogaster
reference genome with Mosaik, a software program written by the Gabor Marth lab (A. Quinlan and G. Marth, http://bioinformatics.bc.edu/marthlab/Mosaik
, unpublished) that uses a BLAT-like approach. The program hashes the genome into unique n
-mers (where n
, the hash size, can be specified by the user; we used 17 bp), which it uses as seeds to align the sequencing reads to the reference. We required all matches to align for at least 91% of the read length (33 of 36 bp) and have no more than 3 mismatches.
To supplement the mapping from this initial Mosaik run, we took two approaches. First, we noticed that some reads fail to map because low quality ends or partial polyA sequence cause them to fail to pass our alignment length filter. In order to get around this, we trimmed up to 10 bp from the end of any read where average quality across a 5 or 10 bp segment was less than Phred Q20. We also trimmed any mononucleotide run from the end of a read. After trimming, we rejected any read that was shorter than 20 bp, or that failed to pass our QC filters (which we reran on the trimmed sequence). The remaining reads were then rerun in Mosaik, using the same parameters described above.
Finally, we attempted to map the remaining sequences using BLAST. Any reads that passed all our quality control filters, but could not be mapped using Mosaik even after end-trimming, were run through a BLAST pipeline: we used blastn with a word size of 7 and an E-value cutoff of 1 × 10-6, and considered any read mapped if either 1) there was only a single BLAST hit to the reference genome, or 2) there were fewer than 10 hits and the best hit aligned over at least 90% of the read length and had a lower E-value than the next best hit. Any read with more than 10 hits was considered repetitive and was not mapped.
After mapping, we combined the output from both Mosaik runs and the BLAST pipeline to produce a single file for each contig containing the depth of coverage at each base in the genome (using the program ace2dep, from the Marth lab, to convert Mosaik output into depth, and custom Perl scripts to convert BLAST output into depth information). The depth of coverage at each base along a scaffold was then the input to our pipeline to identify expressed regions of the genome regulated by infection.
Identifying regions regulated by infection
We first defined an expressed region as any contiguous stretch of DNA along a scaffold where the minimum coverage of the combined infected and uninfected samples at any one base is 1, and the average combined coverage across the region is at least 10. Based on this definition, we identified 4615 expressed regions in D. melanogaster
(Additional file 1
) and 6737 in D. virilis
(Additional file 2
). The median length of an expressed region in D. melanogaster
is 237 bp (Figure ), compared to a median length in D. virilis
of 104 bp (Figure ). Approximately the same number of reads map to D. virilis
and D. melanogaster
(Figure ), so it is unclear why we identify more regions that are on average shorter in D. virilis
Distribution of lengths of expressed regions. Histogram of the lengths of expressed regions in A) D. melanogaster and B) D. virilis. Solid black lines show the median of each distribution.
To determine the extent to which each expressed region responds to infection, we developed a Hidden Markov Model, with five hidden states representing the degree of induction (highly induced, moderately induced, unchanged, moderately repressed, highly repressed), where the emission probability for each state is the binomial probability of observing x infected coverage given n total coverage at each base pair, and the observed data is the coverage of infected reads at each base. We used the HiddenMarkov package in R to optimize our HMM using the Baum-Welch algorithm, and to calculate the most probable set of states using the Viterbi algorithm. The optimized emission probabilities for each state are given in Table . Before running the HMM, we removed sites with less than 10× coverage pooled across samples, as there is very little power to distinguish between states with so few reads. In order to increase the number of expressed regions for which all sites are assigned to the same state, we tuned the transition probabilities by increasing the probability of remaining in the same state, and decreasing the probabilities of transitioning between states proportionally, so that the highest probability in the matrix was equal to 0.999. Empirically, this tuning appeared to increase the consistency of our results, with fewer expressed regions being assigned to multiple states.
To determine the most likely state for any given expressed region, we weighted the Viterbi estimate of the state of each base in an expressed region by the summed coverage of that base. If one state had a majority of this measure, we assigned the expressed region to that state. For the 5.2% of expressed regions in D. virilis and 13.5% of expressed regions in D. melanogaster for which either the weighted sum assigned to the best state was not at least 50% of the total weighted sum, or for which no single base had sufficiently high coverage to be included in our HMM, we consider the state as "not determined" (Table ).
Associating expressed regions with genes
We used two methods to associate expressed regions with annotated genes. First, we used an automated first-pass method, where we simply asked whether any base in an expressed region also falls into an annotated exon, based on the D. melanogaster release 5.7 annotations and the D. virilis release 1.1 annotations available as GFF files from FlyBase (downloaded 4/24/2008). However, given the 3' bias inherent in oligo(dT) primed cDNAs, plus the relatively low coverage to which we sequenced, our identified expressed regions are short (Figure ). In D. melanogaster this does not pose much of a challenge, as the genome annotation is quite mature and includes fully annotated 3' UTRs. In D. virilis, however, 3' UTRs are generally not annotated, leading to any expressed region that falls entirely in a UTR failing to be associated with any gene. Given the length distribution of 3' UTRs in D. melanogaster (Figure ), we expect a substantial fraction of our expressed regions in D. virilis to suffer from this problem.
Figure 8 Distribution of 3' UTR length in D. melanogaster. Histogram of the lengths of annotated 3' UTRs in D. melanogaster. The solid black line is the median of the distribution; the dashed line is the median length of expressed regions in D. melanogaster; and (more ...)
As a partial remedy, we analyzed the 841 induced genomic regions in D. virilis in more detail. For these 841 regions, we extracted the nearest 3' end of a gene to the start of the expressed region on the positive strand and the nearest 3' end of a gene to the end of the expressed region on the negative strand. We consider one of these expressed regions to be putatively associated with a annotated gene in D. virilis if the smaller distance was less than 500 bp and the larger distance was greater than 1 kb, or if the smaller distance was less than 200 bp and the larger distance was greater than 500 bp. Regions more distant that 500 bp from any other gene were declared "putatively unassociated," and the remaining regions were declared "ambiguous." The number of regions assigned to each class is listed in Table .
Association of induced expressed regions with gene models in D. virilis.
In order to understand why there is a large difference in the fraction of expressed regions that can be associated with annotated genes between the two induced classes in D. virilis, we divided the D. virilis scaffolds into "major" scaffolds (the 23 scaffolds with at least 1 Mb of sequence, which represent 77% of the total D. virilis sequence) and the remaining "minor" scaffolds. Induced regions assigned to state 1 are much more likely to be on a major scaffold than induced regions assigned to state 2 (Odds ratio = 14.3, Fisher's Exact Test P-value = 4.85 × 10-12). As expected, regions on minor scaffolds, irrespective of class, are much more likely to fail to be associated with an annotated gene (Odds ratio = 144.6, Fisher's Exact Test P-value < 2.2 × 10-16). However, the difference between minor and major scaffolds does not seem to fully explain the difference between state 1 and state 2, as even when restricted to just the major scaffolds expressed regions assigned to state 1 are more likely to be associated with genes (Odds ratio = 4.4, Fisher's Exact Test P-value = 0.0027). It could be that highly induced genes are more likely to have homologs in D. melanogaster, increasing the probability that those genes would be annotated in D. virilis. However, among just the regions that are associated with genes, it is actually state 2 that is more likely to have homologs in D. melanogaster (based on fuzzy reciprocal BLAST calls downloaded from FlyBase under the "Genomes FTP" section; Odds ratio = 3.60, Fisher's Exact Test P-value = 0.0176). Because of the difficulties in annotating genes on minor scaffolds, we have limited our primary analysis to the 33 expressed regions in state 1, plus the 166 expressed regions in state 2 that are on major scaffolds: this sample of 199 expressed regions includes 101 that can be associated with an annotated gene, as described above, and 98 that cannot.
Sample preparation and iTRAQ labeling for proteomic studies
We extracted hemolymph from approximately 200 D. virilis
females 24 hours after infection with a mixed culture of E. faecalis
and S. marcescens
(infected sample) and 200 uninfected controls (uninfected sample). Hemolymph extracts were spun for 10 minutes at low speed to pellet cellular material. Equal amounts of each sample (40 μg protein) were then aliquoted in duplicate, reduced, cysteine-blocked and digested by trypsin. Each aliquot was labeled with a different iTRAQ tag according to the manufacturer's instructions (document #4351918A and 4350831C downloaded from http://docs.appliedbiosystems.com/search.taf
, Applied Biosystems, Foster City, CA). The 114 and 115 tags were used to label the peptides in the two identical samples from control (uninfected) sample and the 116 and 117 tags were used for two extracts from the infected sample. After labeling, the four samples were pooled and subjected to cation exchange chromatography as described below.
Strong Cation Exchange Fractionation
Strong cation exchange (SCX) fractionation was completed using an Agilent 1100 HPLC with UV detector (Agilent Technologies, Inc. Santa Clara, CA). The tryptic peptides labeled with iTRAQ tags were reconstituted in buffer A (10 mM potassium phosphate pH 3.0, 25% ACN), prior to SCX LC. The samples (~400 μg) were loaded onto a PolyLC Polysulfoethyl A column (2.1 mm × 150 mm) purchased from PolyLC Inc. (Columbia, MD). Buffer B was composed of 10 mM potassium phosphate pH 3.0, 25% ACN with 1 M KCl. Sample fractionation was completed using the gradient 0% B, 15 min, 0–25% B in 40 min, 25–50% B in 10 min and hold 50% B for 10 min. During this elution, forty fractions were collected at a flow rate of 200 μl/min on the basis of the UV trace at 214 nm. Several fractions were pooled post-collection to yield a total of 10 sample containing fractions. Salts were removed via solid phase extraction using Waters SepPak C18 cartridge (Waters, Milford, MA) and purified peptide fractions were dried and reconstituted in 2% ACN, 0.05% formic acid and injected on the nLC-MS/MS.
Reverse-Phase Separation and Tandem Mass Spectrometry
The 10 SCX fractions were partially evaporated to remove ACN and desalted by solid phase extraction. All SPE-extracted and gel-extracted peptide samples were reconstituted in 50 μl of 0.1% formic acid with 2% acetonitrile prior to mass spectrometry (MS) analyses. NanoLC was carried out by an LC Packings Ultimate integrated capillary HPLC system equipped with a Switchos valve switching unit (Dionex, Sunnyvale, CA). An aliquot of peptide fractions (5.0 μl) were injected using a Famous auto sampler onto a PepMap C18 trap column (5 μm, 300 μm × 5 mm, Dionex) for on-line desalting and then separated on a PepMap C-18 RP nano column, eluted in a 90-minute gradient of 5% to 40% acetonitrile in 0.1% formic acid at 250 nl/min. The nanoLC was connected in-line to a hybrid triple quadrupole linear ion trap mass spectrometer, 4000 Q Trap from ABI/MDS Sciex (Framingham, MA) equipped with Micro Ion Spray Head ion source. MS data acquisition was performed using Analyst 1.4.2 software (Applied Biosystems) in the positive ion mode for information dependant acquisition (IDA) analysis. The nanospray voltage was 2.0 kV used for all experiments in positive ion mode. Nitrogen was used as the curtain (value of 10) and collision gas (set to high) with heated interface at 150°C. The declustering potential was set at 50 eV and Gas1 was 15 (arbitrary unit). In IDA analysis, after each survey scan for m/z 400 to m/z 1550 and an enhanced resolution scan, the three highest intensity ions with multiple charge states were selected for tandem MS (MS/MS) with rolling collision energy applied for detected ions based on different charge states and m/z values.
Data analysis and protein identifications
MS/MS spectra generated from nanoLC/ESI-based IDA analyses were interrogated using ProteinPilot™ software 2.0 (Applied Biosystems, Foster City, CA) for database search against the FlyBase FB2008_06 D. virilis peptides database by Paragon method. The default setting for trypsin with MMTS modification of cysteine and a methionine oxidation was used for quantitative processing and rapid ID. The protein identifications are reported with total ProtScore >1.3 for each protein representing > 95% statistical significance in ProteinPilot. In order to estimate abundance ratios and statistical significance for each identified protein, we created custom R scripts (available upon request) that extended the ProteinPilot statistical approach to account for the technical replication included in our experiment. Briefly, for each peptide we calculate an average infected/uninfected ratio as well as an average error (based on the reported ratio error for each of the two technical replicates per treatment). These ratios are then averaged using 1/Error as a weighting factor, and significance is determined using the weighted standard deviation as described in the ProteinPilot manual (ABI).
For the estimated abundance analysis of each identified protein, the MS/MS data were also submitted to Mascot 2.2 for database searching using in-house licensed Mascot local server against the same FlyBase FB2008_06 D. virilis
peptides database with one missed cleavage site by trypsin allowed and iTRAQ-4-plex quantitation. The peptide tolerance was set to 1.2 Da and MS/MS tolerance was set to 0.6 Da. MMTS modification of cysteine and iTRAQ modification of N-terminal and lysine were fixed and a methionine oxidation and iTRAQ modification of tyrosine were set as variable modifications. Only significant scores for the peptides defined by Mascot probability analysis http://www.matrixscience.com/help/scoring_help.html#PBM
greater than "identity" were considered for the protein identifications. The exponentially modified protein abundance index (emPAI) is obtained for each identified protein from Mascot searching and the corresponding protein content in mol percent is calculated as mol % = emPAI/Σ(emPAI) * 100 as reported previously [40
]. For our analysis, we focus on the 144 proteins identified by both MASCOT and ProteinPilot, for which we can calculate both a infected/uninfected ratio and emPAI.