|Home | About | Journals | Submit | Contact Us | Français|
The release of oil resulting from the blowout of the Deepwater Horizon (DH) drilling platform was one of the largest in history discharging more than 189 million gallons of oil and subject to widespread application of oil dispersants. This event impacted a wide range of ecological habitats with a complex mix of pollutants whose biological impact is still not yet fully understood. To better understand the effects on a vertebrate genome, we studied gene expression in the salt marsh minnow Fundulus grandis, which is local to the northern coast of the Gulf of Mexico and is a sister species of the ecotoxicological model Fundulus heteroclitus. To assess genomic changes, we quantified mRNA expression using high throughput sequencing technologies (RNA-Seq) in F. grandis populations in the marshes and estuaries impacted by DH oil release. This application of RNA-Seq to a non-model, wild, and ecologically significant organism is an important evaluation of the technology to quickly assess similar events in the future.
Our de novo assembly of RNA-Seq data produced a large set of sequences which included many duplicates and fragments. In many cases several of these could be associated with a common reference sequence using blast to query a reference database. This reduced the set of significant genes to 1,070 down-regulated and 1,251 up-regulated genes. These genes indicate a broad and complex genomic response to DH oil exposure including the expected AHR-mediated response and CYP genes. In addition a response to hypoxic conditions and an immune response are also indicated. Several genes in the choriogenin family were down-regulated in the exposed group; a response that is consistent with AH exposure. These analyses are in agreement with oligonucleotide-based microarray analyses, and describe only a subset of significant genes with aberrant regulation in the exposed set.
RNA-Seq may be successfully applied to feral and extremely polymorphic organisms that do not have an underlying genome sequence assembly to address timely environmental problems. Additionally, the observed changes in a large set of transcript expression levels are indicative of a complex response to the varied petroleum components to which the fish were exposed.
In April of 2010, the explosion of the Deepwater Horizon oil drilling platform initiated the largest deep water oil release in the history of the petroleum industry culminating in more than 189 million gallons of crude oil released into the environment [1,2]. This incident has greatly impacted the natural and economic resources of many Gulf of Mexico coastal areas [3-7].
For several decades Fundulus heteroclitus has been an important model organism employed in environmental toxicology studies of heavily polluted superfund sites throughout the northeastern United States . We have leveraged the body of existing scientific work surrounding F. heteroclitus in the analysis of high throughput sequencing techniques applied to natural Fundulus grandis populations, a closely-related sister species inhabiting DH-impacted sites in the Gulf of Mexico . This application holds promise for revealing induction of environmental stress with genome-wide molecular genetic response patterns. Deploying these techniques within natural populations for which no underlying genomic information is available also holds the potential to track responses in the most situation-appropriate and ecologically relevant species, and be used to follow ecosystem recovery . Herein we provide methodological detail and results of transcriptome expression comparisons between populations of F. grandis sampled directly from the environment affected by the DH oil release versus those from unaffected populations.
In this study we assembled a reference transcriptome from representative natural populations of F. grandis collected from Gulf Coast estuaries ranging from Texas to Florida. We quantified liver mRNA expression from sequences produced using the Illumina GAIIx platform. A statistical framework was used to identify genes that were differentially expressed in oil-exposed versus non-exposed samples. Our results show a complex response including activation of AHR and related pathways previously shown to be important in response to xenobiotic hydrocarbons, as well as some previously undocumented gene responses. We also show that these results compare favorably with a microarray-based method. Additionally we compiled abundant transcript sequence data for a species not previously studied in this manner. And finally, we demonstrate the efficacy of these techniques for addressing similar events in the future while studying organisms key to a habitat instead of being restricted to a well studied model of lesser significance.
The exposed (E) samples E1, E2 and E3 and the unexposed (U) U1 and U2 samples had between 81% and 94% overall pass rates and most of the reads that passed remained paired (Table (Table1).1). After filtration 409,153,209 reads remained.
We selected a transcriptome assembly based on the features of the N50 plot over a range of k-mer lengths from 21 to 49. The point k=27 was the highest k-mer length in what seemed to be a plateau of N50 values before the N50 began to dip (Additional file 1: Figure S1). The average length over all sequences in this assembly is 599bp with an N50 value of 1,238bp, and when contigs shorter than 500bp are excluded these values rise to 1,429bp and 1,804bp respectively. These statistics indicate a population of robust and contiguous sequences among smaller fragmentary contigs. The final set obtained after mapping reads and removing those transcripts with fewer than 10 mapped reads contains 120,725 contigs with an average length of 878 bp and an N50 of 1,494 bp. We were able to annotate 45% of the assembled contigs with a total of 15,494 unique sequence descriptions.
While this set still contains many small fragments and some sequence overlaps, it indicates that two sequencing technologies are not necessary for de novo assembly of non-model organisms as is common practice in recently published studies [11,12] with which our methods compare favorably.
There exists subtle but significant genetic structure within F. grandis, where population divergence generally follows a pattern of isolation-by-distance with longitude across the northern Gulf of Mexico. Our “western” pool of samples (sites 1-3 on Figure Figure1)1) included fish from the western phylogeographic cluster of populations, whereas the “eastern” pool of samples (sites 5-8 on Figure Figure1)1) included fish from the eastern phylogeographic cluster of populations . Grand Terre, our oil-impacted site, is located near the center of the species’ range (site 4 on Figure Figure1),1), and fish from this site have greatest affinity to the western phylogeographic cluster of populations. Among our reference fish are animals from Leeville, LA, which is only 21.26 km from the Grand Terre site; significant genetic structure is not observed within this species at that geographic scale . Since our reference samples are representative of the genetic diversity within the species, including from southern Louisiana, we consider it unlikely that genetic variation among populations confounds the treatment effects that we observe. Seasonal changes should also be minimal as fish were collected during the months of April (reference) and June (exposed). As shown in Figure Figure2,2, the unexposed samples cluster separately from the exposed samples bearing out the self-similarity of the pooled unexposed samples, and indicating global differences between the two sets.
At a significance level of P<0.01, we determined a set of 5,384 up and 5,882 down-regulated sequences (Figures (Figures33 and and4).4). The assembly contains long well-assembled sequences in addition to short fragments, and the color-coding of points by length reveals that the longer sequences tend to provide more statistical power (green to orange points in Figure Figure4).4). The short length of these reads, however, is beneficial when matching them to short sequence fragments. This allows many of the shorter contigs to also provide useful data (blue points in Figure Figure4).4). Many of the sequences are alternative assemblies of the same original transcript or are fragments of the same transcript. Thus, we used the annotations derived from blast searching against the non-redundant database (nr) to group transcripts that likely represent the same gene by unique sequence annotations. This resulted in 1,070 down-regulated and 1,251 up-regulated unique sequence annotations. That still leaves 6,146 significantly differently expressed sequences that lack annotation, where 962 of these sequences are over 1kb in length. Some of these may be entirely new gene products unknown to prior studies while others are likely assembly errors.
Among the more interesting genes that were differentially expressed are two up-regulated aryl-hydrocarbon receptor sequences (AHR1B and AHR2) (Figure (Figure3).3). The response of AHR pathway genes has been well documented following exposure to aromatic hydrocarbons, where activation of this pathway is diagnostic of exposure to this class of pollutant [8,14,15]. Downstream in the AHR response pathway are cytochrome P450 genes; CYP1C (mediated by AHR2) and CYP1B1 are found to be up-regulated in the exposed sample (Figure (Figure3),3), while others in the CYP2N and CYP2X sub-families are down-regulated (Additional file 1: Table S2). Whereas previous studies have highlighted the response of CYP1A to dioxins, polycyclic aromatic hydrocarbons (PAHs), and other similar pollutants, in this analysis we detected no significant difference in CYP1A mRNA expression upon exposure to DH oil. To verify this we searched the assembled F. grandis transcript sequences using the F. heteroclitus CYP1A sequence and found that only small fragments existed in the de novo assembly. We then mapped F. grandis reads to the F. heteroclitus CYP1A sequence and analyzed its expression in the context of the assembled sequences. While there was an increase in the expression values in the oil exposed sample relative to either of the unexposed samples, it did not pass the p-value threshold. It may be that by the time of collection the animals had been exposed to levels and durations of DH oil or some prior exposure causing a de-sensitized response by CYP1A response in the exposed population. Alternately, since a previous study that used the same RNA-Seq data did find a significant increase in CYP1A, but used F. heteroclitus ESTs as the reference transcriptome , it could be that the residual fragmentary and erroneous sequences in the de novo assembly led to reduced statistical power for some sequences.
Given clear damage to killifish gills caused by contaminating oil , one might anticipate that compromised gill function, coupled with limited oxygen exchange between air and water in oil-slick contaminated areas, could result in hypoxic stress. In the up-regulated gene set from DH oil exposed animals we find several genes suggesting a hypoxic response. Hypoxia inducible factor 2-alpha, its co-activator E1A binding protein p300, and the activity enhancing nuclear receptor coactivator 2 are all up-regulated  (Figure (Figure5).5). It should be noted that the estuaries inhabited by F. grandis often experience transient periods of hypoxia (see Additional file 1: Table S2), and given the broad range of sampling sites it is likely that our reference set includes RNA from individuals experiencing hypoxic conditions to some degree, thus capturing a range of normal and hypoxic expression levels for these genes. The inclusion of reference individuals exposed to hypoxia would make it more difficult to statistically identify a hypoxic response in the exposed population. Therefore, the genes indicated here are likely a conservative set of hypoxia response genes.
Another broad set of up-regulated genes in DH oil exposed fish indicates an immune response perhaps by circulatory leukocytes. The secreted form of the immunoglobulin mu heavy chain and heavy chain variable regions are both indicative of an active immune response. Tumor necrosis factor receptors 14 and 21, involved in immune response signaling pathways, are both up-regulated (Figure (Figure5).5). Previous studies have noted an association between an immune response induced by hypoxia and/or HIF1 activation , and hydrocarbons are well-known to alter immune system function .
The set of genes that appears to be down-regulated in response to DH oil exposure is also large and diverse. One large group of down regulated transcripts includes a wide selection of 40S and 60S ribosomal proteins (Figure (Figure5).5). A similar response has been reported in hypoxic conditions in zebrafish gills , and may be explained by an overall reduction in protein synthesis as a survival strategy during hypoxic conditions .
One final group of genes that was previously observed to be strongly down-regulated are the group of choriogenins ChgL, ChgH, and ChgHm. The sequences for these are highly repetitive and so did not assemble well. Thus they did not originally show up in our analysis except as unidentified small fragments albeit with expression levels that would confirm earlier observations. To better examine the expression levels in the context of the rest of the assembly, we added the F. heteroclitus cDNA sequences to the assembled contigs and recalculated differential expression for these in the context of the rest of the assembled sequences. The results confirm that choriogenin genes are strongly down-regulated (Figure (Figure5).5). As has been previously noted, the expression of these genes is estrogen-dependent and is down-regulated by exposure to PAHs .
The results of the microarray-based analysis correlate well with those measured by RNA-Seq (Figure (Figure6).6). One notable difference is that the dynamic range of expression levels seems to be much wider in the RNA-Seq method versus those observed for microarray results. The RNA-Seq method exhibits results for values ranging from approximately +5 to -14 log2 fold change while the microarray results are in a more narrow range of approximately ±3.5fold. This may indicate a greater sensitivity in the RNA-Seq based methodology which is an observation that is consistent with other similar comparisons . For the set of best hits between the data sets we calculated a Pearson’s correlation of r=0.48, while the set of sequences found to be significant in both analyses had an r value of 0.70. These indicate a reasonable correlation between the two data sets with a high degree of variability, and are similar to results of other such comparisons . Much of the variation between the data sets is likely related to the fact that only the exposed samples were common among the two analyses; the un-exposed samples in each used mRNA from different sets of fish. The unexposed RNA-Seq data represent two pools of 6 and 8 individuals collected from sites on both sides of the Mississippi river, while the microarray data set is comprised of a separate set of 75 individual animals collected in the summer of 2010 from different sites along the Gulf of Mexico.
Assembly of mRNA transcript data is a challenging task that is complicated further when using wild outbred animals that have significant genetic variability between and within populations and which lack prior EST or genome sequence data. However, data provided here suggest that useful analysis can be achieved with reference to phylogenetically similar species. Additionally, Illumina short read data have recently been shown to be more resilient to fragmentary, incomplete sequences as targets for alignment for digital gene expression  as would likely be the case in non-model organisms. We suggest that with improvements in assembly techniques, digital gene expression results are likely to improve and become increasingly valuable tools for assessing environmental damage and recovery after anthropogenic perturbation. Thus, the methods provided here indicate that RNA-Seq can be used to investigate global mRNA expression levels in a very wide variety of organisms, such that the most appropriate species for particular research questions in environmental science may be exploited.
In this case, RNA-Seq studies of wild populations of the small teleost fish (Fundulus grandis) collected from estuaries in the Gulf of Mexico show broad and complex molecular genomic responses to DH oil exposure. In comparison to samples collected from a varied set of Gulf of Mexico coastal sites in 2008, the fish from estuaries with oil in the water exhibited mRNA expression patterns consistent with exposure to the toxic components of oil (aromatic hydrocarbons). In addition to expected indicators of AH exposure, the RNA-Seq based techniques used here revealed effects on several cellular systems, such as the immune response and effects on the wide variety of ribosomal proteins, that might not have been specifically interrogated by a more focused examination of specific pathways through smaller-scale techniques. Additionally, the specific gene responses we have analyzed and discussed here represent only a handful of the thousands of significant up and down regulated gene sets. This data set will continue to provide insight into the full effects of this oil exposure as results are compared with other similar studies. The RNA-Seq results are also similar to a distinct and independent microarray study and thus lend validity to the overall methods employed and results of the study.
Samples of Fundulus grandis were collected from eight estuaries over a range of time points prior to, and concurrent with, the Deepwater Horizon oil release event. Collection sites, from west to east, include: 1) Port Aransas, TX [27°45'59"N, 97°7'33"W]; 2) Cocodrie, LA [29°15' 13.89" N, 90°39' 45.91" W]; 3) Leeville, LA [29° 12' 43.37N, 90° 09' 08.37" W]; 4) Isle Grande Terre, LA [29°16'22.93"N, 89°56'41.87"W]; 5) Dauphin Island, AL [30°20' 04.92"N, 88° 07'57.21"W]; 6) Weeks Bay, AL [30°22'41.80"N, 87°50'19.60"W]; 7) Santa Rosa Island, FL [30°21'16.63"N, 87° 2' 46.92"W]; and 8) Florida State University Marine Station, FL [29°54'56.83"N, 84°30'39.07"W]. Three adults were sampled from Isle Grande Terre (Barataria Bay, LA) on June 28, 2010 -- a time when oil had already heavily infiltrated the area for at least two weeks  (exposed samples E1, E2, E3). Each of the three individuals discussed thus far were kept as separate biological replicate samples for sequencing. Two additional samples were composed of 6 and 8 adults pooled from 7 sites East or West of the Mississippi (locations1-3 and 5-8) collected prior to the oil spill (April 2008) (unexposed samples U1 and U2 respectively) (Table (Table1).1). Both genders are represented in both exposed and unexposed samples, and differences in size are not expected to be a major factor in the current analysis [24-26]. In all cases the RNA was purified from isolated livers. Several of these sites have a recorded history of hypoxic events as indicated in Additional file 1: Table S2. None of the collection sites are in heavily industrialized areas with known histories of pollution. While small oil seeps occur throughout the Gulf of Mexico, we do not expect any to be exposed at the same levels as the south Louisiana fish during June 2010.
Oil exposed samples
RNA from adult male GT fish was isolated as reported in . Briefly, liver tissues were dissected from fish field sites immediately following capture and preserved in RNAlater (Ambion, Austin, TX, USA). Total RNA was extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA), purified using RNeasy spin columns (Qiagen Inc., Valencia, CA, USA), and RNA quality was verified using microfluidic electrophoresis (Experion instrument and reagents, Bio-Rad Laboratories, Hercules, CA, USA).
RNA was isolated from livers stored in RNALater (20°C) by homogenizing tissue using ceramic beads homogenizer (Bio101, Savant; Carlsbad, CA). Tissues were place in screw-top 2ml tube with 750 ul of chaotropic reagent, 1.25 g ceramic beads and violently shaken for 40 s. Solution was centrifuged and 500 ul of supernatant placed in RNAse-free microfuge tube, with 50 ul 2M NaOAc pH 4.0, vortexed; 500ul of acidic phenol was added and vortexed again; then 100 ul of Chloroform:isoamyl alcohol (24:1) was added and vortexed. This solution was centrifuged at 4°C for 20 min. The upper-phase was placed in RNAse-free microfuge tube and RNA was precipitated with equal volume of isopropanol. RNA pellet was resuspended in chatropic reagent and purified using Qiagen RNA columns following manufacture’s procedures.
The quality of all mRNA samples were verified by electrophoresis on BioAnalyzer (Agilent, Santa Clara CA). The RIN scores for all samples were in the range of 8.3-9.7 (Additional file 1: Table S4).
All RNA samples were sequenced in separate lanes on an Illumina Genome Analyzer (Expression Analysis, Inc. Durahm, NC). The resulting raw data files contained between 60-75 million 60 bp read pairs each for a total of 501,847,804 overall reads (deposited to NCBI short read archive, accession: SRA053469). We filtered and trimmed the reads based on quality scores using an in-house filtration algorithm that removed low scoring sections of each read and preserved the longest remaining fragment based on the following method. First, any reads with uncalled bases were rejected. The phred quality score of 2 encoded in fastq format as a ‘B’ is a special flag indicating that the results downstream of that position are untrustworthy. Therefore, as a second step, portions downstream of ‘B’ quality scores were removed. Finally, reads were broken apart anywhere the quality score value was 10 or below or where the average score of a position and its two neighbors was 20 or below. The largest remaining fragment of each read was kept (provided it was sufficiently long i.e., 49 bp or more) and the rest were discarded. Finally reads that lost their mate pair were moved into a single-end file and the integrity of the remaining read-pairing information was maintained. Any of the Perl scripts we developed to perform these steps are available on request.
We used the Velvet short read assembly package to perform the transcript assembly and made use of the Columbus and Oases extensions to Velvet [27,28] (Velvet version 1.1.04 and Oases version 0.1.21). As a first step we aligned the filtered and trimmed short reads to F. heteroclitus (a sister species to F. grandis) ESTs (retrieved from GenBank Feb 17, 2011) using the Bowtie short read aligner  (Bowtie version 0.12.7). The resulting alignment file (in SAM format), consisting of the read sequences and aligned locations if available, was then passed to Velvet along with the reference ESTs. We scanned a range of k-mer sizes from 21 to 49, and for each size we produced transcripts using Oases. Based primarily on the N50 value profile we selected the k=27 assembly for further use in these studies (Figure S1). This full set of assembled sequences have been deposited to the Transcriptome Shotgun Assembly (TSA) database at GenBank (accession numbers JW540070 - JW662113) and are also available at xiphophorus.org under ‘Publication Supplements’.
We used Bowtie to map the full set of filtered reads to 203,252 assembled transcriptome sequences, then determined the number of reads mapped to each transcript. The settings we used allowed any read to map with up to 2 mismatches, and reads that could match in more than one location were randomly assigned. Any transcript with fewer than 10 mapped reads was rejected. This process eliminated 41% of the full transcript set resulting in a set of 120,725 transcripts that were used for further analyses.
We again employed the Bowtie short read alignment program to map each sample of short reads independently to the refined set of reference transcripts. A custom Perl script then determined the number of reads mapped to each contig from the alignment file. A second Perl script then compiled the number of reads per contig per sample into a tabular format. The first column of the data file contains the transcript identifier, and each subsequent column has the number of fragments mapped to that transcript in each sample. We then used the DESeq package (ver 1.4.1, Bioconductor ver. 2.8, R ver. 2.13) to determine differential expression from the compiled tabular data . DESeq uses a model based on the negative binomial distribution to determine significance and was developed specifically to meet the challenges of working with RNA-Seq data. The biological replicates were advantageous because DESeq is then able to determine a much better variance estimate for each transcript improving the statistical power of the experiment. The diagnostic plots produced by DESeq indicated no major problems with the data (Additional file 1: Figures S2, S3, and S4). A significance level of P<0.01 was used to select differentially expressed sequences. Although there was a newer version of DESeq at the time, the one we employed was stable and used methods which had been published in the paper cited whereas the latest version was being updated daily in the Bioconductor repository.
Matches for the reduced set of transcripts are sought from the non-redundant ‘nr’ blast database maintained by NCBI. We obtained the fasta format of the nr database on May 12, 2011, and prepared it for use with MPI-BLAST . We used MPI-BLAST (ver 1.4.0) to query the F. grandis transcripts against nr and kept only 10 hits per sequence with an expect value threshold of 1e-10 using blastx. The results were processed by Blast2GO [32-34] in order to quickly assign a likely identity to each sequence.
Microarray target sequences were queried against the assembled transcript sequences using blastn with a minimum expect-value threshold of 1e-10. For each microarray target, the hit with the highest bit score was selected as a match. RNA-Seq analyses were then compared to microarray data for F. grandis that compared samples from the exposed collection site (Isle Grande Terre, LA) both before and after the arrival of contaminating oil at that site, and compared the exposed site response to five different unexposed populations . The microarray data were analyzed by mixed-model ANOVA which compared the response pre-oil to post-oil (3 timepoints) and between exposed and unexposed populations (6 populations) using JMP-SAS.
DH: Deepwater Horizon drilling platform; AHR: Aromatic hydrocarbon receptor; RNA: Seq - RNA sequencing; PAH: Polycyclic aromatic hydrocarbon.
The authors declare that they have no competing interests.
TG carried out the assembly, digital gene expression, comparison to microarray results, analyzed results, and drafted the manuscript. YS consulted on design of the computational work and analysis, and contributed intellectually and critically to the manuscript. DC, MO, AW collected samples, conceived of the study, and contributed intellectually and critically to the manuscript. AW additionally carried out microarray analysis. RW conceived of the study, coordinated research, and helped to draft the manuscript. All authors read and approved the final manuscript.
Figure S1. Length/quantity statistics of assemblies for odd values of K from 21 to 49. In each graph that has two plots, the blue describes the full set of contigs while the red describes only those contigs 500bp or over. Figure S2. Plot of squared coefficient of variation as produced by DESeq. Figure S3. Plot of the base variance as produced by DESeq. Figure S4. ECDF plots for the two samples as produced by DESeq. Table S1. Unique Sequence Descriptions for Up-Regulated Sequences at P < 0.01. Table S2. Unique Sequence Descriptions for Down-Regulated Sequences at P < 0.01. Table S3. Site history of hypoxia. Table S4. RIN values for RNA samples
This work was supported by Texas State University, the National Science Foundation grant DEB-1048208 to D. L. Crawford, and the National Institutes of Health, National Center for Research Resources grant R24-RR024790 including an American Recovery and Reinvestment act supplement to this award. This work was also supported in part by National Science Foundation grant DEB-1048206 and Gulf of Mexico Research Initiative funding to AW.