|Home | About | Journals | Submit | Contact Us | Français|
The current study was aimed at analyzing putative protein sequences of the transition protein-like proteins in 12 Drosophila species based on the reference sequences of transition protein-like protein (Tpl94D) expressed in Drosophila melanogaster sperm nuclei. Transition proteins aid in transforming chromatin from a histone-based nucleosome structure to a protamine-based structure during spermiogenesis - the post-meiotic stage of spermatogenesis. Sequences were obtained from NCBI Ref-Seq database using NCBI ORF-Finder (PSI-BLAST). Sequence alignments and analysis of the amino acid content indicate that orthologs for Tpl94D are present in the melanogaster species subgroup (D. simulans, D. sechellia, D. erecta, and D. yakuba), D. ananassae, and D. pseudoobscura, but absent in D. persmilis, D. willistoni, D. mojavensis, D. virilis, and D. grimshawi. Transcriptome next generation sequence (RNA-Seq) data for testes and ovaries was used to conduct differential gene expression analysis for Tpl94D in D. melanogaster, D. simulans, D. yakuba, D. ananassae, and D. pseudoobscura. The identified Tpl94D orthologs show high expression in the testes as compared to the ovaries. Additionally, 2 isoforms of Tpl94D were detected in D. melanogaster with isoform A being much more highly expressed than isoform B. Functional analyses of the conserved region revealed that the same high mobility group (HMG) box/DNA binding region is conserved for both Drosophila Tpl94D and Drosophila protamine-like proteins (MST35Ba and MST35Bb). Based on the rigorous bioinformatic approach and the conservation of the HMG box reported in this work, we suggest that the Drosophila Tpl94D orthologs should be classified as their own transition protein group.
During spermatogenesis in most metazoans, haploid round spermatids undergo a dramatic nuclear transformation where the chromatin is remodeled into a highly compacted, transcriptionally silent form. This transformation is accompanied by the production of sperm-specific proteins that replace histones as the DNA-binding proteins. These sperm-specific proteins include histone H1 linker-like proteins,1,2 true protamines,3 protamine-like proteins,1,2 chromatin insulator proteins,4 and transition proteins.2,4-6 Histone H1 linker-like proteins, true protamines and protamine-like proteins appear to have evolved from histone H1 linker and are collectively referred as the “sperm nuclear basic proteins” (SNBPs).7,8 True protamines are present in the sperm nuclei of higher vertebrates such as mice and humans,9-11 while protamine-like proteins are found in some vertebrates,12 but are predominantly found in invertebrate species such as fruit flies,4,6,13 Atlantic surf clam,13-15 and stalked tunicate.16
Adult male Drosophila fruit flies and mammals have a similar process of spermatogenesis. In Drosophila, spermatogenesis advances from tip of the blind-ended tubular or ellipsoid testes, while in mammals spermatogenesis proceeds within the seminiferous epithelium lining seminiferous tubules in the testes.17 In both flies and mammals, the initiation of spermatogenesis occurs in the stem cell niche region, which is located at the apex of the testes in flies,18,19 and in the basal compartment of the seminiferous epithelium in mammals. The fly testis stem cell niche houses the germline stem cells and cyst progenitor stem cells.20 The gonialblast will go through a mitotic amplification stage, followed by 2 meiotic divisions to generate haploid round spermatids. During the post-meiotic stage of spermatogenesis (spermiogenesis), haploid round spermatids transform into functional sperm. This transformation includes the exchange of histones for protamines and chromatin condensation. In flies, nuclear transformation involves the exchange of somatic histones for SNBPs called protamine-like proteins.21,22 In D. melanogaster, the transition protein Tpl94D facilitates the exchange of histones for protamine-like proteins.4-6 It has also been well documented that mammalian transition proteins (TPs) are involved in binding DNA to facilitate the transition from nucleosome-based chromatin to protamine-based chromatin.3
The D. melanogaster protamine-like proteins are male specific transcripts MST35Ba and MST35Bb.1,2,4,13,23 The purpose of MST35Ba and MST35Bb appears to be to serve as the protector of the compacted DNA in the sperm nucleus against detrimental environmental factors such as X-rays.6 Furthermore, deletion of MST35Ba and MST35Bb does not significantly affect chromatin condensation or fertility as it does in mammals when true protamines are deleted.1,2,24,25
Recent studies showed that during spermiogenesis both transition (Tpl94D) and histone H1 linker-like (male specific transcript - MST77F) proteins play a significant role in remodeling the sperm nucleus in D. melanogaster.4,6 During sperm nuclear remodeling, the ubiquitous chromatin insulator protein CTCF has been postulated to be involved in controlling the areas where chromatin can undergo histone modification.4 These histone modifications include H2A mono-ubiquitination and an increase in H4 acetylation, which cause the histones on the chromatin to be removed and degraded.4 Consequently, an opening within the chromatin allows Tpl94D to act as an intermediate for the transition from a histone bound nucleosome to a protamine bound structure.4 A key component of Tpl94D that allows for chromatin condensation to occur is the N terminal high mobility group (HMG) box.4 This HMG box is rich in arginine, which is a very basic amino acid with high affinity for binding DNA.4,5
Recently, we performed a detailed bioinformatic analysis of protamine-like proteins in 12 species of Drosophila (D. melanogaster. D. simulans, D. sechellia, D. yakuba, D. erecta, D. erecta, D. ananassae, D. persimilis, D. pseudoobcura, D. willistoni, D. virilis, and D. grimshawi).13 The current study focuses on an analysis of transition proteins (TPs) in the same 12 species analyzed in our previous work. Here, we include differential gene expression analysis using available next generation sequencing (NGS) RNA-Seq transcriptome data in addition to the genomic analysis. Additionally, we show that Tpl94D orthologs have a conserved N-terminal DNA binding domain and they are highly expressed in the testes as compared to the ovaries.
The published genomic and mRNA nucleotide sequences for Tpl94D (GI: 442620556) from D. melanogaster were used to search the genomes of D. simulans, D. sechellia, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. persmilis, D. willistoni, D. mojavensis, D. virilis, and D. grimshawi for sequence matches. The best NCBI ORF sequences for transition protein Tpl94D orthologs within the original 12 sequenced Drosophila species are listed in Table 1. The nucleotide BLAST and protein BLAST did not reveal the same gene loci for all the species outside the melanogaster species subgroup (D. ananassae, D. pseudoobscura, D. persimilis, D. willistoni, D. mojavensis, D. virilis, and D. grimshawi). A forced nucleotide BLAST2 alignment for transcripts and genomic sequences for the protein orthologs for Tpl94D illustrates that the protein BLAST, PSI BLAST, and ORF Finder sequences do not align with the genomic or transcript sequences of the Drosophila species from outside the melanogaster species subgroup to Tpl94D. This is due to the poor E-value scores and the percent query coverage for the species outside the melanogaster species subgroup. The current annotation on Flybase shows D. persimilis (Dper GL26871-Tpl94D) to be a putative ortholog of Tpl94D based on protein sequence predictions made using OrthoDB. Our current investigation, however, does not include Dper GL26871-Tpl94D because the next generation sequence RNA-Seq transcriptome data sets were not available for D. persmilis testes and ovaries and Dper GL26871-Tpl94D was below the NCBI ORF Finder's threshold (Tables 1–2). A summary of the best nucleotide BLAST alignment results are shown in Table 2 with their maximum identity, query coverage and E-value(s).
The published protein sequence for Tpl94D (GI: 24649166) for D. melanogaster was used to search the genomes of the Drosophila species listed previously for protein sequence matches. BLAST results with maximum identity, query coverage, and E-value scores are shown in Table 3. Only the best matched protein BLAST sequences are listed for each of the Drosophila species. No sequence matches were found outside the melanogaster species subgroup except for D. ananassae and D. pseudoobscura. The amino acid sequences for D. ananassae (Dana GF19889-Tpl94D) and D. pseudoobscura (Dpse GA22645-Tpl94D) were confirmed by analyzing publically available NGS RNA-Seq transcriptome data sets from NCBI SRA, ModENCODE, Flybase, and NCBI EST (Table S1). All of the orthologs were then confirmed using NCBI ORF Finder, PSI BLAST, and protein BLAST. Figure 1 shows a T-Coffee protein alignment of the Tpl94D orthologs for D. melanogaster, D. simulans (Dsim GD20990-Tpl94D), D. sechellia (Dsec GM26474-Tpl94D), D. yakuba (Dyak GE10340-Tpl94D), D. erecta (Dere GG11172-Tpl94D), D. ananassae (Dana GF19889-Tpl94D), and D. pseudoobscura (Dpse GA22645-Tpl94D) with a consensus score of 87. Figure S1 shows the consensus score increase to 97 with the omission of D. ananassae (Dana GF19889-Tpl94D), and D. pseudoobscura (Dpse GA22645-Tpl94D) amino acid residues from the T-Coffee alignment. Similarly, CLUSTAL Omega (conservative global alignment tool) shows the same N terminal region among the Tpl94D orthologs (Dsim GD20990-Tpl94D, Dsec GM26474-Tpl94D, Dyak GE10340-Tpl94D, Dere GG11172-Tpl94D, Dana GF19889-Tpl94D, and Dpse GA22645-Tpl94D) as being conserved (Fig. 2).
The Tpl94D protein orthologs were analyzed for their amino acid percentages (Figure S2 and File S1) and total number of amino acids (Figure S3 and File S2). These analyses included published NCBI sequences for D. melanogaster histone H1 linker-like proteins (MST77F), mouse transition proteins, rat transition proteins, protamine-like proteins, and true protamine proteins. These proteins were included to illustrate the change in the percentage of basic amino acids in DNA binding proteins across model and non-model organisms. Previous studies have characterized transition proteins, histone H1 linker-like, protamine-like, and true protamines based on distinct percentage of basic amino acids (lysine and arginine) and other specific amino acids like cysteine, tyrosine, and serine.4,13,15,17,27 Table 4 indicates species that are within the melanogaster species subgroup (Dsim GD20990-Tpl94D, Dsec GM26474-Tpl94D, Dyak GE10340-Tpl94D, and Dere GG11172-Tpl94D) have essentially the same number of amino acid residues as compared to the control Tpl94D found in D. melanogaster. In contrast, Drosophila species found outside the melanogaster species subgroup have greater variance in the number of amino acid residues (79 and 101 amino acids for Dana GF19889-Tpl94D and Dpse GA22645-Tpl94D respectively).
Transition proteins are rich in basic amino acids like lysine (K) and arginine (R), serine (S), and low in cysteine (C) amino acid residues.27 All orthologs had a high percentage of the total sum of lysine (K) and arginine (R) amino acids with an average percentage of 19.4 (ranged from 19% to 21%) (Figure S2 and File S1). Overall, there was an equal or larger amount of arginine amino acids for all orthologs with the exception of Dpse GA22645-Tpl94D, which had a higher lysine amino acid percentage of 12% as compared to 9% for arginine amino acids (Figure S2 and File S1). The Drosophila species orthologs closest to the D. melanogaster Tpl94D control (Dsim GD20990-Tpl94D and Dsec GM26474-Tpl94D) had very similar percentages of cysteine, lysine, arginine, and serine (Figure S2 and File S1).
The sum of lysine and arginine amino acids was substantially lower for Tpl94D and its respective orthologs than the sum of both of lysine and arginine amino acids in TP1 and TP2 found in Mus musculus (mouse), Rattus norvegicus (rat), and Bos taurus (bull) (Figure S2 and File S1). In contrast, percentage sum of lysine and arginine amino acids in the Tpl94D orthologs was similar to the percentage sum of lysine and arginine amino acids found in Homo sapiens TP2 (Figure S2 and File S1). A sum percentage average of lysine and arginine amino acids of 19% was obtained when H. sapiens TP2 was included with the Tpl94D orthologs. Cysteine residues are essentially absent from the Tpl94D orthologs, which is similar to TP1 found in M. musculus, R. norvegicus, B. taurus, and Homo sapiens (Figure S2 and File S1).
The whole protein sequences for Tpl94D orthologs in the melanogaster species subgroup are conserved as indicated in Figure S1. The percentage of amino acid residues present among the Tpl94D orthologs are shown in Figure S4 and File S3. Likewise the number of amino acid residues present among the Tpl94D orthologs are shown in Figure S5 and File S4, The lysine and arginine content is slightly lower in the conserved region with an average percentage of 17% (Figure S4 and File S3).
The orthologs for Tpl94D were compared to TP1 and TP2 from 4 mammalian model organisms: M. musculus, R. norvegicus, B. taurus, and H. sapiens. TP1 for M. musculus, R. norvegicus, B. taurus, and H. sapiens did not show any conservation with Tpl94D orthologs (data not shown). However, there are a small number of amino acid residues at the N terminus of the Tpl94D orthologs that are conserved with the TP2 N terminus for M. musculus, R. norvegicus, B. taurus, and H. sapiens (Fig. 4). This conservation may be attributed to the overall greater sequence and length diversity among TP2s as compared to TP1s.17,27
Functional analysis of the whole Tpl94D protein orthologs and their respective conserved region was conducted using 3 DNA binding prediction tools: BindN+, DNA-Binder and DP-Bind. All results from DNA binder showed that Tpl94D orthologs and their respective conserved regions were able to bind DNA with average to high confidence (Table S2). Additionally, the conserved regions (Main Data Set) showed a higher affinity to bind DNA as compared to the whole protein (Realistic and Alternative Data sets) (Table S2).
BindN+ was used to predict the actual amino acid residues that will or will not bind to DNA. The whole protein analysis indicates that a minimum of 63% of all amino acids will bind to DNA in all of the orthologs, except for Dana GF19889-Tpl94D with only 57% binding DNA. The conserved N-terminal region in the Tpl94D orthologs illustrates that an increase of DNA binding probability to greater than 71% with the exception of the Dana GF19889-Tpl94D being only 58% (Table S3). Overall, the majority of the putative DNA binding residues were found within the conserved region.
DP-Bind was used to predict DNA binding or non DNA binding amino acid residues in the whole protein orthologs and their respective conserved regions. Overall, a substantial range in the percentages of the Tpl94D orthologs were shown to be DNA binding with the highest percentage found in Dsim GD20990-Tpl94D (53%) and the lowest found in the Dyak GE10340-Tpl94D (29%). The overall decrease in the percentage in Dyak GE10340-Tpl94D and Dere GG11172-Tpl94D is attributed to the larger number of amino acids present as compared to the rest of the orthologs. The conserved regions of Dyak GE10340-Tpl94D and Dere GG11172-Tpl94D have the same number of amino acids shown to be DNA binding as compared to the rest of the melanogaster species subgroup (D. melanogaster Tpl94D, Dsim GD20990-Tpl94D, Dsec GM26474-Tpl94D, Dyak GE10340-Tpl94D, and Dere GG11172-Tpl94D).
The Tpl94D orthologs and their respective conserved regions were further analyzed using Protein homology/analogy recognition engine 2.0 (Phyre 2). A detailed analysis of the conserved regions for Tpl94D is shown in Table 5. All five sample matches (c2e6oA, c2cs1A, d1v64a, d1hmfa, and c2yrqA) have an overlapping region with a protein of unknown function (DUF1074 Family) and high mobility group (HMG) box. Table 6 shows the analysis of the whole protein orthologs for Tpl94D. The DUF1074 protein of unknown function once again overlaps with the HMG box. The Dere GG11172-Tpl94D had N terminal and C-terminal distinct regions matching up for DNA binding and HMG box. This can be attributed to Dere GG11172-Tpl94D being a DNA binding protein as indicated by c2yrqA match, which had residues 2 through 172 covering 97% of the whole protein. Phyre2 was used to generate a tertiary wire frame structure of the conserved regions and Molsoft ICM Browser was used to analyze the alignment of these structures. The conserved regions in Tpl94D orthologs have similar tertiary arrangements of the 3 α helices as shown in Fig. 3.
File S5 and Table 7 shows a summary of the RNA-Seq analysis using Cuffdiff 2.0.2 with a false discovery rate (FDR) of 0.01 for all transition protein Tpl94D orthologs across D. melanogaster (control), D. simulans, D. yakuba, D. ananassae, and D. pseudoobscura. For these species, Tpl94D was highly expressed in the testes as compared to ovaries. D. melanogaster expressed 2 isoforms for Tpl94D: Tpl94D_A: FBtr0084339 and Tpl94D_B: FBtr0310110 - with higher expression found for Tpl94D_A (Figure S7 and S8). The Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) for the testes samples in D. melanogaster Tpl94D_A showed a high expression (123.52) as compared D. melanogaster ovaries samples (FPKM = 0). A positive log2 fold change of 13.8006 was seen with a p value of 0.0622531 and a q value of 0.222648 for Tpl94D_A isoform. The Tpl94D_B isoform has a lower expression for the testes (19.079 FPKM) as compared to Tpl94D_A isoform for the testes. The ovaries expression for the Tpl94D_B isoform was 0. The relationship of these 2 isoforms for D. melanogaster Tpl94D (FBgn0051281) was analyzed using NCBI Isoform Usage Two-step Analysis (IUTA). IUTA showed that Tpl94D_A isoform (FBtr0084339) is the dominant isoform of the Tpl94D gene with 91% expression as compared to only 9% expression of Tpl94D_B isoform (FBtr0310110) in D. melanogaster testes (Figure S8).28 No other isoforms were detected for any other species (D. simulans, D. yakuba, D. ananassae, and pseudoobscura) based on ENSEMBL GTF files.29
The expression for Dsim GD20990-Tpl94D was 266.525 FPKM in the testes with 0 FPKM found in the D. simulans ovaries. This also resulted in an exponential positive log 2-fold change of 1.79769×10308 with a p value of 0.000117315 and a q value of 0.00109149. The Dyak GE10340-Tpl94D had similar high expression in the testes (506.227 FPKM) and close to 0 FPKM for the ovaries (positive log 2-fold change of 15.3673 with a p value of 0.00698236 and q value of 0.0104932). The Dana GF19889-Tpl94D had testes expression of 78.6323 FPKM while the ovaries were close zero to (0.0116349 FPKM). The decreased expression of the Dana GF19889-Tpl94D is attributed to the sequence length of Dana GF19889-Tpl94D being the smallest among all the orthologs. The p and q values for Dana GF19889-Tpl94D were both 0 with a log 2-fold change of 12.7224. Lastly, the Dpse GA22645-Tpl94D had testes expression of 232.614 FPKM and ovaries expression of 0.054751 FPKM with a p value of 0.000115653 and q value of 0.00159621 (log 2-fold change of 12.0528). The log 2-fold change was approximately the same across all orthologs with the exception of Dsim GD20990-Tpl94D and D. melanogaster Tpl94D_B due to 0 expression being found for respective sequences in ovaries. The gene orthologs for Tpl94D had high expression in the testes as compared to the ovaries.
To confirm the differential expression analysis for testes and ovaries in D. melanogaster, D. simulans, D. yakuba, and D. pseudoobscura, we compared the results to published data in ModENCODE,30 Flybase,31 NCBI EST32,33 and NCBI (File S5).34 A better consensus on the differential expression for the testes and ovaries RNA-Seq datasets for D. ananassae was established through the use of 2 additional approaches because there is only one known RNA-Seq testes and ovaries data set for D. ananassae.34
The results of the sensitivity analysis for the Tpl94D orthologs are shown in Figure S9.35 A stable alignment was found to exist when the gap open penalty (GOP) value varied from 5 to 50 while the gap extension penalty (GEP) was constant at a value of 10. Positional correspondence for amino acids across all the species required gaps to be inserted into the Tpl94D orthologs resulting in an overall length of 189 amino acids. The highest number of gaps were inserted into the D. ananassae (Dana GF19889- Tpl94D) and D. pseudoobscura sequences due to their shorter length relative to the other Tpl94D orthologs. For some sites the primary homology could not be confirmed, therefore, they are designated as ambiguous sites and were eliminated from the character matrix.36-38
The phylogenetic analysis used the portions of the protein alignment from the sensitivity analysis that were unambiguous (a total of 144 characters from character positions 1, 18– 61 and 91–189) (Figure S9). This yielded 2 most equally parsimonious trees (length = 137 steps, consistency index = 0.97 and retention index = 0.94) (Figures S9 and S10). Figure S10 shows that Tree A and Tree B differ in the placements of D. yakuba and D. erecta within the melanogaster species group. Tree A has D. yakuba as sister to the melanogaster species complex and D. erecta as sister to the clade comprised of D. yakuba and the melanogaster species complex. The topology of Tree B shows that D. yakuba and D. erecta form a clade that is sister to the melanogaster species complex.
Our results show that the best protein sequences (Table 3), genomic DNA and nucleotide transcript sequences (Table 2) have the same gene loci within a species for Tpl94D orthologs for representatives of the melanogaster species subgroup. The diversity in length for Dana GF19889-Tpl94D and Dpse GA22645-Tpl94D prevented the sequences from being found using a typical BLAST search. This meant that there was no gene loci consensus for D. ananassae and D. pseudoobscura across NCBI ORF Finder (Table 1), nucleotide BLAST (Table 2), and protein BLAST (Table 3). We were able to refine the genomic DNA and nucleotide transcript sequences through our rigorous DNA binding predictions and RNA-Seq analysis to establish Dana GF19889-Tpl94D and Dpse GA22645-Tpl94D as orthologs for Tpl94D. The other representative species of the subgenus Sophophora (D. persmilis, and D. willistoni) and representatives of the subgenus Drosophila (D. mojavensis, D. virilis, and D. grimshawi) did not have any gene loci matches within the established threshold of NCBI's ORF Finder (Table 1) for Tpl94D. All conserved regions that were found among the analyzed Drosophila species were based on one open reading frame in Tpl94D that was located at the 5′ end of each transcript sequence. This same conserved region was found at the same locus for the N-terminal HMG group box described by Rathke and colleagues4 for Tpl94D. The N terminal HMG box region is important for the replacement of histones and for the deposition of protamine-like proteins (MST35Ba and MST35Bb) and histone H1 linker-like (MST77F).4,5
Several studies have focused on the number and the percentages of amino acids present in TPs39-41 and SNBPs.3,13,42,43 The Tpl94D orthologs found in the 12 Drosophila species analyzed in the current work are less rich in basic amino acids when compared to their mammalian counterparts, but they still share specific characteristics that classify them as TPs.6,13,27 For example, Tpl94D and mammalian TPs cause a disruption of the histone nucleosome organization to facilitate the sperm chromatin transition to a protamine bound structure.4-6,27 Jeanteur27 summarized the concentration of basic amino acids lysine (K) and arginine (R), serine (S), proline (P), cysteine (C), and tyrosine (Y) in TP1 and TP2 for H. sapiens, B. taurus, R. norvegicus, Sus scrofa (boar), Ovis aries (ram), and M. musculus. That analysis indicated that TP1 and TP2 appeared to have evolved separately from each other, and mammalian TP1 is more conserved when compared to mammalian TP2.27,40,41,44
The TPs are different from the SNBPs in that they have large variations in size and the percentages of specific amino acids.17,27 TPs are more basic than histones, but are less basic than protamines.27 This is probably due to the cascade of evolution of the SNBPs from histone H1 linker protein (H1→H1 like→ protamine-like→ true protamine).21,42
The putative Tpl94D protein orthologs found across the sequenced species of Drosophila described in the current work vary significantly in length, with the largest found in Dyak GE10340-Tpl94D (187 amino acids) and the smallest found in Dana GF19889-Tpl94D (79 amino acids) (Figure S3). Our analysis of the DNA binding domain in the Tpl94D orthologs indicates that the same 26 amino acid DNA binding region is conserved within the melanogaster species subgroup (Dsim GD20990-Tpl94D, Dsec GM26474-Tpl94D, Dyak GE10340-Tpl94D, and Dere GG11172-Tpl94D) (File S6A-G and Table S4). The species outside the melanogaster species subgroup (Dana GF19889-Tpl94D and Dpse GA22645-Tpl94D) had greater variation in number of potential DNA binding residues. This may be attributed to a decrease in the protein sequence length in those respective species.
Dana GF19889-Tpl94D had only 39 predicted DNA binding amino acid residues with 29 of those residues being predicted to be DNA binding within the conserved region (N-terminal HMG box/DNA binding). Dana GF19889-Tpl94D is a small protein with a sequence length of 79 amino acids and a high concentration of DNA binding amino acid resides in the conserved region. In contrast, the Dpse GA22645-Tpl94D conserved region had approximately the same percentage of amino acid residues predicted to bind DNA compared to the whole protein (48%). Overall, the putative DNA binding regions were found mainly within their respective conserved regions (File S6A-G and Table S4). All Tpl94D orthologs had low numbers of cysteine amino acid residues, which is similar to mammalian TP1 and TP2 (Figure S3 and File S2). Disulfide bonding occurs between cysteine amino acids in mammalian protamines which increases the compactness of the sperm chromatin.45,46
Interestingly, a similarity between the mammalian TPs and the Tpl94D orthologs is the concentration of tyrosine in the conserved region. Among the Tpl94D protein orthologs, the tyrosine concentration averages 3% (Figure S2 and File S1) in the whole protein. In contrast, in the conserved region the tyrosine concentration averages 6% (Figure S4 and File S3). The average tyrosine concentration within the conserved region for Tpl94D orthologs is 2% greater than the average tyrosine concentration found within the 12 sequenced Drosophila male specific transcript (MST) 35 Ba/Bb orthologs.13 The concentration of tyrosine amino acid residues appears to be important in destabilizing the chromatin compactness thus allowing the histone-bound nucleosome to become protamine-bound.27
The Tpl94D orthologs are rich in arginine (R) amino acid residues as compared to lysine (K) for all the orthologs except for Dpse GA22645-Tpl94D (Table 4). The increased number of arginine (R) residues probably increases protein affinity for DNA binding during chromatin condensation.8,22 Also arginine (R) has higher hydrogen bond potential as compared to lysine (K).8 This allows chromatin to be more protected from DNA damaging sources.8 These Drosophila TPs are less basic than both histone H1 linker-like and protamine-like proteins (Table 4; Figure S2; File S1).8 This is unlike their mammalian counterparts.
The functional domains shown in Table 5 and Table 6 are present in the protein orthologs and their respective N terminal conserved regions. Rathke and co-workers4,5 found a high mobility group (HMG) box that spanned from amino acid residue 4 through 84 in Tpl94D. The functional domains listed in Table 5 and Table 6 illustrate that HMG proteins are highly conserved chromosomal proteins that have DNA binding properties47 and are often involved in transcription.48 The conserved HMG box in Tpl94D has been postulated to be involved in the disruption of nucleosomal structure during the histone to protamine transition in Drosophila.4,49
A consensus of InterProScan 5, Phyre2, and HMMER found a large overlap of an HMG box within the conserved region described in the current work. The HMG box partially overlapped with the DUF1074 family of proteins. The functionality of DUF1074 family of proteins is currently unknown, although DUF1074 is part of the HMG box-like super family that includes 6 other protein families. These six protein families are CHDNT, DUF1014, DUF1073, DUF1898, HMG Box and YABBY, which have been annotated by the Sanger Institute.50 The secondary and tertiary 3D model wire frame structures of the conserved regions for the putative Tpl94D orthologs found in the current work appear to be nearly identical to each other. Furthermore, these secondary and tertiary wire frame structures are similar to known HMG boxes and DNA binding proteins (Fig. 3). The HMG structure is known for its 3 α helices, which appear to be similar to the DNA-binding motif found in histone H1 linker-like proteins.12,14 The conserved Tpl94D region aligns with the secondary and tertiary 3D models of the conserved region found in Drosophila protamine-like proteins (Fig. 5).13 A T-Coffee alignment of the Tpl94D orthologs and Drosophila protamine-like proteins indicates conservation (Fig. 6). In this alignment, the first translated exon for D. pseudoobscura GA18970 (Dpse GA18970/GA31252-MST35Ba/MST35Bb) was used because the length of the protein is 569 amino acids. A recent annotation update to Flybase has indicated that the first exon for D. pseudoobscura GA18970 is a separate gene called GA31252, but other annotation sites such as ENSEMBL still refer to this exon as part of GA18970.29,31 Additionally, the first translated exon for Dpse GA18970/GA31252 -MST35Ba/MST35Bb contains the conserved region found among the rest of the protamine-like and Tpl94D orthologs.13 When the whole protein sequence of Dpse GA18970/GA31252 -MST35Ba/MST35Bb is used, the same conserved region is found when aligned with rest of MST35Ba/MST35Bb and Tpl94D orthologs (Figure S6).
Dana GF19889-Tpl94D and Dpse GA22645-Tpl94D are conserved at the N-terminal HMG box-DNA binding region when aligned with both MST35Ba/MST35Bb and Tpl94D orthologs. In contrast, the N terminal HMG box-DNA binding region of the Drosophila protamine-like protein orthologs is conserved with the C-terminal end of Tpl94D within the melanogaster species subgroup (Dsim GD20990-Tpl94D, Dsec GM26474-Tpl94D, Dyak GE10340-Tpl94D, and Dere GG11172-Tpl94D) (Fig. 6). The melanogaster species subgroup contains a conserved sequence identified as c2yrqA in the Protein Databank (PDB), which spans from the N to the C terminus (Table 6). C2yrqA is known to be involved in DNA binding and contains a HMG box (Table 6). Dere GG11172-Tpl94D aligns 2 PDB proteins (c2e6oA and d1v64a) that span from the middle of the protein sequence to the C terminus. PDB proteins (c2e6oA, c2cs1A, d1v64a, d1hmfa, and c2yrqA) indicated in Table 6 are present in the conserved region in the Tpl94D orthologs (Table 5). The variation in the protein alignments of the Drosophila protamine-like protein (MST35Ba and MST35Bb) orthologs and Tpl94D orthologs can be attributed to vast sequence length differences.13
The conserved regions in Tpl94D protein orthologs and Drosophila protamine-like protein orthologs appear to have the same primary function of binding DNA during Drosophila spermatogenesis as reflected by the T-Coffee alignment (consensus score = 93; Fig. 7). Hence, both conserved regions have a similar function of binding DNA through their respective highly basic HMG box during spermiogenesis.
Collectively, the results (File S5) of the transcriptome RNA-Seq analysis of D. melanogaster, D. simulans, D. yakuba, D. ananassae, and D. pseudoobscura reveal that all protein orthologs for Tpl94D are highly expressed in the testes. Dyak GE10340-Tpl94D was reconfirmed to be testes specific by NCBI expressed sequence tag (EST) MEGABLAST.32,33 The testes and ovaries expression results in Cuffdiff2 for Dpse GA22645-Tpl94D yielded similar expression results as published by Van Kuren and Vibranovski.34 Our FPKM differential expression result for Dana GF19889-Tpl94D was lower when compared to Van Kuren and Vibranovski.34 This may be attributed to the different approach for mapping the reads to the reference genome and the quality assessment during the pre-processing stage. Regardless, our RNA-Seq differential expression results and Van Kuren and Vibranovski34 showed high expression for Dana GF19889-Tpl94D in the testes as compared to the ovaries. Overall, Dana GF19889-Tpl94D had comparable log fold change values in EdgeR (File S5).51 Additionally, DESeq was utilized to further test the differential expression of D. ananassae RNA-Seq testes and ovaries data (File S5).52 DESeq revealed high expression in the testes as compared to the ovaries for Dana GF19889-Tpl94D. D. melanogaster Tpl94D and Dsim GD20990-Tpl94D were verified to be highly expressed in the testes by analyzing the gene loci locations in the genome browser in ModENCODE.30 Likewise, the expression of Dpse GA22645-Tpl94D in the testes was verified by analyzing the gene loci location using Flybase and ModENCODE. These Tpl94D orthologs have small p and q values, which signifies confidence in differential expression FPKM values from Cuffdiff2.53 Heatmaps were generated using CummeRbund in R Studio to show the high expression of Tpl94D orthologs in the testes as compared to ovaries (Figure S7).54 This analysis showed that Tpl94D_A isoform (FBtr0084339) is more highly expressed than Tpl94D_B isoform (FBtr0310110) in D. melanogaster testes. Additionally, NCBI IUTA analysis shows that Tpl94D_A isoform (FBtr0084339) is the dominant isoform of the Tpl94D (FBgn0051281) gene as compared to Tpl94D_B isoform (FBtr0310110) in D. melanogaster testes (Figure S8). Our RNA-Seq transcriptome expression results across the available sequenced Drosophila species show that Tpl94D orthologs are highly expressed in the testes and have a similar role to Tpl94D in D. melanogaster during spermatogenesis.
Some RNA-Seq data sets presented in this study contained testes and ovaries with tracts30,31 and without tracts.34 There was minimal differential expression difference for Tpl94D orthologs between whole reproductive organs with tracts versus organs without tracts. Additionally, our differential expression results for Tpl94D and its orthologs in D. melanogaster (control), D. simulans, D. ananassae, D. yakuba, and D. pseudoobscura were very similar to the genome-wide studies conducted by ModENCODE;30 Flybase;31 Begun et al.;33 Begun et al.;32 and Van Kuren and Vibranovski.34
All of these Tpl94D orthologs exhibit the characteristic HMG box at the N-terminus and a high degree of DNA binding amino acids. A sensitivity analysis of the amino acid sequence alignment was another approach corroborating that the N-terminus HMG box is more conserved (unambigious) across species (Figure S9). Because sequence alignments establish characters used to build evolutionary trees they are also sensitive to species sampling.37 Thus, in the future, when additional Tpl94D sequences are available, we anticipate that there will be fewer gaps and unambiguous sites in the sequence alignments, and that the features of Tpl94D orthologs will be better understood.
As one progresses to hierarchical levels in the phylogeny further from D. melanogaster (Figure S10), the variation in the amino acid length of Tpl94D increases. In fact, the D. ananassae (Dana GF19889-Tpl94D) and D. pseudoobscura (Dpse GA22645-Tpl94D) orthologs required further corroboration through RNA-Seq analysis of their testes and ovaries transcriptome datasets.
The current work does not identify putative transition protein-like proteins in the other Drosophila species, however, they may exist. Our inability to identify Tpl94D orthologs in those species might be due to greater variation in sequence from the D. melanogaster Tpl94D reference sequence. Currently, there are no available testis or ovary transcriptome data sets for D. sechellia, D. erecta, D. persmilis, D.willistoni, D. mojavensis, D. virilis, and D. grimshawi (Table 1).
The phylogenetic analysis yields 2 (Tree A and Tree B) most equally most parsimonious trees (Figure S10). The topology of Tree A in Figure S10 more accurately reflects the taxonomic groupings and well-established phylogeny when all 9 species within the melanogaster species subgroup are included in analyses.31,55,56 The topology of Tree B in Figure S10 depicts an anomalous sister relationship between D. yakuba and D. erecta forming a clade that is sister to the melanogaster species complex. This topology has been seen previously by 12 Drosophila Consortium and Flybase.31,56 Phylogenetic analyses are sensitive to species sampling; therefore, this anomaly is most likely due to the reduced number of species represented within the melanogaster species subgroup in the phylogenetic analyses.
The work presented here indicates that the orthologs for Tpl94D are present in the sequenced Drosophila species of the melanogaster species subgroup (D. simulans, D. sechellia, D. erecta, and D. yakuba), D. ananassae, and D. pseudoobscura. The RNA-Seq differential expression data for D. melanogaster, D. simulans, D. yakuba, D. ananassae, and D. pseudoobscura indicates a high expression of Tpl94D and its respective orthologs in the testes as compared to the ovaries. Additionally, Drosophila Tpl94D orthologs share a conserved DNA-binding region with Drosophila protamine-like proteins. The conserved HMG box among all the Tpl94D orthologs has been postulated to be involved in the disruption of nucleosomal structure, which facilitates the transition from histone-bound nucleosome chromatin to a protamine-bound chromatin structure in Drosophila.4,49 In addition, the rigorous bioinformatic methodology used in the work reported here can be used to annotate Tpl94D orthologs in any newly sequenced Drosophila species found within the melanogaster species group. We suggest that the Drosophila Tpl94D orthologs should be classified as their own transition protein group.
The reference genomic, transcript, and protein sequences for D. melanogaster transition protein Tpl94D were acquired from NCBI and Flybase. A nucleotide BLAST, protein BLAST, and Position-Specific Iterated (PSI)-BLAST were conducted on the original 12 sequenced Drosophila genomes:56 D. melanogaster, D. simulans, D. sechellia, D. erecta, D. yakuba, D. ananassae, D. pseudoobscura, D. persmilis, D.willistoni, D. mojavensis, D. virilis, and D. grimshawi. Potential orthologs were identified for transition protein Tpl94D using BLASTX and NCBI open reading frame finder (ORF finder). The cut off threshold for Tpl94D open reading frame orthologs was query coverage of 40% with maximum identity score of 36% and an E-value of 7 × 10−5. The best protein matches for Tpl94D were analyzed for conserved domains by the local alignment tool T-Coffee (http://tcoffee.crg.cat/apps/tcoffee/).57
The DNA binding bioinformatic tools DNA Binder, BindN+, and DP- Bind, were used to analyze each of the best protein matches for Tpl94D and their respective conserved domains for prospective DNA binding regions. DNA Binder uses a regression based algorithm through support vector machines (SVM) models to determine whether a protein sequence is involved in DNA binding (http://www.imtech.res.in/raghava/dnabinder/).58 Three defined datasets called realistic, alternative, and main set parameters are used to determine whether the user defined protein sequence is DNA binding. The realistic data sets contain 146 DNA binding proteins and 1500 non DNA binding proteins with the analysis parameters set to 47.95% for sensitivity, 93.33% for specificity, and 89.31% accuracy. The alternative dataset is the largest of the 3 data sets with 1153 DNA binding proteins and 1153 non-DNA binding protein chains. The main dataset is the smallest of the 3 types of data sets provided in DNA Binder and is primarily used in the identification of DNA binding regions and domains within a large protein sequence. The main dataset contains 146 DNA bind proteins and 250 non-DNA binding proteins with the analysis parameters set to 78.11% for sensitivity, 80.80% for specificity, and 79.80% for accuracy. The provided sequence is considered as DNA binding if the score is close or above 1 in DNA Binder. In contrast, a non-DNA binding score will be closer to −1 or less. In the case of a score is in between −1 and 1 and is close to zero then the provided protein sequence may or may not be a DNA binding domain.58
The BindN+ uses 2 data sets (PDNA-62 and PRINR25) from the Protein Data Bank (PDB) to analyze user defined amino acid sequences in FASTA format for potential to bind to DNA. The evolutionary information in regards to the user defined amino acid sequence is acquired in BindN+ by searching through UniPortKB and PDB (PDNA-62 and PRINR25) databases. The analysis in BindN+ was conducted using the recommended settings of 79% for the specificity. The results in BindN+ are given a score of positive or negative with confidence score under each amino acid ranging from one to 9 with one being the least confident and 9 being the most confident.59
Lastly, DP-Bind was also used to analyze the probability of the user defined the amino acid sequences to bind to DNA. DP-Bind returns highly sensitive and conservative results as compared to BindN and BindN+.60,61 DP-Bind determines a user defined amino acid sequence based on 3 different approaches:56 support vector machines (SVM),56 kernel logistic regression (KLR), and62 penalized logistic regression (PLR). The three approaches in DP-Bind use non-redundant datasets of 62 experimentally determined structures of proteins that have been shown to bind to double-stranded DNA. These three algorithms are combined with position-specific scoring matrix (PSSM) in PSI-BLAST that are used to generate a score of one (DNA binding) or zero (not DNA binding) for each amino acid in the user defined sequence. The combined PSSM-SVM had the following analysis parameters: 76% +/− 9.1 for accuracy, 76.7% +/− 18.6 for sensitivity, and 74.8% +/− 12.5 specificity. The combined PSSM-KLR had the following analysis parameters: 77.2% +/− 9.3 for accuracy, 76.4% +/− 18.5 for sensitivity, and 76.6% +/− 11.2 specificity. The combined PSSM-PLR had the following analysis parameters: 73% +/− 8.8 for accuracy, 73.3% +/− 18.4 for sensitivity, and 71.8% +/− 12.8 specificity. A probability score ranging from one (high probability) to zero (low probability) states the likelihood of the amino acid residue to bind to DNA. DP-Bind contained 2 additional tests called majority consensus and strict consensus. These two consensus tests summarized the results from PSSM-PLR, PSSM-KLR, and PSSM-SVM with a score of zero (not DNA binding), one (DNA binding), and not assigned (NA – cannot be determined). The majority consensus had the following set analysis parameters: 76% +/− 9.0 for accuracy, 76.9% +/− 18.6 for sensitivity, and 75.3% +/− 12.0 specificity. Likewise the strict consensus had the following set analysis parameters 80% +/− 9.4 for accuracy, 79.1% +/− 19.4 for sensitivity, and 78.6% +/− 12.7 specificity. We used the recommended approach by DP-Bind to seek a consensus of all 5 results (PSSM-SVM, PSSM-KLR, PSSM-PLR, majority consensus, and strict consensus) to determine whether each amino acid in a sequence was DNA binding or not DNA-binding.
Sequence Manipulation Suite 2 - Protein Statistics (http://www.bioinformatics.org/sms2/protein_stats.html) was used to analyze the amino acid content for each of the NCBI Open Reading Frame (ORF) Finder, protein BLAST, Position-Specific Iterated (PSI)-BLAST, and BLASTX and conserved sequence regions in Tpl94D matches. The following published sequences were added to the comparison: Mus musculus histone H1 linker-like protein (GI: 9055232), Rattus norvegicus histone linker-like H1 domain, spermatid-specific 1, (GI: 157818369), Mus musculus spermatid nuclear TP1 (GI: 6678395), Mus musculus nuclear TP2 (GI: 31981239), Rattus norvegicus spermatid nuclear TP1 (GI: 8394472), and Rattus norvegicus nuclear TP2 (GI: 51036639).
The respective NCBI ORF Finder, protein BLAST, PSI-BLAST, and BLASTX and conserved sequence regions in Tpl94D matches were analyzed for functional domains through EMBL-EBI's Interpro Scan 5 (http://www.ebi.ac.uk/interpro/),63 HMMER (http://hmmer.org/),64 and Phyre2 (http://www.sbg.bio.ic.ac.uk/phyre2).65 The functional groups were identified using Phyre2, Interpro Scan 5, and HMMER. The putative 3D secondary and tertiary models for each conserved regions for Tpl94D matches were modeled using Phyre2. The 3D models were then analyzed using Molsoft ICM Browser (http://www.molsoft.com/).
Testes and ovaries transcriptome Illumina RNA-Seq FastQ data files were acquired from publicly available EMBL-EBI-SRA based on their corresponding NCBI SRA identification codes for D. melanogaster, D. simulans, D. yakuba, D. ananassae, and D. pseudoobscura. The NCBI SRA identifications for these publicly available data sets are listed in Table S1. Quality assessment and trimming of the FastQ files was done using FastQC 0.10.1 (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc) and Trimmomatic 0.32,66 respectively. The trimmed and quality assessed FastQ files were then uploaded onto iPlant Collaborative's Discovery Environment for differential expression assessment.67 The genomic sequences and general transfer formats (GTF) for D. melanogaster 5.21, D. simulans 1.21, D. yakuba 1.3, D. ananassae 1.21, and D. pseudoobscura 2.21 were uploaded to iPlant Collaborative's Discovery Environment67 from ENSEMBL.29 All reads were then mapped using Tophat 2.0.9 with Bowtie 2.1.0 with the settings of -g 1 with species appropriate reference GTF and reference genomic sequence.53,68 The settings for Tophat 2 were acquired from Flybase (http://flybase.org). The – g 1 setting instructed Tophat 2.0.9 with Bowtie 2.1.0 to allow only 1 alignment to the provided reference genome for a given read. This was done so to have a conservative approach in mapping the reads to reference genome as the default setting is 40. All paired-end datasets were aligned with the inner mate distance of -r 150 as stated on Flybase (http://flybase.org). The rest of the parameters for Tophat 2.0.9 were left as default.
Cufflinks 2.0.2 was then used to assemble the reads with species appropriate reference GTF and reference genomic sequence. The reference genomic sequences were provided through –b/-frag-bias-correct < reference_genome.fa > setting in Cufflinks 2.0.2, which improved the accuracy of the transcript abundance by running new bias detection and by using a built-in correction algorithm.53 Multi-read correction option, –u/-multi-read-correct, was enabled during Cufflinks 2.0.2 to improve the accuracy of the reads mapped to multiple locations in the reference genome. Cuffmerge 2.0.2 was then used to merge all the GTF output files from Cufflinks 2.0.2 in a species-specific manner with the species-specific reference annotation (-g/-ref-gtf ENSEMBL GTFs) and all isoforms were discarded with abundance below 0.1. This was done to merge all novel isoforms and known isoforms to obtain maximum assembly quality.53,69 The merged output GTF from Cuffmerge 2.0.2 and the species and tissue sample appropriate output from Tophat 2.0.9 were used in Cuffdiff 2.0.2 to evaluate the differential expression between the ovaries and the testes for D. melanogaster Tpl94D orthologs in D. simulans, D. yakuba, D. ananassae, and D. pseudoobscura. In Cuffdiff 2.0.2, the default setting of 10 was used for the minimum number of counts (-c/-min-alignment-count), which signified the minimum number of alignments to be present to test the significance in change between the ovaries and testes at samples for any gene loci.53,69 The accuracy of the transcript abundance was improved by enabling fragment bias correction with species-specific genome (b/-frag-bias-correct < reference_genome.fa >) and multi-read correction (–u/-multi-read-correct) in Cuffdiff 2.0.2. Also the default false discovery rate (-FDR) of 0.05 was changed to 0.01 in Cuffdiff 22.214.171.124 The remaining conditions for Cuffdiff 2.0.2 were left as default. Heatmaps were generated using cummeRbund for the Tpl94D orthologs and isoforms in D. melanogaster, D. simulans, D. yakuba, D. ananassae, and D. pseudoobscura.54
A count-based differential expression approach was used to conduct the 2 additional RNA-Seq approaches. The Tophat 2.0.9 alignment for D. ananassae was converted to counts file by using HT-Seq counts70 with the D. ananassae 1.21 GTF from ENSEMBL.29 Then EdgeR51 and DeSeq52 was used at default settings with false discovery rate (FDR) set to 0.01 to analyze the differential expression between ovaries and testes data sets for D. ananassae. EdgeR and DeSeq were conducted on iPlant Collaborative's Discovery Environment.67
Isoforms for Tpl94D orthologs in D. melanogaster, D. simulans, D. yakuba, D. ananassae, and D. pseudoobscura were analyzed using NCBI Isoform Usage Two-step Analysis (IUTA) in R Studio with R 126.96.36.199 We created 2 array variables in IUTA that contained all the ovaries (bam.list.1) and the testes (bam.list.2) paired-end Tophat 2.0.9 alignments for each specific species. A third variable was created called “transcript.info” that indicated the species specific GTF from ENSEMBL.29 These variables were created in accordance with IUTA's manual. IUTA was run independently for each species with fragment length distribution (FLD) setting set to empirical and 3 statistical tests called SKK, CQ, and KY enabled.26,28,71,72 IUTA recommended the empirical settings to be used for the fragment length distribution for each sample group (ovaries vs. testes) per species. Pie charts were generated using IUTA to illustrate the percentage of each isoform present in the testes and ovaries for D. melanogaster, D. simulans, D. yakuba, D. ananassae, and pseudoobscura.
From NCBI Ref-Seq protein sequence for Tpl94D orthologs were identified using D. melanogaster isoform A and isoform B as the reference sequences for BLAST searches (Tables 2–3). The length of the Tpl94D orthologs varies across species. Therefore, a sensitivity analysis was run to create an unbiased approach for placement of gaps and identification of characters by position.35 Multiple alignments were performed using the ClustalW method within the program MEGA6 under a Gonnet weight table for amino acid change where the gap extension penalty (GEP) was held constant while the gap opening penalty (GOP) varied.38,73,74 A stable alignment was found to exist when amino acid sites considered to be ambiguous were eliminated.38 Therefore, the character matrix for the phylogenetic analysis only contained unambiguous positions for the Tpl94D orthologs. An exhaustive search under a maximum parsimony criterion was run on PAUP* version 4.0a14.75 The gaps were treated as missing and the tree was rooted with the outgroup, D. pseudoobscura.
No potential conflicts of interest were disclosed.
The authors thank the anonymous reviewers for suggestions that significantly improved the manuscript. We are grateful to Jennifer Hillman Jackson (Pennsylvania State University and Galaxy), Dr. Roger Barthelson (Iplant Collaborative), Dr. Sheldon McKay (Iplant Collaborative), Nirav Merchant (Iplant Collaborative), Andy Edmonds (Iplant Collaborative) and other members of the Iplant Collaborative team for their support during the transcriptome and isoform analysis. We also would like to thank Michael Campbell (Utah University) and Dr. Chris Childer (USDA) for introducing us to the Iplant Collaborative.