The genome of Trypanosoma cruzi CL Brener is highly repetitive and polymorphic. This makes assembly and subsequent annotation of the genome difficult, and the current draft sequence is fragmented and lacks many copies of highly repetitive genes. We have studied all protein coding genes and pseudogenes in T. cruzi for alignment depth using a sensitive multiple alignment tool to build new alignments representing a gene and all its copies.
A database of all annotated genes in Trypanosoma cruzi and their repetitive qualities has been created. It can be searched using a locus id or a GenBank id. The information includes estimation of copy number, collapses between annotated genes and, in many cases, repeat cassette information. A few selected genes have been studied in more detail, using the assembly finishing tool DNPTrapper.
Characteristics of Trypanosoma cruzi genes
There are 23 216 predicted protein coding genes and pseudogenes in the published release of the Trypanosoma cruzi CL Brener genome. Of these 23 216 sequences, 12 642 (54%), were annotated as hypothetical, and 3 603 (16%) as pseudogenes. Alignments of 22 640 annotated genes were built. This is 98% of all annotated genes. The information gathered in these alignments is stored in the database. Sequences that did not create an alignment are too short or too repetitive for our alignment criteria, or represent possible misassemblies. Too short in this case means shorter than the shortest overlapping read and too long, or too repetitive, means that the resulting alignment was too large to keep in the memory of the computer. Most annotated genes that do not form useful alignments are long and repetitive, with the majority being surface antigens. These will be added to the database at a later date.
There are a few hypothetical genes that did not form alignments, as well as the DnaJ homolog subfamily A member 2 and one version of eukaryotic translation initiation factor 1A. The former was too short to have an overlap with a read and the latter had multiple insertions in the sequence, which made it non-homologous to all of the reads. The insertions might be the result of misassembly of the gene. The lack of overlapping reads for these two genes imply that they are not repetitive.
Of the alignments created, 4 171 (18%) showed an average alignment depth of 100 or more. With an average coverage in the sequencing of 7 [10
], an average alignment depth of 100 indicates 14 copies in the diploid genome. These 4 171 annotations represent 73 different functional annotations, with mucin-associated surface protein (MASP) being the most common with 913 copies in the published genome with an average coverage over 100. The distribution of the estimates is shown in Figure . The distribution for hypothetical genes is shown separately. The figure shows the number of annotations for each copy number estimate. As expected, most annotations have an estimate of 2, where the distribution peaks. The hypothetical genes have a lower fraction of repeated annotations. At a cutoff of 14 copies, which is the cutoff we have used for repetitive genes, the fraction of repeated hypothetical genes is less than a third of that of all annotations (6.6% versus 19.1%). This is clearly visible in Figure , where the distribution of the higher estimates is zoomed in.
Figure 2 Distribution of estimated copy numbers. The estimated copy number is calculated for each annotation by averaging the alignment depth along the annotation and dividing the average by 7, the average shotgun coverage. The distributions of all annotations (more ...)
A list of the repeated annotated genes is shown in Table . This table does not include the hypothetical genes, as they could not be clustered by name. The names are all unique, which means that there could be variants of the annotation that were not included, depending on differences in how the annotations have been curated. This is particularly important for generic gene names, such as for example kinase. Annotations with more specific names are more reliable. The table also shows the number of annotations in the genome assembly that have an average alignment depth of 100 or more (corresponding to an estimated copy number of 14 or above, column 3) and the average estimated copy number of these annotated copies (column 4).
Draft genome copy numbers and estimated copy numbers of repeated genes
As an example, consider flagellar calcium binding protein (FCB). There are eight copies of FCB in the draft genome (column 2). All of these have alignments with an average shotgun depth of 100 or higher (column 3). The copy number estimate of these eight alignments, i.e. the alignment depth divided by the average coverage of the genome sequencing, is, on average, 66 (column 4). In the collapse analysis, each of the genomic copies were grouped with the other seven other copies. In two cases one more annotated gene was in the same collapse (annotated as calcium-binding), and one case two more annotated genes were grouped with the FCBs. These results show that, for FCB, the estimated copy number (66) is the correct total number of FCB genes in the genome.
We chose a stringent cutoff for the average alignment depth (100) in order to ensure that we with high fidelity picked repetitive genes, rather than regions of high shotgun coverage due to random sampling depth. Using the average coverage of the genome sequencing (7×), an average alignment depth of 100 gives an estimated copy number of 14. Interestingly, the estimation of the number of repeat copies for surface antigens and other highly represented annotated genes from our analysis (column 3 in Table ) was almost always low compared to the total number of family members present in the assembly (column 2 in Table ). Conversely, high estimates of copy number almost always correspond to annotated genes with few copies in the assembly. This indicates that gene repeats such as the surface antigens, which are driven towards diversity, were in many cases separated in the assembly; whereas highly similar repeats pose a larger problem for the assembly program.
An example is given in Figure , where the distribution of the copy number estimations of the 1 430 trans-sialidase annotated copies is shown together with the distributions for all annotations and all hypothetical genes. The average copy number estimation of trans-sialidase in this dataset is 16. If, on the other hand, only alignments with a shotgun depth of 100 or greater are considered, as in Table , the average copy number estimate of trans-sialidase is 24. To further clarify, consider the two trans-sialidases in Figure . The figure shows the contigs containing the genes and graphs of the alignment depth sampled along the genes. One of the examples has a high alignment depth, while the other appears to only be present as one copy in the genome. This is representative of the trans-sialidases, who, as many of the surface antigen genes, are both divergent enough to be separated in the draft genome and to not be clustered in our analysis, and in some cases similar enough to not be assembled correctly and cluster in relatively large alignments. The estimated copy numbers, represented in column 3 in Table , are in cases such as the trans-sialidases estimates of the number of gene variant sub-groups within the larger gene family. In an effort to deduce the actual number of trans-sialidases in the genome, we calculated the average coverage of all trans-sialidase genes. A total of 65 815 reads, with a total length of 47 507 179 bp, formed an alignment with one or more of the trans-sialidases. The trans-sialidases have a total length of 5 783 417 bp in the assembly, which gives an average coverage of 8.2. This is close to the assembly coverage of 7, which indicates that, while there are some collapsed copies, the majority of the trans-sialidase genes were assembled separately. However, since it is clear that many reads aligned to more than one annotated copy it is likely that there are many assembly errors in members of this family, especially at the consensus sequence level, due to regions of high similarity. As a contrast, consider again FCB, which had eight copies in the draft genome and an estimated copy number of 66.
Figure 3 Coverage of two trans-sialidases. A, B shows two contigs with annotated putative trans-sialidases. C, D show the coverage every 100 bp along the genes. Tc00.1047053511875 (A, C) has an average shotgun depth of 9, indicating only one copy in the genome. (more ...)
The functional classification of the repeated annotated genes is shown in Table . 738 (18%) of the repeated annotated genes were annotated as hypothetical proteins, a somewhat lower fraction than of the entire set. 61% of the annotated genes are surface antigens. The most striking feature in this table is the wide range of functions covered by the repeated genes. Not only surface antigen genes and housekeeping genes, but close to all functional categories have genes that are predicted to be repeated.
Functional classification of repeated genes
Of the 4 171 alignments with an average alignment depth over 100 that were searched for cassette boundaries, 3 788 (91%) yielded estimates. Of the 1 787 cassettes where the collapse could be used for verification, the borders were supported in 1 318 (74%) cases. Of the 190 cassettes that could be verified using alleles, 118 (62%) were supported. The multiple alignments that did not yield cassette estimations may be dispersed gene repeats or have a larger cassette than our parameters allowed (400 bp on each side).
16 149 annotated genes collapsed with other annotated genes. The remaining sequences either collapsed into one copy in the genome assembly, exist in only one copy, or have two alleles that are too diverse to collapse.
In an attempt to estimate the total gene number of Trypanosoma cruzi CL Brener, the total number of basepairs of reads aligned to annotated genes (765 951 reads, 547 513 087 bp) was compared to the total number of basepairs in annotated genes in the original genome assembly (32 883 890 bp). The gene base pair count was corrected for genes annotated as a collapse in the assembly between the two haplotypes (Esmeraldo and non-Esmeraldo). There were 1796 such annotations, and they covered a total of 2 713 442 bp. In total, the annotated genes covered 35 597 332 bp. That gave an average depth over the annotations of 15.4. A comparison with the average coverage in the genome (7) gives an average number of copies per annotation of 2.2. Hence we estimate the number of protein coding genes (including pseudogenes and allelic variants) in T. cruzi to be approximately twice the previously estimated amount bringing it up to around 40 000.
All information gathered in this study is available on-line. The data are stored in a postgreSQL database, which is accessible through a php interface. The data can be searched using a locus id or a GenBank id. The user can also upload a file of locus ids or GenBank ids. At the top of the results page there are links to the respective GeneDB and GenBank pages. GeneDB [26
] is a repository of the official annotation and genome browser for T. cruzi
. An additional database and browser (TcruziDB) can be accessed at [27
There are four choices of how much information to show for each search: estimated copy number, collapse, COGs, and cassette. The estimated copy number shows the average depth of the multiple alignment, the standard deviation of a sampling of the depth and the estimated copy number of the annotated gene. There is a link, Details, to the sampled depth, so that the user can see the distribution. The database can also be searched by contig or by estimated copy number.
The collapse shows the annotated genes that are highly similar to the query (at least 90% of the shorter of the two is in the alignment). A link at the collapsed sequence, Info, gives the user information on the putative function, allele, possible split junction, and hybrid strain of the annotation. A link at the end of the collapse information, Get Seqs, gives the DNA sequences of all annotated genes in the alignment, including the query.
The COGs information was taken directly from the T. cruzi
genome project, where the COGs were formed using jaccard clustering [10
]. The annotated genes in a COG share potential function, and complement our collapse data.
The cassette data is only available for certain queries. It shows the number of basepairs of cassette sequence before and after the start and end of the annotation. There is also information on whether or not the annotations in the collapse verify the cassette borders and whether or not the other allele verifies them. A link, Info, provides additional information on the cassette, such as if the cassette contains more than one annotations.
A list of the reads participating in an alignment can be reached through the link Get the read list at the end of the result page. The reads and the quality values are available at the database website.
The genes selected for further analysis highlight different characteristics of T. cruzi repetitive genes, with respect to organization, conservation and other features. They also illustrate the types of useful information that can be extracted from the database in combination with the DNPTrapper software.
Allelic tandem arrays
Tyrosine aminotransferase (TAT), codes for an aromatic amino acid transferase, similar to the corresponding tyrosine catabolism enzymes found in rat and human liver [28
]. For T. cruzi
, TAT has been annotated in nine contigs in GeneDB, in most cases at the end of a contig or in a small contig by itself, indicating that the assembler used for the whole genome assembly of T. cruzi
(Celera assembler, [29
]) was not able to extend the contigs in the TAT regions due to repeats.
We used two of the 13 annotated gene variants to search for reads. These two sequences were in the same COG, collapsed together, and both indicated a copy number for this particular version of TAT of 47 in the gene repeat analysis described above. Using GRAT, 1 357 reads matching the region within the similarity threshold were retrieved. Out of these, 710 were not included in the assembly and were thus analyzed for the first time in the present study.
Using DNP Trapper, the reads could be divided into two large groups of roughly the same size (group 1: 670 reads with maximum alignment depth 210, indicating approximately 30 copies; group 2: 560 reads, depth 205, around 30 copies) based on their DNP content. Analysis of the distribution of mate pairs from shotgun clones supported this division in that there were many mate pairs within both groups and none between the groups. This supports the hypothesis that TAT is present in two allelic tandem arrays in the genome, one on each of a homologous pair of chromosomes. Furthermore, this illustrates a previously observed phenomenon in T. cruzi
, where repetitive genes differ in sequence between homologs but are more conserved within homologs [10
]. In the intergenic region, however, the DNP patterns were less homogenous in both groups, allowing for further division into around 20 distinct subgroups for group 1 and 10 subgroups for group 2. 120 reads could not be reliably assigned to either of the two large groups, because of lack of, or ambiguous, DNP information.
The DNPs in the coding region were investigated to see if they represent differences in the corresponding amino acid sequences. In group 1, three subgroups with DNPs in coding positions were identified. Group 2 had two such subgroups, one of which had an alignment depth of 140, indicating that the majority (20 out of 30) copies in this group had sequence that was divergent from that available in the public databases.
The amino acid changes were analyzed for possible functional effects using the prediction software SIFT. We created an alignment using PSI-Blast, with median sequence conservation at the prediction sites of 2.62, except at the first amino acid change where it was 3.09. All SNPs were predicted to be tolerated except the first, which was a Valine to Tyrosine change and had a SIFT score of 0.00. It had, however, a low 'Seq Rep' value (0.41), which indicates a poor prediction [24
Similar results were obtained for flagellar calcium binding protein (FCB), which is believed to be important for trypanosomatid flagellar mobility [30
], and has been proposed to be used in serological diagnosis of Chagas' disease due to the T. cruzi
specificity of the C-terminus [31
There are 8 annotations of FCB in three contigs in GeneDB, with all contigs breaking at the FCB site. In two of these, FCB is flanked downstream by hypothetical proteins that are homologous to each other, while the third contig consists of two merged copies. In the read retrieval, we used two of these annotated variants of the gene as well as one gene annotated as calcium-binding protein, which collapsed with the two FCBs but was not in the same COG, as queries. Two of these annotated gene variants had an estimated copy number of 73, the remaining one an estimation of 72. 1 100 reads matching the region were located, of which 869 were not included in the genome assembly.
The reads could again be divided into two large groups of 540 and 470 reads, respectively, based on DNP content, also in this case supported by mate pair data. Approximately 220 reads could not be assigned to any group. Again, this indicates that the gene is present in allelic tandem arrays on both homologs. From the maximum alignment depth in each group (470 and 400 respectively), the copy number was estimated to be 70 copies in group 1 and 60 in group 2.
Six DNPs in the coding regions were found in coding positions, one of which was present in 160 reads or around 20 copies in one of the groups. We performed a SIFT analysis on the amino acid changes as above. The median sequence conservation was between 3.06 and 4.32, with a majority of 3.06. No amino acid changes were predicted to be deleterious.
Extremely conserved repetitive gene
heat shock protein 85 (HSP85) has homology to hsp90 in S. cerevisiae
, hsp83 in D. melanogaster
, and hsp90 in chicken, and has been proposed to be a target of the autoimmune response in parasite infection [32
]. It has been annotated in three contigs in GeneDB, all ending in tandem copies of the gene (in one of the cases the contig actually ends in the highly similar HSP83).
We used three of the five HSP85 gene annotations as queries. The three sequences collapsed together and were in the same COG (as previously determined by clustering in [10
]). They all had an estimated copy number of 63. 444 reads matching the region were present in the assembly, and an additional 1279 were retrieved by GRAT, giving a total of 1 723 reads. Unlike the previous two regions, this data set could not be readily divided into two groups. This was mainly due to the lack of DNPs in the coding region: 75% of the reads do not have reliable DNPs in this region at all. The gene copies are thus so similar, even between alleles, that they cannot be separated using these methods.
Analysis showed that the few DNPs present in the coding regions are at wobble positions (except one identified case with a frame shift, probably a pseudogene), which indicates that HSP85 shows a high degree of conservation in T. cruzi.
Highly repeated surface antigen gene
Trans-sialidase transfers sialic-acid residues from host glycoconjugates to parasite mucins. It can also bind to mammalian cell receptors and undermine host defense mechanisms [33
]. The protein has an enzymatic domain on the N-terminus that is required for trans-sialidase activity. In the C-terminus, there are varying numbers of 12 amino acid antigenic repeats. There is limited divergence in the amino acid sequence of the N-terminal and there is one amino acid in particular (Tyr342) that is required for enzymatic activity [34
]. Most experiments that involve trans-sialidase only take the N-terminal, enzymatic region into account [35
As mentioned above, 1 430 trans-sialidase genes have been annotated in the T. cruzi
CL Brener genome sequence, twelve of which have been found to contain the critical Tyr342. Inactive versions have previously been found to have a histidine in this position [34
], but none of the inactive annotated variants in the current version of the genome contain this histidine.
As an example of how these types of gene families can be studied using our methods, we used two of the 1 430 annotated variants of trans-sialidase to search for reads. The two sequences were in the same collapse and COG, both had an estimated copy number of ten and both were predicted to have trans-sialidase activity.
The DNA sequences corresponding to the relatively conserved N-terminus of the active annotations were used for retrieving matching reads, resulting in 516 reads within the similarity threshold of 95%. 185 of these were not present in the assembly. Nine read groups covering most of the region were chosen for further analysis, while a few groups were discarded due to low coverage, ambiguous DNPs and in one case a frame shift indicating a pseudogene. Furthermore, three additional groups were identified as inactive in the critical position (Figure ), while still having strong similarity to the region. The fact that only three inactive copies out of 1 418 were found indicates that once the gene copies have lost their trans-sialidase activity, they have diverged rapidly.
Figure 4 Active and inactive copies of trans-sialidase. Comparison of two trans-sialidase repeat groups in DNPTrapper. Boxes indicate reads, colored dots indicate DNPs. Only part of the alignment is shown. The reads have been clustered in DNPTrapper based on their (more ...)
Out of the nine groups, four matched annotated gene variants perfectly, while the additional five groups showed similarities of 93% to 98% to annotated copies. The inactive groups had similarities between 90% and 97% to the annotated copies.
Repetitive gene with differences in predicted transmembrane regions
There are 11 812 sequences from T. cruzi in GeneDB that have been annotated as coding for hypothetical proteins. We chose one of these for further analysis. The corresponding protein is expressed in epimastigotes of T. cruzi, as shown by performing mass spectrometry on a sub cellular fraction of the parasite (Ferella, unpublished). The protein has several predicted transmembrane regions and contains a Nodulin like motif (PF06813). Proteins containing this Pfam signature are for example the peptide transporter PTRZ in S. cerevisiae, the organic anion transporting peptides (OATPs) in human and rat, and the BT1 family, such as the pteridine transporters found in L. major. All these are transporters with similar numbers of transmembrane domains, and it is therefore possible that the T. cruzi counterpart is involved in transport.
In the assembly, this hypothetical protein was part of a seven-member COG. We used two annotated variants that collapsed together and had similar cassette boundaries. The two sequences had estimated copy numbers of 27 and 28, respectively.
973 reads matching the entire gene region were extracted. The distribution of these reads was found to be somewhat uneven. The alignment depth indicated that the first ~80% of the gene has a copy number of around 60, whereas the last 20% showed lower copy numbers, around 40. 467 of the reads were not included in the genome assembly.
Out of the 40 DNP groups that could be identified, 17 with good coverage over most of the protein coding sequence were chosen for further investigation. We observed that a relatively large number of single base differences between copies of the gene give rise to amino acid differences between the corresponding proteins. 46 such positions were identified, not considering the C-terminal region where most of the copies differ in sequence as well as in length. 35 of these changes are located in transmembrane regions (identified using Phobius [37
]), and although most of them are favored substitutions, they could possibly give rise to different substrate specificities (Figure ).
Figure 5 Protein sequence alignments of gene with transmembrane regions. Protein sequence alignment of 17 good coverage groups, from EAN81429.1. The names of the amino acid sequences to the left represent the read group their consensus sequence was derived from. (more ...)