In this work we have attempted to complete the job of assembly of the T. cruzi
CL Brener genome using all available sequence information. The result is a genome of 41 chromosome pairs ranging in size from 78 kb to 2.4 Mb. This assembly is somewhat unique to currently sequenced genomes in that both homologous chromosomes of this widely heterozygous hybrid strain required construction before a consensus model for each chromosome could be derived. Previous pulsed field gradient electrophoresis (PFGE) studies [17
] have estimated that the chromosomes of T. cruzi
range in size from 300 kb to over 3 Mb. While the range of the assembled chromosomes is less than those from the PFGE studies, the difference is likely due to the gene family rich contigs that were not able to be placed in this assembly. However, a previous orthogonal-field-alternation gel electrophoresis (OFAGE) study [24
] has described chromosomes as small as 100 kb.
Previous chromosome-level studies [16
] in T. cruzi
have focused on two assemblies named chromosomes "1" (corresponding to TcChr35 herein) and "3" (TcChr6) based on the ordering of gel bands in PFGE analyses. According to these studies, chromosome "1" exists as 2 homologous chromosomes of size 450 kb and 1.3 Mb, while chromosome "3" is present as 2 homologs of 600 kb and 1.0 Mb. However, while the PFGE analyses predict different sized homologous chromosomes for both "1" and "3", in our assemblies the homologous chromosomes of each are roughly the same size (> 1 Mb for TcChr35 and ~400 kb for TcChr6). In the case of chromosome "3" (TcChr6), where the size of the assembled homologous chromosomes is smaller than their estimated lengths from the PFGE study, it is likely that the size discrepancy is due to unassigned gene family rich contigs in the sub-telomeric and/or telomeric regions. This justification is consistent with the previous finding that the deletion of the 400 kb sequence responsible for the size differences of the homologous chromosomes of "3" resulted in no phenotypic consequences [16
]. However, there is a clear contradiction between the organization of TcChr35 and the model for chromosome "1" proposed in [16
] as both assembled homologous chromosomes in TcChr35 are larger than the reported 450 kb homolog. This study notes that the difference in size between the two homologs of chromosome "1" cannot be due to additional sequence between the Tcsod
locus and the downstream telomere. However in the assembled TcChr35, the Tcsod
locus is essentially in the middle of the homologous chromosomes. Synteny between the chromosomes of L. major
(Lm Chr32) and T. brucei
(Tb Chr11) supports the organization of TcChr35 as assembled herein, as does the BAC clone mappings of the Esmeraldo-like homologous chromosome. However, despite the allelic synteny across the entire chromosome, the lack of spanning BAC clones on the non-Esmeraldo-like chromosome from the Tcsod
locus to the rest of the chromosome does not rule out the possibility that there exists a small homolog of that chromosome ending near the Tcsod
locus (as described for chromosome "1" in [16
]). If this were the case, then the remainder of the non-Esmeraldo-like chromosome is either a separate chromosome homolog or is a portion of another chromosome altogether.
The fixes and observations described above emphasize perhaps the most confounding issue with using the initial T. cruzi
genome assembly [3
]. Copies of heterozygous alleles are often annotated as independent genes when in fact they are alleles on the homologous chromosomes in this hybrid strain (a problem with the assembly). At the same time, many families of genes include truly distinct genes in discrete loci (an aspect of the organism). These characteristics make it challenging to determine when genes are heterozygous alleles mapping to the same locus or are paralogous genes at different loci. This latter decision is of course further complicated by the fact that the complete sequence for many genes is not present in the assembled contigs. Viewing regions of the assembled chromosomes where both haplotypes are represented facilitates this determination; special consideration must be made for syntenous genes where one or both of the alleles exist at the end of contigs because many of these are truncated and should be merged with another "gene" on the adjacent contig.
The T. cruzi
genome contains many non-gene-family, homozygous genes (i.e. with only a single annotated allele) that disrupt the allelic synteny of homologous chromosomes (Figure , Additional file 2
). These sequences are likely the result of the merging of sequence from both the Esmeraldo- and non-Esemeraldo-like haplotypes, an indication that the homologous chromosomes are, as expected, quite similar. However, there are many sequenced BAC ends whose exact sequence does not exist in the annotated genome, such as in cases where one BAC end maps to a contig on a particular homologous chromosome with near perfect sequence identity, while the best match of the other end is to a contig on the other homologous chromosome with many mismatches (Additional file 2
). Given that the genome was sequenced to 14× coverage at an error rate of < = 1.5%, the absence of these sequences is surprising. Further examination of the raw sequence reads may reveal that the particular sequences exist but were not utilized in the assembly process. Regardless, the current analysis has mapped to these chromosomes the candidate BAC clones that could be fully sequenced in order to correct these errors and close the remaining gaps in the chromosomes.
The assembled chromosomes provide a physical platform on which to study gene function and variation in T. cruzi. For example, the chromosome structure provided here will be particularly useful for planning and confirming gene knockouts and thus determining the function of hypothetical genes or confirming the function of annotated genes. As RNAi does not appear to function in T. cruzi, gene knockout remains a primary method linking phenotypes or functions to particular gene products. In addition, the chromosomes will facilitate strain comparisons, either by techniques such as CGH or subsequent sequencing of additional strains.
Telomeric and sub-telomeric regions of the chromosomes may never be fully sequenced; these regions are likely too redundant to assemble properly and yet too variable as a whole between strains of T. cruzi
to be ultimately informative, except as examples of the degree of variability that is possible in T. cruzi
. As assembled, over 23% of the annotated genes in the genome are members of large gene families, but it has been suggested that there may be upwards of 20,000 additional genes in these families that are not present in the genome due to the collapsing of reads during assembly [23
]. The large number of gene families and the substantial number of members of these families will be interesting to further explore, as the biological function of such large and diverse families of genes is not totally clear. It is hypothesized that the location near chromosome ends facilitates rearrangement in these genes and thus provides a source for new variants [28
]. Since members of these families are major targets of anti-T. cruzi
immune responses, it is likely that this variation has a role in immune evasion. It would be of interest to determine if gene family clusters that are integrated amongst the core genes in the T. cruzi
genome are less prone to rearrangement over time or variation between strains relative to those on chromosome ends, as would be predicted.
As a caveat, one of the risks in assembling the chromosomes as described here is that a mosaic may result given the repetitive and hybrid nature of the T. cruzi genome. Though the majority of BAC clones used for organization were mapped unambiguously to the appropriate chromosomes, it must be noted that the organization was based on the most likely location of each scaffold/contig; there were many clones whose BAC ends either mapped to different chromosomes, a contradiction to the placement of the associated sequences, or mapped to scaffolds/contigs that were not placed on any chromosome. Thus these assemblies represent a model for the chromosomes of T. cruzi, and, though they are still incomplete, they are a vast improvement on what was previously available.