Trypanosoma cruzi is a protozoan parasite of the order Kinetoplastida, and the causative agent of Chagas Disease, one of the so called neglected diseases that disproportionately affect the poor. The disease is endemic in most Latin American countries, affecting in excess of 8 million people
[
1]. Chagas disease has a variable clinical outcome. In its acute form it can lead to death (mostly in infants), while in its chronic form, it is a debilitating disease producing different associated pathologies: mega-colon, mega-esophagus and cardiomyopathy, among others. These different clinical outcomes are the result of a complex interplay between environmental factors, the host genetic background and the genetic diversity present in the parasite population. As a result, these different clinical manifestations have been suggested to be, at least in part, due to the genetic diversity of
T. cruzi[
2-
5].
The
T. cruzi species has a structured population, with a predominantly clonal mode of reproduction
[
6], and a considerable phenotypic diversity
[
7-
10]. Through the use of a number of molecular markers the population has been divided in a number of evolutionary lineages, also called discrete typing units. Some markers allow the distinction of two or three major lineages
[
11-
14], while other experimental strategies, such as RAPD and multilocus isoenzyme electrophoresis (MLEE) support the distinction of six subdivisions
[
15-
17] originally designated as DTUs I, IIa, IIb, IIc, IId, and IIe
[
16]. Recently, this nomenclature was revised as follows: TcI, TcII (former TcIIb), TcIII (IIc), TcIV (TcIIa), TcV (TcIId) and TcVI (TcIIe)
[
18,
19]. Lineages TcV and TcVI (which include the strain used for the first genomic sequence of
T. cruzi, CL Brener) have a very high degree of heterozygosity but otherwise very homogeneous population structures with low intralineage diversity
[
20,
21]. The currently favoured hypothesis suggests that these two lineages originated after either one or two independent hybridization events between strains of DTUs TcII and TcIII
[
21-
23].
Knowledge of the genetic variation present in a genome (i.e. between the two alleles of a diploid individual) or in a species (i.e. in the population) is of central importance for a variety of reasons and applications: i) to understand the evolutionary forces underlying the biological and phenotypic properties observed in an individual; ii) to detect cases of apparent horizontal gene transfer; iii) to assess the potential for development of resistance when validating a target for drug development; iv) to prioritize targets for development of diagnostics or vaccines; v) in the design of constructs for genetic knockout experiments in order to increase the success rate when targeting specific alleles; and vi) as genetic markers in association studies or to further probe the population structure.
The genome sequence of the CL-Brener clone of
T. cruzi was published in 2005
[
24], together with those of two other trypanosomatids of medical importance:
Trypanosoma brucei (Sleeping sickness, African trypanosomiasis)
[
25] and
Leishmania major (Leishmaniasis)
[
26]. However, the genome of
T. cruzi was a particular case for a number of reasons: it was obtained from a hybrid TcVI strain composed of two divergent parental haplotypes; and it was sequenced using a whole genome shotgun strategy
[
24]. This choice of strain and sequencing strategy resulted in high sequence coverage from the two parental haplotypes, which were derived from ancestral TcII and TcIII strains. Because of the high allelic variation found within this diploid genome, a significant number of contigs were found to be present twice in the assembly
[
24]. These divergent haplotypes, which were assembled separately in many cases, were the basis of a recent re-assembly of the genome
[
27]. As a consequence, it is now possible to identify the genetic diversity present within this diploid genome.
More recently a number of whole genome sequencing data have become available from different strains of
T. cruzi: the draft genomic sequence of the Sylvio X10 (TcI) strain
[
28], high-coverage transcriptomic data, from another TcI strain (Westergaard G, and Vazquez MP, manuscript in preparation), as well as 2.5X WGS shotgun data from the Esmeraldo cl3 (TcII) strain.
To take advantage of the hybrid genome of the CL-Brener strain, and of other genome and transcriptome datasets, we designed a bionformatics strategy to obtain information on the genetic diversity present in these data. As already observed for a significant number of molecular markers, each of the alleles identified in the majority of the polymorphic heterozygous site in strains from hybrid lineages TcV and TCVI can be observed in homozygosity in strains from either of the two proposed parental lineages (TcII and TCIII)
[
20,
21,
29-
31]. Therefore by uncovering the diversity within the CL-Brener and Sylvio X10 genomes, we expect to reveal a significant fraction of the diversity that can be observed between extant TcI, TcII, TcIII, and TcVI strains.
In this work we present an initial compilation of a genome-wide map of genetic diversity in T. cruzi, and its functional analysis, focussed mostly on protein-coding regions of the genome.