Characteristic of the new annotation: Improvements to the genome assembly, gene structures and functional assignments
Close examination of the initial assembly of E. histolytica
strain HM-1:IMSS revealed multiple problems. Sequence analysis using intra-scaffold dot-plots exposed 161 artifactual tandem duplications (Figure S1
, panel A) located at the boundaries between neighboring contigs (a contiguous assembled sequence ordered together to form a scaffold). Tandem duplications spanned 364,707 bp of genomic sequence with a median length of 892 bp. In the previous assembly, genes predicted on these regions and on unmasked repetitive regions caused an over-estimation of genes by approximately 18%. Indeed, of the 399 genes located in those regions, 61 hit transposable elements (TEs) or were likely pseudogenes, while most of the remaining 338 coding sequences were artifactually duplicated and so collapsed into 206 individual genes in the new annotation (Figure S1
, panel B). A comparative description of the features of the original and the new E. histolytica
assemblies is summarized in .
Genome statistics and annotation comparison.
The new genome assembly consists of ~20 Mb of sequence organized into 1,496 scaffolds. To generate a “core” assembly for functional annotation, scaffolds lacking predicted genes were not considered. The resulting core assembly consisted of 818 non-redundant scaffolds with a total of 19,220,345 bp. All scaffolds that were excluded from the core assembly as well as degenerate contigs and singleton reads, although not annotated, were considered to survey the presence or absence of genes when necessary, and all sequences were deposited in GenBank (see Methods
The results of the new assembly show higher fragmentation and a reduction in genome size with respect of the published assembly. However, our comparative analysis between the two annotations shows that there is no loss of coding information from one assembly to the other.
The new assembly contains 8,201 de novo
predicted protein coding genes, 1,784 fewer than previously reported for this genome () 
. To determine the origin of these differences and to evaluate changes in gene structure between the original (OGA) and new (NGA) annotation, genes from OGA were mapped onto the new assembly and structural differences were estimated using GSAC (see Methods
and ). Mapping results indicated that the main reason for gene number reduction is the elimination of genes within repetitive regions and artifactual tandem duplications, and the removal of genes smaller than 300 bp without any supporting evidence (). Noteworthy, less than 0.2% of the genes from the original annotation do not map onto the new assembly, despite the fact that the assembly is 2,562,911 bp smaller than the published one. These missing OGA genes contained no supporting evidence and were originally annotated as hypothetical protein coding genes. This analysis also showed that 51% of the OGA genes keep the same structure in the new annotation (same isoform in ), while 36% underwent structural change (different isoform in ).
As part of the curation process, the structure of 740 genes was manually reviewed and curated based on supporting evidence such as ESTs. An important hallmark of this work is the concerted effort from scientists of the Entamoeba community that contributed to the curation of the genome by direct communication with the authors as well as participation via specific workshops held at JCVI.
To evaluate whether structural changes in the new annotation reflect an overall improvement of gene structures we selected a group of 1,024 OGA-NGA pairs of genes that map to each other but are structurally different. Then, we ran HMM-searches and global pairwise alignments on each pair of proteins against Pfam HMMs and our PANDA database (see Methods
). Finally, we compared the resulting statistics between OGA and NGA peptides from each pair (). These analyses showed that translated products from NGA genes consistently give better hits against Pfam and PANDA databases when compared to OGA genes, demonstrating an overall improvement in gene structures for the new annotation. In those cases where NGA genes gave worse hits compared to their OGA counterparts, we manually inspected and corrected gene structures in the new annotation.
Structural annotation improvement in the new E. histolytica assembly.
Structural improvements in the new annotation were also reflected by (1) the appearance of new Pfam/TIGRfam domain hits not present in the original protein dataset and (2) the identification of genes coding for additional members of different protein families. Noteworthy, among novel protein domains are a domain typically found in some subunits of several DNA polymerases (PF04042), a domain found in phospholipid methyltransferases (PF04191) and another present in panthotenate kinase proteins (PF03630, see section below). On the other hand, point (2) is very well exemplified by the subunits of the Gal/GalNAc lectins. In E. histolytica
these lectins are composed of three different subunits: a 170 kDa heavy subunit (Hgl), a 150 kDa intermediate subunit (Igl) and a 31–35 kDa light subunit (Lgl) 
. In agreement with the current number of Hgl
genes in the new annotation, studies of pulse-field gel electrophoresis have shown that there are five hgl
and six lgl
genes in the genome 
. However, only four Hgl
genes, one of them truncated, and four Lgl
genes are part of the old dataset.
Particular effort was directed towards the improvement of functional annotation (summarized in ) by the incorporation of additional 974 enzyme commission (EC) numbers and 531 Pfam/TIGRfam domains. Gene ontology (GO) terms were automatically assigned from Pfam HMM searches refreshing and updating the assignments from InterPro evidence used in the old annotation. The advantage of using hits from Pfam HMM searches is that results can then be filtered not just by e-value but also by trusted cutoff scores, giving a more accurate estimation than InterPro searches and therefore, a more confident GO assignment. In addition to automatic EC number and GO term assignments, functional annotation has been manually curated for 2,130 genes. A total of 3,468 genes have been assigned GO terms, of which 3,216 have a molecular function term. We have distributed the specific terms in a total of 30 molecular function GO-Slim categories (Table S1
). No difference was observed in the representation of GO categories in the protein families with respect to that of singletons.
Classification and function of protein families in Entamoeba histolytica
predicted proteins were organized into protein families to facilitate the review of their functional annotation, visualizing relationships between proteins and allowing annotators to examine related genes as a group. Our family clustering method produces groups of proteins sharing protein domains conserved across the proteome, and consequently, related biochemical function, as described in Methods 
. For example, based on our clustering criteria, all proteins containing a single RhoGAP domain (PF00620) fall within the same family irrespectively of their length.
A total of 897 protein families containing 4,564 proteins (56% of the proteome) were identified from the 8,201 predicted polypeptides in the new annotation, leaving 3,637 “orphan” proteins. Among the families, 247 clusters (479 proteins) have no homology to any known Pfam or TIGRfam domain, and harbor potentially novel domains (91 of these families contain five members or more). On average, E. histolytica
families contain five proteins, ranging from two to 149 members (). We identified seven families with more than 50 members encoding proteins such as small GTP binding proteins, BspA-like leucine-rich repeat proteins, kinase domain-containing proteins, WD domain-containing proteins, a large family of uncharacterized hypothetical proteins, a RNA recognition motif domain-containing protein family and a RhoGAP domain-containing protein family (see Table S2
for the complete list of families).
E. histolytica protein families.
Interestingly, a number of protein families appear to be physically linked to transposable elements. shows the top 27 families that present this type of association (for the entire repertoire of genes see Table S3
). For example, a cluster of 31 members of the Hsp70 protein family appears associated 30% of the time with TEs within 1 kb of the gene context. Hsp70 proteins are molecular chaperones that assist a large variety of protein folding processes in the cell by the transient association between their substrate-binding domain and the short hydrophobic peptide segments present in their target proteins. Hsp70 s are highly conserved and are known to be induced by a variety of stresses 
. It has been previously reported that multiple natural TE insertions in Drosophila
reduce the level of expression of hsp70
genes by insertion nearby gene promoter regions 
. The characteristics of the hsp70
promoter in the fly may make it a suitable target for transposition leading to the generation of novel alleles. In this sense, TEs could be playing an adaptive role in microevolution by gene amplification and also manipulating the expression of genes critical for the parasite fitness 
Entamoeba histolytica protein families showing high association with repetitive elements.
Another family showing a high correlation with transposable elements is the large BspA-like surface protein family 
. Initially, Davis et al
. identified 89 genes coding for BspA-like proteins in the genome of E. histolytica
, containing a leucine-rich repeat motif (LRRs). LRRs serve as recognition motifs for surface proteins in bacteria and other eukaryotes 
and have been shown to be involved in binding to fibronectin. E. histolytica
BspA-like proteins have unique LRR-like repeats that resemble, to certain extent, to the Treponema pallidum
LLRs (LrrA proteins) 
, that appear to have a role in attachment and penetration to host tissues 
, suggesting they may be involved in attachment to the host cells. Our analysis identified 116 BspA-like genes in the genome, 41 of them associated with transposable elements. The core domain of the BspA-like proteins contains 23 amino acids with the consensus P[T/S][T/S][V/I/L]xx[I/L]GxxCFxxCxxLxx[I/L]x[I/L], and these units form tandem blocks that can contain two or more core motifs represented from 1 to 21 times in a single molecule, leading to a great variability in the protein length in the family. Most of the proteins in the family contain a novel 50 amino acids N-terminal domain that is preserved in 85 members of this cluster. A closer examination of those genes encoding proteins lacking the N-terminal domain showed they are probably truncated by the insertion of transposable elements, primarily SINE and LINE elements at their 5′ end. BspA-like proteins are located on the surface of E. histolytica 
however no classic membrane-targeting signal is present in the proteins. Therefore, it is tempting to speculate that the conserved N-terminal domain of these proteins might function as either an export signal or serve as a membrane-anchor domain or that export involves a non-classical transport mechanism, independent of the ER–Golgi pathway, similar to those that have been detected in yeast and mammalian cells 
. Details on the motifs and domain structure are shown in Figure S2
A third worthy of note family associated with TEs is the AIG family of proteins, comprising 29 members distributed in 3 clusters, of which 18 genes are in close proximity to repetitive elements (). AIG1 proteins are associated with resistance to bacteria 
. Interestingly, comparative gene expression studies have shown that AIG1 proteins as well as heat shock proteins have significantly reduced expression levels in E. dispar 
, when compared to E. histolytica
. This observation leads us to speculate that transposable elements inserted in the neighborhood of these genes could lead to the enhanced expression of these genes and ultimately could be related to the increased virulence. Indeed we have previously shown that LINEs and SINEs are involved in genome rearrangements driving in consequence genomic evolution 
. It is tempting to speculate that the amplification of the AIG family was mediated by the close association of TEs, but the observation that non-virulent E. dispar
contains the same number of genes without the TE association seems to indicate that this is not the case. We are currently analyzing all gene family/transposable element associations in the context of comparative genomics with other Entamoeba species (manuscript in preparation).
Close examination of the functional annotation of protein families and singleton proteins revealed that a total of 2,981 (65%) genes within the families were annotated as encoding proteins with putative functions and 1,583 genes are hypothetical proteins (34%, ). Of a total of 1,088 genes that have EST support in the whole genome, 705 are genes within protein families. In contrast, singletons had a larger proportion of hypothetical genes (76%) and a smaller portion of genes with a known or putative function (24%), and half the number of genes supported by EST evidence (383).
Segmental genome duplications
As mentioned above, about 20% of the E. histolytica
genome consists of transposable elements. These repeats show a tendency to insert close to each other forming large TE clusters. We have previously shown that these repeat clusters are frequently found at syntenic breakpoints between E. histolytica
and E. dispar
suggesting that they could contribute to parasite genome instability and, consequently, to the evolution of these species 
. It is also possible that the highly repetitive nature of this genome led to genome duplications. In order to evaluate this possibility we analyzed the presence of additional rearrangements within the genome by searching for segmental duplications using DAGchainer as explained in Methods 
. We observed the presence of four different types of segmental duplications, named D1-D4, spanning seven to ten genes each ().
Entamoeba histolytica segmental genome duplications.
The first duplication (D1, ) spans a 16.6 kb region containing up to 8 hypothetical protein coding genes. These duplications are approximately 94% identical at the nucleotide level. All D1-type duplications are flanked by 2.3 kb inverted repeats (IR) not found in the rest of the genome. Nucleotide composition analysis revealed that D1-IRs are highly AT-rich (84.3%) compared to the average content of those regions 71.4% and they are 95% identical at the nucleotide level. A genome wide survey of D1-duplications led to the identification of four complete and two partial copies of this element in the genome. It is interesting to mention that all the scaffolds containing the four complete duplications have similar sizes (16.6 kb on average) and are spanned almost in their entire length by their respective segmental duplications. The two partial D1-duplications are located in shorter scaffolds of 14.4 kb and 6.6 kb, respectively.
The second duplication (D2, ) is 12.5 kb long and contains up to eight duplicated hypothetical protein coding genes depending on the duplication. Comparative analysis showed that these duplications are more than 80% identical at the nucleotide level with an average of 92%. Similar to D1-type duplications, D2 are frequently flanked by 1.2 kb IRs, composed of two fragments derived from the TEs EhERE1 and EhLINE2. D2-IRs share 92.6% identity at the nucleotide level and are also very AT-rich (85%AT). The organization of the duplications is not conserved in all copies across the genome, with some copies flanked by IRs composed of either EhERE1 or EhLINE2 fragments, while in others we could not identify any IR.
D3-type duplications are 7.4 kb long and 83% identical at the nucleotide level. Although frequently found nearby TEs (mostly EhLINE1), none of the eight identified genome duplications are flanked by IRs as D1- and D2-type duplications. D3 presents a very unique gene content that suggest that the segment could present a unique functionality, represented in . A total of seven protein coding genes are arranged in the same orientation, and include a putative serine-threonine kinase similar to ARK1, a human protein that participates in cell cycle regulation; an endonuclease V domain-containing protein coding gene potentially involved in DNA repair; a putative secreted hypothetical protein coding gene; a tandem duplicated gene coding for a putative protein containing a type-1 glutamine amido transferase-like domain and a GDSL-like lipase/acylhydrolase domain-containing protein coding gene. Interestingly, D3-type duplications are found at or in close proximity to the end of scaffolds, and therefore, they could potentially be located at subtelomeric regions. However, in spite of a thorough analysis we could not identify any repetitive telomeric/subtelomeric motif in these regions.
Lastly, the 10 kb long D4 () shares more than 85% identity at the nucleotide level and spans up to 9 hypothetical protein coding and one putative dUTP hydrolase-coding genes. Most D4-type duplications have TEs inserted nearby, but no flanking IRs were identified.
The presence of these duplications is not likely to be an artifact of the assembly due to the fact that they are also appear duplicated in E. dispar
. It is possible that some of these duplications, that in some cases span full scaffolds represent different copies of one of the several extrachromosomal elements known to exist in Entamoeba species, as described by Dhar et al 
Our work has led to the identification of 460 novel putative protein coding genes not present in the OGA, 16% of which have some functional annotation. One of these genes codes for a putative pantothenate kinase (EHI_183060) the first enzyme in the biosynthesis of coenzyme A from pantothenate. Although the coding genomic region was present in the original assembly, the gene had not been predicted and therefore, it was missing from the previous annotation. Only the enzymes phophopantothenoyl-cysteine decarboxilase (EC 18.104.22.168), phosphopantothenoyl-cysteine synthase (EC 22.214.171.124), and dephospho-CoA kinase (EC 126.96.36.199), responsible for the second, third and last of the five consecutive enzymatic reactions, had been previously identified in the OGA (EHI_164490, EHI_092330, EHI_040840). However, the lack of candidate enzymes for the remaining two biochemical reactions of this pathway raised the question whether E. histolytica
can synthesize coenzyme A from pantothenate 
. Our de novo
gene prediction for a putative pantothenate kinase plus the identification of a candidate gene for the forth step of this pathway, a putative pantetheine-phosphate adenylyltransferase (EC 188.8.131.52), indicates that the whole set of metabolic reactions required to synthesize coenzyme A from pantothenate is present in this amoeba. Interestingly, the enzymes that participate in this pathway resemble those from eubacteria but not higher eukaryotes. Indeed, the second and third sets of reactions are catalyzed by a single enzyme present in two copies (EHI_164490, EHI_092330), while the fourth and fifth steps are carried out by independent enzymes, EHI_006680 and EHI_040840, respectively. In higher eukaryotes the last two reactions are carried out by the same enzyme 
Another gene not present in the OGA (EHI_141410) codes for a protein with a predicted molecular weight of 44.6 kDa similar to subunit p50 of the DNA polymerase delta, a key enzyme for chromosomal DNA replication in higher eukaryotes. In mammals, it has been shown that p50 is tightly associated with p125, the catalytic subunit of these types of DNA polymerases. Accordingly, a gene coding for a putative 124.4 kDa catalytic subunit of the DNA polymerase delta (EHI_006690), is also present in the NGA. These results are in agreement with a previous work showing that the sensitivity to different inhibitors of the DNA polymerase activity of E. histolytica
resembles that of mammalian DNA alpha, delta and epsilon polymerases 
In addition, a gene coding for a protein containing a Yos1-like Pfam domain is also absent from OGA (EHI_178640). This putative protein has similarity to other members of the Yos1 family, involved in protein transport between the endoplasmic reticulum and the Golgi apparatus 
Comparative analysis between the two annotation datasets also allowed us to identify genes present in their complete form in NGA but truncated in OGA. Example of these genes are two copies of a gene coding for a putative pyridine nucleotide transhydrogenase, EHI_055400 and EHI_014030, the latter identical to a gene previously cloned by Clark et al
, which exists as a single truncated copy in the OGA. Another example is a 605 bp gene coding for a putative phospholipid methyltransferase protein (EHI_153710) similar to Schizosaccharomyces pombe
cho1 (35% identity; e-value
), an enzyme that participates in the synthesis of phosphatidylcholine via the methylation of phosphatidylethanolamine. A coding sequence containing only the last 222 bp of this gene is present in the OGA.
Our reannotation effort has focused mostly on the improvement of the assembly and the gene content and structure of the E. histolytica genome. The new assembly, annotation and analysis of the genome has incorporated many updates and enhancements to the structural and functional assignments of the original gene predictions, including an improved assembly, removal of spurious genes, improved gene structures and functional assignments, and generation of gene families.
Regardless of the advancement of the computational methods and of the exponentially growing amount of data that could be used for automated genome annotation, only experimental evidence from expression data will conclusively validate the accuracy of computationally assigned functions done at the genome-wide level. Nevertheless, in order to provide a sound bases to drive research, genome annotations have to be maintained and revised, either by expert annotators in the field and/or community involvement. Additional sequence information will allow the further refinement of gene structures and a deeper understanding of the genome architecture, while the functional annotation will be enriched both by the availability of new experimental data and from expression and other kinds of analyses to characterize each gene and its function fully.
This reannotation effort will be the base for the future analysis and annotation of new E. histolytica
genomes from patient isolates, a project recently approved under the NIAID supported program Genome Sequence Centers for Infectious Disease, GSCID (http://gsc.jcvi.org/