Apicomplexa is a phylum of protozoan parasites that infects both humans and animals, causing serious health problems world-wide.
Plasmodium falciparum (Pf) and
Plasmodium vivax (Pv), for example, cause malaria, which kills over a million people every year [
1,
2].
Toxoplasma gondii (Tg) infects one third of the entire human population, causing brain and eye defects in the unborn fetuses of infected women [
3].
Cryptosporidium parvum (Cp) infects humans and other warm-blooded animals, causing severe diarrhea [
4]. Genome sequencing projects for at least 15 species of apicomplexa, including several
Plasmodium species [
5-
7], two
Theileria species [
8,
9],
Babesia bovis [
10], Cp [
11] and Tg, have been carried out during the last decade.
The resulting genomic sequences have been analyzed, revealing that even though the apicomplexan parasites are believed to have been derived from a common ancestor, their genome sizes and compositions vary widely. The Cp genome is only 9.1 Mb, with only 5% of its genes containing introns, a proportion which nearly parallels that of the
Saccharomyces cerevisiae genome [
11]. The Tg genome, by contrast, is 65 Mb, averages 4.1 introns per gene, and has a G+C content of 52% [
3]; whereas the Pf genome is 23 Mb, and is extremely A+T rich, having a G+C content of just 19% [
7]. Respective genome information for each of these species has been made publicly available in one or more of the following databases: PlasmoDB [
12-
14], CryptoDB [
15-
17], ToxoDB [
18,
19], EuPathDB [
20], and GeneDB [
21].
Obviously, accurately annotated genomes are important tools for elucidating the genetic basis of parasiticism in apicomplexa. Such genetic knowledge will form the basis for drug development and potential vaccine candidates for these parasites. However, the quality of the accumulated genomic data is currently insufficient for these purposes. The genomic sequences of Py and Pb are still very incomplete, consisting of numerous short contigs (the N50 contig lengths of Py and Pb are only 7.7 kb and 2.8 kb, respectively; note the N50 size is the length such that 50% of the assembled genome lies in blocks of the N50 size or longer). An even more serious issue is that the genome-associated gene annotations (gene models) appear to be imperfect. Even for the well-annotated
P. falciparum genome, recent reports have suggested it contains many errors [
22]. Because the structures of the genomes and genes are very different from species to species, it is difficult to make precise, uniform gene prediction, using computational methods such as GENSCAN [
23] or GlimmerM [
24]. Therefore, experimental evidence, such as cDNA sequences, is extremely important and should be more intensively collected and taken into consideration for annotation purposes.
We previously developed a method, called oligo-capping, for constructing full-length cDNA libraries and have used it to collect full-length cDNAs from numerous organisms [
25]. These cDNA sequences have been published online in two databases: Full-Parasites and Comparasite [
26]. Full-Parasites [
27] contains 5'-end-single-pass-read expressed-sequence-tags (5'-ESTs) for the Pf, Pv, Py, Pb, Cp and Tg genomes, and for the tapeworm
Echinococcus multilocularis [
28]. Comparasite [
29] is an integrated database containing the transcriptomes of the same six apicomplexa species [
26]. In it, homologous gene groups are clustered and any combination of these species can be comparatively analyzed. While analyzing the cDNA data in these databases, we noticed significant inconsistencies between our cDNA annotations and those of the publically available annotated genes.
In this study, we first analyzed 61,056 5'-end partially sequenced cDNAs which were isolated from six apicomplexan parasite full-length cDNA libraries. We found that a significant number of current gene models contain inconsistencies and therefore should be re-evaluated. To evaluate the gene models at the complete sequence level, we completely sequenced 732 full-length Tg cDNAs and drew the same conclusions. In addition, we found that the possible errors in the publically available annotations were largely due to overprediction of the exons. Here we report the first, large-scale systematic evaluation of the current genomic annotation of apicomplexan parasites based on our unique full-length cDNA data.