Full-malaria (http://fullmal.ims.u-tokyo.ac.jp), a database for full-length cDNAs from the human malaria parasite, Plasmodium falciparum has been updated in at least three points. (i) We added 8934 sequences generated from the addition of new libraries, so that our collection of 11 424 full-length cDNAs covers 1375 (25%) of the estimated number of the entire 5409 parasite genes. (ii) All of our full-length cDNAs and GenBank EST sequences were mapped to genomic sequences together with publicly available annotated genes and other predictions. This precisely determined the gene structures and positions of the transcriptional start sites, which are indispensable for the identification of the promoter regions. (iii) A total of 4257 cDNA sequences were newly generated from murine malaria parasites, Plasmodium yoelii yoelii. The genome/cDNA sequences were compared at both nucleotide and amino acid levels, with those of P.falciparum, and the sequence alignment for each gene is presented graphically. This part of the database serves as a versatile platform to elucidate the function(s) of malaria genes by a comparative genomic approach. It should also be noted that all of the cDNAs represented in this database are supported by physical cDNA clones, which are publicly and freely available, and should serve as indispensable resources to explore functional analyses of malaria genomes.
Apicomplexan parasites are causative agents of various diseases including malaria and have been targets of extensive genomic sequencing. We generated 5'-EST collections for six apicomplexa parasites using our full-length oligo-capping cDNA library method. To improve upon the current genome annotations, as well as to validate the importance for physical cDNA clone resources, we generated a large-scale collection of full-length cDNAs for several apicomplexa parasites.
In this study, we used a total of 61,056 5'-end-single-pass cDNA sequences from Plasmodium falciparum, P. vivax, P. yoelii, P. berghei, Cryptosporidium parvum, and Toxoplasma gondii. We compared these partially sequenced cDNA sequences with the currently annotated gene models and observed significant inconsistencies between the two datasets. In particular, we found that on average 14% of the exons in the current gene models were not supported by any cDNA evidence, and that 16% of the current gene models may contain at least one mis-annotation and should be re-evaluated. We also identified a large number of transcripts that had been previously unidentified. For 732 cDNAs in T. gondii, the entire sequences were determined in order to evaluate the annotated gene models at the complete full-length transcript level. We found that 41% of the T. gondii gene models contained at least one inconsistency. We also identified and confirmed by RT-PCR 140 previously unidentified transcripts found in the intergenic regions of the current gene annotations. We show that the majority of these discrepancies are due to questionable predictions of one or two extra exons in the upstream or downstream regions of the genes.
Our data indicates that the current gene models are likely to still be incomplete and have much room for improvement. Our unique full-length cDNA information is especially useful for further refinement of the annotations for the genomes of apicomplexa parasites.
Motivation: The sequencing of the Plasmodium yoelii genome, a model rodent malaria parasite, has greatly facilitated research for the development of new drug and vaccine candidates against malaria. Unfortunately, only preliminary gene models were annotated on the partially sequenced genome, mostly by in silico gene prediction, and there has been no major improvement of the annotation since 2002.
Results: Here we report on a systematic assessment of the accuracy of the genome annotation based on a detailed analysis of a comprehensive set of cDNA sequences and proteomics data. We found that the coverage of the current annotation tends to be biased toward genes expressed in the blood stages of the parasite life cycle. Based on our proteomic analysis, we estimate that about 15% of the liver stage proteome data we have generated is absent from the current annotation. Through comparative analysis we identified and manually curated a further 510 P. yoelii genes which have clear orthologs in the P. falciparum genome, but were not present or incorrectly annotated in the current annotation. This study suggests that improvements of the current P. yoelii genome annotation should focus on genes expressed in stages other than blood stages. Comparative analysis will be critically helpful for this re-annotation. The addition of newly annotated genes will facilitate the use of P. yoelii as a model system for studying human malaria.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
The DNA amplification process can be a source of bias and artifacts, especially when amplifying genomic areas with extreme AT or GC content. The human malaria parasite Plasmodium falciparum has an AT-rich genome, and some of its highly AT-rich regions have been shown to be refractory to polymerase chain reaction (PCR) amplification. Biased amplification may lead to erroneous conclusions for studies investigating genome-wide gene expression, nucleosome position, and copy number variation. Here we compare genome-wide nucleosome coverage in libraries amplified at three different extension temperatures and show that reduction in PCR extension temperature from 70ºC to 60ºC can greatly increase the fraction of coverage at AT-rich regions of the P. falciparum genome. Our method will improve the efficiency and coverage in sequencing an AT-rich genome.
new generation sequencing; malaria; genome; amplification bias; nucleosome
It has been shown that nearly a quarter of the initial predicted gene models in the Plasmodium falciparum genome contain errors. Although there have been efforts to obtain complete cDNA sequences to correct the errors, the coverage of cDNA sequences on the predicted genes is still incomplete, and many gene models for those expressed in sexual or mosquito stages have not been validated. Antisense transcripts have widely been reported in P. falciparum; however, the extent and pattern of antisense transcripts in different developmental stages remain largely unknown.
We have sequenced seven bidirectional libraries from ring, early and late trophozoite, schizont, gametocyte II, gametocyte V, and ookinete, and four strand-specific libraries from late trophozoite, schizont, gametocyte II, and gametocyte V of the 3D7 parasites. Alignment of the cDNA sequences to the 3D7 reference genome revealed stage-specific antisense transcripts and novel intron-exon splicing junctions. Sequencing of strand-specific cDNA libraries suggested that more genes are expressed in one direction in gametocyte than in schizont. Alternatively spliced genes, antisense transcripts, and stage-specific expressed genes were also characterized.
It is necessary to continue to sequence cDNA from different developmental stages, particularly those of non-erythrocytic stages. The presence of antisense transcripts in some gametocyte and ookinete genes suggests that these antisense RNA may play an important role in gene expression regulation and parasite development. Future gene expression studies should make use of directional cDNA libraries. Antisense transcripts may partly explain the observed discrepancy between levels of mRNA and protein expression.
Expressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. However, some of GenBank dbEST sequences have proven to be “unclean”. Identification of cDNA termini/ends and their structures in raw ESTs not only facilitates data quality control and accurate delineation of transcription ends, but also furthers our understanding of the potential sources of data abnormalities/errors present in the wet-lab procedures for cDNA library construction.
After analyzing a total of 309,976 raw Pinus taeda ESTs, we uncovered many distinct variations of cDNA termini, some of which prove to be good indicators of wet-lab artifacts, and characterized each raw EST by its cDNA terminus structure patterns. In contrast to the expected patterns, many ESTs displayed complex and/or abnormal patterns that represent potential wet-lab errors such as: a failure of one or both of the restriction enzymes to cut the plasmid vector; a failure of the restriction enzymes to cut the vector at the correct positions; the insertion of two cDNA inserts into a single vector; the insertion of multiple and/or concatenated adapters/linkers; the presence of 3′-end terminal structures in designated 5′-end sequences or vice versa; and so on. With a close examination of these artifacts, many problematic ESTs that have been deposited into public databases by conventional bioinformatics pipelines or tools could be cleaned or filtered by our methodology. We developed a software tool for Abnormality Filtering and Sequence Trimming for ESTs (AFST, http://code.google.com/p/afst/) using a pattern analysis approach. To compare AFST with other pipelines that submitted ESTs into dbEST, we reprocessed 230,783 Pinus taeda and 38,709 Arachis hypogaea GenBank ESTs. We found 7.4% of Pinus taeda and 29.2% of Arachis hypogaea GenBank ESTs are “unclean” or abnormal, all of which could be cleaned or filtered by AFST.
cDNA terminal pattern analysis, as implemented in the AFST software tool, can be utilized to reveal wet-lab errors such as restriction enzyme cutting abnormities and chimeric EST sequences, detect various data abnormalities embedded in existing Sanger EST datasets, improve the accuracy of identifying and extracting bona fide cDNA inserts from raw ESTs, and therefore greatly benefit downstream EST-based applications.
cDNA terminus; cDNA library construction; Pattern analysis; Restriction enzyme cutting abnormality; Chimeric EST sequences
The Plasmodium falciparum genome being AT-rich, the presence of GC-rich regions suggests functional significance. Evolution imposes selection pressure to retain functionally important coding and regulatory elements. Hence searching for evolutionarily conserved GC-rich, intergenic regions in an AT-rich genome will help in discovering new coding regions and regulatory elements. We have used elevated GC content in intergenic regions coupled with sequence conservation against P. reichenowi, which is evolutionarily closely related to P. falciparum to identify potential sequences of functional importance. Interestingly, ~30% of the GC-rich, conserved sequences were associated with antigenic proteins encoded by var and rifin genes. The majority of sequences identified in the 5′ UTR of var genes are represented by short expressed sequence tags (ESTs) in cDNA libraries signifying that they are transcribed in the parasite. Additionally, 19 sequences were located in the 3′ UTR of rifins and 4 also have overlapping ESTs. Further analysis showed that several sequences associated with var genes have the capacity to encode small peptides. A previous report has shown that upstream peptides can regulate the expression of var genes hence we propose that these conserved GC-rich sequences may play roles in regulation of gene expression.
Plasmodium; regulatory elements; comparative genomics; genome bias; antigenic variation
Malaria is the most important parasitic disease in the world with approximately two million people dying every year, mostly due to Plasmodium falciparum infection. During its complex life cycle in the Anopheles vector and human host, the parasite requires the coordinated and modulated expression of diverse sets of genes involved in epigenetic, transcriptional and post-transcriptional regulation. However, despite the availability of the complete sequence of the Plasmodium falciparum genome, we are still quite ignorant about Plasmodium mechanisms of transcriptional gene regulation. This is due to the poor prediction of nuclear proteins, cognate DNA motifs and structures involved in transcription.
A comprehensive directory of proteins reported to be potentially involved in Plasmodium transcriptional machinery was built from all in silico reports and databanks. The transcription-associated proteins were clustered in three main sets of factors: general transcription factors, chromatin-related proteins (structuring, remodelling and histone modifying enzymes), and specific transcription factors. Only a few of these factors have been molecularly analysed. Furthermore, from transcriptome and proteome data we modelled expression patterns of transcripts and corresponding proteins during the intra-erythrocytic cycle. Finally, an interactome of these proteins based either on in silico or on 2-yeast-hybrid experimental approaches is discussed.
This is the first attempt to build a comprehensive directory of potential transcription-associated proteins in Plasmodium. In addition, all complete transcriptome, proteome and interactome raw data were re-analysed, compared and discussed for a better comprehension of the complex biological processes of Plasmodium falciparum transcriptional regulation during the erythrocytic development.
The variant surface antigens expressed on Plasmodium falciparum–infected erythrocytes are potentially important targets of immunity to malaria and are encoded, at least in part, by a family of var genes, about 60 of which are present within every parasite genome. Here we use semi-conserved regions within short var gene sequence “tags” to make direct comparisons of var gene expression in 12 clinical parasite isolates from Kenyan children. A total of 1,746 var clones were sequenced from genomic and cDNA and assigned to one of six sequence groups using specific sequence features. The results show the following. (1) The relative numbers of genomic clones falling in each of the sequence groups was similar between parasite isolates and corresponded well with the numbers of genes found in the genome of a single, fully sequenced parasite isolate. In contrast, the relative numbers of cDNA clones falling in each group varied considerably between isolates. (2) Expression of sequences belonging to a relatively conserved group was negatively associated with the repertoire of variant surface antigen antibodies carried by the infected child at the time of disease, whereas expression of sequences belonging to another group was associated with the parasite “rosetting” phenotype, a well established virulence determinant. Our results suggest that information on the state of the host–parasite relationship in vivo can be provided by measurements of the differential expression of different var groups, and need only be defined by short stretches of sequence data.
Hope that it will be possible to develop a malaria vaccine is supported by the fact that individuals who have grown up in malaria endemic regions learn to carry malarial infections without suffering disease. Surprisingly little is still known about how this immunity develops. Much current research focuses on how the host develops immune responses to parasite antigens that are exposed to the host immune system. A major family of such antigens are inserted into the surface of parasite-infected erythrocytes, where they undergo antigenic switching to evade a developing antibody response. These proteins are encoded by a family of approximately 60 var genes, variants of which are present in every parasite genome.
The extreme diversity of the var genes has prevented meaningful comparison of their expression in clinical isolates. However, the authors of this paper show that var genes can be placed in groups that have a similar representation in the genomes of all parasites that the authors collected from Kenyan children. Having demonstrated an underlying similarity at the genomic level, the authors show that the var expression patterns vary markedly between different patients. The expression levels of specific groups of var genes was associated with poorly developed antibody responses in the children and a well-established parasite virulence phenotype. The study provides tools for exploring how host and parasite adapt to one another as immunity develops.
Plasmodium vivax is the most widely distributed human malaria, responsible for 70–80 million clinical cases each year and large socio-economical burdens for countries such as Brazil where it is the most prevalent species. Unfortunately, due to the impossibility of growing this parasite in continuous in vitro culture, research on P. vivax remains largely neglected.
A pilot survey of expressed sequence tags (ESTs) from the asexual blood stages of P. vivax was performed. To do so, 1,184 clones from a cDNA library constructed with parasites obtained from 10 different human patients in the Brazilian Amazon were sequenced. Sequences were automatedly processed to remove contaminants and low quality reads. A total of 806 sequences with an average length of 586 bp met such criteria and their clustering revealed 666 distinct events. The consensus sequence of each cluster and the unique sequences of the singlets were used in similarity searches against different databases that included P. vivax, Plasmodium falciparum, Plasmodium yoelii, Plasmodium knowlesi, Apicomplexa and the GenBank non-redundant database. An E-value of <10-30 was used to define a significant database match. ESTs were manually assigned a gene ontology (GO) terminology
A total of 769 ESTs could be assigned a putative identity based upon sequence similarity to known proteins in GenBank. Moreover, 292 ESTs were annotated and a GO terminology was assigned to 164 of them.
These are the first ESTs reported for P. vivax and, as such, they represent a valuable resource to assist in the annotation of the P. vivax genome currently being sequenced. Moreover, since the GC-content of the P. vivax genome is strikingly different from that of P. falciparum, these ESTs will help in the validation of gene predictions for P. vivax and to create a gene index of this malaria parasite.
FULL-malaria is a database for a full-length-enriched cDNA library
from the human malaria parasite Plasmodium falciparum (http://18.104.22.168/).
Because of its medical importance, this organism is the first target for
genome sequencing of a eukaryotic pathogen; the sequences of two
of its 14 chromosomes have already been determined. However, for
the full exploitation of this rapidly accumulating information, correct
identification of the genes and study of their expression are essential.
Using the oligo-capping method, we have produced a full-length-enriched cDNA
library from erythrocytic stage parasites and performed one-pass
reading. The database consists of nucleotide sequences of 2490 random
clones that include 390 (16%) known malaria genes according
to BLASTN analysis of the nr-nt database in GenBank; these represent
98 genes, and the clones for 48 of these genes contain the complete
protein-coding sequence (49%). On the other hand, comparisons
with the complete chromosome 2 sequence revealed that 35 of 210
predicted genes are expressed, and in addition led to detection
of three new gene candidates that were not previously known. In
total, 19 of these 38 clones (50%) were full-length. From
these observations, it is expected that the database contains ∼1000 genes, including
500 full-length clones. It should be an invaluable resource for
the development of vaccines and novel drugs.
The development of Plasmodium falciparum within human erythrocytes induces a wide array of changes in the ultrastructure, function and antigenic properties of the host cell. Numerous proteins encoded by the parasite have been shown to interact with the erythrocyte membrane. The identification of new interactions between human erythrocyte and P. falciparum proteins has formed a key area of malaria research. To circumvent the difficulties provided by conventional protein techniques, a novel application of the phage display technology was utilised.
P. falciparum phage display libraries were created and biopanned against purified erythrocyte membrane proteins. The identification of interacting and in-frame amino acid sequences was achieved by sequencing parasite cDNA inserts and performing bioinformatic analyses in the PlasmoDB database.
Following four rounds of biopanning, sequencing and bioinformatic investigations, seven P. falciparum proteins with significant binding specificity toward human erythrocyte spectrin and protein 4.1 were identified. The specificity of these P. falciparum proteins were demonstrated by the marked enrichment of the respective in-frame binding sequences from a fourth round phage display library.
The construction and biopanning of P. falciparum phage display expression libraries provide a novel approach for the identification of new interactions between the parasite and the erythrocyte membrane.
The Plasmodium falciparum genome (3D7 strain) published in 2002, revealed ~5,400 genes, mostly based on in silico predictions. Experimental data is therefore required for structural and functional assessments of P. falciparum genes and expression, and polymorphic data are further necessary to exploit genomic information to further qualify therapeutic target candidates. Here, we undertook a large scale analysis of a P. falciparum FcB1-schizont-EST library previously constructed by suppression subtractive hybridization (SSH) to study genes expressed during merozoite morphogenesis, with the aim of: 1) obtaining an exhaustive collection of schizont specific ESTs, 2) experimentally validating or correcting P. falciparum gene models and 3) pinpointing genes displaying protein polymorphism between the FcB1 and 3D7 strains.
A total of 22,125 clones randomly picked from the SSH library were sequenced, yielding 21,805 usable ESTs that were then clustered on the P. falciparum genome. This allowed identification of 243 protein coding genes, including 121 previously annotated as hypothetical. Statistical analysis of GO terms, when available, indicated significant enrichment in genes involved in "entry into host-cells" and "actin cytoskeleton". Although most ESTs do not span full-length gene reading frames, detailed sequence comparison of FcB1-ESTs versus 3D7 genomic sequences allowed the confirmation of exon/intron boundaries in 29 genes, the detection of new boundaries in 14 genes and identification of protein polymorphism for 21 genes. In addition, a large number of non-protein coding ESTs were identified, mainly matching with the two A-type rRNA units (on chromosomes 5 and 7) and to a lower extent, two atypical rRNA loci (on chromosomes 1 and 8), TARE subtelomeric regions (several chromosomes) and the recently described telomerase RNA gene (chromosome 9).
This FcB1-schizont-EST analysis confirmed the actual expression of 243 protein coding genes, allowing the correction of structural annotations for a quarter of these sequences. In addition, this analysis demonstrated the actual transcription of several remarkable non-protein coding loci: 2 atypical rRNA, TARE region and telomerase RNA gene. Together with other collections of P. falciparum ESTs, usually generated from mixed parasite stages, this collection of FcB1-schizont-ESTs provides valuable data to gain further insight into the P. falciparum gene structure, polymorphism and expression.
A large collection of full-length cDNAs is essential for the correct annotation of genomic sequences and for the functional analysis of genes and their products. We obtained a total of 39 936 soybean cDNA clones (GMFL01 and GMFL02 clone sets) in a full-length-enriched cDNA library which was constructed from soybean plants that were grown under various developmental and environmental conditions. Sequencing from 5′ and 3′ ends of the clones generated 68 661 expressed sequence tags (ESTs). The EST sequences were clustered into 22 674 scaffolds involving 2580 full-length sequences. In addition, we sequenced 4712 full-length cDNAs. After removing overlaps, we obtained 6570 new full-length sequences of soybean cDNAs so far. Our data indicated that 87.7% of the soybean cDNA clones contain complete coding sequences in addition to 5′- and 3′-untranslated regions. All of the obtained data confirmed that our collection of soybean full-length cDNAs covers a wide variety of genes. Comparative analysis between the derived sequences from soybean and Arabidopsis, rice or other legumes data revealed that some specific genes were involved in our collection and a large part of them could be annotated to unknown functions. A large set of soybean full-length cDNA clones reported in this study will serve as a useful resource for gene discovery from soybean and will also aid a precise annotation of the soybean genome.
EST; full-length cDNA; functional annotation; legume; soybean
Using cDNA microarray technology 12 new female-associated and 4 new male-associated gene transcripts were identified in mature adult Schistosoma mansoni parasites.
Parasitic helminths of the genus Schistosoma mate, achieve sexual maturity and produce eggs in the bloodstream of their definitive hosts, and the most important pathological consequences of the infection are associated with this process. We have used cDNA microarray technology to initiate genome-wide gene-expression studies of sex and sexual development in mature Schistosoma mansoni parasites.
An S. mansoni-specific cDNA microarray was fabricated using 576 expressed sequence tags selected from three cDNA libraries and originating from two different parasite developmental stages. Five independent cDNA microarray hybridizations were analyzed using stringent filtering criteria and careful quality control, leading to the identification of 12 new female-associated and 4 new male-associated gene transcripts in the mature adult schistosome. Statistical analysis of variation demonstrated high levels of agreement within a cDNA microarray (correlation coefficient 0.91; median coefficient of variation 11.1%) and between cDNA microarrays (correlation coefficient 0.90; median coefficient of variation 14.4%). RT-PCR analysis confirmed the cDNA microarray results, thereby supporting the reliability of the system.
Our study expands the list of S. mansoni gender-associated gene transcripts from all previous studies by a factor of two. Among the new associations identified, a tyrosinase ortholog was preferentially expressed in the adult female, and a dynein light-chain ortholog was highly induced in the adult male. cDNA microarrays offer the potential for exponential leaps in the understanding of parasite biology and this study shows how molecules involved in sexual biology can be rapidly identified.
Almost a decade after the publication of the complete sequence of the genome of the human malaria parasite Plasmodium falciparum, the mechanisms involved in gene regulation remain poorly understood. Like other eukaryotic organisms, P. falciparum’s genomic DNA organizes into nucleosomes. Nucleosomes are the basic structural units of eukaryotic chromatin and their regulation is known to play a key role in regulation of gene expression. Despite its importance, the relationship between nucleosome positioning and gene regulation in the malaria parasite has only been investigated recently. Using two independent and complementary techniques followed by next-generation high-throughput sequencing, our laboratory recently generated a dynamic atlas of nucleosome-bound and nucleosome-free regions (NFRs) at single-nucleotide resolution throughout the parasite erythrocytic cycle. We have found evidences that genome-wide changes in nucleosome occupancy play a critical role in controlling the rigorous parasite replication in infected red blood cells. However, the role of nucleosome positioning at remarkable locations such as transcriptional start sites (TSS) was not investigated. Here we show that a study of NFR in experimentally determined TSS and in silico-predicted promoters can provide deeper insights of how a transcriptionally permissive organization of chromatin can control the parasite’s progression through its life cycle. We find that NFRs found at TSS and core promoters are strongly associated with high levels of gene expression in asexual erythrocytic stages, whereas nucleosome-bound TSSs and promoters are associated with silent genes preferentially expressed in sexual stages. The implications in terms of regulatory evolution, adaptation of gene expression and their impact in the design of antimalarial strategies are discussed.
Nucleosome; Malaria; TSS; Promoter; Transcription; Virulence; Evolution; MAINE-seq; FAIRE-seq
Alternative splicing contributes significantly to the complexity of the human transcriptome and proteome. Computational prediction of alternative splice isoforms are usually based on EST sequences that also allow to approximate the expression pattern of the related transcripts. However, the limited number of tissues represented in the EST data as well as the different cDNA construction protocols may influence the predictive capacity of ESTs to unravel tissue-specifically expressed transcripts.
We predict tissue and tumor specific splice isoforms based on the genomic mapping (SpliceNest) of the EST consensus sequences and library annotation provided in the GeneNest database. We further ascertain the potentially rare tissue specific transcripts as the ones represented only by ESTs derived from normalized libraries. A subset of the predicted tissue and tumor specific isoforms are then validated via RT-PCR experiments over a spectrum of 40 tissue types.
Our strategy revealed 427 genes with at least one tissue specific transcript as well as 1120 genes showing tumor specific isoforms. While our experimental evaluation of computationally predicted tissue-specific isoforms revealed a high success rate in confirming the expression of these isoforms in the respective tissue, the strategy frequently failed to detect the expected restricted expression pattern. The analysis of putative lowly expressed transcripts using normalized cDNA libraries suggests that our ability to detect tissue-specific isoforms strongly depends on the expression level of the respective transcript as well as on the sensitivity of the experimental methods. Especially splice isoforms predicted to be disease-specific tend to represent transcripts that are expressed in a set of healthy tissues rather than novel isoforms.
We propose to combine the computational prediction of alternative splice isoforms with experimental validation for efficient delineation of an accurate set of tissue-specific transcripts.
Rodent malaria parasites, such as Plasmodium berghei, are practical and useful model organisms for human malaria research because of their analogies to the human malaria in terms of structure, physiology, and life cycle. Exploiting the available genetic sequence information, we constructed a cDNA library from the erythrocytic stages of P. berghei and analyzed the expressed sequence tag (EST). A total of 10,040 ESTs were generated and assembled into 2,462 clusters. These EST clusters were compared against public protein databases and 48 putative new transcripts, most of which were hypothetical proteins with unknown function, were identified. Genes encoding ribosomal or membrane proteins and purine nucleotide phosphorylases were highly abundant clusters in P. berghei. Protein domain analyses and the Gene Ontology functional categorization revealed translation/protein folding, metabolism, protein degradation, and multiple family of variant antigens to be mainly prevalent. The presently-collected ESTs and its bioinformatic analysis will be useful resources to identify for drug target and vaccine candidates and validate gene predictions of P. berghei.
Plasmodium berghei; expressed sequence tag; rodent malaria
The generation of expressed sequence tag (EST) libraries offers an affordable approach to investigate organisms, if no genome sequence is available. OREST (http://mips.gsf.de/genre/proj/orest/index.html) is a server-based EST analysis pipeline, which allows the rapid analysis of large amounts of ESTs or cDNAs from mammalia and fungi. In order to assign the ESTs to genes or proteins OREST maps DNA sequences to reference datasets of gene products and in a second step to complete genome sequences. Mapping against genome sequences recovers additional 13% of EST data, which otherwise would escape further analysis. To enable functional analysis of the datasets, ESTs are functionally annotated using the hierarchical FunCat annotation scheme as well as GO annotation terms. OREST also allows to predict the association of gene products and diseases by Morbid Map (OMIM) classification. A statistical analysis of the results of the dataset is possible with the included PROMPT software, which provides information about enrichment and depletion of functional and disease annotation terms. OREST was successfully applied for the identification and functional characterization of more than 3000 EST sequences of the common marmoset monkey (Callithrix jacchus) as part of an international collaboration.
Common bean is an important legume crop with only a moderate number of short expressed sequence tags (ESTs) made with traditional methods. The goal of this research was to use full-length cDNA technology to develop ESTs that would overlap with the beginning of open reading frames and therefore be useful for gene annotation of genomic sequences. The library was also constructed to represent genes expressed under drought, low soil phosphorus and high soil aluminum toxicity. We also undertook comparisons of the full-length cDNA library to two previous non-full clone EST sets for common bean.
Two full-length cDNA libraries were constructed: one for the drought tolerant Mesoamerican genotype BAT477 and the other one for the acid-soil tolerant Andean genotype G19833 which has been selected for genome sequencing. Plants were grown in three soil types using deep rooting cylinders subjected to drought and non-drought stress and tissues were collected from both roots and above ground parts. A total of 20,000 clones were selected robotically, half from each library. Then, nearly 10,000 clones from the G19833 library were sequenced with an average read length of 850 nucleotides. A total of 4,219 unigenes were identified consisting of 2,981 contigs and 1,238 singletons. These were functionally annotated with gene ontology terms and placed into KEGG pathways. Compared to other EST sequencing efforts in common bean, about half of the sequences were novel or represented the 5' ends of known genes.
The present full-length cDNA libraries add to the technological toolbox available for common bean and our sequencing of these clones substantially increases the number of unique EST sequences available for the common bean genome. All of this should be useful for both functional gene annotation, analysis of splice site variants and intron/exon boundary determination by comparison to soybean genes or with common bean whole-genome sequences. In addition the library has a large number of transcription factors and will be interesting for discovery and validation of drought or abiotic stress related genes in common bean.
The intraerythrocytic development of Plasmodium falciparum, the most virulent human malaria parasite involves asexual and gametocyte stages. There has been a significant increase in disparate datasets derived from genomic and post-genomic analysis of the parasite that necessitates delivery of integrated analysis from which biological processes important to the survival of the parasite can be determined.
In order to resolve genes associated with stage differentially expressed transcripts, we have developed and implemented an integrative approach that combines evidence from P. falciparum expressed sequence tags (ESTs), genomic, microarray, proteomic and gene ontology data.
A total of 143 gametocyte-overexpressed and 51 asexual-overexpressed transcripts were identified. A subset of 74 genes associated with these transcripts showed evidence of stage-correlated protein expression, of which 53 have not been experimentally characterised. Our study has revealed (1) possible regulatory mechanisms in malaria parasites' gametocyte maturation, (2) correlation between EST and microarray data for a P. falciparum gene family to present unique EST-derived information, (3) candidate drug and antigenic targets on which computational and experimental studies can be performed, and (4) the need for more empirical studies on gene and protein expression in malaria parasites.
Applying different domains of data to the same underlying gene set has yielded novel insights into the biology of the parasite and presents an approach to appraise critically the data quality of post-genomic datasets from malaria parasites.
Full-Parasites (http://fullmal.hgc.jp/) is a transcriptome database of apicomplexa parasites, which include Plasmodium and Toxoplasma species. The latest version of Full-Parasites contains a total of 105 786 EST sequences from 12 parasites, of which 5925 full-length cDNAs have been completely sequenced. Full-Parasites also contain more than 30 million transcription start sites (TSS) for Plasmodium falciparum (Pf) and Toxoplasma gondii (Tg), which were identified using our novel oligo-capping-based protocol. Various types of cDNA data resources were interconnected with our original database functionalities. Specifically, in this update, we have included two unique RNA-Seq data sets consisting of 730 million mapped RNA-Seq tags. One is a dataset of 16 time-lapse experiments of cultured bradyzoite differentiation for Tg. The other dataset includes 31 clinical samples of Pf. Parasite RNA was extracted together with host human RNA, and the extracted mixed RNA was used for RNA sequencing, with the expectation that gene expression information from the host and parasite would be simultaneously represented. By providing the largest unique full-length cDNA and dynamic transcriptome data, Full-Parasites is useful for understanding host–parasite interactions and will help to eventually elucidate how monophyletic organisms have evolved to become parasites by adopting complex life cycles.
The guppy, Poecilia reticulata, is a well-known model organism for studying inheritance and variation of male ornamental traits as well as adaptation to different river habitats. However, genomic resources for studying this important model were not previously widely available.
With the aim of generating molecular markers for genetic mapping of the guppy, cDNA libraries were constructed from embryos and different adult organs to generate expressed sequence tags (ESTs). About 18,000 ESTs were annotated according to BLASTN and BLASTX results and the sequence information from the 3' UTRs was exploited to generate PCR primers for re-sequencing of genomic DNA from different wild type strains. By comparison of EST-linked genomic sequences from at least four different ecotypes, about 1,700 polymorphisms were identified, representing about 400 distinct genes. Two interconnected MySQL databases were built to organize the ESTs and markers, respectively. A robust phylogeny of the guppy was reconstructed, based on 10 different nuclear genes.
Our EST and marker databases provide useful tools for genetic mapping and phylogenetic studies of the guppy.
cDNA cloning is a central technology in molecular biology. cDNA sequences are used to determine mRNA transcript structures, including splice junctions, open reading frames (ORFs) and 5′- and 3′-untranslated regions (UTRs). cDNA clones are valuable reagents for functional studies of genes and proteins. Expressed Sequence Tag (EST) sequencing is the method of choice for recovering cDNAs representing many of the transcripts encoded in a eukaryotic genome. However, EST sequencing samples a cDNA library at random, and it recovers transcripts with low expression levels inefficiently. We describe a PCR-based method for directed screening of plasmid cDNA libraries. We demonstrate its utility in a screen of libraries used in our Drosophila EST projects for 153 transcription factor genes that were not represented by full-length cDNA clones in our Drosophila Gene Collection. We recovered high-quality, full-length cDNAs for 72 genes and variously compromised clones for an additional 32 genes. The method can be used at any scale, from the isolation of cDNA clones for a particular gene of interest, to the improvement of large gene collections in model organisms and the human. Finally, we discuss the relative merits of directed cDNA library screening and RT–PCR approaches.
It is well known that the nutritional quality of the American oil-palm (Elaeis oleifera) mesocarp oil is superior to that of African oil-palm (Elaeis
guineensis Jacq. Tenera) mesocarp oil. Therefore, it is of important to identify the genetic features for its superior value. This could be achieved through
the genome sequencing of the oil-palm. However, the genome sequence is not available in the public domain due to commercial secrecy. Hence, we
constructed a cDNA library and generated expressed sequence tags (3,205) from the mesocarp tissue of the American oil-palm. We continued to annotate
each of these cDNAs after submitting to GenBank/DDBJ/EMBL. A rough analysis turned our attention to the beta-carotene hydroxylase (Chyb) enzyme
encoding cDNA. Then, we completed the full sequencing of cDNA clone for its both strands using M13 forward and reverse primers. The full nucleotide
and protein sequence was further analyzed and annotated using various Bioinformatics tools. The analysis results showed the presence of fatty acid
hydroxylase superfamily domain in the protein sequence. The multiple sequence alignment of selected Chyb amino acid sequences from other plant
species and algal members with E. oleifera Chyb using ClustalW and its phylogenetic analysis suggest that Chyb from monocotyledonous plant species,
Lilium hubrid, Crocus sativus and Zea mays are the most evolutionary related with E. oleifera Chyb. This study reports the annotation of E. oleifera Chyb.
ESTs - expressed sequence tags,
EoChyb - Elaeis oleifera beta-carotene hydroxylase,
MC - main cluster
African oil-palm; American oil-palm; fatty acids; fatty acid hydroxylase; oleic acid; sterol desaturase; zeaxanthin