Search tips
Search criteria

Results 1-17 (17)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Standardized Metadata for Human Pathogen/Vector Genomic Sequences 
PLoS ONE  2014;9(6):e99979.
High throughput sequencing has accelerated the determination of genome sequences for thousands of human infectious disease pathogens and dozens of their vectors. The scale and scope of these data are enabling genotype-phenotype association studies to identify genetic determinants of pathogen virulence and drug/insecticide resistance, and phylogenetic studies to track the origin and spread of disease outbreaks. To maximize the utility of genomic sequences for these purposes, it is essential that metadata about the pathogen/vector isolate characteristics be collected and made available in organized, clear, and consistent formats. Here we report the development of the GSCID/BRC Project and Sample Application Standard, developed by representatives of the Genome Sequencing Centers for Infectious Diseases (GSCIDs), the Bioinformatics Resource Centers (BRCs) for Infectious Diseases, and the U.S. National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), informed by interactions with numerous collaborating scientists. It includes mapping to terms from other data standards initiatives, including the Genomic Standards Consortium’s minimal information (MIxS) and NCBI’s BioSample/BioProjects checklists and the Ontology for Biomedical Investigations (OBI). The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support. By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards. The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.
PMCID: PMC4061050  PMID: 24936976
2.  A Potentially Functional Mariner Transposable Element in the Protist Trichomonas vaginalis 
Molecular biology and evolution  2004;22(1):126-134.
Mariner transposable elements encoding a D,D34D motif-bearing transposase are characterized by their pervasiveness among, and exclusivity to, animal phyla. To date several hundred sequences have been obtained from taxa ranging from cnidarians to humans, only two of which are known to be functional. Related transposons have been identified in plants and fungi, but their absence among protists is noticeable. Here, we identify and characterize Tvmar1, the first representative of the mariner family to be found in a species of protist, the human parasite Trichomonas vaginalis. This is the first D,D34D element to be found outside the animal kingdom, and its inclusion in the mariner family is supported by both structural and phylogenetic analyses. Remarkably, Tvmar1 has all the hallmarks of a functional element, and has recently expanded to several hundred copies in the genome of T. vaginalis. Our results show that a new potentially active mariner has been found, which belongs to a distinct mariner lineage, and has successfully invaded a non-animal, single-celled organism. The considerable genetic distance between Tvmar1 and other mariners may have valuable implications for the design of new, high-efficiency vectors to be used in transfection studies in protists.
PMCID: PMC1406841  PMID: 15371525
transposon; mariner; protist; parabasilid; Trichomonas; vaginalis
3.  A Novel Clade of Unique Eukaryotic Ribonucleotide Reductase R2 Subunits is Exclusive to Apicomplexan Parasites 
Apicomplexa are protist parasites of tremendous medical and economic importance, causing millions of deaths and billions of dollars in losses each year. Apicomplexan-related diseases may be controlled via inhibition of essential enzymes. Ribonucleotide reductase (RNR) provides the only de novo means of synthesizing deoxyribonucleotides, essential precursors for DNA replication and repair. RNR has long been the target of antibacterial and antiviral therapeutics. However, targeting this ubiquitous protein in eukaryotic pathogens may be problematic unless these proteins differ significantly from that of their respective host. The typical eukaryotic RNR enzymes belong to class Ia, and the holoenzyme consists minimally of two R1 and two R2 subunits (α2β2). We generated a comparative, annotated, structure-based, multiple-sequence alignment of R2 subunits, identified a clade of R2 subunits unique to Apicomplexa, and determined its phylogenetic position. Our analyses revealed that the apicomplexan-specific sequences share characteristics with both class I R2 and R2lox proteins. The putative radical-harboring residue, essential for the reduction reaction by class Ia R2-containing holoenzymes, was not conserved within this group. Phylogenetic analyses suggest that class Ia subunits are not monophyletic and consistently placed the apicomplexan-specific clade sister to the remaining class Ia eukaryote R2 subunits. Our research suggests that the novel apicomplexan R2 subunit may be a promising candidate for chemotherapeutic-induced inhibition as it differs greatly from known eukaryotic host RNRs and may be specifically targeted.
Electronic supplementary material
The online version of this article (doi:10.1007/s00239-013-9583-y) contains supplementary material, which is available to authorized users.
PMCID: PMC3824934  PMID: 24046025
Ribonucleotide reductase; RNR; Apicomplexa; Structure-based amino acid alignment; Paralog
4.  Whole Genome Mapping and Re-Organization of the Nuclear and Mitochondrial Genomes of Babesia microti Isolates 
PLoS ONE  2013;8(9):e72657.
Babesia microti is the primary causative agent of human babesiosis, an emerging pathogen that causes a malaria-like illness with possible fatal outcome in immunocompromised patients. The genome sequence of the B. microti R1 strain was reported in 2012 and revealed a distinct evolutionary path for this pathogen relative to that of other apicomplexa. Lacking from the first genome assembly and initial molecular analyses was information about the terminal ends of each chromosome, and both the exact number of chromosomes in the nuclear genome and the organization of the mitochondrial genome remained ambiguous. We have now performed various molecular analyses to characterize the nuclear and mitochondrial genomes of the B. microti R1 and Gray strains and generated high-resolution Whole Genome maps. These analyses show that the genome of B. microti consists of four nuclear chromosomes and a linear mitochondrial genome present in four different structural types. Furthermore, Whole Genome mapping allowed resolution of the chromosomal ends, identification of areas of misassembly in the R1 genome, and genomic differences between the R1 and Gray strains, which occur primarily in the telomeric regions. These studies set the stage for a better understanding of the evolution and diversity of this important human pathogen.
PMCID: PMC3762879  PMID: 24023759
5.  Phylogeographical footprint of colonial history in the global dispersal of human immunodeficiency virus type 2 group A 
The Journal of General Virology  2012;93(Pt 4):889-899.
Human immunodeficiency virus type 2 (HIV-2) emerged in West Africa and has spread further to countries that share socio-historical ties with this region. However, viral origins and dispersal patterns at a global scale remain poorly understood. Here, we adopt a Bayesian phylogeographic approach to investigate the spatial dynamics of HIV-2 group A (HIV-2A) using a collection of 320 partial pol and 248 partial env sequences sampled throughout 19 countries worldwide. We extend phylogenetic diffusion models that simultaneously draw information from multiple loci to estimate location states throughout distinct phylogenies and explicitly attempt to incorporate human migratory fluxes. Our study highlights that Guinea-Bissau, together with Côte d’Ivoire and Senegal, have acted as the main viral sources in the early stages of the epidemic. We show that convenience sampling can obfuscate the estimation of the spatial root of HIV-2A. We explicitly attempt to circumvent this by incorporating rate priors that reflect the ratio of human flow from and to West Africa. We recover four main routes of HIV-2A dispersal that are laid out along colonial ties: Guinea-Bissau and Cape Verde to Portugal, Côte d’Ivoire and Senegal to France. Within Europe, we find strong support for epidemiological linkage from Portugal to Luxembourg and to the UK. We demonstrate that probabilistic models can uncover global patterns of HIV-2A dispersal providing sampling bias is taken into account and we provide a scenario for the international spread of this virus.
PMCID: PMC3542711  PMID: 22190015
6.  Comparative genomic analysis and phylogenetic position of Theileria equi 
BMC Genomics  2012;13:603.
Transmission of arthropod-borne apicomplexan parasites that cause disease and result in death or persistent infection represents a major challenge to global human and animal health. First described in 1901 as Piroplasma equi, this re-emergent apicomplexan parasite was renamed Babesia equi and subsequently Theileria equi, reflecting an uncertain taxonomy. Understanding mechanisms by which apicomplexan parasites evade immune or chemotherapeutic elimination is required for development of effective vaccines or chemotherapeutics. The continued risk of transmission of T. equi from clinically silent, persistently infected equids impedes the goal of returning the U. S. to non-endemic status. Therefore comparative genomic analysis of T. equi was undertaken to: 1) identify genes contributing to immune evasion and persistence in equid hosts, 2) identify genes involved in PBMC infection biology and 3) define the phylogenetic position of T. equi relative to sequenced apicomplexan parasites.
The known immunodominant proteins, EMA1, 2 and 3 were discovered to belong to a ten member gene family with a mean amino acid identity, in pairwise comparisons, of 39%. Importantly, the amino acid diversity of EMAs is distributed throughout the length of the proteins. Eight of the EMA genes were simultaneously transcribed. As the agents that cause bovine theileriosis infect and transform host cell PBMCs, we confirmed that T. equi infects equine PBMCs, however, there is no evidence of host cell transformation. Indeed, a number of genes identified as potential manipulators of the host cell phenotype are absent from the T. equi genome. Comparative genomic analysis of T. equi revealed the phylogenetic positioning relative to seven apicomplexan parasites using deduced amino acid sequences from 150 genes placed it as a sister taxon to Theileria spp.
The EMA family does not fit the paradigm for classical antigenic variation, and we propose a novel model describing the role of the EMA family in persistence. T. equi has lost the putative genes for host cell transformation, or the genes were acquired by T. parva and T. annulata after divergence from T. equi. Our analysis identified 50 genes that will be useful for definitive phylogenetic classification of T. equi and closely related organisms.
PMCID: PMC3505731  PMID: 23137308
Apicomplexa; Parasite; Vaccine; Horse; Vector-borne disease
7.  Genome sequences reveal divergence times of malaria parasite lineages 
Parasitology  2010;138(13):1737-1749.
The evolutionary history of human malaria parasites (genus Plasmodium) has long been a subject of speculation and controversy. The complete genome sequences of the two most widespread human malaria parasites, P. falciparum and P. vivax, and of the monkey parasite P. knowlesi are now available, together with the draft genomes of the chimpanzee parasite P. reichenowi, three rodent parasites, P. yoelii yoelli, P. berghei and P. chabaudi chabaudi, and one avian parasite, P. gallinaceum.
We present here an analysis of 45 orthologous gene sequences across the eight species that resolves the relationships of major Plasmodium lineages, and provides the first comprehensive dating of the age of those groups.
Our analyses support the hypothesis that the last common ancestor of P. falciparum and the chimpanzee parasite P. reichenowi occurred around the time of the human-chimpanzee divergence. P. falciparum infections of African apes are most likely derived from humans and not the other way around. On the other hand, P. vivax, split from the monkey parasite P. knowlesi in the much more distant past, during the time that encompasses the separation of the Great Apes and Old World Monkeys.
The results support an ancient association between malaria parasites and their primate hosts, including humans.
PMCID: PMC3081533  PMID: 21118608
8.  Two Theileria parva CD8 T Cell Antigen Genes Are More Variable in Buffalo than Cattle Parasites, but Differ in Pattern of Sequence Diversity 
PLoS ONE  2011;6(4):e19015.
Theileria parva causes an acute fatal disease in cattle, but infections are asymptomatic in the African buffalo (Syncerus caffer). Cattle can be immunized against the parasite by infection and treatment, but immunity is partially strain specific. Available data indicate that CD8+ T lymphocyte responses mediate protection and, recently, several parasite antigens recognised by CD8+ T cells have been identified. This study set out to determine the nature and extent of polymorphism in two of these antigens, Tp1 and Tp2, which contain defined CD8+ T-cell epitopes, and to analyse the sequences for evidence of selection.
Methodology/Principal Findings
Partial sequencing of the Tp1 gene and the full-length Tp2 gene from 82 T. parva isolates revealed extensive polymorphism in both antigens, including the epitope-containing regions. Single nucleotide polymorphisms were detected at 51 positions (∼12%) in Tp1 and in 320 positions (∼61%) in Tp2. Together with two short indels in Tp1, these resulted in 30 and 42 protein variants of Tp1 and Tp2, respectively. Although evidence of positive selection was found for multiple amino acid residues, there was no preferential involvement of T cell epitope residues. Overall, the extent of diversity was much greater in T. parva isolates originating from buffalo than in isolates known to be transmissible among cattle.
The results indicate that T. parva parasites maintained in cattle represent a subset of the overall T. parva population, which has become adapted for tick transmission between cattle. The absence of obvious enrichment for positively selected amino acid residues within defined epitopes indicates either that diversity is not predominantly driven by selection exerted by host T cells, or that such selection is not detectable by the methods employed due to unidentified epitopes elsewhere in the antigens. Further functional studies are required to address this latter point.
PMCID: PMC3084734  PMID: 21559495
9.  Comparative genomics of the neglected human malaria parasite Plasmodium vivax 
Nature  2008;455(7214):757-763.
The human malaria parasite Plasmodium vivax is responsible for 25-40% of the ~515 million annual cases of malaria worldwide. Although seldom fatal, the parasite elicits severe and incapacitating clinical symptoms and often relapses months after a primary infection has cleared. Despite its importance as a major human pathogen, P. vivax is little studied because it cannot be propagated in the laboratory except in non-human primates. We determined the genome sequence of P. vivax in order to shed light on its distinctive biologic features, and as a means to drive development of new drugs and vaccines. Here we describe the synteny and isochore structure of P. vivax chromosomes, and show that the parasite resembles other malaria parasites in gene content and metabolic potential, but possesses novel gene families and potential alternate invasion pathways not recognized previously. Completion of the P. vivax genome provides the scientific community with a valuable resource that can be used to advance scientific investigation into this neglected species.
PMCID: PMC2651158  PMID: 18843361
10.  The protist Trichomonas vaginalis harbors multiple lineages of transcriptionally active Mutator-like elements 
BMC Genomics  2009;10:330.
For three decades the Mutator system was thought to be exclusive of plants, until the first homolog representatives were characterized in fungi and in early-diverging amoebas earlier in this decade.
Here, we describe and characterize four families of Mutator-like elements in a new eukaryotic group, the Parabasalids. These Trichomonas vaginalis Mutator- like elements, or TvMULEs, are active in T. vaginalis and patchily distributed among 12 trichomonad species and isolates. Despite their relatively distinctive amino acid composition, the inclusion of the repeats TvMULE1, TvMULE2, TvMULE3 and TvMULE4 into the Mutator superfamily is justified by sequence, structural and phylogenetic analyses. In addition, we identified three new TvMULE-related sequences in the genome sequence of Candida albicans. While TvMULE1 is a member of the MuDR clade, predominantly from plants, the other three TvMULEs, together with the C. albicans elements, represent a new and quite distinct Mutator lineage, which we named TvCaMULEs. The finding of TvMULE1 sequence inserted into other putative repeat suggests the occurrence a novel TE family not yet described.
These findings expand the taxonomic distribution and the range of functional motif of MULEs among eukaryotes. The characterization of the dynamics of TvMULEs and other transposons in this organism is of particular interest because it is atypical for an asexual species to have such an extreme level of TE activity; this genetic landscape makes an interesting case study for causes and consequences of such activity. Finally, the extreme repetitiveness of the T. vaginalis genome and the remarkable degree of sequence identity within its repeat families highlights this species as an ideal system to characterize new transposable elements.
PMCID: PMC2725143  PMID: 19622157
11.  IDEA: Interactive Display for Evolutionary Analyses 
BMC Bioinformatics  2008;9:524.
The availability of complete genomic sequences for hundreds of organisms promises to make obtaining genome-wide estimates of substitution rates, selective constraints and other molecular evolution variables of interest an increasingly important approach to addressing broad evolutionary questions. Two of the programs most widely used for this purpose are codeml and baseml, parts of the PAML (Phylogenetic Analysis by Maximum Likelihood) suite. A significant drawback of these programs is their lack of a graphical user interface, which can limit their user base and considerably reduce their efficiency.
We have developed IDEA (Interactive Display for Evolutionary Analyses), an intuitive graphical input and output interface which interacts with PHYLIP for phylogeny reconstruction and with codeml and baseml for molecular evolution analyses. IDEA's graphical input and visualization interfaces eliminate the need to edit and parse text input and output files, reducing the likelihood of errors and improving processing time. Further, its interactive output display gives the user immediate access to results. Finally, IDEA can process data in parallel on a local machine or computing grid, allowing genome-wide analyses to be completed quickly.
IDEA provides a graphical user interface that allows the user to follow a codeml or baseml analysis from parameter input through to the exploration of results. Novel options streamline the analysis process, and post-analysis visualization of phylogenies, evolutionary rates and selective constraint along protein sequences simplifies the interpretation of results. The integration of these functions into a single tool eliminates the need for lengthy data handling and parsing, significantly expediting access to global patterns in the data.
PMCID: PMC2655098  PMID: 19061522
12.  Properties of non-coding DNA and identification of putative cis-regulatory elements in Theileria parva 
BMC Genomics  2008;9:582.
Parasites in the genus Theileria cause lymphoproliferative diseases in cattle, resulting in enormous socio-economic losses. The availability of the genome sequences and annotation for T. parva and T. annulata has facilitated the study of parasite biology and their relationship with host cell transformation and tropism. However, the mechanism of transcriptional regulation in this genus, which may be key to understanding fundamental aspects of its parasitology, remains poorly understood. In this study, we analyze the evolution of non-coding sequences in the Theileria genome and identify conserved sequence elements that may be involved in gene regulation of these parasitic species.
Intergenic regions and introns in Theileria are short, and their length distributions are considerably right-skewed. Intergenic regions flanked by genes in 5'-5' orientation tend to be longer and slightly more AT-rich than those flanked by two stop codons; intergenic regions flanked by genes in 3'-5' orientation have intermediate values of length and AT composition. Intron position is negatively correlated with intron length, and positively correlated with GC content. Using stringent criteria, we identified a set of high-quality orthologous non-coding sequences between T. parva and T. annulata, and determined the distribution of selective constraints across regions, which are shown to be higher close to translation start sites. A positive correlation between constraint and length in both intergenic regions and introns suggests a tight control over length expansion of non-coding regions. Genome-wide searches for functional elements revealed several conserved motifs in intergenic regions of Theileria genomes. Two such motifs are preferentially located within the first 60 base pairs upstream of transcription start sites in T. parva, are preferentially associated with specific protein functional categories, and have significant similarity to know regulatory motifs in other species. These results suggest that these two motifs are likely to represent transcription factor binding sites in Theileria.
Theileria genomes are highly compact, with selection seemingly favoring short introns and intergenic regions. Three over-represented sequence motifs were independently identified in intergenic regions of both Theileria species, and the evidence suggests that at least two of them play a role in transcriptional control in T. parva. These are prime candidates for experimental validation of transcription factor binding sites in this single-celled eukaryotic parasite. Sequences similar to two of these Theileria motifs are conserved in Plasmodium hinting at the possibility of common regulatory machinery across the phylum Apicomplexa.
PMCID: PMC2612703  PMID: 19055776
13.  Genomic Islands in the Pathogenic Filamentous Fungus Aspergillus fumigatus 
PLoS Genetics  2008;4(4):e1000046.
We present the genome sequences of a new clinical isolate of the important human pathogen, Aspergillus fumigatus, A1163, and two closely related but rarely pathogenic species, Neosartorya fischeri NRRL181 and Aspergillus clavatus NRRL1. Comparative genomic analysis of A1163 with the recently sequenced A. fumigatus isolate Af293 has identified core, variable and up to 2% unique genes in each genome. While the core genes are 99.8% identical at the nucleotide level, identity for variable genes can be as low 40%. The most divergent loci appear to contain heterokaryon incompatibility (het) genes associated with fungal programmed cell death such as developmental regulator rosA. Cross-species comparison has revealed that 8.5%, 13.5% and 12.6%, respectively, of A. fumigatus, N. fischeri and A. clavatus genes are species-specific. These genes are significantly smaller in size than core genes, contain fewer exons and exhibit a subtelomeric bias. Most of them cluster together in 13 chromosomal islands, which are enriched for pseudogenes, transposons and other repetitive elements. At least 20% of A. fumigatus-specific genes appear to be functional and involved in carbohydrate and chitin catabolism, transport, detoxification, secondary metabolism and other functions that may facilitate the adaptation to heterogeneous environments such as soil or a mammalian host. Contrary to what was suggested previously, their origin cannot be attributed to horizontal gene transfer (HGT), but instead is likely to involve duplication, diversification and differential gene loss (DDL). The role of duplication in the origin of lineage-specific genes is further underlined by the discovery of genomic islands that seem to function as designated “gene dumps” and, perhaps, simultaneously, as “gene factories”.
Author Summary
Aspergillus is an extremely diverse genus of filamentous ascomycetous fungi (molds) found ubiquitously in soil and decomposing vegetation. Being supreme opportunists, aspergilli have adapted to overcome various chemical, physical, and biological stresses found in heterogeneous environments. While most species in the genus are saprophytes, a surprising number are able to infect wounded plants and animals. Remarkably, the allergic human host also responds abnormally to the aspergilli with lung and sinus disease. The advent of immunosuppressive agents and other medical advances have created a large worldwide pool of human hosts susceptible to some Aspergillus species, including the world's most harmful mold and the causative agent of invasive aspergillosis, Aspergillus fumigatus. In this study, we have used the power of comparative genomics to gain insight into genetic mechanisms that may contribute to the metabolic versatility and pathogenicity of this important human pathogen. Comparison of the genomes of two A. fumigatus clinical isolates and two closely related, but rarely pathogenic species showed that their genomes contain several large isolate- and species-specific chromosomal islands. The metabolic capabilities encoded by these highly labile regions are likely to contribute to their rapid adaptation to heterogeneous environments such as soil or a living host.
PMCID: PMC2289846  PMID: 18404212
14.  Characterization of paralogous protein families in rice 
BMC Plant Biology  2008;8:18.
High gene numbers in plant genomes reflect polyploidy and major gene duplication events. Oryza sativa, cultivated rice, is a diploid monocotyledonous species with a ~390 Mb genome that has undergone segmental duplication of a substantial portion of its genome. This, coupled with other genetic events such as tandem duplications, has resulted in a substantial number of its genes, and resulting proteins, occurring in paralogous families.
Using a computational pipeline that utilizes Pfam and novel protein domains, we characterized paralogous families in rice and compared these with paralogous families in the model dicotyledonous diploid species, Arabidopsis thaliana. Arabidopsis, which has undergone genome duplication as well, has a substantially smaller genome (~120 Mb) and gene complement compared to rice. Overall, 53% and 68% of the non-transposable element-related rice and Arabidopsis proteins could be classified into paralogous protein families, respectively. Singleton and paralogous family genes differed substantially in their likelihood of encoding a protein of known or putative function; 26% and 66% of singleton genes compared to 73% and 96% of the paralogous family genes encode a known or putative protein in rice and Arabidopsis, respectively. Furthermore, a major skew in the distribution of specific gene function was observed; a total of 17 Gene Ontology categories in both rice and Arabidopsis were statistically significant in their differential distribution between paralogous family and singleton proteins. In contrast to mammalian organisms, we found that duplicated genes in rice and Arabidopsis tend to have more alternative splice forms. Using data from Massively Parallel Signature Sequencing, we show that a significant portion of the duplicated genes in rice show divergent expression although a correlation between sequence divergence and correlation of expression could be seen in very young genes.
Collectively, these data suggest that while co-regulation and conserved function are present in some paralogous protein family members, evolutionary pressures have resulted in functional divergence with differential expression patterns.
PMCID: PMC2275729  PMID: 18284697
15.  Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalis 
Science (New York, N.Y.)  2007;315(5809):207-212.
We describe the genome sequence of the protist Trichomonas vaginalis, a sexually transmitted human pathogen. Repeats and transposable elements comprise about two-thirds of the ~160-megabase genome, reflecting a recent massive expansion of genetic material. This expansion, in conjunction with the shaping of metabolic pathways that likely transpired through lateral gene transfer from bacteria, and amplification of specific gene families implicated in pathogenesis and phagocytosis of host proteins may exemplify adaptations of the parasite during its transition to a urogenital environment. The genome sequence predicts previously unknown functions for the hydrogenosome, which support a common evolutionary origin of this unusual organelle with mitochondria.
PMCID: PMC2080659  PMID: 17218520
16.  Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote 
PLoS Biology  2006;4(9):e286.
The ciliate Tetrahymena thermophila is a model organism for molecular and cellular biology. Like other ciliates, this species has separate germline and soma functions that are embodied by distinct nuclei within a single cell. The germline-like micronucleus (MIC) has its genome held in reserve for sexual reproduction. The soma-like macronucleus (MAC), which possesses a genome processed from that of the MIC, is the center of gene expression and does not directly contribute DNA to sexual progeny. We report here the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila, which is approximately 104 Mb in length and composed of approximately 225 chromosomes. Overall, the gene set is robust, with more than 27,000 predicted protein-coding genes, 15,000 of which have strong matches to genes in other organisms. The functional diversity encoded by these genes is substantial and reflects the complexity of processes required for a free-living, predatory, single-celled organism. This is highlighted by the abundance of lineage-specific duplications of genes with predicted roles in sensing and responding to environmental conditions (e.g., kinases), using diverse resources (e.g., proteases and transporters), and generating structural complexity (e.g., kinesins and dyneins). In contrast to the other lineages of alveolates (apicomplexans and dinoflagellates), no compelling evidence could be found for plastid-derived genes in the genome. UGA, the only T. thermophila stop codon, is used in some genes to encode selenocysteine, thus making this organism the first known with the potential to translate all 64 codons in nuclear genes into amino acids. We present genomic evidence supporting the hypothesis that the excision of DNA from the MIC to generate the MAC specifically targets foreign DNA as a form of genome self-defense. The combination of the genome sequence, the functional diversity encoded therein, and the presence of some pathways missing from other model organisms makes T. thermophila an ideal model for functional genomic studies to address biological, biomedical, and biotechnological questions of fundamental importance.
The macronuclear genome ofTetrahymena thermophila is sequenced and analyzed. Conservation in this single-celled ciliate of some features normally observed in only multicellular organisms sheds light on early eukaryotic evolution.
PMCID: PMC1557398  PMID: 16933976
17.  Intron gain and loss in segmentally duplicated genes in rice 
Genome Biology  2006;7(5):R41.
Analysis of over 3,000 co-linear paired genes in rice shows more intron loss than intron gain following segmental duplication.
Introns are under less selection pressure than exons, and consequently, intronic sequences have a higher rate of gain and loss than exons. In a number of plant species, a large portion of the genome has been segmentally duplicated, giving rise to a large set of duplicated genes. The recent completion of the rice genome in which segmental duplication has been documented has allowed us to investigate intron evolution within rice, a diploid monocotyledonous species.
Analysis of segmental duplication in rice revealed that 159 Mb of the 371 Mb genome and 21,570 of the 43,719 non-transposable element-related genes were contained within a duplicated region. In these duplicated regions, 3,101 collinear paired genes were present. Using this set of segmentally duplicated genes, we investigated intron evolution from full-length cDNA-supported non-transposable element-related gene models of rice. Using gene pairs that have an ortholog in the dicotyledonous model species Arabidopsis thaliana, we identified more intron loss (49 introns within 35 gene pairs) than intron gain (5 introns within 5 gene pairs) following segmental duplication. We were unable to demonstrate preferential intron loss at the 3' end of genes as previously reported in mammalian genomes. However, we did find that the four nucleotides of exons that flank lost introns had less frequently used 4-mers.
We observed that intron evolution within rice following segmental duplication is largely dominated by intron loss. In two of the five cases of intron gain within segmentally duplicated genes, the gained sequences were similar to transposable elements.
PMCID: PMC1779517  PMID: 16719932

Results 1-17 (17)