Search tips
Search criteria

Results 1-25 (42)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
1.  Genomics of Loa loa, a Wolbachia-free filarial parasite of humans 
Nature genetics  2013;45(5):495-500.
Loa loa, the African eyeworm, is a major filarial pathogen of humans. Unlike most filariae, Loa loa does not contain the obligate intracellular Wolbachia endosymbiont. We describe the 91.4 Mb genome of Loa loa, and the genome of the related filarial parasite Wuchereria bancrofti, and predict 14,907 Loa loa genes based on microfilarial RNA sequencing. By comparing these genomes to that of another filarial parasite, Brugia malayi, and to several other nematode genomes, we demonstrate synteny among filariae but not with non-parasitic nematodes. The Loa loa genome encodes many immunologically relevant genes, as well as protein kinases targeted by drugs currently approved for humans. Despite lacking Wolbachia, Loa loa shows no new metabolic synthesis or transport capabilities compared to other filariae. These results suggest that the role played by Wolbachia in filarial biology is more subtle than previously thought and reveal marked differences between parasitic and non-parasitic nematodes.
PMCID: PMC4238225  PMID: 23525074
2.  Evolution of Invasion in a Diverse Set of Fusobacterium Species 
mBio  2014;5(6):e01864-14.
The diverse Fusobacterium genus contains species implicated in multiple clinical pathologies, including periodontal disease, preterm birth, and colorectal cancer. The lack of genetic tools for manipulating these organisms leaves us with little understanding of the genes responsible for adherence to and invasion of host cells. Actively invading Fusobacterium species can enter host cells independently, whereas passively invading species need additional factors, such as compromise of mucosal integrity or coinfection with other microbes. We applied whole-genome sequencing and comparative analysis to study the evolution of active and passive invasion strategies and to infer factors associated with active forms of host cell invasion. The evolution of active invasion appears to have followed an adaptive radiation in which two of the three fusobacterial lineages acquired new genes and underwent expansions of ancestral genes that enable active forms of host cell invasion. Compared to passive invaders, active invaders have much larger genomes, encode FadA-related adhesins, and possess twice as many genes encoding membrane-related proteins, including a large expansion of surface-associated proteins containing the MORN2 domain of unknown function. We predict a role for proteins containing MORN2 domains in adhesion and active invasion. In the largest and most comprehensive comparison of sequenced Fusobacterium species to date, we have generated a testable model for the molecular pathogenesis of Fusobacterium infection and illuminate new therapeutic or diagnostic strategies.
Fusobacterium species have recently been implicated in a broad spectrum of human pathologies, including Crohn’s disease, ulcerative colitis, preterm birth, and colorectal cancer. Largely due to the genetic intractability of member species, the mechanisms by which Fusobacterium causes these pathologies are not well understood, although adherence to and active invasion of host cells appear important. We examined whole-genome sequence data from a diverse set of Fusobacterium species to identify genetic determinants of active forms of host cell invasion. Our analyses revealed that actively invading Fusobacterium species have larger genomes than passively invading species and possess a specific complement of genes—including a class of genes of unknown function that we predict evolved to enable host cell adherence and invasion. This study provides an important framework for future studies on the role of Fusobacterium in pathologies such as colorectal cancer.
PMCID: PMC4222103  PMID: 25370491
3.  A universal genomic coordinate translator for comparative genomics 
BMC Bioinformatics  2014;15:227.
Genomic duplications constitute major events in the evolution of species, allowing paralogous copies of genes to take on fine-tuned biological roles. Unambiguously identifying the orthology relationship between copies across multiple genomes can be resolved by synteny, i.e. the conserved order of genomic sequences. However, a comprehensive analysis of duplication events and their contributions to evolution would require all-to-all genome alignments, which increases at N2 with the number of available genomes, N.
Here, we introduce Kraken, software that omits the all-to-all requirement by recursively traversing a graph of pairwise alignments and dynamically re-computing orthology. Kraken scales linearly with the number of targeted genomes, N, which allows for including large numbers of genomes in analyses. We first evaluated the method on the set of 12 Drosophila genomes, finding that orthologous correspondence computed indirectly through a graph of multiple synteny maps comes at minimal cost in terms of sensitivity, but reduces overall computational runtime by an order of magnitude. We then used the method on three well-annotated mammalian genomes, human, mouse, and rat, and show that up to 93% of protein coding transcripts have unambiguous pairwise orthologous relationships across the genomes. On a nucleotide level, 70 to 83% of exons match exactly at both splice junctions, and up to 97% on at least one junction. We last applied Kraken to an RNA-sequencing dataset from multiple vertebrates and diverse tissues, where we confirmed that brain-specific gene family members, i.e. one-to-many or many-to-many homologs, are more highly correlated across species than single-copy (i.e. one-to-one homologous) genes. Not limited to protein coding genes, Kraken also identifies thousands of newly identified transcribed loci, likely non-coding RNAs that are consistently transcribed in human, chimpanzee and gorilla, and maintain significant correlation of expression levels across species.
Kraken is a computational genome coordinate translator that facilitates cross-species comparisons, distinguishes orthologs from paralogs, and does not require costly all-to-all whole genome mappings. Kraken is freely available under LPGL from
PMCID: PMC4086997  PMID: 24976580
Comparative genomics; Genomic coordinate translation; Genomic duplication; Cross-species gene expression analysis
4.  De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity 
Nature protocols  2013;8(8):10.1038/nprot.2013.084.
De novo assembly of RNA-Seq data allows us to study transcriptomes without the need for a genome sequence, such as in non-model organisms of ecological and evolutionary importance, cancer samples, or the microbiome. In this protocol, we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-Seq data in non-model organisms. We also present Trinity’s supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples, and approaches to identify protein coding genes. In an included tutorial we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from
PMCID: PMC3875132  PMID: 23845962
5.  Kinannote, a computer program to identify and classify members of the eukaryotic protein kinase superfamily 
Bioinformatics  2013;29(19):2387-2394.
Motivation: Kinases of the eukaryotic protein kinase superfamily are key regulators of most aspects eukaryotic cellular behavior and have provided several drug targets including kinases dysregulated in cancers. The rapid increase in the number of genomic sequences has created an acute need to identify and classify members of this important class of enzymes efficiently and accurately.
Results: Kinannote produces a draft kinome and comparative analyses for a predicted proteome using a single line command, and it is currently the only tool that automatically classifies protein kinases using the controlled vocabulary of Hanks and Hunter [Hanks and Hunter (1995)]. A hidden Markov model in combination with a position-specific scoring matrix is used by Kinannote to identify kinases, which are subsequently classified using a BLAST comparison with a local version of KinBase, the curated protein kinase dataset from Kinannote was tested on the predicted proteomes from four divergent species. The average sensitivity and precision for kinome retrieval from the test species are 94.4 and 96.8%. The ability of Kinannote to classify identified kinases was also evaluated, and the average sensitivity and precision for full classification of conserved kinases are 71.5 and 82.5%, respectively. Kinannote has had a significant impact on eukaryotic genome annotation, providing protein kinase annotations for 36 genomes made public by the Broad Institute in the period spanning 2009 to the present.
Availability: Kinannote is freely available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3777111  PMID: 23904509
6.  Evolutionary Dynamics of the Accessory Genome of Listeria monocytogenes 
PLoS ONE  2013;8(6):e67511.
Listeria monocytogenes, a foodborne bacterial pathogen, is comprised of four phylogenetic lineages that vary with regard to their serotypes and distribution among sources. In order to characterize lineage-specific genomic diversity within L. monocytogenes, we sequenced the genomes of eight strains from several lineages and serotypes, and characterized the accessory genome, which was hypothesized to contribute to phenotypic differences across lineages. The eight L. monocytogenes genomes sequenced range in size from 2.85–3.14 Mb, encode 2,822–3,187 genes, and include the first publicly available sequenced representatives of serotypes 1/2c, 3a and 4c. Mapping of the distribution of accessory genes revealed two distinct regions of the L. monocytogenes chromosome: an accessory-rich region in the first 65° adjacent to the origin of replication and a more stable region in the remaining 295°. This pattern of genome organization is distinct from that of related bacteria Staphylococcus aureus and Bacillus cereus. The accessory genome of all lineages is enriched for cell surface-related genes and phosphotransferase systems, and transcriptional regulators, highlighting the selective pressures faced by contemporary strains from their hosts, other microbes, and their environment. Phylogenetic analysis of O-antigen genes and gene clusters predicts that serotype 4 was ancestral in L. monocytogenes and serotype 1/2 associated gene clusters were putatively introduced through horizontal gene transfer in the ancestral population of L. monocytogenes lineage I and II.
PMCID: PMC3692452  PMID: 23825666
7.  Distinctive Expansion of Potential Virulence Genes in the Genome of the Oomycete Fish Pathogen Saprolegnia parasitica 
PLoS Genetics  2013;9(6):e1003272.
Oomycetes in the class Saprolegniomycetidae of the Eukaryotic kingdom Stramenopila have evolved as severe pathogens of amphibians, crustaceans, fish and insects, resulting in major losses in aquaculture and damage to aquatic ecosystems. We have sequenced the 63 Mb genome of the fresh water fish pathogen, Saprolegnia parasitica. Approximately 1/3 of the assembled genome exhibits loss of heterozygosity, indicating an efficient mechanism for revealing new variation. Comparison of S. parasitica with plant pathogenic oomycetes suggests that during evolution the host cellular environment has driven distinct patterns of gene expansion and loss in the genomes of plant and animal pathogens. S. parasitica possesses one of the largest repertoires of proteases (270) among eukaryotes that are deployed in waves at different points during infection as determined from RNA-Seq data. In contrast, despite being capable of living saprotrophically, parasitism has led to loss of inorganic nitrogen and sulfur assimilation pathways, strikingly similar to losses in obligate plant pathogenic oomycetes and fungi. The large gene families that are hallmarks of plant pathogenic oomycetes such as Phytophthora appear to be lacking in S. parasitica, including those encoding RXLR effectors, Crinkler's, and Necrosis Inducing-Like Proteins (NLP). S. parasitica also has a very large kinome of 543 kinases, 10% of which is induced upon infection. Moreover, S. parasitica encodes several genes typical of animals or animal-pathogens and lacking from other oomycetes, including disintegrins and galactose-binding lectins, whose expression and evolutionary origins implicate horizontal gene transfer in the evolution of animal pathogenesis in S. parasitica.
Author Summary
Fish are an increasingly important source of animal protein globally, with aquaculture production rising dramatically over the past decade. Saprolegnia is a fungal-like oomycete and one of the most destructive fish pathogens, causing millions of dollars in losses to the aquaculture industry annually. Saprolegnia has also been linked to a worldwide decline in wild fish and amphibian populations. Here we describe the genome sequence of the first animal pathogenic oomycete and compare the genome content with the available plant pathogenic oomycetes. We found that Saprolegnia lacks the large effector families that are hallmarks of plant pathogenic oomycetes, showing evolutionary adaptation to the host. Moreover, Saprolegnia harbors pathogenesis-related genes that were derived by lateral gene transfer from the host and other animal pathogens. The retrotransposon LINE family also appears to be acquired from animal lineages. By transcriptome analysis we show a high rate of allelic variation, which reveals rapidly evolving genes and potentially adaptive evolutionary mechanisms coupled to selective pressures exerted by the animal host. The genome and transcriptome data, as well as subsequent biochemical analyses, provided us with insight in the disease process of Saprolegnia at a molecular and cellular level, providing us with targets for sustainable control of Saprolegnia.
PMCID: PMC3681718  PMID: 23785293
8.  Genome level analysis of rice mRNA 3′-end processing signals and alternative polyadenylation 
Nucleic Acids Research  2008;36(9):3150-3161.
The position of a poly(A) site of eukaryotic mRNA is determined by sequence signals in pre-mRNA and a group of polyadenylation factors. To reveal rice poly(A) signals at a genome level, we constructed a dataset of 55 742 authenticated poly(A) sites and characterized the poly(A) signals. This resulted in identifying the typical tripartite cis-elements, including FUE, NUE and CE, as previously observed in Arabidopsis. The average size of the 3′-UTR was 289 nucleotides. When mapped to the genome, however, 15% of these poly(A) sites were found to be located in the currently annotated intergenic regions. Moreover, an extensive alternative polyadenylation profile was evident where 50% of the genes analyzed had more than one unique poly(A) site (excluding microheterogeneity sites), and 13% had four or more poly(A) sites. About 4% of the analyzed genes possessed alternative poly(A) sites at their introns, 5′-UTRs, or protein coding regions. The authenticity of these alternative poly(A) sites was partially confirmed using MPSS data. Analysis of nucleotide profile and signal patterns indicated that there may be a different set of poly(A) signals for those poly(A) sites found in the coding regions. Based on the features of rice poly(A) signals, an updated algorithm termed PASS-Rice was designed to predict poly(A) sites.
PMCID: PMC2396415  PMID: 18411206
9.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data 
Nature biotechnology  2011;29(7):644-652.
Massively-parallel cDNA sequencing has opened the way to deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here, we present the Trinity methodology for de novo full-length transcriptome reconstruction, and evaluate it on samples from fission yeast, mouse, and whitefly – an insect whose genome has not yet been sequenced. Trinity fully reconstructs a large fraction of the transcripts present in the data, also reporting alternative splice isoforms and transcripts from recently duplicated genes. In all cases, Trinity performs better than other available de novo transcriptome assembly programs, and its sensitivity is comparable to methods relying on genome alignments. Our approach provides a unified and general solution for transcriptome reconstruction in any sample, especially in the complete absence of a reference genome.
PMCID: PMC3571712  PMID: 21572440
10.  Genome sequencing and mapping reveal loss of heterozygosity as a mechanism for rapid adaptation in the vegetable pathogen Phytophthora capsici 
The oomycete vegetable pathogen Phytophthora capsici has shown remarkable adaptation to fungicides and new hosts. Like other members of this destructive genus, P. capsici has an explosive epidemiology, rapidly producing massive numbers of asexual spores on infected hosts. In addition, P. capsici can remain dormant for years as sexually-recombined oospores, making it difficult to produce crops at infested sites, and allowing outcrossing populations to maintain significant genetic variation. Genome sequencing, development of a high-density genetic map, and integrative genomic/genetic characterization of P. capsici field isolates and intercross progeny revealed significant mitotic loss of heterozygosity (LOH) and higher levels of SNVs than those reported for humans, plants, and P. infestans. LOH was detected in clonally propagated field isolates and sexual progeny, cumulatively affecting >30% of the genome. LOH altered genotypes for more than 11,000 single nucleotide variant (SNV) sites and showed a strong association with changes in mating type and pathogenicity. Overall, it appears that LOH may provide a rapid mechanism for fixing alleles and may be an important component of adaptability for P. capsici.
PMCID: PMC3551261  PMID: 22712506
11.  Comparative Genomics of Recent Shiga Toxin-Producing Escherichia coli O104:H4: Short-Term Evolution of an Emerging Pathogen 
mBio  2013;4(1):e00452-12.
The large outbreak of diarrhea and hemolytic uremic syndrome (HUS) caused by Shiga toxin-producing Escherichia coli O104:H4 in Europe from May to July 2011 highlighted the potential of a rarely identified E. coli serogroup to cause severe disease. Prior to the outbreak, there were very few reports of disease caused by this pathogen and thus little known of its diversity and evolution. The identification of cases of HUS caused by E. coli O104:H4 in France and Turkey after the outbreak and with no clear epidemiological links raises questions about whether these sporadic cases are derived from the outbreak. Here, we report genome sequences of five independent isolates from these cases and results of a comparative analysis with historical and 2011 outbreak isolates. These analyses revealed that the five isolates are not derived from the outbreak strain; however, they are more closely related to the outbreak strain and each other than to isolates identified prior to the 2011 outbreak. Over the short time scale represented by these closely related organisms, the majority of genome variation is found within their mobile genetic elements: none of the nine O104:H4 isolates compared here contain the same set of plasmids, and their prophages and genomic islands also differ. Moreover, the presence of closely related HUS-associated E. coli O104:H4 isolates supports the contention that fully virulent O104:H4 isolates are widespread and emphasizes the possibility of future food-borne E. coli O104:H4 outbreaks.
In the summer of 2011, a large outbreak of bloody diarrhea with a high rate of severe complications took place in Europe, caused by a previously rarely seen Escherichia coli strain of serogroup O104:H4. Identification of subsequent infections caused by E. coli O104:H4 raised questions about whether these new cases represented ongoing transmission of the outbreak strain. In this study, we sequenced the genomes of isolates from five recent cases and compared them with historical isolates. The analyses reveal that, in the very short term, evolution of the bacterial genome takes place in parts of the genome that are exchanged among bacteria, and these regions contain genes involved in adaptation to local environments. We show that these recent isolates are not derived from the outbreak strain but are very closely related and share many of the same disease-causing genes, emphasizing the concern that these bacteria may cause future severe outbreaks.
PMCID: PMC3551546  PMID: 23341549
12.  How deep is deep enough for RNA-Seq profiling of bacterial transcriptomes? 
BMC Genomics  2012;13:734.
High-throughput sequencing of cDNA libraries (RNA-Seq) has proven to be a highly effective approach for studying bacterial transcriptomes. A central challenge in designing RNA-Seq-based experiments is estimating a priori the number of reads per sample needed to detect and quantify thousands of individual transcripts with a large dynamic range of abundance.
We have conducted a systematic examination of how changes in the number of RNA-Seq reads per sample influences both profiling of a single bacterial transcriptome and the comparison of gene expression among samples. Our findings suggest that the number of reads typically produced in a single lane of the Illumina HiSeq sequencer far exceeds the number needed to saturate the annotated transcriptomes of diverse bacteria growing in monoculture. Moreover, as sequencing depth increases, so too does the detection of cDNAs that likely correspond to spurious transcripts or genomic DNA contamination. Finally, even when dozens of barcoded individual cDNA libraries are sequenced in a single lane, the vast majority of transcripts in each sample can be detected and numerous genes differentially expressed between samples can be identified.
Our analysis provides a guide for the many researchers seeking to determine the appropriate sequencing depth for RNA-Seq-based studies of diverse bacterial species.
PMCID: PMC3543199  PMID: 23270466
13.  A framework for human microbiome research 
Methé, Barbara A. | Nelson, Karen E. | Pop, Mihai | Creasy, Heather H. | Giglio, Michelle G. | Huttenhower, Curtis | Gevers, Dirk | Petrosino, Joseph F. | Abubucker, Sahar | Badger, Jonathan H. | Chinwalla, Asif T. | Earl, Ashlee M. | FitzGerald, Michael G. | Fulton, Robert S. | Hallsworth-Pepin, Kymberlie | Lobos, Elizabeth A. | Madupu, Ramana | Magrini, Vincent | Martin, John C. | Mitreva, Makedonka | Muzny, Donna M. | Sodergren, Erica J. | Versalovic, James | Wollam, Aye M. | Worley, Kim C. | Wortman, Jennifer R. | Young, Sarah K. | Zeng, Qiandong | Aagaard, Kjersti M. | Abolude, Olukemi O. | Allen-Vercoe, Emma | Alm, Eric J. | Alvarado, Lucia | Andersen, Gary L. | Anderson, Scott | Appelbaum, Elizabeth | Arachchi, Harindra M. | Armitage, Gary | Arze, Cesar A. | Ayvaz, Tulin | Baker, Carl C. | Begg, Lisa | Belachew, Tsegahiwot | Bhonagiri, Veena | Bihan, Monika | Blaser, Martin J. | Bloom, Toby | Vivien Bonazzi, J. | Brooks, Paul | Buck, Gregory A. | Buhay, Christian J. | Busam, Dana A. | Campbell, Joseph L. | Canon, Shane R. | Cantarel, Brandi L. | Chain, Patrick S. | Chen, I-Min A. | Chen, Lei | Chhibba, Shaila | Chu, Ken | Ciulla, Dawn M. | Clemente, Jose C. | Clifton, Sandra W. | Conlan, Sean | Crabtree, Jonathan | Cutting, Mary A. | Davidovics, Noam J. | Davis, Catherine C. | DeSantis, Todd Z. | Deal, Carolyn | Delehaunty, Kimberley D. | Dewhirst, Floyd E. | Deych, Elena | Ding, Yan | Dooling, David J. | Dugan, Shannon P. | Dunne, Wm. Michael | Durkin, A. Scott | Edgar, Robert C. | Erlich, Rachel L. | Farmer, Candace N. | Farrell, Ruth M. | Faust, Karoline | Feldgarden, Michael | Felix, Victor M. | Fisher, Sheila | Fodor, Anthony A. | Forney, Larry | Foster, Leslie | Di Francesco, Valentina | Friedman, Jonathan | Friedrich, Dennis C. | Fronick, Catrina C. | Fulton, Lucinda L. | Gao, Hongyu | Garcia, Nathalia | Giannoukos, Georgia | Giblin, Christina | Giovanni, Maria Y. | Goldberg, Jonathan M. | Goll, Johannes | Gonzalez, Antonio | Griggs, Allison | Gujja, Sharvari | Haas, Brian J. | Hamilton, Holli A. | Harris, Emily L. | Hepburn, Theresa A. | Herter, Brandi | Hoffmann, Diane E. | Holder, Michael E. | Howarth, Clinton | Huang, Katherine H. | Huse, Susan M. | Izard, Jacques | Jansson, Janet K. | Jiang, Huaiyang | Jordan, Catherine | Joshi, Vandita | Katancik, James A. | Keitel, Wendy A. | Kelley, Scott T. | Kells, Cristyn | Kinder-Haake, Susan | King, Nicholas B. | Knight, Rob | Knights, Dan | Kong, Heidi H. | Koren, Omry | Koren, Sergey | Kota, Karthik C. | Kovar, Christie L. | Kyrpides, Nikos C. | La Rosa, Patricio S. | Lee, Sandra L. | Lemon, Katherine P. | Lennon, Niall | Lewis, Cecil M. | Lewis, Lora | Ley, Ruth E. | Li, Kelvin | Liolios, Konstantinos | Liu, Bo | Liu, Yue | Lo, Chien-Chi | Lozupone, Catherine A. | Lunsford, R. Dwayne | Madden, Tessa | Mahurkar, Anup A. | Mannon, Peter J. | Mardis, Elaine R. | Markowitz, Victor M. | Mavrommatis, Konstantinos | McCorrison, Jamison M. | McDonald, Daniel | McEwen, Jean | McGuire, Amy L. | McInnes, Pamela | Mehta, Teena | Mihindukulasuriya, Kathie A. | Miller, Jason R. | Minx, Patrick J. | Newsham, Irene | Nusbaum, Chad | O’Laughlin, Michelle | Orvis, Joshua | Pagani, Ioanna | Palaniappan, Krishna | Patel, Shital M. | Pearson, Matthew | Peterson, Jane | Podar, Mircea | Pohl, Craig | Pollard, Katherine S. | Priest, Margaret E. | Proctor, Lita M. | Qin, Xiang | Raes, Jeroen | Ravel, Jacques | Reid, Jeffrey G. | Rho, Mina | Rhodes, Rosamond | Riehle, Kevin P. | Rivera, Maria C. | Rodriguez-Mueller, Beltran | Rogers, Yu-Hui | Ross, Matthew C. | Russ, Carsten | Sanka, Ravi K. | Pamela Sankar, J. | Sathirapongsasuti, Fah | Schloss, Jeffery A. | Schloss, Patrick D. | Schmidt, Thomas M. | Scholz, Matthew | Schriml, Lynn | Schubert, Alyxandria M. | Segata, Nicola | Segre, Julia A. | Shannon, William D. | Sharp, Richard R. | Sharpton, Thomas J. | Shenoy, Narmada | Sheth, Nihar U. | Simone, Gina A. | Singh, Indresh | Smillie, Chris S. | Sobel, Jack D. | Sommer, Daniel D. | Spicer, Paul | Sutton, Granger G. | Sykes, Sean M. | Tabbaa, Diana G. | Thiagarajan, Mathangi | Tomlinson, Chad M. | Torralba, Manolito | Treangen, Todd J. | Truty, Rebecca M. | Vishnivetskaya, Tatiana A. | Walker, Jason | Wang, Lu | Wang, Zhengyuan | Ward, Doyle V. | Warren, Wesley | Watson, Mark A. | Wellington, Christopher | Wetterstrand, Kris A. | White, James R. | Wilczek-Boney, Katarzyna | Wu, Yuan Qing | Wylie, Kristine M. | Wylie, Todd | Yandava, Chandri | Ye, Liang | Ye, Yuzhen | Yooseph, Shibu | Youmans, Bonnie P. | Zhang, Lan | Zhou, Yanjiao | Zhu, Yiming | Zoloth, Laurie | Zucker, Jeremy D. | Birren, Bruce W. | Gibbs, Richard A. | Highlander, Sarah K. | Weinstock, George M. | Wilson, Richard K. | White, Owen
Nature  2012;486(7402):215-221.
A variety of microbial communities and their genes (microbiome) exist throughout the human body, playing fundamental roles in human health and disease. The NIH funded Human Microbiome Project (HMP) Consortium has established a population-scale framework which catalyzed significant development of metagenomic protocols resulting in a broad range of quality-controlled resources and data including standardized methods for creating, processing and interpreting distinct types of high-throughput metagenomic data available to the scientific community. Here we present resources from a population of 242 healthy adults sampled at 15 to 18 body sites up to three times, which to date, have generated 5,177 microbial taxonomic profiles from 16S rRNA genes and over 3.5 Tb of metagenomic sequence. In parallel, approximately 800 human-associated reference genomes have been sequenced. Collectively, these data represent the largest resource to date describing the abundance and variety of the human microbiome, while providing a platform for current and future studies.
PMCID: PMC3377744  PMID: 22699610
14.  The Genome of the Basidiomycetous Yeast and Human Pathogen Cryptococcus neoformans 
Science (New York, N.Y.)  2005;307(5713):1321-1324.
Cryptococcus neoformans is a basidiomycetous yeast ubiquitous in the environment, a model for fungal pathogenesis, and an opportunistic human pathogen of global importance. We have sequenced its ~20-megabase genome, which contains ~6500 intron-rich gene structures and encodes a transcriptome abundant in alternatively spliced and antisense messages. The genome is rich in transposons, many of which cluster at candidate centromeric regions. The presence of these transposons may drive karyotype instability and phenotypic variation. C. neoformans encodes unique genes that may contribute to its unusual virulence properties, and comparison of two phenotypically distinct strains reveals variation in gene content in addition to sequence polymorphisms between the genomes.
PMCID: PMC3520129  PMID: 15653466
15.  Approaches to Fungal Genome Annotation 
Mycology  2011;2(3):118-141.
Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center’s production genome annotation environment.
PMCID: PMC3207268  PMID: 22059117
16.  Comparative Genomics of Vancomycin-Resistant Staphylococcus aureus Strains and Their Positions within the Clade Most Commonly Associated with Methicillin-Resistant S. aureus Hospital-Acquired Infection in the United States  
mBio  2012;3(3):e00112-12.
Methicillin-resistant Staphylococcus aureus (MRSA) strains are leading causes of hospital-acquired infections in the United States, and clonal cluster 5 (CC5) is the predominant lineage responsible for these infections. Since 2002, there have been 12 cases of vancomycin-resistant S. aureus (VRSA) infection in the United States—all CC5 strains. To understand this genetic background and what distinguishes it from other lineages, we generated and analyzed high-quality draft genome sequences for all available VRSA strains. Sequence comparisons show unambiguously that each strain independently acquired Tn1546 and that all VRSA strains last shared a common ancestor over 50 years ago, well before the occurrence of vancomycin resistance in this species. In contrast to existing hypotheses on what predisposes this lineage to acquire Tn1546, the barrier posed by restriction systems appears to be intact in most VRSA strains. However, VRSA (and other CC5) strains were found to possess a constellation of traits that appears to be optimized for proliferation in precisely the types of polymicrobic infection where transfer could occur. They lack a bacteriocin operon that would be predicted to limit the occurrence of non-CC5 strains in mixed infection and harbor a cluster of unique superantigens and lipoproteins to confound host immunity. A frameshift in dprA, which in other microbes influences uptake of foreign DNA, may also make this lineage conducive to foreign DNA acquisition.
Invasive methicillin-resistant Staphylococcus aureus (MRSA) infection now ranks among the leading causes of death in the United States. Vancomycin is a key last-line bactericidal drug for treating these infections. However, since 2002, vancomycin resistance has entered this species. Of the now 12 cases of vancomycin-resistant S. aureus (VRSA), each was believed to represent a new acquisition of the vancomycin-resistant transposon Tn1546 from enterococcal donors. All acquisitions of Tn1546 so far have occurred in MRSA strains of the clonal cluster 5 genetic background, the most common hospital lineage causing hospital-acquired MRSA infection. To understand the nature of these strains, we determined and examined the nucleotide sequences of the genomes of all available VRSA. Genome comparison identified candidate features that position strains of this lineage well for acquiring resistance to antibiotics in mixed infection.
PMCID: PMC3372964  PMID: 22617140
17.  Branch Point Identification and Sequence Requirements for Intron Splicing in Plasmodium falciparum ▿ † 
Eukaryotic Cell  2011;10(11):1422-1428.
Splicing of mRNA is an ancient and evolutionarily conserved process in eukaryotic organisms, but intron-exon structures vary. Plasmodium falciparum has an extreme AT nucleotide bias (>80%), providing a unique opportunity to investigate how evolutionary forces have acted on intron structures. In this study, we developed an in vivo luciferase reporter splicing assay and employed it in combination with lariat isolation and sequencing to characterize 5′ and 3′ splicing requirements and experimentally determine the intron branch point in P. falciparum. This analysis indicates that P. falciparum mRNAs have canonical 5′ and 3′ splice sites. However, the 5′ consensus motif is weakly conserved and tolerates nucleotide substitution, including the fifth nucleotide in the intron, which is more typically a G nucleotide in most eukaryotes. In comparison, the 3′ splice site has a strong eukaryotic consensus sequence and adjacent polypyrimidine tract. In four different P. falciparum pre-mRNAs, multiple branch points per intron were detected, with some at U instead of the typical A residue. A weak branch point consensus was detected among 18 identified branch points. This analysis indicates that P. falciparum retains many consensus eukaryotic splice site features, despite having an extreme codon bias, and possesses flexibility in branch point nucleophilic attack.
PMCID: PMC3209046  PMID: 21926333
18.  Efficient and robust RNA-seq process for cultured bacteria and complex community transcriptomes 
Genome Biology  2012;13(3):r23.
We have developed a process for transcriptome analysis of bacterial communities that accommodates both intact and fragmented starting RNA and combines efficient rRNA removal with strand-specific RNA-seq. We applied this approach to an RNA mixture derived from three diverse cultured bacterial species and to RNA isolated from clinical stool samples. The resulting expression profiles were highly reproducible, enriched up to 40-fold for non-rRNA transcripts, and correlated well with profiles representing undepleted total RNA.
PMCID: PMC3439974  PMID: 22455878
19.  'Next-generation' sequencing becomes 'now-generation' 
Genome Biology  2011;12(3):303.
A report on the Advances in Genome Biology & Technology conference, Marco Island, USA, 2-5 February 2011.
PMCID: PMC3129669  PMID: 21457529
20.  Comparative Functional Genomics of the Fission Yeasts 
Science (New York, N.Y.)  2011;332(6032):930-936.
The fission yeast clade, comprising Schizosaccharomyces pombe, S. octosporus, S. cryophilus and S. japonicus, occupies the basal branch of Ascomycete fungi and is an important model of eukaryote biology. A comparative annotation of these genomes identified a near extinction of transposons and the associated innovation of transposon-free centromeres. Expression analysis established that meiotic genes are subject to antisense transcription during vegetative growth, suggesting a mechanism for their tight regulation. In addition, trans-acting regulators control new genes within the context of expanded functional modules for meiosis and stress response. Differences in gene content and regulation also explain why, unlike the Saccharomycotina, fission yeasts cannot use ethanol as a primary carbon source. These analyses elucidate the genome structure and gene regulation of fission yeast and provide tools for investigation across the Schizosaccharomyces clade.
PMCID: PMC3131103  PMID: 21511999
21.  Comparative Genomic Analysis of Human Fungal Pathogens Causing Paracoccidioidomycosis 
PLoS Genetics  2011;7(10):e1002345.
Paracoccidioides is a fungal pathogen and the cause of paracoccidioidomycosis, a health-threatening human systemic mycosis endemic to Latin America. Infection by Paracoccidioides, a dimorphic fungus in the order Onygenales, is coupled with a thermally regulated transition from a soil-dwelling filamentous form to a yeast-like pathogenic form. To better understand the genetic basis of growth and pathogenicity in Paracoccidioides, we sequenced the genomes of two strains of Paracoccidioides brasiliensis (Pb03 and Pb18) and one strain of Paracoccidioides lutzii (Pb01). These genomes range in size from 29.1 Mb to 32.9 Mb and encode 7,610 to 8,130 genes. To enable genetic studies, we mapped 94% of the P. brasiliensis Pb18 assembly onto five chromosomes. We characterized gene family content across Onygenales and related fungi, and within Paracoccidioides we found expansions of the fungal-specific kinase family FunK1. Additionally, the Onygenales have lost many genes involved in carbohydrate metabolism and fewer genes involved in protein metabolism, resulting in a higher ratio of proteases to carbohydrate active enzymes in the Onygenales than their relatives. To determine if gene content correlated with growth on different substrates, we screened the non-pathogenic onygenale Uncinocarpus reesii, which has orthologs for 91% of Paracoccidioides metabolic genes, for growth on 190 carbon sources. U. reesii showed growth on a limited range of carbohydrates, primarily basic plant sugars and cell wall components; this suggests that Onygenales, including dimorphic fungi, can degrade cellulosic plant material in the soil. In addition, U. reesii grew on gelatin and a wide range of dipeptides and amino acids, indicating a preference for proteinaceous growth substrates over carbohydrates, which may enable these fungi to also degrade animal biomass. These capabilities for degrading plant and animal substrates suggest a duality in lifestyle that could enable pathogenic species of Onygenales to transfer from soil to animal hosts.
Author Summary
Paracoccidioides sp. are fungal pathogens that cause paracoccidioidomycosis in humans. They are part of a larger group of dimorphic fungi causing pulmonary infections in immunocompetent people, whereas many other fungi cause opportunistic infections. We sequenced the genomes of two strains of Paracoccidioides brasiliensis and one strain of the closely related species Paracoccidioides lutzii, and compared them to other fungal genomes. We found gene family expansions specific to Paracoccidioides, including the fungal-specific kinase family. By contrast we found that dimorphic fungi as a group have lost many genes involved in carbohydrate metabolism but retained most proteases. As the growth substrates for dimorphic fungi have not been well characterized, we tested a non-pathogenic relative, Uncinocarpus reesii, for growth on 190 carbon sources. We found that U. reesii is capable of growth on a limited set of carbohydrates, but grows more rapidly on a wide range of dipeptides and amino acids. Our analysis suggests that this genetic and phenotypic preference may underlie the ability of the dimorphic fungi to infect and grow on animal hosts.
PMCID: PMC3203195  PMID: 22046142
22.  UCHIME improves sensitivity and speed of chimera detection 
Bioinformatics  2011;27(16):2194-2200.
Motivation: Chimeric DNA sequences often form during polymerase chain reaction amplification, especially when sequencing single regions (e.g. 16S rRNA or fungal Internal Transcribed Spacer) to assess diversity or compare populations. Undetected chimeras may be misinterpreted as novel species, causing inflated estimates of diversity and spurious inferences of differences between populations. Detection and removal of chimeras is therefore of critical importance in such experiments.
Results: We describe UCHIME, a new program that detects chimeric sequences with two or more segments. UCHIME either uses a database of chimera-free sequences or detects chimeras de novo by exploiting abundance data. UCHIME has better sensitivity than ChimeraSlayer (previously the most sensitive database method), especially with short, noisy sequences. In testing on artificial bacterial communities with known composition, UCHIME de novo sensitivity is shown to be comparable to Perseus. UCHIME is >100× faster than Perseus and >1000× faster than ChimeraSlayer.
Availability: Source, binaries and data:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3150044  PMID: 21700674
23.  Draft genome sequence of the ricin-producing oilseed castor bean 
Nature biotechnology  2010;28(9):951-956.
Castor bean (Ricinus communis) is an oil crop that belongs to the spurge (Euphorbiaceae) family. Its seeds are the source of castor oil, used for the production of high-quality lubricants due to its high proportion of the unusual fatty acid ricinoleic acid. Castor bean seeds also produce ricin, a highly toxic ribosome inactivating protein, making castor bean relevant for biosafety. We report here the 4.6X draft genome sequence of castor bean, representing the first reported Euphorbiaceae genome sequence. Our analysis shows that most key castor oil metabolism genes are single-copy while the ricin gene family is larger than previously thought. Comparative genomics analysis suggests the presence of an ancient hexaploidization event that is conserved across the dicotyledonous lineage.
PMCID: PMC2945230  PMID: 20729833
24.  The genome of the blood fluke Schistosoma mansoni 
Nature  2009;460(7253):352-358.
Schistosoma mansoni is responsible for the neglected tropical disease schistosomiasis that affects 210 million people in 76 countries. We report here analysis of the 363 megabase nuclear genome of the blood fluke. It encodes at least 11,809 genes, with an unusual intron size distribution, and novel families of micro-exon genes that undergo frequent alternate splicing. As the first sequenced flatworm, and a representative of the lophotrochozoa, it offers insights into early events in the evolution of the animals, including the development of a body pattern with bilateral symmetry, and the development of tissues into organs. Our analysis has been informed by the need to find new drug targets. The deficits in lipid metabolism that make schistosomes dependent on the host are revealed, while the identification of membrane receptors, ion channels and more than 300 proteases, provide new insights into the biology of the life cycle and novel targets. Bioinformatics approaches have identified metabolic chokepoints while a chemogenomic screen has pinpointed schistosome proteins for which existing drugs may be active. The information generated provides an invaluable resource for the research community to develop much needed new control tools for the treatment and eradication of this important and neglected disease.
PMCID: PMC2756445  PMID: 19606141
25.  Comparative genomics of mutualistic viruses of Glyptapanteles parasitic wasps 
Genome Biology  2008;9(12):R183.
Comparative genome analysis of two endosymbiotic polydnaviruses from Glyptapanteles parasitic wasps reveals new insights into the evolutionary arms race between host and parasite.
Polydnaviruses, double-stranded DNA viruses with segmented genomes, have evolved as obligate endosymbionts of parasitoid wasps. Virus particles are replication deficient and produced by female wasps from proviral sequences integrated into the wasp genome. These particles are co-injected with eggs into caterpillar hosts, where viral gene expression facilitates parasitoid survival and, thereby, survival of proviral DNA. Here we characterize and compare the encapsidated viral genome sequences of bracoviruses in the family Polydnaviridae associated with Glyptapanteles gypsy moth parasitoids, along with near complete proviral sequences from which both viral genomes are derived.
The encapsidated Glyptapanteles indiensis and Glyptapanteles flavicoxis bracoviral genomes, each composed of 29 different size segments, total approximately 517 and 594 kbp, respectively. They are generated from a minimum of seven distinct loci in the wasp genome. Annotation of these sequences revealed numerous novel features for polydnaviruses, including insect-like sugar transporter genes and transposable elements. Evolutionary analyses suggest that positive selection is widespread among bracoviral genes.
The structure and organization of G. indiensis and G. flavicoxis bracovirus proviral segments as multiple loci containing one to many viral segments, flanked and separated by wasp gene-encoding DNA, is confirmed. Rapid evolution of bracovirus genes supports the hypothesis of bracovirus genes in an 'arms race' between bracovirus and caterpillar. Phylogenetic analyses of the bracoviral genes encoding sugar transporters provides the first robust evidence of a wasp origin for some polydnavirus genes. We hypothesize transposable elements, such as those described here, could facilitate transfer of genes between proviral segments and host DNA.
PMCID: PMC2646287  PMID: 19116010

Results 1-25 (42)