Search tips
Search criteria

Results 1-15 (15)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  The Repatterning of Eukaryotic Genomes by Random Genetic Drift 
Recent observations on rates of mutation, recombination, and random genetic drift highlight the dramatic ways in which fundamental evolutionary processes vary across the divide between unicellular microbes and multicellular eukaryotes. Moreover, population-genetic theory suggests that the range of variation in these parameters is sufficient to explain the evolutionary diversification of many aspects of genome size and gene structure found among phylogenetic lineages. Most notably, large eukaryotic organisms that experience elevated magnitudes of random genetic drift are susceptible to the passive accumulation of mutationally hazardous DNA that would otherwise be eliminated by efficient selection. Substantial evidence also suggests that variation in the population-genetic environment influences patterns of protein evolution, with the emergence of certain kinds of amino-acid substitutions and protein-protein complexes only being possible in populations with relatively small effective sizes. These observations imply that the ultimate origins of many of the major genomic and proteomic disparities between prokaryotes and eukaryotes and among eukaryotic lineages have been molded as much by intrinsic variation in the genetic and cellular features of species as by external ecological forces.
PMCID: PMC4519033  PMID: 21756106
complexity; genome evolution; mutation; protein evolution; recombination
2.  Stability of Gut Enterotypes in Korean Monozygotic Twins and Their Association with Biomarkers and Diet 
Scientific Reports  2014;4:7348.
Studies on the human gut microbiota have suggested that human individuals could be categorized into enterotypes based on the compositions of their gut microbial communities. Here, we report that the gut microbiota of healthy Koreans are clustered into two enterotypes, dominated by either Bacteroides (enterotype 1) or Prevotella (enterotype 2). More than 72% of the paired fecal samples from monozygotic twin pairs were assigned to the same enterotype. Our longitudinal analysis of these twins indicated that more than 80% of the individuals belonged to the same enterotype after about a 2-year interval. Microbial functions based on KEGG pathways were also divided into two clusters. For enterotype 2, 100% of the samples belonged to the same functional cluster, while for enterotype 1, approximately half of the samples belonged to each functional cluster. Enterotype 2 was significantly associated with long-term dietary habits that were high in dietary fiber, various vitamins, and minerals. Among anthropometrical and biochemical traits, the level of serum uric acid was associated with enterotype. These results suggest that host genetics as well as host properties such as long-term dietary patterns and a particular clinical biomarker could be important contributors to the enterotype of an individual.
PMCID: PMC4258686  PMID: 25482875
3.  Genome-Wide Characterization of Endogenous Retroviruses in the Bat Myotis lucifugus Reveals Recent and Diverse Infections 
Journal of Virology  2013;87(15):8493-8501.
Bats are increasingly recognized as reservoir species for a variety of zoonotic viruses that pose severe threats to human health. While many RNA viruses have been identified in bats, little is known about bat retroviruses. Endogenous retroviruses (ERVs) represent genomic fossils of past retroviral infections and, thus, can inform us on the diversity and history of retroviruses that have infected a species lineage. Here, we took advantage of the availability of a high-quality genome assembly for the little brown bat, Myotis lucifugus, to systematically identify and analyze ERVs in this species. We mined an initial set of 362 potentially complete proviruses from the three main classes of ERVs, which were further resolved into 13 major families and 86 subfamilies by phylogenetic analysis. Consensus or representative sequences for each of the 86 subfamilies were then merged to the Repbase collection of known ERV/long terminal repeat (LTR) elements to annotate the retroviral complement of the bat genome. The results show that nearly 5% of the genome assembly is occupied by ERV-derived sequences, a quantity comparable to findings for other eutherian mammals. About one-fourth of these sequences belong to subfamilies newly identified in this study. Using two independent methods, intraelement LTR divergence and analysis of orthologous loci in two other bat species, we found that the vast majority of the potentially complete proviruses identified in M. lucifugus were integrated in the last ∼25 million years. All three major ERV classes include recently integrated proviruses, suggesting that a wide diversity of retroviruses is still circulating in Myotis bats.
PMCID: PMC3719839  PMID: 23720713
4.  CRISPR-Cas systems target a diverse collection of invasive mobile genetic elements in human microbiomes 
Genome Biology  2013;14(4):R40.
Bacteria and archaea develop immunity against invading genomes by incorporating pieces of the invaders' sequences, called spacers, into a clustered regularly interspaced short palindromic repeats (CRISPR) locus between repeats, forming arrays of repeat-spacer units. When spacers are expressed, they direct CRISPR-associated (Cas) proteins to silence complementary invading DNA. In order to characterize the invaders of human microbiomes, we use spacers from CRISPR arrays that we had previously assembled from shotgun metagenomic datasets, and identify contigs that contain these spacers' targets.
We discover 95,000 contigs that are putative invasive mobile genetic elements, some targeted by hundreds of CRISPR spacers. We find that oral sites in healthy human populations have a much greater variety of mobile genetic elements than stool samples. Mobile genetic elements carry genes encoding diverse functions: only 7% of the mobile genetic elements are similar to known phages or plasmids, although a much greater proportion contain phage- or plasmid-related genes. A small number of contigs share similarity with known integrative and conjugative elements, providing the first examples of CRISPR defenses against this class of element. We provide detailed analyses of a few large mobile genetic elements of various types, and a relative abundance analysis of mobile genetic elements and putative hosts, exploring the dynamic activities of mobile genetic elements in human microbiomes. A joint analysis of mobile genetic elements and CRISPRs shows that protospacer-adjacent motifs drive their interaction network; however, some CRISPR-Cas systems target mobile genetic elements lacking motifs.
We identify a large collection of invasive mobile genetic elements in human microbiomes, an important resource for further study of the interaction between the CRISPR-Cas immune system and invaders.
PMCID: PMC4053933  PMID: 23628424
CRISPR-Cas system; human microbiome; mobile genetic element (MGE)
5.  Oral Spirochetes Implicated in Dental Diseases Are Widespread in Normal Human Subjects and Carry Extremely Diverse Integron Gene Cassettes 
Applied and Environmental Microbiology  2012;78(15):5288-5296.
The NIH Human Microbiome Project (HMP) has produced several hundred metagenomic data sets, allowing studies of the many functional elements in human-associated microbial communities. Here, we survey the distribution of oral spirochetes implicated in dental diseases in normal human individuals, using recombination sites associated with the chromosomal integron in Treponema genomes, taking advantage of the multiple copies of the integron recombination sites (repeats) in the genomes, and using a targeted assembly approach that we have developed. We find that integron-containing Treponema species are present in ∼80% of the normal human subjects included in the HMP. Further, we are able to de novo assemble the integron gene cassettes using our constrained assembly approach, which employs a unique application of the de Bruijn graph assembly information; most of these cassette genes were not assembled in whole-metagenome assemblies and could not be identified by mapping sequencing reads onto the known reference Treponema genomes due to the dynamic nature of integron gene cassettes. Our study significantly enriches the gene pool known to be carried by Treponema chromosomal integrons, totaling 826 (598 97% nonredundant) genes. We characterize the functions of these gene cassettes: many of these genes have unknown functions. The integron gene cassette arrays found in the human microbiome are extraordinarily dynamic, with different microbial communities sharing only a small number of common genes.
PMCID: PMC3416431  PMID: 22635997
6.  The Ecoresponsive Genome of Daphnia pulex 
Science (New York, N.Y.)  2011;331(6017):555-561.
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 Mb and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than 1/3 of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The co-expansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes – including many additional loci within sequenced regions that are otherwise devoid of annotations – are the most responsive genes to ecological challenges.
PMCID: PMC3529199  PMID: 21292972
7.  A framework for human microbiome research 
Methé, Barbara A. | Nelson, Karen E. | Pop, Mihai | Creasy, Heather H. | Giglio, Michelle G. | Huttenhower, Curtis | Gevers, Dirk | Petrosino, Joseph F. | Abubucker, Sahar | Badger, Jonathan H. | Chinwalla, Asif T. | Earl, Ashlee M. | FitzGerald, Michael G. | Fulton, Robert S. | Hallsworth-Pepin, Kymberlie | Lobos, Elizabeth A. | Madupu, Ramana | Magrini, Vincent | Martin, John C. | Mitreva, Makedonka | Muzny, Donna M. | Sodergren, Erica J. | Versalovic, James | Wollam, Aye M. | Worley, Kim C. | Wortman, Jennifer R. | Young, Sarah K. | Zeng, Qiandong | Aagaard, Kjersti M. | Abolude, Olukemi O. | Allen-Vercoe, Emma | Alm, Eric J. | Alvarado, Lucia | Andersen, Gary L. | Anderson, Scott | Appelbaum, Elizabeth | Arachchi, Harindra M. | Armitage, Gary | Arze, Cesar A. | Ayvaz, Tulin | Baker, Carl C. | Begg, Lisa | Belachew, Tsegahiwot | Bhonagiri, Veena | Bihan, Monika | Blaser, Martin J. | Bloom, Toby | Vivien Bonazzi, J. | Brooks, Paul | Buck, Gregory A. | Buhay, Christian J. | Busam, Dana A. | Campbell, Joseph L. | Canon, Shane R. | Cantarel, Brandi L. | Chain, Patrick S. | Chen, I-Min A. | Chen, Lei | Chhibba, Shaila | Chu, Ken | Ciulla, Dawn M. | Clemente, Jose C. | Clifton, Sandra W. | Conlan, Sean | Crabtree, Jonathan | Cutting, Mary A. | Davidovics, Noam J. | Davis, Catherine C. | DeSantis, Todd Z. | Deal, Carolyn | Delehaunty, Kimberley D. | Dewhirst, Floyd E. | Deych, Elena | Ding, Yan | Dooling, David J. | Dugan, Shannon P. | Dunne, Wm. Michael | Durkin, A. Scott | Edgar, Robert C. | Erlich, Rachel L. | Farmer, Candace N. | Farrell, Ruth M. | Faust, Karoline | Feldgarden, Michael | Felix, Victor M. | Fisher, Sheila | Fodor, Anthony A. | Forney, Larry | Foster, Leslie | Di Francesco, Valentina | Friedman, Jonathan | Friedrich, Dennis C. | Fronick, Catrina C. | Fulton, Lucinda L. | Gao, Hongyu | Garcia, Nathalia | Giannoukos, Georgia | Giblin, Christina | Giovanni, Maria Y. | Goldberg, Jonathan M. | Goll, Johannes | Gonzalez, Antonio | Griggs, Allison | Gujja, Sharvari | Haas, Brian J. | Hamilton, Holli A. | Harris, Emily L. | Hepburn, Theresa A. | Herter, Brandi | Hoffmann, Diane E. | Holder, Michael E. | Howarth, Clinton | Huang, Katherine H. | Huse, Susan M. | Izard, Jacques | Jansson, Janet K. | Jiang, Huaiyang | Jordan, Catherine | Joshi, Vandita | Katancik, James A. | Keitel, Wendy A. | Kelley, Scott T. | Kells, Cristyn | Kinder-Haake, Susan | King, Nicholas B. | Knight, Rob | Knights, Dan | Kong, Heidi H. | Koren, Omry | Koren, Sergey | Kota, Karthik C. | Kovar, Christie L. | Kyrpides, Nikos C. | La Rosa, Patricio S. | Lee, Sandra L. | Lemon, Katherine P. | Lennon, Niall | Lewis, Cecil M. | Lewis, Lora | Ley, Ruth E. | Li, Kelvin | Liolios, Konstantinos | Liu, Bo | Liu, Yue | Lo, Chien-Chi | Lozupone, Catherine A. | Lunsford, R. Dwayne | Madden, Tessa | Mahurkar, Anup A. | Mannon, Peter J. | Mardis, Elaine R. | Markowitz, Victor M. | Mavrommatis, Konstantinos | McCorrison, Jamison M. | McDonald, Daniel | McEwen, Jean | McGuire, Amy L. | McInnes, Pamela | Mehta, Teena | Mihindukulasuriya, Kathie A. | Miller, Jason R. | Minx, Patrick J. | Newsham, Irene | Nusbaum, Chad | O’Laughlin, Michelle | Orvis, Joshua | Pagani, Ioanna | Palaniappan, Krishna | Patel, Shital M. | Pearson, Matthew | Peterson, Jane | Podar, Mircea | Pohl, Craig | Pollard, Katherine S. | Priest, Margaret E. | Proctor, Lita M. | Qin, Xiang | Raes, Jeroen | Ravel, Jacques | Reid, Jeffrey G. | Rho, Mina | Rhodes, Rosamond | Riehle, Kevin P. | Rivera, Maria C. | Rodriguez-Mueller, Beltran | Rogers, Yu-Hui | Ross, Matthew C. | Russ, Carsten | Sanka, Ravi K. | Pamela Sankar, J. | Sathirapongsasuti, Fah | Schloss, Jeffery A. | Schloss, Patrick D. | Schmidt, Thomas M. | Scholz, Matthew | Schriml, Lynn | Schubert, Alyxandria M. | Segata, Nicola | Segre, Julia A. | Shannon, William D. | Sharp, Richard R. | Sharpton, Thomas J. | Shenoy, Narmada | Sheth, Nihar U. | Simone, Gina A. | Singh, Indresh | Smillie, Chris S. | Sobel, Jack D. | Sommer, Daniel D. | Spicer, Paul | Sutton, Granger G. | Sykes, Sean M. | Tabbaa, Diana G. | Thiagarajan, Mathangi | Tomlinson, Chad M. | Torralba, Manolito | Treangen, Todd J. | Truty, Rebecca M. | Vishnivetskaya, Tatiana A. | Walker, Jason | Wang, Lu | Wang, Zhengyuan | Ward, Doyle V. | Warren, Wesley | Watson, Mark A. | Wellington, Christopher | Wetterstrand, Kris A. | White, James R. | Wilczek-Boney, Katarzyna | Wu, Yuan Qing | Wylie, Kristine M. | Wylie, Todd | Yandava, Chandri | Ye, Liang | Ye, Yuzhen | Yooseph, Shibu | Youmans, Bonnie P. | Zhang, Lan | Zhou, Yanjiao | Zhu, Yiming | Zoloth, Laurie | Zucker, Jeremy D. | Birren, Bruce W. | Gibbs, Richard A. | Highlander, Sarah K. | Weinstock, George M. | Wilson, Richard K. | White, Owen
Nature  2012;486(7402):215-221.
A variety of microbial communities and their genes (microbiome) exist throughout the human body, playing fundamental roles in human health and disease. The NIH funded Human Microbiome Project (HMP) Consortium has established a population-scale framework which catalyzed significant development of metagenomic protocols resulting in a broad range of quality-controlled resources and data including standardized methods for creating, processing and interpreting distinct types of high-throughput metagenomic data available to the scientific community. Here we present resources from a population of 242 healthy adults sampled at 15 to 18 body sites up to three times, which to date, have generated 5,177 microbial taxonomic profiles from 16S rRNA genes and over 3.5 Tb of metagenomic sequence. In parallel, approximately 800 human-associated reference genomes have been sequenced. Collectively, these data represent the largest resource to date describing the abundance and variety of the human microbiome, while providing a platform for current and future studies.
PMCID: PMC3377744  PMID: 22699610
8.  Stitching gene fragments with a network matching algorithm improves gene assembly for metagenomics 
Bioinformatics  2012;28(18):i363-i369.
Motivation: One of the difficulties in metagenomic assembly is that homologous genes from evolutionarily closely related species may behave like repeats and confuse assemblers. As a result, small contigs, each representing a short gene fragment, instead of complete genes, may be reported by an assembler. This further complicates annotation of metagenomic datasets, as annotation tools (such as gene predictors or similarity search tools) typically perform poorly on configs encoding short gene fragments.
Results: We present a novel way of using the de Bruijn graph assembly of metagenomes to improve the assembly of genes. A network matching algorithm is proposed for matching the de Bruijn graph of contigs against reference genes, to derive ‘gene paths’ in the graph (sequences of contigs containing gene fragments) that have the highest similarities to known genes, allowing gene fragments contained in multiple contigs to be connected to form more complete (or intact) genes. Tests on simulated and real datasets show that our approach (called GeneStitch) is able to significantly improve the assembly of genes from metagenomic sequences, by connecting contigs with the guidance of homologous genes—information that is orthogonal to the sequencing reads. We note that the improvement of gene assembly can be observed even when only distantly related genes are available as the reference. We further propose to use ‘gene graphs’ to represent the assembly of reads from homologous genes and discuss potential applications of gene graphs to improving functional annotation for metagenomics.
Availability: The tools are available as open source for download at
PMCID: PMC3436815  PMID: 22962453
9.  Diverse CRISPRs Evolving in Human Microbiomes 
PLoS Genetics  2012;8(6):e1002441.
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) loci, together with cas (CRISPR–associated) genes, form the CRISPR/Cas adaptive immune system, a primary defense strategy that eubacteria and archaea mobilize against foreign nucleic acids, including phages and conjugative plasmids. Short spacer sequences separated by the repeats are derived from foreign DNA and direct interference to future infections. The availability of hundreds of shotgun metagenomic datasets from the Human Microbiome Project (HMP) enables us to explore the distribution and diversity of known CRISPRs in human-associated microbial communities and to discover new CRISPRs. We propose a targeted assembly strategy to reconstruct CRISPR arrays, which whole-metagenome assemblies fail to identify. For each known CRISPR type (identified from reference genomes), we use its direct repeat consensus sequence to recruit reads from each HMP dataset and then assemble the recruited reads into CRISPR loci; the unique spacer sequences can then be extracted for analysis. We also identified novel CRISPRs or new CRISPR variants in contigs from whole-metagenome assemblies and used targeted assembly to more comprehensively identify these CRISPRs across samples. We observed that the distributions of CRISPRs (including 64 known and 86 novel ones) are largely body-site specific. We provide detailed analysis of several CRISPR loci, including novel CRISPRs. For example, known streptococcal CRISPRs were identified in most oral microbiomes, totaling ∼8,000 unique spacers: samples resampled from the same individual and oral site shared the most spacers; different oral sites from the same individual shared significantly fewer, while different individuals had almost no common spacers, indicating the impact of subtle niche differences on the evolution of CRISPR defenses. We further demonstrate potential applications of CRISPRs to the tracing of rare species and the virus exposure of individuals. This work indicates the importance of effective identification and characterization of CRISPR loci to the study of the dynamic ecology of microbiomes.
Author Summary
Human bodies are complex ecological systems in which various microbial organisms and viruses interact with each other and with the human host. The Human Microbiome Project (HMP) has resulted in >700 datasets of shotgun metagenomic sequences, from which we can learn about the compositions and functions of human-associated microbial communities. CRISPR/Cas systems are a widespread class of adaptive immune systems in bacteria and archaea, providing acquired immunity against foreign nucleic acids: CRISPR/Cas defense pathways involve integration of viral- or plasmid-derived DNA segments into CRISPR arrays (forming spacers between repeated structural sequences), and expression of short crRNAs from these single repeat-spacer units, to generate interference to future invading foreign genomes. Powered by an effective computational approach (the targeted assembly approach for CRISPR), our analysis of CRISPR arrays in the HMP datasets provides the very first global view of bacterial immunity systems in human-associated microbial communities. The great diversity of CRISPR spacers we observed among different body sites, in different individuals, and in single individuals over time, indicates the impact of subtle niche differences on the evolution of CRISPR defenses and indicates the key role of bacteriophage (and plasmids) in shaping human microbial communities.
PMCID: PMC3374615  PMID: 22719260
10.  Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets 
BMC Bioinformatics  2012;13(Suppl 2):S9.
Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets.
Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of rRNA genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters.
This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS.
Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.
PMCID: PMC3305784  PMID: 22536872
11.  FragGeneScan: predicting genes in short and error-prone reads 
Nucleic Acids Research  2010;38(20):e191.
The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.
PMCID: PMC2978382  PMID: 20805240
12.  LTR retroelements in the genome of Daphnia pulex 
BMC Genomics  2010;11:425.
Long terminal repeat (LTR) retroelements represent a successful group of transposable elements (TEs) that have played an important role in shaping the structure of many eukaryotic genomes. Here, we present a genome-wide analysis of LTR retroelements in Daphnia pulex, a cyclical parthenogen and the first crustacean for which the whole genomic sequence is available. In addition, we analyze transcriptional data and perform transposon display assays of lab-reared lineages and natural isolates to identify potential influences on TE mobility and differences in LTR retroelements loads among individuals reproducing with and without sex.
We conducted a comprehensive de novo search for LTR retroelements and identified 333 intact LTR retroelements representing 142 families in the D. pulex genome. While nearly half of the identified LTR retroelements belong to the gypsy group, we also found copia (95), BEL/Pao (66) and DIRS (19) retroelements. Phylogenetic analysis of reverse transcriptase sequences showed that LTR retroelements in the D. pulex genome form many lineages distinct from known families, suggesting that the majority are novel. Our investigation of transcriptional activity of LTR retroelements using tiling array data obtained from three different experimental conditions found that 71 LTR retroelements are actively transcribed. Transposon display assays of mutation-accumulation lines showed evidence for putative somatic insertions for two DIRS retroelement families. Losses of presumably heterozygous insertions were observed in lineages in which selfing occurred, but never in asexuals, highlighting the potential impact of reproductive mode on TE abundance and distribution over time. The same two families were also assayed across natural isolates (both cyclical parthenogens and obligate asexuals) and there were more retroelements in populations capable of reproducing sexually for one of the two families assayed.
Given the importance of LTR retroelements activity in the evolution of other genomes, this comprehensive survey provides insight into the potential impact of LTR retroelements on the genome of D. pulex, a cyclically parthenogenetic microcrustacean that has served as an ecological model for over a century.
PMCID: PMC2996953  PMID: 20618961
13.  MGEScan-non-LTR: computational identification and classification of autonomous non-LTR retrotransposons in eukaryotic genomes 
Nucleic Acids Research  2009;37(21):e143.
Computational methods for genome-wide identification of mobile genetic elements (MGEs) have become increasingly necessary for both genome annotation and evolutionary studies. Non-long terminal repeat (non-LTR) retrotransposons are a class of MGEs that have been found in most eukaryotic genomes, sometimes in extremely high numbers. In this article, we present a computational tool, MGEScan-non-LTR, for the identification of non-LTR retrotransposons in genomic sequences, following a computational approach inspired by a generalized hidden Markov model (GHMM). Three different states represent two different protein domains and inter-domain linker regions encoded in the non-LTR retrotransposons, and their scores are evaluated by using profile hidden Markov models (for protein domains) and Gaussian Bayes classifiers (for linker regions), respectively. In order to classify the non-LTR retrotransposons into one of the 12 previously characterized clades using the same model, we defined separate states for different clades. MGEScan-non-LTR was tested on the genome sequences of four eukaryotic organisms, Drosophila melanogaster, Daphnia pulex, Ciona intestinalis and Strongylocentrotus purpuratus. For the D. melanogaster genome, MGEScan-non-LTR found all known ‘full-length’ elements and simultaneously classified them into the clades CR1, I, Jockey, LOA and R1. Notably, for the D. pulex genome, in which no non-LTR retrotransposon has been annotated, MGEScan-non-LTR found a significantly larger number of elements than did RepeatMasker, using the current version of the RepBase Update library. We also identified novel elements in the other two genomes, which have only been partially studied for non-LTR retrotransposons.
PMCID: PMC2790886  PMID: 19762481
14.  Independent Mammalian Genome Contractions Following the KT Boundary 
Although it is generally accepted that major changes in the earth's history are significant drivers of phylogenetic diversification and extinction, such episodes may also have long-lasting effects on genomic architecture. Here we show that widespread reductions in genome size have occurred in multiple lineages of mammals subsequent to the Cretaceous–Tertiary (KT) boundary, whereas there is no evidence for such changes in other vertebrate, invertebrate, or land plant lineages. Although the mechanisms remain unclear, such shifts in mammalian genome evolution may be a consequence of an increase in the efficiency of selection against excess DNA resulting from post-KT population size expansions. Independent historical changes in genome architecture in diverse lineages raise a significant challenge to the idea that genome size is finely tuned to achieve adaptive phenotypic modifications and suggest that attempts to use phylogenetic analysis to infer ancestral genome sizes may be problematical.
PMCID: PMC2817402  PMID: 20333172
genome evolution; genome size; KT boundary; mammalian evolution; mobile elements; pseudogenes; retrotransposons
15.  De novo identification of LTR retrotransposons in eukaryotic genomes 
BMC Genomics  2007;8:90.
LTR retrotransposons are a class of mobile genetic elements containing two similar long terminal repeats (LTRs). Currently, LTR retrotransposons are annotated in eukaryotic genomes mainly through the conventional homology searching approach. Hence, it is limited to annotating known elements.
In this paper, we report a de novo computational method that can identify new LTR retrotransposons without relying on a library of known elements. Specifically, our method identifies intact LTR retrotransposons by using an approximate string matching technique and protein domain analysis. In addition, it identifies partially deleted or solo LTRs using profile Hidden Markov Models (pHMMs). As a result, this method can de novo identify all types of LTR retrotransposons. We tested this method on the two pairs of eukaryotic genomes, C. elegans vs. C. briggsae and D. melanogaster vs. D. pseudoobscura. LTR retrotransposons in C. elegans and D. melanogaster have been intensively studied using conventional annotation methods. Comparing with previous work, we identified new intact LTR retroelements and new putative families, which may imply that there may still be new retroelements that are left to be discovered even in well-studied organisms. To assess the sensitivity and accuracy of our method, we compared our results with a previously published method, LTR_STRUC, which predominantly identifies full-length LTR retrotransposons. In summary, both methods identified comparable number of intact LTR retroelements. But our method can identify nearly all known elements in C. elegans, while LTR_STRUCT missed about 1/3 of them. Our method also identified more known LTR retroelements than LTR_STRUCT in the D. melanogaster genome. We also identified some LTR retroelements in the other two genomes, C. briggsae and D. pseudoobscura, which have not been completely finished. In contrast, the conventional method failed to identify those elements. Finally, the phylogenetic and chromosomal distributions of the identified elements are discussed.
We report a novel method for de novo identification of LTR retrotransposons in eukaryotic genomes with favorable performance over the existing methods.
PMCID: PMC1858694  PMID: 17407597

Results 1-15 (15)