Search tips
Search criteria

Results 1-10 (10)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Estimating population diversity with CatchAll 
Bioinformatics  2012;28(7):1045-1047.
Motivation: The massive data produced by next-generation sequencing require advanced statistical tools. We address estimating the total diversity or species richness in a population. To date, only relatively simple methods have been implemented in available software. There is a need for software employing modern, computationally intensive statistical analyses including error, goodness-of-fit and robustness assessments.
Results: We present CatchAll, a fast, easy-to-use, platform-independent program that computes maximum likelihood estimates for finite-mixture models, weighted linear regression-based analyses and coverage-based non-parametric methods, along with outlier diagnostics. Given sample ‘frequency count’ data, CatchAll computes 12 different diversity estimates and applies a model-selection algorithm. CatchAll also derives discounted diversity estimates to adjust for possibly uncertain low-frequency counts. It is accompanied by an Excel-based graphics program.
Availability: Free executable downloads for Linux, Windows and Mac OS, with manual and source code, at
PMCID: PMC3315724  PMID: 22333246
2.  Estimation of viral richness from shotgun metagenomes using a frequency count approach 
Microbiome  2013;1:5.
Viruses are important drivers of ecosystem functions, yet little is known about the vast majority of viruses. Viral shotgun metagenomics enables the investigation of broad ecological questions in phage communities. One ecological characteristic is species richness, which is the number of different species in a community. Viruses do not have a phylogenetic marker analogous to the bacterial 16S rRNA gene with which to estimate richness, and so contig spectra are employed to measure the number of virus taxa in a given community. A contig spectrum is generated from a viral shotgun metagenome by assembling the random sequence reads into groups of sequences that overlap (contigs) and counting the number of sequences that group within each contig. Current tools available to analyze contig spectra to estimate phage richness are limited by relying on rank-abundance data.
We present statistical estimates of virus richness from contig spectra. The program CatchAll ( was used to analyze contig spectra in terms of frequency count data rather than rank-abundance, thus enabling formal statistical analyses. Also, the influence of potentially spurious low-frequency counts on richness estimates was minimized by two methods, empirical and statistical. The results show greater estimates of viral richness than previous calculations in nearly all environments analyzed, including swine feces and reclaimed fresh water.
CatchAll yielded consistent estimates of richness across viral metagenomes from the same or similar environments. Additionally, analysis of pooled viral metagenomes from different environments via mixed contig spectra resulted in greater richness estimates than those of the component metagenomes. Using CatchAll to analyze contig spectra will improve estimations of richness from viral shotgun metagenomes, particularly from large datasets, by providing statistical measures of richness.
PMCID: PMC3869190  PMID: 24451229
Phage; Metagenomics; Virome; Ecology; Richness; CatchAll; Singleton
3.  Protistan microbial observatory in the Cariaco Basin, Caribbean. I. Pyrosequencing vs Sanger insights into species richness 
The ISME Journal  2011;5(8):1344-1356.
Microbial diversity and distribution are topics of intensive research. In two companion papers in this issue, we describe the results of the Cariaco Microbial Observatory (Caribbean Sea, Venezuela). The Basin contains the largest body of marine anoxic water, and presents an opportunity to study protistan communities across biogeochemical gradients. In the first paper, we survey 18S ribosomal RNA (rRNA) gene sequence diversity using both Sanger- and pyrosequencing-based approaches, employing multiple PCR primers, and state-of-the-art statistical analyses to estimate microbial richness missed by the survey. Sampling the Basin at three stations, in two seasons, and at four depths with distinct biogeochemical regimes, we obtained the largest, and arguably the least biased collection of over 6000 nearly full-length protistan rRNA gene sequences from a given oceanographic regime to date, and over 80 000 pyrosequencing tags. These represent all major and many minor protistan taxa, at frequencies globally similar between the two sequence collections. This large data set provided, via the recently developed parametric modeling, the first statistically sound prediction of the total size of protistan richness in a large and varied environment, such as the Cariaco Basin: over 36 000 species, defined as almost full-length 18S rRNA gene sequence clusters sharing over 99% sequence homology. This richness is a small fraction of the grand total of known protists (over 100 000–500 000 species), suggesting a degree of protistan endemism.
PMCID: PMC3146274  PMID: 21390079
protists; diversity; species richness; anoxic; pyrosequencing; 18S rRNA approach
4.  Protistan microbial observatory in the Cariaco Basin, Caribbean. II. Habitat specialization 
The ISME Journal  2011;5(8):1357-1373.
This is the second paper in a series of three that investigates eukaryotic microbial diversity and taxon distribution in the Cariaco Basin, Venezuela, the ocean's largest anoxic marine basin. Here, we use phylogenetic information, multivariate community analyses and statistical richness predictions to test whether protists exhibit habitat specialization within defined geochemical layers of the water column. We also analyze spatio-temporal distributions of protists across two seasons and two geographic sites within the basin. Non-metric multidimensional scaling indicates that these two basin sites are inhabited by distinct protistan assemblages, an observation that is supported by the minimal overlap in observed and predicted richness of sampled sites. A comparison of parametric richness estimations indicates that protistan communities in closely spaced—but geochemically different—habitats are very dissimilar, and may share as few as 5% of total operational taxonomic units (OTUs). This is supported by a canonical correspondence analysis, indicating that the empirically observed OTUs are organized along opposing gradients in oxidants and reductants. Our phylogenetic analyses identify many new clades at species to class levels, some of which appear restricted to specific layers of the water column and have a significantly nonrandom distribution. These findings suggest many pelagic protists are restricted to specific habitats, and likely diversify, at least in part due to separation by geochemical barriers.
PMCID: PMC3146276  PMID: 21390077
protists; diversity; anoxic; 18S rRNA approach
5.  Measuring the microbiome: perspectives on advances in DNA-based techniques for exploring microbial life 
Briefings in Bioinformatics  2012;13(4):420-429.
This article reviews recent advances in ‘microbiome studies’: molecular, statistical and graphical techniques to explore and quantify how microbial organisms affect our environments and ourselves given recent increases in sequencing technology. Microbiome studies are moving beyond mere inventories of specific ecosystems to quantifications of community diversity and descriptions of their ecological function. We review the last 24 months of progress in this sort of research, and anticipate where the next 2 years will take us. We hope that bioinformaticians will find this a helpful springboard for new collaborations with microbiologists.
PMCID: PMC3404397  PMID: 22308073
microbial ecology; biodiversity; metagenomics; next generation sequencing; microbiome; visual analytics
6.  Sequence diversity and novelty of natural assemblages of picoeukaryotes from the Indian Ocean 
The ISME journal  2010;5(2):184-195.
Despite the ecological importance of marine pico-size eukaryotes, the study of their in situ diversity using molecular tools started just a few years ago. These studies have revealed that marine picoeukaryotes are very diverse and include many novel taxa. However, the amount and structure of their phylogenetic diversity and the extent of their sequence novelty still remains poorly known, as a systematic analysis has been seldom attempted. In this study, we use a coherent and carefully curated data set of 500 published 18S ribosomal DNA sequences to quantify the diversity and novelty patterns of picoeukaryotes in the Indian Ocean. Our phylogenetic tree showed many distant lineages. We grouped sequences in OTUs (operational taxonomic units) at discrete values delineated by pair-wise Jukes–Cantor (JC) distances and tree patristic distances. At a distance of 0.01, the number of OTUs observed (237/242; using JC or patristic distances, respectively) was half the number of sequences analyzed, indicating the existence of microdiverse clusters of highly related sequences. At this distance level, we estimated 600–800 OTUs using several statistical methods. The number of OTUs observed was still substantial at higher distances (39/82 at 0.20 distance) suggesting a large diversity at high-taxonomic ranks. Most sequences were related to marine clones from other sites and many were distant to cultured organisms, highlighting the huge culturing gap within protists. The novelty analysis indicated the putative presence of pseudogenes and of truly novel high-rank phylogenetic lineages. The identified diversity and novelty patterns among marine picoeukaryotes are of great importance for understanding and interpreting their ecology and evolution.
PMCID: PMC3105688  PMID: 20631807
diversity; genetic distances; microdiversity; novelty; OTUs; picoeukaryotes
7.  A bacterial artificial chromosome library for the Australian saltwater crocodile (Crocodylus porosus) and its utilization in gene isolation and genome characterization 
BMC Genomics  2009;10(Suppl 2):S9.
Crocodilians (Order Crocodylia) are an ancient vertebrate group of tremendous ecological, social, and evolutionary importance. They are the only extant reptilian members of Archosauria, a monophyletic group that also includes birds, dinosaurs, and pterosaurs. Consequently, crocodilian genomes represent a gateway through which the molecular evolution of avian lineages can be explored. To facilitate comparative genomics within Crocodylia and between crocodilians and other archosaurs, we have constructed a bacterial artificial chromosome (BAC) library for the Australian saltwater crocodile, Crocodylus porosus. This is the first BAC library for a crocodile and only the second BAC resource for a crocodilian.
The C. porosus BAC library consists of 101,760 individually archived clones stored in 384-well microtiter plates. NotI digestion of random clones indicates an average insert size of 102 kb. Based on a genome size estimate of 2778 Mb, the library affords 3.7 fold (3.7×) coverage of the C. porosus genome. To investigate the utility of the library in studying sequence distribution, probes derived from CR1a and CR1b, two crocodilian CR1-like retrotransposon subfamilies, were hybridized to C. porosus macroarrays. The results indicate that there are a minimum of 20,000 CR1a/b elements in C. porosus and that their distribution throughout the genome is decidedly non-random. To demonstrate the utility of the library in gene isolation, we probed the C. porosus macroarrays with an overgo designed from a C-mos (oocyte maturation factor) partial cDNA. A BAC containing C-mos was identified and the C-mos locus was sequenced. Nucleotide and amino acid sequence alignment of the C. porosus C-mos coding sequence with avian and reptilian C-mos orthologs reveals greater sequence similarity between C. porosus and birds (specifically chicken and zebra finch) than between C. porosus and squamates (green anole).
We have demonstrated the utility of the Crocodylus porosus BAC library as a tool in genomics research. The BAC library should expedite complete genome sequencing of C. porosus and facilitate detailed analysis of genome evolution within Crocodylia and between crocodilians and diverse amniote lineages including birds, mammals, and other non-avian reptiles.
PMCID: PMC2966330  PMID: 19607660
8.  Environmental rRNA inventories miss over half of protistan diversity 
BMC Microbiology  2008;8:222.
The main tool to discover novel microbial eukaryotes is the rRNA approach. This approach has important biases, including PCR discrimination against certain rRNA gene species, which makes molecular inventories skewed relative to the source communities. The degree of this bias has not been quantified, and it remains unclear whether species missed from clone libraries could be recovered by increasing sequencing efforts, or whether they cannot be detected in principle. Here we attempt to discriminate between these possibilities by statistically analysing four protistan inventories obtained using different general eukaryotic PCR primers.
We show that each PCR primer set-specific clone library is not a sample from the community diversity but rather from a fraction of this diversity. Therefore, even sequencing such clone libraries to saturation would only recover that fraction, which, according to the parametric models, varies between 17 ± 4% to 49 ± 10%, depending on the set of primers. The pooled data is thus qualitatively richer than individual libraries, even if normalized to the same sequencing effort.
The use of a single pair of primers leads to significant underestimation of the true community richness at all levels of taxonomic hierarchy. The majority of available protistan rRNA gene surveys likely sampled less than half of the target diversity, and might have completely missed the rest. The use of multiple PCR primers reduces this bias but does not necessarily eliminate it.
PMCID: PMC2625359  PMID: 19087295
9.  Protistan Diversity in the Arctic: A Case of Paleoclimate Shaping Modern Biodiversity? 
PLoS ONE  2007;2(8):e728.
The impact of climate on biodiversity is indisputable. Climate changes over geological time must have significantly influenced the evolution of biodiversity, ultimately leading to its present pattern. Here we consider the paleoclimate data record, inferring that present-day hot and cold environments should contain, respectively, the largest and the smallest diversity of ancestral lineages of microbial eukaryotes.
Methodology/Principal Findings
We investigate this hypothesis by analyzing an original dataset of 18S rRNA gene sequences from Western Greenland in the Arctic, and data from the existing literature on 18S rRNA gene diversity in hydrothermal vent, temperate sediments, and anoxic water column communities. Unexpectedly, the community from the cold environment emerged as one of the richest observed to date in protistan species, and most diverse in ancestral lineages.
This pattern is consistent with natural selection sweeps on aerobic non-psychrophilic microbial eukaryotes repeatedly caused by low temperatures and global anoxia of snowball Earth conditions. It implies that cold refuges persisted through the periods of greenhouse conditions, which agrees with some, although not all, current views on the extent of the past global cooling and warming events. We therefore identify cold environments as promising targets for microbial discovery.
PMCID: PMC1940325  PMID: 17710128
10.  Microeukaryote Community Patterns along an O2/H2S Gradient in a Supersulfidic Anoxic Fjord (Framvaren, Norway)†  
To resolve the fine-scale architecture of anoxic protistan communities, we conducted a cultivation-independent 18S rRNA survey in the superanoxic Framvaren Fjord in Norway. We generated three clone libraries along the steep O2/H2S gradient, using the multiple-primer approach. Of 1,100 clones analyzed, 753 proved to be high-quality protistan target sequences. These sequences were grouped into 92 phylotypes, which displayed high protistan diversity in the fjord (17 major eukaryotic phyla). Only a few were closely related to known taxa. Several sequences were dissimilar to all previously described sequences and occupied a basal position in the inferred phylogenies, suggesting that the sequences recovered were derived from novel, deeply divergent eukaryotes. We detected sequence clades with evolutionary importance (for example, clades in the euglenozoa) and clades that seem to be specifically adapted to anoxic environments, challenging the hypothesis that the global dispersal of protists is uniform. Moreover, with the detection of clones affiliated with jakobid flagellates, we present evidence that primitive descendants of early eukaryotes are present in this anoxic environment. To estimate sample coverage and phylotype richness, we used parametric and nonparametric statistical methods. The results show that although our data set is one of the largest published inventories, our sample missed a substantial proportion of the protistan diversity. Nevertheless, statistical and phylogenetic analyses of the three libraries revealed the fine-scale architecture of anoxic protistan communities, which may exhibit adaptation to different environmental conditions along the O2/H2S gradient.
PMCID: PMC1472314  PMID: 16672511

Results 1-10 (10)