Although mothur is fully capable of analyzing traditional clone-based sequences, here we demonstrate the ability of mothur to efficiently analyze a pyrosequencing data set. Sogin and colleagues, in a seminal 2006 study that outlined the use of pyrosequencing in microbial ecology studies, obtained 216,243 high-quality sequence reads from the V6 region of the 16S rRNA gene from eight samples (27
). They obtained six-paired samples from the meso- and bathypelagic realms from three sites in the North Atlantic Deep Water loop and two samples from diffuse hydrothermal vent fluids near the site of an eruption in the Axial Seamount in the northeast Pacific Ocean (Fig. ). Their analysis primarily considered their inability to exhaustively sample the biodiversity of sites in spite of record sequencing depths. The sequence data were obtained from http://jbpc.mbl.edu/research_supplements/g454/20060412-private/
, and we used the 2 February 2008 version of the data set. These data differ from those described in the original publication because the data processing algorithms internal to the GS20 machine were updated; therefore, it is not possible to make a direct comparison to the findings of the original analysis. Although these data were already trimmed and sorted into individual files for each sample, mothur has the capacity to generate these files from the FASTA-formatted sequence file generated by a sequencer. Furthermore, mothur has a number of functions for performing hypothesis tests, but here we will focus on operational taxonomic unit (OTU)-based methods of describing and comparing communities.
FIG. 1. Description and comparison of the eight samples analyzed by Sogin et al. (27). The dendrogram to the left represents the similarity of the samples based on the membership-based Jaccard coefficient calculated using Chao1 estimated richness values. The (more ...)
mothur makes several improvements that allow users with modest computing resources to analyze large data sets. Most significant are the ability to analyze only the unique sequences in a data set but retain information about the number of times that each sequence was observed and the use of sparse matrices that represent only distances smaller than a user-specified cutoff. Using a PHYLIP-based approach would have required approximately 145 GB to represent 2.3 × 1010 distances. Our improvements resulted in an 18.9-MB file containing 5.2 × 105 pairwise distances that were smaller than 0.10. The only mothur-imposed limit is the number of distances that can be processed, which is 264. The more likely limitation will be the amount of random-access memory (RAM) available on the user's computer. With the reduced memory requirement also comes significantly improved processing speed. Considering that most computers have multiple processors, users can obtain further increases in speed by utilizing the parallelization features provided in the alignment and distance calculation commands.
mothur can cluster sequences using the furthest neighbor, nearest neighbor, or UPGMA (unweighted-pair group method using average linkages) algorithms (22
). The ability to let the data speak for themselves in determining OTUs is advantageous compared to database-based approaches that can form clusters, in which sequences are similar to the same database sequences but not to each other. Furthermore, mothur uses the approach employed in DOTUR where OTUs are defined for multiple cutoffs up to the distance threshold so that alternative OTU definitions can be compared. For example, using the furthest neighbor algorithm, we clustered sequences into OTUs up to a distance threshold of 0.10 and observed 13,202, 11,317, and 7,971 OTUs at cutoffs of 0.03, 0.05, and 0.10 distance units, respectively. A similar type of analysis using the approach used in programs such as CD-HIT would limit the user to a nearest neighbor-based approach, and the users would need to run the program for each distance level in which they were interested (10
By inputting a file that maps each sequence to a sample identifier, the clusters could be parsed to perform α-diversity analyses. First, we calculated the richness and diversity of the eight samples at OTU cutoffs of 0.03, 0.05, and 0.10 distance units by using the number of observed OTUs, Chao1 estimated minimum number of OTUs, and a nonparametric Shannon diversity index (Table ). Second, we calculated rarefaction curves for the eight samples for a 0.10 distance cutoff (Fig. ); the original Sogin analysis built rarefaction curves using frequencies acquired from a database-based OTU assignment analysis. Interestingly, mothur calculated the coverage of these samples to be between 0.94 and 0.98, and yet the rarefaction curves continued to climb with increasing sequencing effort. These types of analysis were the extent of the α-diversity measurements performed in the original Sogin analysis, and each sample required up to 4 days to complete on a Quad Opteron 875 2.2-GHz series Dual Core machine with 28 GB of RAM (S. Huse, personal communication). The analysis described in this paper—from aligning of sequences through β-diversity analyses—required less than 2 h with use of a MacBook Pro laptop with 2 GB RAM and with only one of the 2.0-GHz dual processors.
TABLE 2. Measures of α diversity for the samples characterized by Sogin et al. (27) for three OTU definitionsa
FIG. 2. Rarefaction curves describing the dependence of discovering novel OTUs as a function of sampling effort for OTUs defined at a 0.10 distance cutoff. The curves for FS312 and FS396 climb to 3,095 and 2,804 OTUs after sampling of 54,894 and 80,769 sequences, (more ...)
Due to software limitations, it was not possible to assess the β diversity of the samples in the original Sogin analysis. With the software improvements implemented in mothur, we were able to transform the original OTU information into heat maps, Venn diagrams, and dendrograms (Fig. ) to describe the similarities in membership and structure of the eight samples. Several interesting observations can be made from this analysis. First, although the dendrograms generated using the Jaccard coefficient and the θYC community structure similarity coefficient have similar topologies, the terminal branch lengths of the Jaccard coefficient dendrogram are considerably longer for samples 53R, 55R, 115R, and 137. This is interesting because it indicates that while these samples have considerably different memberships (Jaccard), the relative abundances of the shared OTUs are similar. Thus, the differences between the communities are likely found in the rarer OTUs. Second, the two diffuse hydrothermal flow samples clearly cluster away from the others. This is intuitive because of the considerable differences in temperature and chemistry. Third, the only available piece of metadata that explains the clustering of the seawater samples is extreme depth; the deepest sample, 112R, clearly clusters away from the other seawater samples and was taken 2,411 m deeper than was any of the other samples. Considering that this was the only sample taken at such an extreme depth, additional sampling is required in order to have confidence in such a correlation.