The main pipeline is divided into five major steps: (i) quality trimming and filtering of raw reads, (ii) optional mapping to remove, extract, and/or quantify reads matching a reference database, (iii) assembly, (iv) assembly revision, and (v) gene prediction (). Statistics from each step are summarized into multi-sheet Excel documents, as well as queryable SQLite databases. Full details of output files and statistics produced in each processing step are given in Table S7
The MOCAT data processing pipeline.
The individual processing steps in MOCAT were benchmarked using three different data sets: 124 published human gut metagenomic samples 
, a mock community produced by the Human Microbiome Project (HMP) with 22 species from 19 genera 
, and a simulated metagenome with 100 strains from 85 species 
. By using this combination of host associated, artificial, and simulated metagenomes with different taxonomical resolution, we show that MOCAT can efficiently process a variety of metagenomic samples, ranging in both size (0.5–16.6 Gbp), origin and composition owing to new developments in each of the five major steps.
i) Quality Trimming and Filtering of Raw Reads
Read quality trimming and filtering can greatly improve the length and accuracy of metagenomic assemblies 
. Therefore, in the first processing step, raw reads below specified quality and length cutoffs are trimmed or removed using either the FastX program (http://hannonlab.cshl.edu/fastx_toolkit/
) or the DynamicTrim algorithm in the SolexaQA package 
. The supported FastX program removes bases from the 3′ end below a user-defined threshold, whereas the DynamicTrim algorithm in the SolexaQA package keeps the longest contiguous read segment in which all quality scores are above the user-defined threshold 
. After quality trimming and filtering our three test datasets, 57–79% of the reads remained as high quality reads (Table S1
Additionally, to reduce base composition-biases that commonly occur in HTS data 
, the frequency of each base at each position over all reads is calculated, and bases that exceed two standard deviations of the average base frequency within a sample are trimmed from the 5′ end of all reads. Using our test data set of 124 published human gut microbial samples, on average, the fraction of reads that could be mapped to assemblies was 1% higher when using 5′ trimmed reads, compared to non-trimmed reads (Table S2
ii) Mapping, and Removal or Extraction of Reads Matching a Reference Database
In the second step, reads can be mapped to reference sequences in order to extract or remove reads from the original data set as well as to calculate base or read coverages. For example, reads from a human fecal metagenomic sample can be mapped to the provided human genome database (hg19, Genome Reference Consortium Human Reference 37) to remove reads of human origin using SOAPAligner2 
, or reads containing adapters used for sequencing library construction can be removed using Usearch 
. Reads can also be mapped to any other custom reference database to calculate base and insert coverage of reference sequences to estimate taxonomic and/or functional composition of a sample, for example.
Here, we estimated the taxonomic composition of the simulated metagenome by mapping reads to the set of original reference genomes (Table S2
and Table S3
) and calculating genome size-normalized base and read coverages. The Pearson and Spearman correlation coefficients between the observed and expected composition of the simulated metagenome were 0.95 and 0.90, respectively, for both base and read counts (), and only 80 out of more than 30 Million reads were not aligned. However, the observed abundances of genomes with very high sequence identity may deviate from the expected abundances due to reads mapping to both the genome of origin and other highly similar genomes.
Relative abundance of each reference genome present in the simulated metagenome.
When estimating taxonomic composition of the HMP mock community, reads were mapped to reference sequences of the community (Table S4
). By first removing quality filtered and trimmed reads matching known Illumina adapter sequences (Table S5
), the percentage of bases and reads mapping to the reference genomes increased from 94.3% to 97.3%, and 95.0% to 97.6%, respectively, indicating the usefulness of a pre-screening step. The taxonomic composition estimated here is similar to the values calculated by the HMP consortium (Pearson correlation coefficient of 0.75 and 0.83 for bases and reads mapping, respectively, ), and also to estimates of 16S sequences using 454 sequencing presented in (Figure S14, 
). Experimental errors, not applicable to estimates of computationally simulated metagenomes, may explain the lower correlation in the mock community, compared to the simulated metagenome.
Relative abundance of each genus present in the even HMP mock community.
In the assembly step, a new version (1.06) of SOAPdenovo 
is used. For paired-end sequences, the insert size of each sequencing library is estimated at run-time by mapping reads to either reference marker genes 
prior to assembly, or assembled contigs prior to scaffolding. Similarly, Kmer sizes used for assemblies are calculated at run-time for each individual metagenome. Empirical tests on a large number of samples show that estimating a Kmer size for each sample as the closest odd number larger or equal to half the average read length may not yield the best possible assembly, but balances assembly throughput and accuracy.
The accuracy of metagenomic assemblies was assessed using data from the simulated metagenome and the mock community. We used the percentage of predicted complete genes aligning to the reference sequences of origin, as a proxy for correctly assembled scaftigs (contigs that were extended and linked using the paired-end information of sequencing reads). For the simulated metagenome this value was 95.2% (12,385 complete genes predicted), and for the mock community 89.3% of the complete genes aligned (1,042 complete genes predicted). The lower number of predicted complete genes in the mock community may be explained by the relatively low number of high quality reads used in the assembly for this metagenome.
The effect of using variable Kmer sizes, rather than a fixed kmer, in the assembly step, was evaluated using the 124 gut metagenomes. Estimating Kmer sizes at run-time for each individual metagenome, rather than using a fixed Kmer size across all samples, improved the number and frequency of complete gene calls as well as overall average gene length (column 1 in ).
Progressive improvement of gene prediction metrics in 124 human gut metagenomes.
iv) Assembly Revision
In the assembly revision step, a feature independent of the utilized assembly packages, MOCAT can revise existing paired-end read assemblies by aligning the reads to assembled scaftigs using the gap-tolerant BWA aligner 
to correct for base errors and short indels, and the fast SOAPaligner2 to resolve chimeric regions. Performing assembly revision on the 124 human fecal metagenomes further improved gene prediction metrics (column 2 in ).
The functionality and versatility of the pipeline has been demonstrated using an artificial mock community metagenome, a simulated metagenome with 100 species, and 124 human gut metagenomes. Based on parameter exploration and data driven parameter optimization at run-time, the MOCAT pipeline can process metagenomes in a standardized and automated way while improving the quality of assembly and gene prediction compared to using default parameters for the supported programs. To date, MOCAT has additionally been used to process and assemble hundreds of host-associated and ocean metagenomes within the scope of the MetaHIT 
and TARA Oceans projects 
Implementation, Availability, and Requirements
MOCAT is implemented in Perl and installed by extracting the package and executing one script, which downloads the default external software used by the pipeline and sets up the software. This reduces the otherwise tedious process of downloading all the individual components, a common drawback of in-house pipelines 
. Optional components requiring a license, such as MetaGeneMark 
for gene prediction, and Usearch 
for removal or extraction of reads by alignment to a FASTA-formatted sequence file, require a manual download.
A new project is quickly setup requiring only single- or paired-end FastQ formatted sequencing reads files 
for each sample in a separate directory. The use of a project-specific configuration file, with suggested default settings, offers users to run all processing steps up to gene prediction without additional setup, while allowing experienced users to modify parameters and programs used in MOCAT. All of the settings are described in the MOCAT documentation.
A queuing system enables processing of a large number of samples in parallel. If present, MOCAT seamlessly integrates all processing steps with the SGE and PBS queuing systems. However, if no queuing system is available, MOCAT processes samples serially on the machine it was executed.
MOCAT runs on 64-bit UNIX systems and can be freely downloaded at http://www.bork.embl.de/mocat/
. Perl version 5.8.8 or above is required. MOCAT is also available in a Virtual Machine package, which could be used to run MOCAT on a PC or a cloud based system. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. There are no minimum hardware requirements for the pipeline itself to run, however, requirements for analyzing metagenomic datasets vary depending on the number of samples to process in parallel and the sequencing depth of each sample. To aid in determining whether local computational resources are adequate, we provide in Table S6
the maximum resources required to process the datasets in this article. We recommend at least 16 GB of RAM to process smaller metagenomes and 64 GB of RAM to process medium sized metagenomes, but these requirements may vary depending on project settings and systems used. The hard disk space requirements depend on the size and number of metagenomes to analyze, but we recommend at least 500 GB of hard disk space.