CaPSID implements an improved form of a computational approach known as “digital subtraction” [
10] that consists of subtracting in silico known human short read sequences from human transcriptome (or genome) samples, leaving candidate non-human sequences to be aligned against known pathogen reference sequences. CaPSID differs from traditional digital subtraction (e.g., [
8]), which is used as a filter, eliminating human sequences from the dataset before comparison with pathogen reference sequences. By contrast, CaPSID matches reads against both human and pathogen reference sequences, dividing the reads into three disjoint sets per sample: a set that aligns to pathogen sequences, a set that aligns to both human and pathogen sequences, and a set that does not align to either human or pathogen sequences. This three-way division forms the basis for an exploratory environment for both known and unknown pathogen research.
As shown in Figure , CaPSID consists of three linked components:
· A pipeline to analyze and maintain sequencing datasets
· A database which stores reference samples and analysis results
· An interactive interface to browse, search, and explore identified candidate pathogen data
The CaPSID Pipeline
The CaPSID pipeline is a suite of command-line tools written in Python designed to identify, through digital subtraction, non-human nucleotide sequences in short read datasets generated by deep sequencing of RNA or DNA tumor samples.
The pipeline can be conceptually divided in two distinct modules. The first module, called the Genomes Module, provides users with tools to create and update the in-house reference sequence database required by CaPSID for applying the digital subtraction. It uses BioPython [
11] to efficiently parse GenBank files and loads whole genome reference sequences, as well as some of their annotations (e.g. gene and CDS locations), into CaPSID’s database. Our database contains complete sets of human (GRCh37/hg19), viral (4015), microbial (bacterial and archaea) (38035), and fungal (53098) genomes (as of December 2011) from UCSC [
12] and NCBI [
13]. This module also provides the tools to create customized reference sequence FASTA files needed by short read sequence alignment software.
The second module, called the Analysis Module (see Figure ), is responsible for executing the digital subtraction and for analyzing its results. It requires two BAM files as input for each sequenced sample to be analyzed: one containing the short read alignment results to the human reference sequences (HRS) and one containing the alignment results to all the pathogen reference sequences (PRS) found in the CaPSID database.
CaPSID can directly process BAM files containing header lines formatted according to NCBI FASTA specification. In order to produce BAM files that can be loaded into CaPSID, a user can also export genome reference files from CaPSID for passing to the short read aligner. Before processing the BAM files, the user can specify a minimum required MAPQ score and all aligned reads that fall below that value will be removed from the analysis. CaPSID can process BAM files containing either single-end or pair-end reads and will automatically determine which analysis method is appropriate. Digital subtraction is not limited to a human reference sequence and can be executed with any host organism as long as its reference sequence has been loaded into the database.
In theory, any short read alignment program - as long as it produces BAM files– is suitable for CaPSID. However, one restriction applies, namely that multiple alignment locations for individual reads must be reported as they can map to more than one pathogen genome. We selected Novoalign (Novocraft Technologies [
14]) for its flexibility (allows both mismatches and gaps), accuracy and speed. For ABI SOLiD sequencing data, we chose to use BioScope (Applied Biosystems). Both of these aligners are commercial programs with costly licences. However there are freely available and open source alternatives such as the recently released Bowtie 2 [
15] (also allowing both mismatches and gaps) and SHRiMP2 [
16].
As output, CaPSID’s digital subtraction implementation produces three disjoint sets of short reads (or records) per sample: a set that aligns to any of the PRS, a set that aligns to both HRS and PRS and a set of reads that does not align to any of the reference sequences (both human and pathogens).
In Algorithm 1 we outline the digital subtraction method used by CaPSID, and how the three disjoint sets are constructed. CaPSID uses the pysam Python module to read and process aligned short read sequences from two input BAM files (LHRS and LNHRS are read from the same BAM file, in two passes). The algorithm first processes alignment information for each read that maps to the PRS and stores it to the CaPSID’s database (lines 1 to 3). Next, the algorithm processes alignment information for reads that map to HRS, storing it into CaPSID if the same read identifier also maps to the PRS (lines 4 to 8). To avoid memory performance problems, LHRS, which may be very large, is processed sequentially. Finally, with another pass through the same BAM file, this time selecting reads that do not align to the human reference sequence, CaPSID identifies unknown reads when they also do not map to the PRS by testing against an in-memory indexed copy of LPRS, which is expected to be relatively small (lines 9 to 13).
Algorithm 1 CapSID’s digital subtraction method
Require: input LPRS - list of reads that align to the pathogen reference sequenceRequire: input LHRS - list of reads that align to the human reference sequenceRequire: input LNHRS - list of reads that do not align to the human reference sequence
1:for eachitem in LPRSdo
2: additemtoPATHOGENSTORE
3: end for
4: for eachitem in LHRSdo
5: ifiteminPATHOGENSTOREthen
6: additemtoHUMANSTORE
7: end if
8: end for
9: for eachiteminLNHRSdo
10: ifitemis not inLPRSthen
11: additemtofile(FASTQ)
12: end if
13: end for
To better evaluate the significance of the findings, the Analysis Module calculates four different metrics for each sample and for each project as a whole (defined as a collection of samples): (i) the total number of aligned reads (or hits) across any given pathogen genome, (ii) the total number of hits across genes only within a pathogen genome, (iii) the total coverage across each pathogen genome and (iv) the maximum coverage across any of the genes in a given pathogen genome. Here we define coverage as the number of genome nucleotides represented in aligned reads normalized by the genome length. Code that calculates the four metrics for each sample has been parallelized to run on multiple processors in order to make these calculations over samples faster.
In addition to the Genome and Analysis Modules, CaPSID includes a command-line script for manipulating FASTQ files that allows the filtering of low quality reads and the removal of duplicates before alignment to reference sequences. The filtering is based on two parameters set by the user, the Phred quality threshold and the number of base pairs allowed below that threshold.
Processing digital subtraction and calculating metrics on two BAM files of 34 GB and 31 GB respectively takes approximately 15 minutes when using 5 GB of RAM on a 16-core AMD 64-bit processor.
The CaPSID Database
One of the main unique features of CaPSID is that reference sequences and digital subtraction results are both stored as linked data in MongoDB, a scalable, high-performance, open source document-oriented database [
17].
The CaPSID database records information on each read identified by the pipeline as part of the two first output sets described above, namely on each read that aligns to either the PRS alone, or to both the HRS and the PRS. The stored fields include, for example, alignment location, length, score, average base qualities, alignment sequence, CIGAR (Compact Idiosyncratic Gapped Alignment Report) information, number of mismatches and whether the read aligns to HRS. Storing both genome and read data makes it possible to run analysis quickly over known gene locations to determine which reads are aligning over gene regions. The third output set of reads identified by the pipeline (reads which do not align to any of the reference sequences) is not stored in the database but is rather saved in FASTQ format to the local file system for further processing.
Next generation sequencing produces large amount of data and because CaPSID stores information about each aligned read and each reference sequence, it needs to deal effectively with large data sizes. CaPSID uses MongoDB [
17], a database software that scales horizontally by sharding the database across a cluster of servers, while still enabling fast retrieval of large volume of data. Thanks to this scalable architecture, there is virtually no limit as to how many reads or experiments CaPSID can store in its database. MongoDB reports [
17] examples of highly accessed production systems with more than 3.5 TB more than 3.5 TB and over 20 billion records. MongoDB also provides the safety of having no single point of failure, as well as distributing both the processing load and data storage requirements. We tested CaPSID on a single node with more that 25 million read records from 113 different transcriptomes samples without significant drop in its performance. Another advantage of using MongoDB is that it offers API access to a number of programming languages (such as R, Python, Java and more) which, depending on users’ needs, allow a broad range of custom type data analysis.
CaPSID allows also users to store meta-information about each project (defined as a set of samples) and sample. For example, the user can specify the type of disease, the type of cell and sample source together with alignment information (such as the aligner used, sequencing platform, type of sequencing and the location of BAM files) for each sample to be processed.
This database is a key component of the CaPSID platform since it allows users to store, organize and analyze the relevant information from individual samples (derived from the BAM files) in a simple and seamless way, without the burden of manipulating them programmatically.
The CaPSID Web Application
After digital subtraction, CaPSID’s web interface enables research attention to be focused on those parts of the sequencing datasets that match the pathogen reference sequences. Its aim is to provide researchers with a rich and interactive interface to manage, query and visualize project results stored in the database. For example, it allows users to search for a specific pathogen and view its statistics across multiple samples or to display sortable tables of coverage statistics for any sample or project (see Figure ). CaPSID also lets users rank genome hits in a given sample or project by using any of the four metrics described above. CaPSID integrates the genome browser JBrowse [
18] to allow, in a simple mouse click, visualizing and analyzing the distribution of read alignments from one or many samples of a given pathogen genome and its genes (see Figure ). The browser can also differentiate between reads that align PRS only and those that align to both HRS and PRS, allowing users to quickly see which proportion of reads also align to a reference sequence.
CaPSID was designed to be used as a platform for large collaborative projects, such as those between laboratories, and functions as a centralized repository of information. Separating the interface from the analysis pipeline is central to this, as it enables those without direct experience of high-throughput sequencing systems to participate in expert judgements on the digitally subtracted sample datasets.
To enable this, projects in CaPSID are defined as a collection of samples and allow for fine-grained control over user access levels. There are three levels of access to a project: users, collaborators and owners. Users have read-only access to projects, collaborators have permission to add, edit and remove samples and owners have full access to the project, which includes the ability to give permission for others users to access the project, and to remove the project and all associated data entirely from the platform. In addition to specific user level access, projects on the platform can be made public, giving all CaPSID users read-access to the samples.
The CaPSID web application is written in Groovy using the Grails web framework, and allows user authentication and authorization data to be defined in the CaPSID database, an LDAP server or a combination of the two. For example, specific user permissions can be stored in the CaPSID database, while login credentials come from an external LDAP server. This gives full project control to the CaPSID administrators, while still keeping the centralized access control of an LDAP server.
CaPSID’s manual and documentation
We have provided comprehensive documentation for installing and using CaPSID, complete with a step by step tutorial that takes users from creating a project and loading sequencing data, to analyzing and visualizing aligned reads across pathogen genomes. The full documentation is available at
https://github.com/capsid/capsid/wiki.