NG6 can be split into two distinct parts: the pipelines and the web site (Figure ). The pipelines gather a set of analyses adapted to the produced sequences. They can only be accessed and launched by the sequencing facility team. The pipelines are running in Ergatis [
5]: a workflow management system able to iterate through multiple inputs in order to run them at the same time on a computer farm. These jobs perform analysis and save the analysis results in the NG6 database and directories. The web site part, presenting the results has been implemented as a typo3 [
6] extension.
NG6 uses three data types: project, run and analysis. A project is a collection of runs and analysis. A run contains one or several raw files which can be used as inputs of different analysis. A project is owned by a user group and only users within this group are allowed to browse and download data related to this project.
Building and running pipelines
Pipelines are defined by a set of connected ergatis components. Depending on the links between the components, they are processed in a parallel or a serial manner. Most components available in NG6 combine a processing step and a storage step. This last one stores, on one hand, resulting files into the ad-hoc directory structure and, on the other hand, saves information into the database such as software version, parameters, links between analysis and resulting figures.
In the current version, NG6 offers a set of pipelines adapted to two platforms (Roche 454, Illumina HiSeq), four file formats (sff, fastq, fasta and qseq) and handles both casava 1.7 and casava 1.8 outputs of the illumina package [
7]. It includes analyses such as quality control, genomic read alignment, BAC assembly, 16S/18S diversity analysis, expression quantification using 16S amplicons. In order to handle multiplexed runs, some pipelines first split the input read file into sample files, process and collect results on each of them and last merge these results in a summary table.
As an example, the 454_default pipeline processes sff files, coming from the Roche sequencer. It first performs usual statistical analysis on the reads, then tracks down contamination from common contaminant databases (ecoli, yeast and phage) using blast [
8] returning a list of contaminated sequence IDs. Contamination between the different regions is also traced using the sfffile script included in the Roche Newbler package [
9]. Sequences with incorrect MID (Multiplexed ID) are discarded and the number of contaminated sequences is returned to the end-user. Roche 454 sequencing kits include control fragments known as spike-ins within each run. Statistics on the corresponding sequences are used to check if the run matches the expected quality standard. In the next step reads are cleaned using the pyrocleaner script [
10]. It discards reads considering different criteria such as length, base quality, complexity, number of undetermined bases, multiple copy reads or even faulty paired-ends. The analysis results are presented to the users in a summary table. Last, a de novo assembly is performed on the cleaned reads using the Newbler runAssembly command [
9]. Some basic figures regarding the assembly results, such as contig count, N50 value, contig length distribution or even contig length versus sum of read length per contig diagram are presented to the user in order to ease the assembly quality assessment.
When the pipeline execution is over, all analysis and runs newly added to the system are flagged as hidden. This was meant to permit the validation of the run by the team in charge of the sequencer before data release to the end-user.
NG6 also provides two components enabling to start a pipeline with data already loaded into the system. The ng6run2ergatis component takes a run ID and a file pattern in order to create an input file list which can be used as input for other components. The same can be done with the ng6analysis2ergatis component to work on previous analysis result files. This enables to launch new pipelines on datasets already stored in the system in order to answer new requests. When building a new pipeline, the administrator will have the choice between several already available components such as cleaning tools : smartkitcleaner, adaptatorcleaner, 16Scleaner or cutadapt [
11], alignment tools : bwa [
12,
13], blast, statistical tools : fastqc [
14], the samtools [
15], 16S/18S diversity assessment tools as mothur [
16] or other utilities as fastq_extract or sff_extract [
17]. After the configuration step, the administrator will be able to run the pipeline and monitor the processing steps states (Figure ).
The analyses provided in NG6 have been designed to limit the used disk space and the number of temporary files. As an example, the bwa alignment against a reference genome, performed on illumina reads, chains bwa and samtools using the unix pipe command.
A cluster environment has often a local optimized file system. NG6 moves files from the cluster file system to the storage file system using the ng6synchronization component. Until synchronization is completed, a warning message is displayed to inform the end-user.
Browsing and downloading results
A user can access his projects or runs using the menu bar items at the top of the page. The project and run links list all projects and runs he has access to. Once in a project, the user will see all the related runs and analysis performed on the project level. At the run level the system displays corresponding metadata such as species, sequence type and data volume. It also gives access to the sequence files and hierarchically lists analysis performed on the run. The analysis view displays analysis results and provides a direct access to the resulting data files (Figure ). At each level, the NG6 interface shows the used disk space. The download manager accessible from the menu bar permits to select and download data and analysis results files. To avoid data duplication, if the user has an unix account on the NG6 server, the software provides the possibility to create symbolic links between the data files and his home directory.
As a typo3 plug-in, NG6 can easily be included in any web site built with this CMS. The NG6 plug-in is compliant with the national language support system of typo3. Configuring the system for a new language only consists in translating and adding the corresponding language files. So far, only English and French are supported.
Right accesses and administration
NG6 offers two user status : administrator and end-user and two data access levels : public and private. Within each level the items can be hidden or unhidden. This allows to manage access rights considering the user type (Table ). NG6 uses the typo3 user tables and management system. Rights are given on a project level to a user group. A user can be part of multiple groups. Once the user is logged on the web site, he can only browse projects of his groups.
| Table 1Users and data right management |
The project administrator has all rights on the project, he can delete, hide, unhide, publish and unpublish the whole project with related runs and analysis. A hidden project is only visible to the project administrator, this was designed in order to permit the validation of the run by the team in charge of the sequencer before releasing the data to the end-user. To give access to the project, once the data is validated, the administrator unhides it. This is also true for analysis (Figure ). The metadata fields are editable on line by the administrator.
A published project is openly accessible on the web site. For example, you can access our demonstration project using the following link :
http://ng6.toulouse.inra.fr/index.php?id=3. This feature provides the biologists with a fast and easy way to make their data accessible to their community.
Adding new analysis
NG6 web site is a Typo3 extension written in php. It uses the smarty template engine [
18] and the jquery javascript library [
19]. Adding new analysis into NG6 requires three steps. The first one is writing the ergatis component of the analysis. Each parameter, input and output required by the analysis has to be specified in the configuration file. Second, a simple python script has to be programmed using the NG6 API and the provided skeleton to define the data stored in the database and the result files stored in the directory structure. Finally, a smarty template is specified to set the corresponding analysis display. While writing the smarty template, the developer has access to several objects to build the analysis display as wished. Several HTML classes are available to ease javascript functionalities implementation.