NG6 can be split into two distinct parts: the pipelines and the web site (Figure ). The pipelines gather a set of analyses adapted to the produced sequences. They can only be accessed and launched by the sequencing facility team. The pipelines are running in Ergatis [5
]: a workflow management system able to iterate through multiple inputs in order to run them at the same time on a computer farm. These jobs perform analysis and save the analysis results in the NG6 database and directories. The web site part, presenting the results has been implemented as a typo3 [6
Figure 1 Architecture of the ng6 application. NG6 pipelines are available within the ergatis workflow environment. The analyses are processed either on a local system or on a distributed environment. While the analyses are running, they store the resulting files (more ...)
NG6 uses three data types: project, run and analysis. A project is a collection of runs and analysis. A run contains one or several raw files which can be used as inputs of different analysis. A project is owned by a user group and only users within this group are allowed to browse and download data related to this project.
Building and running pipelines
Pipelines are defined by a set of connected ergatis components. Depending on the links between the components, they are processed in a parallel or a serial manner. Most components available in NG6 combine a processing step and a storage step. This last one stores, on one hand, resulting files into the ad-hoc directory structure and, on the other hand, saves information into the database such as software version, parameters, links between analysis and resulting figures.
In the current version, NG6 offers a set of pipelines adapted to two platforms (Roche 454, Illumina HiSeq), four file formats (sff, fastq, fasta and qseq) and handles both casava 1.7 and casava 1.8 outputs of the illumina package [7
]. It includes analyses such as quality control, genomic read alignment, BAC assembly, 16S/18S diversity analysis, expression quantification using 16S amplicons. In order to handle multiplexed runs, some pipelines first split the input read file into sample files, process and collect results on each of them and last merge these results in a summary table.
As an example, the 454_default pipeline processes sff files, coming from the Roche sequencer. It first performs usual statistical analysis on the reads, then tracks down contamination from common contaminant databases (ecoli, yeast and phage) using blast [8
] returning a list of contaminated sequence IDs. Contamination between the different regions is also traced using the sfffile script included in the Roche Newbler package [9
]. Sequences with incorrect MID (Multiplexed ID) are discarded and the number of contaminated sequences is returned to the end-user. Roche 454 sequencing kits include control fragments known as spike-ins within each run. Statistics on the corresponding sequences are used to check if the run matches the expected quality standard. In the next step reads are cleaned using the pyrocleaner script [10
]. It discards reads considering different criteria such as length, base quality, complexity, number of undetermined bases, multiple copy reads or even faulty paired-ends. The analysis results are presented to the users in a summary table. Last, a de novo assembly is performed on the cleaned reads using the Newbler runAssembly command [9
]. Some basic figures regarding the assembly results, such as contig count, N50 value, contig length distribution or even contig length versus sum of read length per contig diagram are presented to the user in order to ease the assembly quality assessment.
When the pipeline execution is over, all analysis and runs newly added to the system are flagged as hidden. This was meant to permit the validation of the run by the team in charge of the sequencer before data release to the end-user.
NG6 also provides two components enabling to start a pipeline with data already loaded into the system. The ng6run2ergatis component takes a run ID and a file pattern in order to create an input file list which can be used as input for other components. The same can be done with the ng6analysis2ergatis component to work on previous analysis result files. This enables to launch new pipelines on datasets already stored in the system in order to answer new requests. When building a new pipeline, the administrator will have the choice between several already available components such as cleaning tools : smartkitcleaner, adaptatorcleaner, 16Scleaner or cutadapt [11
], alignment tools : bwa [12
], blast, statistical tools : fastqc [14
], the samtools [15
], 16S/18S diversity assessment tools as mothur [16
] or other utilities as fastq_extract or sff_extract [17
]. After the configuration step, the administrator will be able to run the pipeline and monitor the processing steps states (Figure ).
Figure 2 Executing and monitoring workflows. To monitor, execute and create pipelines, NG6 relies on the ergatis workflow management system. This figure presents a pipeline running on illumina data and producing an alignment against a reference genome, some statistics (more ...)
The analyses provided in NG6 have been designed to limit the used disk space and the number of temporary files. As an example, the bwa alignment against a reference genome, performed on illumina reads, chains bwa and samtools using the unix pipe command.
A cluster environment has often a local optimized file system. NG6 moves files from the cluster file system to the storage file system using the ng6synchronization component. Until synchronization is completed, a warning message is displayed to inform the end-user.
Browsing and downloading results
A user can access his projects or runs using the menu bar items at the top of the page. The project and run links list all projects and runs he has access to. Once in a project, the user will see all the related runs and analysis performed on the project level. At the run level the system displays corresponding metadata such as species, sequence type and data volume. It also gives access to the sequence files and hierarchically lists analysis performed on the run. The analysis view displays analysis results and provides a direct access to the resulting data files (Figure ). At each level, the NG6 interface shows the used disk space. The download manager accessible from the menu bar permits to select and download data and analysis results files. To avoid data duplication, if the user has an unix account on the NG6 server, the software provides the possibility to create symbolic links between the data files and his home directory.
Figure 3 Administrator view of a run. The administrator view enables multiple analysis selection in order to hide, unhide or delete the selected elements. Once hidden, an analysis will no longer be displayed to the end-user. As an example, the Control analysis (more ...)
As a typo3 plug-in, NG6 can easily be included in any web site built with this CMS. The NG6 plug-in is compliant with the national language support system of typo3. Configuring the system for a new language only consists in translating and adding the corresponding language files. So far, only English and French are supported.
Right accesses and administration
NG6 offers two user status : administrator and end-user and two data access levels : public and private. Within each level the items can be hidden or unhidden. This allows to manage access rights considering the user type (Table ). NG6 uses the typo3 user tables and management system. Rights are given on a project level to a user group. A user can be part of multiple groups. Once the user is logged on the web site, he can only browse projects of his groups.
Users and data right management
The project administrator has all rights on the project, he can delete, hide, unhide, publish and unpublish the whole project with related runs and analysis. A hidden project is only visible to the project administrator, this was designed in order to permit the validation of the run by the team in charge of the sequencer before releasing the data to the end-user. To give access to the project, once the data is validated, the administrator unhides it. This is also true for analysis (Figure ). The metadata fields are editable on line by the administrator.
Example of some analysis view. The analysis result layout is defined in a smarty template, thus enabling different layouts for the end-user. Figure five shows examples of MothurClassify and PyroCleaner analysis result displays.
A published project is openly accessible on the web site. For example, you can access our demonstration project using the following link : http://ng6.toulouse.inra.fr/index.php?id=3
. This feature provides the biologists with a fast and easy way to make their data accessible to their community.
Adding new analysis
NG6 web site is a Typo3 extension written in php. It uses the smarty template engine [18