454 transcriptome sequencing is widely used as a cost effective sequencing method, especially for non-model organisms 
. Concentrating the sequencing effort on the expressed part of the genome not only saves costs, it allows analysis of the expressed part of the genome, which is not easily predicted from the genome sequence alone. Splice patterns, versatile combinations of exons, can be identified, and gene expression rates can be estimated and compared. In addition, single nucleotide polymorphisms (SNPs) and simple sequence repeats (SSRs) within the coding part of the genome can be determined.
Most analyses that utilise transcriptome data require assembled reads. With next generation sequencing (NGS), DNA molecules are fragmented, size-selected, amplified, and high-throughput sequenced resulting in reads of a length which is specific for the respective NGS technology. This fragmentation procedure is reversed in silico by merging overlapping reads into contigs during the assembly process. The study presented here focuses on the performance of software for de novo assembly of cDNA reads generated by 454 sequencing. In studies lacking a sequenced genome, it is not possible to assemble the reads by mapping them onto a reference genome. Instead all reads have to be aligned against each other, i.e. de novo assembled. Despite the higher costs compared to other NGS technologies, 454 is still widely used because of the long reads it produces, facilitating read alignment during the de novo assembly. Other sequencing technologies, such as Illumina, are constantly increasing their read length and supersede 454 especially in terms of throughput and per base pair costs. In addition, new technologies being developed for example the semiconductor technology of Ion Torrent. Therefore, the assembly of around 200 bp long reads, as evaluated in the study presented here, likely will persist as a bioinformatics challenge.
For the de novo
assembly of 454 transcriptomic reads the following assemblers are most widely used: CAP3 
, wrapper for CAP3), MIRA 
, wrapper for MIRA), Newbler 
, Seqman NGen© , CLC bio©, and the web application EGassembler 
(see ). Not all of these assemblers are specifically intended for transcriptome data. In contrast to a genome consisting of few long continuous stretches (linkage groups or chromosomes), the transcriptome is comprised of many transcripts that are variable in length. The complexity of assembling a transcriptome is further exacerbated by varying expression levels, resulting in an uneven distribution of reads amongst the diverse transcripts. Even if experimental cDNA normalization aims to reduce the dynamic range of expression it usually does not result in an even distribution of transcripts 
. In addition, alternative splicing results in multiple isoforms, which share partial sequence information 
Assembler software recently used for de novo assembly of 454 transcriptome data.§
These intrinsic features of the transcriptome pose special challenges for any assembly software. A recent study by Kumar and Blaxter compared transcriptome assemblers by analysing 454 cDNA reads from Litomosoides sigmodontis
, a nematode, and evaluated the resulting contigs 
. The quality of read assemblies were assessed for basic assembly metrics, such as various measurements of bases used, contig number, and length. In addition, contigs were compared with previously existing sequence databases. Besides presenting a very comprehensive evaluation of different software solutions, some aspects have not been addressed exhaustively: (1) The analysis of basic assembly metrics usually suffers from the fact that optimal values are not known when only using real data. Although it may seem tempting to simply assume that longer contigs represent a better assembly, this might not necessarily be the case, e.g. if reads of different transcripts are concatenated. (2) The comparison of assemblies with pre-existing sets of reference sequences from other organisms might be misleading. The best performing assembler does not necessarily always match well with reference sequences, even when these references originate from the same species because the transcriptome varies depending on tissue, time point, and abiotic factors 
. (3) Due to some sequence similarity between transcripts, reads originating from different transcripts can be merged into one contig during the assembly process. Without knowledge of the origin of reads, it is difficult to determine the extend to which an assembler produces chimeric contigs, i.e. contigs containing reads from different transcripts.
We used a novel approach to assess the performance of assembler software. By applying a simulation approach we circumvent some of the problems mentioned above. Given a transcriptome, the simulator carried out in silico gene expression, reverse transcription, fragmentation and 454 sequencing. In contrast to real 454 reads, the exact origin of each simulated read was known. Utilising this information it was possible to merge reads with a minimum of one base pair overlap, independent of sequence information. This way, we knew an ideal solution (Model Assembly MA), which was assigning all reads to their original transcripts while merging reads as efficiently as possible with the given amounts of data (one single 454 plate). Therefore the MA was the optimal solution of the assembly problem given the data. The same simulated reads were assembled using assembly software which operated on sequence information only. The resulting assemblies could be compared to the MA. The MA provided reference values for basic assembly metrics, such as contig count and contig length. Additionally, the MA could be used as a reference data set against which to compare the output contigs of the assemblers to determine specificity and sensitivity measurements. Assessing the amount of reads aligning back to multiple contigs identified alignment ambiguity and redundancy in the assemblies. As we knew from which transcript each simulated read originates, it was, in addition, possible to identify reads of different origin joined to form one chimeric contig and quantify the extent of chimera formation in the different assemblies.
In our study we created simulated reads based on a description of the human transcriptome (GRCh37.58). The human data set was chosen due to the comprehensive amount of data available and the complexity and size of the transcriptome. In addition, we used real 454 reads from a human tissue pool in order to compare the simulation approach with a realistic experimental setup. The assemblers tested in this study are CAP3 
, MIRA 
, Newbler 
, and Oases 
. These assemblers had been chosen as they are frequently used in non-model organism transcriptome studies and are freely distributed stand-alone applications (see ). Although Oases is primarily designed for shorter Illumina reads, it was included in this study because it is specifically designed for transcriptome data.