1.1 Input and output
provides mechanisms for input of diverse high-throughput sequence data. A major starting point is reads aligned to references, as from manufacturer software or aligners such as MAQ
(Li et al.
) and Bowtie
(Langmead et al.
parses additional formats (e.g. fasta, fastq and arbitrary column-oriented text files). Resulting R
data structures allow manipulation of sequence, quality, alignment and other information. Input functions transparently parse compressed ( .gz
) files; most file types can be read as ‘chunks’, to allow processing of data subsets. ShortRead
inputs but does not specially represent fine-grained alignment descriptions (e.g. in Stockholm format).
Facilities for data output include fasta and fastq text formats, arbitrary column-oriented output of reads and auxiliary information, serialization of objects in native R
format, and through use of additional R
packages such as rtracklayer
export to common genome browser formats such as wiggle
(Kuhn et al.
1.2 Quality assessment
ShortRead includes facilities for assessment (QA) of read quality, sample processing and sequencing artifacts, and alignment characteristics. The QA pipeline can start with base calls and their quality scores (e.g. fastq or qseq files), as well as aligned data formats from special-purpose aligners. The result is an HTML report with embedded narrative to facilitate interpretation; a sample report is included with the package. Illustrative results are shown in . Highlights include: (i) The number of raw, filtered and aligned reads; (ii) Base call frequencies. (iii) Cycle-specific base calls and read qualities (e.g. A). (iv) Tabulation of read occurrences (how often reads are represented once, twice, …, n times). For instance, reads occurring once or a few times (to the left in B) may be unique due to base call errors, whereas reads occurring very frequently (at the extreme right in B) typically reflect PCR or resequencing issues. (v) Preliminary alignment quality score summaries. Technology-specific quality measures are also generated, especially for Illumina's Genome Analyzer (e.g. tile-specific read counts and qualities).
Fig. 1. Quality assessment. (A) Unlikely directional nucleotide change and base calls (cycle 26) from a Short Read Archive accession. (B) Left and right ‘tails’ correspond to infrequently and highly repeated reads, respectively, in a X174 (more ...)
1.3 Transformation and downstream analysis
ShortRead provides facilities for data exploration, transformation, and down-stream analysis. For example, alphabetByCycle summarizes cycle-specific nucleotide counts (A) and base qualities. The alphabetFrequency function summarizes nucleotide use over all cycles, on a per-read basis or over the entire set of reads. The tables summarizes commonly occurring sequences, as illustrated in B. ShortRead contains facilities for sorting reads, finding duplicates, trimming left and right ends and for exploiting the extensive string pattern matching functions of Biostrings.
The features described here are generally fast, operating on tens of millions of short reads in a few seconds; input of large text files can be slow, taking 3–5 min for 50 million 36mers. Sixty-four bit platforms with 4–8 GB of memory are typically sufficient.
provides extensible ‘filter’ functions for removing short reads satisfying predefined or ad hoc
criteria. For instance, the dustyFilter
identifies and removes low-complexity reads. Filters can be composed to formulate complex criteria. Additional ShortRead
functionality is a starting point for downstream analysis. The function coverage
summarizes [possibly ‘extended’, see Kharchenko et al.
)] alignments as vectors tallying the number of reads over each nucleotide in the reference.
ShortRead is one of several Bioconductor packages for sequence analysis. Biostrings has flexible tools for pattern matching, sequence alignment and manipulation. BSgenome provides facilities for representing and efficiently manipulating whole genomes. IRanges provides range-based and other expressive representations. rtracklayer provides an interface to genome browsers from within R sessions.
1.4 Advanced features
The ShortRead package includes advanced features for handling large resequencing data. In particular the large volume of data and generation in ‘lanes’ encourages a ‘block’ processing style. For instance much of the QA functionality of ShortRead can be conducted on a per-lane basis. The srapply function exploits this work flow. A typical use takes a list of file names and a function to be applied to the file. srapply applies the function to each file. Usually the function reduces the data volume in the file, e.g. from a large collection of reads to a compact summary of lane quality. The distinguishing feature of srapply is that the calculation is distributed across processors or nodes in a computer cluster, if such resources exist.