Components and modules included in PEMer
PEMer consists of the following modular components, which are by default executed in the order given below.
Initially, PEM data are formatted into a proper structure. For example, when processing data from the 454/Roche platform, the standard 44 bp linker sequence (GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC) from the 454/Roche paired-end protocol is identified (for example, at a minimum sequence identity of 90%) and fragments split into paired ends using the linker as a seed. PEM data generated with the Solexa/Illumina or SOLID/ABI platforms are pre-processed and initially aligned to the reference genome using MAQ [28
In this step, both ends are independently aligned with the reference genome. When using 454 data, by default, a computational approach that combines efficient initial heuristic genome alignment (that is, using Megablast [32
] by default with parameters: '-p 80 -s 11 -W 11', or, alternatively, using BLAT [33
] with parameters '-fastmap') and comprehensive optimal realignment (that is, using the Smith-Waterman algorithm [41
]) is used. As mentioned above, MAQ [28
] (default parameters) is normally used for processing Solexa/Illumina or SOLiD/ABI data. In principle, any sequence alignment algorithm can be plugged into PEMer
for read alignment.
Optimal paired-end placement
In this step, when processing data from the 454/Roche platform, an adapted version of the placement algorithm [30
] is implemented to enable the identification of most plausible paired-end alignments. This is particularly important in cases where alignments are ambiguous due to the repetitive nature of the human genome. In brief, the placement algorithm executes a cost function that penalizes outlier paired-end assignments, if one or both ends display high sequence similarity to a different genomic locus and if placing the end(s) into the alternative locus would result in a non-outlier paired end (see supplementary methods and notes in Additional data file 1). When processing Solexa/Illumina or SOLiD/ABI paired-end data, PEMer
by default omits the abovementioned placement algorithm, and instead considers paired ends as optimally placed if each end unambiguously aligns against the reference genome with a MAQ [28
] 'mapping quality' of at least 20. This score-cutoff [26
] ensures unambiguous optimal placement of short reads onto the reference genome.
Paired ends are considered as outliers if they map with a relative orientation of ends, or genomic position, consistent with structurally altered genomic regions - for example, if they fall outside the expected range of paired-end spans. Paired ends falling beyond the expected range of spans are identified based on a cutoff C expressed in terms of standard deviations from the median L and the according cutoff points Ci and Cd (Figure ). The cutoff points are usually derived from simulations and depend on the span coverage λ, the cluster size N, and the distribution of paired-end spans. Alternatively, the cutoff points may be derived from experimental controls, or may be estimated directly from the sample data. Note that PEMer by default discards outlier paired ends in which both ends map to different chromosomes (that is, putative translocations).
Outliers are categorized into SVs if a cluster of N
(or more) independent paired ends is consistent with a single SV. PEMer
evaluates whether all paired ends in a cluster are indicative of the same event. In other words, a simple intersection of paired ends may be insufficient - for example, if two intersecting paired ends indicate deletions with significantly different predicted deletion sizes. (Thus, a window for the proper clustering of paired ends is defined in PEMer
, as described below in the section 'Estimating E
-values and P
-values'.) Deletions are identified from ≥ N
overlapping discordant paired ends with a paired-end span >Cd
(with the condition that both putative breakpoints are spanned). Insertions may be identified from ≥ N
overlapping discordant paired ends exhibiting a paired-end span <Ci
. Inversions are identified using ≥ N
paired ends that are discordant in terms of orientation relative to the reference genome and are consistent with a single inversion breakpoint - that is, in such a way that all paired ends span a single, common breakpoint interval. In addition to those simple SV events, more complex events [21
] may be identified by PEMer
: mated insertions are identified from ≥ N
unpaired SVs that lie in nearby genomic regions and have ≥ N
paired ends indicating a connection with a (common) distal genomic region <100 kb in size. Mated insertions may involve tandem duplications, or translocations. Unmated insertions are predicted from ≥ N
paired ends that support a rearrangement of a genomic region in which loci change relative order without changing their relative orientation (that is, both ends map to the same DNA strand).
Clusters consistent with the same SV are merged into a single cluster. This step is necessary when SVs are searched in parallel with distinct cutoffs and cluster sizes (for example, when using the multi-cutoff or simplified multi-cutoff strategy, or when results from paired-end datasets generated with different insert sizes or different next-generation sequencing platforms are combined).
Simulation of PEM experiments
We performed simulations with a diploid chromosome, which enabled evaluation of both the efficiency of SV reconstruction and the false positive rate of PEMer. In particular, we used human chromosome 2, the chromosome with the largest determined length as well as an average repeat content and gene density. While we regard the selection of chromosome 2 as a reasonable pick to save computational processing time during the simulations, future studies may use other chromosomes or entire genomes in haploid or diploid form to parameterize PEMer. SV events were randomly generated, that is, distributed uniformly on the chromosome as described above. When simulating the reconstruction of heterozygous SVs, no events were introduced on the second copy of chromosome 2, which was included only to monitor the false positive rate. For example, when simulating PEM data generated by the 454/Roche platform, first, random shearing of the sample genome was carried out by randomly picking DNA fragment lengths from a given lognormal distribution with reasonable values for median (that is, 7.8 in log-space) and standard deviation (0.29 in log-space), which both were obtained from a typical PEM experiment. Second, fragment centers were uniformly placed along the chromosomes. Third, DNA fragment circularization, random cleavage and linker read isolation were simulated by first generating read lengths from the length distribution of sequences resulting from a typical 454 run (that is, the Roche GS-FLX-system), then by placing 'centers' for the 44 bp linker sequence uniformly onto the read, then by placing the linker sequence onto that center, and, finally, by assigning sequences of DNA fragment ends to the read ends not occupied by the linker. To achieve the expected topology of paired ends in the circulated DNA (Figure ) the 5'-end of the fragment was assigned to the 3'-end of the read and vice versa. Fourth, the resulting fragments, which were in principle undistinguishable from real genomic fragments generated by the PEM method indicated in Figure , were subjected to PEMer for SV detection.
Finally, simulation parameters can be easily adapted to platforms generating short PEM end tags (see Tables S8 and S9 in Additional data file 1 for an example involving the simulation of Solexa/Illumina data).
Error models for next-generation DNA sequencing
Optionally, error models may be applied in our simulations to consider typical next-generation sequencing errors. For example, our simulations enable considering the major two causes of errors in 454 Sequencing, that is, insertion of nucleotides and homopolymer errors. In the error model we assumed that signals observed from a homopolymer of length n
follow a Gaussian distribution with mean n
and the standard deviation being proportional to the square root of n
with coefficient 0.17, while the background follows a lognormal distribution with mean 0.2 and standard deviation 0.1 (Figure S2 in Additional data file 1) [42
]. We used intersection points of the curves as cutoff points for calling a particular DNA sequence for a given signal. For instance, signals in the range 0.56 to 1.43 were called as a single sequenced nucleotide (Figure S2 in Additional data file 1), rather than a homopolymer (that is, dimer). Nucleotide flow was simulated in the following order: T, A, C, G. For every nucleotide sequenced (including homopolymers and single nucleotides) the observed signal was generated either from a background distribution - in cases where the flowed nucleotide was different from the nucleotide to be sequenced - or otherwise from the corresponding Gaussian distribution. The overall sequencing error rate was 2.5%.
For the Solexa sequencing error we approximated the average substitution rate of the Solexa/Illumina platform [44
] using a simple model involving a fourth degree polynomial. The polynomial was used to assign substitution probabilities at each base position during the simulation. If at a given sequenced position a substitution was assigned by the simulation procedure, a randomly picked, different nucleotide was inserted at the position in question. The average sequencing error rate was 1.5%.
While our simulations were designed for optimizing the parameters of PEMer and, thus, for improving the resolution over earlier approaches, we realize that future studies may aim for broader and more realistic simulations of PEM-based studies. To facilitate future simulations involving PEM data generation and scoring, we have made the code of our simulation scripts available to the public together with PEMer.
Development of a specialized breakpoint database
To allow storage, display and manipulation of SV data as well as consistency between different sets of SVs, we implemented a database module for our approach. In particular, a web-accessible database, BreakDB, was developed, which holds a variety of data along with each SV entry. A diagram of the BreakDB schematic, illustrating the database tables and their relationships, is depicted in Figure S1 in Additional data file 1. Data inserted into BreakDB can easily be manipulated for subsequent analyses of the SV data - for instance, high-resolution breakpoint information (i.e. the genomic coordinates of breakpoint-junctions), the expected overall coverage of SVs in a particular genome estimated based on the Poisson approximation [45
] (see Additional data file 1), and results of breakpoint junction analyses by BLASTN [46
] can be added to a SV entry and mined once becoming available. Therefore, the database has a versioning system, so that all changes to an event are archived and are viewable within the application. BreakDB contains information such as the coordinates, flanking and inserted sequences (in case breakpoints are known), potentially the suggested molecular mechanism leading to SV formation, and supporting evidence for the SV entry. With more SVs identified at base-pair resolution, their representation in databases becomes challenging as the coordinates of independently occurred SV events that subsequently affected the same locus may overlap in a complex fashion. To deal with such scenarios, subsequent SV events - for example, an insertion of genomic DNA followed by inversion and deletion of parts of the sequence - can be defined recursively in BreakDB (Figure ). Thus, a SV event can be defined with respect to the current version of the reference genome (build36), or, in case of complex embedded SVs, with respect to another SV. Periodically, as a SV collection becomes stabilized, a release is generated and displayed as consistent, static pages.
Estimating E-values and P-values
PEMer computes E-values and P-values for the different types of SVs identified in PEM datasets. Specifically, given a certain span coverage we can estimate the total number of optimally placed paired ends. Let us assume the span coverage is calculated as λ = NL/G, where N = number of optimally placed paired ends, L = median insert size, and G = diploid genome size; thus, rearranging N = λG /L. Now, let us introduce a set Y of discordant paired ends with Nh elements and a span beyond a cutoff H indicative of deletions, that is, all paired ends with a span larger than a cutoff expressed in terms of standard deviations SD from the mean of the distribution of spans. Note that as for H > 2 × SD the frequency of occurrence of the span lengths decreases faster than exponentially, effectively all spans in the set Y are approximately of length H.
Consider consequently placing Nh paired ends randomly onto the genome and checking whether the n-th newly placed paired end forms a proper cluster with the already placed n-1 paired ends. The probability to cluster will be:
where Wh is the effective window size applied for clustering long paired ends (i.e., paired-ends indicative of deletions), (n-1)Wh represents the number of nucleotides where the n-th paired end may fall in order to form a proper cluster with any previously placed paired end, and 0.5 G is the non-redundant length of the genome (that is, the size of the haploid genome). Note that since all paired ends are effectively of given size H, Wh/2 is simply the maximum separation between the ends of two pairs so that that they still cluster. We further assume that Y covers the genome sparsely, thereby neglecting the effect of window overlap from different paired ends. Thus, the total number of clusters will be:
where Eh is the expected number of false positive deletions, or an E-value in relation to the number of randomly clustered 'long' discordant paired ends of set Y. Similarly, one can generalize the equation for calculating the number of false positives for k (2 or more) overlapping paired ends:
By using similar reasoning, one can estimate that the expected number Es of false positive insertions is:
where Ns is the number of 'short' discordant paired ends at cutoff S indicative of insertions and Ws is the effective window size applied for clustering short paired ends (i.e., such indicative of insertions).
PEMer detects SVs by clustering long and short events separately and is flexible in terms of defining Wh and Ws, which are used to cluster paired ends based on compatibility in sizes and locations of ends. Thus, the values Wh and Ws are introduced to simplify the clustering procedure and to obtain an analytical description of the false positive rate. We estimated that when clustering two paired ends the window sizes correspond roughly to the cutoff used for clustering Wh ~H and Ws~ S. By comparison with simulated data we also found that the effective Wh doubles when tripling the minimum required number of paired ends in a cluster while Ws gets twice longer when doubling the minimum required number of paired ends in a cluster; therefore, we obtain the general dependency Wh~ (k + 1)H/3 and Ws ~ kS/2. Thus, the amount of false positives is estimated as:
In order to calculate P
; the chance that a given event with a given cutoff occurred at random), we need to convert the number of randomly generated events E
into the probability that an event will happen at least once. Assuming the number of false positives in a given experiment follows a Poisson distribution with mean E
, we can use the known analytical expression P
= 1 - exp(-E
for clusters of 'long' paired ends and
for clusters of 'short' paired ends. Here, Nq
is the quantile of events corresponding to the 'shortest'/'longest' paired end of length Q
in the cluster defining a deletion/insertion.
Examples of E-values for identified SVs are given in Table S2 in Additional data file 1. We usually operate with E-values rather than with P-values, as they can be more intuitively understood. For E < 0.01, P and E are nearly identical.