Accepted input files are specially formatted SeqMap output files (ELAND3 format) which can be obtained by running SeqMap with the option/output_all_matches. ELAND3 files contain a list of mapped sequences and associated coordinates in genome coordinate order, each line corresponding to one mapped sequence read. proTRAC basically operates with a sliding window while reading the input file. The window size in lines is given by the required minimum number of mapped sequence reads per cluster. If the region encompassed by the first and last coordinate of the sliding window is small enough to exceed the minimum density of mapped reads (hit density), the respective loci are tagged. Contiguously tagged loci are assembled to cluster candidates being subject to a subsequent verification process. The required minimum number of putative piRNA loci (= mapped reads) and minimum hit density depend upon the provided dataset and the stated minimum score values, which are explained in more detail below. In order to determine the minimum hit density, proTRAC examines the distribution of mapped sequence reads for each chromosome or scaffold and specifically calculates a significant hit density by stepwise computation of the probability to observe an increasing number of hits within a 1kb window on the given chromosome or scaffold assuming a uniform distribution of mapped sequences (figure ).
Figure 1 Determining the minimum number of piRNA loci per kb. Pn (blue points) refers to the probability to observe n hits (n) per kb. f(x) (yellow line) defines the function which extends the probabilities to positive rational numbers. P(x≥ (more ...)
In order to obtain hit density threshold values with a resolution more precise than in steps of 1 hit/kb, proTRAC calculates probabilities (P
) for non-integer hit numbers (k
) per kb with an increment of 0.1 by using factorials deduced from the gamma function: Γ
) = (n
-1)!. The stepwise calculation of P
) continues until the probability reaches the significance level. Then, the minimum hit density is defined as k
The minimum number of putative piRNA loci per piRNA cluster corresponds to the minimum number of loci needed to reach all stated score values (log10 of reciprocal probabilities) with the given dataset.
If necessary, since overlapping sections of proper hit density can result in one large cluster candidate which in sum falls below the minimum hit density, proTRAC performs a stepwise clipping of peripheral hits to find the largest possible cluster candidate. During a progressive process with an increasing number of hits, in each step all possible combinations of upstream and downstream hits are clipped from the cluster candidate ends and the effect on hit density is assessed for each combination. This process continues until a sufficient combination with the minimal number of hits is found and the hit density of the remaining cluster, comprising the highest possible number of putative piRNA loci, exceeds the required minimum. The removed hits, potentially forming a separate cluster, are assessed subsequently as a separate cluster candidate.
Since piRNA clusters are typically organized in a mono- or bidirectional manner proTRAC determines directionality by comparing the degree of mono- and bi-directionality. The degree of mono-directionality is simply given by the proportion of sequence reads encoded on the main strand (the strand which encodes the majority of mapped sequence reads). To determine the degree of bi-directionality, each cluster is split stepwise between every pair of hits that yields two fragments with each fragment comprising at least 25% of all hits or at least 10 hits respectively. Subsequently, the proportion of sequence reads encoded on the main strand is computed independently for each fragment and summed pro rata. The highest attainable value accounts for the degree of bi-directionality. If one or both values exceed the user defined directionality threshold, the cluster is assigned to the respective directionality category. Otherwise it is assessed to be non-directional.
Not to rely solely on tracing regions that exhibit a high hit density, and rather additionally consider the characteristics of the clustered putative piRNA loci, proTRAC now verifies each cluster candidate by examining its: i) number of normalized hits to total hits ratio, ii) extent of strand bias, iii) proportion of putative piRNA loci with 1T or 10A respectively and iv) proportion of putative piRNA loci within the typical length range. For each parameter proTRAC assigns score values based on a probabilistic scoring system which evaluates the probabilities to obtain the observed cluster characteristics in the light of the given dataset (e.g. a score value of 2.0 corresponds to a probability of 0.01). Regarding strand bias, we assume equal probabilities for one mapped sequence read to be encoded on either plus- or minus strand. Furthermore, we presume that the number of hits per cluster is very small compared to the total number of mapped sequence reads, so that proTRAC can calculate probabilities in a sampling-with-replacement fashion, which accelerates computation without a noticeable effect on the results.
Score values for strand bias (m = hits on main strand, r = 0.5), enrichment of putative piRNA loci with optimal length (m = putative piRNA loci with typical length in cluster, r = ratio of putative piRNA loci with typical length in dataset) and enrichment of putative piRNA loci with 1T or 10A (m = loci with 1T or 10A in cluster, r = ratio of putative piRNA loci with 1T or 10A in dataset) are calculated using the following equation:
is the total number of hits in the cluster. The score value for enrichment of putative piRNA loci with 1T or 10A automatically scores either 1T or 10A enrichment (not the sum of both), so that it is suitable for datasets comprising primary as well as secondary piRNAs in both vertebrates and flies. For reasons of computational speed, proTRAC performs an approximate calculation of the score value for enrichment of putative piRNA loci corresponding to infrequently mapped reads as measured by the normalized-hits/total-hits ratio (NTR). Therefore, instead of considering the exact number of genomic hits produced by each read, which would lead to an exorbitant number of possible combinations per cluster, the entirety of mapped sequence reads (R
) is sorted into 8 groups based on the number of genomic hits per read (figure ). The number of groups was chosen as a suitable tradeoff between precision, which asymptotically increases, and computational speed, which exponentially decreases with a growing number of groups. For each of these groups containing r1
reads, the average number of genomic hits per read is calculated (a1
) and this number is ascribed to each read of the group. For each cluster comprising n
putative piRNA loci, proTRAC then calculates the minimum number of sequence reads from group 1 (m1
), 2 (m2
), 3 (m3
) and 4 (m4
) to obtain an NTR greater or equal than the observed NTR, under the assumption that all remaining reads are from group 5, 6, 7 and 8 respectively:
. Finally, proTRAC calculates the score based on the probability to obtain an NTR equal to or greater than the observed NTR by summating the probabilities for each sufficient combination:
Figure 2 Grouping of reads depending on the number of genomic hits they produce. The grouping of sequence reads allows a fast approximated calculation of the NTR value. Data in brackets: hits per read, number of reads in group, average number of hits per read (more ...)
In general, the calculation of score values implies calculation of factorials corresponding to the total number of loci within one cluster, which gets computationally more expensive with an increasing number of hits per cluster candidate. If one cluster candidate comprises more than 1000 putative piRNA loci, its total number of putative piRNA loci is set to 1000 and the other parameters are scaled down accordingly. This may lead to an underestimation of score values. However, this will not lead to rejection of true clusters, since the computation of probability by default aborts once the probability falls below 1/10100 (score = 100), which is often the case with clusters of this size. Moreover, the minimum set score values should reasonably not exceed 10.
Since it is possible, that real piRNA clusters are concealed by the presence of loci that correspond to frequently mapped sequence reads that do not originate from the cluster in question but distort its strand asymmetry, proTRAC optionally reevaluates rejected cluster candidates in that case, considering only loci that correspond to sequence reads that mapped not more than a stated maximum times to the genome, similar to the method applied by Girard et al. 2006 [4
Once clusters are identified and validated, proTRAC calculates the probability for each cluster, whether any of its observed deviations from the hypothetical uniform distribution came off by chance and deduces the probability for 0, 1-or-more and 2-or-more mistakenly annotated clusters within the obtained set of detected clusters.