Given a genomic signature (a sequence of base pairs) and a set of genomic sequences, the goal is to identify which TFBSs from a library of TFBS matrices match the occurrences of the signature within the sequences. If a signature,
, is longer than a matrix with a length
, then a matrix score can be calculated (as is done in ) for all
length substrings of
(see Wasserman and Sandelin for more details on the scoring function [9
]). Typically, a cutoff score is used and all substrings meeting or exceeding the cutoff are considered results. However, the case in which signatures are shorter than the matrices available for comparison requires special consideration. One solution is to align the signature to the matrix and report the best score. However, erroneous scores result from biases in string matching, as shown in . To overcome this problem, we propose using an expanded window of base pairs extracted from the original sequences from which the signatures were derived. The simplest approach is to use the full sequences and find the optimal alignment to each TFBS matrix using a window around the signature and every possible sub-sequence containing the signature, as shown in . However, this approach is exceedingly slow.
Fig. 1 (a) A position frequency matrix (PFM) is a count of the number of bases at each position for a set of known TFBS sites. (b) A sequence is scored by (c) adding the appropriate values across the columns. (d) The resulting raw score is transformed into a (more ...)
(a) The sequence TGTGC matches a portion of the matrix exactly, giving a score of 100% on just that portion. (b) The full window around the sequence, TACATGATGTGC, scores poorly against the matrix.
An alternative approach is to find all occurrences of TFBS meeting the desired threshold in every sequence and subsequently use this information to quickly score the signatures. The first step in this process is slow, however it outperforms the previous simple approach, as demonstrated in Section 3. As an additional benefit, this step needs to be run only once on the sequences, to provide unlimited, future analyses on the entire set of sequences or interesting subsets therein. An example of an interesting subset would be to look for coordinated expression only within sequences whose genes are highly expressed in a single tissue, such as the liver.
Once the sites of all of the TFBS are known, scoring the signatures becomes a trivial task of scanning the list of sites for overlapping matches. Besides speed, another advantage is that finding all of the TFBS provides an exact measure of the likelihood of finding each TFBS against the set of sequences. For instance, promoter regions (the region upstream of a gene where transcription is initiated) typically contain higher concentrations of G’s and C’s than the entire human genome, biasing the expected values for the TFBS matrices. Knowing the precise expected values produces a scoring scheme with higher confidences.
The steps of the algorithm are outlined in . The first and most time-consuming step is to locate every sub-sequence that exceeds a scoring threshold for each matrix. The naïve implementation is a nested loop algorithm (). In this case, the full matrix calculation is done in the function SCORE_MATRIX on line 4. This function can be improved when called from a modified algorithm (). In this case, all sub-sequences are generated first and sorted alphabetically, which allows the SORT_SCORE_MATRIX function on line 11 to cache the previous result and quickly score similar sequences. For instance, identical sub-sequences will be next to each other in the sorted list of sub-sequences. After scoring the first one, the SORT_SCORE_MATRIX function can immediately return the result without doing the costly matrix calculation. Additionally, we optimize the SORT_SCORE_MATRIX function by starting with a maximum score for each matrix and subtracting the difference between the maximum scoring sequence and the given sequence in each position. In this way, not every position of the matrix needs to be calculated and the SORT_SCORE_MATRIX function can return early for sequences that could not possibly meet the desired threshold (cutoff). Note that in the algorithm we do not include the reverse complement case, which can be accounted for by running the MOTIF_FIND algorithm again with the reverse complement of the input sequences. Also, the algorithms are presented as in-memory, and can easily be adapted to external memory using file I/O and external sorting.
Briefly, our algorithm joins two sorted lists of genomic sites, one of all matrix matches (step 1) and one of all signature matches (step 2).
The naïve implementation of MOTIF_FIND to find all matrix matches meeting a threshold (cutoff).
A better implementation of MOTIF_FIND optimizes the use of the SORT_SCORE function.
The second step, GET_SIGNATURE_SITES, is trivial and scans the sequences to locate the occurrences of each signature, by employing a string matching function. In the final step, given a list of TFBS matches and signature matches, both sorted by sequence and then index, a sort-merge join [10
] or a plane-sweep technique [11
] can be used (). In other words, both lists are scanned simultaneously, and the signature match list is only advanced (by removing the head of the list on line 9) when a match could no longer overlap TFBS matches yet to be seen in the ordered list sorted_TFBS.
The final step in the algorithm identifies sub-sequences containing a signature and scoring above a threshold for a matrix.