|Home | About | Journals | Submit | Contact Us | Français|
For genomic applications, signature-finding algorithms identify over-represented signatures (words) in collections of DNA sequences. The results can be presented as a specific sequence of bases, a consensus sequence showing possible combination of bases, or a matrix of weighted possibilities at each position. These results are often compared to a biological set of binding sites (i.e., known functional elements), which are usually represented as weighted matrices. The comparison is made by scoring the signatures against each weight matrix to identify the best option for a positive hit. However, this approach can misclassify results when applied to short sequences, which are a frequent result of signature finders. We describe a novel method using a window around the original sequences (those which the signature is based upon) to improve the comparison and identify a more significant measure of similarity. In doing so, our method transforms a list of DNA signatures into a resource of characterized binding sites with known functional roles and identifies novel elements in need of further elucidation.
The human genome contains 3 billion letters. Many of these can be implicated as having an important biological function, yet the precise activity remains an enigma . One approach is to identify small subcomponents (like words in a sentence) that have meaning. For instance, an over-represented combination of letters is suggestive of a functional entity because the statistical likelihood of such an event is low given a random distribution of genomic letters. When applied to sequences with similar functions such as promoters of genes that are co-expressed, signatures are implicated in conferring similar expression fates. In such cases, a particular sequence of letters would regulate gene expression by recruiting a protein capable of activating the entire set of genes simultaneously.
Once statistically significant words are identified, their biological role must be discerned. One can begin the process by filtering those which represent known non-functional elements enriched for a role other than cis-regulation. For instance, the human genome contains numerous types of transposable elements  that are able to use the cell’s machinery to copy themselves to other parts of the genome. While some of these repetitive elements, termed repeats, might have been exapted into useful functioning elements  most of them are considered ‘junk’ DNA. These elements are routinely excluded from evolutionary constraints that characterize functional DNA, such as purifying selection. On the other hand, the approach of finding associations with known biological words, such as transcription factor binding sites (TFBSs) is valuable for identifying enriched patterns that represent preferred binding sites for a protein. Tools for this purpose include m2transfac  and an extension of MEME . These tools compare PWMs of enriched words to PWMs of binding sites in TRANSFAC  or JASPAR  databases. Nevertheless, these tools do not work with formats other than PWMs, and they are not available as stand-alone tools. When stated as a tractable problem “given a genomic signature, what are the underlying TFBSs” the importance of mapping enriched signatures in DNA sequences onto known binding sites becomes apparent. Errors in mapping word signatures onto known binding sites can derail advancement in biological research. Here we describe a method to map DNA signatures onto libraries of TFBSs using a unique approach that aims to improve the accuracy of the comparison to determine which signatures have prior biological information and which are novel elements. We demonstrate the utility of our method with results from a prototype system (available upon request) and provide performance metrics. Additionally, with signatures, enriched words having no characterized biological function become candidates for novel functional elements, such as binding sites. Given that one-third of all transcription factors have not been identified , the list of characterized binding sites remains largely incomplete, leaving ample room for discovery of important elements.
Given a genomic signature (a sequence of base pairs) and a set of genomic sequences, the goal is to identify which TFBSs from a library of TFBS matrices match the occurrences of the signature within the sequences. If a signature, s, is longer than a matrix with a length m, then a matrix score can be calculated (as is done in Fig. 1) for all m length substrings of s (see Wasserman and Sandelin for more details on the scoring function ). Typically, a cutoff score is used and all substrings meeting or exceeding the cutoff are considered results. However, the case in which signatures are shorter than the matrices available for comparison requires special consideration. One solution is to align the signature to the matrix and report the best score. However, erroneous scores result from biases in string matching, as shown in Fig. 2a. To overcome this problem, we propose using an expanded window of base pairs extracted from the original sequences from which the signatures were derived. The simplest approach is to use the full sequences and find the optimal alignment to each TFBS matrix using a window around the signature and every possible sub-sequence containing the signature, as shown in Fig. 2b. However, this approach is exceedingly slow.
An alternative approach is to find all occurrences of TFBS meeting the desired threshold in every sequence and subsequently use this information to quickly score the signatures. The first step in this process is slow, however it outperforms the previous simple approach, as demonstrated in Section 3. As an additional benefit, this step needs to be run only once on the sequences, to provide unlimited, future analyses on the entire set of sequences or interesting subsets therein. An example of an interesting subset would be to look for coordinated expression only within sequences whose genes are highly expressed in a single tissue, such as the liver.
Once the sites of all of the TFBS are known, scoring the signatures becomes a trivial task of scanning the list of sites for overlapping matches. Besides speed, another advantage is that finding all of the TFBS provides an exact measure of the likelihood of finding each TFBS against the set of sequences. For instance, promoter regions (the region upstream of a gene where transcription is initiated) typically contain higher concentrations of G’s and C’s than the entire human genome, biasing the expected values for the TFBS matrices. Knowing the precise expected values produces a scoring scheme with higher confidences.
The steps of the algorithm are outlined in Fig. 3. The first and most time-consuming step is to locate every sub-sequence that exceeds a scoring threshold for each matrix. The naïve implementation is a nested loop algorithm (Fig. 4). In this case, the full matrix calculation is done in the function SCORE_MATRIX on line 4. This function can be improved when called from a modified algorithm (Fig. 5). In this case, all sub-sequences are generated first and sorted alphabetically, which allows the SORT_SCORE_MATRIX function on line 11 to cache the previous result and quickly score similar sequences. For instance, identical sub-sequences will be next to each other in the sorted list of sub-sequences. After scoring the first one, the SORT_SCORE_MATRIX function can immediately return the result without doing the costly matrix calculation. Additionally, we optimize the SORT_SCORE_MATRIX function by starting with a maximum score for each matrix and subtracting the difference between the maximum scoring sequence and the given sequence in each position. In this way, not every position of the matrix needs to be calculated and the SORT_SCORE_MATRIX function can return early for sequences that could not possibly meet the desired threshold (cutoff). Note that in the algorithm we do not include the reverse complement case, which can be accounted for by running the MOTIF_FIND algorithm again with the reverse complement of the input sequences. Also, the algorithms are presented as in-memory, and can easily be adapted to external memory using file I/O and external sorting.
The second step, GET_SIGNATURE_SITES, is trivial and scans the sequences to locate the occurrences of each signature, by employing a string matching function. In the final step, given a list of TFBS matches and signature matches, both sorted by sequence and then index, a sort-merge join  or a plane-sweep technique  can be used (Fig. 6). In other words, both lists are scanned simultaneously, and the signature match list is only advanced (by removing the head of the list on line 9) when a match could no longer overlap TFBS matches yet to be seen in the ordered list sorted_TFBS.
The result of running our program is a list of possible matrices matching each signature. An example for one signature is shown in Table 1, where the signature occurred twelve times in the set of input sequences. Four possible matches with the TRANSFAC database of TFBS matrices  were found. Each matrix ID is shown with its consensus sequence (in IUPAC notation: N=any base, R=A or G, K=G or T, Y=C or T, S = C or G, M = A or C). The expected value (expected number of occurrence for each sequence) is based on the occurrences of the matrix in the set of sequences, which is calculated during the motif-finding phase, and gives a relative measure of the quality of the matrix. While the signature scored well against the two SP1 variants, with an average of 96%, they only matched 4 out of the 12 full sequences. The GC_01 matrix matched all of the full sequences with a high score and has stronger evidentiary support.
As shown in Fig. 7, results of experiments on a set of genomic sequences showed that the optimized implementation of MOTIF_FIND (Fig. 5) significantly outperformed the naïve implementation (Fig. 4). Since we do not know of any methods employing a similar approach, we compared our method to the naïve implementation. This implementation (and performance) is similar to an overall naïve algorithm, which would take the set of genomic sequences and signatures as input and produce the same results. Not only is this approach slower, it is not reusable as is our approach.
Tools to identify genomic signatures often produce lists of sequences implicated in biological function. Despite the non-random occurrence of these signatures, interpreting their biological role is not a trivial task. We report an approach to increase the knowledge base available for each signature by matching the sequences to known transcription factor binding site matrices. This procedure transforms a list of signatures generated without bias toward known functional elements into a resource identifying the possible biological pathways involved in regulation at a genomic location. In addition to TFBS matching, we intend to expand the work to filter transposable elements and to consider sequence conservation across species. Importantly, the existing procedures for mapping TFBS elucidate computational problems that need refinement as genomic annotation progresses toward large-scale signature finding. In particular, scoring a library of matrices is time consuming. Algorithms or database indices could dramatically accelerate the process.
This work was supported by Intramural training program of the National Human Genome Research Institute of the US National Institutes of Health. The authors are grateful to Dr. Lonnie Welch for constructive and helpful comments.