The algorithm is implemented in C and the program is wrapped by a Perl script to maintain a user friendly web interface (8
). A public version of the MatchTM
tool is available at http://www.gene-regulation.com/pub/programs.html#match
. The same program under a different web interface can be found at http://compel.bionet.nsc.ru/Match/Match.html
. An advanced version of the tool called MatchTM
Professional is available at http://www.biobase.de
. It makes use of the whole TRANSFAC Professional matrix library, while the publicly available version of Match has only access to the matrices of the TRANSFAC public version. This public library is comparatively small because it does not contain most of the matrices generated by the TRANSFAC team. MatchTM
Professional contains a number of tissue-specific profiles that are not included in the public version. In addition, so called ‘best_selection’ profiles are accessible only with MatchTM
Professional. These profiles contain selections of the most reliable matrices and the cut-offs are optimised using well prepared sets of real binding sites from TRANSFAC (in contrast to the default profiles where cut-offs are optimised using an oligonucleotide generation approach, see minFN above). MatchTM
Professional provides an additional tool for matrix construction which is not included in the public version of Match. This tool allows users to construct their own matrices from a set of aligned sequences.
The MatchTM user interface is shown in Figure . It has been designed so that the user has all necessary parameters available on one screen. The left panel is used to paste the sequence (or several sequences) and to specify the name of the search. The right panel contains three major sections: matrix selection, cut-off selection and profile selection.
The matrix selection section provides the possibility to select the taxa (vertebrate, insects, plant, fungi or all). ‘High quality’ selection tag enables to use the high quality matrices only. These are approximately 70% of TRANSFAC® matrices that are characterised by the lowest false positive rate. We have selected these matrices using the following criteria. When using a matrix with a cut-off which allows a false negative rate of 50%, the frequency of matches found in exon2 sequences (false positive rate) must drop below 1 match per 1
kb. The choice of three cut-off sets (minFN, minFP and minSUM) is also provided in the matrix selection section. Alternatively, the user can select some uniform MSS and CSS cut-offs (e.g. 0.7 and 0.75) that will be applied to all matrices.
The profile selection section is the alternative way of defining parameters of the search. A profile is a subset of matrices with defined cut-offs. The user can choose one of the predefined profiles (created by the MatchTM team) or build his/her own profile using the associated web tool called ‘Profiler’. In the ‘Profiler’ the user can flexibly select different matrices from the whole TRANSFAC® matrix library and define cut-offs individually or simultaneously to all matrices in the selection and save the profile under a new name. The user can also modify some of the existing profiles. A number of useful predefined profiles are provided by MatchTM including a small number of best matrices called ‘best selection’ and several tissue-/cell type-specific (liver, muscle, immune-cells) or process specific (cell cycle) profiles. To build such profiles groups of transcription factors known to be active in a particular tissue or a process have been collected for each profile with the help of information from the TRANSFAC® database. Matrices linked to these transcription factors in TRANSFAC® were then retrieved. When more than one matrix was linked to a transcription factor, we chose the matrix that had the lowest false positive rate.
After submitting the form to the server, the MatchTM
program makes the search of the TF binding sites according to the given parameters. The output of the MatchTM
program is shown in Figure . Every match found by the program is shown in a separate line in the results table. It contains: matrix ID, position of the match, strand [(+) or (−), that indicate the matrix orientation in the match], two scores of the match, corresponding subsequence and names of transcription factors associated with the matrix. It must be mentioned that the position of the match is always given according to the (+) strand of the sequence. A simple visual representation of locations of the found matches is generated after pressing the ‘graphic’ button (Fig. B). Sites are shown above the sequence and the orientation of the ‘>’ sign corresponds to the (+) or (−) location of the sites. The name of the matrix is given as well. In Figure we show the results of a MatchTM
search in the promoter of the human gene for IL-12 using the predefined immune cell-specific profile. Three sites that are known in this promoter (see TRANSFAC® database) were found by MatchTM
(shadowed in Fig. ) along with a number of new sites. The relatively low number of known sites among numerous predicted sites can be explained first of all by the very limited knowledge obtained so far about real functional sites in genomes. Taking into account the whole complexity of regulatory functions maintained by promoters of genes that have to be encoded in their structure by a system of TF site combinations (10
), we can speculate that many more TF sites will be revealed experimentally in the near future. All predictions obtained by MatchTM
search can be considered as a source of well supported hypotheses for further experimental verification.
Figure 2 MatchTM output. (A) Tabulated result page. Every match contains: matrix ID, position of the match, strand [(+) or (−)], two scores of the match, corresponding subsequence and names of transcription factors associated with the matrix. (more ...)