Genomes contain a considerable number of repetitive elements known as repeats. These elements fall into two broad categories: (i) interspersed repeats or transposable elements and (ii) tandem repeats (TRs) (1
). In this study, we focus on the detection of TRs. TRs occur as a result of replication slippage or DNA repair (2
). Consecutive copies of a DNA motif comprise TRs. These copies can be exact copies in the case of perfect TRs or can be inexact copies in the case of approximate TRs. Depending on the length of the repeated motif, TRs can be classified as microsatellites (MSs) (the motif length is 1–6 bp) or minisatellites (the motif length is 10–60 bp).
MSs are important due to their documented functions and association with cancer and other diseases. In 2005, it was demonstrated that MSs polymorphism, which is due to copy number variability, can enhance the virulence of pathogens and their adaptability to the environment (2
). In addition, MSs can be involved in gene regulation (3–5
). Moreover, Kolpakov et al.
) have highlighted several reported functions of MSs. Recombination enhancement has been linked to MSs consisting of a repeated GT motif (7
). Further, alterations in dinucleotide MSs have been shown to be associated with cancer in the proximal colon (8
). Trinucleotide MSs consisting of repeated CCG or AGC are associated with Fragile X syndrome, myotonic dystrophy, Kennedy’s disease and Huntington’s disease (9
). Finally, several human triplet-repeat expansion diseases have been reported (11
Furthermore, MSs have several biomedical applications. Ellegren (13
) listed several applications of MSs in linkage mapping, population genetics studies, paternity testing and instances in forensic medicine. In the computational biology field, it is known that masking TRs in sequences improve the performance of sequence alignment methods (14
Several computational tools have been developed to detect and discover repeats in DNA sequences. RepeatMasker (http://repeatmasker.org/
) is a widely used detection tool, which searches a DNA sequence for instances of known repeats that have been previously identified. REPuter (15
), PILER (16
) and Repseek (17
) are examples for ab initio
discovery tools, which discover repeats classes in the input sequence without relying on a library of known repeats. In addition, special-purpose tools are available for the discovery and the detection of TRs/MSs in particular. STAR (18
), Mreps (6
) and Sputnik (http://espressosoftware.com/sputnik/index.html
) are well-known MSs discovery tools. Hereafter, we use detection and discovery interchangeably. Several other tools are currently available (5
). Additional tools are reviewed in (27
However, these tools have the following limitations: (i) they require the user to adjust several parameters; (ii) the user may have to provide the filtering threshold(s) to remove spurious detections; (iii) some of the tools require a list of motifs or a library of known repeats and (iv) they may not be efficient in terms of memory or time. Two recent studies (28
) have suggested that parameter tuning and the user-defined filtering threshold(s) result in varying the performance of these tools. Thus, based on the conclusions of these two studies, the need for a standard MSs detection tool is evident.
The goal of our study is to develop just such a tool to detect MSs in DNA sequences. To this end, we have designed software called MsDetector that attempts to remedy the limitations of the currently available tools. The parameters of our software tool were optimized using machine-learning algorithms. MsDetector does not require a library of known MSs or a list of motifs. Therefore, we expect MsDetector to produce consistent results across studies. In addition, MsDetector can process a whole human chromosome in a few minutes on a regular personal computer.
We incorporated a supervised-learning approach into our design. Labeled data are required for supervised-learning algorithms. For example, the labeled data required in our study to train a tool to detect MSs consisted of two sets of sequences: (i) DNA sequences that are known to include MSs and (ii) DNA sequences that are not likely to include MSs. To obtain such data, we used RepeatMasker to obtain MS sequences. Genomic sequences that did not overlap with MSs located by RepeatMasker comprised the other set unlikely to include MSs. Then, we trained a hidden Markov model (HMM) on these two sets to detect MSs. To reduce the false detection rate, the HMM detections were processed by a filter to remove spurious detections. Again, we applied a supervised-learning algorithm to obtain such a filter. We regarded the filtering problem as a classification problem where we distinguished between true and false detections. Therefore, we trained a general linear model (GLM) to obtain a classifier that functioned as the filter. As before, two sets of labeled data are required to train the filter. HMM detections that overlapped with MSs located by RepeatMasker comprised one of the two sets. The other set consisted of HMM detections found in shuffled DNA sequences. The human chromosome 20 and its shuffled version were divided into three segments to train, validate and test MsDetector. We followed the train–validate–test approach to make sure that MsDetector performance during training is very similar to its performance on unseen data, i.e. to avoid over-fitting.
MsDetector is both memory- and time-efficient. The memory requirement and the run time are linear with respect to the length of the input sequence. Due to the advantages of the supervised-learning algorithms, the user is not required to adjust any parameters or provide any filtering criteria. In sum, the contribution of our study comprises a software tool called MsDetector. The tool can locate perfect and approximate MSs. The advantages of MsDetector are as follows:
- The user is not required to optimize the parameters.
- There is no need to provide a library of known MSs.
- There is no need to specify motif patterns.
- It is efficient in terms of memory and time and
- It produces consistent results across studies.