A metamotif is a generative model for PWM motif columns that can be used to represent a gapless alignment of position weight matrices. For each PWM position
i (multinomial column) there exists a Dirichlet distribution in the metamotif column at position
i. A metamotif is therefore a parameter configuration for a product Dirichlet distribution where position
i of the motif alignment model corresponds to parameters
αi. More intuitively, a metamotif of length
k is a probability distribution over PWM motifs of length
k (Figure ). Sampling motifs from the metamotif is also possible (samples drawn from a forkhead metamotif are shown in Additional File
1 Figure S1). This is analogous to computing a probability score for a sequence
k-mer to measure the probability of the
k-mer having been generated by a PWM, and drawing
k-mer samples from a PWM.
Below we first describe the metamotif in detail. We then expand the use of the model beyond simply constructing metamotifs as global gapless alignment models of motif sets. This expansion is made possible by a Markov Chain Monte Carlo (MCMC) metamotif inference algorithm that simultaneously estimates multiple weakly represented metamotifs from a potentially large set of motifs. Illustrative examples of metamotifs are shown in Figure alongside the input motifs for which the metamotifs were estimated.
Metamotif inference by nested sampling
The metamotif can be seen as a way to summarise a gapless alignment of motifs of a certain length, to yield a probability distribution of motifs. However, our goal in designing the metamotif framework was to describe recurring patterns seen in sequence motif data deposited in public motif databases such as TRANSFAC [
37], JASPAR [
38] or UniPROBE [
39]. Many sequence motif families cannot be described accurately by global gapless multiple alignments of motifs at a fixed length. Motifs can for example consist of shorter repetitive signals, such as in the case of the heat-shock factor (HSF) motifs (Figure ), or the basic Helix-Loop-Helix (bHLH) motif family that are completely or partially palindromic due to their dimeric binding mode [
40]. Inspection of the HSF motif set shows that a global alignment of its columns does not describe the regularly spaced fivebase repeat that is observed as part of the motifs in opposing orientations (aGAAn/nTTCt) [
41]. Furthermore, even non-repetitive and non-palindromic motifs present challenges for gapless multiple alignments: the span of informative columns contributing to familial patterns in publicly available PWM data is often unclear because of different signal-to-noise ratios and varying information content criteria used for calling motif ends.
We wanted to develop an inference algorithm that allows simultaneous detection of
n short metamotif signals from a set of motif data, allowing for varying length for different metamotifs, and optionally free orientation (signal present on either strand). The metamotif count
n is a fixed, user settable parameter to the algorithm. For metamotif inference problems where
n is expected to be large, the choice for the parameter should be informed by prior information of the motif set under study, for example clustering of the motifs to estimate a rough number of recurring motif segments. Each metamotif has
a priori an unknown length between
lmin and
lmaxcolumns, and is expected to contain one or more matches in a fraction
f of motifs. Motifs in the framework are thought to be generated by recurring metamotif patterns, each of which is potentially shorter than any of the motifs, and background positions that model "uninteresting" sections of the motifs (positions not emitted by any of the metamotifs). The background model in the framework is the maximum likelihood (MLE) Dirichlet distribution estimated from all the motif columns in the input data. It is computed with the optimisation procedure described in [
42].
Our metamotif inference algorithm is a variant of the NestedMICA nested sampling algorithm described in detail in [
43]. Nested sampling, originally introduced by [
44], is a generic Bayesian MCMC sampling strategy that allows drawing samples from a posterior distribution and directly estimating the evidence (marginal likelihood) of the model. In brief, nested sampling operates on an ensemble of states ranked by their likelihood. The initial state of the algorithm is drawn by sampling uniformly from the prior distribution. At each iteration, the state with the lowest likelihood is removed from the ensemble and replaced with a new state whose likelihood is bounded to be higher than that of the removed state. New states are created either by drawing samples from the prior distribution of states, or by decorrelating other randomly chosen states in the ensemble.
The metamotif inference algorithm allows estimating n metamotifs for a set of p motifs, with a variable metamotif length between lmin and lmax columns, and an expected fraction f of motifs containing any one of the n metamotifs. This is analogous to the NestedMICA motif inference algorithm that estimates multiple motifs with varying length from an expected fraction of nucleotide or protein sequence data. The posterior distribution being sampled is over the sets of m metamotifs and so-called mixture matrices, given the motif data and a background model for the motifs. The mixture (or occupancy) matrix describes the pairing between metamotifs and motifs. The term mixing matrix is a reference to the algorithm treating pattern recognition as an independent component analysis problem: a likelihood function is written for the observations (the motif set) and the motif set is assumed to be generated as a mixture of independent metamotif contributions and noise represented by the background model. Each element Qi, j in the n × p mixing matrix Q is a binary indicator of the metamotif j being present one or more times in the motif i. If the metamotif is present, Qi, j = 1, otherwise Qi, j = 0. The likelihood of the motif set given the metamotif set is simply the product of likelihoods of each individual motif given the metamotif set and the mixture matrix.
Likelihood of a motif given a set of metamotifs is calculated assuming the motif is emitted by a Hidden Markov Model (HMM) we call the multipleuncounted motif-metamotif HMM mixture model, or MUMM (MUMM with metamotif count 2 is given in Additional file
1 Figure S2). This formulation allows for each motif to contain multiple metamotifs simultaneously, without the need to iteratively repeat sampling after masking previously inferred signals. Computing the likelihood of a motif given metamotifs under the MUMM model involves completing one-dimensional dynamic programming from the beginning of the motif to column
c, closely in the same form as the protein or nucleotide sequence likelihood function described in [
43] (Equation 6).
Lc represents the likelihood of all metamotif and background column arrangements (paths) in the input motif up to the column
c.
M is the set of metamotifs that have a mixing coefficient of 1 for the motif under consideration (i.e. metamotifs marked to be present in the motif in the mixing matrix
Q), and |
M| is the number of metamotifs that have a mixing coefficient 1. The length of the metamotif
α is represented by
lα.
Bc is the probability that the motif column at position
c was emitted by the background. For the motif
X of length
lX the transition probability
t to a metamotif is defined as
t = 1/
lX, i.e. one metamotif is expected per motif, and any motif position is equally likely to contain a transition.

is the probability that the motif segment from
i to
j was emitted by a metamotif
m, and it is given by the metamotif density function (Equation 4). A metamotif can optionally be allowed to be present on either strand to improve our ability to detect repeating (e.g. palindromic) features. Alternating orientation of metamotifs are achieved simply by summing the probability contributions

of the metamotif
α and its reverse complement at all possible offsets.
Accounting for incomplete metamotif matches in a motif is an important consideration. This is because we wish to analyse data from different experimental and computational sources where motif start or end positions have not been chosen consistently, for instance with an information content criterion. Incomplete hits are accounted for by adding lmax additional "un-informative" columns in the input motifs in both the 5' and the 3' motif ends. The un-informative columns are multinomial distributions that match the mean nucleotide weights of the background model Dirichlet distribution. This effectively allows all possible offsets of the metamotif that overlap the motif with at least one column, whilst associating more uncertainty to those columns supported by only a subset of the motif data.
Performance of the metamotif inference algorithm was tested using synthetic motif sets where samples from known metamotifs were inserted, or "spiked". The aim was to measure the relative frequency of metamotifs at which the expected metamotifs could be recovered by the algorithm from synthetic motif data containing metamotif instances. The evaluations were done in two stages. The ability of the algorithm to infer a single metamotif presented to it was tested first. After that, several metamotifs were presented to the algorithm to assess the ability of the algorithm to infer multiple metamotifs simultaneously.
To prepare the synthetic motif sets, metamotifs were first generated of examples of three structurally diverse TRANSFAC 12.2 PWM families: six forkhead motifs (class 3.3 in TRANSFAC classification), six GATA-like Cys
4 zinc finger motifs (class 2.1) and five MADS box motifs (class 4.4) were used (source motifs shown in Additional file
1 Figure S3). This was done by aligning the input each of the three motif sets with a greedy gapless sequence motif multiple alignment method similar to the one utilised in STAMP motif toolkit [
45]. A metamotif was then estimated from the motif multiple alignments with the MLE method from [
42]: MLE Dirichlet distribution was computed for motif alignment columns (example seen in Figure ), with each motif column in the alignment mapping to a MLE Dirichlet distribution in the resulting metamotif.
Motifs (PWMs) from each of the three familial metamotifs were sampled in relative frequencies of 0%, 10%, 20%,...,100%, into synthetic input motif sets (separate input motif set per motif family). Each synthetic motif set contained 60 motifs, each 20 nucleotide columns long, with a maximum of one metamotif instance allowed per input motif. The synthetic motif columns in the input motif sets are samples from a Dirichlet distribution with parameters
α = {0.5, 0.5, 0.5, 0.5}. The metamotif sample PWMs were inserted at random positions within the 20 nucleotide long synthetic motifs (example GATA motif set given in Additional file
1 Figure S4A,B,C). The metamotif inference algorithm was then run on the motif set to infer a single metamotif between length ranges 4 and 14, allowing for the signal to be present in either orientation (
-numMetamotifs 1 -revComp -minLength 4 -maxLength 14).
Metamotif inference performance was measured qualitatively with visual inspection comparing the inferred metamotifs to the known spiked metamotifs, and quantitively measuring the Cartesian distance between the metamotif mean nucleotide weights. The visual comparison, Cartesian distances and empirical p-values for observed metamotif-metamotif distances are presented in Figure . The evaluation shows that metamotifs can be inferred from motif sets that contain them with relative frequencies of even 10%. At a relative frequency of 40% and above all three recovered metamotifs are found to be relatively similar to the respective source metamotif (Figure ), although missing columns from metamotif ends are seen up to the relative frequency of 70% in the case of the forkhead and GATA motif families.
The ability to predict multiple metamotifs was demonstrated in a second evaluation experiment where instances of all the three motif families were inserted into synthetic motif sets and the algorithm was required to infer three metamotifs. It was shown that the algorithm was able to infer multiple metamotif models concurrently at a relative frequency as low as 20% (Additional file
1 Figure S5).
We demonstrated use of the metamotif nested sampling algorithm in inferring familial metamotifs from known experimentally determined regulatory motifs from the TRANSFAC database [
37]. Motifs retrieved from TRANSFAC were first divided to clusters with the SSD distance metric described in [
3] with cutoff 6.0. Three metamotifs were then inferred from each of the resulting clusters. Examples of metamotifs inferred are shown in Figure . The metamotif nested sampler algorithm was found capable of detecting several recurring patterns from the motif clusters that are clear upon visual inspection of the motifs, in addition to finding overliers from the motif sets (Figure ).
Our results show that the metamotif nested sampler algorithm, the first metamotif inference algorithm to be described, can discover detailed structure from motif sets. We show the algorithm is capable of correctly inferring multiple familial metamotif patterns. This makes the algorithm applicable for example for finding redundant motif patterns from large scale de novo inferred motif predictions from different algorithms, or for inferring a complete set of familial metamotifs from a set of motifs.
Metamotifs as a position weight matrix prior
De novo gene regulatory motif inference algorithms commonly suffer from lack of sensitivity when applied to large collections of long eukaryotic promoter sequences, rendering it impossible to describe complete sets of regulatory motifs with sequence alone. We therefore wanted to see if prior biological knowledge in the form of familial metamotifs could be used to improve the sensitivity of a motif inference algorithm. This was motivated by the work by [
35] which showed familial tendencies in the motifs of sequence specific transcription factors can improve motif inference sensitivity. We therefore wanted to test if a probabilistic model of the observed spectrum of DNA specificity of known transcription factor motif families expressed as a metamotif could be used to improve the sensitivity of motif inference from novel sequence.
We extended the NestedMICA motif inference algorithm to accept a series of metamotifs as a position specific prior probability function for motifs. The NestedMICA algorithm was chosen for the purpose as it is known to perform well in large scale motif inference tasks [
3,
46], and because the previously published version of the algorithm applies an uninformative Dirichlet prior probability function to motif columns (all allowed nucleotide weight combinations equally probable
a priori), which made it straightforward to change the algorithm to take use of column-specific informative Dirichlet distributions.
The prior probability of motif X given a metamotif α is taken as the sum of metamotif densities of α with all continuous motif segments contained in X that have the same length l as the metamotif (log of the density is given by Equation 4). A segment of motif X refers to a motif formed from columns of the motif starting from column i and ending at position i + l - 1. The prior probability of a motif given series of metamotifs is simply the sum of prior density contributions of each of the metamotifs.
To test the performance of the metamotif prior, we conducted simulation experiments following the same principle as described for the NestedMICA [
46] and the BayesMD [
47] algorithms. Human intronic nucleotide sequence fragments randomly chosen from the
Homo sapiens Ensembl database release 50 [
48]) were 'spiked' with five different types of motifs. The motifs used were those of ZAP1, HIF1, TBX5, TAL1 and NF-
κB transcription factors. These motifs were selected because they showed little similarity with each other when aligned, and because this set contains examples of differing motif length and information content. All sequence sets used contained 200 sequences, and the total length of the sequence was varied between 100, 200,...,2000 nucleotides. The nucleotide
k-mers sampled from each of the five PWMs in the evaluation were inserted at a constant relative frequency of 20% of the sequences, with a maximum of one motif present per sequence. In other words, motif density was varied by inserting the motif instances to sequences of different lengths. Motifs of only one kind were present in each synthetic sequence set.
Motif inference with three types of prior functions were tested with the sequences: firstly a single familial metamotif contributing to the prior function was used. Secondly we tested a prior function with all of the five unrelated metamotifs contributing to the prior, with instances of only one motif family being actually present represented in the sequences. Thirdly, we used an uninformative PWM prior similar to the previously published NestedMICA version 0.8. In each of the motif inference runs, the longest sequence length at which the algorithm infers the correct motif of interest is reported as a measure of sensitivity (
p < 0.05), with motif comparison
pvalues computed as described in [
3]. In all cases, five motifs were inferred from the sequences (as recurring sequence motifs tend to be found from intronic sequences, we cannot assume that the spiked motif would be the only signal present). The sequence background model used in all evaluations of the algorithm was a 4-class 1
st order trained from the 2000 nt long intronic sequences with
nmmakebg.
The source motifs (ZAP1, HIF1, TBX5, TAL1, NF-κB) were transformed to metamotifs to be used in the metamotif prior function by applying a pseudocount of 0.1 to the motif column weights, and interpreting the resulting motif nucleotide weights as mean nucleotide weights in Dirichlet distributions with precision set at 4.0 (metamotifs used in the experiment shown in Figure ). The metamotif priors used in the prior function evaluation were constructed from known PWMs with a set precision and pseudocounts to test the most common hypothesis testing use of a motif prior function: user is aware of a set of potentially relevant motifs or consensus strings present in a sequence set and wants to inform the algorithm of them to increase sensitivity.
The informative motif prior function was shown to dramatically improve the capacity of the Nested-MICA algorithm to resolve weakly represented sequence motifs presented to it in longer nucleotide sequences, especially those with larger number of columns and higher information content (Figure ). Both the baseline motif inference sensitivity and the effect of the informative weight matrix prior was seen to vary based on the length (likely due to the information content) of the motifs, ranging from as high as fourfold difference in the motif recovery length for TAL1 and NFKappa-β, to only a 1/3 improvement from 400 bp to 600 bp sequence between the uninformative and the 'single' informative metamotif prior for the TBX5 motif. Sensitivity to infer all but the TBX5 motif showed improvement with the combinatorial metamotif prior over the uninformative prior, with the effect less pronounced than when the prior function only contained a contribution from the correct metamotif. These results suggest that the metamotif prior is performing as hoped, and multiple metamotifs can be combined to a 'combination metamotif prior function' to detect motifs where information is available of candidate motifs.
We wanted to ensure that the metamotif prior did not have the propensity to bias motif inference to an incorrect solution, i.e. that it does not encourage the inference of a motif not supported by the sequence data. We tested this by spiking intronic sequence with the NF-
κB motif, and using the ZAP1-like metamotif in the prior function. No motifs similar to ZAP1 (whose instances were not present in the sequences) were recovered from the spiked intronic sequence between lengths 100 and 2000 (comparison with distances and
p-values shown in Additional file
1 Figure S6), indicating that the metamotif prior function does not have an adverse effect on inference specificity. A number of other combinations of spiked motifs and inaccurate informative metamotif prior functions were also tested, with no observed tendency for the algorithm to infer a motif that not supported by the sequence data (data not shown).
The metamotif prior extension to the Nested-MICA algorithm is designed to function with any number of metamotif models, input PWMs or IUPAC consensus sequences. PWMs are treated as metamotif priors by interpreting its columns
i as the

[
X|
α] of a metamotif and applying a constant precision
α0 to all columns of the metamotif. IUPAC consensus sequences are first transformed to PWMs by applying pseudocounts and then transformed similarly as PWMs. Metamotifs inferred with our framework could also be applied to other Bayesian motif inference algorithms which model a prior distribution over motif columns. Metamotifs could therefore be useful in building large or even complete regulatory binding site motif libraries for novel genomes.
Predicting regulatory motif type with metamotifs
The metamotif is shown above to improve motif inference sensitivity dramatically when used as a Bayesian PWM prior. We however also wanted to test if metamotifs could be used to form functional predictions for novel motifs based on a database of previously known motifs. We developed a motif classification tool for the purpose, termed metamatti, for metamotif based automated transcription factor type inference.
The principle of our motif classifier is to compute the density function (Equation 4) of a large dictionary of familial metamotifs along the length of training set motifs, effectively "scanning" weight matrices with metamotifs. The optimal (maximum) and average metamotif densities of each metamotif with the motif are then extracted as features in a random forest classifier that tries to infer the TRANSFAC superfamily (Figure ) or TRANSFAC family (Figure ) of the motifs. Random forest classification was chosen as the machine learning framework because it generalises naturally to multi-class problems and provides unbiased classification error estimates as part of model training [
49]. The framework also automatically controls the sparsity of the feature set that contributes to the classification. The full list of feature types used in the
metamatti classifier are shown in Table .
| Table 1Features calculated for motifs in the metamatti classifier |
All motif families with at least 10 representatives were retrieved from the TRANSFAC 12.2 database [
37], totalling 623 motifs of 13 domain families. For the motif domain superfamily classifier comparison made with MotifPrototyper [
36] (Figure ), the set of motifs was reduced further to include only motifs annotated in TRANSFAC with the four superfamilies classified in [
36]. Similarly, for the motif family classification comparison with SMLR [
34] (Figure ), only motifs of the same six major classes classified in [
34] were included in our training set.
Most features in the classifier are metamotif probability density scores of the training set motifs with metamotifs trained from motifs of each of the motif families. To compute the density features, we chose to first divide the motifs of each family into smaller clusters by complete linkage hierarchical clustering them with the SSD metric described in [
3] and cutting the clusters at a lenient clustering cutoff of 6.0, resulting in 65 motif clusters. Three metamotifs were trained from each motif cluster with
nmmetainfer, resulting in 195 metamotifs to be used in the motif classifier (examples seen in Figure ). Metamotif length was constrained between 6 and 15 columns, and the expected usage fraction was set at 0.5.
Motif classification performance of
metamatti was compared to two methods with a related goal: MotifPrototyper [
36] which classifies motifs into four TRANSFAC superfamilies (zinc coordinated, helixturn-helix,
β-scaffold,basic), and SMLR which classifies motifs into six major classes of TF domains (Cys
2His
2 and Cys4 zinc fingers, homeodomains, forkhead domains, basic helix-loop-helices and basic zipper domains) [
34].
Classification accuracy comparison shows that
metamatti outperforms MotifPrototyper [
36] (Figure ) across all motif superfamilies. There are several possible reasons for the difference in performance. Firstly, the metamotif inference algorithm is not constrained to a fixed motif family model column count, unlike the algorithm utilised in MotifPrototyper. Secondly, training several metamotifs per motif family,
metamatti also accounts for the fact that not all columns in motif families can be accurately expressed as a single column wise probability distribution. Instead, recurring patterns in a motif set can be generated by multiple potentially shorter familial metamotif components in our model. Thirdly, the metamotif estimation algorithm treats some motif columns as noise with a column background model, improving our capacity to find the recurring patterns from sequence motif sets and reducing over-fitting of familial models due to reporting weak or nonexistent recurring trends.
Motif family level classification was conducted with the same subset of TRANSFAC 12.2 PWMs that were classified with SMLR by Narlikar and Hartemink [
34]. The overall classification accuracy comparison shows that
metamatti has a marginally improved performance at 89.5% classification accuracy over the 87% reported for SMLR. However, the class-by-class accuracy figures (Figure ) and the confusion matrix of the 6-way TRANSFAC motif family classifier (Table ) make it evident that the higher classification accuracy comes at the cost of a 14% drop in the classification accuracy of the bHLH family (89% accuracy with SMLR, 75% with
metamatti). The E-box motif CAGGTG appears to be the most common type misclassified in the erroneous bHLH motif cases. Inspection of motif family assignments in the TRANSFAC database shows that closely similar motifs with the CAGGTG consensus have been annotated with all of bHLH and C2H2 zinc finger families. In fact it has been proposed that certain bHLH and Snail-like C2H2 like factors are thought to bind with closely similar specificities to compete for the same binding site positions [
50], highlighting a general limitation of a sequence PWM feature based motif family classification methods.
| Table 2Confusion matrix of the 6-way TRANSFAC motif classification |
Clustering of the motifs and training metamotifs from motif clusters was motivated by the requirement to choose a value for the metamotif count parameter of the metamotif inference algorithm, and to limit the metamotif search space. Inspection of clusters at cut off 6.0 showed no clusters with more than three strongly distinct recurring patterns. Although for many motif clusters there were clearly less than three distinct recurring metamotif patterns present at the clustering cutoff of 6.0, the metamotif inference algorithm was found to treat these cases by either inferring closely similar duplicate metamotifs (such as metamotifs 1 and 2 in Figure ) or short metamotifs with mean nucleotide weights with low information content, or occasionally splitting the metamotif segments in several independent parts. This suggested to us that that together with a sparse machine learning strategy such as a random forests, it would be advantageous to choose a high metamotif count that described the input motif set in as much detail as possible, with the price of some potentially redundant features in the feature set (densities for duplicate or low information metamotifs). We validated this assumption by retraining the classifier with two metamotifs per cluster (a total of 130 metamotifs). The classifier trained with two metamotifs per family resulted in a mild decrease in the classification accuracy (88.4%, as opposed to 89.5% with three metamotifs per cluster), suggesting that the additional metamotifs were indeed informative.
We also wanted to assess the significance of the metamotif density score in the metamatti classifier by comparing it to a more naive classifier where we replace the metamotif average and maximum scores with average and maximum SSD distances computed between the training set motifs and 'average motifs' of each of the motif family, instead of scoring training set motifs with familial metamotif density average and maximum scores. The average motifs we used in the more naive classifier were the mean PWMs of the metamotifs trained with nmmetainfer were used for classification by scoring the training set motifs with an SSD distance metric with each of the metamotifs. We found that the classifier accuracy achieved with the SSD metric was lower to the metamotif density based classifier by 1.4% (accuracy of 88.1%), suggesting that both the metamotif mean and the column wise precision values which contribute to the metamotif density scores are partially responsible for metamatti's high performance. Furthermore, we tested training a classifier with cluster average motifs instead of the metamotif segments, resulting in an accuracy figure of 86.5%, suggesting that not only is the metamotif density a suitable score, but that the motif segments identified by the metamotif inference algorithm provide a classifier that generalises better than simply using average motifs inferred by clustering and collapsing clustered motifs to an average representation.
Independent validation of the TRANSFAC family prediction of motifs
Previous motif classification work has relied on separating training and testing data from a single public database using cross-validation [
26,
34,
36], where most motifs originate from individual studies and therefore different experimental sources. Recent advances in protein-DNA interaction assaying have however resulted in several large high-throughput regulatory motif data sets becoming publicly available. We therefore wanted to assess the performance of
metamatti with two homeodomain motif sets recovered from different species and via different experimental methods:
Mus musculus protein binding microarray motifs [
51] and
Drosophila melanogaster bacterial one-hybrid motifs [
52]. This evaluation also allowed us to compare the classification accuracy achieved with the two independent datasets to the accuracy expected based on the out of bag classification error estimate of the
metamatti random forest classifier training for the homeodomain motif family.
We trained a metamatti classifier with all 13 motif families present in the TRANSFAC 12.2 with at least 10 examples per family (total of 663 motifs) for evaluating classification of the two homeodomain motif datasets. The accuracy reported for both Drosophila melanogaster and Mus musculus datasets was the fraction of motifs labelled as homeodomain instead of any of the 12 incorrect labels.
The classification accuracy rates for both homeodomain motif sets were shown to be high, and in good agreement with the out-of-bag accuracy estimate of 91.3% reported by the
metamatti random classifier during classifier training: 92.1% and 91.7% of the homeodomain motifs in the Berger
et al. (84 motifs) and Noyes
et al. (177 motifs) sets were correctly classified, respectively. We studied the misclassified examples from the
Drosophila melanogaster homeodomain datasets in more detail to see where the misclassified motifs lie in the homeodomain specificity group clustering presented in [
52]. Interestingly, the misclassifications were shown to be atypical homeodomains which do not contain the canonical TAATTA core and fall amongst the smaller specificity groups described in [
52]. The misclassified motifs included three TGIF-Exd-like motifs (Vis, Hth, Exd), two Iroquis-like (Ara, Mirr), one Six-like (Optix) and an outlier from the specificity group clustering (Additional file
1 Figure S7A). A similar trend of non-canonical homeodomains being primarily amongst the misclassified was also noted for the
Mus musculus homeodomain motifs (Additional file
1 Figure S7B). This is most likely explained by atypical homeodomain motifs not being well covered well by the TRANSFAC 12.2 training set; No closely matching homeodomain motifs were observed in TRANSFAC 12.2 to many of the misclassified motifs.
We also wanted to test if a metamatti-like classifier could be trained to detect more detailed differences between motif groups than motif family or superfamily. We therefore labelled the Drosophila melanogaster homeodomain motifs with the homeodomain specificity groups suggested by the Noyes et al. (2008) and estimated a single metamotif with nmmetainfer from each of the 11 specificity groups. We then trained a metamatti classifier with these metamotifs similarly as described above for the motif family and superfamily classifiers. We observed the remarkably high accuracy of 84% (confusion matrix shown in Table ), when all Noyes et al. (2008) homeodomain motifs with 3 or more examples per specificity group were included in the classification (10-way classification). The applicability of supervised machine learning strategies that aim to learn motif type labels more precise than the DNA binding domain family are however currently limited by the amount of available training data. For instance, the 84 motifs in the Noyes et al. (2008) dataset contain examples of 11 specificity groups which are very biased to the two largest groups (Antennapedia and Engrailed, with 25 and 15 examples, respectively), with several specificity groups containing as few as two to four examples (Ladybird, Iroquis, NK- 1, NK-2, TGIF-Exd, Bcd). This makes classifier error estimation imprecise especially for the weakly represented classes, and results in the major classes which have as much as eightfold as many examples present in the training dataset to have considerable weight in predictions over the smaller classes (such as to maximise overall classification accuracy). Methods like metamatti can however become increasingly relevant once more high-throughput TF DNA specificity data becomes available.
| Table 3Confusion matrix of the homeodomain motif specificity group classifier |