Analysis of discrete linear sequences has played an increasingly important role in biology. In particular, the detection of heterogeneous regions among sequences can aid in understanding the heterogeneous processes that act upon those regions 
. Therefore, determining whether specified types or categories of sites, such as polymorphic 
or substituted sites 
within DNA or protein sequences, are concentrated in specific regions within DNA or protein sequences has become a key component of these analyses 
. For instance, detecting regions that feature heterogeneity in substitutions may provide valuable information on the structure and function of DNAs or proteins 
Several parametric and nonparametric methods have been proposed and historically applied to sequence data. Parametric methods include applications of a Fisher's exact test to tallies of site types between regions, or of a likelihood ratio test to identify heterogeneous regions 
. Alternatively, several heuristic methods may be applied for this clustering 
. For example, UPGMA (Unweighted Pair Grouping Method with Arithmetic-mean) or NN (Nearest Neighbor), are hierarchical methods that at each step combine the nearest 2 clusters into one new cluster. Iteration of this step is continued until the number of clusters is one. One of NN's variants, K
-Nearest Neighbor), differs in its termination condition, stopping the iteration until the K
clusters are identified, where K
needs to be defined in advance. Another heuristic approach, K
-means, uses a partitioning algorithm to break data into K
clusters, and also requires the number of clusters K
as a prior knowledge. When regions of a sequence that are expected to have heterogeneous frequencies of a site type may be specified in advance or the number of clusters to be identified is known a priori
, these methods have high power to detect clustering 
. However, they require a priori
assignment of partitions. When no a priori
expectation of cluster size or cluster number may be specified, extant studies have usually relied on “sliding window” methods 
. For example, Pesole et al.
(1992) labeled invariable site as ‘1’ and variable site as ‘0’, and applied a sliding window to identify whether ‘1’s are significantly clustered 
. Pesole et al.
calculated a heuristic score based on the presence or absence of site types within a window that processes serially across the sequence of interest.
Advantages of sliding window methods include their intuitive conceptual basis and their striking output: an autocorrelated plot of the score that may be superimposed upon the sequence, providing a visual appraisal of the level of clustering at every site. However, sliding window methods have two related major disadvantages 
. First, they generally offer only crude non-parametric means for statistical significance testing. The autocorrelation of serial scores severely complicates attempts to develop more insightful parametric approaches to sliding window significance testing, making parameter estimation with confidence intervals either challenging or impossible. Second, the need to specify a window size presents a user with a procedural ambiguity. Without a unified statistical framework, there is no strong justification for selection of one window size over another. In such a situation, it may even be tempting to invert the procedure of statistical inference and select a window size that produces an autocorrelated score plot consistent with a particular scientific hypothesis, as opposed to the valid procedure of selecting a window size by an objective statistical optimality criterion.
Because of these disadvantages of the sliding window methods, several nonparametric statistical methods that do not assume prior knowledge have been suggested or implemented to detect clustering in discrete linear sequences. These methods include runs tests 
and empirical cumulative distribution function (ECDF) statistics 
. Runs tests use the “longest unbroken run” between sites of interest as a test statistic for clustering, where a run is defined as consecutive length between events 
. This test statistic provides very weak power, because it uses very little of the relevant information about the phenomenon of interest, ignoring all runs other than the longest. Statistics based on the longest two runs, longest three runs, or even on a summary of the full distribution of run lengths have been discussed, but remain weak tests. For instance, the variance in distance between site types of interest may be calculated and used as a test statistic for the detection of clusters of sites, where a high variance is indicative of clustering 
. This test statistic incorporates information about the length of all the runs, but does not capture all of the relevant information: it discards all information about the relative position of runs of different lengths. A sequence with all of its shorter runs in one region would be more clustered than one with short runs distributed evenly.
Currently, the most powerful nonparametric method is the ECDF. It features the cumulative difference between the observed and expected proportion of variant sites to identify regions that differ from other regions in number of substitutions. Under a null model that assumes no heterogeneous region(s) within sequences, this difference remains close to zero. Its significant departure from zero is an indicator for rejecting the null model 
. Although ECDF has been used to detect heterogeneity in several studies 
, its power can be affected by the location of the heterogeneous region 
. Moreover, a parametric method may perform even better across a wide range of datasets.
Most extant methods that have been proposed to detect heterogeneous clusters among sequences suffer from poor power to detect clustering when it is present. The problem is made especially challenging by a tradeoff wherein increasing power to detect clustering also increases overparameterization or false positive rates. Methods that have high power are prone to identify clustering even in random sequences, because even in short sequences, there are so many potential patterns of clustering to evaluate. In this paper, we propose a hierarchical clustering method, model averaged clustering by maximum likelihood (MACML), requiring no priori knowledge of cluster size or cluster count, that provides greater statistical power in detecting heterogeneous regions. MACML adopts a divide-and-conquer approach to hierarchically detect heterogeneous regions and repeat similar analysis for each identified region, unlike most hierarchical methods that do not revisit clusters once they are constructed 
. To address issues of overparameterization, MACML employs model selection and model averaging techniques that lead to intuitively appealing profiles of sequence heterogeneity and that facilitate description of clustered sites in discrete linear sequences. We describe MACML in detail and provide comparative results in the form of an in-depth evaluation of simulated datasets and an empirical sequence data set on polymorphisms in the Drosophila
alcohol dehydrogenase gene.