We next performed simulation studies to determine the (i) consistency of model estimates under different parameter settings, (ii) detection of the correct model under model misspecification, and (iii) importance of the continuous-index assumption in modeling gapped tiling array data.
5.1. Consistency of model parameter estimation
Data sets were generated first under models M0, M1, and M2. Ten data sets of size 5000 probes were generated for each parameter setting for each model, which were fit using MCMC (Section 3). In each case, the estimated bias and Mean square error (MSE) showed consistent estimation of the model parameters. Here we describe one study in detail. For M0, the emission distributions were assumed to be normal with means −0.4 and 0.4, λ and μ were set to be − 3, and emission variances were set to be 0.35. For M1 and M2, the intercept parameter for both states was set as 1. Each regression coefficient (θ for M1 and β for M2) for the 5 covariates was fixed either as 1 or 0, for both nucleosomal and NFR states. Thus, a typical β (or θ) vector would be, for example, (1,0,1,0,0,0) for the nucleosomal and (1,0,0,0,0,0) for the NFR states. For all 3 models, 2500 iterations were run for each parameter setting, with a burn-in of 500. For evaluating misclassification rates, probes having an average state membership probability greater than 0.5 were rounded off to 1, while others were set at 0. For all data sets under model M0, the maximum MSE of the emission (transition) parameters was less than 0.0001 (0.03) and the maximum misclassification rate was 0, indicating that the model was fit accurately in each case. For M1, the maximum MSE for the corresponding parameters was also 0.03, with the maximum misclassification rate being 0.1; the MSE for θ ranged between 0.05 and 0.09. For M2, the maximum MSE for emission and transition parameters was slightly higher (0.08 for μ), the maximum misclassification rate was 0.1, and the range of MSE for β was 0.001–0.004.
5.2. Cross-comparison under different models
Next, we simulated data under each model and estimated parameters and states from all models in order to judge whether in each case the correct model was most accurate. Five data sets were simulated under each model. The variances were fixed to 0.48 and 0.64 for the nucleosomal and NFR states. For M0, the transition rates were − 3 and − 3. For models M1 and M2, the coefficient for the first PC in the NFR state was fixed to 1, while the others were set to 0. The simulated data sets were of size 5000; the sequence features of the first 5000 probes of the FAIRE data set were used as covariates. When the simulation and estimation models matched, the method performed very accurately, with the maximum MSE of the emission parameters ν and σ being less than 0.01 and, for λ and μ, less than 0.05, and the averaged misclassification rates also low (). In each case, the BIC chose the correct model. Also for a simulation model, say MA, when we estimated it by a model which contains it, say MB, we obtained a similar value of the maximal log likelihood. In an ideal scenario, the extra components in the bigger model would turn out to be 0, and we would get the same likelihood—however, this is typically not achieved in most data sets, necessitating the use of criteria like the BIC. However, in our case, we see that the estimated log-likelihood values are very close to the true value (the error margin is less than 0.01) and the expected log likelihood, obtained from the averages of the MCMC iterations are equal. One possible reason is the large size of the data set, leading to the errors in estimation being smoothened to a considerable degree and the parameters equaling the true simulation parameters with high probability. Due to MCMC convergence problems, model M3 could not be tested. However, we generated 5 data sets from this model (with similar parameter settings as above) and ran the other 3 estimation models on these data sets. The percent of correctly classified probes under models M0, M1, and M2 were 81%, 92%, and 94%, respectively. In 2 of the 5 data sets, model M1 achieved the lowest BIC and MSE, while in the other 3, model M2 was best; model M0 was weakest.
Cross-tabulation of average correct state classification percentage under the 3 models
5.3. Importance of the continuous-index model
In order to show that the continuous-index HMM was an essential improvement required to fit the gapped probe data, we simulated a data set of length 5000, where the gap distribution was the same as in the FAIRE data HMM. We also simulated a second data set of 5000 measurements where the gap structure was assigned randomly. Both data sets were analyzed using both discrete- and continuous-index HMMs (). Comparing the number of predicted matches with the simulated state set, it is clearly seen that the discrete-index HMM fails to achieve as high a correct classification rate as the continuous-index model, in the gapped data framework.