Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC5340598

Formats

Article sections

- Abstract
- 1 Modeling the Atrial Fibrillation
- 2 Estimation
- 3 Simulation
- 4 Applications
- 5 Discussion
- Supplementary Material
- References

Authors

Related links

Biom J. Author manuscript; available in PMC 2017 March 7.

Published in final edited form as:

PMCID: PMC5340598

NIHMSID: NIHMS831000

Liang Li,^{*,}^{1} Huzhang Mao,^{2} Hemant Ishwaran,^{3} Jeevanantham Rajeswaran,^{4} John Ehrlinger,^{4} and Eugene H. Blackstone^{5}

The publisher's final edited version of this article is available at Biom J

Atrial fibrillation (AF) is an abnormal heart rhythm characterized by rapid and irregular heart beat, with or without perceivable symptoms. In clinical practice, the electrocardiogram (ECG) is often used for diagnosis of AF. Since the AF often arrives as recurrent episodes of varying frequency and duration and only the episodes that occur at the time of ECG can be detected, the AF is often underdiagnosed when a limited number of repeated ECGs are used. In studies evaluating the efficacy of AF ablation surgery, each patient undergo multiple ECGs and the AF status at the time of ECG is recorded. The objective of this paper is to estimate the marginal proportions of patients with or without AF in a population, which are important measures of the efficacy of the treatment. The underdiagnosis problem is addressed by a three-class mixture regression model in which a patient’s probability of having no AF, paroxysmal AF, and permanent AF is modeled by auxiliary baseline covariates in a nested logistic regression. A binomial regression model is specified conditional on a subject being in the paroxysmal AF group. The model parameters are estimated by the EM algorithm. These parameters are themselves nuisance parameters for the purpose of this research, but the estimators of the marginal proportions of interest can be expressed as functions of the data and these nuisance parameters and their variances can be estimated by the sandwich method. We examine the performance of the proposed methodology in simulations and two real data applications.

Atrial fibrillation (AF) affects approximately 2.2 million individuals in the United States (Fuster, 2006), particularly those with structural heart disease and the elderly. It is characterized by rapid and irregular heart beat. Each AF episode may last between a few seconds to a few days, and the frequency also varies with subjects. If the AF episodes occur intermittently, they are called paroxysmal AF; if the patient is constantly in AF, this condition is called permanent AF. Most AF episodes occur without any symptoms, while some may be accompanied by perceived heart palpitations, weakness, shortness of breath, or chest pain. AF patients may suffer from tachycardia, low cardiac output from loss of atrial function, atrial and ventricular remodeling, and significantly elevated risk of stroke. Accurately detecting the start and end of the AF episodes is difficult due to the possible intermittent recurrence of the episodes. While devices such as the insertable cardiac monitors can record the AF episodes continuously over a long period of time, these devices require a surgery to be implanted on the patient, and another surgery to be extracted out at the end of the monitoring. Hence, they are not suitable for use in the general at-risk population. In the current clinical practice, physicians often rely on electrocardiogram (ECG) or Holter monitoring to capture the AF episodes. The ECG is usually performed at inpatient visits and lasts for less than 15 minutes; the Holter monitoring device may be carried at home by the patient and provides 24 to 72 hours of continuous monitoring. Since the gap times between consecutive AF episodes may vary in length between a few minutes to many days, in many situations such monitoring methods can be viewed as only providing a “snapshot” of the underlying continuous on-and-off process of AF. Consequently, the AF episodes are easily missed in these snapshots, leading to underdiagnosis or false negative diagnosis. It helps if the monitoring frequency and duration are increased (Arya et al, 2007), *e.g.*, with the 7-day ECG or daily telephonic ECG, but these options are not always feasible in the current clinical practice.

In this paper, we investigate the diagnosis of AF under the following study design. Suppose *n* patients, indexed by *i* = 1, 2, …, *n*, satisfy the inclusion criteria and are enrolled in a study on the risk of AF after a surgical procedure such as the Cox-Maze, with the goal of evaluating whether the surgery reduces or eliminates the AF (Gillinov et al, 2006). After the surgery, each patient is scheduled for *n _{i}* recurrent clinical visits; the presence of absence of AF at the

$${p}_{0}=\mathrm{Pr}({C}_{i}=0),{p}_{1/2}=\mathrm{Pr}({C}_{i}=1/2),{p}_{1}=\mathrm{Pr}({C}_{i}=1)$$

(1)

For a given patient population receiving AF treatments, these estimands quantifies the effectiveness of the surgery in reducing or eliminating AF in that population. An effective procedure is expected to increase the probability of no AF (*p*_{0}) and decrease the probabilities of paroxysmal AF (*p*_{1/2}) and permanent AF (*p*_{1}). If we can estimate *p*_{0}, *p*_{1/2} and *p*_{1} for a single population, extension to randomized clinical trial or propensity score matched observational studies with multiple treatment groups is straightforward because these marginal prevalence probabilities can be estimated for each group under comparison and the estimators and their estimated variances can be compared or used to form a statistical test for group differences.

Let
${M}_{i}={\sum}_{j=1}^{{n}_{i}}{Y}_{\mathit{ij}}$ denote the number of positive ECG results. A simple method to estimate the marginal prevalence is to claim that a patient is free of AF if *M _{i}* = 0, in permanent AF if

$${\stackrel{\sim}{p}}_{0}=\frac{1}{n}\sum _{i=1}^{n}1\{{M}_{i}=0\},{\stackrel{\sim}{p}}_{1/2}=\frac{1}{n}\sum _{i=1}^{n}1\{0<{M}_{i}<{n}_{i}\},{\stackrel{\sim}{p}}_{1}=\frac{1}{n}\sum _{i=1}^{n}1\{{M}_{i}={n}_{i}\}$$

(2)

These prevalence estimates may be biased. By definition, if a patient is free of AF, *Y _{ij}* 0 and

The estimands of interest (1) cannot be identified without additional auxiliary information. There are well established prognosis risk factors of AF. In this paper, we consider exploiting their relationship with *C _{i}* and

$$\mathit{Pr}({C}_{i}=c\mid {\mathit{X}}_{i};\mathit{\theta})\phantom{\rule{0.2em}{0ex}}\mathrm{and}\phantom{\rule{0.2em}{0ex}}\mathit{Pr}({M}_{i}\mid {C}_{i}=c,{\mathit{X}}_{i};\mathit{\theta})$$

where *c* = 0, 1/2, 1 and * θ* is a generic notation for all the unknown parameters in this model. For the purpose of this research,

$$L(\mathit{\theta})=\sum _{i=1}^{n}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\mathit{Pr}({M}_{i}\mid {\mathit{X}}_{i})=\sum _{i=1}^{n}\mathrm{log}\left\{\sum _{c\in \{0,1/2,1\}}\mathit{Pr}({M}_{i}\mid {C}_{i}=c,{\mathit{X}}_{i})\mathit{Pr}({C}_{i}=c\mid {\mathit{X}}_{i})\right\}$$

(3)

Denote the maximum likelihood estimator of * θ* by

$$\begin{array}{c}{\widehat{p}}_{0}=\widehat{\mathrm{Pr}}({C}_{i}=0)=\frac{1}{n}\sum _{i=1}^{n}\mathit{Pr}({C}_{i}=0\mid {M}_{i},{\mathit{X}}_{i};\widehat{\mathit{\theta}})\\ {\widehat{p}}_{1/2}=\widehat{\mathrm{Pr}}({C}_{i}=1/2)=\frac{1}{n}\sum _{i=1}^{n}\mathit{Pr}({C}_{i}=1/2\mid {M}_{i},{\mathit{X}}_{i};\widehat{\mathit{\theta}})\\ {\widehat{p}}_{1}=\widehat{\mathrm{Pr}}({C}_{i}=1)=\frac{1}{n}\sum _{i=1}^{n}\mathit{Pr}({C}_{i}=1\mid {M}_{i},{\mathit{X}}_{i};\widehat{\mathit{\theta}})\end{array}$$

(4)

These are the sample averages of each subject’s conditional probability of *C _{i}* given the observed data, evaluated at

In Section 2, we provide details on the model and estimation procedure. Section 3 presents the simulation results. Section 4 includes two real data examples to illustrate the proposed method. Discussion is in Section 5.

We first describe the estimation of the nuisance parameter * θ* through an EM algorithm, and then discuss the point and variance estimation of the estimands of interest

$$\begin{array}{ll}\mathit{Pr}({C}_{i}=0\mid {\mathit{X}}_{i})& =\frac{1}{1+\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\alpha})}\\ \mathit{Pr}({C}_{i}=1/2\mid {\mathit{X}}_{i})& =\mathit{Pr}({C}_{i}=1/2\phantom{\rule{0.2em}{0ex}}\mathrm{or}\phantom{\rule{0.2em}{0ex}}1\mid {\mathit{X}}_{i})\times \mathit{Pr}({C}_{i}=1/2\mid {\mathit{X}}_{i},{C}_{i}=1/2\phantom{\rule{0.2em}{0ex}}\mathrm{or}\phantom{\rule{0.2em}{0ex}}1)\\ & =\frac{\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\alpha})}{1+\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\alpha})}\frac{1}{1+\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\beta})}\\ \mathit{Pr}({C}_{i}=1\mid {\mathit{X}}_{i})& =\mathit{Pr}({C}_{i}=1/2\phantom{\rule{0.2em}{0ex}}\mathrm{or}\phantom{\rule{0.2em}{0ex}}1\mid {\mathit{X}}_{i})\times \mathit{Pr}({C}_{i}=1\mid {\mathit{X}}_{i},{C}_{i}=1/2\phantom{\rule{0.2em}{0ex}}\mathrm{or}\phantom{\rule{0.2em}{0ex}}1)\\ & =\frac{\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\alpha})}{1+\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\alpha})}\frac{\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\beta})}{1+\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\beta})}\end{array}$$

This is a nested logistic regression model, in which we first model the probability of *C _{i}* = 0 vs.

The conditional distribution of *M _{i}* given

$$\begin{array}{l}\mathit{Pr}({M}_{i}\mid {C}_{i}=0,{\mathit{X}}_{i})=1\{{M}_{i}=0\}\\ \mathit{Pr}({M}_{i}\mid {C}_{i}=1/2,{\mathit{X}}_{i})=\left(\begin{array}{c}{n}_{i}\\ {M}_{i}\end{array}\right){p}_{i}^{{M}_{i}}{(1-{p}_{i})}^{{n}_{i}-{M}_{i}}=\left(\begin{array}{c}{n}_{i}\\ {M}_{i}\end{array}\right)\mathrm{exp}\left\{{M}_{i}{\mathit{X}}_{i}^{T}\mathit{\gamma}-{n}_{i}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}(1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\gamma}})\right\}\\ {p}_{i}=\frac{\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\gamma})}{1+\mathrm{exp}({\mathit{X}}_{i}^{T}\mathit{\gamma})}\\ \mathit{Pr}({M}_{i}\mid {C}_{i}=1,{\mathit{X}}_{i})=1\{{M}_{i}={n}_{i}\}\end{array}$$

(5)

When the patient does not have AF (*C _{i}* = 0),

The nuisance parameters * θ* = (

$$\begin{array}{l}\sum _{i=1}^{n}\left\{1\{{C}_{i}=0\}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\mathit{Pr}({C}_{i}=0\mid {\mathit{X}}_{i})+1\{{C}_{i}=1/2\}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\mathit{Pr}({C}_{i}=1/2\mid {\mathit{X}}_{i})\mathit{Pr}({M}_{i}\mid {\mathit{X}}_{i},{C}_{i}=1/2)\right\}+\\ 1\{{C}_{i}=1\}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\mathit{Pr}({C}_{i}=1\mid {\mathit{X}}_{i})\}\\ =\sum _{i=1}^{n}\{1\{{C}_{i}=0\}(-1)\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left(1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}\right)+1\{{C}_{i}=1/2\}[\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left(\frac{{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}}{1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}}\frac{1}{1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\beta}}}\right)+\\ \mathrm{log}\phantom{\rule{0.2em}{0ex}}\left(\begin{array}{c}{n}_{i}\\ {M}_{i}\end{array}\right)+{M}_{i}{\mathit{X}}_{i}^{T}\mathit{\gamma}-{n}_{i}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}(1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\gamma}})]+1\{{C}_{i}=1\}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left(\frac{{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}}{1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}}\frac{{e}^{{\mathit{X}}_{i}^{T}\mathit{\beta}}}{1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\beta}}}\right)\}\end{array}$$

The following is the EM algorithm:

- (Initial Step) Obtain initial values for
. The initial value for**θ**and**α**can be obtained by fitting a nested logistic regression of 1{**β***M*= 0}, 1{0 <_{i}*M*<_{i}*n*} and 1{_{i}*M*=_{i}*n*} on_{i}. The initial values for**X**_{i}can be obtained by fitting a binomial regression of**γ***M*on_{i}for the subset of data with 0 <**X**_{i}*M*<_{i}*n*._{i} - (E-step) Given the current value of
**θ**^{(m)}, calculate the conditional expectation of the indicators in the complete data log-likelihood given the observed data*M*and conditional on_{i}.**X**_{i}Denote$$\mathit{Pr}({C}_{i}=c\mid {M}_{i},{\mathit{X}}_{i})=\frac{\mathit{Pr}({M}_{i}\mid {C}_{i}=c,{\mathit{X}}_{i})\mathit{Pr}({C}_{i}=c\mid {\mathit{X}}_{i})}{{\sum}_{c\prime \in \{0,1/2,1\}}\mathit{Pr}({M}_{i}\mid {C}_{i}=c\text{'},{\mathit{X}}_{i})\mathit{Pr}({C}_{i}=c\text{'}\mid {\mathit{X}}_{i})},(c=0,1/2,1)$$$$\begin{array}{c}{p}_{0i}^{(m)}=\mathit{Pr}({C}_{i}=0\mid {M}_{i},{\mathit{X}}_{i};{\mathit{\theta}}^{(m)})\\ {p}_{1/2,i}^{(m)}=\mathit{Pr}({C}_{i}=1/2\mid {M}_{i},{\mathit{X}}_{i};{\mathit{\theta}}^{(m)})\\ {p}_{1i}^{(m)}=\mathit{Pr}({C}_{i}=2\mid {M}_{i},{\mathit{X}}_{i};{\mathit{\theta}}^{(m)})\end{array}$$(6) - (M-step) Given ${p}_{0i}^{(m)}$, ${p}_{1/2,i}^{(m)}$, and ${p}_{1i}^{(m)}$, we maximize the following expected log complete data likelihood with respect to
:*θ*Since the terms involving$$\begin{array}{l}\sum _{i=1}^{n}\{{p}_{0i}^{(m)}(-1)\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left(1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}\right)+{p}_{1/2,i}^{(m)}[\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left(\frac{{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}}{1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}}\frac{1}{1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\beta}}}\right)+\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left(\begin{array}{c}{n}_{i}\\ {M}_{i}\end{array}\right)+\\ {M}_{i}{\mathit{X}}_{i}^{T}\mathit{\gamma}-{n}_{i}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left(1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\gamma}}\right)]+{p}_{1i}^{(m)}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left(\frac{{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}}{1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\alpha}}}\frac{{e}^{{\mathit{X}}_{i}^{T}\mathit{\beta}}}{1+{e}^{{\mathit{X}}_{i}^{T}\mathit{\beta}}}\right)\}\end{array}$$,**α**and**β**are linearly additive, we can maximize with respect to**γ**,**α**and**β**separately in three M-steps. Each step involves maximization of a concave function with closed form first and second derivatives, and can be completed using the Newton-Raphson method.**γ** - Iterate between E-Step and M-Step until the following convergence criteria is satisfied:That is, we iterate till all the elements in the vector |$$\mathrm{arg\; max}|{\mathit{\theta}}^{(m)}-{\mathit{\theta}}^{(m-1)}|<{\epsilon}_{0}$$
**θ**^{(m)}–**θ**^{(m−1)}| be less than a pre-specified small positive number*ε*_{0}. We set*ε*_{0}= 10^{−6}for the numerical study in this paper. If it converges at step*K*, We denote the final point estimator as*=***θ**^{(K)}.

Once the nuisance parameters * is obtained, we can estimate the marginal prevalence probabilities (1) according to (4):
*

$${\widehat{p}}_{0}=\frac{1}{n}\sum _{i=1}^{n}{p}_{0i}^{(K)},{\widehat{p}}_{1/2}=\frac{1}{n}\sum _{i=1}^{n}{p}_{1/2,i}^{(K)},{\widehat{p}}_{1}=\frac{1}{n}\sum _{i=1}^{n}{p}_{1i}^{(K)}$$

where the individual conditional probabilities
${p}_{0i}^{(K)}$,
${p}_{1i}^{(K)}$ and
${p}_{2i}^{(K)}$ can be obtained directly from the last (the *K*-th) iteration of the EM algorithm.

The asymptotic variance matrix of * can be estimated by the inverse of the negative Hessian matrix, obtained by numerically differentiating the log likelihood (3). However, for the purpose of this research, ** θ* is nuisance parameter, and the estimands of interest are

$$0=\sum _{i=1}^{n}{\varphi}_{i}(\mathit{k})=\sum _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}\left[\begin{array}{c}\mathit{Pr}({C}_{i}=0|{M}_{i},{\mathit{X}}_{i};\mathit{\theta})-{p}_{0}\\ \frac{\partial}{\partial \theta}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left({\sum}_{c\in \{0,1/2,1\}}\mathit{Pr}({M}_{i}|{C}_{i}=c,{\mathit{X}}_{i})\mathit{Pr}({C}_{i}=c|{\mathit{X}}_{i})\right)\end{array}\right]$$

By the sandwich method,
$\widehat{\mathrm{var}}({\widehat{p}}_{0})$ is the first diagonal entry of the matrix
$\widehat{\mathrm{var}}(\widehat{\mathit{k}})={n}^{-1}{\widehat{\mathit{A}}}_{n}^{-1}{\widehat{\mathit{B}}}_{n}{\widehat{\mathit{A}}}_{n}^{-T}$, with
${\widehat{\mathit{A}}}_{n}={n}^{-1}{{\sum}_{i=1}^{n}\partial {\varphi}_{i}(\mathit{k})/\partial {\mathit{k}}^{T}|}_{\mathit{k}=\widehat{\mathit{k}}}$ and
${\widehat{\mathit{B}}}_{n}={n}^{-1}{{\sum}_{i=1}^{n}{\varphi}_{i}(\mathit{k}){\varphi}_{i}{(\mathit{k})}^{T}|}_{\mathit{k}=\widehat{\mathit{k}}}$. The derivatives can be calculated numerically. Variances of _{1/2} and _{1} are estimated in similar ways.

We conducted simulations to study the performance of the proposed model and method. At each simulation, we generated two standard Gaussian baseline covariates with a correlation coefficient of 0.2. We consider a setting in which the three marginal prevalence probabilities, *p*_{0} = *Pr*(*C _{i}* = 0),

The simulation results are summarized in Table 1. The naive method generally leads to biased estimator of marginal prevalence probabilities and sometimes the bias is quite large, while the proposed method is unbiased in all settings. The bias of the naive method decreases when there are more repeated measures per subject, and increasing the sample size does not help reducing bias. When there is a large number of diagnoses per subject, the chance that all these diagnosis fail to capture any AF episodes on a paroxysmal AF patient is greatly reduced, and so is the chance that these diagnoses capture at least one AF episode on the paroxysmal AF patient. The 95% confidence intervals are correct coverage probabilities, though slight under coverage can be seen in small data sets with small sample size and number of repeated measures per subject. The simulation also shows that the proposed estimators of the nuisance parameters (* α*,

Simulation results comparing the naive and the proposed methods in estimating marginal prevalence probabilities *p*_{0}, *p*_{1/2}, and *p*_{1}. The results are reported as the average (Est) and empirical standard deviation (empSD) of the point estimators, the average **...**

The proposed mixture regression model is a working model to help identify the estimands in (1), which are otherwise unidentified from the data. Hence, we conducted additional simulation to study the proposed estimators when the working model is misspecified. The simulation was designed in a similar setup as in Setting 1 of Table 1. The sample size is fixed at 500. The median number of repeated measures is 5 (range 1-9), 10 (range 5-15), or 20 (range 10-30). We fit the correctly specified three-part mixture model as well as an incorrectly specified model omitting *X*_{2}. The result is visualized in Figure 1. The result reconfirms the finding in Table 1 that the correctly specified mixture model produces unbiased estimators. When the model is misspecified, there is bias in the proposed estimator, but the bias is much smaller than the naive estimator in all scenarios. When the number of repeated measures per subject increases, the bias of the naive method decreases, and so does the bias of the proposed method under misspecified model. A heuristical explanation is that when the monitoring frequency increases, it is difficult for a paroxysmal AF patient to have * Y_{i}* 0. In other words, there is less evidence in the data for a paroxysmal AF patient to have a high probability of being in the other two groups, even when there is some model misspecification. Hence the proposed estimator converges with the naive estimator, whose bias decreases with increasing monitoring frequency. Therefore, based on this result, we recommend the use of the proposed methodology in all scenarios in replacement of the naive method, regardless of the monitoring frequency.

The average estimated marginal prevalence probabilities from correctly specified three-part mixture model (green, solid line), misspecified three-part mixture model (green, dashed line), and the naive method (red), when the median number of repeated measures **...**

The simulations in this section were performed using functions written in R (R Core Team 2014). The R code is available for download at the publisher’s website. The initial values of the EM algorithm were chosen using the simple method described in Section 2, and convergence was achieved for all the simulation runs. Analyzing a typical simulated data set with *n* = 1000 took about 80 seconds on a PC with 3.30 GHz CPU and 16GB memory. The good convergence performance may be partly attributed to the concavity of the target functions in the M-step of the EM algorithm. Previous literature suggests that it is desirable to run the EM algorithm using different initial values to protect against convergence to local, instead of global maximum (Biernachi et al 2003). We intentionally let the initial values deviate from the recommended values in Section 2. The algorithm always converged to the same estimators, unless excessively large deviations were used that led to unreasonable initial values, in which case the algorithm did not converge.

We illustrate the proposed methodology with two real data applications. The first application is from a retrospective observational study on the effectiveness of three different surgical ablation procedures in reducing atrial fibrillation: the Cox-Maze ablation procedure, which is a cut-and-sew procedure, considered as the gold standard, but is more complicated and time consuming; the pulmonary vein isolation, which is the simplest; pulmonary vein isolation alone with added lesions, which is between the other two procedures in terms of complexity. Data on heart rhythm were collected from postoperative ECGs and extracted from the Electronic Health Records (Gillinov et al 2006). The data set includes 413 patients with at least one ECG diagnosis between 6 and 24 months. The ECG data within the first 6 months after surgery have been excluded as patients’ heart rhythms may be affected by many other peri-operative risk factors and are not stable shortly after the open heart surgery. Of these patients, 169 had one ECG, 105 has two, 49 had 3, and the maximum number of ECGs is 14. The naive method estimated that _{0} = 0.564, _{1/2} = 0.196, _{1} = 0.240. The proposed method estimated that _{0} = 0.379(0.0171), _{1/2} = 0.481(0.0208), _{1} = 0.140(0.00682) (the numbers in brackets are standard errors). The proportion of permanent AF patients appears to be overestimated under the naive method. Table 2 shows the estimated regression model parameters. These are nuisance for the purpose of estimating the marginal prevalence probabilities, but can be informative for exploring the association between risk factors and outcomes.

The estimated model parameters from the three-part mixture model applied to the first data application. Results are presented as the point estimator, standard error estimator, and p-value from a Wald-type test against the null hypothesis of zero parameter. **...**

The second data application is from a prospective randomized study to determine if adding surgical ablation to mitral valve surgery is more effective than surgery alone in reducing AF at 6 and 12 months. In this study, patients underwent weekly trans-telephonic monitoring of their heart rhythm after cardiac surgery. The trans-telephonic monitoring (TTM) system functions in a similar way as the ECG in the physician’s office, but it is smaller and more portable so that patients can use it to monitor their heart rhythm at home. The device can be connected to the home telephone line to transmit the heart rhythm data to the data center of the study. The patients were instructed to make at least one transmission each week, though more frequent transmissions were permitted. There are *n* = 150 patients with weekly TTM data between 6 and 12 months. The median number of repeated measures is 23, minimum 1, and maximum 46. The naive method estimated that _{0} = 0.38, _{1/2} = 0.28, _{1} = 0.34. The proposed method estimated that _{0} = 0.375(0.0168), _{1/2} = 0.287(0.0128), _{1} = 0.339(0.0152). Unlike the first example, the results from the proposed method and the naive method are very similar. As shown in the simulations, when the number of repeated measures per subject is large, the discrepancy between the proposed and the naive methods diminishes. In other words, the naive method produces nearly unbiased result with large number of repeated measures. From a clinical perspective, this finding highlights the importance and usefulness of the modern technology such as the TTM system in the diagnosis and monitoring of patients at risk of AF. Table 3 shows the estimated model parameters. As an additional sensitivity analysis, we artificially reduced the number of diagnoses per subject in this data set by randomly sampling a fraction of the repeated measures with a probability of retention , and plotted the change in estimated marginal prevalence probabilities from the naive and the proposed methods in Figure 2. The estimated probabilities from the proposed method are less sensitive to the monitoring frequency than those from the naive method.

The estimated prevalence probabilities by the naive (dashed lines) and proposed (solid lines) methods, as the probability of retention *ω* decreases, causing a reduction in the number of repeated measures per subject from the original TTM data set **...**

In this paper, we proposed a new statistical methodology to address one aspect of the widely recognized diagnosis problem in atrial fibrillation research, that the estimated prevalence of AF increases with the monitoring intensity (Arya et al 2007; Senatore et al 2006). This phenomenon arises because AF episodes occur intermittently and may be easily missed when the monitoring frequency is not adequate. Estimating the prevalence of people with no AF, paroxysmal AF and permanent AF in a given population is of scientific interest in many contexts including the evaluation of the efficacy of AF ablation surgeries. If these probabilities are estimated based on the observed proportions such as in the naive method, the results will be biased unless a large number of repeated diagnoses are made, such as in the TTM data example. Hence, the results of this article highlight the importance of using modern, high frequency monitoring technology such as the TTM system. However, high frequent monitoring is not always available in clinical practice. When the monitoring frequency is not large, we propose to use a model-based approach. Our simulation shows that the proposed approach outperforms the naive approach and produces unbiased result regardless of the monitoring frequency. Even when the model is moderately misspecified, the proposed method can still achieve substantial bias reduction and outperforms the naive method. Our real data applications show that when the monitoring frequency is high as in the second example, the proposed method and the naive method give similar results; when the monitoring frequency is low as in the first data example, there are considerable differences between the two methods but the proposed method is better justified than the naive method. In addition, the result of the proposed method is less sensitive to the monitoring frequency than the naive method, as demonstrated in the real data experiment with the second data example (Figure 2). Based on all the results above, we recommend the proposed methodology in replacement of the naive method in all situations, regardless of the monitoring frequency.

The marginal prevalence probabilities are not identifiable with only the AF outcome * Y*. They are identifiable in our approach because of the use of the three part mixture model. This is analogous in a perspective to the cure model in survival analysis (Farewell, 1982), where the cured patients are not expected to have the clinical event of interest and are indistinguishable from the censored patients without additional modeling assumptions. The covariates

Click here to view.^{(27K, R)}

Click here to view.^{(1016 bytes, txt)}

Click here to view.^{(5.5K, R)}

This research is funded by NIH grants R01HL103552 and P30CA016672. We want to thank the two referees and the Associate Editor for constructive comments that improved this paper.

**Conflict of interest** None.

**Supporting Information**

“code.zip” in the Supporting Information contains the code required to generate the simulation results in this paper together with instructions for their use. Available at the publisher’s website.

- Arya A, Piorkowski C, Sommer P, Kottkamp H, Hindricks G. Clinical implications of various follow up strategies after catheter ablation of atrial fibrillation. Pacing Clin Electrophysiol. 2007;30(4):458–62. [PubMed]
- Biernachi C, Celeux G, Govaert G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics and Data Analysis. 2003;41:561–575.
- Farewell VT. The use of mixture models for the analysis of survival data with long-term survivors. Biometrics. 1982;38:1041–1046. [PubMed]
- Faucett CL, Schenker N, Taylor JM. Survival analysis using auxiliary variables via multiple imputation, with application to AIDS clinical trial data. Biometrics. 2002;58(1):37–47. [PubMed]
- Fuster V. ACC/AHA/ESC 2006 guidelines for the management of patients with atrial fibrillation. Circulation. 2006;114(7):e257–354. [PubMed]
- Gaillard N, Deltour S, Vilotijevic B, Hornych A, Crozier S, Leger A, Frank R, Samson Y. Detection of paroxysmal atrial fibrillation with transtelephonic EKG in TIA or stroke patients. Neurology. 2010;74(21):1666–70. [PubMed]
- Gillinov AM, Bhavani S, Blackstone EH, Rajeswaran J, Svensson LG, Navia JL, Pettersson BG, Sabik JF, 3rd, Smedira NG, Mihaljevic T, McCarthy PM, Shewchik J, Natale A. Surgery for permanent atrial fibrillation: impact of patient factors and lesion set. Annals of Thoracic Surgery. 2006;82(2):502–13. [PubMed]
- Hall DB. Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics. 2000;56(4):1030–1039. [PubMed]
- Hall DB, Zhang Z. Marginal models for zero inflated clustered data. Statistical Modeling. 2004;4:161–180.
- R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria: 2014. URL http://www.R-project.org/
- Senatore G, Stabile G, Bertaglia E, Donnici G, De Simone A, Zoppo F, Turco P, Pascotto P, Fazzari M. Role of transtelephonic electrocardiographic monitoring in detecting short-term arrhythmia recurrences after radiofrequency ablation in patients with atrial fibrillation. J Am Coll Cardiol. 2006;45(6):873–6. [PubMed]
- Stefanski LA, Boos DD. The calculus of M-estimation. The American Statistician. 2002;56:29–38.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |