Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC4907892

Formats

Article sections

Authors

Related links

Adv Neural Inf Process Syst. Author manuscript; available in PMC 2016 June 15.

Published in final edited form as:

Adv Neural Inf Process Syst. 2015 December; 28: 2656–2664.

PMCID: PMC4907892

NIHMSID: NIHMS771641

Christopher De Sa: ude.drofnats@asedc; Ce Zhang: ude.csiw.sc@gnahzc; Kunle Olukotun: ude.drofnats@elnuk; Christopher Ré: ude.drofnats@ermsirhc

See other articles in PMC that cite the published article.

Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD’s runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.

Many problems in machine learning can be written as a stochastic optimization problem

$$\text{minimize}\mathbf{E}[\tilde{f}(x)]\text{over}x\in {\mathbb{R}}^{n},$$

where is a random objective function. One popular method to solve this is with stochastic gradient descent (SGD), an iterative method which, at each timestep *t*, chooses a random objective sample * _{t}* and updates

$${x}_{t+1}={x}_{t}-\alpha \nabla {\tilde{f}}_{t}({x}_{t}),$$

(1)

where α is the step size. For most problems, this update step is easy to compute, and perhaps because of this SGD is a ubiquitous algorithm with a wide range of applications in machine learning [1], including neural network back-propagation [2, 3, 13], recommendation systems [8, 19], and optimization [20]. For non-convex problems, SGD is popular—in particular, it is widely used in deep learning—but its success is poorly understood theoretically.

Given SGD’s success in industry, practitioners have developed methods to speed up its computation. One popular method to speed up SGD and related algorithms is using asynchronous execution. In an asynchronous algorithm, such as Hogwild! [17], multiple threads run an update rule such as Equation 1 in parallel without locks. Hogwild! and other lock-free algorithms have been applied to a variety of uses, including PageRank approximations (FrogWild! [16]), deep learning (Dogwild! [18]) and recommender systems [24]. Many asynchronous versions of other stochastic algorithms have been individually analyzed, such as stochastic coordinate descent (SGD) [14, 15] and accelerated parallel proximal coordinate descent (APPROX) [6], producing rate results that are similar to those of Hogwild! Recently, Gupta et al. [9] gave an empirical analysis of the effects of a low-precision variant of SGD on neural network training. Other variants of stochastic algorithms have been proposed [5,11,12,21,22,23]; only a fraction of these algorithms have been analyzed in the asynchronous case. Unfortunately, a new variant of SGD (or a related algorithm) may violate the assumptions of existing analysis, and hence there are gaps in our understanding of these techniques.

One approach to filling this gap is to analyze each purpose-built extension from scratch: an entirely new model for each type of asynchrony, each type of precision, etc. In a practical sense, this may be unavoidable, but ideally there would be a single technique that could analyze many models. In this vein, we prove a martingale-based result that enables us to treat many different extensions as different forms of noise within a unified model. We demonstrate our technique with three results:

- For the convex case, Hogwild! requires strict sparsity assumptions. Using our techniques, we are able to relax these assumptions and still derive convergence rates. Moreover, under Hogwild!’s stricter assumptions, we recover the previous convergence rates.
- We derive convergence results for an asynchronous SGD algorithm for a non-convex matrix completion problem. We derive the first rates for asynchronous SGD following the recent (synchronous) non-convex SGD work of De Sa et al. [4].
- We derive convergence rates in the presence of quantization errors such as those introduced by fixed-point arithmetic. We validate our results experimentally, and show that Buckwild! can achieve speedups of up to 2.3× over Hogwild!-based algorithms for logistic regression.

One can combine these different methods both theoretically and empirically. We begin with our main result, which describes our martingale-based approach and our model.

Analyzing asynchronous algorithms is challenging because, unlike in the sequential case where there is a single copy of the iterate *x*, in the asynchronous case each core has a separate copy of *x* in its own cache. Writes from one core may take some time to be propagated to another core’s copy of *x*, which results in race conditions where stale data is used to compute the gradient updates. This difficulty is compounded in the non-convex case, where a series of unlucky random events—bad initialization, inauspicious steps, and race conditions—can cause the algorithm to get stuck near a saddle point or in a local minimum.

Broadly, we analyze algorithms that repeatedly update *x* by running an update step

$${x}_{t+1}={x}_{t}-{\tilde{G}}_{t}({x}_{t}),$$

(2)

for some i.i.d. update function * _{t}*. For example, for SGD, we would have

Our main result is a technique that allows us to bound the convergence rates of asynchronous SGD and related algorithms, even for some non-convex problems. We use martingale methods, which have produced elegant convergence rate results for both convex and some non-convex [4] algorithms. Martingales enable us to model multiple forms of error—for example, from stochastic sampling, random initialization, and asynchronous delays—within a single statistical model. Compared to standard techniques, they also allow us to analyze algorithms that sometimes get stuck, which is useful for non-convex problems. Our core contribution is that a martingale-based proof for the convergence of a sequential stochastic algorithm can be easily modified to give a convergence rate for an asynchronous version.

A *supermartingale* [7] is a stochastic process *W _{t}* such that

For a stochastic algorithm as described above, a non-negative process *W _{t}* :

$$\mathbf{E}[{W}_{t+1}({x}_{t}-{\tilde{G}}_{t}({x}_{t}),{x}_{t},\dots ,{x}_{0})]\le {W}_{t}({x}_{t},{x}_{t-1},\dots ,{x}_{0}).$$

(3)

Second, for all times *T* ≤ *B* and for any sequence *x _{T}*, …,

$${W}_{T}({x}_{T},{x}_{T-1},\dots ,{x}_{0})\ge T.$$

(4)

This represents the fact that we are unhappy with running for many iterations without success.

Using this, we can easily bound the convergence rate of the sequential version of the algorithm.

*Assume that we run a sequential stochastic algorithm, for which W is a* rate supermartingale. *For any T* ≤ *B, the probability that the algorithm has not succeeded by time T is*

$$P({F}_{T})\le \frac{\mathbf{E}[{W}_{0}({x}_{0})]}{T}.$$

In what follows, we let *W _{t}* denote the actual value taken on by the function in a process defined by (2). That is,

$$\mathbf{E}[{W}_{T}]\le \mathbf{E}[{W}_{0}]=\mathbf{E}[{W}_{0}({x}_{0})].$$

By the law of total expectation applied to the failure event *F _{T}*,

$$\mathbf{E}[{W}_{0}({x}_{0})]\ge \mathbf{E}[{W}_{T}]=P({F}_{T})\mathbf{E}[{W}_{T}|{F}_{T}]+P(\neg {F}_{T})\mathbf{E}[{W}_{T}|\neg {F}_{T}].$$

Applying (4), i.e. **E** [*W _{T}|F_{T}*] ≥

$$\mathbf{E}[{W}_{0}({x}_{0})]\ge P({F}_{T})T;$$

rearranging terms produces the result in Statement 1.

This technique is very general; in subsequent sections we show that rate supermartingales can be constructed for SGD on all convex problems and for some algorithms for non-convex problems.

The behavior of an asynchronous SGD algorithm depends both on the problem it is trying to solve and on the hardware it is running on. For ease of analysis, we assume that the hardware has the following characteristics. These are basically the same assumptions used to prove the original Hogwild! result [17].

- There are multiple threads running iterations of (2), each with their own cache. At any point in time, these caches may hold different values for the variable
*x*, and they communicate via some cache coherency protocol. - There exists a central store (typically RAM) at which all writes are serialized. This provides a consistent value for the state of the system at any point in real time.
- If a thread performs a read
*R*of a previously written value*X*, and then writes another value*Y*(dependent on*R*), then the write that produced*X*will be committed to before the write that produced*Y*. - Each write from an iteration of (2) is to only a single entry of
*x*and is done using an atomic read-add-write instruction. That is, there are no write-after-write races (handling these is possible, but complicates the analysis).

Notice that, if we let *x _{t}* denote the value of the vector

$${x}_{t+1}={x}_{t}-{\tilde{G}}_{t}({\tilde{\upsilon}}_{t})$$

(5)

Our hardware model further constrains the value of _{t}: all the read elements of _{t} must have been written to at some time before *t*. Therefore, for some nonnegative variable _{i,t},

$${e}_{i}^{T}{\tilde{\upsilon}}_{t}={e}_{i}^{T}{x}_{t-{\tilde{\tau}}_{i,t}},$$

(6)

where *e _{i}* is the

We can conceive of this system as a stochastic process with two sources of randomness: the noisy update function samples * _{t}* and the delays

$$\mathbf{P}({\tilde{\tau}}_{i,t}\ge k|{\mathcal{F}}_{t})\le P(\tilde{\tau}\ge k).$$

(7)

We let τ = **E**[], and call τ the *worst-case expected delay*.

Now that we are equipped with a stochastic model for the asynchronous SGD algorithm, we show how we can use a rate supermartingale to give a convergence rate for asynchronous algorithms. To do this, we need some continuity and boundedness assumptions; we collect these into a definition, and then state the theorem.

An algorithm with rate supermartingale *W* is (*H, R*, ξ)-bounded if the following conditions hold. First, *W* must be Lipschitz continuous in the current iterate with parameter *H*; that is, for any *t, u*, υ, and sequence *x _{t}*, …,

$$\Vert {W}_{t}(u,{x}_{t-1},\dots ,{x}_{0})-{W}_{t}(\upsilon ,{x}_{t-1},\dots ,{x}_{0})\Vert \le H\Vert u-\upsilon \Vert .$$

(8)

Second, must be Lipschitz continuous in expectation with parameter *R*; that is, for any *u*, and υ,

$$\mathbf{E}[\Vert \tilde{G}(u)-\tilde{G}(\upsilon )\Vert ]\le R{\Vert u-\upsilon \Vert}_{1}.$$

(9)

Third, the expected magnitude of the update must be bounded by ξ. That is, for any *x*,

$$\mathbf{E}[\Vert \tilde{G}(x)\Vert ]\le \xi .$$

(10)

*Assume that we run an asynchronous stochastic algorithm with the above hardware model, for which W is a* (*H, R*, ξ)-*bounded rate supermartingale with horizon B. Further assume that HR*ξτ < 1. *For any T* ≤ *B, the probability that the algorithm has not succeeded by time T is*

$$P({F}_{T})\le \frac{\mathbf{E}[W(0,{x}_{0})]}{(1-HR\xi \tau )T}.$$

Note that this rate depends only on the worst-case expected delay τ and not on any other properties of the hardware model. Compared to the result of Statement 1, the probability of failure has only increased by a factor of 1 − *HR*ξτ. In most practical cases, *HR*ξτ 1, so this increase in probability is negligible.

Since the proof of this theorem is simple, but uses non-standard techniques, we outline it here. First, notice that the process *W _{t}*, which was a supermartingale in the sequential case, is not in the asynchronous case because of the delayed updates. Our strategy is to use

$${V}_{t}({x}_{t},\dots ,{x}_{0})={W}_{t}({x}_{t},\dots ,{x}_{0})-HR\xi \tau t+HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert \sum _{m=k}^{\infty}P(\tilde{\tau}\ge m).$$

Compared with *W*, there are two additional terms here. The first term is negative, and cancels out some of the unhappiness from (4) that we ascribed to running for many iterations. We can interpret this as us accepting that we may need to run for more iterations than in the sequential case. The second term measures the distance between recent iterates; we would be unhappy if this becomes large because then the noise from the delayed updates would also be large. On the other hand, if *x _{u}*

$${V}_{t}({x}_{t},\dots ,{x}_{u},\dots ,{x}_{0})={V}_{u}({x}_{u},\dots ,{x}_{0}).$$

We call *V _{t}* a

Theorem 1 gives us a straightforward way of bounding the convergence time of any asynchronous stochastic algorithm. First, we find a rate supermartingale for the problem; this is typically no harder than proving sequential convergence. Second, we find parameters such that the problem is (*H, R*, ξ-bounded, typically; this is easily done for well-behaved problems by using differentiation to bound the Lipschitz constants. Third, we apply Theorem 1 to get a rate for asynchronous SGD. Using this method, analyzing an asynchronous algorithm is really no more difficult than analyzing its sequential analog.

Now that we have proved our main result, we turn our attention to applications. We show, for a couple of algorithms, how to construct a rate supermartingale. We demonstrate that doing this allows us to recover known rates for Hogwild! algorithms as well as analyze cases where no known rates exist.

First, we consider the simple case of using asynchronous SGD to minimize a convex function *f* (*x*) using unbiased gradient samples (*x*). That is, we run the update rule

$${x}_{t+1}={x}_{t}-\alpha \nabla {\tilde{f}}_{t}(x).$$

(11)

We make the standard assumption that *f* is strongly convex with parameter *c*; that is, for all *x* and *y*

$${(x-y)}^{T}(\nabla f(x)-\nabla f(y))\ge c{\Vert x-y\Vert}^{2}.$$

(12)

We also assume continuous differentiability of with 1-norm Lipschitz constant *L*,

$$\mathbf{E}[\Vert \nabla \tilde{f}(x)-\nabla \tilde{f}(y)\Vert ]\le L{\Vert x-y\Vert}_{1}.$$

(13)

We require that the second moment of the gradient sample is also bounded for some *M* > 0 by

$$\mathbf{E}[{\Vert \nabla \tilde{f}(x)\Vert}^{2}]\le {M}^{2}.$$

(14)

For some ε > 0, we let the success region be

$$S=\{x|{\Vert x-{x}^{*}\Vert}^{2}\le \epsilon \}.$$

Under these conditions, we can construct a rate supermartingale for this algorithm.

*There exists a W _{t} where, if the algorithm hasn ’t succeeded by timestep t*,

$${W}_{t}({x}_{t},\dots ,{x}_{0})=\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}}\text{log}(e{\Vert {x}_{t}-{x}^{*}\Vert}^{2}{\epsilon}^{-1})+t,$$

*such that W _{t} is a rate submartingale for the above algorithm with horizon B* = ∞.

Using this and Theorem 1 gives us a direct bound on the failure rate of convex Hogwild! SGD.

*Assume that we run an asynchronous version of the above SGD algorithm, where for some constant* (0, 1) *we choose step size*

$$\alpha =\frac{c\epsilon \vartheta}{{M}^{2}+2LM\tau \sqrt{\epsilon}}.$$

*Then for any T, the probability that the algorithm has not succeeded by time T is*

$$P({F}_{T})\le \frac{{M}^{2}+2LM\tau \sqrt{\epsilon}}{{c}^{2}\epsilon \vartheta T}\text{log}\left(e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}{\epsilon}^{-1}\right).$$

This result is more general than the result in Niu et al. [17]. The main differences are: that we make no assumptions about the sparsity structure of the gradient samples; and that our rate depends only on the second moment of and the expected value of , as opposed to requiring absolute bounds on their magnitude. Under their stricter assumptions, the result of Corollary 1 recovers their rate.

One of the ways Buckwild! achieves high performance is by using low-precision fixed-point arithmetic. This introduces additional noise to the system in the form of *round-off error*. We consider this error to be part of the Buckwild! hardware model. We assume that the round-off error can be modeled by an unbiased rounding function operating on the update samples. That is, for some chosen precision factor κ, there is a random quantization function such that, for any *x* , it holds that **E**[(*x*)] = *x*, and the round-off error is bounded by |(*x*) — *x*|< ακ*M*. Using this function, we can write a low-precision asynchronous update rule for convex SGD as

$${x}_{t+1}={x}_{t}-{\tilde{Q}}_{t}(\alpha \nabla {\tilde{f}}_{t}({\tilde{\upsilon}}_{t})),$$

(15)

where * _{t}* operates only on the single nonzero entry of

*Assume that we run asynchronous low-precision convex SGD, and for some* (0, 1), *we choose step*

$$\alpha =\frac{c\epsilon \vartheta}{{M}^{2}(1+{\kappa}^{2})+LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}},$$

*then for any T, the probability that the algorithm has not succeeded by time T is*

$$P({F}_{T})\le \frac{{M}^{2}(1+{\kappa}^{2})+LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}}{{c}^{2}\epsilon \vartheta T}\text{log}\left(e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}{\epsilon}^{-1}\right).$$

Typically, we choose a precision such that κ 1; in this case, the increased error compared to the result of Corollary 1 will be negligible and we will converge in a number of samples that is very similar to the high-precision, sequential case. Since each Buckwild! update runs in less time than an equivalent Hogwild! update, this result means that an execution of Buckwild! will produce same-quality output in less wall-clock time compared with Hogwild!

Many machine learning problems are non-convex, but are still solved in practice with SGD. In this section, we show that our technique can be adapted to analyze non-convex problems. Unfortunately, there are no general convergence results that provide rates for SGD on non-convex problems, so it would be unreasonable to expect a general proof of convergence for non-convex Hogwild! Instead, we focus on a particular problem, low-rank least-squares matrix completion,

$$\begin{array}{cc}\text{minimize}\hfill & \mathbf{E}[{\Vert \xc3-x{x}^{T}\Vert}_{F}^{2}]\hfill \\ \text{subject to}\hfill & x\in {\mathbb{R}}^{n},\hfill \end{array}$$

(16)

for which there exists a sequential SGD algorithm with a martingale-based rate that has already been proven. This problem arises in general data analysis, subspace tracking, principle component analysis, recommendation systems, and other applications [4]. In what follows, we let *A* = **E**[*Ã*]. We assume that *A* is symmetric, and has unit eigenvectors *u*_{1}, *u*_{2}, …, *u _{n}* with corresponding eigenvalues λ

De Sa et al. [4] provide a martingale-based rate of convergence for a particular SGD algorithm, Alecton, running on this problem. For simplicity, we focus on only the rank-1 version of the problem, and we assume that, at each timestep, a single entry of *A* is used as a sample. Under these conditions, Alecton uses the update rule

$${x}_{t+1}=(I+\eta {n}^{2}{e}_{{\tilde{i}}_{t}}{e}_{{\tilde{i}}_{t}}^{T}A{e}_{{\tilde{j}}_{t}}{e}_{{\tilde{j}}_{t}}^{T}){x}_{t},$$

(17)

where *ĩ _{t}* and

$$S=\{x|{({u}_{1}^{T}x)}^{2}\ge (1-\epsilon ){\Vert x\Vert}^{2}\}.$$

(18)

That is, we are only concerned with the direction of *x*, not its magnitude; this algorithm only recovers the dominant eigenvector of *A*, not its eigenvalue. In order to show convergence for this entrywise sampling scheme, De Sa et al. [4] require that the matrix *A* satisfy a *coherence bound* [10].

A matrix *A* ^{n×n} is incoherent with parameter μ if for every standard basis vector *e _{j}*, and for all unit eigenvectors

They also require that the step size be set, for some constants 0 < γ ≤ 1 and 0 < < (1 + ε)^{−1} as

$$\eta =\frac{\mathrm{\Delta}\epsilon \gamma \vartheta}{2n{\mu}^{4}{\Vert A\Vert}_{F}^{2}}.$$

For ease of analysis, we add the additional assumptions that our algorithm runs in some bounded space. That is, for some constant *C*, at all times *t*, 1 ≤ ‖*x _{t}*‖ and ‖

*For the problem above, choose any horizon B such that* ηγεΔ*B* ≤ 1. *Then there exists a function W _{t} such that W_{t} is a rate supermartingale for the above non-convex SGD algorithm with parameters*
$H=8n{\eta}^{-1}{\gamma}^{-1}{\mathrm{\Delta}}^{-1}{\epsilon}^{-\frac{1}{2}}$,

$$\mathbf{E}[{W}_{0}({x}_{0})]\le 2{\eta}^{-1}{\mathrm{\Delta}}^{-1}\text{log}(en{\gamma}^{-1}{\epsilon}^{-1})+B\sqrt{2\pi \gamma}.$$

Note that the analysis parameter γ allows us to trade off between *B*, which determines how long we can run the algorithm, and the initial value of the supermartingale **E** [*W*_{0}(*x*_{0})]. We can now produce a corollary about the convergence rate by applying Theorem 1 and setting *B* and *T* appropriately.

*Assume that we run* Hogwild! *Alecton under these conditions for T timesteps, as defined below. Then the probability of failure, P* (*F _{T}*)

$$T=\frac{4n{\mu}^{4}{\Vert A\Vert}_{F}^{2}}{{\mathrm{\Delta}}^{2}\epsilon \gamma \vartheta \sqrt{2\pi \gamma}}\text{log}\left(\frac{en}{\gamma \epsilon}\right),\hspace{1em}P({F}_{T})\le \frac{\sqrt{8\pi \gamma}{\mu}^{2}}{{\mu}^{2}-4C\vartheta \tau \sqrt{\epsilon}}.$$

The fact that we are able to use our technique to analyze a non-convex algorithm illustrates its generality. Note that it is possible to combine our results to analyze asynchronous low-precision non-convex SGD, but the resulting formulas are complex, so we do not include them here.

We validate our theoretical results for both asynchronous non-convex matrix completion and Buckwild!, a Hogwild! implementation with lower-precision arithmetic. Like Hogwild!, a Buckwild! algorithm has multiple threads running an update rule (2) in parallel without locking. Compared with Hogwild!, which uses 32-bit floating point numbers to represent input data, Buckwild! uses limited-precision arithmetic by rounding the input data to 8-bit or 16-bit integers. This not only decreases the memory usage, but also allows us to take advantage of single-instruction-multiple-data (SIMD) instructions for integers on modern CPUs.

We verified our main claims by running Hogwild! and Buckwild! algorithms on the discussed applications. Table 1 shows how the training loss of SGD for logistic regression, a convex problem, varies as the precision is changed. We ran SGD with step size α = 0.0001; however, results are similar across a range of step sizes. We analyzed all four datasets reported in DimmWitted [25] that favored Hogwild!: Reuters and RCV1, which are text classification datasets; Forest, which arises from remote sensing; and Music, which is a music classification dataset. We implemented all GLM models reported in DimmWitted, including SVM, Linear Regression, and Logistic Regression, and report Logistic Regression because other models have similar performance. The results illustrate that there is almost no increase in training loss as the precision is decreased for these problems. We also investigated 4-bit and 1-bit computation: the former was slower than 8-bit due to a lack of 4-bit SIMD instructions, and the latter discarded too much information to produce good quality results.

Figure 1(a) displays the speedup of Buckwild! running on the dense-version of the RCV1 dataset compared to both full-precision sequential SGD (left axis) and best-case Hogwild! (right axis). Experiments ran on a machine with two Xeon X650 CPUs, each with six hyperthreaded cores, and 24GB of RAM. This plot illustrates that incorporating low-precision arithmetic into our algorithm allows us to achieve significant speedups over both sequential and Hogwild! SGD. (Note that we don’t get full linear speedup because we are bound by the available memory bandwidth; beyond this limit, adding additional threads provides no benefits while increasing conflicts and thrashing the L1 and L2 caches.) This result, combined with the data in Table 1, suggest that by doing low-precision asynchronous updates, we can get speedups of up to 2.3× on these sorts of datasets without a significant increase in error.

Experiments compare the training loss, performance, and convergence of Hogwild! and Buckwild! algorithms with sequential and/or high-precision versions.

Figure 1(b) compares the convergence trajectories of Hogwild! and sequential versions of the non-convex Alecton matrix completion algorithm on a synthetic data matrix *A* ^{n×n} with ten random eigenvalues λ_{i} > 0. Each plotted scries represents a different run of Alecton; the trajectories differ somewhat because of the randomness of the algorithm. The plot shows that the sequential and asynchronous versions behave qualitatively similarly, and converge to the same noise floor. For this dataset, sequential Alecton took 6.86 seconds to run while 12-thread Hogwild! Alecton took 1.39 seconds, a 4.9× speedup.

This paper presented a unified theoretical framework for producing results about the convergence rates of asynchronous and low-precision random algorithms such as stochastic gradient descent. We showed how a martingale-based rate of convergence for a sequential, full-precision algorithm can be easily leveraged to give a rate for an asynchronous, low-precision version. We also introduced Buckwild!, a strategy for SGD that is able to take advantage of modern hardware resources for both task and data parallelism, and showed that it achieves near linear parallel speedup over sequential algorithms.

The Buckwild! name arose out of conversations with Benjamin Recht. Thanks also to Madeleine Udell for helpful conversations.

The authors acknowledge the support of: DARPA FA8750-12-2-0335; NSF IIS-1247701; NSF CCF-1111943; DOE 108845; NSF CCF-1337375; DARPA FA8750-13-2-0039; NSF IIS-1353606; ONR N000141210041 and N000141310129; NIH U54EB020405; Oracle; NVIDIA; Huawei; SAP Labs; Sloan Research Fellowship; Moore Foundation; American Family Insurance; Google; and Toshiba.

This proof is a more detailed version of the argument outlined in Section 2.2. First, we restate the definition of the process *V _{t}* from the body of the paper. As long as the algorithm hasn’t succeeded yet,

$${V}_{t}({x}_{t},\dots ,{x}_{0})={W}_{t}({x}_{t},\dots ,{x}_{0})-HR\xi \tau t+HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert \sum _{m=k}^{\infty}P(\tilde{\tau}\ge m).$$

At the next timestep, we will have *x*_{t+1} = *x _{t}* + (

$${V}_{t+1}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})={W}_{t+1}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})-HR\xi \tau (t+1)+HR\Vert \tilde{G}({\tilde{\upsilon}}_{t})\Vert \sum _{m=1}^{\infty}P(\tilde{\tau}\ge m)\phantom{\rule{0ex}{0ex}}+HR\sum _{k=2}^{\infty}\Vert {x}_{t-k+2}-{x}_{t-k+1}\Vert \sum _{m=k}^{\infty}P(\tilde{\tau}\ge m).$$

Re-indexing the second sum and applying the definition of τ produces

$${V}_{t+1}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})={W}_{t+1}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})-HR\xi \tau (t+1)+HR\tau \Vert \tilde{G}({\tilde{\upsilon}}_{t})\Vert \phantom{\rule{0ex}{0ex}}+HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert \sum _{m=k+1}^{\infty}P(\tilde{\tau}\ge m).$$

Applying the Lipschitz continuity assumption (8) for *W* results in

$${V}_{t+1}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})\le {W}_{t+1}({x}_{t}+\tilde{G}({x}_{t}),{x}_{t},\dots ,{x}_{0})+H\Vert \tilde{G}({\tilde{\upsilon}}_{t})-\tilde{G}({x}_{t})\Vert -HR\xi \tau (t+1)\phantom{\rule{0ex}{0ex}}+HR\tau \Vert \tilde{G}({\tilde{\upsilon}}_{t})\Vert +HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert \sum _{m=k+1}^{\infty}P(\tilde{\tau}\ge m).$$

Taking the expected value of both sides produces

$$\mathbf{E}[{V}_{t+1}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})]\le \mathbf{E}[{W}_{t+1}({x}_{t}+\tilde{G}({x}_{t}),{x}_{t},\dots ,{x}_{0})]+H\mathbf{E}[\Vert \tilde{G}({\tilde{\upsilon}}_{t})-\tilde{G}({x}_{t})\Vert ]-HR\xi \tau (t+1)\phantom{\rule{0ex}{0ex}}+HR\tau \mathbf{E}[\Vert \tilde{G}({\tilde{\upsilon}}_{t})\Vert ]+HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert \sum _{m=k+1}^{\infty}P(\tilde{\tau}\ge m).$$

Applying the rate supermartingale property (3) of *W*,

$$\mathbf{E}[{V}_{t}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})]\le {W}_{t}({x}_{t},\dots ,{x}_{0})+H\mathbf{E}[\Vert \tilde{G}({\tilde{\upsilon}}_{t})-\tilde{G}({x}_{t})\Vert ]-HR\xi \tau (t+1)\phantom{\rule{0ex}{0ex}}+HR\tau \mathbf{E}[\Vert \tilde{G}({\tilde{\upsilon}}_{t})\Vert ]+HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert \sum _{m=k+1}^{\infty}P(\tilde{\tau}\ge m).$$

Applying the Lipschitz continuity assumption (9) for ,

$$\mathbf{E}[{V}_{t}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})]\le {W}_{t}({x}_{t},\dots ,{x}_{0})+HR\mathbf{E}[{\Vert {\tilde{\upsilon}}_{t}-{x}_{t}\Vert}_{1}]-HR\xi \tau (t+1)\phantom{\rule{0ex}{0ex}}+HR\tau \mathbf{E}[\Vert \tilde{G}({\tilde{\upsilon}}_{t})\Vert ]+HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert \sum _{m=k+1}^{\infty}P(\tilde{\tau}\ge m).$$

Finally, applying the update distance bound (10),

$$\mathbf{E}[{V}_{t}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})]\le {W}_{t}({x}_{t},\dots ,{x}_{0})+HR\mathbf{E}[{\Vert {\tilde{\upsilon}}_{t}-{x}_{t})\Vert}_{1}]-HR\xi \tau (t+1)\phantom{\rule{0ex}{0ex}}+HR\xi \tau +HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert \sum _{m=k+1}^{\infty}P(\tilde{\tau}\ge m)\phantom{\rule{0ex}{0ex}}={W}_{t}({x}_{t},\dots ,{x}_{0})-HR\xi \tau t+HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert \sum _{m=k}^{\infty}P(\tilde{\tau}\ge m)\phantom{\rule{0ex}{0ex}}+HR\mathbf{E}[{\Vert {\tilde{\upsilon}}_{t}-{x}_{t}\Vert}_{1}]-HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert P(\tilde{\tau}\ge k)\phantom{\rule{0ex}{0ex}}={V}_{t}({x}_{t},\dots ,{x}_{0})+HR\mathbf{E}[{\Vert {\tilde{\upsilon}}_{t}-{x}_{t}\Vert}_{1}]-HR\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert P(\tilde{\tau}\ge k).$$

Now, by the definition of the _{t},

$${\Vert {\tilde{\upsilon}}_{t}-{x}_{t}\Vert}_{1}=\sum _{i=1}^{n}|{e}_{i}^{T}{x}_{t}-{e}_{i}^{T}{\tilde{\upsilon}}_{t}|\phantom{\rule{0ex}{0ex}}=\sum _{i=1}^{n}|{e}_{i}^{T}{x}_{t}-{e}_{i}^{T}{x}_{t-{\tilde{\tau}}_{i,t}}|\phantom{\rule{0ex}{0ex}}\le \sum _{i=1}^{n}\sum _{k=1}^{{\tilde{\tau}}_{i,t}}|{e}_{i}^{T}{x}_{t-k+1}-{e}_{i}^{T}{x}_{t-k}|$$

Furthermore, using the bound on _{i,t} from (7) gives us

$$\mathbf{E}[{\Vert {\tilde{\upsilon}}_{t}-{x}_{t}\Vert}_{1}]\le \sum _{i=1}^{n}\sum _{k=1}^{\infty}|{e}_{i}^{T}{x}_{t-k+1}-{e}_{i}^{T}{x}_{t-k}|P({\tilde{\tau}}_{i,t}\ge k)\phantom{\rule{0ex}{0ex}}\le \sum _{i=1}^{n}\sum _{k=1}^{\infty}|{e}_{i}^{T}{x}_{t-k+1}-{e}_{i}^{T}{x}_{t-k}|P(\tilde{\tau}\ge k)\phantom{\rule{0ex}{0ex}}=\sum _{k=1}^{\infty}{\Vert {x}_{t-k+1}-{x}_{t-k}\Vert}_{1}P(\tilde{\tau}\ge k)\phantom{\rule{0ex}{0ex}}=\sum _{k=1}^{\infty}\Vert {x}_{t-k+1}-{x}_{t-k}\Vert P(\tilde{\tau}\ge k),$$

where the 1-norm is equal to the 2-norm here because each step only updates a single entry of *x*. Substituting this result in to the above equation allows us to conclude that, if the algorithm hasn’t succeeded by time *t*,

$$\mathbf{E}[{V}_{t}({x}_{t}+\tilde{G}({\tilde{\upsilon}}_{t}),{x}_{t},\dots ,{x}_{0})]\le {V}_{t}({x}_{t},\dots ,{x}_{0}).$$

(19)

On the other hand, if it has succeeded, this statement will be vacuously true, since *V _{t}* does not change after success occurs. Therefore, (19) will hold for all times.

In what follows, as in the proof of Statement 1, we let *V _{t}* denote the actual value taken on by the function during execution of the algorithm. That is,

$$\mathbf{E}[{V}_{T}]\le \mathbf{E}[{V}_{0}].$$

Since we assumed as part of our hardware model that *x _{t}* =

$$\mathbf{E}[{V}_{0}]=\mathbf{E}[{W}_{0}({x}_{0})].$$

Therefore, by the law of total expectation

$$\mathbf{E}[{W}_{0}({x}_{0})]\phantom{\rule{0ex}{0ex}}\ge \mathbf{E}[{V}_{T}]\phantom{\rule{0ex}{0ex}}=\mathbf{E}[{V}_{T}|{F}_{T}]P({F}_{T})+\mathbf{E}[{V}_{T}|\neg {F}_{T}]P(\neg {F}_{T})\phantom{\rule{0ex}{0ex}}\ge \mathbf{E}[{V}_{T}|{F}_{T}]P({F}_{T})\phantom{\rule{0ex}{0ex}}=\mathbf{E}[{W}_{T}({x}_{T},\dots ,{x}_{0})-HR\xi \tau T+HR\sum _{k=1}^{\infty}\Vert {x}_{T-k+1}-{x}_{T-k}\Vert \sum _{m=k}^{\infty}P(\tilde{\tau}\ge m)\left|{F}_{T}\right]P({F}_{T})\phantom{\rule{0ex}{0ex}}\ge (\mathbf{E}[{W}_{T}({x}_{T},\dots ,{x}_{0})|{\mathcal{F}}_{T}]-HR\xi \tau T)P({F}_{T}).$$

Since *W _{t}* is a rate supermartingale, we can apply (4) to get

$$\mathbf{E}[{W}_{0}({x}_{0})]\ge (T-HR\xi \tau T)P({F}_{T}),$$

and solving for *P* (*F _{T}*) produces

$$P({F}_{T})\le \frac{\mathbf{E}[{W}_{0}({x}_{0})]}{(1-HR\xi \tau )T},$$

as desired.

First, we state the rate supermartingale lemma for the low-precision convex SGD algorithm.

*There exists a W _{t} with*

$${W}_{0}({x}_{0})\le \frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{log}\left(\frac{e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right)$$

*such that W _{t} is a rate submartingale for the above convex SGD algorithm with horizon B* = ∞.

$$H=\frac{2\sqrt{\epsilon}}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}.$$

We note that, including this Lemma, the results in Section 3.1 are the same as the results in Section 3.2, except that the quantization factor is set as κ = 0. It follows that it is sufficient to prove only the Lemma and Corollary in 3.2; this is what we will do here.

In order to prove the results in this section, we will need some definitions and lemmas, which we state now.

(Piecewise Logarithm). For the purposes of this document, we define the *piecewise logarithm* function to be

$$\text{log}(x)=\{\begin{array}{cc}\text{log}(\mathit{\text{ex}})\hfill & :x\ge 1\hfill \\ x\hfill & :x\le 1\hfill \end{array}$$

*The piecewise logarithm Junction is differentiable and concave. Also, if x* ≥ 1, *then for any* Δ,

$$\text{log}(x(1+\mathrm{\Delta}))\le \text{log}(x)+\mathrm{\Delta}.$$

The first part of the lemma follows from the fact that log(*x*) is a piecewise function, where the pieces are both increasing and concave, and the fact that the function is differentiable at *x* = 1. The second part of the lemma follows from the fact that a first-order approximation always overestimates a concave function.

Armed with this definition, we prove Lemma 3.

First, we note that, at any timestep *t*, if we evaluate the distance to the optimum at the next timestep using (11), then

$${\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2}={\Vert {x}_{t}-{x}^{*}\Vert}^{2}-2{({x}_{t}-{x}^{*})}^{T}{\tilde{Q}}_{t}(\alpha \nabla {\tilde{f}}_{t}({x}_{t}))+{\Vert {\tilde{Q}}_{t}(\alpha \nabla {\tilde{f}}_{t}({x}_{t}))\Vert}^{2}\phantom{\rule{0ex}{0ex}}={\Vert {x}_{t}-{x}^{*}\Vert}^{2}-2{({x}_{t}-{x}^{*})}^{T}{\tilde{Q}}_{t}(\alpha \nabla {\tilde{f}}_{t}({x}_{t}))\phantom{\rule{0ex}{0ex}}+{\alpha}^{2}{\Vert \alpha \nabla {\tilde{f}}_{t}({x}_{t})\Vert}^{2}+{\Vert {\tilde{Q}}_{t}(\alpha \nabla {\tilde{f}}_{t}({x}_{t}))-\alpha \nabla {\tilde{f}}_{t}({x}_{t})\Vert}^{2}.$$

Taking the expected value and applying (14), and the bounds on the properties of * _{t}*, produces

$$\mathbf{E}\left[{\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2}\right]\le {\Vert {x}_{t}-{x}^{*}\Vert}^{2}-2\alpha {({x}_{t}-{x}^{*})}^{T}\nabla f({x}_{t})+{\alpha}^{2}{M}^{2}+{\delta}^{2}.$$

Since we assigned δ ≤ ακ*M*,

$$\mathbf{E}\left[{\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2}\right]\le {\Vert {x}_{t}-{x}^{*}\Vert}^{2}-2\alpha {({x}_{t}-{x}^{*})}^{T}\nabla f({x}_{t})+{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})\phantom{\rule{0ex}{0ex}}={\Vert {x}_{t}-{x}^{*}\Vert}^{2}-2\alpha {({x}_{t}-{x}^{*})}^{T}(\nabla f({x}_{t})-\nabla f({x}^{*}))+{\alpha}^{2}{M}^{2}(1+{\kappa}^{2}).$$

Applying the strong convexity assumption (12),

$$\mathbf{E}\left[{\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2}\right]\le {\Vert {x}_{t}-{x}^{*}\Vert}^{2}-2\alpha c{({x}_{t}-{x}^{*})}^{2}+{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})\phantom{\rule{0ex}{0ex}}=(1-2\alpha c){\Vert {x}_{t}-{x}^{*}\Vert}^{2}+{\alpha}^{2}{M}^{2}(1+{\kappa}^{2}).$$

Now, if we haven’t succeeded yet, then ‖*x _{t}* −

$$\mathbf{E}\left[{\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2}\right]\le {\Vert {x}_{t}-{x}^{*}\Vert}^{2}(1-2\alpha c+{\alpha}^{2}{M}^{2}(1+{\kappa}^{2}){\epsilon}^{-1}).$$

Multiplying both sides of the equation by ε^{−1} and taking the piecewise logarithm, by Jensen’s inequality

$$\mathbf{E}\left[\text{log}\right({\epsilon}^{-1}{\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2}\left)\right]\le \text{log}\left(\mathbf{E}\right[{\epsilon}^{-1}{\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2}\left]\right)\phantom{\rule{0ex}{0ex}}\le \text{log}({\epsilon}^{-1}{\Vert {x}_{t}-{x}^{*}\Vert}^{2}(1-2\alpha c+{\alpha}^{2}{M}^{2}(1+{\kappa}^{2}){\epsilon}^{-1})).$$

Since ε^{−1} ‖*x _{t}* −

$$\mathbf{E}\left[\text{log}\right({\epsilon}^{-1}{\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2}\left)\right]\le \text{log}\left({\epsilon}^{-1}{\Vert {x}_{t}-{x}^{*}\Vert}^{2}\right)-2\alpha c+{\alpha}^{2}{M}^{2}(1+{\kappa}^{2}){\epsilon}^{-1}.$$

Now, we define the rate supermartingale *W _{t}* such that, if we haven’t succeeded up to time

$${W}_{t}({x}_{t},\dots ,{x}_{0})=\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{log}\left({\epsilon}^{-1}{\Vert {x}_{t}-{x}^{*}\Vert}^{2}\right)+t;$$

otherwise, if *u* is a time such that *x _{u}*

$${W}_{t}({x}_{t},\dots ,{x}_{0})={W}_{u}({x}_{u},\dots ,{x}_{0}).$$

The first rate supermartingale property (3) is true because if success hasn’t occurred,

$$\mathbf{E}[{W}_{t+1}({x}_{t}+{\tilde{G}}_{t}({x}_{t}),\dots ,{x}_{0})]=\mathbf{E}\left[\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{log}\right({\epsilon}^{-1}{\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2})+(t+1)]\phantom{\rule{0ex}{0ex}}=\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\mathbf{E}\left[\text{log}\right({\epsilon}^{-1}{\Vert {x}_{t}+{\tilde{G}}_{t}({x}_{t})-{x}^{*}\Vert}^{2}\left)\right]+(t+1)\phantom{\rule{0ex}{0ex}}\le \frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\left(\text{log}\right({\epsilon}^{-1}{\Vert {x}_{t}-{x}^{*}\Vert}^{2})-2\alpha c+{\alpha}^{2}{M}^{2}(1+{\kappa}^{2}){\epsilon}^{-1})\phantom{\rule{0ex}{0ex}}+(t+1)\phantom{\rule{0ex}{0ex}}=\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{log}\left({\epsilon}^{-1}{\Vert {x}_{t}-{x}^{*}\Vert}^{2}\right)-1+(t+1)\phantom{\rule{0ex}{0ex}}={W}_{t}({x}_{t},\dots ,{x}_{0});$$

it is vacuously true if success has occurred because the value of *W _{t}* does not change after

$${W}_{T}({x}_{T},\dots ,{x}_{0})=\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{log}\left({\epsilon}^{-1}{\Vert {x}_{T}-{x}^{*}\Vert}^{2}\right)+T\ge T;$$

this follows from the non-negativity of the log function for non-negative arguments.

We have now shown that *W _{t}* is a rate supermartingale for this algorithm. Next, we verify that the bound on

$${W}_{0}({x}_{0})=\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{log}\left({\epsilon}^{-1}{\Vert {x}_{0}-{x}^{*}\Vert}^{2}\right)\phantom{\rule{0ex}{0ex}}=\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{log}\left(\frac{e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right);$$

this is the bound given in the lemma statement.

Next, we show that this rate supermartingale is (*H, R*, ξ)-bounded, for the values of *H, R*, and ξ given in the lemma statement. First, for any *x, t*, and sequence *x*_{t−1}, …, *x*_{0},

$${\nabla}_{x}{W}_{t}(x,{x}_{t-1},\dots ,{x}_{0})={\nabla}_{x}\left(\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{log}\right({\epsilon}^{-1}{\Vert x-{x}^{*}\Vert}^{2}\left)\right)\phantom{\rule{0ex}{0ex}}=\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}2{\epsilon}^{-1}(x-{x}^{*})\stackrel{\prime}{\text{log}}\phantom{\rule{thinmathspace}{0ex}}\left({\epsilon}^{-1}{\Vert x-{x}^{*}\Vert}^{2}\right).$$

Now, by the definition of log, we can conclude that log′(*u*) = min (1, *u*^{−1}). Therefore,

$${\nabla}_{x}{W}_{t}(x,{x}_{t-1},\dots ,{x}_{0})=\frac{2}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}(x-{x}^{*})\text{min}(1,\epsilon {\Vert x-{x}^{*}\Vert}^{-2}),$$

and taking the norm of both sides,

$${\nabla}_{x}{W}_{t}(x,{x}_{t-1},\dots ,{x}_{0})=\frac{2}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{min}(\Vert x-{x}^{*}\Vert ,\epsilon {\Vert x-{x}^{*}\Vert}^{-1}).$$

Clearly, this expression is maximized when ‖*x* − *x**‖^{2} = ε. Therefore,

$${\nabla}_{x}{W}_{t}(x,{x}_{t-1},\dots ,{x}_{0})\le \frac{2\sqrt{\epsilon}}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}.$$

The Lipschitz continuity expression with *H* in the lemma statement now follows from the mean value theorem.

Next, we bound the Lipschitz continuity expression for *R*. We have that, for any *x* and *y*, if the single non-zero entry of is at index *i*, then

$$\mathbf{E}[\Vert \tilde{G}(x)-\tilde{G}(y)\Vert ]=\mathbf{E}[\Vert \tilde{Q}(\alpha \nabla \tilde{f}(x))-\tilde{Q}(\alpha \nabla \tilde{f}(y))\Vert ]\phantom{\rule{0ex}{0ex}}=\mathbf{E}\left[\right|\tilde{Q}(\alpha {e}_{i}^{T}\nabla \tilde{f}(x))-\tilde{Q}(\alpha {e}_{i}^{T}\nabla \tilde{f}(y))\left|\right]$$

Without loss of generality, we assume that is non-decreasing, and that ${e}_{i}^{T}\nabla \tilde{f}(x)\ge {e}_{i}^{T}\nabla \tilde{f}(y)$. Thus, by the unbiased quality of ,

$$\mathbf{E}[\Vert \tilde{G}(x)-\tilde{G}(y)\Vert ]=\mathbf{E}[\tilde{Q}({e}_{i}^{T}\alpha \nabla \tilde{f}(x))-\tilde{Q}({e}_{i}^{T}\alpha \nabla \tilde{f}(y))]\phantom{\rule{0ex}{0ex}}=\mathbf{E}[{e}_{i}^{T}\alpha \nabla \tilde{f}(x)-{e}_{i}^{T}\alpha \nabla \tilde{f}(y)]\phantom{\rule{0ex}{0ex}}=\alpha \mathbf{E}[\Vert \nabla \tilde{f}(x)-\nabla \tilde{f}(y)\Vert ].$$

Finally, we bound the update expression with ξ. We have,

$$\mathbf{E}{[\Vert \tilde{G}(x)\Vert ]}^{2}\phantom{\rule{0ex}{0ex}}=\mathbf{E}{[\Vert \tilde{Q}(\alpha \nabla \tilde{f}(x))\Vert ]}^{2}\phantom{\rule{0ex}{0ex}}\le \mathbf{E}{[\Vert \tilde{Q}(\alpha \nabla \tilde{f}(x))\Vert ]}^{2}\phantom{\rule{0ex}{0ex}}=\mathbf{E}[{\alpha}^{2}{\Vert \nabla \tilde{f}(x)\Vert}^{2}+2\alpha {(\nabla \tilde{f}(x))}^{T}(\tilde{Q}(\alpha \nabla \tilde{f}(x))-\alpha \nabla \tilde{f}(x))+{\Vert \tilde{Q}(\alpha \nabla \tilde{f}(x))-\alpha \nabla \tilde{f}(x)\Vert}^{2}].$$

Applying the bounds on the rounding error,

$$\mathbf{E}{[\Vert \tilde{G}(x)\Vert ]}^{2}\le \mathbf{E}[{\alpha}^{2}{\Vert \nabla \tilde{f}(x)\Vert}^{2}+2\alpha {(\nabla \tilde{f}(x))}^{T}(\tilde{Q}(\alpha \nabla \tilde{f}(x))-\alpha \nabla \tilde{f}(x))+{\delta}^{2}].$$

Taking the expected value and applying (14) and the unbiased quality of

$$\mathbf{E}{[\Vert \tilde{G}(x)\Vert ]}^{2}\le {\alpha}^{2}{M}^{2}+{\delta}^{2}.$$

Applying the assignment δ = ακ*M* results in

$$\mathbf{E}{[\Vert \tilde{G}(x)\Vert ]}^{2}\le {\alpha}^{2}{M}^{2}+(1+{\kappa}^{2}),$$

which is the desired expression.

So, we have proved all the statements in the lemma.

Applying Theorem 1 directly to the result of Lemma 1 produces

$$P({F}_{T})\le \frac{\mathbf{E}[{W}_{0}({x}_{0})]}{(1-HR\xi \tau )T}=\frac{\epsilon}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\text{log}\left(\frac{e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right){\left(\right(1-\left(\frac{2\sqrt{\epsilon}}{2\alpha c\epsilon -{\alpha}^{2}{M}^{2}(1+{\kappa}^{2})}\right)(\alpha L)(\alpha M\sqrt{1+{\kappa}^{2}})\tau \left)T\right)}^{-1}=\frac{\epsilon}{(2\alpha c\epsilon -{\alpha}^{2}({M}^{2}(1+{\kappa}^{2})-2LM\tau \sqrt{1+{\kappa}^{2}}\sqrt{\epsilon}))T}\text{log}\left(\frac{e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right)\le \frac{\epsilon}{(2\alpha c\epsilon -{\alpha}^{2}({M}^{2}(1+{\kappa}^{2})-LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}))T}\text{log}\left(\frac{e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right)$$

Substituting the chosen value of α,

$$P({F}_{T})\le \frac{\epsilon}{T}{\left(2c\epsilon \right(\frac{c\epsilon \vartheta}{{M}^{2}(1+{\kappa}^{2})+LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}})-({M}^{2}(1+{\kappa}^{2})-LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}{\left(\frac{c\epsilon \vartheta}{{M}^{2}(1+{\kappa}^{2})+LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}}\right)}^{2})}^{-1}\phantom{\rule{0ex}{0ex}}\text{log}\left(\frac{e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right)=\frac{\epsilon}{(\frac{2{c}^{2}{\epsilon}^{2}\vartheta}{{M}^{2}(1+{\kappa}^{2})+LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}}-\frac{{c}^{2}{\epsilon}^{2}{\vartheta}^{2}}{{M}^{2}(1+{\kappa}^{2})+LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}})T}\text{log}\left(\frac{e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right)\le \frac{\epsilon}{\frac{{c}^{2}{\epsilon}^{2}\vartheta}{{M}^{2}(1+{\kappa}^{2})+LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}}}\text{log}\left(\frac{e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right)=\frac{{M}^{2}(1+{\kappa}^{2})+LM\tau (2+{\kappa}^{2})\sqrt{\epsilon}}{{c}^{2}\epsilon \vartheta T}\text{log}\left(\frac{e{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right),$$

as desired.

In order to accomplish this proof, we make use of some definitions and lemmas that appear in De Sa et al. [4]. We state them here before proceeding to the proof.

First, we define a function

$$\tau (x)=\frac{{({u}_{1}^{T}x)}^{2}}{(1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2}}.$$

Clearly, 0 ≤ τ (*x*) ≤ 1. Using this function, De Sa et al. [4] prove the following lemma. While their version of the lemma applies to higher-rank problems and multiple distributions, we state here a version that is specialized for the rank-1, entrywise sampling case we study in this paper. (This is a combination of Lemma 2 and Lemma 12 from De Sa etal.[4].)

(τ-bound). *If we run the Alecton update rule using entrywise sampling under the conditions in Section 33, including the incoherence and step size assignment, then for any x* *S*,

$$\mathbf{E}[\tau (x+\eta \xc3x)]\ge \tau (x)(1+\eta \mathrm{\Delta}(1-\tau (x))).$$

We also use another lemma from De Sa et al. [4]. This is a combination of their Lemmas 1 and 7.

(Expected value of τ (*x*_{0})). *If we initialize x_{0} with a uniform random angle (as done in Alecton), then*

$$\mathbf{E}[1-\tau ({x}_{0})]\le \sqrt{\frac{\pi \gamma}{2}}.$$

Now, we prove Lemma 2.

First, if *x* *S*, then ${({u}_{1}^{T}x)}^{2}\le (1-\epsilon ){\Vert x\Vert}^{2}$. Therefore,

$$\tau (x)=\frac{{({u}_{1}^{T}x)}^{2}}{(1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2}}\phantom{\rule{0ex}{0ex}}\le \frac{1-\epsilon}{(1-\gamma {n}^{-1})(1-\epsilon )+\gamma {n}^{-1}}\phantom{\rule{0ex}{0ex}}=\frac{1-\epsilon}{1-\epsilon +\gamma {n}^{-1}\epsilon ,}$$

and so

$$1-\tau (x)\ge \frac{\gamma {n}^{-1}\epsilon}{1-\epsilon +\gamma {n}^{-1}\epsilon ,}>\gamma {n}^{-1}\epsilon .$$

From the result of Lemma 5, for any *x* *S*,

$$\mathbf{E}[\tau (x+\eta \xc3x)]\ge \tau (x)(1+\eta \mathrm{\Delta}(1-\tau (x))).$$

Therefore,

$$\mathbf{E}[1-\tau (x+\eta \xc3x)]\le (1-\tau (x))(1-\eta \mathrm{\Delta}\tau (x))$$

Therefore, by Jensen’s inequality and Lemma 4, since γ^{−1}*n*ε(l − τ(*x*)) > 1,

$$\mathbf{E}\left[\text{log}\right({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau (x+\eta \xc3x))\left)\right]\ge \text{log}\left(\mathbf{E}\right[{\gamma}^{-1}n{\epsilon}^{-1}(1-\tau (x+\eta \xc3x))\left]\right)\phantom{\rule{0ex}{0ex}}\ge \text{log}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau (x))(1-\eta \mathrm{\Delta}\tau (x)))\phantom{\rule{0ex}{0ex}}\ge \text{log}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau (x)))-\eta \mathrm{\Delta}\tau (x).$$

Now, we define our rate supermartingale. First, define

$$Z=\left\{x\right|\tau (x)\ge \frac{1}{2}\},$$

and let *B* > 0 be any constant. Let *W _{t}* be defined such that, if

$${W}_{t}({x}_{t},\dots ,{x}_{0})=\frac{2}{\eta \mathrm{\Delta}}\text{log}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau ({x}_{t})))+2B(1-\tau ({x}_{t}))+t.$$

On the other hand, if *x _{u}*

$${W}_{t}({x}_{t},\dots ,{x}_{0})={W}_{u}({x}_{u},\dots ,{x}_{0}).$$

That is, once *x _{t}* enters

We verify that *W _{t}* is a rate supermartingale. First, (3) is true because, in the case that the process has stopped it is true vacuously, and in the case that it hasn’t stopped (i.e.

$$\mathbf{E}[{W}_{t+1}({x}_{t}+\eta {\xc3}_{t}{x}_{t},{x}_{t},\dots ,{x}_{0}]=\mathbf{E}[\frac{2}{\eta \mathrm{\Delta}}\text{log}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau ({x}_{t}+\eta {\xc3}_{t}{x}_{t})))+2B(1-\tau ({x}_{t}+\eta {\xc3}_{t}{x}_{t}))\phantom{\rule{0ex}{0ex}}+t+1]=\frac{2}{\eta \mathrm{\Delta}}\mathbf{E}[\text{log}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau ({x}_{t}+\eta {\xc3}_{t}{x}_{t})))]+2B\mathbf{E}[1-\tau ({x}_{t}+\eta {\xc3}_{t}{x}_{t})]+t\phantom{\rule{0ex}{0ex}}+1\le \frac{2}{\eta \mathrm{\Delta}}(\text{log}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau ({x}_{t})))-\eta \mathrm{\Delta}\tau ({x}_{t}))+2B(1-\tau ({x}_{t}))+t+1={W}_{t}({x}_{t},\dots ,{x}_{0})-2\tau ({x}_{t})+1.$$

Since *x _{t}*

$$\mathbf{E}[{W}_{t+1}({x}_{t}+\eta {\xc3}_{t}{x}_{t},{x}_{t},\dots ,{x}_{0}]\le {W}_{t}({x}_{t},\dots ,{x}_{0}).$$

And so (3) holds in all cases.

The second rate supermartingale property (4) holds because, if success hasn’t occurred by time *T* < *B*, then there are two possibilities: either the process hasn’t stopped yet, or it stopped at a timestep where *x _{t}*

$${W}_{T}({x}_{T},\dots ,{x}_{0})=\frac{2}{\eta \mathrm{\Delta}}\text{log}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau ({x}_{T})))+2B(1-\tau ({x}_{T}))+T\ge T.$$

In the latter case,

$${W}_{T}({x}_{T},\dots ,{x}_{0})=\frac{2}{\eta \mathrm{\Delta}}\text{log}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau ({x}_{T})))+2B(1-\tau ({x}_{T}))+T\phantom{\rule{0ex}{0ex}}\ge B.$$

Therefore (4) holds.

We have now shown that *W _{t}* is a rate supermartingale for Alecton. Next, we show that our bound on the initial value of the supermartingale holds. At time 0,

$${W}_{0}({x}_{0})=\frac{2}{\eta \mathrm{\Delta}}\text{log}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau ({x}_{0})))+2B(1-\tau ({x}_{0}))\phantom{\rule{0ex}{0ex}}\le \frac{2}{\eta \mathrm{\Delta}}\text{log}\left({\gamma}^{-1}n{\epsilon}^{-1}\right)+2B(1-\tau ({x}_{0}))\phantom{\rule{0ex}{0ex}}=\frac{2}{\eta \mathrm{\Delta}}\text{log}\left(\frac{\mathit{\text{en}}}{\gamma \epsilon}\right)+2B(1-\tau ({x}_{0})).$$

Therefore, applying Lemma 6,

$$\mathbf{E}[{W}_{0}({x}_{0})]\le \frac{2}{\eta \mathrm{\Delta}}\text{log}\left(\frac{\mathit{\text{en}}}{\gamma \epsilon}\right)+2B\mathbf{E}[1-\tau ({x}_{0})]\phantom{\rule{0ex}{0ex}}\le \frac{2}{\eta \mathrm{\Delta}}\text{log}\left(\frac{\mathit{\text{en}}}{\gamma \epsilon}\right)+B\sqrt{2\pi \gamma}.$$

This is the value given in the lemma.

Now, we show that *W _{t}* is (

$${\nabla}_{\tau}(x)=\frac{2{u}_{1}{u}_{1}^{T}x((1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2})-2{({u}_{1}^{T}x)}^{2}((1-\gamma {n}^{-1}){u}_{1}{u}_{1}^{T}x+\gamma {n}^{-1}x)}{{((1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2})}^{2}}\phantom{\rule{0ex}{0ex}}=\frac{2{u}_{1}{u}_{1}^{T}x\gamma {n}^{-1}{\Vert x\Vert}^{2}-2{({u}_{1}^{T}x)}^{2}\gamma {n}^{-1}x}{{((1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2})}^{2}}\phantom{\rule{0ex}{0ex}}=2\gamma {n}^{-1}\frac{{u}_{1}{u}_{1}^{T}x{\Vert x\Vert}^{2}-x{({u}_{1}^{T}x)}^{2}}{{((1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2})}^{2}}.$$

Therefore,

$${\Vert {\nabla}_{\tau}(x)\Vert}^{2}=4{\gamma}^{2}{n}^{-2}\frac{{({u}_{1}^{T}x)}^{2}{\Vert x\Vert}^{4}-{({u}_{1}^{T}x)}^{4}{\Vert x\Vert}^{2}}{{((1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2})}^{4}}\phantom{\rule{0ex}{0ex}}\le 4{\gamma}^{2}{n}^{-2}\frac{{\Vert x\Vert}^{4}-{({u}_{1}^{T}x)}^{2}{\Vert x\Vert}^{2}}{{((1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2})}^{3}}\phantom{\rule{0ex}{0ex}}\le 4\gamma {n}^{-1}\frac{{\Vert x\Vert}^{2}(1-\tau (x))}{{((1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2})}^{2}}\phantom{\rule{0ex}{0ex}}\le \frac{4(1-\tau (x))}{((1-\gamma {n}^{-1}){({u}_{1}^{T}x)}^{2}+\gamma {n}^{-1}{\Vert x\Vert}^{2})}\phantom{\rule{0ex}{0ex}}\le \frac{4n(1-\tau (x))}{\gamma {\Vert x\Vert}^{2}}.$$

Applying the assumption that ‖*x*‖^{2} ≥ 1,

$$\Vert {\nabla}_{\tau}(x)\Vert \le \sqrt{\frac{4n(1-\tau (x))}{\gamma}}.$$

Now, differentiating *W _{t}* with respect to τ produces

$$\frac{dW}{d\tau}=-\frac{2n}{\eta \gamma \epsilon \mathrm{\Delta}}\stackrel{\prime}{\text{log}}\phantom{\rule{thinmathspace}{0ex}}({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau ))-2B.$$

So, it follows that

$$\Vert {\nabla}_{x}{W}_{t}(x,{x}_{t-1},\dots ,{x}_{0})\Vert \le \left|\frac{dW}{d\tau}\right|\Vert {\nabla}_{\tau}(x)\Vert \phantom{\rule{0ex}{0ex}}\le \left(\frac{2n}{\eta \gamma \epsilon \mathrm{\Delta}}\stackrel{\prime}{\text{log}}\phantom{\rule{thinmathspace}{0ex}}\right({\gamma}^{-1}n{\epsilon}^{-1}(1-\tau ))+2B)\sqrt{\frac{4n(1-\tau (x))}{\gamma}}.$$

Applying our assumption that ηγεΔ*B* ≤ l, it is clear that this function will be maximized when γ^{−1}*n*ε^{−1}(l − τ) = 1. Therefore,

$$\Vert {\nabla}_{x}{W}_{t}(x,{x}_{t-1},\dots ,{x}_{0})\Vert \le (\frac{2n}{\eta \gamma \epsilon \mathrm{\Delta}}+2B)2\sqrt{\epsilon}\phantom{\rule{0ex}{0ex}}=\frac{8n}{\eta \gamma \mathrm{\Delta}\sqrt{\epsilon}},$$

which is our given value for *H*.

Next, we give the *R* bound. For Alecton, we have

$$\stackrel{~}{G}(x)=\eta \xc3x=\eta {n}^{2}{e}_{i}{e}_{i}^{T}A{e}_{j}{e}_{j}^{T}x.$$

Therefore,

$$\mathbf{E}[\Vert \stackrel{~}{G}(x)-\stackrel{~}{G}(y)\Vert ]=\eta {n}^{2}\mathbf{E}[\Vert {e}_{i}{e}_{i}^{T}A{e}_{j}{e}_{j}^{T}(x-y)\Vert ]\phantom{\rule{0ex}{0ex}}=\eta {n}^{2}\mathbf{E}\left[\right|{e}_{i}^{T}A{e}_{j}{e}_{j}^{T}(x-y)\left|\right]\phantom{\rule{0ex}{0ex}}=\eta \sum _{i=1}^{n}\sum _{j=1}^{n}\left|{e}_{i}^{T}A{e}_{j}\right||{e}_{j}^{T}(x-y)|\phantom{\rule{0ex}{0ex}}=\eta \sum _{j=1}^{n}|{e}_{j}^{T}(x-y)|\left(\sum _{i=1}^{n}\right|{e}_{i}^{T}A{e}_{j}\left|\right)\phantom{\rule{0ex}{0ex}}\le \eta \sum _{j=1}^{n}|{e}_{j}^{T}(x-y)|\sqrt{n}{\left(\sum _{i=1}^{n}{({e}_{i}^{T}A{e}_{j})}^{2}\right)}^{\frac{1}{2}}\phantom{\rule{0ex}{0ex}}=\eta \sum _{j=1}^{n}|{e}_{j}^{T}(x-y)|\sqrt{n}{({e}_{j}^{T}{A}^{2}{e}_{j})}^{\frac{1}{2}}\phantom{\rule{0ex}{0ex}}=\eta \sum _{j=1}^{n}|{e}_{j}^{T}(x-y)|\sqrt{n}{\left(\sum _{k=1}^{\infty}{\lambda}_{j}^{2}{({u}_{k}^{T}{e}_{j})}^{2}\right)}^{\frac{1}{2}}.$$

Applying the incoherence bound,

$$\mathbf{E}[\Vert \tilde{G}(x)-\tilde{G}(y)\Vert ]\le \eta \sum _{j=1}^{n}|{e}_{j}^{T}(x-y)|\sqrt{n}{\left(\sum _{k=1}^{\infty}{\lambda}_{j}^{2}{\mu}^{2}{n}^{-1}\right)}^{\frac{1}{2}}\phantom{\rule{0ex}{0ex}}=\eta \sum _{j=1}^{n}|{e}_{j}^{T}(x-y)|\sqrt{n}{\left({\mu}^{2}{n}^{-1}{\Vert A\Vert}_{F}^{2}\right)}^{\frac{1}{2}}\phantom{\rule{0ex}{0ex}}=\eta \sum _{j=1}^{n}|{e}_{j}^{T}(x-y)|\mu {\Vert A\Vert}_{F}\phantom{\rule{0ex}{0ex}}=\eta \mu {\Vert A\Vert}_{F}{\Vert x-y\Vert}_{1}.$$

This agrees with our assignment of *R*= ημ‖*A*‖_{F}.

Finally, we give our ξ bound on the magnitude of the updates. By the same argument as above, we will have

$$\mathbf{E}[\Vert \stackrel{~}{G}(x)\Vert ]=\eta {n}^{2}\mathbf{E}[\Vert {e}_{i}{e}_{i}^{T}A{e}_{j}{e}_{j}^{T}x\Vert ]\phantom{\rule{0ex}{0ex}}=\eta \mu {\Vert A\Vert}_{F}{\Vert x\Vert}_{1}.$$

Applying the assumption that ${\Vert x\Vert}_{1}^{2}\le C$, produces the bound given in the lemma, ξ = ημ ‖*A*‖_{F}
*C*.

This completes the proof of the lemma.

Next, we prove the corollary that gives a bound on the failure probability of asynchronous Alecton.

By Theorem 1, we know that for the constants defined in Lemma 2,

$$P({F}_{T})\le \frac{\mathbf{E}[W(0,{x}_{0})]}{(1-HR\xi \tau )T}.$$

If we choose *B* = *T* for the horizon in Lemma 2, and substitute in the given constants,

$$P({F}_{T})\le \left(\frac{2}{\eta \mathrm{\Delta}}\text{log}\right(\frac{en}{\gamma \epsilon})+T\sqrt{2\pi \gamma}){(1-(\frac{8n}{\eta \gamma \mathrm{\Delta}\sqrt{\epsilon}})(\eta \mu {\Vert A\Vert}_{F})(\eta \mu {\Vert A\Vert}_{F}C)\tau )}^{-1}{T}^{-1}\phantom{\rule{0ex}{0ex}}=\left(\frac{2}{\eta \mathrm{\Delta}T}\text{log}\right(\frac{en}{\gamma \epsilon})+\sqrt{2\pi \gamma}){(1-\frac{8\eta n{\mu}^{2}{\Vert A\Vert}_{F}^{2}C\tau}{\gamma \mathrm{\Delta}\sqrt{\epsilon}})}^{-1}.$$

Now, for the given value of η, we will have

$$\frac{8\eta n{\mu}^{2}{\Vert A\Vert}_{F}^{2}C\tau}{\gamma \mathrm{\Delta}\sqrt{\epsilon}}=\frac{\mathrm{\Delta}\epsilon \gamma \vartheta}{2n{\mu}^{4}{\Vert A\Vert}_{F}^{2}}\frac{8n{\mu}^{2}{\Vert A\Vert}_{F}^{2}C\tau}{\gamma \mathrm{\Delta}\sqrt{\epsilon}}\phantom{\rule{0ex}{0ex}}=\frac{4C\vartheta \tau \sqrt{\epsilon}}{{\mu}^{2}}.$$

Also, for the given values of η and *T*, we will have

$$\frac{2}{\eta \mathrm{\Delta}T}\text{log}\left(\frac{en}{\gamma \epsilon}\right)=\frac{2n{\mu}^{4}{\Vert A\Vert}_{F}^{2}}{\mathrm{\Delta}\epsilon \gamma \vartheta}\frac{{\mathrm{\Delta}}^{2}\epsilon \gamma \vartheta \sqrt{2\pi \gamma}}{4n{\mu}^{4}{\Vert A\Vert}_{F}^{2}}\frac{2}{\mathrm{\Delta}}\phantom{\rule{0ex}{0ex}}=\sqrt{2\pi \gamma}.$$

Substituting these results in produces

$$P({F}_{T})\le \sqrt{8\pi \gamma}{(1-\frac{4C\vartheta \tau \sqrt{\epsilon}}{{\mu}^{2}})}^{-1}\phantom{\rule{0ex}{0ex}}=\frac{\sqrt{8\pi \gamma}{\mu}^{2}}{{\mu}^{2}-4C\vartheta \tau \sqrt{\epsilon}},$$

which is the desired result.

In this section, we provide a simplified proof for a result similar to our main result that only works in the convex case. This proof does not use any martingale results, and can therefore be considered more elementary than the proofs given above; however, it does not generalize to the non-convex case.

*Under the conditions given in Section 3.1, for any* ε > 0, *if for some* (0, 1) *we choose constant step size*

$$\alpha =\frac{c\vartheta \epsilon}{2LM\tau \sqrt{\epsilon}+{M}^{2}},$$

*then there exists a timestep*

$$T\le \frac{2LM\tau \sqrt{\epsilon}+{M}^{2}}{{c}^{2}\vartheta \epsilon}\text{log}\left(\frac{{\Vert {x}_{0}-{x}^{*}\Vert}^{2}}{\epsilon}\right)$$

*such that*

$$\mathbf{E}\left[{\Vert {x}_{T}-{x}^{*}\Vert}^{2}\right]\le \epsilon .$$

Our goal is to bound the square-distance to the optimum by showing that it generally decreases at each timestep. We can show algebraically that

$${\Vert {x}_{t+1}-{x}^{*}\Vert}^{2}={\Vert {x}_{t}-{x}^{*}\Vert}^{2}-2\alpha {({x}_{t}-{x}^{*})}^{T}\nabla {\tilde{f}}_{t}({x}_{t})+2\alpha {({x}_{t}-{x}^{*})}^{T}(\nabla {\tilde{f}}_{t}({x}_{t})-\nabla {\tilde{f}}_{t}({\tilde{\upsilon}}_{t}))+{\alpha}^{2}{\Vert \nabla {\tilde{f}}_{t}({\tilde{\upsilon}}_{t})\Vert}^{2}.$$

We can think of these terms as representing respectively: the current square-distance, the first-order change, the noise due to delayed updates, and the noise due to random sampling. Taking the expected value given _{t} and applying Cauchy-Schwarz, (12), (13), and (14) produces

$$\mathbf{E}\left[{\Vert {x}_{t+1}-{x}^{*}\Vert}^{2}\right|{\mathcal{F}}_{t},{\tilde{\upsilon}}_{t}]\le {\Vert {x}_{t}-{x}^{*}\Vert}^{2}-2\alpha c{\Vert {x}_{t}-{x}^{*}\Vert}^{2}+2\alpha L\Vert {x}_{t}-{x}^{*}\Vert {\Vert {x}_{t}-{\tilde{\upsilon}}_{t}\Vert}_{1}+{\alpha}^{2}{M}^{2}\phantom{\rule{0ex}{0ex}}=(1-2\alpha c){\Vert {x}_{t}-{x}^{*}\Vert}^{2}+{\alpha}^{2}{M}^{2}+2\alpha L\Vert {x}_{t}-{x}^{*}\Vert \sum _{i=1}^{n}|{e}_{i}^{T}{x}_{t}-{e}_{i}^{T}{x}_{t-{\tilde{\tau}}_{i,t}}|\phantom{\rule{0ex}{0ex}}\le (1-2\alpha c){\Vert {x}_{t}-{x}^{*}\Vert}^{2}+{\alpha}^{2}{M}^{2}+2\alpha L\Vert {x}_{t}-{x}^{*}\Vert \sum _{i=1}^{n}\sum _{k=1}^{{\tilde{\tau}}_{i,t}}|{e}_{i}^{T}{x}_{t-k+1}-{e}_{i}^{T}{x}_{t-k}|.$$

We can now take the full expected value given the filtration, which produces

$$\mathbf{E}[{\Vert {x}_{t+1}-{x}^{*}\Vert}^{2}|{\mathcal{F}}_{t}]\le (1-2\alpha c){\Vert {x}_{t}-{x}^{*}\Vert}^{2}+{\alpha}^{2}{M}^{2}+2\alpha L\Vert {x}_{t}-{x}^{*}\Vert \sum _{i=1}^{n}\sum _{k=1}^{\infty}P({\tilde{\tau}}_{i,k}\ge k)|{e}_{i}^{T}{x}_{t-k+1}-{e}_{i}^{T}{x}_{t-k}|.$$

Applying (7) results in

$$\mathbf{E}\left[{\Vert {x}_{t+1}-{x}^{*}\Vert}^{2}\right|{\mathcal{F}}_{t}]\le (1-2\alpha c){\Vert {x}_{t}-{x}^{*}\Vert}^{2}+{\alpha}^{2}{M}^{2}+2\alpha L\Vert {x}_{t}-{x}^{*}\Vert \sum _{i=1}^{n}\sum _{k=1}^{\infty}P(\tilde{\tau}\ge k)|{e}_{i}^{T}{x}_{t-k+1}-{e}_{i}^{T}{x}_{t-k}|\phantom{\rule{0ex}{0ex}}=(1-2\alpha c){\Vert {x}_{t}-{x}^{*}\Vert}^{2}+{\alpha}^{2}{M}^{2}+2\alpha L\Vert {x}_{t}-{x}^{*}\Vert \sum _{k=1}^{\infty}P(\tilde{\tau}\ge k){\Vert {x}_{t-k+1}-{x}_{t-k}\Vert}_{1},$$

and since only at most one entry of *x* changes at each iteration.

$$\mathbf{E}[{\Vert {x}_{t+1}-{x}^{*}\Vert}^{2}|{\mathcal{F}}_{t}]\le (1-2\alpha c){\Vert {x}_{t}-{x}^{*}\Vert}^{2}+{\alpha}^{2}{M}^{2}+2\alpha L\sum _{k=1}^{\infty}P(\tilde{\tau}\ge k)\Vert {x}_{t}-{x}^{*}\Vert \Vert {x}_{t-k+1}-{x}_{t-k}\Vert .$$

Finally, taking the full expected value, and applying Cauchy-Schwarz again,

$$\mathbf{E}\left[{\Vert {x}_{t+1}-{x}^{*}\Vert}^{2}\right]\le (1-2\alpha c)\mathbf{E}\left[{\Vert {x}_{t}-{x}^{*}\Vert}^{2}\right]+{\alpha}^{2}{M}^{2}\phantom{\rule{0ex}{0ex}}+2\alpha L\sum _{k=1}^{\infty}P(\tilde{\tau}\ge k)\sqrt{\mathbf{E}\left[{\Vert {x}_{t}-{x}^{*}\Vert}^{2}\right]\mathbf{E}\left[{\Vert {x}_{t-k+1}-{x}_{t-k}\Vert}^{2}\right]}.$$

Noticing that, from (14),

$$\mathbf{E}\left[{\Vert {x}_{t-k+1}-{x}_{t-k}\Vert}^{2}\right]=\mathbf{E}\left[{\Vert \alpha \tilde{G}({\tilde{\upsilon}}_{t-k})\Vert}^{2}\right]\le {\alpha}^{2}M,$$

if we let *J _{t}* =

$${J}_{t+1}\le (1-2\alpha c){J}_{t}+{\alpha}^{2}{M}^{2}+2{\alpha}^{2}LM\sum _{k=1}^{\infty}P(\tilde{\tau}\ge k)\sqrt{{J}_{t}}\phantom{\rule{0ex}{0ex}}=(1-2\alpha c){J}_{t}+{\alpha}^{2}{M}^{2}+2{\alpha}^{2}LM\tau \sqrt{{J}_{t}}.$$

For any ε > 0, as long as *J _{t}* > ε,

$$\text{log}{J}_{t+1}\le \text{log}{J}_{t}+\text{log}(1-2\alpha c+{\alpha}^{2}{M}^{2}{\epsilon}^{-1}+2{\alpha}^{2}LM\tau {\epsilon}^{-\frac{1}{2}})\phantom{\rule{0ex}{0ex}}\text{log}{J}_{t}-2\alpha c+{\alpha}^{2}{M}^{2}{\epsilon}^{-1}+2{\alpha}^{2}LM\tau {\epsilon}^{-\frac{1}{2}}.$$

If we substitute the value of α chosen in the theorem statement, then

$$\text{log}{J}_{t+1}\text{log}{J}_{t}-\frac{{c}^{2}\vartheta \epsilon}{2LM\tau \sqrt{\epsilon}+{M}^{2}}.$$

Therefore, for any *T*, if *J _{T}* > ε for all

$$T<\frac{2LM\tau \sqrt{\epsilon}+{M}^{2}}{{c}^{2}\vartheta \epsilon}\text{log}\left(\frac{{J}_{0}}{{J}_{T}}\right),$$

which proves the theorem.

1. Bottou Léon. COMPSTAT'2010. Springer; 2010. Large-scale machine learning with stochastic gradient descent; pp. 177–186.

2. Bottou Léon. Neural Networks: Tricks of the Trade. Springer; 2012. Stochastic gradient descent tricks; pp. 421–436.

3. Bottou Léon, Bousquet Olivier. The tradeoffs of large scale learning. In: Platt JC, Koller D, Singer Y, Roweis S, editors. NIPS. Vol. 20. NIPS Foundation; 2008. pp. 161–168.

4. De Sa Christopher, Olukotun Kunle, Ré Christopher. Global convergence of stochastic gradient descent for some non-convex matrix problems. ICML. 2015

5. Duchi John C, Bartlett Peter L, Wainwright Martin J. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization. 2012;22(2):674–701.

6. Fercoq Olivier, Richtárik Peter. Accelerated, parallel and proximal coordinate descent. arXiv preprint arXiv:1312 5799. 2013

7. Fleming Thomas R, Harrington David P. Counting processes and survival analysis. Vol. 169. John Wiley & Sons; 1991. pp. 56–57.

8. Gupta Pankaj, Goel Ashish, Lin Jimmy, Sharma Aneesh, Wang Dong, Zadeh Reza. WTF: The who to follow service at twitter. WWW’ 13. 2013:505–514.

9. Gupta Suyog, Agrawal Ankur, Gopalakrishnan Kailash, Narayanan Pritish. Deep learning with limited numerical precision. ICML. 2015

10. Jain Prateek, Netrapalli Praneeth, Sanghavi Sujay. STOC. ACM; 2013. Low-rank matrix completion using alternating minimization; pp. 665–674.

11. Johansson Björn, Rabi Maben, Johansson Mikael. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization. 2009;20(3):1157–1170.

12. Konecný Jakub, Qu Zheng, Richtárik Peter. S2cd: Semi-stochastic coordinate descent. NIPS Optimization in Machine Learning workshop. 2014

13. Le Cun Yann, Bottou Lén, Orr Genevieve B, Müller Klaus-Robert. Efficient backprop. Neural Networks, Tricks of the Trade. 1998

14. Liu Ji, Wright Stephen J. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIOPT. 2015;25(l):351–376.

15. Liu Ji, Wright Stephen J, Ré Christopher, Bittorf Victor, Sridhar Srikrishna. An asynchronous parallel stochastic coordinate descent algorithm. JMLR. 2015;16:285–322.

16. Mitliagkas Ioannis, Borokhovich Michael, Dimakis Alexandras G, Caramanis Constantine. Frogwild!: Fast pagerank approximations on graph engines. PVLDB. 2015

17. Niu Feng, Recht Benjamin, Re Christopher, Wright Stephen. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. NIPS. 2011:693–701.

18. Noel Cyprien, Osindero Simon. Dogwild!-Distributed Hogwild for CPU & GPU. 2014

19. Ahamed Shameem, Parambath Puthiya. Matrix factorization methods for recommender systems. 2013

20. Rakhlin Alexander, Shamir Ohad, Sridharan Karthik. Making gradient descent optimal for strongly convex stochastic optimization. ICML. 2012

21. Richtárik Peter, Takáč Martin. Parallel coordinate descent methods for big data optimization. Mathematical Programming. 2012:1–52.

22. Tao Qing, Kong Kang, Chu Dejun, Wu Gaowei. Machine Learning and Knowledge Discovery in Databases. Springer; 2012. Stochastic coordinate descent methods for regularized smooth and nonsmooth losses; pp. 537–552.

23. Tappenden Rachael, Takáč Martin, Richtárik Peter. On the complexity of parallel coordinate descent. arXiv preprint arXiv:1503.03033. 2015

24. Yu Hsiang-Fu, Hsieh Cho-Jui, Si Si, Dhillon Inderjit S. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. ICDM. 2012:765–774.

25. Zhang Ce, Re Christopher. Dimmwitted: A study of main-memory statistical analytics. PVLDB. 2014

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |