PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Int J Pure Appl Math. Author manuscript; available in PMC 2010 May 25.
Published in final edited form as:
Int J Pure Appl Math. 2008 January 1; 46(1): 19–36.
PMCID: PMC2875694
NIHMSID: NIHMS152176

General Algorithmic Frameworks for Online Problem

Abstract

We study general algorithmic frameworks for online learning tasks. These include binary classification, regression, multiclass problems and cost-sensitive multiclass classification. The theorems that we present give loss bounds on the behavior of our algorithms which depend on general conditions on the iterative step sizes.

Keywords: Online learning, general algorithms, classification, regression, multiclass, convex feasibility

1 Introduction

Online learning algorithms for various prediction tasks differ fundamentally from batch learning algorithms. The online learning process assumes that the instances that need to be classified and their correct labels are not all available to the algorithm at the start of the training, but rather that they are unveiled sequentially. Moreover, the algorithm starts to offer predicted labels from its exposure to the first instance-label pair, and it subsequently learns and makes further predictions simultaneously. The theoretical aspect of online learning algorithms analysis is to provide tight bounds on their performance. Very often, these algorithms can be analyzed and shown to work quite well even when no statistical assumptions of any kind are made about the process producing the observed data. Many of the algorithms and methods of analysis used in this area can trace their roots to the work of Littlestone, Vovk and Warmuth, see [6, 7, 8]. Inspired and influenced by [4] we formulate general sets of conditions under which many algorithmic variants of online passive-aggressive algorithms can be analyzed. More precisely, we answer the following question: under what conditions on the choice of iterative steps can one obtain results analogous to those of [4]?

Thus, our work contributes to the development of new analytical frameworks that advance theoretical studies of practical learning methods. All the algorithms of [4] can be obtained as special cases of our algorithmic framework, but the framework is wide enough to encompass many more variants. Our way of looking at the subject will lead to additional developments of a similar nature. In particular, there are many links between online learning algorithms and projection algorithms for solving convex feasibility problems, see, e.g., [1, 2, 5, 3], which can lead to new studies of the latter that will concentrate on providing tight bounds on their performance as online algorithms, rather then on their asymptotic convergence.

We structured the paper so that the sections order follows closely that of [4], successively handling binary classification (Section 2), regression (Section 3), multiclass problems (Section 4) and cost-sensitive multiclass classification (Section 4).

2 Binary classification

We denote the instance presented to the algorithm on round t by xt [set membership] Rn, where Rn is the n-dimensional Euclidean space. We assume that xt is associated with a unique label yt [set membership] {+1, −1} and refer to each instance-label pair (xt, yt) as an example. The algorithms discussed in this paper make predictions using a classification function. We restrict our discussion to classification functions that are based on a vector of weights w [set membership] Rn, and which take the form sign left angle bracketw, xright angle bracket. We denote by wt the weight vector used by the algorithm on round t, and refer to the expression yt left angle bracketwt, xtright angle bracket as the (signed) margin attained on round t. Whenever sign left angle bracketwt, xtright angle bracket = yt the algorithm has made a correct prediction. The loss is defined by the following hinge-loss function:

(w;(x,y)){0,ifyw,x1,1yw,x,otherwise,}
(2.1)

and, clearly,

(w;(x,y))=max{0,1yw,x}.
(2.2)

We assume henceforth that for any number c > 0, c/0 := +∞.

Algorithm 2.1 General Online Passive-Aggressive Algorithmic Framework for Binary Classification

Initialization: Set w1 = (0, 0, …, 0) and choose parameters γ1 and a sufficiently small κ > 0 such that

0<γ1<2.
(2.3)

Iterative step: (1) Given the weight wt and receiving the instance xt, predict:

y^t=signwt,xt.
(2.4)

(2) Receive the correct label yt [set membership] {+1, −1} and calculate the loss [ell]t = [ell](wt;(xt, yt)).

(3) Choose a nonnegative parameter τt for which

τtγ1txt2andift1thenτtκ.
(2.5)

(4) Update:

wt+1=wt+τtytxt.
(2.6)

We now turn to the analysis of our algorithmic framework. For any set E, denote by card(E) its cardinality. As before, we denote by [ell]t the instantaneous loss suffered by Algorithm 2.1 on round t. In addition, we denote by [l with hat]t the loss suffered by an arbitrary fixed predictor to which we are comparing our performance. Formally, let u be an arbitrary vector in Rn, and denote

t=(wt;(xt,yt))andt^=(u;(xt,yt)).
(2.7)

For any natural number t, define

Δtwtu2wt+1u2.
(2.8)

Lemma 2.2 Let {(x1, y1), (x2, y2), …, (xT, yT)} be a sequence of examples where xt [set membership] Rn and yt [set membership] {+1, −1} for all t. Let τt satisfy (2.5) for all t. Then

Σt=1Tτt(2tτtxt22t^)u2.
(2.9)

Proof. Clearly,

Σt=1TΔt=Σt=1T(wtu2wt+1u2)=w1u2wT+1u2,
(2.10)

and hence,

Σt=1TΔTu2.
(2.11)

By (2.6) and (2.8) we have, for t = 1, 2, …, T,

Δt=wtu2wt+1u2=wtu2wtu+ytτtxt2=wtu2(wtu2+2τtyt(wtu),xt+τt2xt2)=2τtytwtu,xtτt2xt2.
(2.12)

By (2.1), (2.5), (2.6) and (2.8) we also have for t = 1, 2, …, T, that

ift=0,thenτt=0andΔt=0.
(2.13)

Assume that

t{1,2,,T}andt>0.
(2.14)

Applying (2.1), we get

t=1ytwt,xtandt^1ytu,xt.
(2.15)

By (2.12) and (2.15),

Δt2τt((1t^)(1t))τt2xt2=2τt(tt^)τt2xt2,
(2.16)

which, in view of (2.11), (2.14) and (2.13), yields

u2Σt=1TΔtΣt=1Tτt(2tτtxt22t^),
(2.17)

proving the lemma.

Set

E1{t{1,2,,T}t1}.
(2.18)

Theorem 2.3 Let {(x1, y1), (x2, y2), …, (xT, yT)} be a sequence of examples, where xt [set membership] Rn and yt [set membership] {+1, −1} for all t and assume that τt satisfies (2.5) for all t. Assume that there exists a vector u such that [l with hat]t = 0 for all t. Then card(E1) ≤ κ−1(2 − γ1)−1 ||u||2, i.e., the number of indices t [set membership] {1, 2, …,T} for which [ell]t ≥ 1 does not exceed κ−1(2 − γ1)−1 ||u||2.

Proof. By Lemma 2.2, (2.9) holds. Since [l with hat]t = 0 for all t, (2.9) implies that

Σt=1Tτt(2tτtxt2)u2
(2.19)

and that xt ≠ 0 for all t. In view of (2.5) and (2.19),

u2Σt=1Txt2(2tτtxt2τt2xt4)=Σt=1T(2tτtτt2xt2)Σt=1T(2tτtτtγ1t)=Σt=1Tτtt(2γ1).
(2.20)

By (2.18), (2.20), (2.3) and (2.5),

u2ΣtE1(2γ1)τttΣtE1(2γ1)τtκ(2γ1)card(E1)
(2.21)

and the required result follows.

Theorem 2.4 Let {(x1, y1), (x2, y2), …, (xT, yT)} be a sequence of examples, where xt [set membership] Rn and yt [set membership] {+1, −1} for all t and assume that τt satisfies (2.5) for all t. Let u [set membership] Rn and assume that there is a number c > 0 such that τtc for all t. Then

card(E1)ΣtE1tκ1(2γ1)1(u2+2Σt=1Tct^).
(2.22)

Proof. By Lemma 2.2, (2.9) holds and implies that

Σt=1Tτt(2tτtxt2)u2+Σt=1T2τtt^.
(2.23)

Together with (2.5) this implies that

u2+Σt=1T2τtt^Σt=1T(2tτtτtγ1t)=Σt=1Tτtt(2γ1)(2γ1)ΣtE1τtt(2γ1)κΣtE1t.
(2.24)

Since τtc for all t, it follows from (2.24) that

card(E1)ΣtE1tκ1(2γ1)1(u2+2Σt=1T2τtt^)κ1(2γ1)1(u2+2Σt=1Tct^),
(2.25)

which completes the proof.

2.1 Special cases

We show that all three variants ((PA), (PA-I) and (PA-II)) of the online passive-aggressive learning algorithm of Crammer et al. [4, Figure 1] are special cases of Algorithm 2.1 when the sequence {xt} is bounded. To see this, assume that there is an r0 > 0 such that

xtr0for all integerst1.
(2.26)

Consider the algorithmic variant (PA) of [4, Figure 1] with τt = [ell]t||xt||−2. Clearly, the first half of (2.5) holds with γ1 = 1. We show that its second half holds with κ=r02 Assume that [ell]t ≥ 1. By definition,

τtxt2r02.
(2.27)

Thus, (PA) is indeed a particular case of Algorithm 2.1. Consider now the algorithmic variant (PA-I) of [4, Figure 1] with τt = min{C, [ell]t||xt||−2}. (Here C is a positive constant.) Clearly, the first half of (2.5) holds with γ1 = 1. If [ell]t ≥ 1, then

τtmin{C,xt2}min{C,r02}
(2.28)

and the second half of (2.5) holds with κ=min{C,r02}.

Next, consider the algorithm variant (PA-II) of [4, Figure 1] with

τt=t(xt2+(2C)1)1.
(2.29)

Clearly, the first half of (2.5) holds with γ1 = 1. If [ell]t ≥ 1, then

τt(xt2+(2C)1)1(r02+(2C)1)1
(2.30)

and the second half of (2.5) holds with κ=(r02+(2C)1)1.

3 Regression

Each instance xt is associated with a real target value yt [set membership] R which the online algorithm tries to predict. On every round, the algorithm receives an instance xt [set membership] Rn and predicts a target value ŷt [set membership] R using its interval regression function ŷt = left angle bracketwt, xtright angle bracket, where wt is the incrementally-learned vector.

We use the ε-insensitive hinge-loss functions

ε(w;(x,y)){0,if|w,xy|ε,|w,xy|ε,otherwise,}
(3.1)

where ε is a positive parameter.

Algorithm 3.1 General Online Passive-Aggressive Algorithmic Framework for Regression

Initialization: Fix ε > 0. Set w1 = (0, 0, …, 0) and choose parameters γ1 and a sufficiently small κ > 0 such that

0<γ1<2.
(3.2)

.

Iterative step: (1) Given the weight wt and receiving the instance xt, predict:

y^t=wt,xt.
(3.3)

.

(2) Receive the correct label yt [set membership] R and calculate the loss [ell]t = [ell]ε(wt; (xt, yt)).

(3) Choose a nonnegative parameter τt for which

τtγ1txt2,andiftεthenτtκ.
(3.4)

(4) Update:

wt+1=wt+sign(ytyt^)τtxt.
(3.5)

Again we denote by [l with hat]t the loss suffered by an arbitrary fixed predictor to which we are comparing our performance. Formally, let u be an arbitrary vector in Rn, and denote

t=ε(wt;(xt,yt))andt^=ε(u;(xt,yt)).
(3.6)

We also re-use the definition

Δtwtu2wt+1u2.
(3.7)

.

Lemma 3.2 Let {(x1, y1), (x2, y2), …, (xT, yT)} be a sequence of examples, where xt [set membership] Rn and yt [set membership] R for all t. Let Tt satisfy (3.4) for all t. Then

Σt=1Tτt(2tτtxt22^t)u2.
(3.8)

.

Proof. By (3.7),

Σt=1TΔt=w1u2wT+1u2u.
(3.9)

Let t [set membership] {1, …, T}. By both (3.7) and (3.5),

Δt=wtu2wtu+sign(yty^t)τtxt2=sign(yty^t)2τt(wtu)xtτt2xt2.
(3.10)

We now add and subtract the term sign(ytŷt)2τtyt from the right-hand side in the above equation to get the bound

Δtsign(yty^t)2τt(wt,xtyt)+sign(yty^t)2τt(u,xtyt)τt2xt2.
(3.11)

From (3.3) we obtain

sign(yty^t)(wt,xtyt)=|wt,xtyt|.
(3.12)

Assume that [ell]t ≠ 0. We then get, by (3.1),

t=|wt,xtyt|ε.
(3.13)

By (3.11), (3.12), (3.13), (3.6) and (3.1),

Δt2τt(t+ε)+sign(yty^t)2τt(u,xtyt)τt2xt22τt(t+ε)2τt(^t+ε)τt2xt2=τt(2tτtxt22^t).
(3.14)

When combined with (3.9), this implies that

Σt=1Tτt(2tτtxt22^t)Σt=1TΔtu2,
(3.15)

which completes the proof.

Now set

Eε{t{1,2,,T}tε}.
(3.16)

Theorem 3.3 Let {(x1, y1), (x2, y2), …, (xT, yT)} be a sequence of examples where xt [set membership] Rn and yt [set membership] R for all t, let the nonnegative parameters τt satisfy (3.4) for all t, and let ε > 0 be fixed. Assume that there exists a vector u such that [l with hat]t = 0 for all t. Then card(Eε) ≤ ||u||2(ε(2 − γ1)κ)−1.

Proof. From Lemma 3.2 we have (3.8) which together with (3.4) gives

u2Σt=1T(2tτtτtγ1t)=Σt=1Tτtt(2γ1).
(3.17)

This yields, in view of (3.2), (3.6) and (3.4),

u2Σt=1Tt(2γ1)τtΣtEεε(2γ1)κ.
(3.18)

Hence,

card(Eε)u2(ε(2γ1)κ)1,

as asserted.

Theorem 3.4 Let {(x1, y1), (x2, y2), …, (xT, yT)} be a sequence of examples where xt [set membership] Rn and yt [set membership] R for all t, let the nonnegative parameters τt satisfy (3.4) for all t, let ε > 0 be fixed and let u [set membership] Rn. Assume that there is a number c > 0 such that τtc for all t. Then

εcard(Eε)ΣtEεt((2γ1)κ)1)(u2+Σt=1T2c^t).
(3.19)

Proof. From Lemma 3.2 we have (3.8) which, together with (3.4), implies that

u2+Σt=1T2τt^tΣt=1Tτt(2tγ1t)=(2γ1)Σt=1Tτtt(2γ1)κΣtEεt.
(3.20)

The last inequality leads to

εcard(Eε)ΣtEεt((2γ1)κ)1(u2+Σt=1T2τt^t)((2γ1)κ)1(u2Σt=1T2c^t),
(3.21)

which completes the proof.

4 Multiclass problem

In multiclass problems every instance xt is associated with a set of labels Yt. Denoting by Y := {1, 2, …, k} the set of all possible labels, Yt is a subset of Y. We say that y [set membership] Y is relevant for the instance xt if y [set membership] Yt. The online algorithm receives instances x1, x2, … sequentially, where xt belongs to an instance space X.

Assume that we are provided with a set of functions ϕ1, ϕ2, …, ϕd : X × YR and ϕ = (ϕ1, ϕ2, …, ϕd). On round t, the prediction of the algorithm is the k-dimensional vector

(wt,ϕ(xt,1),wt,ϕ(xt,2),,wt,ϕ(xt,k)).
(4.1)

We define the margin attained by the algorithm on round t for the example (xt, Yt) by

γ(wt;(xt,Yt))min{wt,ϕ(xt,r)rYt}max{wt,ϕ(xt,s)sYt}.
(4.2)

The instantaneous loss suffered after receiving Yt is define by the following hinge-loss function:

MC(w;(x,Y)){0,ifγ(w;(x,Y))1,1γ(w;(x,Y)),otherwise,}
(4.3)

and we define

t=MC(wt;(xt,Yt))and^t=MC(u;(xt,Yt)),
(4.4)

where u [set membership] Rn.

Algorithm 4.1 General Online Passive-Aggressive Algorithmic Framework for Multiclass Classification

Initialization: Set w1 = (0, 0, …, 0) and choose parameters γ1, γ2 and a sufficiently small κ > 0 such that

0<γ1<2,γ2(0,1].
(4.5)

Iterative step: (1) Given the weight wt and receiving the instance xt, predict the associated set of labels Ŷt.

(2) Receive the correct associated set of label Yt and calculate the loss [ell]t = [ell]MC(wt; (xt, Yt)).

(3) Calculate

rtargmin{wt,ϕ(xt,r)rYt},stargmax{wt,ϕ(xt,s)sYt},
(4.6)

(4) Choose a nonnegative parameter τt such that τt = 0 if lt = 0; otherwise

τtγ1tϕ(xt,rt)ϕ(xt,st)2,andiftγ2thenτtκ.
(4.7)

(5) Update:

wt+1=wt+τt(ϕ(xt,rt)ϕ(xt,st)).
(4.8)

Again, it can be shown that the three algorithmic variants that appear in [4, Section 7] are particular cases of our Algorithm 4.1 if there is an m0 such that ||ϕ(xt, rt) − ϕ(xt, st)||2m0 for all t.

Lemma 4.2 Let {(x1, Y1), (x2, Y2), …, (xT, YT)} be a sequence of examples with xt [set membership] Rn, Yt [subset, dbl equals] {1, 2, …, k}, let w1 = (0, 0, …, 0), and let u [set membership] Rn. Then

Σt=1Tτt(2t2^tτtϕ(xt,rt)ϕ(xt,st)2)u2.
(4.9)

Proof. Set again

Δtwtu2wt+1u2
(4.10)

for all t. Then

Σt=1TΔT=w1u2wT+1u2u2.
(4.11)

For t = 1, 2, …, T with [ell]t > 0 it follows from (4.10) and (4.8) that

Δt=wtu2wtu+τt(ϕ(xt,rt)ϕ(xt,st))2=2τtwtu,ϕ(xt,rt)ϕ(xt,st)τt2ϕ(xt,rt)ϕ(xt,st)2.
(4.12)

Assume that t [set membership] {1, 2, …, T} and the loss [ell]t > 0. By (4.3),

t=1γ(wt;(xt,Yt))and^t1γ(u;(xt,Yt)).
(4.13)

Together with (4.2) and (4.6) this implies that

(1^t)(1t)γ(u;(xt,Yt))γ(wt;(xt,Yt))=γ(u;(xt,Yt))(wt,ϕ(xt,rt)wt,ϕ(xt,st))u,ϕ(xt,rt)u,ϕ(xt,st)(wt,ϕ(xt,rt)wt,ϕ(xt,st))=uwt,ϕ(xt,rt)ϕ(xt,st).
(4.14)

By (4.12) and (4.14),

Δt2τt(t^t)τt2ϕ(xt,rt)ϕ(xt,st)2.
(4.15)

Together with (4.11) this implies (4.9) as asserted.

Recall that

Eγ2{t{1,2,,T}tγ2}.
(4.16)

Theorem 4.3 Let {(x1, Y1), (x2, Y2), …, (xT, YT)} be a sequence of examples with xt [set membership] Rn, Yt [subset, dbl equals] {1, 2, …, k}, let w1 = (0, 0, …, 0), and let u [set membership] Rn be such that [l with hat](u; (xt, Tt)) = 0 for all t. Then card(Eγ2) ≤ ||u||2(κ(2 − γ1)γ2)−1.

Proof. It follows from (4.9) and (4.7) that, for t = 1, 2, …, T,

2tτtϕ(xt,rt)ϕ(xt,st)2(2γ1)t.
(4.17)

Together with (4.9) this implies that

u2Σt=1Tτt(2γ1)t.
(4.18)

By (4.18) and (4.7),

u2ΣtEγ2τt(2γ1)γ2κ(2γ1)γ2card(Eγ2),
(4.19)

and the theorem follows.

Theorem 4.4 Let {(x1, Y1), (x2, Y2), …, (xT, YT)} be a sequence of examples with xt [set membership] Rn, Yt [subset, dbl equals] {1, 2, …, k} and let u [set membership] Rn. Assume that there is a number c > 0 such that τtc for all t. Then

γ2card(Eγ2)ΣtEγ2t((2γ1)κ)1(u2+2cΣt=1T^t).
(4.20)

Proof. By (4.9),

Σt=1Tτt(2tτtϕ(xt,rt)ϕ(xt,st)2)u2+Σt=1T2τt^t.
(4.21)

Together with (4.7) this implies that

u2+Σt=1T2τt^tΣt=1Tτt(2tγt)=(2γ1)Σt=1Tτtt(2γ1)κΣtEγ2t.
(4.22)

This implies that

γ2card(Eγ2)ΣtEγ2t((2γ1)κ)1(u2+Σt=1T2τt^t)((2γ1)κ)1(u2+2cΣt=1T^t)
(4.23)

and the result follows.

5 Cost-sensitive multiclass classification

With Y and ϕ as in Section 4, in cost-sensitive multiclass classification each instance xt is associated with a single label yt [set membership] Y and the prediction extended by the online algorithm is simply

y^t=argmax{wt,ϕ(xt,y)yY}.
(5.1)

A prediction error occurs if ytŷt. More specifically, for every pair of labels (y, y) there is a cost ρ(y, y). We assume that ρ(y, y) = 0 for all y [set membership] Y and that ρ(y, y) > 0 whenever yy. The goal is to minimize Σt=1Tρ(yt,y^t). Define the cost sensitivity loss

PB(w;(x,y))w,ϕ(x,y^)w,ϕ(x,y)+ρ(y,y^)12.
(5.2)

Algorithm 5.1 General Online Passive-Aggressive Algorithmic Framework for Cost-Sensitive Multiclass Classification

Initialization: Set w1 = (0, 0, …, 0) and choose parameters γ1, γ2 and a sufficiently small κ > 0 such that

0<γ1<2,γ2(0,1].
(5.3)

Iterative step: (1) Given the weight wt and receiving the instance xt, predict the label ŷt.

(2) Receive the correct label yt and calculate the loss [ell]t : = [ell]PB(wt; (xt, yt)).

(3) Choose a nonnegative parameter τt such that if [ell]t = 0, then τt = 0; otherwise

τtγ1tϕ(xt,yt)ϕ(xt,y^t)2,andiftγ2thenτtκ.
(5.4)

(4) Update:

wt+1=wt+τt(ϕ(xt,yt)ϕ(xt,y^t)).
(5.5)

Again, it can be shown that the three algorithmic variants that appear in [4, Section 8] are particular cases of our Algorithm 5.1 if there is a constant m0 such that ||ϕ(xt, yt) − ϕ(xt, ŷt)||m0 for all t.

To analyze Algorithm 5.1, let = (w; x, y) [set membership] Y be defined, for any given w, x and y by

y~argmax{w,ϕ(x,r)w,ϕ(x,y)+ρ(y,r)12rY},y~y~(wt;xt,yt).
(5.6)

Define the loss for the max-loss update by

ML(w;(x,y))w,ϕ(x,y~)w,ϕ(x,y)+ρ(y,y~)12.
(5.7)

By (5.2), (5.6) and (5.7),

PB(wt;(xt,yt))ML(wt;(xt,yt)).
(5.8)

Lemma 5.2 Let {(x1, y1), (x2, y2), …, (xT, yT)} be a sequence of examples, where xt [set membership] Rn and yt [set membership] Y for all t. Let u be an arbitrary vector in Rn. If τt ≥ 0 satisfies (5.4), then, for any sequence {wt} generated by Algorithm 5.1,

Σt=1T[τt(2PB(wt;(xt,yt)))τt2ϕ(xt,yt)ϕ(xt,y^t)22ML(u;(xt,yt))τt]u2.
(5.9)

Under the same conditions, if in Algorithm 5.1 ŷtis replaced by ỹt, then

Σt=1T[τt(2ML(wt;(xt,yt)))τt2ϕ(xt,yt)ϕ(xt,y~t)22ML(u;(xt,yt))τt]u2.
(5.10)

Proof. As usual, set

Δtwtu2wt+1u2.
(5.11)

Then

Σt=1TΔtw1u2wT+1u2u2.
(5.12)

Let t [set membership] {1, 2, …, T} with

PB(wt;(xt,yt))>0.
(5.13)

Then, by (5.11) and (5.5),

Δt=wtu2wtu+τt(ϕ(xt,yt)ϕ(xt,y^t))2=2τtwtu,ϕ(xt,yt)ϕ(xt,y^t)τt2ϕ(xt,yt)ϕ(xt,y^t)2.
(5.14)

By definition (see (5.6) and (5.7)),

ML(u;(xt,yt))=max{u,ϕ(xt,r)ϕ(xt,yt)+ρ(yt,r)12rY}.
(5.15)

Therefore,

^tML(u;(xt,yt))u,ϕ(xt,y^t)ϕ(xt,yt)+ρ(yt,y^t)12.
(5.16)

By (5.14) and (5.16),

Δt2τtwt,ϕ(xt,yt)ϕ(xt,y^t)+2τt(ρ(yt,y^t)12ML(u;(xt,yt)))τt2ϕ(xt,yt)ϕ(xt,y^t)2.
(5.17)

By the definition of [ell]PB (see (5.2)),

wt,ϕ(xt,yt)ϕ(xt,yt^)=ρ(yt,yt^)12PB(wt;(xt,yt)).
(5.18)

By (5.17) and (5.18),

Δt2τt(ρ(yt,yt^)12PB(wt;(xt,yt)))+2τt(ρ(yt,yt^))12ML(u;(xt,yt)))τt2ϕ(xt,yt)ϕ(xt,yt^)2=τt(2PB(wt;(xt,yt)))τt2ϕ(xt,yt)ϕ(xt,yt^)22ML(u;(xt,yt))τt.
(5.19)

Relations (5.19) and (5.12) show that

u2Σt=1TΔtΣt=1T[τt(2PB(wt;(xt,yt))τt2ϕ(xt,yt)ϕ(xt,yt^)22ML(u;(xt,yt))τt],
(5.20)

which proves the first case of the lemma. Considering the second case, where in Algorithm 5.1 ŷt is replaced by t, we define Δt again by (5.11). Clearly,

Σt=1TΔtu2.
(5.21)

Let t [set membership] {1, 2, …, T} with

PB(wt;(xt,yt))>0.
(5.22)

As in (5.14) we can show that

Δt=2τtwtu,ϕ(xt,yt)ϕ(xt,y~t)τt2ϕ(xt,yt)ϕ(xt,y~t)2.
(5.23)

By definition (see (5.6) and (5.7)),

t^ML(u;(xt,yt))u,ϕ(xt,y~t)ϕ(xt,yt)+ρ(yt,y~t)12.
(5.24)

By (5.24) and (5.23),

Δt2τtwt,ϕ(xt,yt)ϕ(xt,y~t)+2τt(ρ(yt,y~t)12ML(u;(xt,yt)))τt2ϕ(xt,yt)ϕ(xt,y~t)2.
(5.25)

Again, by definition (see (5.6) and (5.7)),

wt,ϕ(xt,yt)ϕ(xt,y~t)=ρ(yt,y~t)ML(wt;(xt,yt)).
(5.26)

By (5.25) and (5.26),

Δt2τt(ρ(yt,y~t)12ML(wt;(xt,yt)))+2τt(ρ(yt,y~t)12ML(u;(xt,yt)))τt2ϕ(xt,yt)ϕ(xt,y~t)2=τt(2ML(wt;(xt,yt))2ML(u;(xt,yt)))τt2ϕ(xt,yt)ϕ(xt,y~t)2.
(5.27)

Together with (5.21) this implies that

u2Σt=1TΔtΣt=1T[τt(2ML(wt;(xt,yt))2ML(u;(xt,yt)))τt2ϕ(xt,yt)ϕ(xt,y~t)2],
(5.28)

completing the proof.

Consider the set

Eγ22{t{1,2,,T}ρ(yt,yt^)γ22}.
(5.29)

Theorem 5.3 Let {(x1, y1), (x2, y2), …, (xT, yT)} be a sequence of examples where xt [set membership] Rn and yt [set membership] R for all t, and let u be an arbitrary vector in Rn. Assume that

ML(u;(xt,yt))=0
(5.30)

for all t. Then

card(Eγ22)(κγ2(2γ1))1u2.
(5.31)

Proof. By (5.1), (5.4), (5.6) and Lemma 5.2,

u2Σt=1Tτt(2PB(wt;(xt,yt))τtϕ(xt,yt)ϕ(xt,yt^)2)Σt=1Tτt(2γ1)PB(wt;(xt,yt))Σt=1Tτt(2γ1)ρ(yt,yt^)12.
(5.32)

This and (5.4) yield

u2ΣtEγ22τt(2γ1)γ2κγ2(2γ1)card(Eγ22),
(5.33)

proving the theorem.

Theorem 5.4 Let {(x1, y1), (x2, y2), …, (xT, yT)} be a sequence of examples, where xt [set membership] Rn and yt [set membership] R for all t, and let u be an arbitrary vector in Rn. Assume that there exists a number c > 0 such that τtc for all t. Then

γ2card(Eγ22)ΣtEγ22ρ(yt,yt^)12((2γ1)κ)1(u2+Σt=1T2cML(u;(xt,yt))).
(5.34)

Proof. Inequality (5.9), when combined with (5.4), implies that

u2+Σt=1T2ML(u;(xt,yt))τtΣt=1Tτt(2PB(wt;(xt,yt))γ1PB(wt;(xt,yt)))(2γ1)Σt=1TτtPB(wt;(xt,yt))(2γ1)Σt=1Tτtρ(yt,yt^)12.
(5.35)

This inequality, (5.4), (5.1) and (5.2) show that

γ2card(Eγ22)ΣtEγ22ρ(yt,yt^)12(2γ1)1(2γ1)κ1ΣtEγ22τtρ(yt,yt^)12((2γ1)κ)1(u2+Σt=1T2ML(u;(xt,yt))τt)((2γ1)κ)1(u2+Σt=1T2ML(u;(xt,yt))c),
(5.36)

concluding the proof.

Acknowledgments

We gratefully acknowledge illuminating discussions with Shai Shalev-Shwartz and Yoram Singer. This work was partially supported by grant No. 2003275 of the United States-Israel Binational Science Foundation (BSF) and by a National Institutes of Health (NIH) grant No. HL70472. The second author was partially supported by the Fund for the Promotion of Research at the Technion and by the Technion VPR Fund - B. and G. Greenberg Research Fund (Ottawa).

References

1. Bauschke HH, Borwein JM. On projection algorithms for solving convex feasibility problems. SIAM Review. 1996;38:367–426.
2. Butnariu D, Davidi R, Herman GT, Kazantsev IG. Stable convergence behavior under summable perturbations of a class of projection methods for convex feasibility and optimization problems. IEEE Journal of Selected Topics in Signal Processing. 2007;1:540–547.
3. Censor Y, Zenios SA. Parallel Optimization: Theory, Algorithms and Applications. Oxford University Press; New York, NY, USA: 1997.
4. Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y. Online passive-aggressive algorithms. Journal of Machine Learning Research. 2006;7:551–585.
5. Jiang M, Wang G. Convergence studies on iterative algorithms for image reconstruction. IEEE Transactions on Medical Imaging. 2003;22:569–579. [PubMed]
6. Littlestone N. Learning when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning. 1988;2:285–318.
7. Littlestone N, Warmuth MK. The weighted majority algorithm. Information and Computation. 1994;108:212–261.
8. Vovk VG. Proceedings of the Third Annual Workshop on Computational Learning Theory. Morgan Kaufmann Publishers Inc.; San Francisco, CA, USA: 1990. Aggregating strategies; pp. 371–383.