JMLR : Workshop and Conference Proceedings vol ( 2010 ) 1 – 21 24th Annual Conference on Learning Theory

Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments ∗

G´ abor Bart´ ok bartok @ cs.ualberta.ca D´ avid P´ al dpal @ cs.ualberta.ca Csaba Szepesv´ ari szepesva @ cs.ualberta.ca Department of Computing Science , University of Alberta , Edmonton , T6G 2E8 , AB , Canada

Editors : Sham Kakade , Ulrike von Luxburg

Abstract

In a partial monitoring game , the learner repeatedly chooses an action , the environment responds with an outcome , and then the learner suffers a loss and receives a feedback signal , both of which are fixed functions of the action and the outcome . The goal of the learner is to minimize his regret , which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight . Assuming that the outcomes are generated in an i.i.d . fashion from an arbitrary and unknown probability distribution , we characterize the minimax regret of any partial monitoring game with finitely many actions

and outcomes . It turns out that the minimax regret of any such game is either zero , Θ ( √ T ) , Θ ( T 2/3 ) , or Θ ( T ) . We provide a computationally efficient learning algorithm that achieves the minimax regret within logarithmic factor for any game . Keywords : Online learning , Imperfect feedback , Regret analysis

1 . Introduction

Partial monitoring provides a mathematical framework for sequential decision making problems with imperfect feedback . Various problems of interest can be modeled as partial monitoring instances , such as learning with expert advice ( Littlestone and Warmuth , 1994 ) , the multi-armed bandit problem ( Auer et al . , 2002 ) , dynamic pricing ( Kleinberg and Leighton , 2003 ) , the dark pool problem ( Agarwal et al . , 2010 ) , label efficient prediction ( Cesa-Bianchi et al . , 2005 ) , and linear and convex optimization with full or bandit feedback ( Zinkevich , 2003 ; Abernethy et al . , 2008 ; Flaxman et al . , 2005 ) .

In this paper we restrict ourselves to finite games , i.e. , games where both the set of actions available to the learner and the set of possible outcomes generated by the environment are finite . A finite partial monitoring game G is described by a pair of N × M matrices : the loss matrix L and the feedback matrix H. The entries i , j of L are real numbers lying in , say , the interval [ 0 , 1 ] . The entries h i , j of H belong to an alphabet Σ on which we do not impose any structure and we only assume that learner is able to distinguish distinct elements of the alphabet .

The game proceeds in T rounds according to the following protocol . First , G = ( L , H ) is announced for both players . In each round t = 1 , 2 , . . . , T , the learner chooses an action I t ∈

∗ This work was supported in part by AICML , AITF ( formerly iCore and AIF ) , NSERC and the PASCAL2 Network of Excellence under EC grant no . 216886 .

c 2010 G. Bart´ ok , D. P´ al & C. Szepesv´ ari .

{ 1 , 2 , . . . , N } and simultaneously , the environment chooses an outcome J t ∈ { 1 , 2 , . . . , M } . Then , the learner receives as a feedback the entry h I t , J t . The learner incurs instantaneous loss I t , J t , which is not revealed to him . The feedback can be thought of as a masked information about the outcome J t . In some cases h I t , J t might uniquely determine the outcome , in other cases the feedback might give only partial or no information about the outcome . In this paper , we shall assume that J t is chosen randomly from a fixed multinomial distribution .

The learner is scored according to the loss matrix L. In round t the learner incurs an instantaneous loss of I t , J t . The goal of the learner is to keep low his total loss T t=1 I t , J t . Equivalently , the learner’s performance can also be measured in terms of his regret , i.e. , the total loss of the learner is compared with the loss of best fixed action in hindsight . The regret is defined as the difference of these two losses .

In general , the regret grows with the number of rounds T . If the regret is sublinear in T , the learner is said to be Hannan consistent , and this means that the learner’s average per-round loss approaches the average per-round loss of the best action in hindsight .

Piccolboni and Schindelhauer ( 2001 ) were one of the first to study the regret of these games . In fact , they have studied the problem without making any probabilistic assumptions about the outcome sequence J t . They proved that for any finite game ( L , H ) , either for any algorithm the regret can be Ω ( T ) in the worst case , or there exists an algorithm which has regret O ( T 3/4 ) on any outcome sequence 1 . This result was later improved by CesaBianchi et al . ( 2006 ) who showed that the algorithm of Piccolboni and Schindelhauer has regret O ( T 2/3 ) . Furthermore , they provided an example of a finite game , a variant of label-efficient prediction , for which any algorithm has regret Θ ( T 2/3 ) in the worst case .

However , for many games O ( T 2/3 ) is not optimal . For example , games with full feedback ( i.e. , when the feedback uniquely determines the outcome ) can be viewed as a special instance of the problem of learning with expert advice and in this case it is known that the

“EWA forecaster” has regret O ( √ T ) ; see e.g. , Lugosi and Cesa-Bianchi ( 2006 , Chapter 3 ) . Similarly , for games with “bandit feedback” ( i.e. , when the feedback determines the instantaneous loss ) the INF algorithm ( Audibert and Bubeck , 2009 ) and the Exp3 algorithm ( Auer

et al . , 2002 ) achieve O ( √ T ) regret as well . 2

This leaves open the problem of determining the minimax regret ( i.e. , optimal worst-case regret ) of any given game ( L , H ) . A partial progress was made in this direction by Bart´ok et al . ( 2010 ) who characterized ( almost ) all finite games with M = 2 outcomes . They showed that the minimax regret of any “non-degenerate” finite game with two outcomes

falls into one of four categories : zero , Θ ( √ T ) , Θ ( T 2/3 ) or Θ ( T ) . They gave a combinatoricgeometric condition on the matrices L , H which determines the category a game belongs to . Additionally , they constructed an efficient algorithm which , for any game , achieves the minimax regret rate associated to the game within poly-logarithmic factor .

In this paper , we consider the same problem , with two exceptions . In pursuing a general result , we will consider all finite games . However , at the same time , we will only deal with stochastic environments , i.e. , when the outcome sequences are generated from a fixed probability distribution in an i.i.d . manner .

1 . The notations O ( · ) and Θ ( · ) hide polylogarithmic factors . 2 . We ignore the dependence of regret on the number of actions or any other parameters .

The regret against stochastic environments is defined as the difference between the cumulative loss suffered by the algorithm and that of the action with the lowest expected loss . That is , given an algorithm A and a time horizon T , if the outcomes are generated from a probability distribution p , the regret is

R T ( A , p ) =

T

t=1

I t , J t − min 1≤i≤N E p

T

t=1

i , J t .

In this paper we analyze the minimax expected regret ( in what follows , minimax regret ) of games , defined as

R T ( G ) = inf A sup p∈∆ M E p [ R T ( A , p ) ] .

We show that the minimax regret of any finite game falls into four categories : zero , Θ ( √ T ) , Θ ( T 2/3 ) , or Θ ( T ) . Accordingly , we call the games trivial , easy , hard , and hopeless . We give a simple and efficiently computable characterization of these classes using a geometric condition on ( L , H ) . We provide lower-bounds and algorithms that achieve them within poly-logarithmic factor . Our result is an extension of the result of Bart´ ok et al . ( 2010 ) for stochastic environments .

It is clear that any lower bound which holds for stochastic environments must hold for adversarial environments too . On the other hand , algorithms and regret upper bounds for stochastic environments , of course , do not transfer to algorithms and regret upper bounds for the adversarial case . Our characterization is a stepping stone towards understanding the minimax regret of partial monitoring games . In particular , we conjecture that our characterization holds without any change for unrestricted environments .

2 . Preliminaries

In this section , we introduce our conventions , along with some definitions . By default , all vectors are column vectors . We denote by v =

√ v v the Euclidean norm of a vector v. For a vector v , the notation v ≥ 0 means that all entries of v are non-negative , and the notation v > 0 means that all entries are positive . For a matrix A , Im A denotes its image space , i.e. , the vector space generated by its columns , and the notation Ker A denotes its kernel , i.e. , the set { x : Ax = 0 } .

Consider a game G = ( L , H ) with N actions and M outcomes . That is , L ∈ R N ×M and H ∈ Σ N ×M . For the sake of simplicity and , without loss of generality , we assume that no symbol σ ∈ Σ can be present in two different rows of H. The signal matrix of an action is defined as follows:

Definition 1 ( Signal matrix ) Let { σ 1 , . . . , σ s i } be the set of symbols listed in the i th row of H . ( Thus , s i denotes the number of different symbols in row i of H ) . The signal matrix S i of action i is defined as an s i × M matrix with entries a k , j = I ( h i , j = σ k ) for 1 ≤ k ≤ s i and 1 ≤ j ≤ M . The signal matrix for a set of actions is defined as the signal matrices of the actions in the set , stacked on top of one another , in the ordering of the actions .

For an example of a signal matrix , see Section 3.1 . We identify the strategy of a stochastic opponent with an element of the probability simplex ∆ M = { p ∈ R M : p ≥ 0 , M j=1 p j = 1 } . Note that for any opponent strategy p , if the learner chooses action i then the vector S i p ∈ R s i is the probability distribution of the observed feedback : ( S i p ) k is the probability of observing the k th symbol .

We denote by i the i th row of the loss matrix L and we call i the loss vector of action i. We say that action i is optimal under opponent strategy p ∈ ∆ M if for any 1 ≤ j ≤ N , i p ≤ j p. Action i is said to be Pareto-optimal if there exists an opponent strategy p such that action i is optimal under p. We now define the cell decomposition of ∆ M induced by L ( for an example , see Figure 2 ) :

Definition 2 ( Cell decomposition ) For an action i , the cell C i associated with i is defined as C i = { p ∈ ∆ M : action i is optimal under p } . The cell decomposition of ∆ M is defined as the multiset C = { C i : 1 ≤ i ≤ N , C i has positive ( M − 1 ) -dimensional volume } .

Actions whose cell is of positive ( M − 1 ) -dimensional volume are called strongly Paretooptimal . Actions that are Pareto-optimal but not strongly Pareto-optimal are called degenerate . Note that the cells of the actions are defined with linear inequalities and thus they are convex polytopes . It follows that strongly Pareto-optimal actions are the actions whose cells are ( M − 1 ) -dimensional polytopes . It is also important to note that the cell decomposition is a multiset , since some actions can share the same cell . Nevertheless , if two actions have the same cell of dimension ( M − 1 ) , their loss vectors will necessarily be identical . 3

We call two cells of C neighbors if their intersection is an ( M − 2 ) -dimensional polytope . The actions corresponding to these cells will also be called neighbors . Neighborship is not defined for cells outside of C. For two neighboring cells C i , C j ∈ C , we define the neighborhood action set A i , j = { 1 ≤ k ≤ N : C i ∩ C j ⊆ C k } . It follows from the definition that actions i and j are in A i , j and thus A i , j is nonempty . However , one can have more than two actions in the neighborhood action set .

When discussing lower bounds we will need the definition of algorithms . For us , an algorithm A is a mapping A : Σ ∗ → { 1 , 2 , . . . , N } which maps past feedback sequences to actions . That the algorithms are deterministic is assumed for convenience . In particular , the lower bounds we prove can be extended to randomized algorithms by conditioning on the internal randomization of the algorithm . Note that the algorithms we design are themselves deterministic .

3 . Classification of finite partial-monitoring games

In this section we present our main result : we state the theorem that classifies all finite stochastic partial-monitoring games based on how their minimax regret scales with the time horizon . Thanks to the previous section , we are now equipped to define a notion which will play a key role in the classification theorem:

3 . One could think that actions with identical loss vectors are redundant and that all but one of such actions could be removed without loss of generality . However , since different actions can lead to different observations and thus yield different information , removing the duplicates can be harmful .

Definition 3 ( Observability ) Let S be the signal matrix for the set of all actions in the game . For actions i and j , we say that i − j is globally observable if i − j ∈ Im S . Furthermore , if i and j are two neighboring actions , then i − j is called locally observable if i − j ∈ Im S ( i , j ) , where S ( i , j ) is the signal matrix for the neighborhood action set A i , j .

As we will see , global observability implies that we can estimate the difference of the expected losses after choosing each action once . Local observability means we only need actions from the neighborhood action set to estimate the difference .

The classification theorem , which is our main result , is the following:

Theorem 4 ( Classification ) Let G = ( L , H ) be a partial-monitoring game with N actions and M outcomes . Let C = { C 1 , . . . , C k } be its cell decomposition , with corresponding loss vectors 1 , . . . , k . The game G falls into one of the following four categories:

( a ) R T ( G ) = 0 if there exists an action i with C i = ∆ M . This case is called trivial .

( b ) R T ( G ) = Θ ( T ) if there exist two strongly Pareto-optimal actions i and j such that i − j is not globally observable . This case is called hopeless . ( c ) R T ( G ) = Θ ( √ T ) if it is not trivial and for all pairs of ( strongly Pareto-optimal ) neighboring actions i and j , i − j is locally observable . These games are called easy .

( d ) R T ( G ) = Θ ( T 2/3 ) if G is not hopeless and there exists a pair of neighboring actions i and j such that i − j is not locally observable . These games are called hard .

Note that the conditions listed under ( a ) – ( d ) are mutually exclusive and cover all finite partial-monitoring games . The only non-obvious implication is that if a game is easy then it can not be hopeless . The reason this holds is because for any pair of cells C i , C j in C , the vector i − j can be expressed as a telescoping sum of the differences of loss vectors of neighboring cells .

The remainder of the paper is dedicated to proving Theorem 4 . We start with the simple cases . If there exists an action whose cell covers the whole probability simplex then choosing that action in every round will yield zero regret , proving case ( a ) . The condition in Case ( b ) is due to Piccolboni and Schindelhauer ( 2001 ) , who showed that under the condition mentioned there , there is no algorithm that achieves sublinear regret 4 . The upper bound for case ( d ) is achieved by the FeedExp3 algorithm due to Piccolboni and Schindelhauer ( 2001 ) , for which a regret bound of O ( T 2/3 ) was shown by Cesa-Bianchi et al . ( 2006 ) . The lower bound for case ( c ) was proved by Antos et al . ( 2011 ) . For a visualization of previous results , see Figure 1 .

The above assertions help characterize trivial and hopeless games , and show that if

a game is not trivial and not hopeless then its minimax regret falls between Ω ( √ T ) and O ( T 2/3 ) . Our contribution in this paper is that we give exact minimax rates ( up to logarithmic factors ) for these games . To prove the upper bound for case ( c ) , we introduce a new algorithm , which we call Balaton , for “Bandit Algorithm for Loss Annihilation” 5 . This algorithm is presented in Section 4 , while its analysis is given in Section 5 . The lower bound for case ( d ) is presented in Section 6 .

4 . Although Piccolboni and Schindelhauer state their theorem for adversarial environments , their proof applies to stochastic environments without any change ( which is important for the lower bound part ) . 5 . Balaton is a lake in Hungary . We thank Gergely Neu for suggesting the name .

hopeless trivial

easy hard

dynamic pricing l.e.p . bandits

full-info

Figure 1 : Partial monitoring games and their minimax regret as it was known previously . The big rectangle denotes the set of all games . Inside the big rectangle , the games are ordered from left to right based on their minimax regret . In the “hard” area , l.e.p . denotes label-efficient prediction . The grey area contains games whose

minimax regret is between Ω ( √ T ) and O ( T 2/3 ) but their exact regret rate was unknown . This area is now eliminated , and the dynamic pricing problem is proven to be hard .

3.1 . Example

In this section , as a corollary of Theorem 4 we show that the discretized dynamic pricing game ( see , e.g. , Cesa-Bianchi et al . ( 2006 ) ) is hard . Dynamic pricing is a game between a vendor ( learner ) and a customer ( environment ) . In each round , the vendor sets a price he wants to sell his product at ( action ) , and the costumer sets a maximum price he is willing to buy the product ( outcome ) . If the product is not sold , the vendor suffers some constant loss , otherwise his loss is the difference between the customer’s maximum and his price . The customer never reveals the maximum price and thus the vendor’s only feedback is whether he sold the product or not .

The discretized version of the game with N actions ( and outcomes ) is defined by the matrices

L =

0 1 2 · · · N − 1 c 0 1 · · · N − 2 .. . . .. .. . c · · · c 0 1 c · · · · · · c 0

H =

1 · · · · · · 1 0 . .. .. . .. . . .. ... ... 0 · · · 0 1

,

where c is a positive constant ( see Figure 2 for the cell-decomposition for N = 3 ) . It is easy to see that all the actions are strongly Pareto-optimal . Also , after some linear algebra it turns out that the cells underlying the actions have a single common vertex in the interior of the probability simplex . It follows that any two actions are neighbors . On the other hand , if we take two non-consecutive actions i and i , i − i is not locally observable . For example , the signal matrix for action 1 and action N is

S ( 1 , N ) =

1 . . . 1 1 1 . . . 1 0 0 . . . 0 1

,

whereas N − 1 = ( c , c − 1 , . . . , c − N + 2 , −N + 1 ) . It is obvious that N − 1 is not in the row space of S ( 1 , N ) .

( 1 , 0 , 0 )

( 0 , 1 , 0 )

( 0 , 0 , 1 )

p ∗

Figure 2 : The cell decomposition of the discretized dynamic pricing game with 3 actions . If the opponent strategy is p ∗ , then action 2 is the optimal action .

4 . Balaton : An algorithm for easy games In this section we present our algorithm that achieves O ( √ T ) expected regret for easy games ( case ( c ) of Theorem 4 ) . The input of the algorithm is the loss matrix L , the feedback matrix H , the time horizon T and an error probability δ , to be chosen later . Before describing the algorithm , we introduce some notation . We define a graph G associated with game G the following way . Let the vertex set be the set of cells of the cell decomposition C of the probability simplex such that cells C i , C j ∈ C share the same vertex when C i = C j . The graph has an edge between vertices whose corresponding cells are neighbors . This graph is connected , since the probability simplex is convex and the cell decomposition covers the simplex .

Recall that for neighboring cells C i , C j , the signal matrix S ( i , j ) is defined as the signal matrix for the neighborhood action set A i , j of cells i , j. Assuming that the game satisfies the condition of case ( c ) of Theorem 4 , we have that for all neighboring cells C i and C j , i − j ∈ Im S ( i , j ) . This means that there exists a coefficient vector v ( i , j ) such that i − j = S ( i , j ) v ( i , j ) . We define the k th segment of v ( i , j ) , denoted by v ( i , j ) , k , as the vector of components of v ( i , j ) that correspond to the k th action in the neighborhood action set . That is , if S ( i , j ) = S 1 · · · S r , then i − j = S ( i , j ) v ( i , j ) = r s=1 S s v ( i , j ) , s , where S 1 , . . . , S r are the signal matrices of the individual actions in A i , j .

Let J t ∈ { 1 , . . . , M } denote the outcome at time step t. For 1 ≤ k ≤ M , let e k ∈ R M be the k th unit vector . For an action i , let O i ( t ) = S i e J t be the observation vector of action i at time step t. If the rows of the signal matrix S i correspond to symbols σ 1 , . . . , σ s i and action i is chosen at time step t then the unit vector O i ( t ) indicates which symbol was observed in that time step . Thus , O I t ( t ) holds the same information as the feedback at time t ( recall that I t is the action chosen by the learner at time step t ) . From now on , for simplicity , we will assume that the feedback at time step t is the observation vector O I t ( t ) itself .

The main idea of the algorithm is to successively eliminate actions in an efficient , yet safe manner . When all remaining strongly Pareto optimal actions share the same cell , the elimination phase finishes and from this point , one of the remaining actions is played . During the elimination phase , the algorithm works in rounds . In each round each ‘alive’ Pareto optimal action is played once . The resulting observations are used to estimate the loss-difference between the alive actions . If some estimate becomes sufficiently precise , the action of the pair deemed to be suboptimal is eliminated ( possibly together with other

Algorithm 1 Balaton Input : L , H , T , δ Initialization : [ G , C , { v ( i , j ) , k } , { path ( i , j ) } , { ( LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) ) } ] ← Initialize ( L , H ) t ← 0 , n ← 0 aliveActions ← { 1 ≤ i ≤ N : C i ∩ interior ( ∆ M ) = ∅ } main loop while | V G | > 1 and t < T do n ← n + 1 for each i ∈ aliveActions do O i ← ExecuteAction ( i ) t ← t + 1 end for for each edge ( i , j ) in G : µ ( i , j ) ← k∈A i , j O k v ( i , j ) , k end for for each non-adjacent vertex pair ( i , j ) in G : µ ( i , j ) ← ( k , l ) ∈path ( i , j ) µ ( k , l ) end for haveEliminated ← false for each vertex pair ( i , j ) in G do ˆ µ ( i , j ) ← 1 − 1 n ˆ µ ( i , j ) + 1 n µ ( i , j ) if BStopStep ( ˆ µ ( i , j ) , LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) , n , 1/2 , δ ) then [ aliveActions , C , G ] ← eliminate ( i , j , sgn ( ˆ µ ( i , j ) ) ) haveEliminated ← true end if end for if haveEliminated then { path ( i , j ) } ← regeneratePaths ( G ) end if end while Let i be a strongly Pareto-optimal action in aliveActions while t < T do

ExecuteAction ( i ) t ← t + 1 end while

actions ) . To determine if an estimate is sufficiently precise , we will use an appropriate stopping rule . A small regret will be achieved by tuning the error probability of the stopping rule appropriately .

The details of the algorithm are as follows : In the preprocessing phase , the algorithm constructs the neigbourhood graph , the signal matrices S ( i , j ) assigned to the edges of the graph , the coefficient vectors v ( i , j ) and their segment vectors v ( i , j ) , k . In addition , it constructs a path in the graph connecting any pairs of nodes , and initializes some variables used by the stopping rule .

In the elimination phase , the algorithm runs a loop . In each round of the loop , the algorithm chooses each of the alive actions once and , based on the observations , the estimates ˆ µ ( i , j ) of the loss-differences ( i − j ) p ∗ are updated , where p ∗ is the actual opponent

strategy . The algorithm maintains the set C of cells of alive actions and their neighborship graph G .

The estimates are calculated as follows . First we calculate estimates for neighboring actions ( i , j ) . In round 6 n , for every action k in A i , j let O k be the observation vector for action k. Let µ ( i , j ) = k∈A i , j O k v ( i , j ) , k . From the local observability condition and the construction of v ( i , j ) , k , with simple algebram it follows that µ ( i , j ) are unbiased estimates of ( i − j ) p ∗ ( see Lemma 5 ) . For non-neighboring action pairs , we use telescoping sums : since the graph G ( induced by the alive actions ) stays connected , we can take a path i = i 0 , i 1 , . . . , i r = j in the graph , and the estimate µ ( i , j ) ( n ) will be the sum of the estimates along the path : r l=1 µ ( i l−1 , i l ) . The estimate of the difference of the expected losses after round n will be the average ˆ µ ( i , j ) = ( 1/n ) n l=1 µ ( i , j ) ( s ) , where µ ( i , j ) ( s ) denotes the estimate for pair ( i , j ) computed in round s .

After updating the estimates , the algorithm decides which actions to eliminate . For each pair of vertices i , j of the graph , the expected difference of their loss is tested for its sign by the BStopStep subroutine , based on the estimate ˆ µ ( i , j ) and its relative error . This subroutine uses a stopping rule based on Bernstein’s inequality .

The subroutine’s pseudocode is shown as Algorithm 2 and is essentially based on the work by Mnih et al . ( 2008 ) . The algorithm maintains two values , LB , UB , computed from the supplied sequence of sample means ( ˆ µ ) and the deviation bounds

c ( σ , R , n , δ ) = σ 2 L ( δ , n ) n + R L ( δ , n ) 3n , where L ( δ , n ) = log 3 p p − 1 n p δ . ( 1 )

Here p > 1 is an arbitrarily chosen parameter of the algorithm , σ is a ( deterministic ) upper bound on the ( conditional ) variance of the random variables whose common mean µ we wish to estimate , while R is a ( deterministic ) upper bound on their range . This is a general stopping rule method , which stops when it produced an -relative accurate estimate of the unknown mean . The algorithm is guaranteed to be correct outside of a failure event whose probability is bounded by δ .

Algorithm Balaton calls this method with ε = 1/2 . As a result , when BStopStep returns true , outside of the failure event the sign of the estimate ˆ µ supplied to Balaton will match the sign of the mean to be estimated . The conditions under which the algorithm indeed produces ε-accurate estimates ( with high probability ) are given in Lemma 11 ( see Appendix ) , which also states that also with high probability , the time when the algorithm stops is bounded by

C · max σ 2 2 µ 2 , R |µ| log 1 δ + log R |µ| ,

where µ = 0 is the true mean . Note that the choice of p in ( 1 ) influences only C .

If BStopStep returns true for an estimate µ ( i , j ) , function eliminate is called . If , say , µ ( i , j ) > 0 , this function takes the closed half space { q ∈ ∆ M : ( i − j ) q ≤ 0 } and eliminates all actions whose cell lies completely in the half space . The function also drops the vertices from the graph that correspond to eliminated cells . The elimination necessarily

6 . Note that a round of the algorithm is not the same as the time step t. In a round , the algorithm chooses each of the alive actions once .

Algorithm 2 Algorithm BStopStep . Note that , somewhat unusually at least in pseudocodes , the arguments LB , UB are passed by reference , i.e. , the algorithm rewrites the values of these arguments ( which are thus returned back to the caller ) . Input : ˆ µ , LB , UB , σ , R , n , ε , δ LB ← max ( LB , |ˆ µ| − c ( δ , σ , R , n ) ) UB ← min ( UB , |ˆ µ| + c ( δ , σ , R , n ) ) return ( 1 + ) LB < ( 1 − ) UB

concerns all actions with corresponding cell C i , and possibly other actions as well . The remaining cells are redefined by taking their intersection with the complement half space { q ∈ ∆ M : ( i − j ) q ≥ 0 } .

By construction , after the elimination phase , the remaining graph is still connected , but some paths used in the round may have lost vertices or edges . For this reason , in the last phase of the round , new paths are constructed for vertex pairs with broken paths .

The main loop of the algorithm continues until either one vertex remains in the graph or the time horizon T is reached . In the former case , one of the actions corresponding to that vertex is chosen until the time horizon is reached .

5 . Analysis of the algorithm In this section we prove that the algorithm described in the previous section achieves O ( √ T ) expected regret .

Let us assume that the outcomes are generated following the probability vector p ∗ ∈ ∆ M . Let j ∗ denote an optimal action , that is , for every 1 ≤ i ≤ N , j ∗ p ∗ ≤ i p ∗ . For every pair of actions i , j , let α i , j = ( i − j ) p ∗ be the expected difference of their instantaneous loss . The expected regret of the algorithm can be rewritten as

E

T

t=1

I t , J t − min 1≤i≤N E

T

t=1

i , J t =

N

E [ τ i ] α i , j ∗ , ( 2 )

where τ i is the number of times action i is chosen by the algorithm .

Throughout the proof , the value that Balaton assigns to a variable x in round n will be denoted by x ( n ) . Further , for 1 ≤ k ≤ N , we introduce the i.i.d . random sequence ( J k ( n ) ) n≥1 , taking values on { 1 , . . . , M } , with common multinomial distribution satisfying , P [ J k ( n ) = j ] = p ∗ j . Clearly , a statistically equivalent model to the one where ( J t ) is an i.i.d . sequence with multinomial p ∗ is when ( J t ) is defined through

J t = J I t t s=1 I ( I s = I t ) . ( 3 )

Note that this claim holds , independently of the algorithm generating the actions , I t . Therefore , in what follows , we assume that the outcome sequence is generated through ( 3 ) . As we will see , this construction significantly simplifies subsequent steps of the proof . In particular , the construction will be very convenient since if action k is selected by our algorithm in the n th elimination round then the outcome obtained in response is going to be

O k ( n ) = S k u k ( n ) , where u k ( n ) = e J k ( n ) . ( This holds because in the elimination rounds all alive actions are tried exactly once by Balaton . )

Let ( F n ) n be the filtration defined as F n = σ ( u k ( m ) ; 1 ≤ k ≤ N , 1 ≤ m ≤ n ) . We also introduce the notations E n [ · ] = E [ ·|F n ] and Var n ( · ) = Var ( ·|F n ) , the conditional expectation and conditional variance operators corresponding to F n . Note that F n contains the information known to Balaton ( and more ) at the end of the elimination round n. Our first ( trivial ) observation is that µ ( i , j ) ( n ) , the estimate of α i , j obtained in round n is F n -measurable . The next lemma establishes that , furthermore , µ ( i , j ) ( n ) is an unbiased estimate of α i , j :

Lemma 5 For any n ≥ 1 and i , j such that C i , C j ∈ C , E n−1 [ µ ( i , j ) ( n ) ] = α i , j .

Proof Consider first the case when actions i and j are neighbors . In this case,

µ ( i , j ) ( n ) = k∈A i , j

O k ( n ) v ( i , j ) , k = k∈A i , j

( S k u k ( n ) ) v ( i , j ) , k = k∈A i , j

u k ( n ) S k v ( i , j ) , k ,

and thus

E n−1 µ ( i , j ) ( n ) = k∈A i , j

E n−1 u k ( n ) S k v ( i , j ) , k = p ∗ k∈A i , j

S k v ( i , j ) , k = p ∗ S ( i , j ) v ( i , j )

= p ∗ ( i − j ) = α i , j .

For non-adjacent i and j , we have a telescoping sum:

E n−1 µ ( i , j ) ( n ) =

r

k=1

E n−1 [ µ ( i k−1 , i k ) ( n ) ]

= p ∗ i 0 − i 1 + i 1 − i 2 + · · · + i r−1 − i r = α i , j ,

where i = i 0 , i 1 , . . . , i r = j is the path the algorithm uses in round n , known at the end of round n − 1 .

Lemma 6 The conditional variance of µ ( i , j ) ( n ) , Var n−1 ( µ ( i , j ) ( n ) ) , is upper bounded by V = 2 { i , j neighbors } v ( i , j ) 2 2 .

Proof For neighboring cells i , j , we write

µ ( i , j ) ( n ) = k∈A i , j

O k ( n ) v ( i , j ) , k and thus

Var n−1 ( µ ( i , j ) ( n ) ) = Var n−1

k∈A i , j

O k ( n ) v ( i , j ) , k

=

k∈A i , j

E n−1 v ( i , j ) , k ( O k ( n ) − E n−1 [ O k ( n ) ] ) ( O k ( n ) − E n−1 [ O k ( n ) ] ) v ( i , j ) , k

k∈A i , j

v ( i , j ) , k 2 2 E n−1 O k ( n ) − E n−1 [ O k ( n ) ] 2 2

k∈A i , j

v ( i , j ) , k 2 2 = v ( i , j ) 2 2 , ( 4 )

where in ( 4 ) we used that O k ( n ) is a unit vector and E n−1 [ O k ( n ) ] is a probability vector .

For i , j non-neighboring cells , let i = i 0 , i 1 , . . . , i r = j the path used for the estimate in round n. Then µ ( i , j ) ( n ) can be written as

µ ( i , j ) ( n ) =

r

s=1

µ ( i s−1 , i s ) ( n ) =

r

s=1 k∈A is−1 , is

O k ( n ) v ( i s−1 , i s ) , k .

It is not hard to see that an action can only be in at most two neighborhood action sets in the path and so the double sum can be rearranged as

k∈ A is−1 , is

O k ( n ) ( v ( i sk−1 , i sk ) , k + v ( i sk i sk+1 ) , k ) ,

and thus Var n−1 µ ( i , j ) ( n ) ≤ 2 r s=1 v ( i s−1 , i s ) 2 2 ≤ 2 { i , j neighbors } v ( i , j ) 2 2 .

Lemma 7 The range of the estimates µ ( i , j ) ( n ) is upper bounded by R = { i , j neighbors } v ( i , j ) 1 .

Proof The bound trivially follows from the definition of the estimates .

Let δ be the confidence parameter used in BStopStep . Since , according to Lemmas 5 , 6 and 7 , ( µ ( i , j ) ) is a “shifted” martingale difference sequence with conditional mean α i , j , bounded conditional variance and range , we can apply Lemma 11 stated in the Appendix . By the union bound , the probability that any of the confidence bounds fails during the game is at most N 2 δ. Thus , with probability at least 1 − N 2 δ , if BStopStep returns true for a pair ( i , j ) then sgn ( α i , j ) = sgn ( µ ( i , j ) ) and the algorithm eliminates all the actions whose cell is contained in the closed half space defined by H = { p : sgn ( α i , j ) p ( i − j ) ≤ 0 } . By definition α i , j = ( i − j ) p ∗ . Thus p ∗ / ∈ H and none of the eliminated actions can be optimal under p ∗ .

From Lemma 11 we also see that , with probability at least 1 − N 2 δ , the number of times τ ∗ i the algorithm experiments with a suboptimal action i during the elimination phase is bounded by

τ ∗ i ≤ c ( G ) α 2 i , j ∗ log R δα i , j ∗ = T i , ( 5 )

where c ( G ) = C ( V + R ) is a problem dependent constant .

The following lemma , the proof of which can be found in the Appendix , shows that degenerate actions will be eliminated in time .

Lemma 8 Let action i be a degenerate action . Let A i = { j : C j ∈ C , C i ⊂ C j } . The following two statements hold:

1 . If any of the actions in A i is eliminated , then action i is eliminated as well .

2 . There exists an action k i ∈ A i such that α k i , j ∗ ≥ α i , j ∗ .

An immediate implication of the first claim of the lemma is that if action k i gets eliminated then action i gets eliminated as well , that is , the number of times action i is chosen can not be greater then that of action k i . Hence , τ ∗ i ≤ τ ∗ k i . Let E be the complement of the failure event underlying the stopping rules . As discussed earlier , P ( E c ) ≤ N 2 δ. Note that on E , i.e. , when the stopping rules do not fail , no suboptimal action can remain for the final phase . Hence , τ i I ( E ) ≤ τ ∗ i I ( E ) , where τ i is the number of times action i is chosen by the algorithm . To upper bound the expected regret we continue from ( 2 ) as

N

E [ τ i ] α i , j ∗ =

N

E [ I ( E ) τ i ] α i , j ∗ + P ( E c ) T ( because N i=1 τ i = T and 0 ≤ α i , j ∗ ≤ 1 )

N

E [ I ( E ) τ ∗ i ] α i , j ∗ + N 2 δT

i : C i ∈C

E [ I ( E ) τ ∗ i ] α i , j ∗ + i : C i ∈C

E [ I ( E ) τ ∗ i ] α i , j ∗ + N 2 δT

i : C i ∈C

E [ I ( E ) τ ∗ i ] α i , j ∗ + i : C i ∈C

E I ( E ) τ ∗ k i α k i , j ∗ + N 2 δT ( by Lemma 8 )

i : C i ∈C

T i α i , j ∗ + i : C i ∈C

T k i α k i , j ∗ + N 2 δT

i : C i ∈C α i , j∗ ≥α 0

T i α i , j ∗ + i : C i ∈C α ki , j ∗ ≥α 0

T k i α k i , j ∗ + α 0 + N 2 δ T

≤ c ( G )

i : C i ∈C α i , j∗ ≥α 0

log R δα i , j∗

α i , j ∗ + i : C i ∈C α ki , j ∗ ≥α 0

log R δα ki , j ∗ α k i , j ∗

+ α 0 + N 2 δ T

≤ c ( G ) N log R δα 0 α 0 + α 0 + N 2 δ T , The above calculation holds for any value of α 0 > 0 . Setting

α 0 = c ( G ) N T and δ = c ( G ) T N 3 , we get

E [ R T ] ≤ c ( G ) N T log RT N 2 c ( G ) .

In conclusion , if we run Balaton with parameter δ = c ( G ) T N 3 , the algorithm suffers regret

of O ( √ T ) , finishing the proof .

6 . A lower bound for hard games

In this section we prove that for any game that satisfies the condition of Case ( d ) of Theorem 4 , the minimax regret is of Ω ( T 2/3 ) .

Theorem 9 Let G = ( L , H ) be an N by M partial-monitoring game . Assume that there exist two neighboring actions i and j such that i − j ∈ Im S ( i , j ) . Then there exists a problem dependent constant c ( G ) such that for any algorithm A and time horizon T there exists an opponent strategy p such that the expected regret satisfies

E [ R T ( A , p ) ] ≥ c ( G ) T 2/3 .

Proof Without loss of generality we can assume that the two neighbor cells in the condition are C 1 and C 2 . Let C 3 = C 1 ∩ C 2 . For i = 1 , 2 , 3 , let A i be the set of actions associated with cell C i . Note that A 3 may be the empty set . Let A 4 = A\ ( A 1 ∪A 2 ∪A 3 ) . By our convention for naming loss vectors , 1 and 2 are the loss vectors for C 1 and C 2 , respectively . Let L 3 collect the loss vectors of actions which lie on the open segment connecting 1 and 2 . It is easy to see that L 3 is the set of loss vectors that correspond to the cell C 3 . We define L 4 as the set of all the other loss vectors . For i = 1 , 2 , 3 , 4 , let k i = |A i | .

Let S = S i , j the signal matrix of the neighborhood action set of C 1 and C 2 . It follows from the assumption of the theorem that 2 − 1 ∈ Im ( S ) . Thus , { ρ ( 2 − 1 ) : ρ ∈ R } ⊂ Im ( S ) , or equivalently , ( 2 − 1 ) ⊥ ⊃ Ker S , where we used that ( Im M ) ⊥ = Ker ( M ) . Thus , there exists a vector v such that v ∈ Ker S and ( 2 − 1 ) v = 0 . By scaling we can assume that ( 2 − 1 ) v = 1 . Note that since v ∈ Ker S and the rowspace of S contains the vector ( 1 , 1 , . . . , 1 ) , the coordinates of v sum up to zero .

Let p 0 be an arbitrary probability vector in the relative interior of C 3 . It is easy to see that for any ε > 0 small enough , p 1 = p 0 + εv ∈ C 1 \ C 2 and p 2 = p 0 − εv ∈ C 2 \ C 1 .

Let us fix a deterministic algorithm A and a time horizon T . For i = 1 , 2 , let R ( i ) T denote the expected regret of the algorithm under opponent strategy p i . For i = 1 , 2 and j = 1 , . . . , 4 , let N i j denote the expected number of times the algorithm chooses an action from A j , assuming the opponent plays strategy p i .

From the definition of L 3 we know that for any ∈ L 3 , − 1 = η ( 2 − 1 ) and − 2 = ( 1 − η ) ( 1 − 2 ) for some 0 < η < 1 . Let λ 1 = min ∈L 3 η and λ 2 = min ∈L 3 ( 1 − η ) and λ = min ( λ 1 , λ 2 ) if L 3 = ∅ and let λ = 1/2 , otherwise . Finally , let β i = min ∈L 4 ( − i ) p i and β = min ( β 1 , β 2 ) . Note that λ , β > 0 .

As the first step of the proof , we lower bound the expected regret R ( 1 ) T and R ( 2 ) T in terms of the values N i j , ε , λ and β:

R ( 1 ) T ≥ N 1 2

ε

( 2 − 1 ) p 1 +N 1 3 λ ( 2 − 1 ) p 1 + N 1 4 β ≥ λ ( N 1 2 + N 1 3 ) ε + N 1 4 β , R ( 2 ) T ≥ N 2 1 ( 1 − 2 ) p 2

ε

+N 2 3 λ ( 1 − 2 ) p 2 + N 2 4 β ≥ λ ( N 2 1 + N 2 3 ) ε + N 2 4 β . ( 6 )

For the next step , we need the following lemma .

Lemma 10 There exists a ( problem dependent ) constant c such that the following inequalities hold:

N 2 1 ≥ N 1 1 − cT ε N 1 4 , N 2 3 ≥ N 1 3 − cT ε N 1 4 ,

N 1 2 ≥ N 2 2 − cT ε N 2 4 , N 1 3 ≥ N 2 3 − cT ε N 2 4 .

Proof ( Lemma 10 ) For any 1 ≤ t ≤ T , let f t = ( f 1 , . . . , f t ) ∈ Σ t be a feedback sequence up to time step t. For i = 1 , 2 , let p ∗ i be the probability mass function of feedback sequences of length T − 1 under opponent strategy p i and algorithm A. We start by upper bounding the difference between values under the two opponent strategies . For i = j ∈ { 1 , 2 } and k ∈ { 1 , 2 , 3 } ,

N i k − N j k = f T −1

p ∗ i ( f T −1 ) − p ∗ j ( f T −1 )

T −1

t=0

I ( A ( f t ) ∈ A k )

f T −1 : p ∗ i ( f T −1 ) −p ∗ j ( f T −1 ) ≥0

p ∗ i ( f T −1 ) − p ∗ j ( f T −1 )

T −1

t=0

I ( A ( f t ) ∈ A k )

≤ T

f T −1 : p ∗ i ( f T −1 ) −p ∗ j ( f T −1 ) ≥0

p ∗ i ( f T −1 ) − p ∗ j ( f T −1 ) = T 2 p ∗ 1 − p ∗ 2 1

≤ T KL ( p ∗ 1 ||p ∗ 2 ) /2 , ( 7 )

where KL ( ·||· ) denotes the Kullback-Leibler divergence and · 1 is the L 1 -norm . The last inequality follows from Pinsker’s inequality ( Cover and Thomas , 2006 ) . To upper bound KL ( p ∗ 1 ||p ∗ 2 ) we use the chain rule for KL-divergence . By overloading p ∗ i so that p ∗ i ( f t−1 ) denotes the probability of feedback sequence f t−1 under opponent strategy p i and algorithm A , and p ∗ i ( f t |f t−1 ) denotes the conditional probability of feedback f t ∈ Σ given that the past feedback sequence was f t−1 , again under p i and A. With this notation we have

KL ( p ∗ 1 ||p ∗ 2 ) =

T −1

t=1 f t−1

p ∗ 1 ( f t−1 ) f t

p ∗ 1 ( f t |f t−1 ) log p ∗ 1 ( f t |f t−1 ) p ∗ 2 ( f t |f t−1 )

=

T −1

t=1 f t−1

p ∗ 1 ( f t−1 )

I ( A ( f t−1 ) ∈ A i ) f t

p ∗ 1 ( f t |f t−1 ) log p ∗ 1 ( f t |f t−1 ) p ∗ 2 ( f t |f t−1 ) ( 8 )

Let a f t be the row of S that corresponds to the feedback symbol f t . 7 Assume k = A ( f t−1 ) . If the feedback set of action k does not contain f t then trivially p ∗ i ( f t |f t−1 ) = 0 for i = 1 , 2 . Otherwise p ∗ i ( f t |f t−1 ) = a f t p i . Since p 1 − p 2 = 2εv and v ∈ Ker S , we have a f t v = 0 and thus , if the choice of the algorithm is in either A 1 , A 2 or A 3 , then p ∗ 1 ( f t |f t−1 ) = p ∗ 2 ( f t |f t−1 ) . It follows that the inequality chain can be continued from ( 8 ) by writing

KL ( p ∗ 1 ||p ∗ 2 ) ≤

T −1

t=1 f t−1

p ∗ 1 ( f t−1 ) I ( A ( f t−1 ) ∈ A 4 ) f t

p ∗ 1 ( f t |f t−1 ) log p ∗ 1 ( f t |f t−1 ) p ∗ 2 ( f t |f t−1 )

≤ c 1 ε 2

T −1

t=1 f t−1

p ∗ 1 ( f t−1 ) I ( A ( f t−1 ) ∈ A 4 ) ( 9 )

≤ c 1 ε 2 N 1 4 .

7 . Recall that we assumed that different actions have difference feedback symbols , and thus a row of S corresponding to a symbol is unique .

In ( 9 ) we used Lemma 12 ( see Appendix ) to upper bound the KL-divergence of p 1 and p 2 . Flipping p ∗ 1 and p ∗ 2 in ( 7 ) we get the same result with N 2 4 . Reading together with the bound in ( 7 ) we get all the desired inequalities .

Now we can continue lower bounding the expected regret . Let r = argmin i∈ { 1,2 } N i 4 . It is easy to see that for i = 1 , 2 and j = 1 , 2 , 3,

N i j ≥ N r j − c 2 T ε N r 4 .

If i = r then this inequality is one of the inequalities from Lemma 10 . If i = r then it is a trivial lower bounding by subtracting a positive value . From ( 6 ) we have

R ( i ) T ≥ λ ( N i 3−i + N i 3 ) ε + N i 4 β ≥ λ ( N r 3−i − c 2 T ε N r 4 + N r 3 − c 2 T ε N r 4 ) ε + N r 4 β = λ ( N r 3−i + N r 3 − 2c 2 T ε N r 4 ) ε + N r 4 β .

Now assume that , at the beginning of the game , the opponent randomly chooses between strategies p 1 and p 2 with equal probability . The the expected regret of the algorithm is lower bounded by

R T = 1 2 R ( 1 ) T + R ( 2 ) T ≥ 1 2 λ ( N r 1 + N r 2 + 2N r 3 − 4c 2 T ε N r 4 ) ε + N r 4 β ≥ 1 2 λ ( N r 1 + N r 2 + N r 3 − 4c 2 T ε N r 4 ) ε + N r 4 β = 1 2 λ ( T − N r 4 − 4c 2 T ε N r 4 ) ε + N r 4 β .

Choosing ε = c 3 T −1/3 we get

R T ≥ 1 2 λc 3 T 2/3 − 1 2 λN r 4 c 3 T −1/3 − 2λc 2 c 2 3 T 1/3 N r 4 + N r 4 β ≥ T 2/3 β − 1 2 λc 3 N r 4 T 2/3 − 2λc 2 c 2 3 N r 4 T 2/3 + 1 2 λc 3

= T 2/3 β − 1 2 λc 3 x 2 − 2λc 2 c 2 3 x + 1 2 λc 3 ,

where x = N r 4 /T 2/3 . Now we see that c 3 > 0 can be chosen to be small enough , independently of T so that , for any choice of x , the quadratic expression in the parenthesis is bounded away from zero , and simultaneously , ε is small enough so that the threshold condition in Lemma 12 is satisfied , completing the proof of Theorem 9 .

7 . Discussion

In this we paper we classified all finite partial-monitoring games under stochastic environments , based on their minimax regret . We conjecture that our results extend to nonstochastic environments . This is the major open question that remains to be answered .

One question which we did not discuss so far is the computational efficiency of our algorithm . The issue is twofold . The first computational question is how to efficiently decide which of the four classes a given game ( L , H ) belongs to . The second question is the computational efficiency of Balaton for a fixed easy game . Fortunately , in both cases an efficient implementation is possible , i.e. , in polynomial time by using a linear program solver ( e.g. , the ellipsoid method ( Papadimitriou and Steiglitz , 1998 ) ) .

Another interesting open question is to investigate the dependence of regret on quantities other than T such as the number of actions , the number of outcomes , and more generally the structure of the loss and feedback matrices .

Finally , let us note that our results can be extended to a more general framework , similar to that of Pallavi et al . ( 2011 ) , in which a game with N actions and M -dimensional outcome space is defined as a tuple G = ( L , S 1 , . . . , S N ) . The loss matrix is L ∈ R N ×M as before , but the outcome and the feedback are defined differently . The outcome y is an arbitrary vector from a bounded subset of R M and the feedback received by the learner upon choosing action i is O i = S i y .

References

Jacob Abernethy , Elad Hazan , and Alexander Rakhlin . Competing in the dark : An efficient algorithm for bandit linear optimization . In Proceedings of the 21st Annual Conference on Learning Theory ( COLT 2008 ) , pages 263–273 . Citeseer , 2008 .

Alekh Agarwal , Peter Bartlett , and Max Dama . Optimal allocation strategies for the dark pool problem . In 13th International Conference on Artificial Intelligence and Statistics ( AISTATS 2010 ) , May 12-15 , 2010 , Chia Laguna Resort , Sardinia , Italy , 2010 .

Andr´ as Antos , G´ abor Bart´ ok , D´ avid P´ al , and Csaba Szepesv´ ari . Toward a classification of finite partial-monitoring games , 2011. http : //arxiv.org/abs/1102.2041 .

Jean-Yves Audibert and S´ebastien Bubeck . Minimax policies for adversarial and stochastic bandits . In Proceedings of the 22nd Annual Conference on Learning Theory , 2009 .

Peter Auer , Nicol` o Cesa-Bianchi , Yoav Freund , and Robert E. Schapire . The nonstochastic multiarmed bandit problem . SIAM Journal on Computing , 32 ( 1 ) :48–77 , 2002 .

G´ abor Bart´ ok , D´ avid P´ al , and Csaba Szepesv´ ari . Toward a Classification of Finite PartialMonitoring Games . In Proceedings of the 21st international conference on Algorithmic Learning Theory ( ALT 2010 ) , pages 224–238 . Springer , 2010 .

Nicol` o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Minimizing regret with label efficient prediction . IEEE Transactions on Information Theory , 51 ( 6 ) :2152–2162 , June 2005 .

Nicol´ o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Regret minimization under partial monitoring . Mathematics of Operations Research , 31 ( 3 ) :562–580 , 2006 .

Thomas M. Cover and Joy A. Thomas . Elements of Information Theory . Wiley , New York , second edition , 2006 .

Abraham D. Flaxman , Adam Tauman Kalai , and H. Brendan McMahan . Online convex optimization in the bandit setting : gradient descent without a gradient . In Proceedings of the 16th annual ACM-SIAM Symposium on Discrete Algorithms ( SODA 2005 ) , page 394 . Society for Industrial and Applied Mathematics , 2005 .

Robert Kleinberg and Tom Leighton . The value of knowing a demand curve : Bounds on regret for online posted-price auctions . In Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science 2003 ( FOCS 2003 ) , pages 594–605 . IEEE , 2003 .

Nick Littlestone and Manfred K. Warmuth . The weighted majority algorithm . Information and Computation , 108:212–261 , 1994 .

G´ abor Lugosi and Nicol` o Cesa-Bianchi . Prediction , Learning , and Games . Cambridge University Press , 2006 .

V. Mnih . Efficient stopping rules . Master’s thesis , Department of Computing Science , University of Alberta , 2008 .

V. Mnih , Cs . Szepesv´ ari , and J.-Y . Audibert . Empirical Bernstein stopping . In W. W. Cohen , A. McCallum , and S. T. Roweis , editors , ICML 2008 , pages 672–679 . ACM , 2008 .

A. Pallavi , R. Zheng , and Cs . Szepesv´ ari . Sequential learning for optimal monitoring of multi-channel wireless networks . In INFOCOMM , 2011 .

Christos H. Papadimitriou and Kenneth Steiglitz . Combinatorial optimization : algorithms and complexity . Courier Dover Publications , New York , 1998 .

Antonio Piccolboni and Christian Schindelhauer . Discrete prediction games with arbitrary feedback and loss . In Proceedings of the 14th Annual Conference on Computational Learning Theory ( COLT 2001 ) , pages 208–223 . Springer-Verlag , 2001 .

Martin Zinkevich . Online convex programming and generalized infinitesimal gradient ascent . In Proceedings of Twentieth International Conference on Machine Learning ( ICML 2003 ) , 2003 .

Appendix

Proof ( Lemma 8 )

1 . In an elimination set , we eliminate every action whose cell is contained in a closed half space . Let us assume that j ∈ A i is being eliminated . According to the definition of A i , C i ⊂ C j and thus C i is also contained in the half space .

2 . First let us assume that p ∗ is not in the affine subspace spanned by C i . Let p be an arbitrary point in the relative interior of C i . We define the point p = p + ε ( p − p ∗ ) . For a small enough ε > 0 , p ∈ C k ∈ A i , and at the same time , p ∈ C i . Thus we have

k ( p + ε ( p − p ∗ ) ) ≤ i ( p + ε ( p − p ∗ ) ) ( 1 + ε ) k p − ε k p ∗ ≤ ( 1 + ε ) i p − ε i p ∗ −ε k p ∗ ≤ −ε i p ∗ k p ∗ ≥ i p ∗ α k , j ∗ ≥ α i , j ∗ ,

where we used that k p = i p. For the case when p ∗ lies in the affine subspace spanned by C i , We take a hyperplane that contains the affine subspace . Then we take an infinite sequence ( p n ) n such that every element of the sequence is in the same side of the hyperplane , p n = p ∗ and the sequence converges to p ∗ . Then the statement is true for every element p n and , since the value α r , s is continuous in p , the limit has the desired property as well .

The following lemma concerns the problem of producing an estimate of an unknown mean of some stochastic process with a given relative error bound and with high probability in a sample-efficient manner . The procedure is a simple variation of the one proposed by Mnih et al . ( 2008 ) . The main differences are that here we deal with martingale difference sequences shifted by an unknown constant , which becomes the common mean , whereas Mnih et al . ( 2008 ) considered an i.i.d . sequence . On the other hand , we consider the case when we have a known upper bound on the predictable variance of the process , whereas one of the main contributions of Mnih et al . ( 2008 ) was the lifting of this assumption . The proof of the lemma is omitted , as it follows the same lines as the proof of results of Mnih et al . ( 2008 ) ( the details of these proofs are found in the thesis of ( Mnih , 2008 ) ) , the only difference being , that here we would need to use Bernstein’s inequality for martingales , in place of the empirical Bernstein inequality , which was used by Mnih et al . ( 2008 ) .

Lemma 11 Let ( F t ) be a filtration on some probability space , and let ( X t ) be an F t -adapted sequence of random variables . Assume that ( X t ) is such that , almost surely , the range of each random variable X t is bounded by R > 0 , E [ X t |F t−1 ] = µ , and Var [ X t |F t−1 ] ≤ σ 2 a.s. , where R , µ = 0 and σ 2 are non-random constants . Let p > 1 , > 0 , 0 < δ < 1 and let

L n = ( 1 + ε ) max 1≤t≤n |X t | − c t , and U n = ( 1 − ε ) min 1≤t≤n |X t | + c t ,

where c t = c ( σ , R , t , δ ) , and c ( · ) is defined in ( 1 ) . Define the estimate ˆ µ n of µ as follows:

ˆ µ n = sgn ( X n ) ( 1 + ε ) L n + ( 1 − ε ) U n 2 .

Denote the stopping time τ = min { n : L n ≥ U n } . Then , with probability at least 1 − δ,