JMLR : Workshop and Conference Proceedings vol ( 2010 ) 1 – 21 24th Annual Conference on Learning Theory
Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments ∗
G´ abor Bart´ ok bartok @ cs.ualberta.ca D´ avid P´ al dpal @ cs.ualberta.ca Csaba Szepesv´ ari szepesva @ cs.ualberta.ca Department of Computing Science , University of Alberta , Edmonton , T6G 2E8 , AB , Canada
Editors : Sham Kakade , Ulrike von Luxburg
Abstract
In a partial monitoring game , the learner repeatedly chooses an action , the environment responds with an outcome , and then the learner suffers a loss and receives a feedback signal , both of which are fixed functions of the action and the outcome . The goal of the learner is to minimize his regret , which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight . Assuming that the outcomes are generated in an i.i.d . fashion from an arbitrary and unknown probability distribution , we characterize the minimax regret of any partial monitoring game with finitely many actions
and outcomes . It turns out that the minimax regret of any such game is either zero , Θ ( √ T ) , Θ ( T 2/3 ) , or Θ ( T ) . We provide a computationally efficient learning algorithm that achieves the minimax regret within logarithmic factor for any game . Keywords : Online learning , Imperfect feedback , Regret analysis
1 . Introduction
Partial monitoring provides a mathematical framework for sequential decision making problems with imperfect feedback . Various problems of interest can be modeled as partial monitoring instances , such as learning with expert advice ( Littlestone and Warmuth , 1994 ) , the multi-armed bandit problem ( Auer et al . , 2002 ) , dynamic pricing ( Kleinberg and Leighton , 2003 ) , the dark pool problem ( Agarwal et al . , 2010 ) , label efficient prediction ( Cesa-Bianchi et al . , 2005 ) , and linear and convex optimization with full or bandit feedback ( Zinkevich , 2003 ; Abernethy et al . , 2008 ; Flaxman et al . , 2005 ) .
In this paper we restrict ourselves to finite games , i.e. , games where both the set of actions available to the learner and the set of possible outcomes generated by the environment are finite . A finite partial monitoring game G is described by a pair of N × M matrices : the loss matrix L and the feedback matrix H. The entries i , j of L are real numbers lying in , say , the interval [ 0 , 1 ] . The entries h i , j of H belong to an alphabet Σ on which we do not impose any structure and we only assume that learner is able to distinguish distinct elements of the alphabet .
The game proceeds in T rounds according to the following protocol . First , G = ( L , H ) is announced for both players . In each round t = 1 , 2 , . . . , T , the learner chooses an action I t ∈
∗ This work was supported in part by AICML , AITF ( formerly iCore and AIF ) , NSERC and the PASCAL2 Network of Excellence under EC grant no . 216886 .
c 2010 G. Bart´ ok , D. P´ al & C. Szepesv´ ari .
{ 1 , 2 , . . . , N } and simultaneously , the environment chooses an outcome J t ∈ { 1 , 2 , . . . , M } . Then , the learner receives as a feedback the entry h I t , J t . The learner incurs instantaneous loss I t , J t , which is not revealed to him . The feedback can be thought of as a masked information about the outcome J t . In some cases h I t , J t might uniquely determine the outcome , in other cases the feedback might give only partial or no information about the outcome . In this paper , we shall assume that J t is chosen randomly from a fixed multinomial distribution .
The learner is scored according to the loss matrix L. In round t the learner incurs an instantaneous loss of I t , J t . The goal of the learner is to keep low his total loss T t=1 I t , J t . Equivalently , the learner’s performance can also be measured in terms of his regret , i.e. , the total loss of the learner is compared with the loss of best fixed action in hindsight . The regret is defined as the difference of these two losses .
In general , the regret grows with the number of rounds T . If the regret is sublinear in T , the learner is said to be Hannan consistent , and this means that the learner’s average per-round loss approaches the average per-round loss of the best action in hindsight .
Piccolboni and Schindelhauer ( 2001 ) were one of the first to study the regret of these games . In fact , they have studied the problem without making any probabilistic assumptions about the outcome sequence J t . They proved that for any finite game ( L , H ) , either for any algorithm the regret can be Ω ( T ) in the worst case , or there exists an algorithm which has regret O ( T 3/4 ) on any outcome sequence 1 . This result was later improved by CesaBianchi et al . ( 2006 ) who showed that the algorithm of Piccolboni and Schindelhauer has regret O ( T 2/3 ) . Furthermore , they provided an example of a finite game , a variant of label-efficient prediction , for which any algorithm has regret Θ ( T 2/3 ) in the worst case .
However , for many games O ( T 2/3 ) is not optimal . For example , games with full feedback ( i.e. , when the feedback uniquely determines the outcome ) can be viewed as a special instance of the problem of learning with expert advice and in this case it is known that the
“EWA forecaster” has regret O ( √ T ) ; see e.g. , Lugosi and Cesa-Bianchi ( 2006 , Chapter 3 ) . Similarly , for games with “bandit feedback” ( i.e. , when the feedback determines the instantaneous loss ) the INF algorithm ( Audibert and Bubeck , 2009 ) and the Exp3 algorithm ( Auer
et al . , 2002 ) achieve O ( √ T ) regret as well . 2
This leaves open the problem of determining the minimax regret ( i.e. , optimal worst-case regret ) of any given game ( L , H ) . A partial progress was made in this direction by Bart´ok et al . ( 2010 ) who characterized ( almost ) all finite games with M = 2 outcomes . They showed that the minimax regret of any “non-degenerate” finite game with two outcomes
falls into one of four categories : zero , Θ ( √ T ) , Θ ( T 2/3 ) or Θ ( T ) . They gave a combinatoricgeometric condition on the matrices L , H which determines the category a game belongs to . Additionally , they constructed an efficient algorithm which , for any game , achieves the minimax regret rate associated to the game within poly-logarithmic factor .
In this paper , we consider the same problem , with two exceptions . In pursuing a general result , we will consider all finite games . However , at the same time , we will only deal with stochastic environments , i.e. , when the outcome sequences are generated from a fixed probability distribution in an i.i.d . manner .
1 . The notations O ( · ) and Θ ( · ) hide polylogarithmic factors . 2 . We ignore the dependence of regret on the number of actions or any other parameters .
The regret against stochastic environments is defined as the difference between the cumulative loss suffered by the algorithm and that of the action with the lowest expected loss . That is , given an algorithm A and a time horizon T , if the outcomes are generated from a probability distribution p , the regret is
R T ( A , p ) =
T
t=1
I t , J t − min 1≤i≤N E p
T
t=1
i , J t .
In this paper we analyze the minimax expected regret ( in what follows , minimax regret ) of games , defined as
R T ( G ) = inf A sup p∈∆ M E p [ R T ( A , p ) ] .
We show that the minimax regret of any finite game falls into four categories : zero , Θ ( √ T ) , Θ ( T 2/3 ) , or Θ ( T ) . Accordingly , we call the games trivial , easy , hard , and hopeless . We give a simple and efficiently computable characterization of these classes using a geometric condition on ( L , H ) . We provide lower-bounds and algorithms that achieve them within poly-logarithmic factor . Our result is an extension of the result of Bart´ ok et al . ( 2010 ) for stochastic environments .
It is clear that any lower bound which holds for stochastic environments must hold for adversarial environments too . On the other hand , algorithms and regret upper bounds for stochastic environments , of course , do not transfer to algorithms and regret upper bounds for the adversarial case . Our characterization is a stepping stone towards understanding the minimax regret of partial monitoring games . In particular , we conjecture that our characterization holds without any change for unrestricted environments .
2 . Preliminaries
In this section , we introduce our conventions , along with some definitions . By default , all vectors are column vectors . We denote by v =
√ v v the Euclidean norm of a vector v. For a vector v , the notation v ≥ 0 means that all entries of v are non-negative , and the notation v > 0 means that all entries are positive . For a matrix A , Im A denotes its image space , i.e. , the vector space generated by its columns , and the notation Ker A denotes its kernel , i.e. , the set { x : Ax = 0 } .
Consider a game G = ( L , H ) with N actions and M outcomes . That is , L ∈ R N ×M and H ∈ Σ N ×M . For the sake of simplicity and , without loss of generality , we assume that no symbol σ ∈ Σ can be present in two different rows of H. The signal matrix of an action is defined as follows:
Definition 1 ( Signal matrix ) Let { σ 1 , . . . , σ s i } be the set of symbols listed in the i th row of H . ( Thus , s i denotes the number of different symbols in row i of H ) . The signal matrix S i of action i is defined as an s i × M matrix with entries a k , j = I ( h i , j = σ k ) for 1 ≤ k ≤ s i and 1 ≤ j ≤ M . The signal matrix for a set of actions is defined as the signal matrices of the actions in the set , stacked on top of one another , in the ordering of the actions .
For an example of a signal matrix , see Section 3.1 . We identify the strategy of a stochastic opponent with an element of the probability simplex ∆ M = { p ∈ R M : p ≥ 0 , M j=1 p j = 1 } . Note that for any opponent strategy p , if the learner chooses action i then the vector S i p ∈ R s i is the probability distribution of the observed feedback : ( S i p ) k is the probability of observing the k th symbol .
We denote by i the i th row of the loss matrix L and we call i the loss vector of action i. We say that action i is optimal under opponent strategy p ∈ ∆ M if for any 1 ≤ j ≤ N , i p ≤ j p. Action i is said to be Pareto-optimal if there exists an opponent strategy p such that action i is optimal under p. We now define the cell decomposition of ∆ M induced by L ( for an example , see Figure 2 ) :
Definition 2 ( Cell decomposition ) For an action i , the cell C i associated with i is defined as C i = { p ∈ ∆ M : action i is optimal under p } . The cell decomposition of ∆ M is defined as the multiset C = { C i : 1 ≤ i ≤ N , C i has positive ( M − 1 ) -dimensional volume } .
Actions whose cell is of positive ( M − 1 ) -dimensional volume are called strongly Paretooptimal . Actions that are Pareto-optimal but not strongly Pareto-optimal are called degenerate . Note that the cells of the actions are defined with linear inequalities and thus they are convex polytopes . It follows that strongly Pareto-optimal actions are the actions whose cells are ( M − 1 ) -dimensional polytopes . It is also important to note that the cell decomposition is a multiset , since some actions can share the same cell . Nevertheless , if two actions have the same cell of dimension ( M − 1 ) , their loss vectors will necessarily be identical . 3
We call two cells of C neighbors if their intersection is an ( M − 2 ) -dimensional polytope . The actions corresponding to these cells will also be called neighbors . Neighborship is not defined for cells outside of C. For two neighboring cells C i , C j ∈ C , we define the neighborhood action set A i , j = { 1 ≤ k ≤ N : C i ∩ C j ⊆ C k } . It follows from the definition that actions i and j are in A i , j and thus A i , j is nonempty . However , one can have more than two actions in the neighborhood action set .
When discussing lower bounds we will need the definition of algorithms . For us , an algorithm A is a mapping A : Σ ∗ → { 1 , 2 , . . . , N } which maps past feedback sequences to actions . That the algorithms are deterministic is assumed for convenience . In particular , the lower bounds we prove can be extended to randomized algorithms by conditioning on the internal randomization of the algorithm . Note that the algorithms we design are themselves deterministic .
3 . Classification of finite partial-monitoring games
In this section we present our main result : we state the theorem that classifies all finite stochastic partial-monitoring games based on how their minimax regret scales with the time horizon . Thanks to the previous section , we are now equipped to define a notion which will play a key role in the classification theorem:
3 . One could think that actions with identical loss vectors are redundant and that all but one of such actions could be removed without loss of generality . However , since different actions can lead to different observations and thus yield different information , removing the duplicates can be harmful .
Definition 3 ( Observability ) Let S be the signal matrix for the set of all actions in the game . For actions i and j , we say that i − j is globally observable if i − j ∈ Im S . Furthermore , if i and j are two neighboring actions , then i − j is called locally observable if i − j ∈ Im S ( i , j ) , where S ( i , j ) is the signal matrix for the neighborhood action set A i , j .
As we will see , global observability implies that we can estimate the difference of the expected losses after choosing each action once . Local observability means we only need actions from the neighborhood action set to estimate the difference .
The classification theorem , which is our main result , is the following:
Theorem 4 ( Classification ) Let G = ( L , H ) be a partial-monitoring game with N actions and M outcomes . Let C = { C 1 , . . . , C k } be its cell decomposition , with corresponding loss vectors 1 , . . . , k . The game G falls into one of the following four categories:
( a ) R T ( G ) = 0 if there exists an action i with C i = ∆ M . This case is called trivial .
( b ) R T ( G ) = Θ ( T ) if there exist two strongly Pareto-optimal actions i and j such that i − j is not globally observable . This case is called hopeless . ( c ) R T ( G ) = Θ ( √ T ) if it is not trivial and for all pairs of ( strongly Pareto-optimal ) neighboring actions i and j , i − j is locally observable . These games are called easy .
( d ) R T ( G ) = Θ ( T 2/3 ) if G is not hopeless and there exists a pair of neighboring actions i and j such that i − j is not locally observable . These games are called hard .
Note that the conditions listed under ( a ) – ( d ) are mutually exclusive and cover all finite partial-monitoring games . The only non-obvious implication is that if a game is easy then it can not be hopeless . The reason this holds is because for any pair of cells C i , C j in C , the vector i − j can be expressed as a telescoping sum of the differences of loss vectors of neighboring cells .
The remainder of the paper is dedicated to proving Theorem 4 . We start with the simple cases . If there exists an action whose cell covers the whole probability simplex then choosing that action in every round will yield zero regret , proving case ( a ) . The condition in Case ( b ) is due to Piccolboni and Schindelhauer ( 2001 ) , who showed that under the condition mentioned there , there is no algorithm that achieves sublinear regret 4 . The upper bound for case ( d ) is achieved by the FeedExp3 algorithm due to Piccolboni and Schindelhauer ( 2001 ) , for which a regret bound of O ( T 2/3 ) was shown by Cesa-Bianchi et al . ( 2006 ) . The lower bound for case ( c ) was proved by Antos et al . ( 2011 ) . For a visualization of previous results , see Figure 1 .
The above assertions help characterize trivial and hopeless games , and show that if
a game is not trivial and not hopeless then its minimax regret falls between Ω ( √ T ) and O ( T 2/3 ) . Our contribution in this paper is that we give exact minimax rates ( up to logarithmic factors ) for these games . To prove the upper bound for case ( c ) , we introduce a new algorithm , which we call Balaton , for “Bandit Algorithm for Loss Annihilation” 5 . This algorithm is presented in Section 4 , while its analysis is given in Section 5 . The lower bound for case ( d ) is presented in Section 6 .
4 . Although Piccolboni and Schindelhauer state their theorem for adversarial environments , their proof applies to stochastic environments without any change ( which is important for the lower bound part ) . 5 . Balaton is a lake in Hungary . We thank Gergely Neu for suggesting the name .
hopeless trivial
easy hard
dynamic pricing l.e.p . bandits
full-info
Figure 1 : Partial monitoring games and their minimax regret as it was known previously . The big rectangle denotes the set of all games . Inside the big rectangle , the games are ordered from left to right based on their minimax regret . In the “hard” area , l.e.p . denotes label-efficient prediction . The grey area contains games whose
minimax regret is between Ω ( √ T ) and O ( T 2/3 ) but their exact regret rate was unknown . This area is now eliminated , and the dynamic pricing problem is proven to be hard .
3.1 . Example
In this section , as a corollary of Theorem 4 we show that the discretized dynamic pricing game ( see , e.g. , Cesa-Bianchi et al . ( 2006 ) ) is hard . Dynamic pricing is a game between a vendor ( learner ) and a customer ( environment ) . In each round , the vendor sets a price he wants to sell his product at ( action ) , and the costumer sets a maximum price he is willing to buy the product ( outcome ) . If the product is not sold , the vendor suffers some constant loss , otherwise his loss is the difference between the customer’s maximum and his price . The customer never reveals the maximum price and thus the vendor’s only feedback is whether he sold the product or not .
The discretized version of the game with N actions ( and outcomes ) is defined by the matrices
L =
0 1 2 · · · N − 1 c 0 1 · · · N − 2 .. . . .. .. . c · · · c 0 1 c · · · · · · c 0
H =
1 · · · · · · 1 0 . .. .. . .. . . .. ... ... 0 · · · 0 1
,
where c is a positive constant ( see Figure 2 for the cell-decomposition for N = 3 ) . It is easy to see that all the actions are strongly Pareto-optimal . Also , after some linear algebra it turns out that the cells underlying the actions have a single common vertex in the interior of the probability simplex . It follows that any two actions are neighbors . On the other hand , if we take two non-consecutive actions i and i , i − i is not locally observable . For example , the signal matrix for action 1 and action N is
S ( 1 , N ) =
1 . . . 1 1 1 . . . 1 0 0 . . . 0 1
,
whereas N − 1 = ( c , c − 1 , . . . , c − N + 2 , −N + 1 ) . It is obvious that N − 1 is not in the row space of S ( 1 , N ) .
( 1 , 0 , 0 )
( 0 , 1 , 0 )
( 0 , 0 , 1 )
p ∗
Figure 2 : The cell decomposition of the discretized dynamic pricing game with 3 actions . If the opponent strategy is p ∗ , then action 2 is the optimal action .
4 . Balaton : An algorithm for easy games In this section we present our algorithm that achieves O ( √ T ) expected regret for easy games ( case ( c ) of Theorem 4 ) . The input of the algorithm is the loss matrix L , the feedback matrix H , the time horizon T and an error probability δ , to be chosen later . Before describing the algorithm , we introduce some notation . We define a graph G associated with game G the following way . Let the vertex set be the set of cells of the cell decomposition C of the probability simplex such that cells C i , C j ∈ C share the same vertex when C i = C j . The graph has an edge between vertices whose corresponding cells are neighbors . This graph is connected , since the probability simplex is convex and the cell decomposition covers the simplex .
Recall that for neighboring cells C i , C j , the signal matrix S ( i , j ) is defined as the signal matrix for the neighborhood action set A i , j of cells i , j. Assuming that the game satisfies the condition of case ( c ) of Theorem 4 , we have that for all neighboring cells C i and C j , i − j ∈ Im S ( i , j ) . This means that there exists a coefficient vector v ( i , j ) such that i − j = S ( i , j ) v ( i , j ) . We define the k th segment of v ( i , j ) , denoted by v ( i , j ) , k , as the vector of components of v ( i , j ) that correspond to the k th action in the neighborhood action set . That is , if S ( i , j ) = S 1 · · · S r , then i − j = S ( i , j ) v ( i , j ) = r s=1 S s v ( i , j ) , s , where S 1 , . . . , S r are the signal matrices of the individual actions in A i , j .
Let J t ∈ { 1 , . . . , M } denote the outcome at time step t. For 1 ≤ k ≤ M , let e k ∈ R M be the k th unit vector . For an action i , let O i ( t ) = S i e J t be the observation vector of action i at time step t. If the rows of the signal matrix S i correspond to symbols σ 1 , . . . , σ s i and action i is chosen at time step t then the unit vector O i ( t ) indicates which symbol was observed in that time step . Thus , O I t ( t ) holds the same information as the feedback at time t ( recall that I t is the action chosen by the learner at time step t ) . From now on , for simplicity , we will assume that the feedback at time step t is the observation vector O I t ( t ) itself .
The main idea of the algorithm is to successively eliminate actions in an efficient , yet safe manner . When all remaining strongly Pareto optimal actions share the same cell , the elimination phase finishes and from this point , one of the remaining actions is played . During the elimination phase , the algorithm works in rounds . In each round each ‘alive’ Pareto optimal action is played once . The resulting observations are used to estimate the loss-difference between the alive actions . If some estimate becomes sufficiently precise , the action of the pair deemed to be suboptimal is eliminated ( possibly together with other
Algorithm 1 Balaton Input : L , H , T , δ Initialization : [ G , C , { v ( i , j ) , k } , { path ( i , j ) } , { ( LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) ) } ] ← Initialize ( L , H ) t ← 0 , n ← 0 aliveActions ← { 1 ≤ i ≤ N : C i ∩ interior ( ∆ M ) = ∅ } main loop while | V G | > 1 and t < T do n ← n + 1 for each i ∈ aliveActions do O i ← ExecuteAction ( i ) t ← t + 1 end for for each edge ( i , j ) in G : µ ( i , j ) ← k∈A i , j O k v ( i , j ) , k end for for each non-adjacent vertex pair ( i , j ) in G : µ ( i , j ) ← ( k , l ) ∈path ( i , j ) µ ( k , l ) end for haveEliminated ← false for each vertex pair ( i , j ) in G do ˆ µ ( i , j ) ← 1 − 1 n ˆ µ ( i , j ) + 1 n µ ( i , j ) if BStopStep ( ˆ µ ( i , j ) , LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) , n , 1/2 , δ ) then [ aliveActions , C , G ] ← eliminate ( i , j , sgn ( ˆ µ ( i , j ) ) ) haveEliminated ← true end if end for if haveEliminated then { path ( i , j ) } ← regeneratePaths ( G ) end if end while Let i be a strongly Pareto-optimal action in aliveActions while t < T do
ExecuteAction ( i ) t ← t + 1 end while
actions ) . To determine if an estimate is sufficiently precise , we will use an appropriate stopping rule . A small regret will be achieved by tuning the error probability of the stopping rule appropriately .
The details of the algorithm are as follows : In the preprocessing phase , the algorithm constructs the neigbourhood graph , the signal matrices S ( i , j ) assigned to the edges of the graph , the coefficient vectors v ( i , j ) and their segment vectors v ( i , j ) , k . In addition , it constructs a path in the graph connecting any pairs of nodes , and initializes some variables used by the stopping rule .
In the elimination phase , the algorithm runs a loop . In each round of the loop , the algorithm chooses each of the alive actions once and , based on the observations , the estimates ˆ µ ( i , j ) of the loss-differences ( i − j ) p ∗ are updated , where p ∗ is the actual opponent
strategy . The algorithm maintains the set C of cells of alive actions and their neighborship graph G .
The estimates are calculated as follows . First we calculate estimates for neighboring actions ( i , j ) . In round 6 n , for every action k in A i , j let O k be the observation vector for action k. Let µ ( i , j ) = k∈A i , j O k v ( i , j ) , k . From the local observability condition and the construction of v ( i , j ) , k , with simple algebram it follows that µ ( i , j ) are unbiased estimates of ( i − j ) p ∗ ( see Lemma 5 ) . For non-neighboring action pairs , we use telescoping sums : since the graph G ( induced by the alive actions ) stays connected , we can take a path i = i 0 , i 1 , . . . , i r = j in the graph , and the estimate µ ( i , j ) ( n ) will be the sum of the estimates along the path : r l=1 µ ( i l−1 , i l ) . The estimate of the difference of the expected losses after round n will be the average ˆ µ ( i , j ) = ( 1/n ) n l=1 µ ( i , j ) ( s ) , where µ ( i , j ) ( s ) denotes the estimate for pair ( i , j ) computed in round s .
After updating the estimates , the algorithm decides which actions to eliminate . For each pair of vertices i , j of the graph , the expected difference of their loss is tested for its sign by the BStopStep subroutine , based on the estimate ˆ µ ( i , j ) and its relative error . This subroutine uses a stopping rule based on Bernstein’s inequality .
The subroutine’s pseudocode is shown as Algorithm 2 and is essentially based on the work by Mnih et al . ( 2008 ) . The algorithm maintains two values , LB , UB , computed from the supplied sequence of sample means ( ˆ µ ) and the deviation bounds
c ( σ , R , n , δ ) = σ 2 L ( δ , n ) n + R L ( δ , n ) 3n , where L ( δ , n ) = log 3 p p − 1 n p δ . ( 1 )
Here p > 1 is an arbitrarily chosen parameter of the algorithm , σ is a ( deterministic ) upper bound on the ( conditional ) variance of the random variables whose common mean µ we wish to estimate , while R is a ( deterministic ) upper bound on their range . This is a general stopping rule method , which stops when it produced an -relative accurate estimate of the unknown mean . The algorithm is guaranteed to be correct outside of a failure event whose probability is bounded by δ .
Algorithm Balaton calls this method with ε = 1/2 . As a result , when BStopStep returns true , outside of the failure event the sign of the estimate ˆ µ supplied to Balaton will match the sign of the mean to be estimated . The conditions under which the algorithm indeed produces ε-accurate estimates ( with high probability ) are given in Lemma 11 ( see Appendix ) , which also states that also with high probability , the time when the algorithm stops is bounded by
C · max σ 2 2 µ 2 , R |µ| log 1 δ + log R |µ| ,
where µ = 0 is the true mean . Note that the choice of p in ( 1 ) influences only C .
If BStopStep returns true for an estimate µ ( i , j ) , function eliminate is called . If , say , µ ( i , j ) > 0 , this function takes the closed half space { q ∈ ∆ M : ( i − j ) q ≤ 0 } and eliminates all actions whose cell lies completely in the half space . The function also drops the vertices from the graph that correspond to eliminated cells . The elimination necessarily
6 . Note that a round of the algorithm is not the same as the time step t. In a round , the algorithm chooses each of the alive actions once .
Algorithm 2 Algorithm BStopStep . Note that , somewhat unusually at least in pseudocodes , the arguments LB , UB are passed by reference , i.e. , the algorithm rewrites the values of these arguments ( which are thus returned back to the caller ) . Input : ˆ µ , LB , UB , σ , R , n , ε , δ LB ← max ( LB , |ˆ µ| − c ( δ , σ , R , n ) ) UB ← min ( UB , |ˆ µ| + c ( δ , σ , R , n ) ) return ( 1 + ) LB < ( 1 − ) UB
concerns all actions with corresponding cell C i , and possibly other actions as well . The remaining cells are redefined by taking their intersection with the complement half space { q ∈ ∆ M : ( i − j ) q ≥ 0 } .
By construction , after the elimination phase , the remaining graph is still connected , but some paths used in the round may have lost vertices or edges . For this reason , in the last phase of the round , new paths are constructed for vertex pairs with broken paths .
The main loop of the algorithm continues until either one vertex remains in the graph or the time horizon T is reached . In the former case , one of the actions corresponding to that vertex is chosen until the time horizon is reached .
5 . Analysis of the algorithm In this section we prove that the algorithm described in the previous section achieves O ( √ T ) expected regret .
Let us assume that the outcomes are generated following the probability vector p ∗ ∈ ∆ M . Let j ∗ denote an optimal action , that is , for every 1 ≤ i ≤ N , j ∗ p ∗ ≤ i p ∗ . For every pair of actions i , j , let α i , j = ( i − j ) p ∗ be the expected difference of their instantaneous loss . The expected regret of the algorithm can be rewritten as
E
T
t=1
I t , J t − min 1≤i≤N E
T
t=1
i , J t =
N
E [ τ i ] α i , j ∗ , ( 2 )
where τ i is the number of times action i is chosen by the algorithm .
Throughout the proof , the value that Balaton assigns to a variable x in round n will be denoted by x ( n ) . Further , for 1 ≤ k ≤ N , we introduce the i.i.d . random sequence ( J k ( n ) ) n≥1 , taking values on { 1 , . . . , M } , with common multinomial distribution satisfying , P [ J k ( n ) = j ] = p ∗ j . Clearly , a statistically equivalent model to the one where ( J t ) is an i.i.d . sequence with multinomial p ∗ is when ( J t ) is defined through
J t = J I t t s=1 I ( I s = I t ) . ( 3 )
Note that this claim holds , independently of the algorithm generating the actions , I t . Therefore , in what follows , we assume that the outcome sequence is generated through ( 3 ) . As we will see , this construction significantly simplifies subsequent steps of the proof . In particular , the construction will be very convenient since if action k is selected by our algorithm in the n th elimination round then the outcome obtained in response is going to be
O k ( n ) = S k u k ( n ) , where u k ( n ) = e J k ( n ) . ( This holds because in the elimination rounds all alive actions are tried exactly once by Balaton . )
Let ( F n ) n be the filtration defined as F n = σ ( u k ( m ) ; 1 ≤ k ≤ N , 1 ≤ m ≤ n ) . We also introduce the notations E n [ · ] = E [ ·|F n ] and Var n ( · ) = Var ( ·|F n ) , the conditional expectation and conditional variance operators corresponding to F n . Note that F n contains the information known to Balaton ( and more ) at the end of the elimination round n. Our first ( trivial ) observation is that µ ( i , j ) ( n ) , the estimate of α i , j obtained in round n is F n -measurable . The next lemma establishes that , furthermore , µ ( i , j ) ( n ) is an unbiased estimate of α i , j :
Lemma 5 For any n ≥ 1 and i , j such that C i , C j ∈ C , E n−1 [ µ ( i , j ) ( n ) ] = α i , j .
Proof Consider first the case when actions i and j are neighbors . In this case,
µ ( i , j ) ( n ) = k∈A i , j
O k ( n ) v ( i , j ) , k = k∈A i , j
( S k u k ( n ) ) v ( i , j ) , k = k∈A i , j
u k ( n ) S k v ( i , j ) , k ,
and thus
E n−1 µ ( i , j ) ( n ) = k∈A i , j
E n−1 u k ( n ) S k v ( i , j ) , k = p ∗ k∈A i , j
S k v ( i , j ) , k = p ∗ S ( i , j ) v ( i , j )
= p ∗ ( i − j ) = α i , j .
For non-adjacent i and j , we have a telescoping sum:
E n−1 µ ( i , j ) ( n ) =
r
k=1
E n−1 [ µ ( i k−1 , i k ) ( n ) ]
= p ∗ i 0 − i 1 + i 1 − i 2 + · · · + i r−1 − i r = α i , j ,
where i = i 0 , i 1 , . . . , i r = j is the path the algorithm uses in round n , known at the end of round n − 1 .
Lemma 6 The conditional variance of µ ( i , j ) ( n ) , Var n−1 ( µ ( i , j ) ( n ) ) , is upper bounded by V = 2 { i , j neighbors } v ( i , j ) 2 2 .
Proof For neighboring cells i , j , we write
µ ( i , j ) ( n ) = k∈A i , j
O k ( n ) v ( i , j ) , k and thus
Var n−1 ( µ ( i , j ) ( n ) ) = Var n−1
k∈A i , j
O k ( n ) v ( i , j ) , k
=
k∈A i , j
E n−1 v ( i , j ) , k ( O k ( n ) − E n−1 [ O k ( n ) ] ) ( O k ( n ) − E n−1 [ O k ( n ) ] ) v ( i , j ) , k
k∈A i , j
v ( i , j ) , k 2 2 E n−1 O k ( n ) − E n−1 [ O k ( n ) ] 2 2
k∈A i , j
v ( i , j ) , k 2 2 = v ( i , j ) 2 2 , ( 4 )
where in ( 4 ) we used that O k ( n ) is a unit vector and E n−1 [ O k ( n ) ] is a probability vector .
For i , j non-neighboring cells , let i = i 0 , i 1 , . . . , i r = j the path used for the estimate in round n. Then µ ( i , j ) ( n ) can be written as
µ ( i , j ) ( n ) =
r
s=1
µ ( i s−1 , i s ) ( n ) =
r
s=1 k∈A is−1 , is
O k ( n ) v ( i s−1 , i s ) , k .
It is not hard to see that an action can only be in at most two neighborhood action sets in the path and so the double sum can be rearranged as
k∈ A is−1 , is
O k ( n ) ( v ( i sk−1 , i sk ) , k + v ( i sk i sk+1 ) , k ) ,
and thus Var n−1 µ ( i , j ) ( n ) ≤ 2 r s=1 v ( i s−1 , i s ) 2 2 ≤ 2 { i , j neighbors } v ( i , j ) 2 2 .
Lemma 7 The range of the estimates µ ( i , j ) ( n ) is upper bounded by R = { i , j neighbors } v ( i , j ) 1 .
Proof The bound trivially follows from the definition of the estimates .
Let δ be the confidence parameter used in BStopStep . Since , according to Lemmas 5 , 6 and 7 , ( µ ( i , j ) ) is a “shifted” martingale difference sequence with conditional mean α i , j , bounded conditional variance and range , we can apply Lemma 11 stated in the Appendix . By the union bound , the probability that any of the confidence bounds fails during the game is at most N 2 δ. Thus , with probability at least 1 − N 2 δ , if BStopStep returns true for a pair ( i , j ) then sgn ( α i , j ) = sgn ( µ ( i , j ) ) and the algorithm eliminates all the actions whose cell is contained in the closed half space defined by H = { p : sgn ( α i , j ) p ( i − j ) ≤ 0 } . By definition α i , j = ( i − j ) p ∗ . Thus p ∗ / ∈ H and none of the eliminated actions can be optimal under p ∗ .
From Lemma 11 we also see that , with probability at least 1 − N 2 δ , the number of times τ ∗ i the algorithm experiments with a suboptimal action i during the elimination phase is bounded by
τ ∗ i ≤ c ( G ) α 2 i , j ∗ log R δα i , j ∗ = T i , ( 5 )
where c ( G ) = C ( V + R ) is a problem dependent constant .
The following lemma , the proof of which can be found in the Appendix , shows that degenerate actions will be eliminated in time .
Lemma 8 Let action i be a degenerate action . Let A i = { j : C j ∈ C , C i ⊂ C j } . The following two statements hold:
1 . If any of the actions in A i is eliminated , then action i is eliminated as well .
2 . There exists an action k i ∈ A i such that α k i , j ∗ ≥ α i , j ∗ .
An immediate implication of the first claim of the lemma is that if action k i gets eliminated then action i gets eliminated as well , that is , the number of times action i is chosen can not be greater then that of action k i . Hence , τ ∗ i ≤ τ ∗ k i . Let E be the complement of the failure event underlying the stopping rules . As discussed earlier , P ( E c ) ≤ N 2 δ. Note that on E , i.e. , when the stopping rules do not fail , no suboptimal action can remain for the final phase . Hence , τ i I ( E ) ≤ τ ∗ i I ( E ) , where τ i is the number of times action i is chosen by the algorithm . To upper bound the expected regret we continue from ( 2 ) as
N
E [ τ i ] α i , j ∗ =
N
E [ I ( E ) τ i ] α i , j ∗ + P ( E c ) T ( because N i=1 τ i = T and 0 ≤ α i , j ∗ ≤ 1 )
N
E [ I ( E ) τ ∗ i ] α i , j ∗ + N 2 δT
i : C i ∈C
E [ I ( E ) τ ∗ i ] α i , j ∗ + i : C i ∈C
E [ I ( E ) τ ∗ i ] α i , j ∗ + N 2 δT
i : C i ∈C
E [ I ( E ) τ ∗ i ] α i , j ∗ + i : C i ∈C
E I ( E ) τ ∗ k i α k i , j ∗ + N 2 δT ( by Lemma 8 )
i : C i ∈C
T i α i , j ∗ + i : C i ∈C
T k i α k i , j ∗ + N 2 δT
i : C i ∈C α i , j∗ ≥α 0
T i α i , j ∗ + i : C i ∈C α ki , j ∗ ≥α 0
T k i α k i , j ∗ + α 0 + N 2 δ T
≤ c ( G )
i : C i ∈C α i , j∗ ≥α 0
log R δα i , j∗
α i , j ∗ + i : C i ∈C α ki , j ∗ ≥α 0
log R δα ki , j ∗ α k i , j ∗
+ α 0 + N 2 δ T
≤ c ( G ) N log R δα 0 α 0 + α 0 + N 2 δ T , The above calculation holds for any value of α 0 > 0 . Setting
α 0 = c ( G ) N T and δ = c ( G ) T N 3 , we get
E [ R T ] ≤ c ( G ) N T log RT N 2 c ( G ) .
In conclusion , if we run Balaton with parameter δ = c ( G ) T N 3 , the algorithm suffers regret
of O ( √ T ) , finishing the proof .
6 . A lower bound for hard games
In this section we prove that for any game that satisfies the condition of Case ( d ) of Theorem 4 , the minimax regret is of Ω ( T 2/3 ) .
Theorem 9 Let G = ( L , H ) be an N by M partial-monitoring game . Assume that there exist two neighboring actions i and j such that i − j ∈ Im S ( i , j ) . Then there exists a problem dependent constant c ( G ) such that for any algorithm A and time horizon T there exists an opponent strategy p such that the expected regret satisfies
E [ R T ( A , p ) ] ≥ c ( G ) T 2/3 .
Proof Without loss of generality we can assume that the two neighbor cells in the condition are C 1 and C 2 . Let C 3 = C 1 ∩ C 2 . For i = 1 , 2 , 3 , let A i be the set of actions associated with cell C i . Note that A 3 may be the empty set . Let A 4 = A\ ( A 1 ∪A 2 ∪A 3 ) . By our convention for naming loss vectors , 1 and 2 are the loss vectors for C 1 and C 2 , respectively . Let L 3 collect the loss vectors of actions which lie on the open segment connecting 1 and 2 . It is easy to see that L 3 is the set of loss vectors that correspond to the cell C 3 . We define L 4 as the set of all the other loss vectors . For i = 1 , 2 , 3 , 4 , let k i = |A i | .
Let S = S i , j the signal matrix of the neighborhood action set of C 1 and C 2 . It follows from the assumption of the theorem that 2 − 1 ∈ Im ( S ) . Thus , { ρ ( 2 − 1 ) : ρ ∈ R } ⊂ Im ( S ) , or equivalently , ( 2 − 1 ) ⊥ ⊃ Ker S , where we used that ( Im M ) ⊥ = Ker ( M ) . Thus , there exists a vector v such that v ∈ Ker S and ( 2 − 1 ) v = 0 . By scaling we can assume that ( 2 − 1 ) v = 1 . Note that since v ∈ Ker S and the rowspace of S contains the vector ( 1 , 1 , . . . , 1 ) , the coordinates of v sum up to zero .
Let p 0 be an arbitrary probability vector in the relative interior of C 3 . It is easy to see that for any ε > 0 small enough , p 1 = p 0 + εv ∈ C 1 \ C 2 and p 2 = p 0 − εv ∈ C 2 \ C 1 .
Let us fix a deterministic algorithm A and a time horizon T . For i = 1 , 2 , let R ( i ) T denote the expected regret of the algorithm under opponent strategy p i . For i = 1 , 2 and j = 1 , . . . , 4 , let N i j denote the expected number of times the algorithm chooses an action from A j , assuming the opponent plays strategy p i .
From the definition of L 3 we know that for any ∈ L 3 , − 1 = η ( 2 − 1 ) and − 2 = ( 1 − η ) ( 1 − 2 ) for some 0 < η < 1 . Let λ 1 = min ∈L 3 η and λ 2 = min ∈L 3 ( 1 − η ) and λ = min ( λ 1 , λ 2 ) if L 3 = ∅ and let λ = 1/2 , otherwise . Finally , let β i = min ∈L 4 ( − i ) p i and β = min ( β 1 , β 2 ) . Note that λ , β > 0 .
As the first step of the proof , we lower bound the expected regret R ( 1 ) T and R ( 2 ) T in terms of the values N i j , ε , λ and β:
R ( 1 ) T ≥ N 1 2
ε
( 2 − 1 ) p 1 +N 1 3 λ ( 2 − 1 ) p 1 + N 1 4 β ≥ λ ( N 1 2 + N 1 3 ) ε + N 1 4 β , R ( 2 ) T ≥ N 2 1 ( 1 − 2 ) p 2
ε
+N 2 3 λ ( 1 − 2 ) p 2 + N 2 4 β ≥ λ ( N 2 1 + N 2 3 ) ε + N 2 4 β . ( 6 )
For the next step , we need the following lemma .
Lemma 10 There exists a ( problem dependent ) constant c such that the following inequalities hold:
N 2 1 ≥ N 1 1 − cT ε N 1 4 , N 2 3 ≥ N 1 3 − cT ε N 1 4 ,
N 1 2 ≥ N 2 2 − cT ε N 2 4 , N 1 3 ≥ N 2 3 − cT ε N 2 4 .
Proof ( Lemma 10 ) For any 1 ≤ t ≤ T , let f t = ( f 1 , . . . , f t ) ∈ Σ t be a feedback sequence up to time step t. For i = 1 , 2 , let p ∗ i be the probability mass function of feedback sequences of length T − 1 under opponent strategy p i and algorithm A. We start by upper bounding the difference between values under the two opponent strategies . For i = j ∈ { 1 , 2 } and k ∈ { 1 , 2 , 3 } ,
N i k − N j k = f T −1
p ∗ i ( f T −1 ) − p ∗ j ( f T −1 )
T −1
t=0
I ( A ( f t ) ∈ A k )
f T −1 : p ∗ i ( f T −1 ) −p ∗ j ( f T −1 ) ≥0
p ∗ i ( f T −1 ) − p ∗ j ( f T −1 )
T −1
t=0
I ( A ( f t ) ∈ A k )
≤ T
f T −1 : p ∗ i ( f T −1 ) −p ∗ j ( f T −1 ) ≥0
p ∗ i ( f T −1 ) − p ∗ j ( f T −1 ) = T 2 p ∗ 1 − p ∗ 2 1
≤ T KL ( p ∗ 1 ||p ∗ 2 ) /2 , ( 7 )
where KL ( ·||· ) denotes the Kullback-Leibler divergence and · 1 is the L 1 -norm . The last inequality follows from Pinsker’s inequality ( Cover and Thomas , 2006 ) . To upper bound KL ( p ∗ 1 ||p ∗ 2 ) we use the chain rule for KL-divergence . By overloading p ∗ i so that p ∗ i ( f t−1 ) denotes the probability of feedback sequence f t−1 under opponent strategy p i and algorithm A , and p ∗ i ( f t |f t−1 ) denotes the conditional probability of feedback f t ∈ Σ given that the past feedback sequence was f t−1 , again under p i and A. With this notation we have
KL ( p ∗ 1 ||p ∗ 2 ) =
T −1
t=1 f t−1
p ∗ 1 ( f t−1 ) f t
p ∗ 1 ( f t |f t−1 ) log p ∗ 1 ( f t |f t−1 ) p ∗ 2 ( f t |f t−1 )
=
T −1
t=1 f t−1
p ∗ 1 ( f t−1 )
I ( A ( f t−1 ) ∈ A i ) f t
p ∗ 1 ( f t |f t−1 ) log p ∗ 1 ( f t |f t−1 ) p ∗ 2 ( f t |f t−1 ) ( 8 )
Let a f t be the row of S that corresponds to the feedback symbol f t . 7 Assume k = A ( f t−1 ) . If the feedback set of action k does not contain f t then trivially p ∗ i ( f t |f t−1 ) = 0 for i = 1 , 2 . Otherwise p ∗ i ( f t |f t−1 ) = a f t p i . Since p 1 − p 2 = 2εv and v ∈ Ker S , we have a f t v = 0 and thus , if the choice of the algorithm is in either A 1 , A 2 or A 3 , then p ∗ 1 ( f t |f t−1 ) = p ∗ 2 ( f t |f t−1 ) . It follows that the inequality chain can be continued from ( 8 ) by writing
KL ( p ∗ 1 ||p ∗ 2 ) ≤
T −1
t=1 f t−1
p ∗ 1 ( f t−1 ) I ( A ( f t−1 ) ∈ A 4 ) f t
p ∗ 1 ( f t |f t−1 ) log p ∗ 1 ( f t |f t−1 ) p ∗ 2 ( f t |f t−1 )
≤ c 1 ε 2
T −1
t=1 f t−1
p ∗ 1 ( f t−1 ) I ( A ( f t−1 ) ∈ A 4 ) ( 9 )
≤ c 1 ε 2 N 1 4 .
7 . Recall that we assumed that different actions have difference feedback symbols , and thus a row of S corresponding to a symbol is unique .
In ( 9 ) we used Lemma 12 ( see Appendix ) to upper bound the KL-divergence of p 1 and p 2 . Flipping p ∗ 1 and p ∗ 2 in ( 7 ) we get the same result with N 2 4 . Reading together with the bound in ( 7 ) we get all the desired inequalities .
Now we can continue lower bounding the expected regret . Let r = argmin i∈ { 1,2 } N i 4 . It is easy to see that for i = 1 , 2 and j = 1 , 2 , 3,
N i j ≥ N r j − c 2 T ε N r 4 .
If i = r then this inequality is one of the inequalities from Lemma 10 . If i = r then it is a trivial lower bounding by subtracting a positive value . From ( 6 ) we have
R ( i ) T ≥ λ ( N i 3−i + N i 3 ) ε + N i 4 β ≥ λ ( N r 3−i − c 2 T ε N r 4 + N r 3 − c 2 T ε N r 4 ) ε + N r 4 β = λ ( N r 3−i + N r 3 − 2c 2 T ε N r 4 ) ε + N r 4 β .
Now assume that , at the beginning of the game , the opponent randomly chooses between strategies p 1 and p 2 with equal probability . The the expected regret of the algorithm is lower bounded by
R T = 1 2 R ( 1 ) T + R ( 2 ) T ≥ 1 2 λ ( N r 1 + N r 2 + 2N r 3 − 4c 2 T ε N r 4 ) ε + N r 4 β ≥ 1 2 λ ( N r 1 + N r 2 + N r 3 − 4c 2 T ε N r 4 ) ε + N r 4 β = 1 2 λ ( T − N r 4 − 4c 2 T ε N r 4 ) ε + N r 4 β .
Choosing ε = c 3 T −1/3 we get
R T ≥ 1 2 λc 3 T 2/3 − 1 2 λN r 4 c 3 T −1/3 − 2λc 2 c 2 3 T 1/3 N r 4 + N r 4 β ≥ T 2/3 β − 1 2 λc 3 N r 4 T 2/3 − 2λc 2 c 2 3 N r 4 T 2/3 + 1 2 λc 3
= T 2/3 β − 1 2 λc 3 x 2 − 2λc 2 c 2 3 x + 1 2 λc 3 ,
where x = N r 4 /T 2/3 . Now we see that c 3 > 0 can be chosen to be small enough , independently of T so that , for any choice of x , the quadratic expression in the parenthesis is bounded away from zero , and simultaneously , ε is small enough so that the threshold condition in Lemma 12 is satisfied , completing the proof of Theorem 9 .
7 . Discussion
In this we paper we classified all finite partial-monitoring games under stochastic environments , based on their minimax regret . We conjecture that our results extend to nonstochastic environments . This is the major open question that remains to be answered .
One question which we did not discuss so far is the computational efficiency of our algorithm . The issue is twofold . The first computational question is how to efficiently decide which of the four classes a given game ( L , H ) belongs to . The second question is the computational efficiency of Balaton for a fixed easy game . Fortunately , in both cases an efficient implementation is possible , i.e. , in polynomial time by using a linear program solver ( e.g. , the ellipsoid method ( Papadimitriou and Steiglitz , 1998 ) ) .
Another interesting open question is to investigate the dependence of regret on quantities other than T such as the number of actions , the number of outcomes , and more generally the structure of the loss and feedback matrices .
Finally , let us note that our results can be extended to a more general framework , similar to that of Pallavi et al . ( 2011 ) , in which a game with N actions and M -dimensional outcome space is defined as a tuple G = ( L , S 1 , . . . , S N ) . The loss matrix is L ∈ R N ×M as before , but the outcome and the feedback are defined differently . The outcome y is an arbitrary vector from a bounded subset of R M and the feedback received by the learner upon choosing action i is O i = S i y .
References
Jacob Abernethy , Elad Hazan , and Alexander Rakhlin . Competing in the dark : An efficient algorithm for bandit linear optimization . In Proceedings of the 21st Annual Conference on Learning Theory ( COLT 2008 ) , pages 263–273 . Citeseer , 2008 .
Alekh Agarwal , Peter Bartlett , and Max Dama . Optimal allocation strategies for the dark pool problem . In 13th International Conference on Artificial Intelligence and Statistics ( AISTATS 2010 ) , May 12-15 , 2010 , Chia Laguna Resort , Sardinia , Italy , 2010 .
Andr´ as Antos , G´ abor Bart´ ok , D´ avid P´ al , and Csaba Szepesv´ ari . Toward a classification of finite partial-monitoring games , 2011. http : //arxiv.org/abs/1102.2041 .
Jean-Yves Audibert and S´ebastien Bubeck . Minimax policies for adversarial and stochastic bandits . In Proceedings of the 22nd Annual Conference on Learning Theory , 2009 .
Peter Auer , Nicol` o Cesa-Bianchi , Yoav Freund , and Robert E. Schapire . The nonstochastic multiarmed bandit problem . SIAM Journal on Computing , 32 ( 1 ) :48–77 , 2002 .
G´ abor Bart´ ok , D´ avid P´ al , and Csaba Szepesv´ ari . Toward a Classification of Finite PartialMonitoring Games . In Proceedings of the 21st international conference on Algorithmic Learning Theory ( ALT 2010 ) , pages 224–238 . Springer , 2010 .
Nicol` o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Minimizing regret with label efficient prediction . IEEE Transactions on Information Theory , 51 ( 6 ) :2152–2162 , June 2005 .
Nicol´ o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Regret minimization under partial monitoring . Mathematics of Operations Research , 31 ( 3 ) :562–580 , 2006 .
Thomas M. Cover and Joy A. Thomas . Elements of Information Theory . Wiley , New York , second edition , 2006 .
Abraham D. Flaxman , Adam Tauman Kalai , and H. Brendan McMahan . Online convex optimization in the bandit setting : gradient descent without a gradient . In Proceedings of the 16th annual ACM-SIAM Symposium on Discrete Algorithms ( SODA 2005 ) , page 394 . Society for Industrial and Applied Mathematics , 2005 .
Robert Kleinberg and Tom Leighton . The value of knowing a demand curve : Bounds on regret for online posted-price auctions . In Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science 2003 ( FOCS 2003 ) , pages 594–605 . IEEE , 2003 .
Nick Littlestone and Manfred K. Warmuth . The weighted majority algorithm . Information and Computation , 108:212–261 , 1994 .
G´ abor Lugosi and Nicol` o Cesa-Bianchi . Prediction , Learning , and Games . Cambridge University Press , 2006 .
V. Mnih . Efficient stopping rules . Master’s thesis , Department of Computing Science , University of Alberta , 2008 .
V. Mnih , Cs . Szepesv´ ari , and J.-Y . Audibert . Empirical Bernstein stopping . In W. W. Cohen , A. McCallum , and S. T. Roweis , editors , ICML 2008 , pages 672–679 . ACM , 2008 .
A. Pallavi , R. Zheng , and Cs . Szepesv´ ari . Sequential learning for optimal monitoring of multi-channel wireless networks . In INFOCOMM , 2011 .
Christos H. Papadimitriou and Kenneth Steiglitz . Combinatorial optimization : algorithms and complexity . Courier Dover Publications , New York , 1998 .
Antonio Piccolboni and Christian Schindelhauer . Discrete prediction games with arbitrary feedback and loss . In Proceedings of the 14th Annual Conference on Computational Learning Theory ( COLT 2001 ) , pages 208–223 . Springer-Verlag , 2001 .
Martin Zinkevich . Online convex programming and generalized infinitesimal gradient ascent . In Proceedings of Twentieth International Conference on Machine Learning ( ICML 2003 ) , 2003 .
Appendix
Proof ( Lemma 8 )
1 . In an elimination set , we eliminate every action whose cell is contained in a closed half space . Let us assume that j ∈ A i is being eliminated . According to the definition of A i , C i ⊂ C j and thus C i is also contained in the half space .
2 . First let us assume that p ∗ is not in the affine subspace spanned by C i . Let p be an arbitrary point in the relative interior of C i . We define the point p = p + ε ( p − p ∗ ) . For a small enough ε > 0 , p ∈ C k ∈ A i , and at the same time , p ∈ C i . Thus we have
k ( p + ε ( p − p ∗ ) ) ≤ i ( p + ε ( p − p ∗ ) ) ( 1 + ε ) k p − ε k p ∗ ≤ ( 1 + ε ) i p − ε i p ∗ −ε k p ∗ ≤ −ε i p ∗ k p ∗ ≥ i p ∗ α k , j ∗ ≥ α i , j ∗ ,
where we used that k p = i p. For the case when p ∗ lies in the affine subspace spanned by C i , We take a hyperplane that contains the affine subspace . Then we take an infinite sequence ( p n ) n such that every element of the sequence is in the same side of the hyperplane , p n = p ∗ and the sequence converges to p ∗ . Then the statement is true for every element p n and , since the value α r , s is continuous in p , the limit has the desired property as well .
The following lemma concerns the problem of producing an estimate of an unknown mean of some stochastic process with a given relative error bound and with high probability in a sample-efficient manner . The procedure is a simple variation of the one proposed by Mnih et al . ( 2008 ) . The main differences are that here we deal with martingale difference sequences shifted by an unknown constant , which becomes the common mean , whereas Mnih et al . ( 2008 ) considered an i.i.d . sequence . On the other hand , we consider the case when we have a known upper bound on the predictable variance of the process , whereas one of the main contributions of Mnih et al . ( 2008 ) was the lifting of this assumption . The proof of the lemma is omitted , as it follows the same lines as the proof of results of Mnih et al . ( 2008 ) ( the details of these proofs are found in the thesis of ( Mnih , 2008 ) ) , the only difference being , that here we would need to use Bernstein’s inequality for martingales , in place of the empirical Bernstein inequality , which was used by Mnih et al . ( 2008 ) .
Lemma 11 Let ( F t ) be a filtration on some probability space , and let ( X t ) be an F t -adapted sequence of random variables . Assume that ( X t ) is such that , almost surely , the range of each random variable X t is bounded by R > 0 , E [ X t |F t−1 ] = µ , and Var [ X t |F t−1 ] ≤ σ 2 a.s. , where R , µ = 0 and σ 2 are non-random constants . Let p > 1 , > 0 , 0 < δ < 1 and let
L n = ( 1 + ε ) max 1≤t≤n |X t | − c t , and U n = ( 1 − ε ) min 1≤t≤n |X t | + c t ,
where c t = c ( σ , R , t , δ ) , and c ( · ) is defined in ( 1 ) . Define the estimate ˆ µ n of µ as follows:
ˆ µ n = sgn ( X n ) ( 1 + ε ) L n + ( 1 − ε ) U n 2 .
Denote the stopping time τ = min { n : L n ≥ U n } . Then , with probability at least 1 − δ,
|ˆ µ τ − µ| ≤ ε |µ| and τ ≤ C · max σ 2 2 µ 2 , R |µ| log 1 δ + log R |µ| ,
where C > 0 is a universal constant .
Lemma 12 Fix a probability vector p ∈ ∆ M , and let ∈ R M such that p − , p + ∈ ∆ M also holds . Then KL ( p − ||p + ) = O ( 2 2 ) as → 0 .
The constant and the threshold in the O ( · ) notation depends on p .
Proof Since p , p + , and p − are all probability vectors , notice that | ( i ) | ≤ p ( i ) for 1 ≤ i ≤ M . So if a coordinate of p is zero then the corresponding coordinate of has to be zero as well . As zero coordinates do not modify the KL divergence , we can assume without loss of generality that all coordinates of p are positive . Since we are interested only in the case when → 0 , we can also assume without loss of generality that | ( i ) | ≤ p ( i ) /2 . Also note that the coordinates of = ( p + ) − have to sum up to zero . By definition,
KL ( p − ||p + ) =
M
( p ( i ) − ( i ) ) log p ( i ) − ( i ) p ( i ) + ( i ) .
We write the term with the logarithm
log p ( i ) − ( i ) p ( i ) + ( i ) = log 1 − ( i ) p ( i ) − log 1 + ( i ) p ( i ) ,
so that we can use that , by second order Taylor expansion around 0 , log ( 1−x ) −log ( 1+x ) = −2x + r ( x ) , where |r ( x ) | ≤ c|x| 3 for |x| ≤ 1/2 and some c > 0 . Combining these equations , we get
KL ( p − ||p + ) =
M
( p ( i ) − ( i ) ) −2 ( i ) p ( i ) + r ( i ) p ( i )
=
M
−2 ( i ) +
M
2 ( i )
p ( i ) +
M
( p ( i ) − ( i ) ) r ( i ) p ( i ) .
Here the first term is 0 , letting p = min i∈ { 1 , ... , M } p ( i ) the second term is bounded by 2 M i=1 2 ( i ) /p = ( 2/p ) 2 2 , and the third term is bounded by
M
( p ( i ) − ( i ) ) r ( i ) p ( i ) ≤ c M i=1 p ( i ) − ( i ) p 3 ( i ) | ( i ) | 3
≤ c
M
| ( i ) | p 2 ( i )
2 ( i )
≤ c 2 M i=1 1 p 2 ( i ) = c 2p 2 2 .
Hence , KL ( p − ||p + ) ≤ 4+c 2p 2 2 = O ( 2 2 ) .
price
vendor
product
maximum
customer
dynamic
sell
pricing
buy
loss
feedback
learner
want
willing
outcome
set
constant
difference
action
game
eliminate
action
l
learner
outcome
loss
function
action
environment
choose
feedback
game
minimax
expect
regret
paper
game
eliminate
half
space
action
set
let
regret
game
expect
minimax
paper
game
minimax
regret
finite
environment
stochastic
outcome
learner
feedback
choose
environment
loss
reveal
eliminate
half
space
say
action
estimate
action
difference
observability
estimate
local
loss
expect
choose
l
action
optimal
row
matrix
example
cell
p
action
eliminate
alive
estimate
remain
optimal
round
pair
algorithm
game
regret
area
grey
order
know
minimax
prediction
base
game
regret
minimax
turn
t
logarithmic
zero
game
minimax
area
regret
grey
figure
monitoring
partial
know
easy
hard
learner
loss
measure
action
action
difference
estimate
loss
choose
expect
set
learner
loss
monitoring
partial
outcome
function
feedback
action
difference
game
choose
regret
game
algorithm
variant
result
late
t
prediction
case
exists
observe
p
strategy
symbol
distribution
matrix
feedback
signal
probability
choose
action
example
note
vector
step
time
learner
choose
feedback
action
outcome
game
regret
stochastic
minimax
example
question
efficiency
game
linear
easy
efficient
algorithm
far
efficiently
decide
use
time
game
regret
minimax
characterize
ok
matrix
symbol
signal
row
action
stack
let
number
alive
maintain
action
set
algorithm
game
regret
minimax
finite
partial
ok
number
algorithm
action
time
action
easy
turn
figure
signal
neighbor
matrix
l
row
action
matrix
outcome
h
regret
t
minimax
area
dynamic
pricing
hard
game
regret
expect
algorithm
bound
low
p
expect
regret
loss
stochastic
difference
time
action
minimax
regret
game
game
hard
bound
low
eliminate
update
estimate
action
difference
sign
loss
pair
expect
algorithm
interesting
outcome
matrix
feedback
number
open
question
action
game
category
t
minimax
regret
zero
action
matrix
example
hand
simplex
cell
probability
loss
environment
outcome
time
action
stochastic
action
eliminate
choose
time
remain
upper
regret
note
expect
let
bound
algorithm
alive
update
run
action
estimate
choose
round
algorithm
alive
action
round
choose
note
algorithm
time
l
outcome
action
matrix
space
u
note
regret
environment
bound
game
upper
stochastic
minimax
monitoring
partial
low
regret
game
minimax
achieve
algorithm
zero
t
game
t
category
minimax
regret
hopeless
hard
zero
finite
ok
environment
stochastic
outcome
feedback
distribution
case
choose
information
environment
matrix
row
action
h
game
low
hard
bound
eliminate
action
time
structure
outcome
question
number
action
matrix
feedback
hold
stochastic
bound
algorithm
conjecture
game
low
adversarial
upper
regret
monitoring
partial
minimax
action
loss
game
outcome
difference
characterize
arbitrary
distribution
signal
choose
loss
action
difference
regret
distribution
expect
algorithm
time
outcome
t
loss
regret
learner
performance
measure
hindsight
best
difference
action
t
t
t
t
single
action
matrix
algebra
signal
figure
example
probability
c
easy
game
condition
pair
difference
vector
loss
neighbor
game
result
regret
hopeless
achieve
trivial
algorithm
zero
t
ok
stochastic
minimax
use
game
minimax
bound
logarithmic
t
regret
hopeless
low
upper
number
question
outcome
action
regret
regret
expect
algorithm
low
stochastic
p
t
eliminate
action
confidence
half
space
difference
bound
optimal
pair
let
algorithm
game
observability
action
need
local
global
set
game
regret
algorithm
hopeless
bound
minimax
case
paper
trivial
low
t
feedback
outcome
learner
partial
loss
information
reveal
choose
game
environment
answer
minimax
regret
finite
stochastic
action
turn
positive
game
regret
result
question
remain
paper
base
stochastic
minimax
t
term
regret
bound
low
proof
game
hopeless
characterize
trivial
matrix
l
outcome
action
h
lie
action
estimate
alive
round
difference
follows
pair
loss
let
condition
expect
algorithm
low
bound
need
condition
design
action
n
loss
action
difference
algorithm
distribution
expect
outcome
game
regret
achieve
algorithm
zero
t
minimax
game
minimax
regret
theorem
t
example
feedback
outcome
game
game
exist
regret
expect
algorithm
exists
p
t
minimax
regret
stochastic
game
action
say
loss
optimal
symbol
row
feedback
action
outcome
learner
feedback
loss
action
note
game
symbol
feedback
action
time
outcome
matrix
observe
row
let
vector
choose
signal
information
question
conjecture
game
stochastic
classify
finite
minimax
regret
game
bound
low
action
game
figure
base
pair
action
estimate
decide
difference
loss
expect
algorithm
use
environment
consider
outcome
manner
time
stochastic
game
easy
hard
game
dynamic
pricing
figure
opponent
algorithm
low
stochastic
outcome
distribution
action
time
regret
loss
regret
learner
minimax
hindsight
best
game
function
action
difference
feedback
action
estimate
pair
condition
local
difference
observability
loss
neighbor
expect
use
vector
distribution
outcome
action
probability
p
need
observability
action
local
difference
choose
action
loss
difference
outcome
regret
choose
distribution
game
expect
regret
loss
let
pair
difference
action
action
choose
time
algorithm
l
outcome
corresponding
action
h
cell
step
choose
time
action
note
game
big
minimax
set
monitoring
partial
regret
game
easy
condition
stochastic
hopeless
algorithm
hard
efficiently
factor
use
trivial
zero
finite
minimax
regret
outcome
action
distribution
probability
game
regret
minimax
case
t
different
regret
expect
step
bound
low
t
regret
minimax
game
ok
problem
regret
game
t
finite
action
loss
say
figure
row
vector
matrix
regret
area
dynamic
pricing
minimax
t
outcome
action
distribution
feedback
signal
environment
choose
probability
expect
regret
let
time
r
action
action
signal
observability
game
matrix
say
neighbor
distribution
matrix
example
probability
action
simplex
p
note
outcome
feedback
matrix
action
general
let
note
vector
outcome
distribution
information
h
action
optimal
cell
p
action
pair
estimate
algorithm
main
efficient
use
game
stochastic
finite
minimax
regret
learner
environment
outcome
loss
feedback
game
action
question
easy
game
decide
efficient
algorithm
use
action
pair
difference
expect
loss
u
algorithm
vector
outcome
action
optimal
game
game
regret
opponent
bound
low
game
algorithm
bound
low
hopeless
analysis
rate
minimax
factor
trivial
upper
regret
regret
game
bound
algorithm
stochastic
low
minimax
case
step
follow
game
low
opponent
algorithm
bound
regret
learner
loss
difference
action
matrix
outcome
extend
action
matrix
feedback
let
game
outcome
l
optimal
ok
h
choose
time
action
action
difference
loss
let
expect
optimal
pair
regret
outcome
algorithm
action
strategy
matrix
row
p
say
figure
example
vector
action
half
space
remain
pair
action
game
hard
neighbor
outcome
distribution
environment
action
p
time
probability
dynamic
pricing
bandit
algorithm
opponent
number
time
action
regret
n
game
algorithm
regret
expect
late
achieve
case
t
outcome
action
function
signal
assume
game
feedback
regret
case
bound
algorithm
result
paper
achieve
exists
zero
low
t
matrix
simplex
l
cell
probability
late
h
l
p
game
t
hopeless
hard
loss
t
action
difference
regret
round
t
t
action
choosing
outcome
arbitrary
game
loss
space
u
vector
matrix
regret
loss
hindsight
best
say
mean
learner
action
scale
game
stochastic
key
classification
time
finite
minimax
regret
action
algorithm
choose
environment
stochastic
game
outcome
symbol
matrix
row
action
signal
hopeless
trivial
structure
outcome
matrix
say
able
action
game
number
assume
feedback
case
action
regret
condition
bound
choose
achieve
upper
round
algorithm
t
expect
regret
previous
environment
base
stochastic
game
number
algorithm
bound
action
time
n
game
outcome
finite
minimax
characterize
regret
partial
action
loss
note
positive
game
regret
monitoring
partial
minimax
environment
game
turn
computationally
efficient
algorithm
analysis
minimax
factor
regret
zero
expect
regret
r
easy
game
algorithm
matrix
loss
neighbor
expect
choose
game
condition
efficient
algorithm
rate
factor
zero
minimax
regret
need
action
easy
pair
game
neighbor
expect
regret
algorithm
p
t
game
know
case
t
game
regret
learner
characterize
minimax
matrix
outcome
feedback
distinguish
environment
action
number
say
action
signal
matrix
row
t
expect
regret
game
expect
loss
difference
mean
action
outcome
distribution
j
feedback
case
choose
environment
outcome
game
action
monitoring
partial
distribution
characterize
regret
minimax
outcome
feedback
case
assume
expect
regret
constant
problem
game
time
let
action
regret
expect
algorithm
t
learner
feedback
monitoring
partial
loss
matrix
outcome
set
game
action
game
minimax
main
regret
theorem
finite
stochastic
action
linear
easy
turn
n
game
hopeless
exists
t
game
number
set
finite
outcome
say
action
monitoring
partial
n
game
finite
monitoring
say
partial
environment
regret
minimax
game
feedback
action
say
set
let
game
easy
game
condition
hard
ok
algorithm
use
action
t
time
choose
let
hold
outcome
regret
low
environment
stochastic
t
outcome
step
action
note
follow
distribution
outcome
algorithm
action
environment
difference
game
easy
condition
hold
hopeless
finite
extend
answer
question
game
loss
vector
action
easy
condition
neighbor
game
make
characterize
outcome
ok
action
algebra
strongly
matrix
linear
signal
condition
algorithm
need
bound
action
feedback
learner
choose
action
follow
game
action
game
opponent
feedback
outcome
know
game
minimax
regret
paper
base
result
stochastic
game
game
general
result
paper
stochastic
expect
regret
let
r
action
cell
note
environment
stochastic
game
feedback
outcome
case
example
know
monitoring
partial
game
finite
question
action
loss
matrix
outcome
game
dynamic
pricing
p
action
optimal
cell
use
condition
bound
algorithm
low
action
mention
achieves
regret
zero
upper
t
opponent
bound
upper
proof
action
condition
case
action
bound
previous
algorithm
cell
choose
action
case
t
game
right
figure
base
know
matrix
action
signal
let
say
outcome
distribution
environment
general
time
probability
regret
expect
bound
low
eliminate
regret
t
regret
minimax
problem
matrix
action
loss
say
action
regret
t
round
action
significantly
outcome
hold
round
note
algorithm
t
outcome
loss
feedback
action
question
answer
game
base
l
u
probability
action
matrix
action
let
expect
time
choose
algorithm
t
regret
feedback
outcome
learner
loss
action
matrix
game
action
game
matrix
minimax
regret
game
case
action
game
row
loss
signal
matrix
outcome
regret
bound
figure
zero
theorem
upper
low
t
game
condition
efficient
algorithm
matrix
action
outcome
optimal
p
u
probability
distribution
outcome
probability
game
regret
result
base
stochastic
minimax
game
stochastic
outcome
distribution
deal
time
finite
game
action
loss
vector
outcome
example
feedback
learner
action
matrix
choose
game
matrix
pair
action
loss
say
outcome
game
action
final
choose
time
note
game
algorithm
easy
set
time
regret
game
hopeless
characterize
trivial
action
hard
example
action
constant
matrix
number
action
time
algorithm
bound
upper
regret
n
alive
action
hold
round
action
set
regret
achieve
t
function
feedback
step
time
dynamic
pricing
action
game
strategy
let
p
choose
time
action
number
choose
action
number
time
action
loss
vector
function
action
regret
game
case
example
problem
result
expect
regret
p
algorithm
action
neighbor
outcome
p
let
action
vector
probability
feedback
step
time
learner
action
outcome
choose
information
correspond
matrix
example
loss
action
matrix
minimax
regret
previous
base
result
stochastic
game
time
matrix
outcome
action
t
game
algorithm
condition
action
u
action
set
action
t
round
choose
game
game
know
t
t
case
choose
outcome
loss
distribution
term
bound
l
action
cell
lie
note
feedback
monitoring
partial
dynamic
pricing
bandit
outcome
action
distribution
environment
difference
choose
feedback
game
algorithm
outcome
regret
finite
expect
base
loss
pair
difference
action
minimax
regret
game
base
obvious
row
c
section
case
loss
algorithm
bound
t
upper
regret
game
beginning
strategy
p
choose
probability
environment
important
stochastic
t
regret
term
low
dynamic
pricing
loss
action
set
let
condition
note
game
hopeless
exist
t
trivial
case
l
u
action
choose
expect
algorithm
game
learner
paper
action
strategy
figure
p
minimax
regret
loss
game
case
paper
game
regret
opponent
theorem
t
function
correspond
action
condition
game
l
action
h
action
j
action
observable
observability
game
action
signal
condition
matrix
neighbor
vector
game
action
loss
loss
action
difference
matrix
extend
deterministic
need
action
assume
feedback
game
hopeless
game
hopeless
matrix
action
signal
different
number
let
t
define
game
classification
finite
game
outcome
action
classification
n
action
case
condition
action
algorithm
figure
choose
learner
loss
action
game
hard
outcome
example
optimal
action
say
matrix
signal
let
game
action
space
outcome
loss
note
let
game
game
hopeless
hard
action
t
partial
environment
outcome
leave
game
outcome
l
example
late
h
monitoring
partial
p
action
loss
vector
action
case
time
choose
remain
t
algorithm
outcome
action
note
action
observability
game
set
say
easy
low
bound
regret
n
matrix
c
feedback
way
probability
case
choose
let
time
t
space
action
u
cell
loss
outcome
difference
expect
action
time
learner
loss
environment
outcome
feedback
action
game
regret
general
learner
t
action
cell
outcome
action
assume
previous
game
condition
minimax
regret
estimate
sign
algorithm
condition
time
bound
environment
bound
theorem
low
stochastic
proof
algorithm
action
expect
u
choose
action
loss
choose
outcome
action
different
matrix
signal
assume
game
case
section
condition
regret
t
game
game
outcome
algorithm
hold
action
proof
n
dynamic
pricing
bandit
action
algorithm
choose
action
let
action
t
case
pair
game
space
close
j
action
cell
case
b
action
zero
choose
figure
probability
c
learner
choose
feedback
action
stochastic
note
game
question
algorithm
use
case
game
regret
algorithm
t
condition
case
outcome
let
c
action
vector
feedback
function
step
difference
time
run
regret
t
algorithm
hopeless
trivial
different
game
theorem
row
space
t
game
hopeless
say
action
matrix
action
difference
row
row
action
matrix
action
action
cell
optimal
p
outcome
ok
open
feedback
strategy
let
p
time
probability
outcome
loss
action
game
action
vector
choose
row
signal
matrix
outcome
action
row
matrix
outcome
information
action
algorithm
base
estimate
choose
outcome
loss
action
consider
game
feedback
loss
outcome
action
matrix
regret
case
previous
paper
result
action
previous
define
example
observable
j
action
exist
case
regret
minimax
stochastic
game
case
game
hopeless
finite
minimax
regret
result
ok
game
stochastic
game
opponent
algorithm
action
time
regret
n
action
space
cell
information
t
learner
think
case
paper
l
action
h
p
monitoring
partial
regret
learner
t
low
action
opponent
say
n
turn
regret
t
achieve
game
algorithm
matrix
signal
action
c
vector
case
regret
t
say
mean
game
dynamic
pricing
hard
theorem
environment
game
t
learner
j
action
condition
local
observability
difference
action
neighbor
easy
game
action
hopeless
trivial
function
say
action
game
bound
p
apply
half
true
algorithm
use
need
outcome
feedback
know
game
case
action
space
optimal
h
probability
p
use
cell
game
t
regret
theorem
zero
p
case
note
vector
probability
section
regret
expect
algorithm
t
hold
regret
upper
bound
case
algorithm
game
outcome
action
note
regret
algorithm
t
game
algorithm
make
outcome
action
loss
regret
outcome
t
characterize
game
algorithm
achieves
analysis
regret
t
t
t
t
condition
cell
hold
j
difference
transaction
information
line
need
difference
proof
bound
case
game
finite
environment
stochastic
action
loss
important
note
follow
p
let
c
time
action
game
neighbor
expect
algorithm
different
define
main
theorem
difference
loss
hold
pair
condition
note
game
action
loss
let
game
outcome
algorithm
action
time
decide
base
j
action
difference
algorithm
game
want
loss
action
difference
hard
outcome
loss
algorithm
set
case
choose
section
achieve
game
t
regret
expect
time
let
algorithm
bound
low
bound
mean
upper
outcome
loss
action
follow
space
action
lie
cell
action
strongly
linear
action
set
let
game
expect
algorithm
choose
outcome
action
algorithm
hold
proof
action
t
pair
game
game
action
n
start
previous
p
feedback
symbol
row
action
let
pair
algorithm
vector
signal
use
matrix
action
pair
difference
game
space
algorithm
use
calculate
action
loss
follow
outcome
action
let
game
game
prediction
matrix
action
cell
j
condition
action
cell
case
regret
case
t
algorithm
outcome
game
need
condition
estimate
algorithm
outcome
game
action
choosing
n
trivial
exists
case
t
l
matrix
h
t
easy
expect
action
algorithm
choose
expect
regret
game
loss
case
time
feedback
let
loss
action
note
outcome
space
action
feedback
action
hard
set
outcome
action
let
assume
low
bound
proof
regret
n
number
action
factor
regret
game
turn
efficient
algorithm
algorithm
opponent
action
run
time
action
follow
action
difference
turn
linear
figure
action
note
p
say
action
loss
set
game
pair
outcome
action
need
run
algorithm
base
j
action
choose
action
estimate
achieve
regret
opponent
t
regret
outcome
game
t
t
t
t
t
t
hopeless
game
outcome
p
action
j
cell
regret
t
loss
action
difference
expect
set
choose
action
j
space
action
close
bound
difference
algorithm
cell
action
positive
optimal
action
algorithm
outcome
game
minimax
regret
action
regret
expect
time
let
t
algorithm
game
action
choose
number
time
let
p
note
c
hard
feedback
learner
loss
think
case
paper
action
let
c
t
action
observable
cell
environment
hold
bound
algorithm
case
space
close
j
cell
action
base
loss
action
t
proof
exist
j
action
algorithm
feedback
p
strategy
probability
bound
proof
action
different
define
h
t
difference
let
upper
time
bound
algorithm
game
know
outcome
environment
open
question
base
game
feedback
example
problem
case
low
stochastic
bound
adversarial
proof
action
loss
outcome
game
outcome
action
regret
bound
low
action
number
j
outcome
action
difference
algorithm
loss
action
action
algorithm
efficient
use
action
proof
case
j
t
feedback
correspond
difference
action
hide
regret
loss
action
easy
arbitrary
vector
space
action
u
b
action
case
c
p
c
let
vector
probability
game
game
learner
loss
paper
say
pair
feedback
action
game
general
result
learner
t
action
time
bound
algorithm
know
algorithm
use
choose
regret
expect
bound
algorithm
game
game
finite
ok
p
p
regret
information
prediction
estimate
upper
bound
choose
algorithm
action
game
choose
outcome
partial
ok
game
constant
problem
r
consider
action
opponent
algorithm
bound
time
upper
proof
l
ok
h
use
action
l
use
p
version
action
matrix
game
action
game
action
different
action
paper
problem
result
stochastic
game
time
dynamic
pricing
action
action
game
distribution
outcome
environment
define
action
algorithm
action
point
strongly
monitoring
partial
opponent
say
figure
regret
loss
feedback
action
bound
result
p
use
step
loss
say
example
action
main
proof
bound
mean
upper
know
stochastic
action
main
algorithm
choose
game
dynamic
pricing
learner
algorithm
action
set
example
outcome
game
loss
let
action
outcome
distribution
think
partial
n
know
algorithm
use
action
space
time
choose
action
p
c
regret
outcome
game
optimal
ok
estimate
upper
bound
opponent
stochastic
distribution
action
choose
subspace
p
lie
use
action
information
game
ok
p
algorithm
condition
bound
time
action
follow
expect
regret
action
time
let
game
algorithm
loss
hold
exactly
action
n
let
c
vector
action
open
note
action
outcome
let
matrix
feedback
signal
assume
bound
game
action
algorithm
use
half
n
zero
bound
theorem
t
proof
action
cell
action
t
t
t
case
optimal
know
game
outcome
t
step
t
information
learner
action
number
set
bound
general
algorithm
p
space
action
cell
j
t
bound
minimax
regret
game
term
t
loss
set
action
correspond
p
action
algorithm
bound
condition
note
achieve
regret
t
loss
action
say
p
loss
action
difference
set
section
round
game
outcome
player
learner
game
action
l
u
proof
cell
j
t
base
choose
action
expect
base
algorithm
use
time
action
hold
outcome
action
distribution
signal
vector
matrix
choose
case
action
c
bound
upper
action
linear
regret
question
t
monitoring
partial
action
matrix
game
linear
efficient
algorithm
bandit
monitoring
partial
dynamic
pricing
linear
round
know
algorithm
number
say
action
regret
strategy
action
c
loss
learner
action
action
time
choose
action
game
action
outcome
n
game
regret
t
p
action
cell
choice
note
action
observability
condition
local
n
use
space
action
say
estimate
start
x
t
bound
p
bound
algorithm
upper
t
t
define
estimate
bound
outcome
case
example
game
expect
regret
achieve
use
action
outcome
choose
loss
distribution
action
game
action
set
case
condition
game
outcome
case
game
algorithm
step
regret
t
monitoring
partial
action
u
t
space
let
bound
question
linear
game
case
define
condition
proof
choose
bound
p
t
loss
game
let
action
l
note
p
deterministic
let
action
number
assume
action
start
simplex
cell
probability
observable
j
action
action
cell
regret
expect
remain
algorithm
bound
p
game
finite
ok
action
important
note
vector
loss
feedback
learner
result
game
r
let
action
choose
t
regret
monitoring
partial
action
number
let
j
j
different
action
assume
feedback
action
t
expect
pair
loss
action
let
difference
r
regret
expect
game
strongly
action
case
game
strategy
action
let
p
time
c
loss
think
action
algorithm
result
true
bound
condition
algorithm
bound
action
time
proof
maintain
algorithm
bound
regret
l
action
h
u
action
choose
section
distribution
note
action
choose
action
note
time
p
let
vector
probability
algorithm
action
choose
loss
choose
feedback
time
game
follow
t
p
t
game
prediction
action
algorithm
condition
environment
ok
algorithm
space
positive
set
loss
outcome
expect
action
difference
exactly
hold
action
j
t
regret
expect
bound
difference
main
use
know
estimate
t
arbitrary
game
loss
p
learner
action
choose
game
game
ok
p
estimate
round
let
algorithm
use
n
regret
loss
game
feedback
action
c
t
game
vector
use
t
stochastic
adversarial
bandit
minimax
feedback
loss
set
matrix
game
correspond
expect
time
choose
game
linear
action
action
monitoring
space
matrix
t
achieve
game
condition
regret
algorithm
t
t
algorithm
run
regret
n
algorithm
j
action
choose
action
note
action
estimate
use
algorithm
game
loss
monitoring
partial
t
t
action
action
manner
action
set
n
t
t
t
t
t
t
cell
algorithm
j
case
choose
feedback
bound
game
mean
game
t
achieve
ok
condition
regret
game
algorithm
action
outcome
algorithm
regret
n
algorithm
game
regret
action
choose
time
game
algorithmic
classification
finite
loss
action
information
outcome
algorithm
case
j
action
algorithm
cell
distribution
p
game
main
base
action
u
l
p
probability
use
action
let
loss
game
want
constant
learner
difference
feedback
action
difference
estimate
case
bound
upper
know
bound
upper
c
action
p
loss
game
pair
difference
subspace
p
u
action
hard
set
easy
condition
action
proof
action
time
let
feedback
action
learner
r
loss
outcome
action
game
action
j
regret
t
case
action
hard
rate
regret
minimax
feedback
distribution
j
proof
game
distribution
outcome
algorithm
expect
action
bound
low
distinguish
outcome
j
action
environment
feedback
matrix
zero
signal
vector
action
let
c
note
action
base
time
stochastic
game
action
choose
algorithm
bound
game
set
outcome
hard
action
t
outcome
game
action
choose
action
algorithm
cell
action
p
choose
action
base
monitoring
partial
game
set
action
p
matrix
signal
vector
action
outcome
choose
feedback
hold
j
action
case
loss
feedback
game
outcome
game
estimate
neighbor
bound
proof
t
step
algorithm
bound
p
p
case
c
action
matrix
signal
case
game
feedback
action
x
order
use
label
efficient
outcome
feedback
case
feedback
loss
game
bound
use
upper
n
environment
case
t
loss
feedback
game
condition
algorithm
p
previous
algorithm
action
regret
question
second
case
time
proof
function
feedback
let
difference
time
bound
need
stochastic
use
proof
deal
upper
function
say
action
estimate
t
action
outcome
action
feedback
zero
p
case
j
point
let
assume
infinite
case
action
probability
game
action
action
feedback
loss
game
distribution
action
choose
feedback
j
outcome
feedback
environment
want
action
hard
action
case
algorithm
base
p
action
trivial
game
action
number
time
probability
regret
know
bound
outcome
ok
remain
algorithm
result
use
action
space
hold
set
n
game
classification
finite
case
choose
time
action
bound
upper
time
algorithm
action
say
action
number
condition
bound
zero
proof
n
base
know
game
action
j
cell
loss
outcome
action
matrix
game
define
minimax
stochastic
environment
stochastic
game
t
action
set
let
algorithm
work
base
mean
choose
c
run
algorithm
action
follow
define
vector
space
matrix
zero
proof
bound
upper
n
main
result
matrix
kernel
vector
step
action
choose
time
matrix
zero
constant
problem
linear
bandit
efficient
monitoring
partial
lower
bound
x
zero
bound
t
action
use
probability
define
regret
r
r
r
section
regret
time
game
outcome
action
step
outcome
action
cell
row
space
l
h
use
minimax
stochastic
j
j
action
case
condition
cell
action
proof
action
outside
set
correspond
p
p
action
follow
action
t
t
condition
choose
t
p
algorithm
bound
base
zero
environment
ok
hard
ok
p
c
environment
important
action
proof
algorithm
bound
t
t
action
game
condition
set
example
choose
game
t
t
opponent
bound
upper
regret
regret
t
thank
partial
j
bound
ok
p
know
regret
expect
let
bound
feedback
follow
half
action
set
equivalent
distribution
far
n
proof
j
action
second
p
estimate
algorithm
choose
time
action
p
true
use
case
algorithm
action
decide
use
t
t
loss
regret
algorithm
t
game
p
t
estimate
round
know
note
let
loss
positive
case
note
p
action
probability
let
row
action
p
pair
game
r
action
hard
environment
information
algorithm
t
feedback
choice
action
follow
example
learner
stochastic
r
feedback
action
game
cell
action
t
try
distribution
p
step
t
p
proof
p
p
loss
note
game
action
action
game
say
action
proof
action
information
vector
second
information
action
example
constant
matrix
action
set
game
ok
j
cell
j
consider
know
manner
follow
stochastic
area
figure
base
know
true
p
round
remain
pair
t
regret
game
know
set
regret
bound
know
difference
mean
constant
problem
result
stochastic
case
expect
estimate
use
feedback
difference
proof
algorithm
bound
j
want
outcome
action
environment
difference
feedback
t
t
matrix
action
action
think
feedback
correspond
set
action
choose
choice
j
bound
neighbor
constant
p
note
c
action
algorithm
bound
action
game
action
difference
feedback
case
r
action
action
work
action
result
pair
algorithm
base
action
set
matrix
game
regret
opponent
stochastic
algorithm
pair
game
case
r
action
deterministic
random
function
feedback
let
action
u
p
computation
algorithm
remain
algorithm
case
t
t
opponent
algorithm
bound
use
upper
n
n
n
n
corresponding
p
note
probability
algorithm
case
bound
p
main
t
environment
base
action
number
proof
bound
action
outcome
action
want
game
feedback
random
let
assume
action
time
action
vector
signal
use
matrix
p
c
bound
upper
n
label
make
efficient
j
monitoring
partial
game
game
outcome
action
general
action
say
number
t
p
algorithm
bound
use
t
game
matrix
use
feedback
zero
algorithm
bound
time
action
eliminate
t
action
pair
let
algorithm
regret
game
action
time
choose
expect
t
p
mean
p
t
algorithm
regret
action
ok
action
algorithm
case
choose
p
learner
stochastic
p
j
pair
use
c
action
let
action
action
let
p
vector
p
t
loss
game
feedback
ok
game
j
t
t
bound
mean
x
t
case
action
matrix
action
note
use
vector
information
think
feedback
action
note
actual
action
choose
action
u
note
bandit
stochastic
minimax
action
loss
action
half
loss
note
algorithm
step
t
action
j
loss
game
action
r
action
game
distribution
time
outcome
game
loss
difference
expect
action
expect
time
action
choose
remain
regret
game
algorithm
choose
bound
time
action
game
x
p
t
p
t
p
t
hold
set
t
t
t
t
t
t
t
p
bound
result
know
use
stochastic
case
algorithm
base
bound
u
trivial
bound
environment
proof
bound
game
game
ok
j
action
algorithm
proof
bound
j
bandit
c
p
action
let
half
say
action
feedback
game
estimate
regret
action
j
j
j
j
p
time
correspond
action
choose
proof
action
p
partial
j
action
half
true
hard
decide
algorithm
case
bound
trivial
action
matrix
signal
feedback
step
proof
choose
condition
bound
t
action
algorithm
use
p
p
c
programming
bandit
prediction
dynamic
pricing
t
p
action
assume
let
game
apply
bandit
algorithm
use
probability
action
p
constant
expect
action
time
game
p
action
set
bound
p
example
cell
use
t
vector
probability
optimal
half
r
p
positive
bound
space
leave
game
know
action
hard
proof
problem
feedback
regret
bound
upper
arbitrary
u
feedback
action
note
area
hard
ok
p
matrix
action
t
say
action
game
let
know
game
monitoring
bound
algorithm
choose
let
p
probability
choose
difference
feedback
action
zero
choose
c
u
action
number
t
pair
action
u
main
algorithm
algorithm
game
game
start
probability
p
half
remain
p
t
feedback
action
area
action
second
c
action
c
vector
note
time
case
action
p
p
u
p
low
action
regret
matrix
game
case
feedback
let
define
bound
low
stochastic
action
action
use
u
bound
p
t
proof
p
ok
game
mean
pair
difference
game
let
action
time
n
role
base
time
let
constant
r
t
main
cell
space
bound
t
action
assume
let
j
action
game
loss
difference
game
choose
action
bound
game
information
know
trivial
action
bound
t
action
t
t
difference
algorithm
regret
p
feedback
game
general
choose
p
probability
note
p
t
time
action
use
regret
action
constant
area
t
estimate
know
u
t
let
action
j
b
c
action
let
action
game
result
action
action
let
assume
step
expect
outcome
time
game
case
action
t
hold
choose
action
p
t
p
t
loss
case
t
cell
note
action
environment
ok
p
hard
t
regret
easy
u
action
c
algorithm
j
previous
base
mean
constant
bound
mean
define
game
mean
theorem
p
t
action
time
p
c
p
algorithm
j
u
proof
p
loss
game
action
row
algorithm
try
use
n
action
efficient
algorithm
r
base
action
define
bound
p
t
time
action
ok
p
outcome
action
p
p
j
action
loss
difference
expect
game
exists
case
exactly
action
follow
game
game
j
game
action
j
action
game
p
action
game
t
base
information
ok
loss
vector
algorithm
use
game
action
number
action
action
matrix
set
correspond
game
regret
mean
result
time
p
game
ok
action
p
c
bound
upper
x
p
c
upper
bound
t
action
action
n
c
c
c
t
proof
action
loss
action
note
set
let
algorithm
n
t
define
t
p
probability
action
n
trivial
action
expect
choose
game
algorithm
set
bandit
space
probability
p
t
t
set
action
exists
hard
work
ok
p
action
choose
probability
p
follow
t
t
action
algorithm
t
p
u
distribution
t
round
t
role
p
efficient
ok
define
p
action
extension
game
p
use
constant
u
feedback
j
j
action
game
classification
j
bound
p
game
action
section
ok
p
p
probability
case
game
time
algorithm
j
hold
let
j
role
time
game
loss
game
j
j
action
case
mean
r
choose
regret
r
action
matrix
signal
assume
let
algorithm
feedback
mean
c
feedback
p
action
action
correspond
p
t
p
feedback
action
mean
bandit
algorithm
game
case
time
c
estimate
j
exists
case
c
t
t
t
t
t
t
j
monitoring
case
c
bandit
r
optimal
t
p
t
p
feedback
algorithm
j
action
difference
mean
r
action
game
case
c
use
n
t
p
opponent
proof
ok
game
efficient
ok
p
p
exists
loss
matrix
game
r
p
action
expect
feedback
algorithm
bound
time
probability
j
j
vector
t
define
operator
know
let
know
regret
bound
time
action
linear
zero
proof
action
let
game
efficient
regret
bandit
feedback
remain
use
generalize
action
game
regret
monitoring
partial
regret
p
regret
n
expect
use
p
let
know
p
note
information
u
note
bandit
algorithm
t
t
let
time
constant
hand
use
probability
neighbor
note
action
probability
action
feedback
assume
let
action
matrix
use
t
let
r
t
matrix
signal
t
upper
bound
algorithm
action
time
define
p
c
loss
game
define
p
p
know
n
p
t
t
expect
game
choose
information
case
game
time
action
t
base
action
ok
p
p
let
information
know
note
game
t
neighbor
algorithm
action
set
t
t
t
p
define
c
c
generalize
let
time
t
case
n
space
bound
r
action
set
use
zero
t
p
t
action
assume
game
proof
action
t
action
p
ok
ok
game
t
know
bound
follow
know
note
ok
n
constant
algorithm
j
t
n
ok
ok
game
bound
j
choose
proof
p
hand
p
j
p
n
bound
action
know
hold
mean
t
let
p
t
information
j
ok
game
c
feedback
action
algorithm
t
algorithm
action
zero
theorem
need
know
case
time
game
algorithm
outside
time
j
n
p
information
outside
choose
pair
algorithm
action
define
bandit
r
j
t
bandit
algorithm
case
game
action
p
t
n
t
action
hold
action
say
action
game
p
t
t
n
action
j
t
strategy
time
t
action
p
algorithm
let
know
note
action
game
p
time
n
u
p
let
algorithm
feedback
j
j
j
main
action
game
p
zero
exists
use
ok
game
optimal
action
action
note
follow
action
r
feedback
t
use
time
p
say
set
action
probability
j
j
n
use
bound
regret
c
r
action
action
probability
time
algorithm
j
action
algorithm
use
use
linear
vector
bound
let
r
p
p
algorithm
stochastic
game
matrix
t
result
action
random
n
linear
feedback
bandit
action
bound
feedback
let
action
r
time
game
r
game
r
p
n
t
true
algorithm
bound
know
note
t
problem
game
feedback
information
assume
case
bandit
use
ok
t
r
action
difference
game
t
ok
p
p
j
t
c
j
j
distribution
linear
action
ok
bandit
t
t
algorithm
bandit
set
information
t
game
game
c
p
action
let
c
c
n
know
ok
p
p
p
n
n
t
t
action
regret
let
action
j
n
difference
constant
feedback
j
ok
n
action
j
p
t
game
stochastic
pair
n
n
n
case
t
algorithm
action
choose
regret
use
p
p
p
p
information
p
expect
information
t
p
p
n
n
algorithm
action
game
algorithm
note
space
time
r
action
ok
p
n
p
p
case
know
probability
expect
feedback
game
n
c
let
know
n
n
problem
game
information
correspond
know
p
bandit
proof
case
base
c
pair
feedback
game
p
let
use
n
algorithm
c
set
algorithm
optimal
t
p
use
game
bound
pair
r
let
action
use
probability
p
game
x
feedback
action
use
monitoring
base
p
bandit
p
r
j
c
r
j
matrix
c
r
p
know
game
case
r
algorithm
game
n
r
j
c
r
time
r
r
set
action
matrix
c
game
ok
n
let
n
set
bandit
action
ok
game
game
j
j
let
set
matrix
p
strongly
action
let
follow
n
t
p
action
j
use
game
time
c
game
action
c
note
n
t
feedback
regret
r
case
game
n
t
know
p
n
t
p
game
r
let
set
n
p
action
game
p
p
let
r
action
let
know
action
p
proof
c
n
n
j
n
game
game
bound
let
p
know
action
r
game
action
c
r
c
p
n
constant
know
n
r
n
algorithm
game
action
p
let
c
case
c
game
n
n
choose
time
game
note
t
p
algorithm
matrix
p
t
bound
c
note
n
c
p
p
game
game
game
algorithm
r
n
r
action
c
game
let
n
r
n
p
game
game
action
n
action