JMLR : Workshop and Conference Proceedings vol ( 2010 ) 1 21 24th Annual Conference on Learning Theory
Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments
abor Bart´ ok bartok @ cs.ualberta.caavidal dpal @ cs.ualberta.ca Csaba Szepesv´ ari szepesva @ cs.ualberta.ca Department of Computing Science , University of Alberta , Edmonton , T6G 2E8 , AB , Canada
Editors : Sham Kakade , Ulrike von Luxburg
Abstract
In a partial monitoring game , the learner repeatedly chooses an action , the environment responds with an outcome , and then the learner suffers a loss and receives a feedback signal , both of which are fixed functions of the action and the outcome . The goal of the learner is to minimize his regret , which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight . Assuming that the outcomes are generated in an i.i.d . fashion from an arbitrary and unknown probability distribution , we characterize the minimax regret of any partial monitoring game with finitely many actions
and outcomes . It turns out that the minimax regret of any such game is either zero , Θ ( T ) , Θ ( T 2/3 ) , or Θ ( T ) . We provide a computationally efficient learning algorithm that achieves the minimax regret within logarithmic factor for any game . Keywords : Online learning , Imperfect feedback , Regret analysis
1 . Introduction
Partial monitoring provides a mathematical framework for sequential decision making problems with imperfect feedback . Various problems of interest can be modeled as partial monitoring instances , such as learning with expert advice ( Littlestone and Warmuth , 1994 ) , the multi-armed bandit problem ( Auer et al . , 2002 ) , dynamic pricing ( Kleinberg and Leighton , 2003 ) , the dark pool problem ( Agarwal et al . , 2010 ) , label efficient prediction ( Cesa-Bianchi et al . , 2005 ) , and linear and convex optimization with full or bandit feedback ( Zinkevich , 2003 ; Abernethy et al . , 2008 ; Flaxman et al . , 2005 ) .
In this paper we restrict ourselves to finite games , i.e. , games where both the set of actions available to the learner and the set of possible outcomes generated by the environment are finite . A finite partial monitoring game G is described by a pair of N × M matrices : the loss matrix L and the feedback matrix H. The entries i , j of L are real numbers lying in , say , the interval [ 0 , 1 ] . The entries h i , j of H belong to an alphabet Σ on which we do not impose any structure and we only assume that learner is able to distinguish distinct elements of the alphabet .
The game proceeds in T rounds according to the following protocol . First , G = ( L , H ) is announced for both players . In each round t = 1 , 2 , . . . , T , the learner chooses an action I t
This work was supported in part by AICML , AITF ( formerly iCore and AIF ) , NSERC and the PASCAL2 Network of Excellence under EC grant no . 216886 .
c 2010 G. Bart´ ok , D.al & C. Szepesv´ ari .
{ 1 , 2 , . . . , N } and simultaneously , the environment chooses an outcome J t { 1 , 2 , . . . , M } . Then , the learner receives as a feedback the entry h I t , J t . The learner incurs instantaneous loss I t , J t , which is not revealed to him . The feedback can be thought of as a masked information about the outcome J t . In some cases h I t , J t might uniquely determine the outcome , in other cases the feedback might give only partial or no information about the outcome . In this paper , we shall assume that J t is chosen randomly from a fixed multinomial distribution .
The learner is scored according to the loss matrix L. In round t the learner incurs an instantaneous loss of I t , J t . The goal of the learner is to keep low his total loss T t=1 I t , J t . Equivalently , the learner’s performance can also be measured in terms of his regret , i.e. , the total loss of the learner is compared with the loss of best fixed action in hindsight . The regret is defined as the difference of these two losses .
In general , the regret grows with the number of rounds T . If the regret is sublinear in T , the learner is said to be Hannan consistent , and this means that the learner’s average per-round loss approaches the average per-round loss of the best action in hindsight .
Piccolboni and Schindelhauer ( 2001 ) were one of the first to study the regret of these games . In fact , they have studied the problem without making any probabilistic assumptions about the outcome sequence J t . They proved that for any finite game ( L , H ) , either for any algorithm the regret can be ( T ) in the worst case , or there exists an algorithm which has regret O ( T 3/4 ) on any outcome sequence 1 . This result was later improved by CesaBianchi et al . ( 2006 ) who showed that the algorithm of Piccolboni and Schindelhauer has regret O ( T 2/3 ) . Furthermore , they provided an example of a finite game , a variant of label-efficient prediction , for which any algorithm has regret Θ ( T 2/3 ) in the worst case .
However , for many games O ( T 2/3 ) is not optimal . For example , games with full feedback ( i.e. , when the feedback uniquely determines the outcome ) can be viewed as a special instance of the problem of learning with expert advice and in this case it is known that the
“EWA forecaster” has regret O ( T ) ; see e.g. , Lugosi and Cesa-Bianchi ( 2006 , Chapter 3 ) . Similarly , for games with “bandit feedback” ( i.e. , when the feedback determines the instantaneous loss ) the INF algorithm ( Audibert and Bubeck , 2009 ) and the Exp3 algorithm ( Auer
et al . , 2002 ) achieve O ( T ) regret as well . 2
This leaves open the problem of determining the minimax regret ( i.e. , optimal worst-case regret ) of any given game ( L , H ) . A partial progress was made in this direction by Bart´ok et al . ( 2010 ) who characterized ( almost ) all finite games with M = 2 outcomes . They showed that the minimax regret of any “non-degenerate” finite game with two outcomes
falls into one of four categories : zero , Θ ( T ) , Θ ( T 2/3 ) or Θ ( T ) . They gave a combinatoricgeometric condition on the matrices L , H which determines the category a game belongs to . Additionally , they constructed an efficient algorithm which , for any game , achieves the minimax regret rate associated to the game within poly-logarithmic factor .
In this paper , we consider the same problem , with two exceptions . In pursuing a general result , we will consider all finite games . However , at the same time , we will only deal with stochastic environments , i.e. , when the outcome sequences are generated from a fixed probability distribution in an i.i.d . manner .
1 . The notations O ( · ) and Θ ( · ) hide polylogarithmic factors . 2 . We ignore the dependence of regret on the number of actions or any other parameters .
The regret against stochastic environments is defined as the difference between the cumulative loss suffered by the algorithm and that of the action with the lowest expected loss . That is , given an algorithm A and a time horizon T , if the outcomes are generated from a probability distribution p , the regret is
R T ( A , p ) =
T
t=1
I t , J t min 1≤i≤N E p
T
t=1
i , J t .
In this paper we analyze the minimax expected regret ( in what follows , minimax regret ) of games , defined as
R T ( G ) = inf A sup p∈∆ M E p [ R T ( A , p ) ] .
We show that the minimax regret of any finite game falls into four categories : zero , Θ ( T ) , Θ ( T 2/3 ) , or Θ ( T ) . Accordingly , we call the games trivial , easy , hard , and hopeless . We give a simple and efficiently computable characterization of these classes using a geometric condition on ( L , H ) . We provide lower-bounds and algorithms that achieve them within poly-logarithmic factor . Our result is an extension of the result of Bart´ ok et al . ( 2010 ) for stochastic environments .
It is clear that any lower bound which holds for stochastic environments must hold for adversarial environments too . On the other hand , algorithms and regret upper bounds for stochastic environments , of course , do not transfer to algorithms and regret upper bounds for the adversarial case . Our characterization is a stepping stone towards understanding the minimax regret of partial monitoring games . In particular , we conjecture that our characterization holds without any change for unrestricted environments .
2 . Preliminaries
In this section , we introduce our conventions , along with some definitions . By default , all vectors are column vectors . We denote by v =
v v the Euclidean norm of a vector v. For a vector v , the notation v 0 means that all entries of v are non-negative , and the notation v > 0 means that all entries are positive . For a matrix A , Im A denotes its image space , i.e. , the vector space generated by its columns , and the notation Ker A denotes its kernel , i.e. , the set { x : Ax = 0 } .
Consider a game G = ( L , H ) with N actions and M outcomes . That is , L R N ×M and H Σ N ×M . For the sake of simplicity and , without loss of generality , we assume that no symbol σ Σ can be present in two different rows of H. The signal matrix of an action is defined as follows:
Definition 1 ( Signal matrix ) Let { σ 1 , . . . , σ s i } be the set of symbols listed in the i th row of H . ( Thus , s i denotes the number of different symbols in row i of H ) . The signal matrix S i of action i is defined as an s i × M matrix with entries a k , j = I ( h i , j = σ k ) for 1 k s i and 1 j M . The signal matrix for a set of actions is defined as the signal matrices of the actions in the set , stacked on top of one another , in the ordering of the actions .
For an example of a signal matrix , see Section 3.1 . We identify the strategy of a stochastic opponent with an element of the probability simplex M = { p R M : p 0 , M j=1 p j = 1 } . Note that for any opponent strategy p , if the learner chooses action i then the vector S i p R s i is the probability distribution of the observed feedback : ( S i p ) k is the probability of observing the k th symbol .
We denote by i the i th row of the loss matrix L and we call i the loss vector of action i. We say that action i is optimal under opponent strategy p M if for any 1 j N , i p j p. Action i is said to be Pareto-optimal if there exists an opponent strategy p such that action i is optimal under p. We now define the cell decomposition of M induced by L ( for an example , see Figure 2 ) :
Definition 2 ( Cell decomposition ) For an action i , the cell C i associated with i is defined as C i = { p M : action i is optimal under p } . The cell decomposition of M is defined as the multiset C = { C i : 1 i N , C i has positive ( M 1 ) -dimensional volume } .
Actions whose cell is of positive ( M 1 ) -dimensional volume are called strongly Paretooptimal . Actions that are Pareto-optimal but not strongly Pareto-optimal are called degenerate . Note that the cells of the actions are defined with linear inequalities and thus they are convex polytopes . It follows that strongly Pareto-optimal actions are the actions whose cells are ( M 1 ) -dimensional polytopes . It is also important to note that the cell decomposition is a multiset , since some actions can share the same cell . Nevertheless , if two actions have the same cell of dimension ( M 1 ) , their loss vectors will necessarily be identical . 3
We call two cells of C neighbors if their intersection is an ( M 2 ) -dimensional polytope . The actions corresponding to these cells will also be called neighbors . Neighborship is not defined for cells outside of C. For two neighboring cells C i , C j C , we define the neighborhood action set A i , j = { 1 k N : C i C j C k } . It follows from the definition that actions i and j are in A i , j and thus A i , j is nonempty . However , one can have more than two actions in the neighborhood action set .
When discussing lower bounds we will need the definition of algorithms . For us , an algorithm A is a mapping A : Σ { 1 , 2 , . . . , N } which maps past feedback sequences to actions . That the algorithms are deterministic is assumed for convenience . In particular , the lower bounds we prove can be extended to randomized algorithms by conditioning on the internal randomization of the algorithm . Note that the algorithms we design are themselves deterministic .
3 . Classification of finite partial-monitoring games
In this section we present our main result : we state the theorem that classifies all finite stochastic partial-monitoring games based on how their minimax regret scales with the time horizon . Thanks to the previous section , we are now equipped to define a notion which will play a key role in the classification theorem:
3 . One could think that actions with identical loss vectors are redundant and that all but one of such actions could be removed without loss of generality . However , since different actions can lead to different observations and thus yield different information , removing the duplicates can be harmful .
Definition 3 ( Observability ) Let S be the signal matrix for the set of all actions in the game . For actions i and j , we say that i j is globally observable if i j Im S . Furthermore , if i and j are two neighboring actions , then i j is called locally observable if i j Im S ( i , j ) , where S ( i , j ) is the signal matrix for the neighborhood action set A i , j .
As we will see , global observability implies that we can estimate the difference of the expected losses after choosing each action once . Local observability means we only need actions from the neighborhood action set to estimate the difference .
The classification theorem , which is our main result , is the following:
Theorem 4 ( Classification ) Let G = ( L , H ) be a partial-monitoring game with N actions and M outcomes . Let C = { C 1 , . . . , C k } be its cell decomposition , with corresponding loss vectors 1 , . . . , k . The game G falls into one of the following four categories:
( a ) R T ( G ) = 0 if there exists an action i with C i = M . This case is called trivial .
( b ) R T ( G ) = Θ ( T ) if there exist two strongly Pareto-optimal actions i and j such that i j is not globally observable . This case is called hopeless . ( c ) R T ( G ) = Θ ( T ) if it is not trivial and for all pairs of ( strongly Pareto-optimal ) neighboring actions i and j , i j is locally observable . These games are called easy .
( d ) R T ( G ) = Θ ( T 2/3 ) if G is not hopeless and there exists a pair of neighboring actions i and j such that i j is not locally observable . These games are called hard .
Note that the conditions listed under ( a ) ( d ) are mutually exclusive and cover all finite partial-monitoring games . The only non-obvious implication is that if a game is easy then it can not be hopeless . The reason this holds is because for any pair of cells C i , C j in C , the vector i j can be expressed as a telescoping sum of the differences of loss vectors of neighboring cells .
The remainder of the paper is dedicated to proving Theorem 4 . We start with the simple cases . If there exists an action whose cell covers the whole probability simplex then choosing that action in every round will yield zero regret , proving case ( a ) . The condition in Case ( b ) is due to Piccolboni and Schindelhauer ( 2001 ) , who showed that under the condition mentioned there , there is no algorithm that achieves sublinear regret 4 . The upper bound for case ( d ) is achieved by the FeedExp3 algorithm due to Piccolboni and Schindelhauer ( 2001 ) , for which a regret bound of O ( T 2/3 ) was shown by Cesa-Bianchi et al . ( 2006 ) . The lower bound for case ( c ) was proved by Antos et al . ( 2011 ) . For a visualization of previous results , see Figure 1 .
The above assertions help characterize trivial and hopeless games , and show that if
a game is not trivial and not hopeless then its minimax regret falls between ( T ) and O ( T 2/3 ) . Our contribution in this paper is that we give exact minimax rates ( up to logarithmic factors ) for these games . To prove the upper bound for case ( c ) , we introduce a new algorithm , which we call Balaton , for “Bandit Algorithm for Loss Annihilation” 5 . This algorithm is presented in Section 4 , while its analysis is given in Section 5 . The lower bound for case ( d ) is presented in Section 6 .
4 . Although Piccolboni and Schindelhauer state their theorem for adversarial environments , their proof applies to stochastic environments without any change ( which is important for the lower bound part ) . 5 . Balaton is a lake in Hungary . We thank Gergely Neu for suggesting the name .
hopeless trivial
easy hard
dynamic pricing l.e.p . bandits
full-info
Figure 1 : Partial monitoring games and their minimax regret as it was known previously . The big rectangle denotes the set of all games . Inside the big rectangle , the games are ordered from left to right based on their minimax regret . In the “hard” area , l.e.p . denotes label-efficient prediction . The grey area contains games whose
minimax regret is between ( T ) and O ( T 2/3 ) but their exact regret rate was unknown . This area is now eliminated , and the dynamic pricing problem is proven to be hard .
3.1 . Example
In this section , as a corollary of Theorem 4 we show that the discretized dynamic pricing game ( see , e.g. , Cesa-Bianchi et al . ( 2006 ) ) is hard . Dynamic pricing is a game between a vendor ( learner ) and a customer ( environment ) . In each round , the vendor sets a price he wants to sell his product at ( action ) , and the costumer sets a maximum price he is willing to buy the product ( outcome ) . If the product is not sold , the vendor suffers some constant loss , otherwise his loss is the difference between the customer’s maximum and his price . The customer never reveals the maximum price and thus the vendor’s only feedback is whether he sold the product or not .
The discretized version of the game with N actions ( and outcomes ) is defined by the matrices
L =
0 1 2 · · · N 1 c 0 1 · · · N 2 .. . . .. .. . c · · · c 0 1 c · · · · · · c 0
H =
1 · · · · · · 1 0 . .. .. . .. . . .. ... ... 0 · · · 0 1
,
where c is a positive constant ( see Figure 2 for the cell-decomposition for N = 3 ) . It is easy to see that all the actions are strongly Pareto-optimal . Also , after some linear algebra it turns out that the cells underlying the actions have a single common vertex in the interior of the probability simplex . It follows that any two actions are neighbors . On the other hand , if we take two non-consecutive actions i and i , i i is not locally observable . For example , the signal matrix for action 1 and action N is
S ( 1 , N ) =
1 . . . 1 1 1 . . . 1 0 0 . . . 0 1
,
whereas N 1 = ( c , c 1 , . . . , c N + 2 , −N + 1 ) . It is obvious that N 1 is not in the row space of S ( 1 , N ) .
( 1 , 0 , 0 )
( 0 , 1 , 0 )
( 0 , 0 , 1 )
p
Figure 2 : The cell decomposition of the discretized dynamic pricing game with 3 actions . If the opponent strategy is p , then action 2 is the optimal action .
4 . Balaton : An algorithm for easy games In this section we present our algorithm that achieves O ( T ) expected regret for easy games ( case ( c ) of Theorem 4 ) . The input of the algorithm is the loss matrix L , the feedback matrix H , the time horizon T and an error probability δ , to be chosen later . Before describing the algorithm , we introduce some notation . We define a graph G associated with game G the following way . Let the vertex set be the set of cells of the cell decomposition C of the probability simplex such that cells C i , C j C share the same vertex when C i = C j . The graph has an edge between vertices whose corresponding cells are neighbors . This graph is connected , since the probability simplex is convex and the cell decomposition covers the simplex .
Recall that for neighboring cells C i , C j , the signal matrix S ( i , j ) is defined as the signal matrix for the neighborhood action set A i , j of cells i , j. Assuming that the game satisfies the condition of case ( c ) of Theorem 4 , we have that for all neighboring cells C i and C j , i j Im S ( i , j ) . This means that there exists a coefficient vector v ( i , j ) such that i j = S ( i , j ) v ( i , j ) . We define the k th segment of v ( i , j ) , denoted by v ( i , j ) , k , as the vector of components of v ( i , j ) that correspond to the k th action in the neighborhood action set . That is , if S ( i , j ) = S 1 · · · S r , then i j = S ( i , j ) v ( i , j ) = r s=1 S s v ( i , j ) , s , where S 1 , . . . , S r are the signal matrices of the individual actions in A i , j .
Let J t { 1 , . . . , M } denote the outcome at time step t. For 1 k M , let e k R M be the k th unit vector . For an action i , let O i ( t ) = S i e J t be the observation vector of action i at time step t. If the rows of the signal matrix S i correspond to symbols σ 1 , . . . , σ s i and action i is chosen at time step t then the unit vector O i ( t ) indicates which symbol was observed in that time step . Thus , O I t ( t ) holds the same information as the feedback at time t ( recall that I t is the action chosen by the learner at time step t ) . From now on , for simplicity , we will assume that the feedback at time step t is the observation vector O I t ( t ) itself .
The main idea of the algorithm is to successively eliminate actions in an efficient , yet safe manner . When all remaining strongly Pareto optimal actions share the same cell , the elimination phase finishes and from this point , one of the remaining actions is played . During the elimination phase , the algorithm works in rounds . In each round each ‘alive’ Pareto optimal action is played once . The resulting observations are used to estimate the loss-difference between the alive actions . If some estimate becomes sufficiently precise , the action of the pair deemed to be suboptimal is eliminated ( possibly together with other
Algorithm 1 Balaton Input : L , H , T , δ Initialization : [ G , C , { v ( i , j ) , k } , { path ( i , j ) } , { ( LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) ) } ] Initialize ( L , H ) t 0 , n 0 aliveActions { 1 i N : C i interior ( M ) = } main loop while | V G | > 1 and t < T do n n + 1 for each i aliveActions do O i ExecuteAction ( i ) t t + 1 end for for each edge ( i , j ) in G : µ ( i , j ) k∈A i , j O k v ( i , j ) , k end for for each non-adjacent vertex pair ( i , j ) in G : µ ( i , j ) ( k , l ) ∈path ( i , j ) µ ( k , l ) end for haveEliminated false for each vertex pair ( i , j ) in G do ˆ µ ( i , j ) 1 1 n ˆ µ ( i , j ) + 1 n µ ( i , j ) if BStopStep ( ˆ µ ( i , j ) , LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) , n , 1/2 , δ ) then [ aliveActions , C , G ] eliminate ( i , j , sgn ( ˆ µ ( i , j ) ) ) haveEliminated true end if end for if haveEliminated then { path ( i , j ) } regeneratePaths ( G ) end if end while Let i be a strongly Pareto-optimal action in aliveActions while t < T do
ExecuteAction ( i ) t t + 1 end while
actions ) . To determine if an estimate is sufficiently precise , we will use an appropriate stopping rule . A small regret will be achieved by tuning the error probability of the stopping rule appropriately .
The details of the algorithm are as follows : In the preprocessing phase , the algorithm constructs the neigbourhood graph , the signal matrices S ( i , j ) assigned to the edges of the graph , the coefficient vectors v ( i , j ) and their segment vectors v ( i , j ) , k . In addition , it constructs a path in the graph connecting any pairs of nodes , and initializes some variables used by the stopping rule .
In the elimination phase , the algorithm runs a loop . In each round of the loop , the algorithm chooses each of the alive actions once and , based on the observations , the estimates ˆ µ ( i , j ) of the loss-differences ( i j ) p are updated , where p is the actual opponent
strategy . The algorithm maintains the set C of cells of alive actions and their neighborship graph G .
The estimates are calculated as follows . First we calculate estimates for neighboring actions ( i , j ) . In round 6 n , for every action k in A i , j let O k be the observation vector for action k. Let µ ( i , j ) = k∈A i , j O k v ( i , j ) , k . From the local observability condition and the construction of v ( i , j ) , k , with simple algebram it follows that µ ( i , j ) are unbiased estimates of ( i j ) p ( see Lemma 5 ) . For non-neighboring action pairs , we use telescoping sums : since the graph G ( induced by the alive actions ) stays connected , we can take a path i = i 0 , i 1 , . . . , i r = j in the graph , and the estimate µ ( i , j ) ( n ) will be the sum of the estimates along the path : r l=1 µ ( i l−1 , i l ) . The estimate of the difference of the expected losses after round n will be the average ˆ µ ( i , j ) = ( 1/n ) n l=1 µ ( i , j ) ( s ) , where µ ( i , j ) ( s ) denotes the estimate for pair ( i , j ) computed in round s .
After updating the estimates , the algorithm decides which actions to eliminate . For each pair of vertices i , j of the graph , the expected difference of their loss is tested for its sign by the BStopStep subroutine , based on the estimate ˆ µ ( i , j ) and its relative error . This subroutine uses a stopping rule based on Bernstein’s inequality .
The subroutine’s pseudocode is shown as Algorithm 2 and is essentially based on the work by Mnih et al . ( 2008 ) . The algorithm maintains two values , LB , UB , computed from the supplied sequence of sample means ( ˆ µ ) and the deviation bounds
c ( σ , R , n , δ ) = σ 2 L ( δ , n ) n + R L ( δ , n ) 3n , where L ( δ , n ) = log 3 p p 1 n p δ . ( 1 )
Here p > 1 is an arbitrarily chosen parameter of the algorithm , σ is a ( deterministic ) upper bound on the ( conditional ) variance of the random variables whose common mean µ we wish to estimate , while R is a ( deterministic ) upper bound on their range . This is a general stopping rule method , which stops when it produced an -relative accurate estimate of the unknown mean . The algorithm is guaranteed to be correct outside of a failure event whose probability is bounded by δ .
Algorithm Balaton calls this method with ε = 1/2 . As a result , when BStopStep returns true , outside of the failure event the sign of the estimate ˆ µ supplied to Balaton will match the sign of the mean to be estimated . The conditions under which the algorithm indeed produces ε-accurate estimates ( with high probability ) are given in Lemma 11 ( see Appendix ) , which also states that also with high probability , the time when the algorithm stops is bounded by
C · max σ 2 2 µ 2 , R |µ| log 1 δ + log R |µ| ,
where µ = 0 is the true mean . Note that the choice of p in ( 1 ) influences only C .
If BStopStep returns true for an estimate µ ( i , j ) , function eliminate is called . If , say , µ ( i , j ) > 0 , this function takes the closed half space { q M : ( i j ) q 0 } and eliminates all actions whose cell lies completely in the half space . The function also drops the vertices from the graph that correspond to eliminated cells . The elimination necessarily
6 . Note that a round of the algorithm is not the same as the time step t. In a round , the algorithm chooses each of the alive actions once .
Algorithm 2 Algorithm BStopStep . Note that , somewhat unusually at least in pseudocodes , the arguments LB , UB are passed by reference , i.e. , the algorithm rewrites the values of these arguments ( which are thus returned back to the caller ) . Input : ˆ µ , LB , UB , σ , R , n , ε , δ LB max ( LB , µ| c ( δ , σ , R , n ) ) UB min ( UB , µ| + c ( δ , σ , R , n ) ) return ( 1 + ) LB < ( 1 ) UB
concerns all actions with corresponding cell C i , and possibly other actions as well . The remaining cells are redefined by taking their intersection with the complement half space { q M : ( i j ) q 0 } .
By construction , after the elimination phase , the remaining graph is still connected , but some paths used in the round may have lost vertices or edges . For this reason , in the last phase of the round , new paths are constructed for vertex pairs with broken paths .
The main loop of the algorithm continues until either one vertex remains in the graph or the time horizon T is reached . In the former case , one of the actions corresponding to that vertex is chosen until the time horizon is reached .
5 . Analysis of the algorithm In this section we prove that the algorithm described in the previous section achieves O ( T ) expected regret .
Let us assume that the outcomes are generated following the probability vector p M . Let j denote an optimal action , that is , for every 1 i N , j p i p . For every pair of actions i , j , let α i , j = ( i j ) p be the expected difference of their instantaneous loss . The expected regret of the algorithm can be rewritten as
E
T
t=1
I t , J t min 1≤i≤N E
T
t=1
i , J t =
N
E [ τ i ] α i , j , ( 2 )
where τ i is the number of times action i is chosen by the algorithm .
Throughout the proof , the value that Balaton assigns to a variable x in round n will be denoted by x ( n ) . Further , for 1 k N , we introduce the i.i.d . random sequence ( J k ( n ) ) n≥1 , taking values on { 1 , . . . , M } , with common multinomial distribution satisfying , P [ J k ( n ) = j ] = p j . Clearly , a statistically equivalent model to the one where ( J t ) is an i.i.d . sequence with multinomial p is when ( J t ) is defined through
J t = J I t t s=1 I ( I s = I t ) . ( 3 )
Note that this claim holds , independently of the algorithm generating the actions , I t . Therefore , in what follows , we assume that the outcome sequence is generated through ( 3 ) . As we will see , this construction significantly simplifies subsequent steps of the proof . In particular , the construction will be very convenient since if action k is selected by our algorithm in the n th elimination round then the outcome obtained in response is going to be
O k ( n ) = S k u k ( n ) , where u k ( n ) = e J k ( n ) . ( This holds because in the elimination rounds all alive actions are tried exactly once by Balaton . )
Let ( F n ) n be the filtration defined as F n = σ ( u k ( m ) ; 1 k N , 1 m n ) . We also introduce the notations E n [ · ] = E [ ·|F n ] and Var n ( · ) = Var ( ·|F n ) , the conditional expectation and conditional variance operators corresponding to F n . Note that F n contains the information known to Balaton ( and more ) at the end of the elimination round n. Our first ( trivial ) observation is that µ ( i , j ) ( n ) , the estimate of α i , j obtained in round n is F n -measurable . The next lemma establishes that , furthermore , µ ( i , j ) ( n ) is an unbiased estimate of α i , j :
Lemma 5 For any n 1 and i , j such that C i , C j C , E n−1 [ µ ( i , j ) ( n ) ] = α i , j .
Proof Consider first the case when actions i and j are neighbors . In this case,
µ ( i , j ) ( n ) = k∈A i , j
O k ( n ) v ( i , j ) , k = k∈A i , j
( S k u k ( n ) ) v ( i , j ) , k = k∈A i , j
u k ( n ) S k v ( i , j ) , k ,
and thus
E n−1 µ ( i , j ) ( n ) = k∈A i , j
E n−1 u k ( n ) S k v ( i , j ) , k = p k∈A i , j
S k v ( i , j ) , k = p S ( i , j ) v ( i , j )
= p ( i j ) = α i , j .
For non-adjacent i and j , we have a telescoping sum:
E n−1 µ ( i , j ) ( n ) =
r
k=1
E n−1 [ µ ( i k−1 , i k ) ( n ) ]
= p i 0 i 1 + i 1 i 2 + · · · + i r−1 i r = α i , j ,
where i = i 0 , i 1 , . . . , i r = j is the path the algorithm uses in round n , known at the end of round n 1 .
Lemma 6 The conditional variance of µ ( i , j ) ( n ) , Var n−1 ( µ ( i , j ) ( n ) ) , is upper bounded by V = 2 { i , j neighbors } v ( i , j ) 2 2 .
Proof For neighboring cells i , j , we write
µ ( i , j ) ( n ) = k∈A i , j
O k ( n ) v ( i , j ) , k and thus
Var n−1 ( µ ( i , j ) ( n ) ) = Var n−1
k∈A i , j
O k ( n ) v ( i , j ) , k
=
k∈A i , j
E n−1 v ( i , j ) , k ( O k ( n ) E n−1 [ O k ( n ) ] ) ( O k ( n ) E n−1 [ O k ( n ) ] ) v ( i , j ) , k
k∈A i , j
v ( i , j ) , k 2 2 E n−1 O k ( n ) E n−1 [ O k ( n ) ] 2 2
k∈A i , j
v ( i , j ) , k 2 2 = v ( i , j ) 2 2 , ( 4 )
where in ( 4 ) we used that O k ( n ) is a unit vector and E n−1 [ O k ( n ) ] is a probability vector .
For i , j non-neighboring cells , let i = i 0 , i 1 , . . . , i r = j the path used for the estimate in round n. Then µ ( i , j ) ( n ) can be written as
µ ( i , j ) ( n ) =
r
s=1
µ ( i s−1 , i s ) ( n ) =
r
s=1 k∈A is−1 , is
O k ( n ) v ( i s−1 , i s ) , k .
It is not hard to see that an action can only be in at most two neighborhood action sets in the path and so the double sum can be rearranged as
k∈ A is−1 , is
O k ( n ) ( v ( i sk−1 , i sk ) , k + v ( i sk i sk+1 ) , k ) ,
and thus Var n−1 µ ( i , j ) ( n ) 2 r s=1 v ( i s−1 , i s ) 2 2 2 { i , j neighbors } v ( i , j ) 2 2 .
Lemma 7 The range of the estimates µ ( i , j ) ( n ) is upper bounded by R = { i , j neighbors } v ( i , j ) 1 .
Proof The bound trivially follows from the definition of the estimates .
Let δ be the confidence parameter used in BStopStep . Since , according to Lemmas 5 , 6 and 7 , ( µ ( i , j ) ) is a “shifted” martingale difference sequence with conditional mean α i , j , bounded conditional variance and range , we can apply Lemma 11 stated in the Appendix . By the union bound , the probability that any of the confidence bounds fails during the game is at most N 2 δ. Thus , with probability at least 1 N 2 δ , if BStopStep returns true for a pair ( i , j ) then sgn ( α i , j ) = sgn ( µ ( i , j ) ) and the algorithm eliminates all the actions whose cell is contained in the closed half space defined by H = { p : sgn ( α i , j ) p ( i j ) 0 } . By definition α i , j = ( i j ) p . Thus p / H and none of the eliminated actions can be optimal under p .
From Lemma 11 we also see that , with probability at least 1 N 2 δ , the number of times τ i the algorithm experiments with a suboptimal action i during the elimination phase is bounded by
τ i c ( G ) α 2 i , j log R δα i , j = T i , ( 5 )
where c ( G ) = C ( V + R ) is a problem dependent constant .
The following lemma , the proof of which can be found in the Appendix , shows that degenerate actions will be eliminated in time .
Lemma 8 Let action i be a degenerate action . Let A i = { j : C j C , C i C j } . The following two statements hold:
1 . If any of the actions in A i is eliminated , then action i is eliminated as well .
2 . There exists an action k i A i such that α k i , j α i , j .
An immediate implication of the first claim of the lemma is that if action k i gets eliminated then action i gets eliminated as well , that is , the number of times action i is chosen can not be greater then that of action k i . Hence , τ i τ k i . Let E be the complement of the failure event underlying the stopping rules . As discussed earlier , P ( E c ) N 2 δ. Note that on E , i.e. , when the stopping rules do not fail , no suboptimal action can remain for the final phase . Hence , τ i I ( E ) τ i I ( E ) , where τ i is the number of times action i is chosen by the algorithm . To upper bound the expected regret we continue from ( 2 ) as
N
E [ τ i ] α i , j =
N
E [ I ( E ) τ i ] α i , j + P ( E c ) T ( because N i=1 τ i = T and 0 α i , j 1 )
N
E [ I ( E ) τ i ] α i , j + N 2 δT
i : C i ∈C
E [ I ( E ) τ i ] α i , j + i : C i ∈C
E [ I ( E ) τ i ] α i , j + N 2 δT
i : C i ∈C
E [ I ( E ) τ i ] α i , j + i : C i ∈C
E I ( E ) τ k i α k i , j + N 2 δT ( by Lemma 8 )
i : C i ∈C
T i α i , j + i : C i ∈C
T k i α k i , j + N 2 δT
i : C i ∈C α i , j∗ ≥α 0
T i α i , j + i : C i ∈C α ki , j ≥α 0
T k i α k i , j + α 0 + N 2 δ T
c ( G )
i : C i ∈C α i , j∗ ≥α 0
log R δα i , j∗
α i , j + i : C i ∈C α ki , j ≥α 0
log R δα ki , j α k i , j
+ α 0 + N 2 δ T
c ( G ) N log R δα 0 α 0 + α 0 + N 2 δ T , The above calculation holds for any value of α 0 > 0 . Setting
α 0 = c ( G ) N T and δ = c ( G ) T N 3 , we get
E [ R T ] c ( G ) N T log RT N 2 c ( G ) .
In conclusion , if we run Balaton with parameter δ = c ( G ) T N 3 , the algorithm suffers regret
of O ( T ) , finishing the proof .
6 . A lower bound for hard games
In this section we prove that for any game that satisfies the condition of Case ( d ) of Theorem 4 , the minimax regret is of ( T 2/3 ) .
Theorem 9 Let G = ( L , H ) be an N by M partial-monitoring game . Assume that there exist two neighboring actions i and j such that i j Im S ( i , j ) . Then there exists a problem dependent constant c ( G ) such that for any algorithm A and time horizon T there exists an opponent strategy p such that the expected regret satisfies
E [ R T ( A , p ) ] c ( G ) T 2/3 .
Proof Without loss of generality we can assume that the two neighbor cells in the condition are C 1 and C 2 . Let C 3 = C 1 C 2 . For i = 1 , 2 , 3 , let A i be the set of actions associated with cell C i . Note that A 3 may be the empty set . Let A 4 = A\ ( A 1 ∪A 2 ∪A 3 ) . By our convention for naming loss vectors , 1 and 2 are the loss vectors for C 1 and C 2 , respectively . Let L 3 collect the loss vectors of actions which lie on the open segment connecting 1 and 2 . It is easy to see that L 3 is the set of loss vectors that correspond to the cell C 3 . We define L 4 as the set of all the other loss vectors . For i = 1 , 2 , 3 , 4 , let k i = |A i | .
Let S = S i , j the signal matrix of the neighborhood action set of C 1 and C 2 . It follows from the assumption of the theorem that 2 1 Im ( S ) . Thus , { ρ ( 2 1 ) : ρ R } Im ( S ) , or equivalently , ( 2 1 ) Ker S , where we used that ( Im M ) = Ker ( M ) . Thus , there exists a vector v such that v Ker S and ( 2 1 ) v = 0 . By scaling we can assume that ( 2 1 ) v = 1 . Note that since v Ker S and the rowspace of S contains the vector ( 1 , 1 , . . . , 1 ) , the coordinates of v sum up to zero .
Let p 0 be an arbitrary probability vector in the relative interior of C 3 . It is easy to see that for any ε > 0 small enough , p 1 = p 0 + εv C 1 \ C 2 and p 2 = p 0 εv C 2 \ C 1 .
Let us fix a deterministic algorithm A and a time horizon T . For i = 1 , 2 , let R ( i ) T denote the expected regret of the algorithm under opponent strategy p i . For i = 1 , 2 and j = 1 , . . . , 4 , let N i j denote the expected number of times the algorithm chooses an action from A j , assuming the opponent plays strategy p i .
From the definition of L 3 we know that for any L 3 , 1 = η ( 2 1 ) and 2 = ( 1 η ) ( 1 2 ) for some 0 < η < 1 . Let λ 1 = min ∈L 3 η and λ 2 = min ∈L 3 ( 1 η ) and λ = min ( λ 1 , λ 2 ) if L 3 = and let λ = 1/2 , otherwise . Finally , let β i = min ∈L 4 ( i ) p i and β = min ( β 1 , β 2 ) . Note that λ , β > 0 .
As the first step of the proof , we lower bound the expected regret R ( 1 ) T and R ( 2 ) T in terms of the values N i j , ε , λ and β:
R ( 1 ) T N 1 2
ε
( 2 1 ) p 1 +N 1 3 λ ( 2 1 ) p 1 + N 1 4 β λ ( N 1 2 + N 1 3 ) ε + N 1 4 β , R ( 2 ) T N 2 1 ( 1 2 ) p 2
ε
+N 2 3 λ ( 1 2 ) p 2 + N 2 4 β λ ( N 2 1 + N 2 3 ) ε + N 2 4 β . ( 6 )
For the next step , we need the following lemma .
Lemma 10 There exists a ( problem dependent ) constant c such that the following inequalities hold:
N 2 1 N 1 1 cT ε N 1 4 , N 2 3 N 1 3 cT ε N 1 4 ,
N 1 2 N 2 2 cT ε N 2 4 , N 1 3 N 2 3 cT ε N 2 4 .
Proof ( Lemma 10 ) For any 1 t T , let f t = ( f 1 , . . . , f t ) Σ t be a feedback sequence up to time step t. For i = 1 , 2 , let p i be the probability mass function of feedback sequences of length T 1 under opponent strategy p i and algorithm A. We start by upper bounding the difference between values under the two opponent strategies . For i = j { 1 , 2 } and k { 1 , 2 , 3 } ,
N i k N j k = f T −1
p i ( f T −1 ) p j ( f T −1 )
T −1
t=0
I ( A ( f t ) A k )
f T −1 : p i ( f T −1 ) −p j ( f T −1 ) ≥0
p i ( f T −1 ) p j ( f T −1 )
T −1
t=0
I ( A ( f t ) A k )
T
f T −1 : p i ( f T −1 ) −p j ( f T −1 ) ≥0
p i ( f T −1 ) p j ( f T −1 ) = T 2 p 1 p 2 1
T KL ( p 1 ||p 2 ) /2 , ( 7 )
where KL ( ·||· ) denotes the Kullback-Leibler divergence and · 1 is the L 1 -norm . The last inequality follows from Pinsker’s inequality ( Cover and Thomas , 2006 ) . To upper bound KL ( p 1 ||p 2 ) we use the chain rule for KL-divergence . By overloading p i so that p i ( f t−1 ) denotes the probability of feedback sequence f t−1 under opponent strategy p i and algorithm A , and p i ( f t |f t−1 ) denotes the conditional probability of feedback f t Σ given that the past feedback sequence was f t−1 , again under p i and A. With this notation we have
KL ( p 1 ||p 2 ) =
T −1
t=1 f t−1
p 1 ( f t−1 ) f t
p 1 ( f t |f t−1 ) log p 1 ( f t |f t−1 ) p 2 ( f t |f t−1 )
=
T −1
t=1 f t−1
p 1 ( f t−1 )
I ( A ( f t−1 ) A i ) f t
p 1 ( f t |f t−1 ) log p 1 ( f t |f t−1 ) p 2 ( f t |f t−1 ) ( 8 )
Let a f t be the row of S that corresponds to the feedback symbol f t . 7 Assume k = A ( f t−1 ) . If the feedback set of action k does not contain f t then trivially p i ( f t |f t−1 ) = 0 for i = 1 , 2 . Otherwise p i ( f t |f t−1 ) = a f t p i . Since p 1 p 2 = 2εv and v Ker S , we have a f t v = 0 and thus , if the choice of the algorithm is in either A 1 , A 2 or A 3 , then p 1 ( f t |f t−1 ) = p 2 ( f t |f t−1 ) . It follows that the inequality chain can be continued from ( 8 ) by writing
KL ( p 1 ||p 2 )
T −1
t=1 f t−1
p 1 ( f t−1 ) I ( A ( f t−1 ) A 4 ) f t
p 1 ( f t |f t−1 ) log p 1 ( f t |f t−1 ) p 2 ( f t |f t−1 )
c 1 ε 2
T −1
t=1 f t−1
p 1 ( f t−1 ) I ( A ( f t−1 ) A 4 ) ( 9 )
c 1 ε 2 N 1 4 .
7 . Recall that we assumed that different actions have difference feedback symbols , and thus a row of S corresponding to a symbol is unique .
In ( 9 ) we used Lemma 12 ( see Appendix ) to upper bound the KL-divergence of p 1 and p 2 . Flipping p 1 and p 2 in ( 7 ) we get the same result with N 2 4 . Reading together with the bound in ( 7 ) we get all the desired inequalities .
Now we can continue lower bounding the expected regret . Let r = argmin i∈ { 1,2 } N i 4 . It is easy to see that for i = 1 , 2 and j = 1 , 2 , 3,
N i j N r j c 2 T ε N r 4 .
If i = r then this inequality is one of the inequalities from Lemma 10 . If i = r then it is a trivial lower bounding by subtracting a positive value . From ( 6 ) we have
R ( i ) T λ ( N i 3−i + N i 3 ) ε + N i 4 β λ ( N r 3−i c 2 T ε N r 4 + N r 3 c 2 T ε N r 4 ) ε + N r 4 β = λ ( N r 3−i + N r 3 2c 2 T ε N r 4 ) ε + N r 4 β .
Now assume that , at the beginning of the game , the opponent randomly chooses between strategies p 1 and p 2 with equal probability . The the expected regret of the algorithm is lower bounded by
R T = 1 2 R ( 1 ) T + R ( 2 ) T 1 2 λ ( N r 1 + N r 2 + 2N r 3 4c 2 T ε N r 4 ) ε + N r 4 β 1 2 λ ( N r 1 + N r 2 + N r 3 4c 2 T ε N r 4 ) ε + N r 4 β = 1 2 λ ( T N r 4 4c 2 T ε N r 4 ) ε + N r 4 β .
Choosing ε = c 3 T −1/3 we get
R T 1 2 λc 3 T 2/3 1 2 λN r 4 c 3 T −1/3 2λc 2 c 2 3 T 1/3 N r 4 + N r 4 β T 2/3 β 1 2 λc 3 N r 4 T 2/3 2λc 2 c 2 3 N r 4 T 2/3 + 1 2 λc 3
= T 2/3 β 1 2 λc 3 x 2 2λc 2 c 2 3 x + 1 2 λc 3 ,
where x = N r 4 /T 2/3 . Now we see that c 3 > 0 can be chosen to be small enough , independently of T so that , for any choice of x , the quadratic expression in the parenthesis is bounded away from zero , and simultaneously , ε is small enough so that the threshold condition in Lemma 12 is satisfied , completing the proof of Theorem 9 .
7 . Discussion
In this we paper we classified all finite partial-monitoring games under stochastic environments , based on their minimax regret . We conjecture that our results extend to nonstochastic environments . This is the major open question that remains to be answered .
One question which we did not discuss so far is the computational efficiency of our algorithm . The issue is twofold . The first computational question is how to efficiently decide which of the four classes a given game ( L , H ) belongs to . The second question is the computational efficiency of Balaton for a fixed easy game . Fortunately , in both cases an efficient implementation is possible , i.e. , in polynomial time by using a linear program solver ( e.g. , the ellipsoid method ( Papadimitriou and Steiglitz , 1998 ) ) .
Another interesting open question is to investigate the dependence of regret on quantities other than T such as the number of actions , the number of outcomes , and more generally the structure of the loss and feedback matrices .
Finally , let us note that our results can be extended to a more general framework , similar to that of Pallavi et al . ( 2011 ) , in which a game with N actions and M -dimensional outcome space is defined as a tuple G = ( L , S 1 , . . . , S N ) . The loss matrix is L R N ×M as before , but the outcome and the feedback are defined differently . The outcome y is an arbitrary vector from a bounded subset of R M and the feedback received by the learner upon choosing action i is O i = S i y .
References
Jacob Abernethy , Elad Hazan , and Alexander Rakhlin . Competing in the dark : An efficient algorithm for bandit linear optimization . In Proceedings of the 21st Annual Conference on Learning Theory ( COLT 2008 ) , pages 263–273 . Citeseer , 2008 .
Alekh Agarwal , Peter Bartlett , and Max Dama . Optimal allocation strategies for the dark pool problem . In 13th International Conference on Artificial Intelligence and Statistics ( AISTATS 2010 ) , May 12-15 , 2010 , Chia Laguna Resort , Sardinia , Italy , 2010 .
Andr´ as Antos , G´ abor Bart´ ok , D´ avidal , and Csaba Szepesv´ ari . Toward a classification of finite partial-monitoring games , 2011. http : //arxiv.org/abs/1102.2041 .
Jean-Yves Audibert and S´ebastien Bubeck . Minimax policies for adversarial and stochastic bandits . In Proceedings of the 22nd Annual Conference on Learning Theory , 2009 .
Peter Auer , Nicol` o Cesa-Bianchi , Yoav Freund , and Robert E. Schapire . The nonstochastic multiarmed bandit problem . SIAM Journal on Computing , 32 ( 1 ) :48–77 , 2002 .
abor Bart´ ok , D´ avidal , and Csaba Szepesv´ ari . Toward a Classification of Finite PartialMonitoring Games . In Proceedings of the 21st international conference on Algorithmic Learning Theory ( ALT 2010 ) , pages 224–238 . Springer , 2010 .
Nicol` o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Minimizing regret with label efficient prediction . IEEE Transactions on Information Theory , 51 ( 6 ) :2152–2162 , June 2005 .
Nicol´ o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Regret minimization under partial monitoring . Mathematics of Operations Research , 31 ( 3 ) :562–580 , 2006 .
Thomas M. Cover and Joy A. Thomas . Elements of Information Theory . Wiley , New York , second edition , 2006 .
Abraham D. Flaxman , Adam Tauman Kalai , and H. Brendan McMahan . Online convex optimization in the bandit setting : gradient descent without a gradient . In Proceedings of the 16th annual ACM-SIAM Symposium on Discrete Algorithms ( SODA 2005 ) , page 394 . Society for Industrial and Applied Mathematics , 2005 .
Robert Kleinberg and Tom Leighton . The value of knowing a demand curve : Bounds on regret for online posted-price auctions . In Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science 2003 ( FOCS 2003 ) , pages 594–605 . IEEE , 2003 .
Nick Littlestone and Manfred K. Warmuth . The weighted majority algorithm . Information and Computation , 108:212–261 , 1994 .
abor Lugosi and Nicol` o Cesa-Bianchi . Prediction , Learning , and Games . Cambridge University Press , 2006 .
V. Mnih . Efficient stopping rules . Master’s thesis , Department of Computing Science , University of Alberta , 2008 .
V. Mnih , Cs . Szepesv´ ari , and J.-Y . Audibert . Empirical Bernstein stopping . In W. W. Cohen , A. McCallum , and S. T. Roweis , editors , ICML 2008 , pages 672–679 . ACM , 2008 .
A. Pallavi , R. Zheng , and Cs . Szepesv´ ari . Sequential learning for optimal monitoring of multi-channel wireless networks . In INFOCOMM , 2011 .
Christos H. Papadimitriou and Kenneth Steiglitz . Combinatorial optimization : algorithms and complexity . Courier Dover Publications , New York , 1998 .
Antonio Piccolboni and Christian Schindelhauer . Discrete prediction games with arbitrary feedback and loss . In Proceedings of the 14th Annual Conference on Computational Learning Theory ( COLT 2001 ) , pages 208–223 . Springer-Verlag , 2001 .
Martin Zinkevich . Online convex programming and generalized infinitesimal gradient ascent . In Proceedings of Twentieth International Conference on Machine Learning ( ICML 2003 ) , 2003 .
Appendix
Proof ( Lemma 8 )
1 . In an elimination set , we eliminate every action whose cell is contained in a closed half space . Let us assume that j A i is being eliminated . According to the definition of A i , C i C j and thus C i is also contained in the half space .
2 . First let us assume that p is not in the affine subspace spanned by C i . Let p be an arbitrary point in the relative interior of C i . We define the point p = p + ε ( p p ) . For a small enough ε > 0 , p C k A i , and at the same time , p C i . Thus we have
k ( p + ε ( p p ) ) i ( p + ε ( p p ) ) ( 1 + ε ) k p ε k p ( 1 + ε ) i p ε i p −ε k p −ε i p k p i p α k , j α i , j ,
where we used that k p = i p. For the case when p lies in the affine subspace spanned by C i , We take a hyperplane that contains the affine subspace . Then we take an infinite sequence ( p n ) n such that every element of the sequence is in the same side of the hyperplane , p n = p and the sequence converges to p . Then the statement is true for every element p n and , since the value α r , s is continuous in p , the limit has the desired property as well .
The following lemma concerns the problem of producing an estimate of an unknown mean of some stochastic process with a given relative error bound and with high probability in a sample-efficient manner . The procedure is a simple variation of the one proposed by Mnih et al . ( 2008 ) . The main differences are that here we deal with martingale difference sequences shifted by an unknown constant , which becomes the common mean , whereas Mnih et al . ( 2008 ) considered an i.i.d . sequence . On the other hand , we consider the case when we have a known upper bound on the predictable variance of the process , whereas one of the main contributions of Mnih et al . ( 2008 ) was the lifting of this assumption . The proof of the lemma is omitted , as it follows the same lines as the proof of results of Mnih et al . ( 2008 ) ( the details of these proofs are found in the thesis of ( Mnih , 2008 ) ) , the only difference being , that here we would need to use Bernstein’s inequality for martingales , in place of the empirical Bernstein inequality , which was used by Mnih et al . ( 2008 ) .
Lemma 11 Let ( F t ) be a filtration on some probability space , and let ( X t ) be an F t -adapted sequence of random variables . Assume that ( X t ) is such that , almost surely , the range of each random variable X t is bounded by R > 0 , E [ X t |F t−1 ] = µ , and Var [ X t |F t−1 ] σ 2 a.s. , where R , µ = 0 and σ 2 are non-random constants . Let p > 1 , > 0 , 0 < δ < 1 and let
L n = ( 1 + ε ) max 1≤t≤n |X t | c t , and U n = ( 1 ε ) min 1≤t≤n |X t | + c t ,
where c t = c ( σ , R , t , δ ) , and c ( · ) is defined in ( 1 ) . Define the estimate ˆ µ n of µ as follows:
ˆ µ n = sgn ( X n ) ( 1 + ε ) L n + ( 1 ε ) U n 2 .
Denote the stopping time τ = min { n : L n U n } . Then , with probability at least 1 δ,
µ τ µ| ε |µ| and τ C · max σ 2 2 µ 2 , R |µ| log 1 δ + log R |µ| ,
where C > 0 is a universal constant .
Lemma 12 Fix a probability vector p M , and let R M such that p , p + M also holds . Then KL ( p ||p + ) = O ( 2 2 ) as 0 .
The constant and the threshold in the O ( · ) notation depends on p .
Proof Since p , p + , and p are all probability vectors , notice that | ( i ) | p ( i ) for 1 i M . So if a coordinate of p is zero then the corresponding coordinate of has to be zero as well . As zero coordinates do not modify the KL divergence , we can assume without loss of generality that all coordinates of p are positive . Since we are interested only in the case when 0 , we can also assume without loss of generality that | ( i ) | p ( i ) /2 . Also note that the coordinates of = ( p + ) have to sum up to zero . By definition,
KL ( p ||p + ) =
M
( p ( i ) ( i ) ) log p ( i ) ( i ) p ( i ) + ( i ) .
We write the term with the logarithm
log p ( i ) ( i ) p ( i ) + ( i ) = log 1 ( i ) p ( i ) log 1 + ( i ) p ( i ) ,
so that we can use that , by second order Taylor expansion around 0 , log ( 1−x ) −log ( 1+x ) = −2x + r ( x ) , where |r ( x ) | c|x| 3 for |x| 1/2 and some c > 0 . Combining these equations , we get
KL ( p ||p + ) =
M
( p ( i ) ( i ) ) −2 ( i ) p ( i ) + r ( i ) p ( i )
=
M
−2 ( i ) +
M
2 ( i )
p ( i ) +
M
( p ( i ) ( i ) ) r ( i ) p ( i ) .
Here the first term is 0 , letting p = min i∈ { 1 , ... , M } p ( i ) the second term is bounded by 2 M i=1 2 ( i ) /p = ( 2/p ) 2 2 , and the third term is bounded by
M
( p ( i ) ( i ) ) r ( i ) p ( i ) c M i=1 p ( i ) ( i ) p 3 ( i ) | ( i ) | 3
c
M
| ( i ) | p 2 ( i )
2 ( i )
c 2 M i=1 1 p 2 ( i ) = c 2p 2 2 .
Hence , KL ( p ||p + ) 4+c 2p 2 2 = O ( 2 2 ) .
I 'll start with defining partial monitoring because it 's a little different than from from the previous talk
so consider learner and an environment repeated game in every time step but the learner chooses an action and the environment choosing an outcome they they give their choices to a referee and the referee does the following things the referee calculates the feedback based on the action and the outcome and the feedback function and it calculates a loss based on the loss function and the action and the outcome and it 's important to know that these these functions are known to both the learner and the environment and everybody and then in the next step the referee gives the feedback to the learner and notes the loss but the loss is not revealed to the learner and in this talk we we care about final pressure measuring with final many actions and outcomes and stochastic environment meaning that the outcomes are chosen in an manner every time step
ok some examples of partial monitoring if the loss function and the feedback function are the same then we talk about bandits because the learner gets exactly the loss as feedback the next example is a full information example or expected by this example where numeral which action the learner chooses this this role corresponds to action one for example the the feedback will be the outcome itself so the outcome is basically revealed to the learner so these are two chronicle examples but partial monitoring like partial monitoring because it has some examples that are outside of the scope of these two examples and a good example for that is dynamic pricing where vendor wants to sell a product at every time step and the customer comes in and wants to buy it and the vendor sets a price and the customers has a secret maximum price he 's willing to buy the product for and then the transaction happens or not the feedback to the to the vendor where i showed you the learner the feedback is only if the transaction happened or not and the loss is a constant closely there was no transaction and the difference between the maximum price and the actual price if the transaction happened and the dynamic price again can be the the discretised version of dynamic pricing game and actions and outcomes can be represented with these two matrices this is a loss function and this is the feedback function ok
this slide is the ball slide so the performance measure of a player or a learner is as usual the expected regret which in the stochastic case is the difference between r expected loss minus the the expected loss of the best action in hindsight ok the problem we want to solve is in this paper is that if we are given a game and a pair of feedback and loss functions meaning that minimax expected regret of the game a typical result for minimax regret is for example let 's say that we have a and we can show that the minimax expected regret or the expected regret is at most constant times to the alpha and some of you might remember this sentence from yesterday from chava also you you may think that i stole this sentence from chava but this is not the truth the truth is that david had a thought like two months ago ago we both stole this sentence from him ok previous work so based on their
expected regret people try to characterize these games so this this table shows that the games ordered by their expected regret from zero which is trivial game there is no no regret at all to holders game where there is not enough information and basically the learner can not do anything ok so we know from from these people that full information and the bandit games are the minimax regret square root of t and and we know that in general if the game is not hopeless then then the use an algorithm them based on x three algorithm that achieved the p to the three fourth expected regret whenever the game is not hopeless so they they actually show that there is a gap here and then later on nicolo , gaboro and gil show that the very same algorithm has a little better regret than the piccolo thought and they also also showed in their paper that there exists a game a variant of the prediction game in which there is a lower bound of p to the two third on the regret meaning that this bound in some sense tied in the sense that there exists at least one game where we can not do any better but the question is still there is that true for all games or what what what can we do in general about the game so the next step was that where we showed some other people that that that the gap here that if the game is not really all than the expected regret will jump immediately to at least to t to the one half so the remaining partners in this grey area including dynamic pricing these results are all non-stochastic results but they all apply to a stochastic case
ok so we still have the same table what can we do we know that this grey area that the lower bound on the square root of t upper bound is two to the two third but either games will say t to the three fifth regret the minimax regret that 's what we try to figure out and it turns out the answer turns out to be no so there is no games in between meaning that if we have a game that it will it will be either one of these four categories and as an extra we show that dynamic pricing game is hard that you you can not be you can not do better than t to the two third so the main theorem is that the minimax regret of any finite partial monitoring game against stochastic opponent or environment can be zero square root of t we hide a little meaning that we have some extra logarithmic terms in our proofs t to the two third or linear meaning the hopeless game
ok so how do we do this there we have these two matrices the matrix l the matrix h and i 'll start with l explaining what what we can do with them how do we use the information from l so l is consists of a bunch rows every action corresponds to a row and on the other hand due this is this is the space of all the outcome distributions so here p is the distribution over all the outcomes and this l matrix these rows give us a cellar composition of the of the probability simplex the space of outcome distributions showing that for example the yellow action is optimal whenever the outcome distribution is in the yellow area in the example where the distribution is this one then the orange action is the optimal action and so on ok just a little note save it for later that the boundary between between two actions between these cells between of two actions lies in this subspace where l i is the corresponding a role of the action
ok what can we do with age so age causes of symbols age is basically has only the information what we get as feedback so they are not necessary numbers but when when we choose action i then in this example we had we get feedback a if the outcome was one or three feedback b if the outcome was two and feedback c if the outcome was four a natural question that arises that if we are given an open strategy environment strategy p what probably be observing these symbols in this case it 's really obvious that the probability of observing symbol a is b one plus p three and so on and so but how how can we generalize it the to the general case we likely all know algebra so let 's let 's fill this table it 's not very hard the p one plus p three times if we put one zero one zero here it 's really obvious and the second row will be this and the third will be this and now if we have a look at this matrix then then we can see that the first role corresponds to the indicator of single a the second row this is the indicator of single b the third row is the single c and so because of this we we call this matrix the signal matrix of action i if we have more than one action we have to we want to care about then we can stack these single matrixes on top of each other and we get signal matrices for more actions and why is this important or interesting there is one interesting thing to note that even if we have two outcome distributions and we we can only choose action i and i prime and these two vectors are the same then we can not distinguish between the two outcome distributions at all no matter what we do and no matter how the outcomes come based on these distributions there is no way we can figure out which which outcome distribution the environment chose at the beginning so we can say that the kernel of these matrix is is is a area of danger ok
so what what makes a game easy the question arises and the answer is that there the game is easy if we can figure out which action is better by not choosing any other actions just that two so we want to decide the question with two neighbour actions which one is better and we do n't want to use any other actions because that might be costly from the label efficient partition game we know so this is the main condition that will characterize the easy and the hard games so the local observability condition says that every neighboring action pair for every neighbouring action pair the difference of the two loss vectors is in the row space of the signal matrix well this sounds really arbitrary right now but it turns out that this condition enables us to estimate the the expected difference of the two losses no matter what the outcome distribution is by choosing only these two actions ok so based on this we can have an algorithm the algorithm is
as follows we maintain a set of alive actions and in in every round we choose each alive action once it 's like a racing game we choose each alive action once and then we update we maintain an estimate of loss differences for every action pair and after each run we update these estimates and then if we are if it turns out that we are we are confident in in that in a sign of an action loss difference that say that by a large margin we can say that the difference between two losses is negative or positive then we can eliminate the whole half space let 's say that it turns out that the yellow action is significantly better than the orange action with confidence then we can eliminate this whole half space and note that here not only the orange action was eliminated but this greyish action was eliminated as well because we can be sure that if the if we know that the outcome distribution is on this half space but then this can not be optimal so we can eliminate the section and then then we do it on until we run out of time or only one action remains and this algorithm achieves all t of each square t regret so that 's good if the condition holds then we have upper bound they expected regret by square root of t ok what what is the other case the other case when we have two actions
that are meighbours and we we do n't have enough feedback meaning that there is there is a line in the outcome distribution space where we we can not distinguish between outcome distributions based on actions i and j and this is where the nul space of i and j comes in in this case a third action is needed to decide which action is better and that 's why that 's why it becomes costly and so there is a little coincidence here when does this line exist this is the condition when the local observability condition does n't hold so that the difference is not observable and this is the condition when the line this line of unobservability crosses the border the boundary between the two the two cells and these two conditions coincide luckily so we have this scenario exactly when we can not run the algorithm from the previous slide and we do the usual lower bounding proof technique we we put down two conditions very close to the border on the line and we want we want the algorithm to decide which outcome distribution the environment shows and for that we need to pull pull this is not the bandit talk we need to choose actions not i and j and that becomes costly ok so in summary we we
classified all finite stochastic partial monitoring games and it turns out that there is only four kinds of games trivial games with regrets zero when there is an action that that superior to every other action no matter what the outcome distribution is easy games for which we can run the algorithm hard games where the lower bound holds and hopeless games when there is not even no local observability I did n't talk about this but characterizing hopeless games was done by picoboni and and and basically equivalent to saying that global observability does n't hold there are this distributions that are indistinguishable even by choosing all the actions so we have the key conditions separating the easy and the hard games which is the local observability condition and and the algorithm we designed achieves the minimax rate some algorithmic factors there are some questions computation efficiency of the algorithm there is more questions first of all it can be computationally efficiently verified by the condition the separating condition and once we decided that the game is easy is our algorithm efficient it turns out that all our efficiency questions rely on linear programming and since we do have linear programming time folders everything is efficient the next question is how do we scale with a number of actions I did not mention at all about scaling with a number of actions and while the set part is that the well there is two parts the lower one does n't scale with number of actions at all because it only uses two actions all the time in in the proof and the other set part is that the upper bound is a little ugly we hope we can get down to linear but that 's still not not as good as the bandit where it should be square root of ten so the big question is that it is because there is a lower bound with linear n or n with a half or or do we need a better analysis or do we need a better algorithm I 'm scaling with a number of outcomes the good news is that no matter the number of outcomes these bounds do n't have the number of outcomes in them and the last question is at what can you say about no stochastic opponents well the big conjecture is that exactly the same thing holds for no stochastical opponents and it 's a very very strong conjecture because any lower bond that we 've seen so far was only only used stochastic opponents so well find the separation between adversarial and stochastic games here that would be a pretty big deal I do n't think it will happen so the only thing we need is an algorithm for easy games in the non-stochastic setting and then we are done with the classification
thank you questions you assume that the feedback is a function of the of the two actions of the action yes I wanted to say there is no noise involved so if there is an action and an outcome then from that point everything is deterministic it 's it 's a function of those yes so can you extend this to to random signals we do n't have any extension we we really strongly rely on the these combinatory and linear algebra you can structures so probably something different would be needed to solve that case questions can you extend this to infinite games or would the same infinitely many actions or infinitely many outcomes with infinitely many actions obviously you need some extra structure on the actions and well we hope that we will be able to extend it i do n't know how yet about outcomes if we if we just leave these structure and say infinitely many outcomes and the matrix matrices become integral operators and probably the answer is no but if we if we say that an outcome is a complex combination of atomic outcomes then the answer is actually yes so this whole thing generalizes without any modification and then we have solutions to infinite infinite number of outcomes alright any more let 's thank gabor