Interface d'évaluation

JMLR : Workshop and Conference Proceedings vol ( 2010 ) 1 – 21 24th Annual Conference on Learning Theory

Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments ∗

G´ abor Bart´ ok bartok @ cs.ualberta.ca D´ avid P´ al dpal @ cs.ualberta.ca Csaba Szepesv´ ari szepesva @ cs.ualberta.ca Department of Computing Science , University of Alberta , Edmonton , T6G 2E8 , AB , Canada

Editors : Sham Kakade , Ulrike von Luxburg

Abstract

In a partial monitoring game , the learner repeatedly chooses an action , the environment responds with an outcome , and then the learner suffers a loss and receives a feedback signal , both of which are fixed functions of the action and the outcome . The goal of the learner is to minimize his regret , which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight . Assuming that the outcomes are generated in an i.i.d . fashion from an arbitrary and unknown probability distribution , we characterize the minimax regret of any partial monitoring game with finitely many actions

and outcomes . It turns out that the minimax regret of any such game is either zero , Θ ( √ T ) , Θ ( T 2/3 ) , or Θ ( T ) . We provide a computationally efficient learning algorithm that achieves the minimax regret within logarithmic factor for any game . Keywords : Online learning , Imperfect feedback , Regret analysis

1 . Introduction

Partial monitoring provides a mathematical framework for sequential decision making problems with imperfect feedback . Various problems of interest can be modeled as partial monitoring instances , such as learning with expert advice ( Littlestone and Warmuth , 1994 ) , the multi-armed bandit problem ( Auer et al . , 2002 ) , dynamic pricing ( Kleinberg and Leighton , 2003 ) , the dark pool problem ( Agarwal et al . , 2010 ) , label efficient prediction ( Cesa-Bianchi et al . , 2005 ) , and linear and convex optimization with full or bandit feedback ( Zinkevich , 2003 ; Abernethy et al . , 2008 ; Flaxman et al . , 2005 ) .

In this paper we restrict ourselves to finite games , i.e. , games where both the set of actions available to the learner and the set of possible outcomes generated by the environment are finite . A finite partial monitoring game G is described by a pair of N × M matrices : the loss matrix L and the feedback matrix H. The entries i , j of L are real numbers lying in , say , the interval [ 0 , 1 ] . The entries h i , j of H belong to an alphabet Σ on which we do not impose any structure and we only assume that learner is able to distinguish distinct elements of the alphabet .

The game proceeds in T rounds according to the following protocol . First , G = ( L , H ) is announced for both players . In each round t = 1 , 2 , . . . , T , the learner chooses an action I t ∈

∗ This work was supported in part by AICML , AITF ( formerly iCore and AIF ) , NSERC and the PASCAL2 Network of Excellence under EC grant no . 216886 .

c 2010 G. Bart´ ok , D. P´ al & C. Szepesv´ ari .

{ 1 , 2 , . . . , N } and simultaneously , the environment chooses an outcome J t ∈ { 1 , 2 , . . . , M } . Then , the learner receives as a feedback the entry h I t , J t . The learner incurs instantaneous loss I t , J t , which is not revealed to him . The feedback can be thought of as a masked information about the outcome J t . In some cases h I t , J t might uniquely determine the outcome , in other cases the feedback might give only partial or no information about the outcome . In this paper , we shall assume that J t is chosen randomly from a fixed multinomial distribution .

The learner is scored according to the loss matrix L. In round t the learner incurs an instantaneous loss of I t , J t . The goal of the learner is to keep low his total loss T t=1 I t , J t . Equivalently , the learner’s performance can also be measured in terms of his regret , i.e. , the total loss of the learner is compared with the loss of best fixed action in hindsight . The regret is defined as the difference of these two losses .

In general , the regret grows with the number of rounds T . If the regret is sublinear in T , the learner is said to be Hannan consistent , and this means that the learner’s average per-round loss approaches the average per-round loss of the best action in hindsight .

Piccolboni and Schindelhauer ( 2001 ) were one of the first to study the regret of these games . In fact , they have studied the problem without making any probabilistic assumptions about the outcome sequence J t . They proved that for any finite game ( L , H ) , either for any algorithm the regret can be Ω ( T ) in the worst case , or there exists an algorithm which has regret O ( T 3/4 ) on any outcome sequence 1 . This result was later improved by CesaBianchi et al . ( 2006 ) who showed that the algorithm of Piccolboni and Schindelhauer has regret O ( T 2/3 ) . Furthermore , they provided an example of a finite game , a variant of label-efficient prediction , for which any algorithm has regret Θ ( T 2/3 ) in the worst case .

However , for many games O ( T 2/3 ) is not optimal . For example , games with full feedback ( i.e. , when the feedback uniquely determines the outcome ) can be viewed as a special instance of the problem of learning with expert advice and in this case it is known that the

“EWA forecaster” has regret O ( √ T ) ; see e.g. , Lugosi and Cesa-Bianchi ( 2006 , Chapter 3 ) . Similarly , for games with “bandit feedback” ( i.e. , when the feedback determines the instantaneous loss ) the INF algorithm ( Audibert and Bubeck , 2009 ) and the Exp3 algorithm ( Auer

et al . , 2002 ) achieve O ( √ T ) regret as well . 2

This leaves open the problem of determining the minimax regret ( i.e. , optimal worst-case regret ) of any given game ( L , H ) . A partial progress was made in this direction by Bart´ok et al . ( 2010 ) who characterized ( almost ) all finite games with M = 2 outcomes . They showed that the minimax regret of any “non-degenerate” finite game with two outcomes

falls into one of four categories : zero , Θ ( √ T ) , Θ ( T 2/3 ) or Θ ( T ) . They gave a combinatoricgeometric condition on the matrices L , H which determines the category a game belongs to . Additionally , they constructed an efficient algorithm which , for any game , achieves the minimax regret rate associated to the game within poly-logarithmic factor .

In this paper , we consider the same problem , with two exceptions . In pursuing a general result , we will consider all finite games . However , at the same time , we will only deal with stochastic environments , i.e. , when the outcome sequences are generated from a fixed probability distribution in an i.i.d . manner .

1 . The notations O ( · ) and Θ ( · ) hide polylogarithmic factors . 2 . We ignore the dependence of regret on the number of actions or any other parameters .

The regret against stochastic environments is defined as the difference between the cumulative loss suffered by the algorithm and that of the action with the lowest expected loss . That is , given an algorithm A and a time horizon T , if the outcomes are generated from a probability distribution p , the regret is

R T ( A , p ) =

t=1

I t , J t − min 1≤i≤N E p

t=1

i , J t .

In this paper we analyze the minimax expected regret ( in what follows , minimax regret ) of games , defined as

R T ( G ) = inf A sup p∈∆ M E p [ R T ( A , p ) ] .

We show that the minimax regret of any finite game falls into four categories : zero , Θ ( √ T ) , Θ ( T 2/3 ) , or Θ ( T ) . Accordingly , we call the games trivial , easy , hard , and hopeless . We give a simple and efficiently computable characterization of these classes using a geometric condition on ( L , H ) . We provide lower-bounds and algorithms that achieve them within poly-logarithmic factor . Our result is an extension of the result of Bart´ ok et al . ( 2010 ) for stochastic environments .

It is clear that any lower bound which holds for stochastic environments must hold for adversarial environments too . On the other hand , algorithms and regret upper bounds for stochastic environments , of course , do not transfer to algorithms and regret upper bounds for the adversarial case . Our characterization is a stepping stone towards understanding the minimax regret of partial monitoring games . In particular , we conjecture that our characterization holds without any change for unrestricted environments .

2 . Preliminaries

In this section , we introduce our conventions , along with some definitions . By default , all vectors are column vectors . We denote by v =

√ v v the Euclidean norm of a vector v. For a vector v , the notation v ≥ 0 means that all entries of v are non-negative , and the notation v > 0 means that all entries are positive . For a matrix A , Im A denotes its image space , i.e. , the vector space generated by its columns , and the notation Ker A denotes its kernel , i.e. , the set { x : Ax = 0 } .

Consider a game G = ( L , H ) with N actions and M outcomes . That is , L ∈ R N ×M and H ∈ Σ N ×M . For the sake of simplicity and , without loss of generality , we assume that no symbol σ ∈ Σ can be present in two different rows of H. The signal matrix of an action is defined as follows:

Definition 1 ( Signal matrix ) Let { σ 1 , . . . , σ s i } be the set of symbols listed in the i th row of H . ( Thus , s i denotes the number of different symbols in row i of H ) . The signal matrix S i of action i is defined as an s i × M matrix with entries a k , j = I ( h i , j = σ k ) for 1 ≤ k ≤ s i and 1 ≤ j ≤ M . The signal matrix for a set of actions is defined as the signal matrices of the actions in the set , stacked on top of one another , in the ordering of the actions .

For an example of a signal matrix , see Section 3.1 . We identify the strategy of a stochastic opponent with an element of the probability simplex ∆ M = { p ∈ R M : p ≥ 0 , M j=1 p j = 1 } . Note that for any opponent strategy p , if the learner chooses action i then the vector S i p ∈ R s i is the probability distribution of the observed feedback : ( S i p ) k is the probability of observing the k th symbol .

We denote by i the i th row of the loss matrix L and we call i the loss vector of action i. We say that action i is optimal under opponent strategy p ∈ ∆ M if for any 1 ≤ j ≤ N , i p ≤ j p. Action i is said to be Pareto-optimal if there exists an opponent strategy p such that action i is optimal under p. We now define the cell decomposition of ∆ M induced by L ( for an example , see Figure 2 ) :

Definition 2 ( Cell decomposition ) For an action i , the cell C i associated with i is defined as C i = { p ∈ ∆ M : action i is optimal under p } . The cell decomposition of ∆ M is defined as the multiset C = { C i : 1 ≤ i ≤ N , C i has positive ( M − 1 ) -dimensional volume } .

Actions whose cell is of positive ( M − 1 ) -dimensional volume are called strongly Paretooptimal . Actions that are Pareto-optimal but not strongly Pareto-optimal are called degenerate . Note that the cells of the actions are defined with linear inequalities and thus they are convex polytopes . It follows that strongly Pareto-optimal actions are the actions whose cells are ( M − 1 ) -dimensional polytopes . It is also important to note that the cell decomposition is a multiset , since some actions can share the same cell . Nevertheless , if two actions have the same cell of dimension ( M − 1 ) , their loss vectors will necessarily be identical . 3

We call two cells of C neighbors if their intersection is an ( M − 2 ) -dimensional polytope . The actions corresponding to these cells will also be called neighbors . Neighborship is not defined for cells outside of C. For two neighboring cells C i , C j ∈ C , we define the neighborhood action set A i , j = { 1 ≤ k ≤ N : C i ∩ C j ⊆ C k } . It follows from the definition that actions i and j are in A i , j and thus A i , j is nonempty . However , one can have more than two actions in the neighborhood action set .

When discussing lower bounds we will need the definition of algorithms . For us , an algorithm A is a mapping A : Σ ∗ → { 1 , 2 , . . . , N } which maps past feedback sequences to actions . That the algorithms are deterministic is assumed for convenience . In particular , the lower bounds we prove can be extended to randomized algorithms by conditioning on the internal randomization of the algorithm . Note that the algorithms we design are themselves deterministic .

3 . Classification of finite partial-monitoring games

In this section we present our main result : we state the theorem that classifies all finite stochastic partial-monitoring games based on how their minimax regret scales with the time horizon . Thanks to the previous section , we are now equipped to define a notion which will play a key role in the classification theorem:

3 . One could think that actions with identical loss vectors are redundant and that all but one of such actions could be removed without loss of generality . However , since different actions can lead to different observations and thus yield different information , removing the duplicates can be harmful .

Definition 3 ( Observability ) Let S be the signal matrix for the set of all actions in the game . For actions i and j , we say that i − j is globally observable if i − j ∈ Im S . Furthermore , if i and j are two neighboring actions , then i − j is called locally observable if i − j ∈ Im S ( i , j ) , where S ( i , j ) is the signal matrix for the neighborhood action set A i , j .

As we will see , global observability implies that we can estimate the difference of the expected losses after choosing each action once . Local observability means we only need actions from the neighborhood action set to estimate the difference .

The classification theorem , which is our main result , is the following:

Theorem 4 ( Classification ) Let G = ( L , H ) be a partial-monitoring game with N actions and M outcomes . Let C = { C 1 , . . . , C k } be its cell decomposition , with corresponding loss vectors 1 , . . . , k . The game G falls into one of the following four categories:

( a ) R T ( G ) = 0 if there exists an action i with C i = ∆ M . This case is called trivial .

( b ) R T ( G ) = Θ ( T ) if there exist two strongly Pareto-optimal actions i and j such that i − j is not globally observable . This case is called hopeless . ( c ) R T ( G ) = Θ ( √ T ) if it is not trivial and for all pairs of ( strongly Pareto-optimal ) neighboring actions i and j , i − j is locally observable . These games are called easy .

( d ) R T ( G ) = Θ ( T 2/3 ) if G is not hopeless and there exists a pair of neighboring actions i and j such that i − j is not locally observable . These games are called hard .

Note that the conditions listed under ( a ) – ( d ) are mutually exclusive and cover all finite partial-monitoring games . The only non-obvious implication is that if a game is easy then it can not be hopeless . The reason this holds is because for any pair of cells C i , C j in C , the vector i − j can be expressed as a telescoping sum of the differences of loss vectors of neighboring cells .

The remainder of the paper is dedicated to proving Theorem 4 . We start with the simple cases . If there exists an action whose cell covers the whole probability simplex then choosing that action in every round will yield zero regret , proving case ( a ) . The condition in Case ( b ) is due to Piccolboni and Schindelhauer ( 2001 ) , who showed that under the condition mentioned there , there is no algorithm that achieves sublinear regret 4 . The upper bound for case ( d ) is achieved by the FeedExp3 algorithm due to Piccolboni and Schindelhauer ( 2001 ) , for which a regret bound of O ( T 2/3 ) was shown by Cesa-Bianchi et al . ( 2006 ) . The lower bound for case ( c ) was proved by Antos et al . ( 2011 ) . For a visualization of previous results , see Figure 1 .

The above assertions help characterize trivial and hopeless games , and show that if

a game is not trivial and not hopeless then its minimax regret falls between Ω ( √ T ) and O ( T 2/3 ) . Our contribution in this paper is that we give exact minimax rates ( up to logarithmic factors ) for these games . To prove the upper bound for case ( c ) , we introduce a new algorithm , which we call Balaton , for “Bandit Algorithm for Loss Annihilation” 5 . This algorithm is presented in Section 4 , while its analysis is given in Section 5 . The lower bound for case ( d ) is presented in Section 6 .

4 . Although Piccolboni and Schindelhauer state their theorem for adversarial environments , their proof applies to stochastic environments without any change ( which is important for the lower bound part ) . 5 . Balaton is a lake in Hungary . We thank Gergely Neu for suggesting the name .

hopeless trivial

easy hard

dynamic pricing l.e.p . bandits

full-info

Figure 1 : Partial monitoring games and their minimax regret as it was known previously . The big rectangle denotes the set of all games . Inside the big rectangle , the games are ordered from left to right based on their minimax regret . In the “hard” area , l.e.p . denotes label-efficient prediction . The grey area contains games whose

minimax regret is between Ω ( √ T ) and O ( T 2/3 ) but their exact regret rate was unknown . This area is now eliminated , and the dynamic pricing problem is proven to be hard .

3.1 . Example

In this section , as a corollary of Theorem 4 we show that the discretized dynamic pricing game ( see , e.g. , Cesa-Bianchi et al . ( 2006 ) ) is hard . Dynamic pricing is a game between a vendor ( learner ) and a customer ( environment ) . In each round , the vendor sets a price he wants to sell his product at ( action ) , and the costumer sets a maximum price he is willing to buy the product ( outcome ) . If the product is not sold , the vendor suffers some constant loss , otherwise his loss is the difference between the customer’s maximum and his price . The customer never reveals the maximum price and thus the vendor’s only feedback is whether he sold the product or not .

The discretized version of the game with N actions ( and outcomes ) is defined by the matrices

L =



     

0 1 2 · · · N − 1 c 0 1 · · · N − 2 .. . . .. .. . c · · · c 0 1 c · · · · · · c 0



     

H =



    

1 · · · · · · 1 0 . .. .. . .. . . .. ... ... 0 · · · 0 1



    

where c is a positive constant ( see Figure 2 for the cell-decomposition for N = 3 ) . It is easy to see that all the actions are strongly Pareto-optimal . Also , after some linear algebra it turns out that the cells underlying the actions have a single common vertex in the interior of the probability simplex . It follows that any two actions are neighbors . On the other hand , if we take two non-consecutive actions i and i , i − i is not locally observable . For example , the signal matrix for action 1 and action N is

S ( 1 , N ) =





1 . . . 1 1 1 . . . 1 0 0 . . . 0 1



 ,

whereas N − 1 = ( c , c − 1 , . . . , c − N + 2 , −N + 1 ) . It is obvious that N − 1 is not in the row space of S ( 1 , N ) .

( 1 , 0 , 0 )

( 0 , 1 , 0 )

( 0 , 0 , 1 )

p ∗

Figure 2 : The cell decomposition of the discretized dynamic pricing game with 3 actions . If the opponent strategy is p ∗ , then action 2 is the optimal action .

4 . Balaton : An algorithm for easy games In this section we present our algorithm that achieves O ( √ T ) expected regret for easy games ( case ( c ) of Theorem 4 ) . The input of the algorithm is the loss matrix L , the feedback matrix H , the time horizon T and an error probability δ , to be chosen later . Before describing the algorithm , we introduce some notation . We define a graph G associated with game G the following way . Let the vertex set be the set of cells of the cell decomposition C of the probability simplex such that cells C i , C j ∈ C share the same vertex when C i = C j . The graph has an edge between vertices whose corresponding cells are neighbors . This graph is connected , since the probability simplex is convex and the cell decomposition covers the simplex .

Recall that for neighboring cells C i , C j , the signal matrix S ( i , j ) is defined as the signal matrix for the neighborhood action set A i , j of cells i , j. Assuming that the game satisfies the condition of case ( c ) of Theorem 4 , we have that for all neighboring cells C i and C j , i − j ∈ Im S ( i , j ) . This means that there exists a coefficient vector v ( i , j ) such that i − j = S ( i , j ) v ( i , j ) . We define the k th segment of v ( i , j ) , denoted by v ( i , j ) , k , as the vector of components of v ( i , j ) that correspond to the k th action in the neighborhood action set . That is , if S ( i , j ) = S 1 · · · S r , then i − j = S ( i , j ) v ( i , j ) = r s=1 S s v ( i , j ) , s , where S 1 , . . . , S r are the signal matrices of the individual actions in A i , j .

Let J t ∈ { 1 , . . . , M } denote the outcome at time step t. For 1 ≤ k ≤ M , let e k ∈ R M be the k th unit vector . For an action i , let O i ( t ) = S i e J t be the observation vector of action i at time step t. If the rows of the signal matrix S i correspond to symbols σ 1 , . . . , σ s i and action i is chosen at time step t then the unit vector O i ( t ) indicates which symbol was observed in that time step . Thus , O I t ( t ) holds the same information as the feedback at time t ( recall that I t is the action chosen by the learner at time step t ) . From now on , for simplicity , we will assume that the feedback at time step t is the observation vector O I t ( t ) itself .

The main idea of the algorithm is to successively eliminate actions in an efficient , yet safe manner . When all remaining strongly Pareto optimal actions share the same cell , the elimination phase finishes and from this point , one of the remaining actions is played . During the elimination phase , the algorithm works in rounds . In each round each ‘alive’ Pareto optimal action is played once . The resulting observations are used to estimate the loss-difference between the alive actions . If some estimate becomes sufficiently precise , the action of the pair deemed to be suboptimal is eliminated ( possibly together with other

Algorithm 1 Balaton Input : L , H , T , δ Initialization : [ G , C , { v ( i , j ) , k } , { path ( i , j ) } , { ( LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) ) } ] ← Initialize ( L , H ) t ← 0 , n ← 0 aliveActions ← { 1 ≤ i ≤ N : C i ∩ interior ( ∆ M ) = ∅ } main loop while | V G | > 1 and t < T do n ← n + 1 for each i ∈ aliveActions do O i ← ExecuteAction ( i ) t ← t + 1 end for for each edge ( i , j ) in G : µ ( i , j ) ← k∈A i , j O k v ( i , j ) , k end for for each non-adjacent vertex pair ( i , j ) in G : µ ( i , j ) ← ( k , l ) ∈path ( i , j ) µ ( k , l ) end for haveEliminated ← false for each vertex pair ( i , j ) in G do ˆ µ ( i , j ) ← 1 − 1 n ˆ µ ( i , j ) + 1 n µ ( i , j ) if BStopStep ( ˆ µ ( i , j ) , LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) , n , 1/2 , δ ) then [ aliveActions , C , G ] ← eliminate ( i , j , sgn ( ˆ µ ( i , j ) ) ) haveEliminated ← true end if end for if haveEliminated then { path ( i , j ) } ← regeneratePaths ( G ) end if end while Let i be a strongly Pareto-optimal action in aliveActions while t < T do

ExecuteAction ( i ) t ← t + 1 end while

actions ) . To determine if an estimate is sufficiently precise , we will use an appropriate stopping rule . A small regret will be achieved by tuning the error probability of the stopping rule appropriately .

The details of the algorithm are as follows : In the preprocessing phase , the algorithm constructs the neigbourhood graph , the signal matrices S ( i , j ) assigned to the edges of the graph , the coefficient vectors v ( i , j ) and their segment vectors v ( i , j ) , k . In addition , it constructs a path in the graph connecting any pairs of nodes , and initializes some variables used by the stopping rule .

In the elimination phase , the algorithm runs a loop . In each round of the loop , the algorithm chooses each of the alive actions once and , based on the observations , the estimates ˆ µ ( i , j ) of the loss-differences ( i − j ) p ∗ are updated , where p ∗ is the actual opponent

strategy . The algorithm maintains the set C of cells of alive actions and their neighborship graph G .

The estimates are calculated as follows . First we calculate estimates for neighboring actions ( i , j ) . In round 6 n , for every action k in A i , j let O k be the observation vector for action k. Let µ ( i , j ) = k∈A i , j O k v ( i , j ) , k . From the local observability condition and the construction of v ( i , j ) , k , with simple algebram it follows that µ ( i , j ) are unbiased estimates of ( i − j ) p ∗ ( see Lemma 5 ) . For non-neighboring action pairs , we use telescoping sums : since the graph G ( induced by the alive actions ) stays connected , we can take a path i = i 0 , i 1 , . . . , i r = j in the graph , and the estimate µ ( i , j ) ( n ) will be the sum of the estimates along the path : r l=1 µ ( i l−1 , i l ) . The estimate of the difference of the expected losses after round n will be the average ˆ µ ( i , j ) = ( 1/n ) n l=1 µ ( i , j ) ( s ) , where µ ( i , j ) ( s ) denotes the estimate for pair ( i , j ) computed in round s .

After updating the estimates , the algorithm decides which actions to eliminate . For each pair of vertices i , j of the graph , the expected difference of their loss is tested for its sign by the BStopStep subroutine , based on the estimate ˆ µ ( i , j ) and its relative error . This subroutine uses a stopping rule based on Bernstein’s inequality .

The subroutine’s pseudocode is shown as Algorithm 2 and is essentially based on the work by Mnih et al . ( 2008 ) . The algorithm maintains two values , LB , UB , computed from the supplied sequence of sample means ( ˆ µ ) and the deviation bounds

c ( σ , R , n , δ ) = σ 2 L ( δ , n ) n + R L ( δ , n ) 3n , where L ( δ , n ) = log 3 p p − 1 n p δ . ( 1 )

Here p > 1 is an arbitrarily chosen parameter of the algorithm , σ is a ( deterministic ) upper bound on the ( conditional ) variance of the random variables whose common mean µ we wish to estimate , while R is a ( deterministic ) upper bound on their range . This is a general stopping rule method , which stops when it produced an -relative accurate estimate of the unknown mean . The algorithm is guaranteed to be correct outside of a failure event whose probability is bounded by δ .

Algorithm Balaton calls this method with ε = 1/2 . As a result , when BStopStep returns true , outside of the failure event the sign of the estimate ˆ µ supplied to Balaton will match the sign of the mean to be estimated . The conditions under which the algorithm indeed produces ε-accurate estimates ( with high probability ) are given in Lemma 11 ( see Appendix ) , which also states that also with high probability , the time when the algorithm stops is bounded by

C · max σ 2 2 µ 2 , R |µ| log 1 δ + log R |µ| ,

where µ = 0 is the true mean . Note that the choice of p in ( 1 ) influences only C .

If BStopStep returns true for an estimate µ ( i , j ) , function eliminate is called . If , say , µ ( i , j ) > 0 , this function takes the closed half space { q ∈ ∆ M : ( i − j ) q ≤ 0 } and eliminates all actions whose cell lies completely in the half space . The function also drops the vertices from the graph that correspond to eliminated cells . The elimination necessarily

6 . Note that a round of the algorithm is not the same as the time step t. In a round , the algorithm chooses each of the alive actions once .

Algorithm 2 Algorithm BStopStep . Note that , somewhat unusually at least in pseudocodes , the arguments LB , UB are passed by reference , i.e. , the algorithm rewrites the values of these arguments ( which are thus returned back to the caller ) . Input : ˆ µ , LB , UB , σ , R , n , ε , δ LB ← max ( LB , |ˆ µ| − c ( δ , σ , R , n ) ) UB ← min ( UB , |ˆ µ| + c ( δ , σ , R , n ) ) return ( 1 + ) LB < ( 1 − ) UB

concerns all actions with corresponding cell C i , and possibly other actions as well . The remaining cells are redefined by taking their intersection with the complement half space { q ∈ ∆ M : ( i − j ) q ≥ 0 } .

By construction , after the elimination phase , the remaining graph is still connected , but some paths used in the round may have lost vertices or edges . For this reason , in the last phase of the round , new paths are constructed for vertex pairs with broken paths .

The main loop of the algorithm continues until either one vertex remains in the graph or the time horizon T is reached . In the former case , one of the actions corresponding to that vertex is chosen until the time horizon is reached .

5 . Analysis of the algorithm In this section we prove that the algorithm described in the previous section achieves O ( √ T ) expected regret .

Let us assume that the outcomes are generated following the probability vector p ∗ ∈ ∆ M . Let j ∗ denote an optimal action , that is , for every 1 ≤ i ≤ N , j ∗ p ∗ ≤ i p ∗ . For every pair of actions i , j , let α i , j = ( i − j ) p ∗ be the expected difference of their instantaneous loss . The expected regret of the algorithm can be rewritten as

t=1

I t , J t − min 1≤i≤N E

t=1

i , J t =

E [ τ i ] α i , j ∗ , ( 2 )

where τ i is the number of times action i is chosen by the algorithm .

Throughout the proof , the value that Balaton assigns to a variable x in round n will be denoted by x ( n ) . Further , for 1 ≤ k ≤ N , we introduce the i.i.d . random sequence ( J k ( n ) ) n≥1 , taking values on { 1 , . . . , M } , with common multinomial distribution satisfying , P [ J k ( n ) = j ] = p ∗ j . Clearly , a statistically equivalent model to the one where ( J t ) is an i.i.d . sequence with multinomial p ∗ is when ( J t ) is defined through

J t = J I t t s=1 I ( I s = I t ) . ( 3 )

Note that this claim holds , independently of the algorithm generating the actions , I t . Therefore , in what follows , we assume that the outcome sequence is generated through ( 3 ) . As we will see , this construction significantly simplifies subsequent steps of the proof . In particular , the construction will be very convenient since if action k is selected by our algorithm in the n th elimination round then the outcome obtained in response is going to be

O k ( n ) = S k u k ( n ) , where u k ( n ) = e J k ( n ) . ( This holds because in the elimination rounds all alive actions are tried exactly once by Balaton . )

Let ( F n ) n be the filtration defined as F n = σ ( u k ( m ) ; 1 ≤ k ≤ N , 1 ≤ m ≤ n ) . We also introduce the notations E n [ · ] = E [ ·|F n ] and Var n ( · ) = Var ( ·|F n ) , the conditional expectation and conditional variance operators corresponding to F n . Note that F n contains the information known to Balaton ( and more ) at the end of the elimination round n. Our first ( trivial ) observation is that µ ( i , j ) ( n ) , the estimate of α i , j obtained in round n is F n -measurable . The next lemma establishes that , furthermore , µ ( i , j ) ( n ) is an unbiased estimate of α i , j :

Lemma 5 For any n ≥ 1 and i , j such that C i , C j ∈ C , E n−1 [ µ ( i , j ) ( n ) ] = α i , j .

Proof Consider first the case when actions i and j are neighbors . In this case,

µ ( i , j ) ( n ) = k∈A i , j

O k ( n ) v ( i , j ) , k = k∈A i , j

( S k u k ( n ) ) v ( i , j ) , k = k∈A i , j

u k ( n ) S k v ( i , j ) , k ,

and thus

E n−1 µ ( i , j ) ( n ) = k∈A i , j

E n−1 u k ( n ) S k v ( i , j ) , k = p ∗ k∈A i , j

S k v ( i , j ) , k = p ∗ S ( i , j ) v ( i , j )

= p ∗ ( i − j ) = α i , j .

For non-adjacent i and j , we have a telescoping sum:

E n−1 µ ( i , j ) ( n ) =

k=1

E n−1 [ µ ( i k−1 , i k ) ( n ) ]

= p ∗ i 0 − i 1 + i 1 − i 2 + · · · + i r−1 − i r = α i , j ,

where i = i 0 , i 1 , . . . , i r = j is the path the algorithm uses in round n , known at the end of round n − 1 .

Lemma 6 The conditional variance of µ ( i , j ) ( n ) , Var n−1 ( µ ( i , j ) ( n ) ) , is upper bounded by V = 2 { i , j neighbors } v ( i , j ) 2 2 .

Proof For neighboring cells i , j , we write

µ ( i , j ) ( n ) = k∈A i , j

O k ( n ) v ( i , j ) , k and thus

Var n−1 ( µ ( i , j ) ( n ) ) = Var n−1





k∈A i , j

O k ( n ) v ( i , j ) , k





k∈A i , j

E n−1 v ( i , j ) , k ( O k ( n ) − E n−1 [ O k ( n ) ] ) ( O k ( n ) − E n−1 [ O k ( n ) ] ) v ( i , j ) , k

k∈A i , j

v ( i , j ) , k 2 2 E n−1 O k ( n ) − E n−1 [ O k ( n ) ] 2 2

k∈A i , j

v ( i , j ) , k 2 2 = v ( i , j ) 2 2 , ( 4 )

where in ( 4 ) we used that O k ( n ) is a unit vector and E n−1 [ O k ( n ) ] is a probability vector .

For i , j non-neighboring cells , let i = i 0 , i 1 , . . . , i r = j the path used for the estimate in round n. Then µ ( i , j ) ( n ) can be written as

µ ( i , j ) ( n ) =

s=1

µ ( i s−1 , i s ) ( n ) =

s=1 k∈A is−1 , is

O k ( n ) v ( i s−1 , i s ) , k .

It is not hard to see that an action can only be in at most two neighborhood action sets in the path and so the double sum can be rearranged as

k∈ A is−1 , is

O k ( n ) ( v ( i sk−1 , i sk ) , k + v ( i sk i sk+1 ) , k ) ,

and thus Var n−1 µ ( i , j ) ( n ) ≤ 2 r s=1 v ( i s−1 , i s ) 2 2 ≤ 2 { i , j neighbors } v ( i , j ) 2 2 .

Lemma 7 The range of the estimates µ ( i , j ) ( n ) is upper bounded by R = { i , j neighbors } v ( i , j ) 1 .

Proof The bound trivially follows from the definition of the estimates .

Let δ be the confidence parameter used in BStopStep . Since , according to Lemmas 5 , 6 and 7 , ( µ ( i , j ) ) is a “shifted” martingale difference sequence with conditional mean α i , j , bounded conditional variance and range , we can apply Lemma 11 stated in the Appendix . By the union bound , the probability that any of the confidence bounds fails during the game is at most N 2 δ. Thus , with probability at least 1 − N 2 δ , if BStopStep returns true for a pair ( i , j ) then sgn ( α i , j ) = sgn ( µ ( i , j ) ) and the algorithm eliminates all the actions whose cell is contained in the closed half space defined by H = { p : sgn ( α i , j ) p ( i − j ) ≤ 0 } . By definition α i , j = ( i − j ) p ∗ . Thus p ∗ / ∈ H and none of the eliminated actions can be optimal under p ∗ .

From Lemma 11 we also see that , with probability at least 1 − N 2 δ , the number of times τ ∗ i the algorithm experiments with a suboptimal action i during the elimination phase is bounded by

τ ∗ i ≤ c ( G ) α 2 i , j ∗ log R δα i , j ∗ = T i , ( 5 )

where c ( G ) = C ( V + R ) is a problem dependent constant .

The following lemma , the proof of which can be found in the Appendix , shows that degenerate actions will be eliminated in time .

Lemma 8 Let action i be a degenerate action . Let A i = { j : C j ∈ C , C i ⊂ C j } . The following two statements hold:

1 . If any of the actions in A i is eliminated , then action i is eliminated as well .

2 . There exists an action k i ∈ A i such that α k i , j ∗ ≥ α i , j ∗ .

An immediate implication of the first claim of the lemma is that if action k i gets eliminated then action i gets eliminated as well , that is , the number of times action i is chosen can not be greater then that of action k i . Hence , τ ∗ i ≤ τ ∗ k i . Let E be the complement of the failure event underlying the stopping rules . As discussed earlier , P ( E c ) ≤ N 2 δ. Note that on E , i.e. , when the stopping rules do not fail , no suboptimal action can remain for the final phase . Hence , τ i I ( E ) ≤ τ ∗ i I ( E ) , where τ i is the number of times action i is chosen by the algorithm . To upper bound the expected regret we continue from ( 2 ) as

E [ τ i ] α i , j ∗ =

E [ I ( E ) τ i ] α i , j ∗ + P ( E c ) T ( because N i=1 τ i = T and 0 ≤ α i , j ∗ ≤ 1 )

E [ I ( E ) τ ∗ i ] α i , j ∗ + N 2 δT

i : C i ∈C

E [ I ( E ) τ ∗ i ] α i , j ∗ + i : C i ∈C

E [ I ( E ) τ ∗ i ] α i , j ∗ + N 2 δT

i : C i ∈C

E [ I ( E ) τ ∗ i ] α i , j ∗ + i : C i ∈C

E I ( E ) τ ∗ k i α k i , j ∗ + N 2 δT ( by Lemma 8 )

i : C i ∈C

T i α i , j ∗ + i : C i ∈C

T k i α k i , j ∗ + N 2 δT

i : C i ∈C α i , j∗ ≥α 0

T i α i , j ∗ + i : C i ∈C α ki , j ∗ ≥α 0

T k i α k i , j ∗ + α 0 + N 2 δ T

≤ c ( G )



    i : C i ∈C α i , j∗ ≥α 0

log R δα i , j∗

α i , j ∗ + i : C i ∈C α ki , j ∗ ≥α 0

log R δα ki , j ∗ α k i , j ∗



   

+ α 0 + N 2 δ T

≤ c ( G ) N log R δα 0 α 0 + α 0 + N 2 δ T , The above calculation holds for any value of α 0 > 0 . Setting

α 0 = c ( G ) N T and δ = c ( G ) T N 3 , we get

E [ R T ] ≤ c ( G ) N T log RT N 2 c ( G ) .

In conclusion , if we run Balaton with parameter δ = c ( G ) T N 3 , the algorithm suffers regret

of O ( √ T ) , finishing the proof .

6 . A lower bound for hard games

In this section we prove that for any game that satisfies the condition of Case ( d ) of Theorem 4 , the minimax regret is of Ω ( T 2/3 ) .

Theorem 9 Let G = ( L , H ) be an N by M partial-monitoring game . Assume that there exist two neighboring actions i and j such that i − j ∈ Im S ( i , j ) . Then there exists a problem dependent constant c ( G ) such that for any algorithm A and time horizon T there exists an opponent strategy p such that the expected regret satisfies

E [ R T ( A , p ) ] ≥ c ( G ) T 2/3 .

Proof Without loss of generality we can assume that the two neighbor cells in the condition are C 1 and C 2 . Let C 3 = C 1 ∩ C 2 . For i = 1 , 2 , 3 , let A i be the set of actions associated with cell C i . Note that A 3 may be the empty set . Let A 4 = A\ ( A 1 ∪A 2 ∪A 3 ) . By our convention for naming loss vectors , 1 and 2 are the loss vectors for C 1 and C 2 , respectively . Let L 3 collect the loss vectors of actions which lie on the open segment connecting 1 and 2 . It is easy to see that L 3 is the set of loss vectors that correspond to the cell C 3 . We define L 4 as the set of all the other loss vectors . For i = 1 , 2 , 3 , 4 , let k i = |A i | .

Let S = S i , j the signal matrix of the neighborhood action set of C 1 and C 2 . It follows from the assumption of the theorem that 2 − 1 ∈ Im ( S ) . Thus , { ρ ( 2 − 1 ) : ρ ∈ R } ⊂ Im ( S ) , or equivalently , ( 2 − 1 ) ⊥ ⊃ Ker S , where we used that ( Im M ) ⊥ = Ker ( M ) . Thus , there exists a vector v such that v ∈ Ker S and ( 2 − 1 ) v = 0 . By scaling we can assume that ( 2 − 1 ) v = 1 . Note that since v ∈ Ker S and the rowspace of S contains the vector ( 1 , 1 , . . . , 1 ) , the coordinates of v sum up to zero .

Let p 0 be an arbitrary probability vector in the relative interior of C 3 . It is easy to see that for any ε > 0 small enough , p 1 = p 0 + εv ∈ C 1 \ C 2 and p 2 = p 0 − εv ∈ C 2 \ C 1 .

Let us fix a deterministic algorithm A and a time horizon T . For i = 1 , 2 , let R ( i ) T denote the expected regret of the algorithm under opponent strategy p i . For i = 1 , 2 and j = 1 , . . . , 4 , let N i j denote the expected number of times the algorithm chooses an action from A j , assuming the opponent plays strategy p i .

From the definition of L 3 we know that for any ∈ L 3 , − 1 = η ( 2 − 1 ) and − 2 = ( 1 − η ) ( 1 − 2 ) for some 0 < η < 1 . Let λ 1 = min ∈L 3 η and λ 2 = min ∈L 3 ( 1 − η ) and λ = min ( λ 1 , λ 2 ) if L 3 = ∅ and let λ = 1/2 , otherwise . Finally , let β i = min ∈L 4 ( − i ) p i and β = min ( β 1 , β 2 ) . Note that λ , β > 0 .

As the first step of the proof , we lower bound the expected regret R ( 1 ) T and R ( 2 ) T in terms of the values N i j , ε , λ and β:

R ( 1 ) T ≥ N 1 2

( 2 − 1 ) p 1 +N 1 3 λ ( 2 − 1 ) p 1 + N 1 4 β ≥ λ ( N 1 2 + N 1 3 ) ε + N 1 4 β , R ( 2 ) T ≥ N 2 1 ( 1 − 2 ) p 2

+N 2 3 λ ( 1 − 2 ) p 2 + N 2 4 β ≥ λ ( N 2 1 + N 2 3 ) ε + N 2 4 β . ( 6 )

For the next step , we need the following lemma .

Lemma 10 There exists a ( problem dependent ) constant c such that the following inequalities hold:

N 2 1 ≥ N 1 1 − cT ε N 1 4 , N 2 3 ≥ N 1 3 − cT ε N 1 4 ,

N 1 2 ≥ N 2 2 − cT ε N 2 4 , N 1 3 ≥ N 2 3 − cT ε N 2 4 .

Proof ( Lemma 10 ) For any 1 ≤ t ≤ T , let f t = ( f 1 , . . . , f t ) ∈ Σ t be a feedback sequence up to time step t. For i = 1 , 2 , let p ∗ i be the probability mass function of feedback sequences of length T − 1 under opponent strategy p i and algorithm A. We start by upper bounding the difference between values under the two opponent strategies . For i = j ∈ { 1 , 2 } and k ∈ { 1 , 2 , 3 } ,

N i k − N j k = f T −1

p ∗ i ( f T −1 ) − p ∗ j ( f T −1 )

T −1

t=0

I ( A ( f t ) ∈ A k )

f T −1 : p ∗ i ( f T −1 ) −p ∗ j ( f T −1 ) ≥0

p ∗ i ( f T −1 ) − p ∗ j ( f T −1 )

T −1

t=0

I ( A ( f t ) ∈ A k )

≤ T

f T −1 : p ∗ i ( f T −1 ) −p ∗ j ( f T −1 ) ≥0

p ∗ i ( f T −1 ) − p ∗ j ( f T −1 ) = T 2 p ∗ 1 − p ∗ 2 1

≤ T KL ( p ∗ 1 ||p ∗ 2 ) /2 , ( 7 )

where KL ( ·||· ) denotes the Kullback-Leibler divergence and · 1 is the L 1 -norm . The last inequality follows from Pinsker’s inequality ( Cover and Thomas , 2006 ) . To upper bound KL ( p ∗ 1 ||p ∗ 2 ) we use the chain rule for KL-divergence . By overloading p ∗ i so that p ∗ i ( f t−1 ) denotes the probability of feedback sequence f t−1 under opponent strategy p i and algorithm A , and p ∗ i ( f t |f t−1 ) denotes the conditional probability of feedback f t ∈ Σ given that the past feedback sequence was f t−1 , again under p i and A. With this notation we have

KL ( p ∗ 1 ||p ∗ 2 ) =

T −1

t=1 f t−1

p ∗ 1 ( f t−1 ) f t

p ∗ 1 ( f t |f t−1 ) log p ∗ 1 ( f t |f t−1 ) p ∗ 2 ( f t |f t−1 )

T −1

t=1 f t−1

p ∗ 1 ( f t−1 )

I ( A ( f t−1 ) ∈ A i ) f t

p ∗ 1 ( f t |f t−1 ) log p ∗ 1 ( f t |f t−1 ) p ∗ 2 ( f t |f t−1 ) ( 8 )

Let a f t be the row of S that corresponds to the feedback symbol f t . 7 Assume k = A ( f t−1 ) . If the feedback set of action k does not contain f t then trivially p ∗ i ( f t |f t−1 ) = 0 for i = 1 , 2 . Otherwise p ∗ i ( f t |f t−1 ) = a f t p i . Since p 1 − p 2 = 2εv and v ∈ Ker S , we have a f t v = 0 and thus , if the choice of the algorithm is in either A 1 , A 2 or A 3 , then p ∗ 1 ( f t |f t−1 ) = p ∗ 2 ( f t |f t−1 ) . It follows that the inequality chain can be continued from ( 8 ) by writing

KL ( p ∗ 1 ||p ∗ 2 ) ≤

T −1

t=1 f t−1

p ∗ 1 ( f t−1 ) I ( A ( f t−1 ) ∈ A 4 ) f t

p ∗ 1 ( f t |f t−1 ) log p ∗ 1 ( f t |f t−1 ) p ∗ 2 ( f t |f t−1 )

≤ c 1 ε 2

T −1

t=1 f t−1

p ∗ 1 ( f t−1 ) I ( A ( f t−1 ) ∈ A 4 ) ( 9 )

≤ c 1 ε 2 N 1 4 .

7 . Recall that we assumed that different actions have difference feedback symbols , and thus a row of S corresponding to a symbol is unique .

In ( 9 ) we used Lemma 12 ( see Appendix ) to upper bound the KL-divergence of p 1 and p 2 . Flipping p ∗ 1 and p ∗ 2 in ( 7 ) we get the same result with N 2 4 . Reading together with the bound in ( 7 ) we get all the desired inequalities .

Now we can continue lower bounding the expected regret . Let r = argmin i∈ { 1,2 } N i 4 . It is easy to see that for i = 1 , 2 and j = 1 , 2 , 3,

N i j ≥ N r j − c 2 T ε N r 4 .

If i = r then this inequality is one of the inequalities from Lemma 10 . If i = r then it is a trivial lower bounding by subtracting a positive value . From ( 6 ) we have

R ( i ) T ≥ λ ( N i 3−i + N i 3 ) ε + N i 4 β ≥ λ ( N r 3−i − c 2 T ε N r 4 + N r 3 − c 2 T ε N r 4 ) ε + N r 4 β = λ ( N r 3−i + N r 3 − 2c 2 T ε N r 4 ) ε + N r 4 β .

Now assume that , at the beginning of the game , the opponent randomly chooses between strategies p 1 and p 2 with equal probability . The the expected regret of the algorithm is lower bounded by

R T = 1 2 R ( 1 ) T + R ( 2 ) T ≥ 1 2 λ ( N r 1 + N r 2 + 2N r 3 − 4c 2 T ε N r 4 ) ε + N r 4 β ≥ 1 2 λ ( N r 1 + N r 2 + N r 3 − 4c 2 T ε N r 4 ) ε + N r 4 β = 1 2 λ ( T − N r 4 − 4c 2 T ε N r 4 ) ε + N r 4 β .

Choosing ε = c 3 T −1/3 we get

R T ≥ 1 2 λc 3 T 2/3 − 1 2 λN r 4 c 3 T −1/3 − 2λc 2 c 2 3 T 1/3 N r 4 + N r 4 β ≥ T 2/3 β − 1 2 λc 3 N r 4 T 2/3 − 2λc 2 c 2 3 N r 4 T 2/3 + 1 2 λc 3

= T 2/3 β − 1 2 λc 3 x 2 − 2λc 2 c 2 3 x + 1 2 λc 3 ,

where x = N r 4 /T 2/3 . Now we see that c 3 > 0 can be chosen to be small enough , independently of T so that , for any choice of x , the quadratic expression in the parenthesis is bounded away from zero , and simultaneously , ε is small enough so that the threshold condition in Lemma 12 is satisfied , completing the proof of Theorem 9 .

7 . Discussion

In this we paper we classified all finite partial-monitoring games under stochastic environments , based on their minimax regret . We conjecture that our results extend to nonstochastic environments . This is the major open question that remains to be answered .

One question which we did not discuss so far is the computational efficiency of our algorithm . The issue is twofold . The first computational question is how to efficiently decide which of the four classes a given game ( L , H ) belongs to . The second question is the computational efficiency of Balaton for a fixed easy game . Fortunately , in both cases an efficient implementation is possible , i.e. , in polynomial time by using a linear program solver ( e.g. , the ellipsoid method ( Papadimitriou and Steiglitz , 1998 ) ) .

Another interesting open question is to investigate the dependence of regret on quantities other than T such as the number of actions , the number of outcomes , and more generally the structure of the loss and feedback matrices .

Finally , let us note that our results can be extended to a more general framework , similar to that of Pallavi et al . ( 2011 ) , in which a game with N actions and M -dimensional outcome space is defined as a tuple G = ( L , S 1 , . . . , S N ) . The loss matrix is L ∈ R N ×M as before , but the outcome and the feedback are defined differently . The outcome y is an arbitrary vector from a bounded subset of R M and the feedback received by the learner upon choosing action i is O i = S i y .

References

Jacob Abernethy , Elad Hazan , and Alexander Rakhlin . Competing in the dark : An efficient algorithm for bandit linear optimization . In Proceedings of the 21st Annual Conference on Learning Theory ( COLT 2008 ) , pages 263–273 . Citeseer , 2008 .

Alekh Agarwal , Peter Bartlett , and Max Dama . Optimal allocation strategies for the dark pool problem . In 13th International Conference on Artificial Intelligence and Statistics ( AISTATS 2010 ) , May 12-15 , 2010 , Chia Laguna Resort , Sardinia , Italy , 2010 .

Andr´ as Antos , G´ abor Bart´ ok , D´ avid P´ al , and Csaba Szepesv´ ari . Toward a classification of finite partial-monitoring games , 2011. http : //arxiv.org/abs/1102.2041 .

Jean-Yves Audibert and S´ebastien Bubeck . Minimax policies for adversarial and stochastic bandits . In Proceedings of the 22nd Annual Conference on Learning Theory , 2009 .

Peter Auer , Nicol` o Cesa-Bianchi , Yoav Freund , and Robert E. Schapire . The nonstochastic multiarmed bandit problem . SIAM Journal on Computing , 32 ( 1 ) :48–77 , 2002 .

G´ abor Bart´ ok , D´ avid P´ al , and Csaba Szepesv´ ari . Toward a Classification of Finite PartialMonitoring Games . In Proceedings of the 21st international conference on Algorithmic Learning Theory ( ALT 2010 ) , pages 224–238 . Springer , 2010 .

Nicol` o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Minimizing regret with label efficient prediction . IEEE Transactions on Information Theory , 51 ( 6 ) :2152–2162 , June 2005 .

Nicol´ o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Regret minimization under partial monitoring . Mathematics of Operations Research , 31 ( 3 ) :562–580 , 2006 .

Thomas M. Cover and Joy A. Thomas . Elements of Information Theory . Wiley , New York , second edition , 2006 .

Abraham D. Flaxman , Adam Tauman Kalai , and H. Brendan McMahan . Online convex optimization in the bandit setting : gradient descent without a gradient . In Proceedings of the 16th annual ACM-SIAM Symposium on Discrete Algorithms ( SODA 2005 ) , page 394 . Society for Industrial and Applied Mathematics , 2005 .

Robert Kleinberg and Tom Leighton . The value of knowing a demand curve : Bounds on regret for online posted-price auctions . In Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science 2003 ( FOCS 2003 ) , pages 594–605 . IEEE , 2003 .

Nick Littlestone and Manfred K. Warmuth . The weighted majority algorithm . Information and Computation , 108:212–261 , 1994 .

G´ abor Lugosi and Nicol` o Cesa-Bianchi . Prediction , Learning , and Games . Cambridge University Press , 2006 .

V. Mnih . Efficient stopping rules . Master’s thesis , Department of Computing Science , University of Alberta , 2008 .

V. Mnih , Cs . Szepesv´ ari , and J.-Y . Audibert . Empirical Bernstein stopping . In W. W. Cohen , A. McCallum , and S. T. Roweis , editors , ICML 2008 , pages 672–679 . ACM , 2008 .

A. Pallavi , R. Zheng , and Cs . Szepesv´ ari . Sequential learning for optimal monitoring of multi-channel wireless networks . In INFOCOMM , 2011 .

Christos H. Papadimitriou and Kenneth Steiglitz . Combinatorial optimization : algorithms and complexity . Courier Dover Publications , New York , 1998 .

Antonio Piccolboni and Christian Schindelhauer . Discrete prediction games with arbitrary feedback and loss . In Proceedings of the 14th Annual Conference on Computational Learning Theory ( COLT 2001 ) , pages 208–223 . Springer-Verlag , 2001 .

Martin Zinkevich . Online convex programming and generalized infinitesimal gradient ascent . In Proceedings of Twentieth International Conference on Machine Learning ( ICML 2003 ) , 2003 .

Appendix

Proof ( Lemma 8 )

1 . In an elimination set , we eliminate every action whose cell is contained in a closed half space . Let us assume that j ∈ A i is being eliminated . According to the definition of A i , C i ⊂ C j and thus C i is also contained in the half space .

2 . First let us assume that p ∗ is not in the affine subspace spanned by C i . Let p be an arbitrary point in the relative interior of C i . We define the point p = p + ε ( p − p ∗ ) . For a small enough ε > 0 , p ∈ C k ∈ A i , and at the same time , p ∈ C i . Thus we have

k ( p + ε ( p − p ∗ ) ) ≤ i ( p + ε ( p − p ∗ ) ) ( 1 + ε ) k p − ε k p ∗ ≤ ( 1 + ε ) i p − ε i p ∗ −ε k p ∗ ≤ −ε i p ∗ k p ∗ ≥ i p ∗ α k , j ∗ ≥ α i , j ∗ ,

where we used that k p = i p. For the case when p ∗ lies in the affine subspace spanned by C i , We take a hyperplane that contains the affine subspace . Then we take an infinite sequence ( p n ) n such that every element of the sequence is in the same side of the hyperplane , p n = p ∗ and the sequence converges to p ∗ . Then the statement is true for every element p n and , since the value α r , s is continuous in p , the limit has the desired property as well .

The following lemma concerns the problem of producing an estimate of an unknown mean of some stochastic process with a given relative error bound and with high probability in a sample-efficient manner . The procedure is a simple variation of the one proposed by Mnih et al . ( 2008 ) . The main differences are that here we deal with martingale difference sequences shifted by an unknown constant , which becomes the common mean , whereas Mnih et al . ( 2008 ) considered an i.i.d . sequence . On the other hand , we consider the case when we have a known upper bound on the predictable variance of the process , whereas one of the main contributions of Mnih et al . ( 2008 ) was the lifting of this assumption . The proof of the lemma is omitted , as it follows the same lines as the proof of results of Mnih et al . ( 2008 ) ( the details of these proofs are found in the thesis of ( Mnih , 2008 ) ) , the only difference being , that here we would need to use Bernstein’s inequality for martingales , in place of the empirical Bernstein inequality , which was used by Mnih et al . ( 2008 ) .

Lemma 11 Let ( F t ) be a filtration on some probability space , and let ( X t ) be an F t -adapted sequence of random variables . Assume that ( X t ) is such that , almost surely , the range of each random variable X t is bounded by R > 0 , E [ X t |F t−1 ] = µ , and Var [ X t |F t−1 ] ≤ σ 2 a.s. , where R , µ = 0 and σ 2 are non-random constants . Let p > 1 , > 0 , 0 < δ < 1 and let

L n = ( 1 + ε ) max 1≤t≤n |X t | − c t , and U n = ( 1 − ε ) min 1≤t≤n |X t | + c t ,

where c t = c ( σ , R , t , δ ) , and c ( · ) is defined in ( 1 ) . Define the estimate ˆ µ n of µ as follows:

ˆ µ n = sgn ( X n ) ( 1 + ε ) L n + ( 1 − ε ) U n 2 .

Denote the stopping time τ = min { n : L n ≥ U n } . Then , with probability at least 1 − δ,

|ˆ µ τ − µ| ≤ ε |µ| and τ ≤ C · max σ 2 2 µ 2 , R |µ| log 1 δ + log R |µ| ,

where C > 0 is a universal constant .

Lemma 12 Fix a probability vector p ∈ ∆ M , and let ∈ R M such that p − , p + ∈ ∆ M also holds . Then KL ( p − ||p + ) = O ( 2 2 ) as → 0 .

The constant and the threshold in the O ( · ) notation depends on p .

Proof Since p , p + , and p − are all probability vectors , notice that | ( i ) | ≤ p ( i ) for 1 ≤ i ≤ M . So if a coordinate of p is zero then the corresponding coordinate of has to be zero as well . As zero coordinates do not modify the KL divergence , we can assume without loss of generality that all coordinates of p are positive . Since we are interested only in the case when → 0 , we can also assume without loss of generality that | ( i ) | ≤ p ( i ) /2 . Also note that the coordinates of = ( p + ) − have to sum up to zero . By definition,

KL ( p − ||p + ) =

( p ( i ) − ( i ) ) log p ( i ) − ( i ) p ( i ) + ( i ) .

We write the term with the logarithm

log p ( i ) − ( i ) p ( i ) + ( i ) = log 1 − ( i ) p ( i ) − log 1 + ( i ) p ( i ) ,

so that we can use that , by second order Taylor expansion around 0 , log ( 1−x ) −log ( 1+x ) = −2x + r ( x ) , where |r ( x ) | ≤ c|x| 3 for |x| ≤ 1/2 and some c > 0 . Combining these equations , we get

KL ( p − ||p + ) =

( p ( i ) − ( i ) ) −2 ( i ) p ( i ) + r ( i ) p ( i )

−2 ( i ) +

2 ( i )

p ( i ) +

( p ( i ) − ( i ) ) r ( i ) p ( i ) .

Here the first term is 0 , letting p = min i∈ { 1 , ... , M } p ( i ) the second term is bounded by 2 M i=1 2 ( i ) /p = ( 2/p ) 2 2 , and the third term is bounded by

( p ( i ) − ( i ) ) r ( i ) p ( i ) ≤ c M i=1 p ( i ) − ( i ) p 3 ( i ) | ( i ) | 3

≤ c

| ( i ) | p 2 ( i )

2 ( i )

≤ c 2 M i=1 1 p 2 ( i ) = c 2p 2 2 .

Hence , KL ( p − ||p + ) ≤ 4+c 2p 2 2 = O ( 2 2 ) .

I 'll start with defining partial monitoring because it 's a little different than from from the previous talk

so consider learner and an environment repeated game in every time step but the learner chooses an action and the environment choosing an outcome they they give their choices to a referee and the referee does the following things the referee calculates the feedback based on the action and the outcome and the feedback function and it calculates a loss based on the loss function and the action and the outcome and it 's important to know that these these functions are known to both the learner and the environment and everybody and then in the next step the referee gives the feedback to the learner and notes the loss but the loss is not revealed to the learner and in this talk we we care about final pressure measuring with final many actions and outcomes and stochastic environment meaning that the outcomes are chosen in an manner every time step

ok some examples of partial monitoring if the loss function and the feedback function are the same then we talk about bandits because the learner gets exactly the loss as feedback the next example is a full information example or expected by this example where numeral which action the learner chooses this this role corresponds to action one for example the the feedback will be the outcome itself so the outcome is basically revealed to the learner so these are two chronicle examples but partial monitoring like partial monitoring because it has some examples that are outside of the scope of these two examples and a good example for that is dynamic pricing where vendor wants to sell a product at every time step and the customer comes in and wants to buy it and the vendor sets a price and the customers has a secret maximum price he 's willing to buy the product for and then the transaction happens or not the feedback to the to the vendor where i showed you the learner the feedback is only if the transaction happened or not and the loss is a constant closely there was no transaction and the difference between the maximum price and the actual price if the transaction happened and the dynamic price again can be the the discretised version of dynamic pricing game and actions and outcomes can be represented with these two matrices this is a loss function and this is the feedback function ok

this slide is the ball slide so the performance measure of a player or a learner is as usual the expected regret which in the stochastic case is the difference between r expected loss minus the the expected loss of the best action in hindsight ok the problem we want to solve is in this paper is that if we are given a game and a pair of feedback and loss functions meaning that minimax expected regret of the game a typical result for minimax regret is for example let 's say that we have a and we can show that the minimax expected regret or the expected regret is at most constant times to the alpha and some of you might remember this sentence from yesterday from chava also you you may think that i stole this sentence from chava but this is not the truth the truth is that david had a thought like two months ago ago we both stole this sentence from him ok previous work so based on their

expected regret people try to characterize these games so this this table shows that the games ordered by their expected regret from zero which is trivial game there is no no regret at all to holders game where there is not enough information and basically the learner can not do anything ok so we know from from these people that full information and the bandit games are the minimax regret square root of t and and we know that in general if the game is not hopeless then then the use an algorithm them based on x three algorithm that achieved the p to the three fourth expected regret whenever the game is not hopeless so they they actually show that there is a gap here and then later on nicolo , gaboro and gil show that the very same algorithm has a little better regret than the piccolo thought and they also also showed in their paper that there exists a game a variant of the prediction game in which there is a lower bound of p to the two third on the regret meaning that this bound in some sense tied in the sense that there exists at least one game where we can not do any better but the question is still there is that true for all games or what what what can we do in general about the game so the next step was that where we showed some other people that that that the gap here that if the game is not really all than the expected regret will jump immediately to at least to t to the one half so the remaining partners in this grey area including dynamic pricing these results are all non-stochastic results but they all apply to a stochastic case

ok so we still have the same table what can we do we know that this grey area that the lower bound on the square root of t upper bound is two to the two third but either games will say t to the three fifth regret the minimax regret that 's what we try to figure out and it turns out the answer turns out to be no so there is no games in between meaning that if we have a game that it will it will be either one of these four categories and as an extra we show that dynamic pricing game is hard that you you can not be you can not do better than t to the two third so the main theorem is that the minimax regret of any finite partial monitoring game against stochastic opponent or environment can be zero square root of t we hide a little meaning that we have some extra logarithmic terms in our proofs t to the two third or linear meaning the hopeless game

ok so how do we do this there we have these two matrices the matrix l the matrix h and i 'll start with l explaining what what we can do with them how do we use the information from l so l is consists of a bunch rows every action corresponds to a row and on the other hand due this is this is the space of all the outcome distributions so here p is the distribution over all the outcomes and this l matrix these rows give us a cellar composition of the of the probability simplex the space of outcome distributions showing that for example the yellow action is optimal whenever the outcome distribution is in the yellow area in the example where the distribution is this one then the orange action is the optimal action and so on ok just a little note save it for later that the boundary between between two actions between these cells between of two actions lies in this subspace where l i is the corresponding a role of the action

ok what can we do with age so age causes of symbols age is basically has only the information what we get as feedback so they are not necessary numbers but when when we choose action i then in this example we had we get feedback a if the outcome was one or three feedback b if the outcome was two and feedback c if the outcome was four a natural question that arises that if we are given an open strategy environment strategy p what probably be observing these symbols in this case it 's really obvious that the probability of observing symbol a is b one plus p three and so on and so but how how can we generalize it the to the general case we likely all know algebra so let 's let 's fill this table it 's not very hard the p one plus p three times if we put one zero one zero here it 's really obvious and the second row will be this and the third will be this and now if we have a look at this matrix then then we can see that the first role corresponds to the indicator of single a the second row this is the indicator of single b the third row is the single c and so because of this we we call this matrix the signal matrix of action i if we have more than one action we have to we want to care about then we can stack these single matrixes on top of each other and we get signal matrices for more actions and why is this important or interesting there is one interesting thing to note that even if we have two outcome distributions and we we can only choose action i and i prime and these two vectors are the same then we can not distinguish between the two outcome distributions at all no matter what we do and no matter how the outcomes come based on these distributions there is no way we can figure out which which outcome distribution the environment chose at the beginning so we can say that the kernel of these matrix is is is a area of danger ok

so what what makes a game easy the question arises and the answer is that there the game is easy if we can figure out which action is better by not choosing any other actions just that two so we want to decide the question with two neighbour actions which one is better and we do n't want to use any other actions because that might be costly from the label efficient partition game we know so this is the main condition that will characterize the easy and the hard games so the local observability condition says that every neighboring action pair for every neighbouring action pair the difference of the two loss vectors is in the row space of the signal matrix well this sounds really arbitrary right now but it turns out that this condition enables us to estimate the the expected difference of the two losses no matter what the outcome distribution is by choosing only these two actions ok so based on this we can have an algorithm the algorithm is

as follows we maintain a set of alive actions and in in every round we choose each alive action once it 's like a racing game we choose each alive action once and then we update we maintain an estimate of loss differences for every action pair and after each run we update these estimates and then if we are if it turns out that we are we are confident in in that in a sign of an action loss difference that say that by a large margin we can say that the difference between two losses is negative or positive then we can eliminate the whole half space let 's say that it turns out that the yellow action is significantly better than the orange action with confidence then we can eliminate this whole half space and note that here not only the orange action was eliminated but this greyish action was eliminated as well because we can be sure that if the if we know that the outcome distribution is on this half space but then this can not be optimal so we can eliminate the section and then then we do it on until we run out of time or only one action remains and this algorithm achieves all t of each square t regret so that 's good if the condition holds then we have upper bound they expected regret by square root of t ok what what is the other case the other case when we have two actions

that are meighbours and we we do n't have enough feedback meaning that there is there is a line in the outcome distribution space where we we can not distinguish between outcome distributions based on actions i and j and this is where the nul space of i and j comes in in this case a third action is needed to decide which action is better and that 's why that 's why it becomes costly and so there is a little coincidence here when does this line exist this is the condition when the local observability condition does n't hold so that the difference is not observable and this is the condition when the line this line of unobservability crosses the border the boundary between the two the two cells and these two conditions coincide luckily so we have this scenario exactly when we can not run the algorithm from the previous slide and we do the usual lower bounding proof technique we we put down two conditions very close to the border on the line and we want we want the algorithm to decide which outcome distribution the environment shows and for that we need to pull pull this is not the bandit talk we need to choose actions not i and j and that becomes costly ok so in summary we we

classified all finite stochastic partial monitoring games and it turns out that there is only four kinds of games trivial games with regrets zero when there is an action that that superior to every other action no matter what the outcome distribution is easy games for which we can run the algorithm hard games where the lower bound holds and hopeless games when there is not even no local observability I did n't talk about this but characterizing hopeless games was done by picoboni and and and basically equivalent to saying that global observability does n't hold there are this distributions that are indistinguishable even by choosing all the actions so we have the key conditions separating the easy and the hard games which is the local observability condition and and the algorithm we designed achieves the minimax rate some algorithmic factors there are some questions computation efficiency of the algorithm there is more questions first of all it can be computationally efficiently verified by the condition the separating condition and once we decided that the game is easy is our algorithm efficient it turns out that all our efficiency questions rely on linear programming and since we do have linear programming time folders everything is efficient the next question is how do we scale with a number of actions I did not mention at all about scaling with a number of actions and while the set part is that the well there is two parts the lower one does n't scale with number of actions at all because it only uses two actions all the time in in the proof and the other set part is that the upper bound is a little ugly we hope we can get down to linear but that 's still not not as good as the bandit where it should be square root of ten so the big question is that it is because there is a lower bound with linear n or n with a half or or do we need a better analysis or do we need a better algorithm I 'm scaling with a number of outcomes the good news is that no matter the number of outcomes these bounds do n't have the number of outcomes in them and the last question is at what can you say about no stochastic opponents well the big conjecture is that exactly the same thing holds for no stochastical opponents and it 's a very very strong conjecture because any lower bond that we 've seen so far was only only used stochastic opponents so well find the separation between adversarial and stochastic games here that would be a pretty big deal I do n't think it will happen so the only thing we need is an algorithm for easy games in the non-stochastic setting and then we are done with the classification

thank you questions you assume that the feedback is a function of the of the two actions of the action yes I wanted to say there is no noise involved so if there is an action and an outcome then from that point everything is deterministic it 's it 's a function of those yes so can you extend this to to random signals we do n't have any extension we we really strongly rely on the these combinatory and linear algebra you can structures so probably something different would be needed to solve that case questions can you extend this to infinite games or would the same infinitely many actions or infinitely many outcomes with infinitely many actions obviously you need some extra structure on the actions and well we hope that we will be able to extend it i do n't know how yet about outcomes if we if we just leave these structure and say infinitely many outcomes and the matrix matrices become integral operators and probably the answer is no but if we if we say that an outcome is a complex combination of atomic outcomes then the answer is actually yes so this whole thing generalizes without any modification and then we have solutions to infinite infinite number of outcomes alright any more let 's thank gabor

price vendor product maximum customer dynamic sell pricing buy loss feedback learner want willing outcome set constant difference action game

eliminate action

learner outcome loss function action environment choose feedback game

minimax expect regret paper game

eliminate half space action set let

regret game expect minimax paper

game minimax regret finite environment stochastic

outcome learner feedback choose environment loss reveal

eliminate half space say action estimate

action difference observability estimate local loss expect choose

l action optimal row matrix example cell p

action eliminate alive estimate remain optimal round pair algorithm

game regret area grey order know minimax prediction base

game regret minimax turn t logarithmic zero

game minimax area regret grey figure monitoring partial know

easy hard

learner loss measure action

action difference estimate loss choose expect set

learner loss monitoring partial outcome function feedback action difference game choose

regret game algorithm variant result late t prediction case exists

observe p strategy symbol distribution matrix feedback signal probability choose action example note vector

step time learner choose feedback action outcome

game regret stochastic minimax

example

question efficiency game linear easy efficient algorithm far efficiently decide use time

game regret minimax characterize ok

matrix symbol signal row action stack let number

alive maintain action set algorithm

game regret minimax finite partial ok

number algorithm action time

action easy turn figure signal neighbor matrix

l row action matrix outcome h

regret t minimax area dynamic pricing hard

game regret expect algorithm bound low p

expect regret loss stochastic difference time action

minimax regret game

game hard bound low

eliminate update estimate action difference sign loss pair expect algorithm

interesting outcome matrix feedback number open question action

game category t minimax regret zero

action matrix example hand simplex cell probability

loss environment outcome time action stochastic

action eliminate choose time remain upper regret note expect let bound algorithm

alive update run action estimate choose round algorithm

alive action round choose note algorithm time

l outcome action matrix space u note

regret environment bound game upper stochastic minimax monitoring partial low

regret game minimax achieve algorithm zero t

game t category minimax regret hopeless hard zero finite ok environment stochastic

outcome feedback distribution case choose information environment

matrix row action h

game low hard bound

eliminate action time

structure outcome question number action matrix feedback

hold stochastic bound algorithm conjecture game low adversarial upper regret monitoring partial minimax

action loss game outcome difference characterize arbitrary distribution signal choose

loss action difference regret distribution expect algorithm time outcome t

loss regret learner performance measure hindsight best difference action

single action matrix algebra signal figure example probability c

easy game condition pair difference vector loss neighbor

game result regret hopeless achieve trivial algorithm zero t ok stochastic minimax use

game minimax bound logarithmic t regret hopeless low upper

number question outcome action regret

regret expect algorithm low stochastic p t

eliminate action confidence half space difference bound optimal pair let algorithm game

observability action need local global set

game regret algorithm hopeless bound minimax case paper trivial low t

feedback outcome learner partial loss information reveal choose

game environment answer minimax regret finite stochastic

action turn positive

game regret result question remain paper base stochastic minimax

t term regret bound low proof

game hopeless characterize trivial

matrix l outcome action h lie

action estimate alive round difference follows pair loss let condition expect

algorithm low bound need condition design action n

loss action difference algorithm distribution expect outcome

game regret achieve algorithm zero t minimax

game minimax regret theorem t

example feedback outcome game

game exist regret expect algorithm exists p t

minimax regret stochastic game

action say loss optimal

symbol row feedback action

outcome learner feedback loss action note game

symbol feedback action time outcome matrix observe row let vector choose signal information

question conjecture game stochastic classify finite minimax regret

game bound low

action game figure

base pair action estimate decide difference loss expect algorithm use

environment consider outcome manner time stochastic game

easy hard

game dynamic pricing figure opponent

algorithm low stochastic outcome distribution action time regret

loss regret learner minimax hindsight best game function action difference feedback

action estimate pair condition local difference observability loss neighbor expect use vector

distribution outcome action probability p

need observability action local difference choose

action loss difference outcome regret choose distribution game

expect regret loss let pair difference action

action choose time algorithm

l outcome corresponding action h cell

step choose time action note

game big minimax set monitoring partial regret

game easy condition stochastic hopeless algorithm hard efficiently factor use trivial zero finite minimax regret

outcome action distribution probability

game regret minimax case t

different

regret expect step bound low t

regret minimax game ok problem

regret game t finite

action loss say figure row vector matrix

regret area dynamic pricing minimax t

outcome action distribution feedback signal environment choose probability

expect regret let time r action

action signal observability game matrix say neighbor

distribution matrix example probability action simplex p note

outcome feedback matrix action general let note vector

outcome distribution information h

action optimal cell p

action pair estimate algorithm main efficient use

game stochastic finite minimax regret

learner environment outcome loss feedback game action

question easy game decide efficient algorithm use

action pair difference expect loss u algorithm vector outcome

action optimal game

game regret opponent bound low

game algorithm bound low hopeless analysis rate minimax factor trivial upper regret

regret game bound algorithm stochastic low minimax case

step follow

game low opponent algorithm bound regret

learner loss difference action matrix

outcome extend action matrix feedback let game

outcome l optimal ok h

choose time action

action difference loss let expect optimal pair regret outcome algorithm

action strategy matrix row p say figure example vector

action half space remain

pair action game hard neighbor

outcome distribution environment action p time probability

dynamic pricing bandit

algorithm opponent number time action regret n

game algorithm regret expect late achieve case t

outcome action function signal assume game feedback

regret case bound algorithm result paper achieve exists zero low t

matrix simplex l cell probability late h

l p

game t hopeless hard

loss t action difference regret round

action choosing outcome arbitrary game loss space u vector matrix

regret loss hindsight best say mean learner action

scale game stochastic key classification time finite minimax regret

action algorithm choose

environment stochastic game

outcome symbol matrix row action signal

hopeless trivial

structure outcome matrix say able action game number assume feedback

case action regret condition bound choose achieve upper round algorithm t

expect regret previous

environment base stochastic game

number algorithm bound action time n

game outcome finite minimax characterize regret partial

action loss note positive

game regret monitoring partial minimax environment

game turn computationally efficient algorithm analysis minimax factor regret zero

expect regret r

easy game algorithm matrix loss neighbor expect choose

game condition efficient algorithm rate factor zero minimax regret

need

action easy pair game neighbor

expect regret algorithm p t

game know case t

game regret learner characterize minimax

matrix outcome feedback distinguish environment action number say

action signal matrix row

expect regret game

expect loss difference mean action

outcome distribution j feedback case choose environment

outcome game action monitoring partial distribution characterize regret minimax

outcome feedback case assume

expect regret constant problem game time let action

regret expect algorithm t

learner feedback monitoring partial loss matrix outcome set game action

game minimax main regret theorem finite stochastic

action linear easy turn n

game hopeless exists t

game number set finite outcome say action monitoring partial n

game finite monitoring say partial environment

regret minimax game feedback

action say set let game

easy game condition hard ok algorithm use

action t time choose let hold outcome

regret low environment stochastic t

outcome step action note follow

distribution outcome algorithm action environment difference

game easy condition hold hopeless finite

extend answer question game

loss vector action easy condition neighbor

game make characterize outcome ok

action algebra strongly matrix linear signal

condition algorithm need bound action feedback

learner choose action follow game

action game opponent

feedback outcome know game

minimax regret paper base result stochastic game

game general result paper stochastic

expect regret let r

action cell note

environment stochastic game

feedback outcome case example know

monitoring partial

game finite

question action loss matrix outcome

game dynamic pricing p

action optimal cell use

condition bound algorithm low action mention achieves regret zero upper

t opponent bound upper proof

action

condition case action bound previous algorithm cell choose

action case t

game right figure base know

matrix action signal let say

outcome distribution environment general time probability

regret expect bound low

eliminate regret t

regret minimax problem

matrix action

loss say action regret t round

action significantly outcome hold round note algorithm t

outcome loss feedback action

question answer game base

l u probability

action matrix

action let expect time choose algorithm t regret

feedback outcome learner loss action matrix game

action game matrix

minimax regret game case

action game row loss signal matrix outcome

regret bound figure zero theorem upper low t

game condition efficient algorithm matrix

action outcome optimal p u probability

distribution outcome probability

game regret result base stochastic minimax

game stochastic outcome distribution deal time finite

game action loss vector outcome

example feedback learner action matrix choose

game matrix pair action loss say outcome

game

action final choose time note

game algorithm easy set time regret

game hopeless characterize trivial

action hard

example action constant matrix

number action time algorithm bound upper regret n

alive action hold round

action set

regret achieve t

function feedback step time

dynamic pricing action game

strategy let p choose time action number

choose action number time

action loss vector

function action

regret game case example problem result

expect regret p algorithm

action neighbor

outcome p let action vector probability

feedback step time learner action outcome choose information correspond matrix

example loss action matrix

minimax regret previous base result stochastic game time

matrix outcome action

t game

algorithm condition action u

action set

action t round choose game

game know t

t case choose outcome loss distribution

term bound

l action cell lie note

feedback monitoring partial dynamic pricing bandit

outcome action distribution environment difference choose feedback

game algorithm outcome regret finite

expect base loss pair difference action

minimax regret game base

obvious row c

section case loss algorithm bound t upper regret game

beginning strategy p choose probability

environment important stochastic

t regret term low

dynamic pricing

loss action set let condition note

game hopeless exist t trivial case

l u

action choose expect algorithm

game learner paper

action strategy figure p

minimax regret loss game case paper

game regret opponent theorem t

function correspond action

condition game

l action h

action

j action observable observability

game

action signal condition matrix neighbor vector game

action loss

loss action difference matrix

extend deterministic need action assume feedback

game hopeless

matrix action signal different number let

define

game classification finite

game outcome action classification n

action case

condition action algorithm figure choose

learner loss action

game hard

outcome example optimal

action say matrix signal let game

action space outcome loss note let game

game hopeless hard action

t partial environment

outcome leave game

outcome l example late h

monitoring partial

action loss vector

action case time choose remain t algorithm

outcome action note

action observability game set say

easy low bound regret n

matrix c feedback way probability case choose let time

space action u cell

loss outcome difference expect action time

learner loss environment outcome feedback action game

regret general learner t

action cell

outcome action assume

game condition minimax regret

estimate sign algorithm condition time bound

environment bound theorem low stochastic proof

algorithm action expect u choose

action loss choose

outcome action different matrix signal assume game

case section condition regret t game

game

outcome algorithm hold action proof n

dynamic pricing bandit

action algorithm choose

action let

action t case pair game

space close j action cell

case b action zero choose figure probability c

learner choose feedback action stochastic note

game question algorithm use case

game regret algorithm t

condition case

outcome let c action vector

feedback function step difference time

run regret t algorithm

hopeless trivial

different

game theorem

row space

t game hopeless

say action matrix

action difference row

row action

matrix action

action cell optimal p

outcome ok open

feedback strategy let p time probability

outcome loss action game

action vector choose row signal matrix outcome

action row matrix outcome information

action algorithm base estimate choose

outcome loss action consider game

feedback loss outcome action matrix

regret case previous paper result action

previous define

example

observable j action exist case

regret minimax stochastic game case

game hopeless finite

minimax regret result ok game stochastic

game opponent algorithm action time regret n

action space cell

information t learner think case paper

l action h p

monitoring partial

regret learner t low

action opponent say n

turn regret t achieve game algorithm

matrix signal action c vector case

regret t say mean

game dynamic pricing hard theorem environment

game t learner

j action condition local observability difference

action neighbor

easy game action hopeless trivial

function say action

game bound p apply half true algorithm use

need

outcome feedback know game case

action space optimal h probability p use cell

game t regret theorem

zero p case note vector probability

section regret expect algorithm t

hold regret upper bound case algorithm game

outcome action note

regret algorithm t

game algorithm make outcome

action loss regret outcome t

characterize game

algorithm achieves analysis regret

condition cell hold j difference

transaction information

line need difference proof bound case

game finite environment stochastic

action loss important note follow

p let c time

action game neighbor expect algorithm

different define

main theorem

difference loss hold pair condition note game

action loss let game outcome

algorithm action time

decide base j action difference algorithm

game want loss action difference hard outcome

loss algorithm set case choose section achieve game t regret expect time let

algorithm bound low

bound mean upper

outcome loss action follow

space action lie cell

action strongly linear

action set let

game expect algorithm choose

outcome action algorithm hold proof

action t pair game

game action n

start previous

p feedback symbol row action let

pair algorithm vector signal use matrix

action pair difference game space algorithm use

calculate action loss follow

outcome action let game

game prediction

matrix action cell

j condition action cell case

regret case t algorithm outcome game

need

condition estimate algorithm

outcome game action choosing n

trivial exists case t

l matrix h

easy expect

action algorithm choose

expect regret game loss case time feedback let

loss action note

outcome space action feedback

action hard set

outcome action let assume

low bound proof regret n

number action factor regret

game turn efficient algorithm

algorithm opponent action run

time action follow

action difference

turn linear figure

action note p

say action loss set game pair outcome

action need

run algorithm base j action choose

action estimate achieve regret

opponent t regret

outcome game

hopeless

game outcome

action j cell

regret t

loss action difference expect set choose

action

j space action close bound difference algorithm cell

action positive optimal

action algorithm outcome

game minimax regret

action regret expect time let t algorithm game

action choose number time let p note c

hard

feedback learner loss think case paper

action let c

action observable cell

environment hold bound algorithm case

space close j cell action

base loss action

t proof

exist j action algorithm

feedback p strategy probability

bound proof

action

different define

t difference let upper time bound algorithm

game know outcome

environment open question base

game feedback example problem case

low stochastic bound adversarial proof

action loss outcome game

outcome action

regret bound low

action number

j outcome action difference algorithm

loss action

action algorithm efficient use

action proof case j

feedback correspond difference action

hide regret

loss action

easy arbitrary vector

space action u

b action case c

p c let vector probability

game

game learner loss paper say pair feedback action

game general result learner

action time bound algorithm

know algorithm use

choose regret expect bound algorithm game

game finite ok

regret information prediction

estimate upper bound choose algorithm

action game choose

outcome partial ok game

constant problem r

consider action

opponent algorithm bound time upper proof

l ok h use

action l use p

version action matrix game

action game

action

different action

paper problem result stochastic game time

dynamic pricing

action

action game

distribution outcome environment

define

action algorithm

action point strongly

monitoring partial

opponent say figure

regret loss feedback action

bound result p use

step

loss say example action

main proof bound mean upper know stochastic

action main algorithm choose

game dynamic pricing learner

algorithm action set

example outcome game

loss let action

outcome distribution think partial n

know algorithm use

action space

time choose action

p c

regret outcome game optimal ok

estimate upper bound

opponent stochastic distribution action

choose

subspace p lie use

action information

game ok p

algorithm condition bound time

action follow

expect regret action time let

game algorithm loss

hold exactly action n

let c vector action open note

action outcome let matrix feedback signal assume

bound game action algorithm use half n

zero bound theorem t proof

action cell

action

case optimal know game outcome t

step t information learner

action number set

bound general algorithm p

space action cell j

t bound

minimax regret game

term

loss set action correspond

action algorithm bound condition note

achieve regret t

loss action say

loss action difference set section round game outcome

player learner game action

l u

proof cell j

base choose action

expect base algorithm use

time action hold outcome

action distribution signal vector matrix choose

case action c

bound upper

action linear

regret question t

monitoring partial

action matrix game

linear efficient algorithm bandit

monitoring partial dynamic pricing linear

round know algorithm

number say action regret

strategy action c

loss learner action

action

time choose action

game action outcome n

game regret t

action cell

choice note

action observability condition local n use

space action say estimate

start

x t bound p

bound algorithm upper

define

estimate bound

outcome case example

game expect

regret achieve use

action

outcome choose loss distribution

action game

action set case condition game

outcome case game

algorithm step

regret t

monitoring partial

action u

t space let bound

question linear game case

define

condition proof choose bound

p t

loss game let action

l note p

deterministic let action number assume

action start simplex cell probability

observable j action

action cell

regret expect remain algorithm bound p

game finite ok

action important note vector

loss feedback learner result game r let action

choose t

regret monitoring partial

action number let

different action assume feedback

action

expect pair loss action let difference r

regret expect game

strongly action case game

strategy action let p time c

loss think action

algorithm result true bound

condition algorithm bound

action time proof

maintain algorithm bound

regret

l action h u

action choose section distribution note

action

choose action note time

p let vector probability

algorithm action choose

loss choose feedback time game follow

game prediction

action algorithm

condition environment ok algorithm

space positive set

loss outcome expect action difference

exactly hold action j

t regret expect bound

difference main use know estimate

arbitrary game loss

learner action choose game

game ok p

estimate round let

algorithm use n

regret loss game feedback

action c

game

vector use

stochastic adversarial bandit minimax

feedback loss set matrix game correspond expect time choose

game linear

action

monitoring

space matrix

t achieve game condition regret algorithm

algorithm run regret n

algorithm j action choose

action note

action estimate use

algorithm game loss

monitoring partial

action

action manner

action set n

cell algorithm j case choose feedback

bound game mean

game

t achieve ok condition regret game algorithm

action outcome algorithm regret n

algorithm game regret

action choose time

game algorithmic classification finite

loss action information

outcome algorithm case j

action algorithm cell

distribution p

game main base

action u

l p probability use

action let

loss game want constant learner difference feedback action

difference estimate case bound upper know

bound upper

c action p

loss game pair difference

subspace p u

action hard

set easy condition action proof

action

time let feedback action learner r

loss outcome action game

action j

regret t

case action

hard rate regret minimax

feedback

distribution j proof

game distribution outcome

algorithm expect

action

bound low

distinguish outcome j action environment feedback

matrix zero signal vector action let c note

action

base time stochastic game

action choose algorithm bound

game set outcome hard action

outcome game

action choose

action algorithm cell

action

p choose action base

monitoring partial game set

action

matrix signal vector

action outcome choose feedback hold j

action case

loss feedback game

outcome game

estimate neighbor

bound proof

t step algorithm bound p

p case c

action matrix signal case game

feedback action

x order use

label efficient

outcome feedback case

feedback loss game

bound use upper n

environment case

loss feedback game

condition algorithm

previous algorithm

action regret

question second case time

proof

function feedback let difference time

bound need stochastic use proof deal upper

function say action

estimate t

action

outcome action feedback

zero p case

point let assume

infinite case

action probability

game action

action

feedback loss game

distribution action choose feedback j

outcome feedback environment want action hard

action case

algorithm base p

action trivial

game

action number time probability

regret know bound

outcome ok

remain algorithm result use

action

space

hold set n

game classification finite

case choose time action

bound upper

time algorithm action

say action number

condition bound zero proof n

base know game

action j cell

loss outcome action matrix game

define

minimax stochastic

environment stochastic game

t action set let algorithm

work base mean

choose c

run algorithm

action follow

define

vector space matrix

zero proof

bound upper n

main

result

matrix kernel vector

step action choose time

matrix zero

constant problem

linear bandit efficient monitoring partial

lower bound

x zero bound t

action use probability

define

regret

section regret time game

outcome action step

outcome action cell

row space

l h use

minimax stochastic

action case

condition cell action proof

action outside set correspond

action follow

action

condition choose

algorithm bound base

zero environment ok hard

ok p c

environment important

action proof

algorithm bound

action game condition set

example

choose game

t opponent bound upper

regret

regret t

thank

partial

j bound

ok p

know

regret expect let bound

feedback follow

half action set

equivalent distribution far n proof

j action

second p

estimate algorithm choose

time action

p true use case

algorithm action decide use

loss regret algorithm t game

p t

estimate round know note let

loss positive case note

p action probability let

row action p

pair game r action

hard

environment

information algorithm

feedback choice action follow

example learner stochastic r feedback action

game

cell action

try

distribution p

step

proof

loss note game

action

game say

action proof

action information vector

second information

action example constant

matrix action set

game ok

j cell

consider know manner follow stochastic

area figure base know

true p

round remain pair

regret game know set

regret bound know

difference mean constant problem result stochastic case

expect

estimate use

feedback difference proof algorithm bound j

want outcome action environment difference feedback

matrix action

action think

feedback correspond set action

choose choice

j bound

neighbor

constant

p note c

action algorithm bound

action game

action difference feedback

case r action

action

work action result pair

algorithm base

action set matrix game

regret

opponent stochastic

algorithm

pair game case r action

deterministic random

function feedback let

action u p

computation algorithm

remain algorithm case t

opponent algorithm bound use upper

corresponding p note probability

algorithm case bound

main t

environment base

action number

proof bound

action

outcome action want game feedback

random let assume

action time

action vector signal use matrix

p c

bound upper n

label make efficient

monitoring partial game

game

outcome action

general action say number

p algorithm bound use t

game

matrix use

feedback zero

algorithm bound

time action

eliminate t action pair let algorithm

regret game

action time choose expect

t p

mean

p t algorithm

regret action

action algorithm case choose

p learner stochastic

pair use

c action let

action

action let p vector

loss game feedback

ok game

bound mean

x t

case action

matrix action note use

vector

information think

feedback action note

actual action choose

action u note

bandit stochastic minimax

action loss

action half

loss note

algorithm step t

action j

loss game action r

action game

distribution time outcome game

loss difference expect action

expect time action choose

remain regret game

algorithm choose bound

time action game

x p t

p t

hold set t

bound result know use stochastic case

algorithm base bound

trivial bound

environment proof bound

game

game ok

j action algorithm

proof bound j

bandit

c p

action let

half say action

feedback game

estimate

regret action

time correspond action choose

proof

action

partial

action

half true

hard

decide algorithm case

bound trivial

action matrix signal feedback

step

proof

choose condition bound t

action

algorithm use

programming

bandit prediction dynamic pricing

t p

action assume let game

apply bandit algorithm

use probability

action p

constant expect action time game

action set

bound p

example

cell use

vector probability

optimal

half

positive bound

space

leave game know

action hard

proof

problem feedback

regret bound upper

arbitrary u

feedback action note

area hard

ok p

matrix action

say action game let

know

game

monitoring

bound algorithm choose

let p probability

choose

difference feedback action

zero choose c

action number

pair action u main algorithm

algorithm game

game

start probability p

half remain

p t

feedback action

area

action

second c

action

c vector note

time case action

u p

low action regret

matrix game case feedback let

define

bound low stochastic

action

action use

bound

t proof

ok game

mean pair difference game let action

time n

role base time

let constant r

t main

cell

space

bound

action assume let

action game

loss difference game

choose action

bound game

information know trivial

action

bound

action

difference algorithm

regret

feedback game

general choose p probability

note p

time

action use regret

action

constant

area

estimate know u

let action

b c action let

action game

result

action

action let assume

step expect

outcome time game

case action

hold

choose action

p t

loss case

cell note

action

environment

ok p

hard

regret

easy

action c

algorithm j

previous base

mean

constant

bound mean

define

game mean theorem

p t

action time

p c

algorithm j

proof

loss game

action row algorithm

try

use n

action

efficient algorithm

base action

define

bound

p t

time action

ok p

outcome action

action loss difference expect

game exists case

exactly action

follow game

game

game action

action game

base

information

loss vector

algorithm use

game

action number

action

action matrix set correspond game

regret

mean result time

game

action

bound upper

upper bound

action

action n

t proof

action loss

action note set let

algorithm n

define

p probability

action

n trivial

action

expect choose game

algorithm set bandit

space probability p

set action

exists

hard

work

ok p

action choose

probability p

action

algorithm

distribution t round

role

efficient

define

action

extension game

p use

constant

feedback

action game

classification

bound

game

action

section

ok p

p probability

case game time algorithm

hold let

role time game

loss game

action case

mean r

choose

regret

action matrix signal assume let

algorithm feedback

mean

feedback

action

action correspond

feedback

action

mean

bandit algorithm

game case time

estimate

exists

case

monitoring

case c

bandit

optimal

t p

feedback

algorithm

j action difference

mean r action game case

use n

opponent

proof

ok game

efficient

ok p

p exists

loss

matrix game

action

expect

feedback algorithm bound

time probability

vector

define

operator know let

know regret bound

time action

linear

zero proof

action let game

efficient regret

bandit feedback

remain use

generalize

action game

regret

monitoring partial regret

regret

expect use p

let know p note

information u note

bandit algorithm

let time

constant

hand use probability

neighbor

note

action probability

action feedback assume let

action matrix

use

let r

matrix signal

t upper bound algorithm

action time

define

p c

loss game

define

know

p t

expect game

choose

information

case game

time action

base action

ok p

let information know note

game

neighbor

algorithm action set

t p

define

generalize

let time

case

space bound

action set use zero

action

assume game

proof

action

ok game

know

bound

know note

constant

algorithm

ok game

bound j

choose

proof

hand

bound

action

know

hold

mean

let

p t

information

ok game

feedback action algorithm

algorithm

action

zero theorem

need know case

time game

algorithm

outside time

information

outside choose

pair algorithm

action

define

bandit

bandit algorithm

case game

action

hold

action say

action game

p t

action

strategy

time

action

algorithm

let know note

action game

time

let

algorithm feedback

main

action game

zero exists use

ok game

optimal

action

action note follow

action

feedback

use

time

say

set action

probability

n use

bound regret

action

probability time

algorithm

action

algorithm use

use

linear

vector

bound

let r

algorithm

stochastic

game matrix

result

action

random

linear feedback

bandit

action

bound

feedback let action

time

game

t true algorithm

bound

know note

problem

game feedback

information

assume case

bandit

use

action difference game

distribution

linear

action

bandit

algorithm

bandit set

information

game

action let

know

ok p

action

regret

let action

difference constant

feedback

action

game

stochastic

pair

case

algorithm

action choose

regret

use

information

expect

information

n algorithm action

game

algorithm note

space

time

action

case know probability

expect

feedback game

let know

problem

game

information correspond

know p

bandit

proof case

base

pair

feedback game

let

use

algorithm

set algorithm

optimal

use

game

bound

pair r let action

use

probability p

game

feedback action

use

monitoring

base

bandit

matrix

know

game

case r

algorithm

game

time

set action matrix

game

let

set

bandit

action

game

let

set matrix

strongly action let

action j

use

game time

game

action

c note

feedback

regret

case game

know

game

let

set

action

game

let r action

let

know

action

proof

game

bound

let

know

action

game

action

constant

know

algorithm

game

action

let

case

game

choose

time

game

note

algorithm

matrix

bound

note

game

algorithm

action

game

let

game

action