JMLR : Workshop and Conference Proceedings vol ( 2010 ) 1 21 24th Annual Conference on Learning Theory
Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments
abor Bart´ ok bartok @ cs.ualberta.caavidal dpal @ cs.ualberta.ca Csaba Szepesv´ ari szepesva @ cs.ualberta.ca Department of Computing Science , University of Alberta , Edmonton , T6G 2E8 , AB , Canada
Editors : Sham Kakade , Ulrike von Luxburg
Abstract
In a partial monitoring game , the learner repeatedly chooses an action , the environment responds with an outcome , and then the learner suffers a loss and receives a feedback signal , both of which are fixed functions of the action and the outcome . The goal of the learner is to minimize his regret , which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight . Assuming that the outcomes are generated in an i.i.d . fashion from an arbitrary and unknown probability distribution , we characterize the minimax regret of any partial monitoring game with finitely many actions
and outcomes . It turns out that the minimax regret of any such game is either zero , Θ ( T ) , Θ ( T 2/3 ) , or Θ ( T ) . We provide a computationally efficient learning algorithm that achieves the minimax regret within logarithmic factor for any game . Keywords : Online learning , Imperfect feedback , Regret analysis
1 . Introduction
Partial monitoring provides a mathematical framework for sequential decision making problems with imperfect feedback . Various problems of interest can be modeled as partial monitoring instances , such as learning with expert advice ( Littlestone and Warmuth , 1994 ) , the multi-armed bandit problem ( Auer et al . , 2002 ) , dynamic pricing ( Kleinberg and Leighton , 2003 ) , the dark pool problem ( Agarwal et al . , 2010 ) , label efficient prediction ( Cesa-Bianchi et al . , 2005 ) , and linear and convex optimization with full or bandit feedback ( Zinkevich , 2003 ; Abernethy et al . , 2008 ; Flaxman et al . , 2005 ) .
In this paper we restrict ourselves to finite games , i.e. , games where both the set of actions available to the learner and the set of possible outcomes generated by the environment are finite . A finite partial monitoring game G is described by a pair of N × M matrices : the loss matrix L and the feedback matrix H. The entries i , j of L are real numbers lying in , say , the interval [ 0 , 1 ] . The entries h i , j of H belong to an alphabet Σ on which we do not impose any structure and we only assume that learner is able to distinguish distinct elements of the alphabet .
The game proceeds in T rounds according to the following protocol . First , G = ( L , H ) is announced for both players . In each round t = 1 , 2 , . . . , T , the learner chooses an action I t
This work was supported in part by AICML , AITF ( formerly iCore and AIF ) , NSERC and the PASCAL2 Network of Excellence under EC grant no . 216886 .
c 2010 G. Bart´ ok , D.al & C. Szepesv´ ari .
{ 1 , 2 , . . . , N } and simultaneously , the environment chooses an outcome J t { 1 , 2 , . . . , M } . Then , the learner receives as a feedback the entry h I t , J t . The learner incurs instantaneous loss I t , J t , which is not revealed to him . The feedback can be thought of as a masked information about the outcome J t . In some cases h I t , J t might uniquely determine the outcome , in other cases the feedback might give only partial or no information about the outcome . In this paper , we shall assume that J t is chosen randomly from a fixed multinomial distribution .
The learner is scored according to the loss matrix L. In round t the learner incurs an instantaneous loss of I t , J t . The goal of the learner is to keep low his total loss T t=1 I t , J t . Equivalently , the learner’s performance can also be measured in terms of his regret , i.e. , the total loss of the learner is compared with the loss of best fixed action in hindsight . The regret is defined as the difference of these two losses .
In general , the regret grows with the number of rounds T . If the regret is sublinear in T , the learner is said to be Hannan consistent , and this means that the learner’s average per-round loss approaches the average per-round loss of the best action in hindsight .
Piccolboni and Schindelhauer ( 2001 ) were one of the first to study the regret of these games . In fact , they have studied the problem without making any probabilistic assumptions about the outcome sequence J t . They proved that for any finite game ( L , H ) , either for any algorithm the regret can be ( T ) in the worst case , or there exists an algorithm which has regret O ( T 3/4 ) on any outcome sequence 1 . This result was later improved by CesaBianchi et al . ( 2006 ) who showed that the algorithm of Piccolboni and Schindelhauer has regret O ( T 2/3 ) . Furthermore , they provided an example of a finite game , a variant of label-efficient prediction , for which any algorithm has regret Θ ( T 2/3 ) in the worst case .
However , for many games O ( T 2/3 ) is not optimal . For example , games with full feedback ( i.e. , when the feedback uniquely determines the outcome ) can be viewed as a special instance of the problem of learning with expert advice and in this case it is known that the
“EWA forecaster” has regret O ( T ) ; see e.g. , Lugosi and Cesa-Bianchi ( 2006 , Chapter 3 ) . Similarly , for games with “bandit feedback” ( i.e. , when the feedback determines the instantaneous loss ) the INF algorithm ( Audibert and Bubeck , 2009 ) and the Exp3 algorithm ( Auer
et al . , 2002 ) achieve O ( T ) regret as well . 2
This leaves open the problem of determining the minimax regret ( i.e. , optimal worst-case regret ) of any given game ( L , H ) . A partial progress was made in this direction by Bart´ok et al . ( 2010 ) who characterized ( almost ) all finite games with M = 2 outcomes . They showed that the minimax regret of any “non-degenerate” finite game with two outcomes
falls into one of four categories : zero , Θ ( T ) , Θ ( T 2/3 ) or Θ ( T ) . They gave a combinatoricgeometric condition on the matrices L , H which determines the category a game belongs to . Additionally , they constructed an efficient algorithm which , for any game , achieves the minimax regret rate associated to the game within poly-logarithmic factor .
In this paper , we consider the same problem , with two exceptions . In pursuing a general result , we will consider all finite games . However , at the same time , we will only deal with stochastic environments , i.e. , when the outcome sequences are generated from a fixed probability distribution in an i.i.d . manner .
1 . The notations O ( · ) and Θ ( · ) hide polylogarithmic factors . 2 . We ignore the dependence of regret on the number of actions or any other parameters .
The regret against stochastic environments is defined as the difference between the cumulative loss suffered by the algorithm and that of the action with the lowest expected loss . That is , given an algorithm A and a time horizon T , if the outcomes are generated from a probability distribution p , the regret is
R T ( A , p ) =
T
t=1
I t , J t min 1≤i≤N E p
T
t=1
i , J t .
In this paper we analyze the minimax expected regret ( in what follows , minimax regret ) of games , defined as
R T ( G ) = inf A sup p∈∆ M E p [ R T ( A , p ) ] .
We show that the minimax regret of any finite game falls into four categories : zero , Θ ( T ) , Θ ( T 2/3 ) , or Θ ( T ) . Accordingly , we call the games trivial , easy , hard , and hopeless . We give a simple and efficiently computable characterization of these classes using a geometric condition on ( L , H ) . We provide lower-bounds and algorithms that achieve them within poly-logarithmic factor . Our result is an extension of the result of Bart´ ok et al . ( 2010 ) for stochastic environments .
It is clear that any lower bound which holds for stochastic environments must hold for adversarial environments too . On the other hand , algorithms and regret upper bounds for stochastic environments , of course , do not transfer to algorithms and regret upper bounds for the adversarial case . Our characterization is a stepping stone towards understanding the minimax regret of partial monitoring games . In particular , we conjecture that our characterization holds without any change for unrestricted environments .
2 . Preliminaries
In this section , we introduce our conventions , along with some definitions . By default , all vectors are column vectors . We denote by v =
v v the Euclidean norm of a vector v. For a vector v , the notation v 0 means that all entries of v are non-negative , and the notation v > 0 means that all entries are positive . For a matrix A , Im A denotes its image space , i.e. , the vector space generated by its columns , and the notation Ker A denotes its kernel , i.e. , the set { x : Ax = 0 } .
Consider a game G = ( L , H ) with N actions and M outcomes . That is , L R N ×M and H Σ N ×M . For the sake of simplicity and , without loss of generality , we assume that no symbol σ Σ can be present in two different rows of H. The signal matrix of an action is defined as follows:
Definition 1 ( Signal matrix ) Let { σ 1 , . . . , σ s i } be the set of symbols listed in the i th row of H . ( Thus , s i denotes the number of different symbols in row i of H ) . The signal matrix S i of action i is defined as an s i × M matrix with entries a k , j = I ( h i , j = σ k ) for 1 k s i and 1 j M . The signal matrix for a set of actions is defined as the signal matrices of the actions in the set , stacked on top of one another , in the ordering of the actions .
For an example of a signal matrix , see Section 3.1 . We identify the strategy of a stochastic opponent with an element of the probability simplex M = { p R M : p 0 , M j=1 p j = 1 } . Note that for any opponent strategy p , if the learner chooses action i then the vector S i p R s i is the probability distribution of the observed feedback : ( S i p ) k is the probability of observing the k th symbol .
We denote by i the i th row of the loss matrix L and we call i the loss vector of action i. We say that action i is optimal under opponent strategy p M if for any 1 j N , i p j p. Action i is said to be Pareto-optimal if there exists an opponent strategy p such that action i is optimal under p. We now define the cell decomposition of M induced by L ( for an example , see Figure 2 ) :
Definition 2 ( Cell decomposition ) For an action i , the cell C i associated with i is defined as C i = { p M : action i is optimal under p } . The cell decomposition of M is defined as the multiset C = { C i : 1 i N , C i has positive ( M 1 ) -dimensional volume } .
Actions whose cell is of positive ( M 1 ) -dimensional volume are called strongly Paretooptimal . Actions that are Pareto-optimal but not strongly Pareto-optimal are called degenerate . Note that the cells of the actions are defined with linear inequalities and thus they are convex polytopes . It follows that strongly Pareto-optimal actions are the actions whose cells are ( M 1 ) -dimensional polytopes . It is also important to note that the cell decomposition is a multiset , since some actions can share the same cell . Nevertheless , if two actions have the same cell of dimension ( M 1 ) , their loss vectors will necessarily be identical . 3
We call two cells of C neighbors if their intersection is an ( M 2 ) -dimensional polytope . The actions corresponding to these cells will also be called neighbors . Neighborship is not defined for cells outside of C. For two neighboring cells C i , C j C , we define the neighborhood action set A i , j = { 1 k N : C i C j C k } . It follows from the definition that actions i and j are in A i , j and thus A i , j is nonempty . However , one can have more than two actions in the neighborhood action set .
When discussing lower bounds we will need the definition of algorithms . For us , an algorithm A is a mapping A : Σ { 1 , 2 , . . . , N } which maps past feedback sequences to actions . That the algorithms are deterministic is assumed for convenience . In particular , the lower bounds we prove can be extended to randomized algorithms by conditioning on the internal randomization of the algorithm . Note that the algorithms we design are themselves deterministic .
3 . Classification of finite partial-monitoring games
In this section we present our main result : we state the theorem that classifies all finite stochastic partial-monitoring games based on how their minimax regret scales with the time horizon . Thanks to the previous section , we are now equipped to define a notion which will play a key role in the classification theorem:
3 . One could think that actions with identical loss vectors are redundant and that all but one of such actions could be removed without loss of generality . However , since different actions can lead to different observations and thus yield different information , removing the duplicates can be harmful .
Definition 3 ( Observability ) Let S be the signal matrix for the set of all actions in the game . For actions i and j , we say that i j is globally observable if i j Im S . Furthermore , if i and j are two neighboring actions , then i j is called locally observable if i j Im S ( i , j ) , where S ( i , j ) is the signal matrix for the neighborhood action set A i , j .
As we will see , global observability implies that we can estimate the difference of the expected losses after choosing each action once . Local observability means we only need actions from the neighborhood action set to estimate the difference .
The classification theorem , which is our main result , is the following:
Theorem 4 ( Classification ) Let G = ( L , H ) be a partial-monitoring game with N actions and M outcomes . Let C = { C 1 , . . . , C k } be its cell decomposition , with corresponding loss vectors 1 , . . . , k . The game G falls into one of the following four categories:
( a ) R T ( G ) = 0 if there exists an action i with C i = M . This case is called trivial .
( b ) R T ( G ) = Θ ( T ) if there exist two strongly Pareto-optimal actions i and j such that i j is not globally observable . This case is called hopeless . ( c ) R T ( G ) = Θ ( T ) if it is not trivial and for all pairs of ( strongly Pareto-optimal ) neighboring actions i and j , i j is locally observable . These games are called easy .
( d ) R T ( G ) = Θ ( T 2/3 ) if G is not hopeless and there exists a pair of neighboring actions i and j such that i j is not locally observable . These games are called hard .
Note that the conditions listed under ( a ) ( d ) are mutually exclusive and cover all finite partial-monitoring games . The only non-obvious implication is that if a game is easy then it can not be hopeless . The reason this holds is because for any pair of cells C i , C j in C , the vector i j can be expressed as a telescoping sum of the differences of loss vectors of neighboring cells .
The remainder of the paper is dedicated to proving Theorem 4 . We start with the simple cases . If there exists an action whose cell covers the whole probability simplex then choosing that action in every round will yield zero regret , proving case ( a ) . The condition in Case ( b ) is due to Piccolboni and Schindelhauer ( 2001 ) , who showed that under the condition mentioned there , there is no algorithm that achieves sublinear regret 4 . The upper bound for case ( d ) is achieved by the FeedExp3 algorithm due to Piccolboni and Schindelhauer ( 2001 ) , for which a regret bound of O ( T 2/3 ) was shown by Cesa-Bianchi et al . ( 2006 ) . The lower bound for case ( c ) was proved by Antos et al . ( 2011 ) . For a visualization of previous results , see Figure 1 .
The above assertions help characterize trivial and hopeless games , and show that if
a game is not trivial and not hopeless then its minimax regret falls between ( T ) and O ( T 2/3 ) . Our contribution in this paper is that we give exact minimax rates ( up to logarithmic factors ) for these games . To prove the upper bound for case ( c ) , we introduce a new algorithm , which we call Balaton , for “Bandit Algorithm for Loss Annihilation” 5 . This algorithm is presented in Section 4 , while its analysis is given in Section 5 . The lower bound for case ( d ) is presented in Section 6 .
4 . Although Piccolboni and Schindelhauer state their theorem for adversarial environments , their proof applies to stochastic environments without any change ( which is important for the lower bound part ) . 5 . Balaton is a lake in Hungary . We thank Gergely Neu for suggesting the name .
hopeless trivial
easy hard
dynamic pricing l.e.p . bandits
full-info
Figure 1 : Partial monitoring games and their minimax regret as it was known previously . The big rectangle denotes the set of all games . Inside the big rectangle , the games are ordered from left to right based on their minimax regret . In the “hard” area , l.e.p . denotes label-efficient prediction . The grey area contains games whose
minimax regret is between ( T ) and O ( T 2/3 ) but their exact regret rate was unknown . This area is now eliminated , and the dynamic pricing problem is proven to be hard .
3.1 . Example
In this section , as a corollary of Theorem 4 we show that the discretized dynamic pricing game ( see , e.g. , Cesa-Bianchi et al . ( 2006 ) ) is hard . Dynamic pricing is a game between a vendor ( learner ) and a customer ( environment ) . In each round , the vendor sets a price he wants to sell his product at ( action ) , and the costumer sets a maximum price he is willing to buy the product ( outcome ) . If the product is not sold , the vendor suffers some constant loss , otherwise his loss is the difference between the customer’s maximum and his price . The customer never reveals the maximum price and thus the vendor’s only feedback is whether he sold the product or not .
The discretized version of the game with N actions ( and outcomes ) is defined by the matrices
L =
0 1 2 · · · N 1 c 0 1 · · · N 2 .. . . .. .. . c · · · c 0 1 c · · · · · · c 0
H =
1 · · · · · · 1 0 . .. .. . .. . . .. ... ... 0 · · · 0 1
,
where c is a positive constant ( see Figure 2 for the cell-decomposition for N = 3 ) . It is easy to see that all the actions are strongly Pareto-optimal . Also , after some linear algebra it turns out that the cells underlying the actions have a single common vertex in the interior of the probability simplex . It follows that any two actions are neighbors . On the other hand , if we take two non-consecutive actions i and i , i i is not locally observable . For example , the signal matrix for action 1 and action N is
S ( 1 , N ) =
1 . . . 1 1 1 . . . 1 0 0 . . . 0 1
,
whereas N 1 = ( c , c 1 , . . . , c N + 2 , −N + 1 ) . It is obvious that N 1 is not in the row space of S ( 1 , N ) .
( 1 , 0 , 0 )
( 0 , 1 , 0 )
( 0 , 0 , 1 )
p
Figure 2 : The cell decomposition of the discretized dynamic pricing game with 3 actions . If the opponent strategy is p , then action 2 is the optimal action .
4 . Balaton : An algorithm for easy games In this section we present our algorithm that achieves O ( T ) expected regret for easy games ( case ( c ) of Theorem 4 ) . The input of the algorithm is the loss matrix L , the feedback matrix H , the time horizon T and an error probability δ , to be chosen later . Before describing the algorithm , we introduce some notation . We define a graph G associated with game G the following way . Let the vertex set be the set of cells of the cell decomposition C of the probability simplex such that cells C i , C j C share the same vertex when C i = C j . The graph has an edge between vertices whose corresponding cells are neighbors . This graph is connected , since the probability simplex is convex and the cell decomposition covers the simplex .
Recall that for neighboring cells C i , C j , the signal matrix S ( i , j ) is defined as the signal matrix for the neighborhood action set A i , j of cells i , j. Assuming that the game satisfies the condition of case ( c ) of Theorem 4 , we have that for all neighboring cells C i and C j , i j Im S ( i , j ) . This means that there exists a coefficient vector v ( i , j ) such that i j = S ( i , j ) v ( i , j ) . We define the k th segment of v ( i , j ) , denoted by v ( i , j ) , k , as the vector of components of v ( i , j ) that correspond to the k th action in the neighborhood action set . That is , if S ( i , j ) = S 1 · · · S r , then i j = S ( i , j ) v ( i , j ) = r s=1 S s v ( i , j ) , s , where S 1 , . . . , S r are the signal matrices of the individual actions in A i , j .
Let J t { 1 , . . . , M } denote the outcome at time step t. For 1 k M , let e k R M be the k th unit vector . For an action i , let O i ( t ) = S i e J t be the observation vector of action i at time step t. If the rows of the signal matrix S i correspond to symbols σ 1 , . . . , σ s i and action i is chosen at time step t then the unit vector O i ( t ) indicates which symbol was observed in that time step . Thus , O I t ( t ) holds the same information as the feedback at time t ( recall that I t is the action chosen by the learner at time step t ) . From now on , for simplicity , we will assume that the feedback at time step t is the observation vector O I t ( t ) itself .
The main idea of the algorithm is to successively eliminate actions in an efficient , yet safe manner . When all remaining strongly Pareto optimal actions share the same cell , the elimination phase finishes and from this point , one of the remaining actions is played . During the elimination phase , the algorithm works in rounds . In each round each ‘alive’ Pareto optimal action is played once . The resulting observations are used to estimate the loss-difference between the alive actions . If some estimate becomes sufficiently precise , the action of the pair deemed to be suboptimal is eliminated ( possibly together with other
Algorithm 1 Balaton Input : L , H , T , δ Initialization : [ G , C , { v ( i , j ) , k } , { path ( i , j ) } , { ( LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) ) } ] Initialize ( L , H ) t 0 , n 0 aliveActions { 1 i N : C i interior ( M ) = } main loop while | V G | > 1 and t < T do n n + 1 for each i aliveActions do O i ExecuteAction ( i ) t t + 1 end for for each edge ( i , j ) in G : µ ( i , j ) k∈A i , j O k v ( i , j ) , k end for for each non-adjacent vertex pair ( i , j ) in G : µ ( i , j ) ( k , l ) ∈path ( i , j ) µ ( k , l ) end for haveEliminated false for each vertex pair ( i , j ) in G do ˆ µ ( i , j ) 1 1 n ˆ µ ( i , j ) + 1 n µ ( i , j ) if BStopStep ( ˆ µ ( i , j ) , LB ( i , j ) , U B ( i , j ) , σ ( i , j ) , R ( i , j ) , n , 1/2 , δ ) then [ aliveActions , C , G ] eliminate ( i , j , sgn ( ˆ µ ( i , j ) ) ) haveEliminated true end if end for if haveEliminated then { path ( i , j ) } regeneratePaths ( G ) end if end while Let i be a strongly Pareto-optimal action in aliveActions while t < T do
ExecuteAction ( i ) t t + 1 end while
actions ) . To determine if an estimate is sufficiently precise , we will use an appropriate stopping rule . A small regret will be achieved by tuning the error probability of the stopping rule appropriately .
The details of the algorithm are as follows : In the preprocessing phase , the algorithm constructs the neigbourhood graph , the signal matrices S ( i , j ) assigned to the edges of the graph , the coefficient vectors v ( i , j ) and their segment vectors v ( i , j ) , k . In addition , it constructs a path in the graph connecting any pairs of nodes , and initializes some variables used by the stopping rule .
In the elimination phase , the algorithm runs a loop . In each round of the loop , the algorithm chooses each of the alive actions once and , based on the observations , the estimates ˆ µ ( i , j ) of the loss-differences ( i j ) p are updated , where p is the actual opponent
strategy . The algorithm maintains the set C of cells of alive actions and their neighborship graph G .
The estimates are calculated as follows . First we calculate estimates for neighboring actions ( i , j ) . In round 6 n , for every action k in A i , j let O k be the observation vector for action k. Let µ ( i , j ) = k∈A i , j O k v ( i , j ) , k . From the local observability condition and the construction of v ( i , j ) , k , with simple algebram it follows that µ ( i , j ) are unbiased estimates of ( i j ) p ( see Lemma 5 ) . For non-neighboring action pairs , we use telescoping sums : since the graph G ( induced by the alive actions ) stays connected , we can take a path i = i 0 , i 1 , . . . , i r = j in the graph , and the estimate µ ( i , j ) ( n ) will be the sum of the estimates along the path : r l=1 µ ( i l−1 , i l ) . The estimate of the difference of the expected losses after round n will be the average ˆ µ ( i , j ) = ( 1/n ) n l=1 µ ( i , j ) ( s ) , where µ ( i , j ) ( s ) denotes the estimate for pair ( i , j ) computed in round s .
After updating the estimates , the algorithm decides which actions to eliminate . For each pair of vertices i , j of the graph , the expected difference of their loss is tested for its sign by the BStopStep subroutine , based on the estimate ˆ µ ( i , j ) and its relative error . This subroutine uses a stopping rule based on Bernstein’s inequality .
The subroutine’s pseudocode is shown as Algorithm 2 and is essentially based on the work by Mnih et al . ( 2008 ) . The algorithm maintains two values , LB , UB , computed from the supplied sequence of sample means ( ˆ µ ) and the deviation bounds
c ( σ , R , n , δ ) = σ 2 L ( δ , n ) n + R L ( δ , n ) 3n , where L ( δ , n ) = log 3 p p 1 n p δ . ( 1 )
Here p > 1 is an arbitrarily chosen parameter of the algorithm , σ is a ( deterministic ) upper bound on the ( conditional ) variance of the random variables whose common mean µ we wish to estimate , while R is a ( deterministic ) upper bound on their range . This is a general stopping rule method , which stops when it produced an -relative accurate estimate of the unknown mean . The algorithm is guaranteed to be correct outside of a failure event whose probability is bounded by δ .
Algorithm Balaton calls this method with ε = 1/2 . As a result , when BStopStep returns true , outside of the failure event the sign of the estimate ˆ µ supplied to Balaton will match the sign of the mean to be estimated . The conditions under which the algorithm indeed produces ε-accurate estimates ( with high probability ) are given in Lemma 11 ( see Appendix ) , which also states that also with high probability , the time when the algorithm stops is bounded by
C · max σ 2 2 µ 2 , R |µ| log 1 δ + log R |µ| ,
where µ = 0 is the true mean . Note that the choice of p in ( 1 ) influences only C .
If BStopStep returns true for an estimate µ ( i , j ) , function eliminate is called . If , say , µ ( i , j ) > 0 , this function takes the closed half space { q M : ( i j ) q 0 } and eliminates all actions whose cell lies completely in the half space . The function also drops the vertices from the graph that correspond to eliminated cells . The elimination necessarily
6 . Note that a round of the algorithm is not the same as the time step t. In a round , the algorithm chooses each of the alive actions once .
Algorithm 2 Algorithm BStopStep . Note that , somewhat unusually at least in pseudocodes , the arguments LB , UB are passed by reference , i.e. , the algorithm rewrites the values of these arguments ( which are thus returned back to the caller ) . Input : ˆ µ , LB , UB , σ , R , n , ε , δ LB max ( LB , µ| c ( δ , σ , R , n ) ) UB min ( UB , µ| + c ( δ , σ , R , n ) ) return ( 1 + ) LB < ( 1 ) UB
concerns all actions with corresponding cell C i , and possibly other actions as well . The remaining cells are redefined by taking their intersection with the complement half space { q M : ( i j ) q 0 } .
By construction , after the elimination phase , the remaining graph is still connected , but some paths used in the round may have lost vertices or edges . For this reason , in the last phase of the round , new paths are constructed for vertex pairs with broken paths .
The main loop of the algorithm continues until either one vertex remains in the graph or the time horizon T is reached . In the former case , one of the actions corresponding to that vertex is chosen until the time horizon is reached .
5 . Analysis of the algorithm In this section we prove that the algorithm described in the previous section achieves O ( T ) expected regret .
Let us assume that the outcomes are generated following the probability vector p M . Let j denote an optimal action , that is , for every 1 i N , j p i p . For every pair of actions i , j , let α i , j = ( i j ) p be the expected difference of their instantaneous loss . The expected regret of the algorithm can be rewritten as
E
T
t=1
I t , J t min 1≤i≤N E
T
t=1
i , J t =
N
E [ τ i ] α i , j , ( 2 )
where τ i is the number of times action i is chosen by the algorithm .
Throughout the proof , the value that Balaton assigns to a variable x in round n will be denoted by x ( n ) . Further , for 1 k N , we introduce the i.i.d . random sequence ( J k ( n ) ) n≥1 , taking values on { 1 , . . . , M } , with common multinomial distribution satisfying , P [ J k ( n ) = j ] = p j . Clearly , a statistically equivalent model to the one where ( J t ) is an i.i.d . sequence with multinomial p is when ( J t ) is defined through
J t = J I t t s=1 I ( I s = I t ) . ( 3 )
Note that this claim holds , independently of the algorithm generating the actions , I t . Therefore , in what follows , we assume that the outcome sequence is generated through ( 3 ) . As we will see , this construction significantly simplifies subsequent steps of the proof . In particular , the construction will be very convenient since if action k is selected by our algorithm in the n th elimination round then the outcome obtained in response is going to be
O k ( n ) = S k u k ( n ) , where u k ( n ) = e J k ( n ) . ( This holds because in the elimination rounds all alive actions are tried exactly once by Balaton . )
Let ( F n ) n be the filtration defined as F n = σ ( u k ( m ) ; 1 k N , 1 m n ) . We also introduce the notations E n [ · ] = E [ ·|F n ] and Var n ( · ) = Var ( ·|F n ) , the conditional expectation and conditional variance operators corresponding to F n . Note that F n contains the information known to Balaton ( and more ) at the end of the elimination round n. Our first ( trivial ) observation is that µ ( i , j ) ( n ) , the estimate of α i , j obtained in round n is F n -measurable . The next lemma establishes that , furthermore , µ ( i , j ) ( n ) is an unbiased estimate of α i , j :
Lemma 5 For any n 1 and i , j such that C i , C j C , E n−1 [ µ ( i , j ) ( n ) ] = α i , j .
Proof Consider first the case when actions i and j are neighbors . In this case,
µ ( i , j ) ( n ) = k∈A i , j
O k ( n ) v ( i , j ) , k = k∈A i , j
( S k u k ( n ) ) v ( i , j ) , k = k∈A i , j
u k ( n ) S k v ( i , j ) , k ,
and thus
E n−1 µ ( i , j ) ( n ) = k∈A i , j
E n−1 u k ( n ) S k v ( i , j ) , k = p k∈A i , j
S k v ( i , j ) , k = p S ( i , j ) v ( i , j )
= p ( i j ) = α i , j .
For non-adjacent i and j , we have a telescoping sum:
E n−1 µ ( i , j ) ( n ) =
r
k=1
E n−1 [ µ ( i k−1 , i k ) ( n ) ]
= p i 0 i 1 + i 1 i 2 + · · · + i r−1 i r = α i , j ,
where i = i 0 , i 1 , . . . , i r = j is the path the algorithm uses in round n , known at the end of round n 1 .
Lemma 6 The conditional variance of µ ( i , j ) ( n ) , Var n−1 ( µ ( i , j ) ( n ) ) , is upper bounded by V = 2 { i , j neighbors } v ( i , j ) 2 2 .
Proof For neighboring cells i , j , we write
µ ( i , j ) ( n ) = k∈A i , j
O k ( n ) v ( i , j ) , k and thus
Var n−1 ( µ ( i , j ) ( n ) ) = Var n−1
k∈A i , j
O k ( n ) v ( i , j ) , k
=
k∈A i , j
E n−1 v ( i , j ) , k ( O k ( n ) E n−1 [ O k ( n ) ] ) ( O k ( n ) E n−1 [ O k ( n ) ] ) v ( i , j ) , k
k∈A i , j
v ( i , j ) , k 2 2 E n−1 O k ( n ) E n−1 [ O k ( n ) ] 2 2
k∈A i , j
v ( i , j ) , k 2 2 = v ( i , j ) 2 2 , ( 4 )
where in ( 4 ) we used that O k ( n ) is a unit vector and E n−1 [ O k ( n ) ] is a probability vector .
For i , j non-neighboring cells , let i = i 0 , i 1 , . . . , i r = j the path used for the estimate in round n. Then µ ( i , j ) ( n ) can be written as
µ ( i , j ) ( n ) =
r
s=1
µ ( i s−1 , i s ) ( n ) =
r
s=1 k∈A is−1 , is
O k ( n ) v ( i s−1 , i s ) , k .
It is not hard to see that an action can only be in at most two neighborhood action sets in the path and so the double sum can be rearranged as
k∈ A is−1 , is
O k ( n ) ( v ( i sk−1 , i sk ) , k + v ( i sk i sk+1 ) , k ) ,
and thus Var n−1 µ ( i , j ) ( n ) 2 r s=1 v ( i s−1 , i s ) 2 2 2 { i , j neighbors } v ( i , j ) 2 2 .
Lemma 7 The range of the estimates µ ( i , j ) ( n ) is upper bounded by R = { i , j neighbors } v ( i , j ) 1 .
Proof The bound trivially follows from the definition of the estimates .
Let δ be the confidence parameter used in BStopStep . Since , according to Lemmas 5 , 6 and 7 , ( µ ( i , j ) ) is a “shifted” martingale difference sequence with conditional mean α i , j , bounded conditional variance and range , we can apply Lemma 11 stated in the Appendix . By the union bound , the probability that any of the confidence bounds fails during the game is at most N 2 δ. Thus , with probability at least 1 N 2 δ , if BStopStep returns true for a pair ( i , j ) then sgn ( α i , j ) = sgn ( µ ( i , j ) ) and the algorithm eliminates all the actions whose cell is contained in the closed half space defined by H = { p : sgn ( α i , j ) p ( i j ) 0 } . By definition α i , j = ( i j ) p . Thus p / H and none of the eliminated actions can be optimal under p .
From Lemma 11 we also see that , with probability at least 1 N 2 δ , the number of times τ i the algorithm experiments with a suboptimal action i during the elimination phase is bounded by
τ i c ( G ) α 2 i , j log R δα i , j = T i , ( 5 )
where c ( G ) = C ( V + R ) is a problem dependent constant .
The following lemma , the proof of which can be found in the Appendix , shows that degenerate actions will be eliminated in time .
Lemma 8 Let action i be a degenerate action . Let A i = { j : C j C , C i C j } . The following two statements hold:
1 . If any of the actions in A i is eliminated , then action i is eliminated as well .
2 . There exists an action k i A i such that α k i , j α i , j .
An immediate implication of the first claim of the lemma is that if action k i gets eliminated then action i gets eliminated as well , that is , the number of times action i is chosen can not be greater then that of action k i . Hence , τ i τ k i . Let E be the complement of the failure event underlying the stopping rules . As discussed earlier , P ( E c ) N 2 δ. Note that on E , i.e. , when the stopping rules do not fail , no suboptimal action can remain for the final phase . Hence , τ i I ( E ) τ i I ( E ) , where τ i is the number of times action i is chosen by the algorithm . To upper bound the expected regret we continue from ( 2 ) as
N
E [ τ i ] α i , j =
N
E [ I ( E ) τ i ] α i , j + P ( E c ) T ( because N i=1 τ i = T and 0 α i , j 1 )
N
E [ I ( E ) τ i ] α i , j + N 2 δT
i : C i ∈C
E [ I ( E ) τ i ] α i , j + i : C i ∈C
E [ I ( E ) τ i ] α i , j + N 2 δT
i : C i ∈C
E [ I ( E ) τ i ] α i , j + i : C i ∈C
E I ( E ) τ k i α k i , j + N 2 δT ( by Lemma 8 )
i : C i ∈C
T i α i , j + i : C i ∈C
T k i α k i , j + N 2 δT
i : C i ∈C α i , j∗ ≥α 0
T i α i , j + i : C i ∈C α ki , j ≥α 0
T k i α k i , j + α 0 + N 2 δ T
c ( G )
i : C i ∈C α i , j∗ ≥α 0
log R δα i , j∗
α i , j + i : C i ∈C α ki , j ≥α 0
log R δα ki , j α k i , j
+ α 0 + N 2 δ T
c ( G ) N log R δα 0 α 0 + α 0 + N 2 δ T , The above calculation holds for any value of α 0 > 0 . Setting
α 0 = c ( G ) N T and δ = c ( G ) T N 3 , we get
E [ R T ] c ( G ) N T log RT N 2 c ( G ) .
In conclusion , if we run Balaton with parameter δ = c ( G ) T N 3 , the algorithm suffers regret
of O ( T ) , finishing the proof .
6 . A lower bound for hard games
In this section we prove that for any game that satisfies the condition of Case ( d ) of Theorem 4 , the minimax regret is of ( T 2/3 ) .
Theorem 9 Let G = ( L , H ) be an N by M partial-monitoring game . Assume that there exist two neighboring actions i and j such that i j Im S ( i , j ) . Then there exists a problem dependent constant c ( G ) such that for any algorithm A and time horizon T there exists an opponent strategy p such that the expected regret satisfies
E [ R T ( A , p ) ] c ( G ) T 2/3 .
Proof Without loss of generality we can assume that the two neighbor cells in the condition are C 1 and C 2 . Let C 3 = C 1 C 2 . For i = 1 , 2 , 3 , let A i be the set of actions associated with cell C i . Note that A 3 may be the empty set . Let A 4 = A\ ( A 1 ∪A 2 ∪A 3 ) . By our convention for naming loss vectors , 1 and 2 are the loss vectors for C 1 and C 2 , respectively . Let L 3 collect the loss vectors of actions which lie on the open segment connecting 1 and 2 . It is easy to see that L 3 is the set of loss vectors that correspond to the cell C 3 . We define L 4 as the set of all the other loss vectors . For i = 1 , 2 , 3 , 4 , let k i = |A i | .
Let S = S i , j the signal matrix of the neighborhood action set of C 1 and C 2 . It follows from the assumption of the theorem that 2 1 Im ( S ) . Thus , { ρ ( 2 1 ) : ρ R } Im ( S ) , or equivalently , ( 2 1 ) Ker S , where we used that ( Im M ) = Ker ( M ) . Thus , there exists a vector v such that v Ker S and ( 2 1 ) v = 0 . By scaling we can assume that ( 2 1 ) v = 1 . Note that since v Ker S and the rowspace of S contains the vector ( 1 , 1 , . . . , 1 ) , the coordinates of v sum up to zero .
Let p 0 be an arbitrary probability vector in the relative interior of C 3 . It is easy to see that for any ε > 0 small enough , p 1 = p 0 + εv C 1 \ C 2 and p 2 = p 0 εv C 2 \ C 1 .
Let us fix a deterministic algorithm A and a time horizon T . For i = 1 , 2 , let R ( i ) T denote the expected regret of the algorithm under opponent strategy p i . For i = 1 , 2 and j = 1 , . . . , 4 , let N i j denote the expected number of times the algorithm chooses an action from A j , assuming the opponent plays strategy p i .
From the definition of L 3 we know that for any L 3 , 1 = η ( 2 1 ) and 2 = ( 1 η ) ( 1 2 ) for some 0 < η < 1 . Let λ 1 = min ∈L 3 η and λ 2 = min ∈L 3 ( 1 η ) and λ = min ( λ 1 , λ 2 ) if L 3 = and let λ = 1/2 , otherwise . Finally , let β i = min ∈L 4 ( i ) p i and β = min ( β 1 , β 2 ) . Note that λ , β > 0 .
As the first step of the proof , we lower bound the expected regret R ( 1 ) T and R ( 2 ) T in terms of the values N i j , ε , λ and β:
R ( 1 ) T N 1 2
ε
( 2 1 ) p 1 +N 1 3 λ ( 2 1 ) p 1 + N 1 4 β λ ( N 1 2 + N 1 3 ) ε + N 1 4 β , R ( 2 ) T N 2 1 ( 1 2 ) p 2
ε
+N 2 3 λ ( 1 2 ) p 2 + N 2 4 β λ ( N 2 1 + N 2 3 ) ε + N 2 4 β . ( 6 )
For the next step , we need the following lemma .
Lemma 10 There exists a ( problem dependent ) constant c such that the following inequalities hold:
N 2 1 N 1 1 cT ε N 1 4 , N 2 3 N 1 3 cT ε N 1 4 ,
N 1 2 N 2 2 cT ε N 2 4 , N 1 3 N 2 3 cT ε N 2 4 .
Proof ( Lemma 10 ) For any 1 t T , let f t = ( f 1 , . . . , f t ) Σ t be a feedback sequence up to time step t. For i = 1 , 2 , let p i be the probability mass function of feedback sequences of length T 1 under opponent strategy p i and algorithm A. We start by upper bounding the difference between values under the two opponent strategies . For i = j { 1 , 2 } and k { 1 , 2 , 3 } ,
N i k N j k = f T −1
p i ( f T −1 ) p j ( f T −1 )
T −1
t=0
I ( A ( f t ) A k )
f T −1 : p i ( f T −1 ) −p j ( f T −1 ) ≥0
p i ( f T −1 ) p j ( f T −1 )
T −1
t=0
I ( A ( f t ) A k )
T
f T −1 : p i ( f T −1 ) −p j ( f T −1 ) ≥0
p i ( f T −1 ) p j ( f T −1 ) = T 2 p 1 p 2 1
T KL ( p 1 ||p 2 ) /2 , ( 7 )
where KL ( ·||· ) denotes the Kullback-Leibler divergence and · 1 is the L 1 -norm . The last inequality follows from Pinsker’s inequality ( Cover and Thomas , 2006 ) . To upper bound KL ( p 1 ||p 2 ) we use the chain rule for KL-divergence . By overloading p i so that p i ( f t−1 ) denotes the probability of feedback sequence f t−1 under opponent strategy p i and algorithm A , and p i ( f t |f t−1 ) denotes the conditional probability of feedback f t Σ given that the past feedback sequence was f t−1 , again under p i and A. With this notation we have
KL ( p 1 ||p 2 ) =
T −1
t=1 f t−1
p 1 ( f t−1 ) f t
p 1 ( f t |f t−1 ) log p 1 ( f t |f t−1 ) p 2 ( f t |f t−1 )
=
T −1
t=1 f t−1
p 1 ( f t−1 )
I ( A ( f t−1 ) A i ) f t
p 1 ( f t |f t−1 ) log p 1 ( f t |f t−1 ) p 2 ( f t |f t−1 ) ( 8 )
Let a f t be the row of S that corresponds to the feedback symbol f t . 7 Assume k = A ( f t−1 ) . If the feedback set of action k does not contain f t then trivially p i ( f t |f t−1 ) = 0 for i = 1 , 2 . Otherwise p i ( f t |f t−1 ) = a f t p i . Since p 1 p 2 = 2εv and v Ker S , we have a f t v = 0 and thus , if the choice of the algorithm is in either A 1 , A 2 or A 3 , then p 1 ( f t |f t−1 ) = p 2 ( f t |f t−1 ) . It follows that the inequality chain can be continued from ( 8 ) by writing
KL ( p 1 ||p 2 )
T −1
t=1 f t−1
p 1 ( f t−1 ) I ( A ( f t−1 ) A 4 ) f t
p 1 ( f t |f t−1 ) log p 1 ( f t |f t−1 ) p 2 ( f t |f t−1 )
c 1 ε 2
T −1
t=1 f t−1
p 1 ( f t−1 ) I ( A ( f t−1 ) A 4 ) ( 9 )
c 1 ε 2 N 1 4 .
7 . Recall that we assumed that different actions have difference feedback symbols , and thus a row of S corresponding to a symbol is unique .
In ( 9 ) we used Lemma 12 ( see Appendix ) to upper bound the KL-divergence of p 1 and p 2 . Flipping p 1 and p 2 in ( 7 ) we get the same result with N 2 4 . Reading together with the bound in ( 7 ) we get all the desired inequalities .
Now we can continue lower bounding the expected regret . Let r = argmin i∈ { 1,2 } N i 4 . It is easy to see that for i = 1 , 2 and j = 1 , 2 , 3,
N i j N r j c 2 T ε N r 4 .
If i = r then this inequality is one of the inequalities from Lemma 10 . If i = r then it is a trivial lower bounding by subtracting a positive value . From ( 6 ) we have
R ( i ) T λ ( N i 3−i + N i 3 ) ε + N i 4 β λ ( N r 3−i c 2 T ε N r 4 + N r 3 c 2 T ε N r 4 ) ε + N r 4 β = λ ( N r 3−i + N r 3 2c 2 T ε N r 4 ) ε + N r 4 β .
Now assume that , at the beginning of the game , the opponent randomly chooses between strategies p 1 and p 2 with equal probability . The the expected regret of the algorithm is lower bounded by
R T = 1 2 R ( 1 ) T + R ( 2 ) T 1 2 λ ( N r 1 + N r 2 + 2N r 3 4c 2 T ε N r 4 ) ε + N r 4 β 1 2 λ ( N r 1 + N r 2 + N r 3 4c 2 T ε N r 4 ) ε + N r 4 β = 1 2 λ ( T N r 4 4c 2 T ε N r 4 ) ε + N r 4 β .
Choosing ε = c 3 T −1/3 we get
R T 1 2 λc 3 T 2/3 1 2 λN r 4 c 3 T −1/3 2λc 2 c 2 3 T 1/3 N r 4 + N r 4 β T 2/3 β 1 2 λc 3 N r 4 T 2/3 2λc 2 c 2 3 N r 4 T 2/3 + 1 2 λc 3
= T 2/3 β 1 2 λc 3 x 2 2λc 2 c 2 3 x + 1 2 λc 3 ,
where x = N r 4 /T 2/3 . Now we see that c 3 > 0 can be chosen to be small enough , independently of T so that , for any choice of x , the quadratic expression in the parenthesis is bounded away from zero , and simultaneously , ε is small enough so that the threshold condition in Lemma 12 is satisfied , completing the proof of Theorem 9 .
7 . Discussion
In this we paper we classified all finite partial-monitoring games under stochastic environments , based on their minimax regret . We conjecture that our results extend to nonstochastic environments . This is the major open question that remains to be answered .
One question which we did not discuss so far is the computational efficiency of our algorithm . The issue is twofold . The first computational question is how to efficiently decide which of the four classes a given game ( L , H ) belongs to . The second question is the computational efficiency of Balaton for a fixed easy game . Fortunately , in both cases an efficient implementation is possible , i.e. , in polynomial time by using a linear program solver ( e.g. , the ellipsoid method ( Papadimitriou and Steiglitz , 1998 ) ) .
Another interesting open question is to investigate the dependence of regret on quantities other than T such as the number of actions , the number of outcomes , and more generally the structure of the loss and feedback matrices .
Finally , let us note that our results can be extended to a more general framework , similar to that of Pallavi et al . ( 2011 ) , in which a game with N actions and M -dimensional outcome space is defined as a tuple G = ( L , S 1 , . . . , S N ) . The loss matrix is L R N ×M as before , but the outcome and the feedback are defined differently . The outcome y is an arbitrary vector from a bounded subset of R M and the feedback received by the learner upon choosing action i is O i = S i y .
References
Jacob Abernethy , Elad Hazan , and Alexander Rakhlin . Competing in the dark : An efficient algorithm for bandit linear optimization . In Proceedings of the 21st Annual Conference on Learning Theory ( COLT 2008 ) , pages 263–273 . Citeseer , 2008 .
Alekh Agarwal , Peter Bartlett , and Max Dama . Optimal allocation strategies for the dark pool problem . In 13th International Conference on Artificial Intelligence and Statistics ( AISTATS 2010 ) , May 12-15 , 2010 , Chia Laguna Resort , Sardinia , Italy , 2010 .
Andr´ as Antos , G´ abor Bart´ ok , D´ avidal , and Csaba Szepesv´ ari . Toward a classification of finite partial-monitoring games , 2011. http : //arxiv.org/abs/1102.2041 .
Jean-Yves Audibert and S´ebastien Bubeck . Minimax policies for adversarial and stochastic bandits . In Proceedings of the 22nd Annual Conference on Learning Theory , 2009 .
Peter Auer , Nicol` o Cesa-Bianchi , Yoav Freund , and Robert E. Schapire . The nonstochastic multiarmed bandit problem . SIAM Journal on Computing , 32 ( 1 ) :48–77 , 2002 .
abor Bart´ ok , D´ avidal , and Csaba Szepesv´ ari . Toward a Classification of Finite PartialMonitoring Games . In Proceedings of the 21st international conference on Algorithmic Learning Theory ( ALT 2010 ) , pages 224–238 . Springer , 2010 .
Nicol` o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Minimizing regret with label efficient prediction . IEEE Transactions on Information Theory , 51 ( 6 ) :2152–2162 , June 2005 .
Nicol´ o Cesa-Bianchi , G´ abor Lugosi , and Gilles Stoltz . Regret minimization under partial monitoring . Mathematics of Operations Research , 31 ( 3 ) :562–580 , 2006 .
Thomas M. Cover and Joy A. Thomas . Elements of Information Theory . Wiley , New York , second edition , 2006 .
Abraham D. Flaxman , Adam Tauman Kalai , and H. Brendan McMahan . Online convex optimization in the bandit setting : gradient descent without a gradient . In Proceedings of the 16th annual ACM-SIAM Symposium on Discrete Algorithms ( SODA 2005 ) , page 394 . Society for Industrial and Applied Mathematics , 2005 .
Robert Kleinberg and Tom Leighton . The value of knowing a demand curve : Bounds on regret for online posted-price auctions . In Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science 2003 ( FOCS 2003 ) , pages 594–605 . IEEE , 2003 .
Nick Littlestone and Manfred K. Warmuth . The weighted majority algorithm . Information and Computation , 108:212–261 , 1994 .
abor Lugosi and Nicol` o Cesa-Bianchi . Prediction , Learning , and Games . Cambridge University Press , 2006 .
V. Mnih . Efficient stopping rules . Master’s thesis , Department of Computing Science , University of Alberta , 2008 .
V. Mnih , Cs . Szepesv´ ari , and J.-Y . Audibert . Empirical Bernstein stopping . In W. W. Cohen , A. McCallum , and S. T. Roweis , editors , ICML 2008 , pages 672–679 . ACM , 2008 .
A. Pallavi , R. Zheng , and Cs . Szepesv´ ari . Sequential learning for optimal monitoring of multi-channel wireless networks . In INFOCOMM , 2011 .
Christos H. Papadimitriou and Kenneth Steiglitz . Combinatorial optimization : algorithms and complexity . Courier Dover Publications , New York , 1998 .
Antonio Piccolboni and Christian Schindelhauer . Discrete prediction games with arbitrary feedback and loss . In Proceedings of the 14th Annual Conference on Computational Learning Theory ( COLT 2001 ) , pages 208–223 . Springer-Verlag , 2001 .
Martin Zinkevich . Online convex programming and generalized infinitesimal gradient ascent . In Proceedings of Twentieth International Conference on Machine Learning ( ICML 2003 ) , 2003 .
Appendix
Proof ( Lemma 8 )
1 . In an elimination set , we eliminate every action whose cell is contained in a closed half space . Let us assume that j A i is being eliminated . According to the definition of A i , C i C j and thus C i is also contained in the half space .
2 . First let us assume that p is not in the affine subspace spanned by C i . Let p be an arbitrary point in the relative interior of C i . We define the point p = p + ε ( p p ) . For a small enough ε > 0 , p C k A i , and at the same time , p C i . Thus we have
k ( p + ε ( p p ) ) i ( p + ε ( p p ) ) ( 1 + ε ) k p ε k p ( 1 + ε ) i p ε i p −ε k p −ε i p k p i p α k , j α i , j ,
where we used that k p = i p. For the case when p lies in the affine subspace spanned by C i , We take a hyperplane that contains the affine subspace . Then we take an infinite sequence ( p n ) n such that every element of the sequence is in the same side of the hyperplane , p n = p and the sequence converges to p . Then the statement is true for every element p n and , since the value α r , s is continuous in p , the limit has the desired property as well .
The following lemma concerns the problem of producing an estimate of an unknown mean of some stochastic process with a given relative error bound and with high probability in a sample-efficient manner . The procedure is a simple variation of the one proposed by Mnih et al . ( 2008 ) . The main differences are that here we deal with martingale difference sequences shifted by an unknown constant , which becomes the common mean , whereas Mnih et al . ( 2008 ) considered an i.i.d . sequence . On the other hand , we consider the case when we have a known upper bound on the predictable variance of the process , whereas one of the main contributions of Mnih et al . ( 2008 ) was the lifting of this assumption . The proof of the lemma is omitted , as it follows the same lines as the proof of results of Mnih et al . ( 2008 ) ( the details of these proofs are found in the thesis of ( Mnih , 2008 ) ) , the only difference being , that here we would need to use Bernstein’s inequality for martingales , in place of the empirical Bernstein inequality , which was used by Mnih et al . ( 2008 ) .
Lemma 11 Let ( F t ) be a filtration on some probability space , and let ( X t ) be an F t -adapted sequence of random variables . Assume that ( X t ) is such that , almost surely , the range of each random variable X t is bounded by R > 0 , E [ X t |F t−1 ] = µ , and Var [ X t |F t−1 ] σ 2 a.s. , where R , µ = 0 and σ 2 are non-random constants . Let p > 1 , > 0 , 0 < δ < 1 and let
L n = ( 1 + ε ) max 1≤t≤n |X t | c t , and U n = ( 1 ε ) min 1≤t≤n |X t | + c t ,
where c t = c ( σ , R , t , δ ) , and c ( · ) is defined in ( 1 ) . Define the estimate ˆ µ n of µ as follows:
ˆ µ n = sgn ( X n ) ( 1 + ε ) L n + ( 1 ε ) U n 2 .
Denote the stopping time τ = min { n : L n U n } . Then , with probability at least 1 δ,