Playing against Hedge

Hedge has been proposed as an adaptive scheme, which guides the player’s hand in a multi-armed bandit full information game. Applications of this game exist in network path selection, load distribution, and network interdiction. We perform a worst case analysis of the Hedge algorithm by using an adversary, who will consistently select penalties so as to maximize the player’s loss, assuming that the adversary’s penalty budget is limited. We further explore the performance of binary penalties, and we prove that the optimum binary strategy for the adversary is to make greedy decisions.


Introduction
The problems of adaptive network path selection and load distribution have often been considered as games that are played simultaneously and independently by agents controlling flows in a network.A possible abstraction of these and other related problems is the bandit game.In the multi-armed bandit game [1] a player chooses one out of N strategies (or "machines" or "options" or "arms").A loss or penalty (or a reward, which can be modeled as a negative loss) i  is assigned to each strategy i ( ) after each round of the game.An agent facing repeated selections will possibly try to exploit the so far accumulated experience.A popular algorithm that can guide the agent in each selection round is the multiplicative updates algorithm or Hedge.In this paper we calculate the worst possible performance of Hedge by using the adversarial technique, i.e. we investigate the behavior of an intelligent adversary, who tries to maximize the player's cumulative loss.In Section 1 we describe Hedge; in Section 2 we give a rigorous formulation of the adversary's problem; in Section 3 we give a recursive solution; and in Section 4 we present sample numerical results.Finally, in Section 5 we explore binary adversarial strategies.Our main result is that the greedy adversarial strategy is optimal among binary strategies.

The Bandit Game
In a generalized bandit game the player is allowed to play mixed strategies, i.e. to assign a fraction i p (such that 1 1 ) of the total bet to option i , thereby getting a loss equal to Alternatively, i p can be interpreted as a probability that the player assigns the bet on option i .In the "bandit" version only the total loss L is announced to the player, while in the "full information" version the penalty vector , , , N     is announced.
A game consists of T rounds; a superscript t marks the t th ( ) Apparently the player will try to minimize the total cumulative loss by controlling the bet distribution, i.e. by properly selecting the variables t i p .We use the additional assumption that the loss budget is limited in each round by setting the constraint 1 1 Clearly a player's goal is to minimize his or her total cumulative loss.An extremely lucky player, or a player with "inside information", would select the minimum penalty option in each round and would put all his or her bet on this option, thereby achieving a total loss equal to 1 0 min

The Hedge Algorithm
Quite a few algorithmic solutions, which will guide the player's hand in the full information game, have appeared in the literature.Freund and Schapire have proposed the Hedge algorithm [2] for the full information game.Auer, Cesa-Bianchi, Freund and Schapire have proposed the Exp3 algorithm in [3].Allenberg-Neeman and Neeman proposed a Hedge variant, the GL (Gain-Loss) algorithm, for the full information game with gains and losses [4].Dani, Hayes, and Kakade have proposed the GeometricHedge algorithm in [5], and a modification was proposed by Bartlett,Dani et al. in [6].Recently Cesa-Bianchi and Lugosi have proposed the ComBand algorithm for the bandit version [7].A comparison can be found in [8].
Hedge maintains a vector , , , of weights, such that 0 ).In each round t Hedge chooses the bet allocation according to the normalized weight ∑ .When the opponent reveals the loss vector of this round, the next round weight 1 t w + is determined so as to reflect the loss results, i.e.
In [9] Auer, Cesa-Bianchi, Freund and Schapire have proved that the expected Hedge performance and the expected performance of the best arm differ at most by ( ) ln O TN N .Freund and Schapire [2] have given a loss upper bound, which relates the total cumulative loss with the total loss of the best arm.

Competitive Analysis
The competitive analysis of an algorithm  , which in this paper is Hedge, involves a comparison of  's performance with the performance of the optimal offline algorithm.In the bandit game the optimal offline algorithm, i.e. the optimal player's decisions given the sequence of all penalties in advance, is trivial.In a given round the player can just bet everything on the option with the lowest penalty.
According to S. Irani and A. Karlin (in Section 13.3.1 of [10]) a technique in finding bounds is to use an "adversary" who plays against  and concocts an input, which forces  to incur a high cost.Using an adversary is just an illustrative way of saying that we try to find the worst possible performance of an online algorithm.In our analysis the adversary tries to maximize Hedge's total loss by controling the penalty vector (under a limited budget).

Interpretations and Applications
In this section we offer some interpretations from the areas of 1) communication networks and 2) transportation.The general setting of course involves a number of options or arms, which must be selected by a player without any knowledge of the future.
Bandit models have been used in quite diverse decision making situations.In [11] He, Chen, Wand and Liu have used a bandit model for the maximization of the revenue of a search engine provider, who charges for advertisements on a per-click basis.They have subsequently defined the "armed bandit problem with shared information"; arms are partitioned in groups and loss information is shared only among players using arms of the same group.In [12] Park and Lee have used a multi-armed bandit model for lane selection in automated highways and autonomous vehicles traffic control.

Traffic Load Distribution
This first application example can take multiple interpretations, which always involve a selection in a competitive environment, in which competition is limited.It can be seen as 1) a path selection problem in networking, 2) a transport means (mode) choice or path selection problem, 3) a computational load distribution problem, which we mention in the end of this section.Firstly, we describe the problem in the context of networking.
Consider N similar independent paths (in the simplest case just N parallel links), which join a pair of nodes  ,  .A traffic volume equal to Q is sent from  to  in consecutive time periods or rounds by a population of agents.Q is the same in each round, but the allocation of Q to paths, i.e. ( ) , , , , is different in each round t .An agent A produces a constant amount of traffic equal to A , such that q Q  , in T consecutive rounds, and allocates a part equal to i q ( ) to the i th path in round t .The average delay (or cost) experienced by A 's traffic in the t th round is proportional to , if we assume a linear delay (or cost) model.Linear models are used for simplicity in network analysis [13] and can be realistic if a network resource still operates in the linear region of the delay vs. load curve, e.g. when delay is calculated in a link, which operates not very close to capacity.Agent A aims at minimizing the total delay for its own traffic and may use Hedge to determine the quantities t i q in round t , assuming that A knows the performance of its own traffic in each path in the past time period.Note that the maximum delay in a round occurs if A puts the whole q in a single path together with the whole traffic of the competition, i.e. with Q ; then A 's average delay in this round equals Q .On the contrary, if Q is evenly distributed in all paths, A 's allocation decision does not really matter, as the average will be equal to ( ) ( ) . Of course the minimum delay in a round will occur if A puts the whole q in an empty path, thereby achieving a zero delay.
The above problem can also be formulated as a more general problem of distributing workload over a collection of parallel resources (e.g.distributing jobs to parallel processors).A. Blum and C. Burch have used the following motivating scenario in [14]: A process runs on some machine in an environment with N machines in total.The process may move to a different machine at the end of a time interval.The load t i  , which will be found on a machine i at time round t is the penalty felt by the process.

Interdiction
Although an adversary is usually a "technical" (fictional) concept, which serves the worst case analysis of online algorithms, in some environments a real adversary, who intentionally tries to oppose a player, does exist.An example is the interdiction problem.
We present a version of the interdiction problem in a network security context.An attacker attacks N resources (e.g.launches a distributed denial of service attack on nodes, servers, etc., see [15]) by sending streams of harmful packets to resource i at a rate i w (where 1, , i N =  and i i w ∑ is constant).A defen- der assigns a defense mechanism of intensity i  (e.g. a filter that is able to detect and avoid harmful packets with a probability proportional to i  ) to resource i .At the end of a time interval T , e.g. one day, both the attacker and the defender revise the flows and the distribution of defense mechanisms to resources respectively, based on past performance.
Similar interpretations exist in transportation network environments, as in border and custom control, including illegal immigration control.An interdiction problem formulation can be used in a maritime transport security context: pirates attack the vessels traversing a maritime route.In [16] Vanek et al. assign the role of the player to the pirate.The pirate operates in rounds, starting and finishing in his home port.In each round he selects a sea area (arm) to sail to and search for possible victim vessels.A patrol force distributes the available escort resources to sea areas (arms), and pirate gains are inversely proportional to the strength of the defender's forces on this area.Naval forces reallocate their own resources to sea areas.

Problem Formulation
In this paper we aim at finding the worst case performance of Hedge.Effectively, we try to solve the following problem: Problem 1.Given a number of options N , an initial normalized weight vector ( ) , and a Hedge parameter β , find the sequence 0 that maximizes the player's total cumulative loss where ( ) 1 , , is the penalty vector in round t ( ) and the t th round penalty weights t i p are updated according to ( )  , and c) β .Due to the normalization of both weights and penalties there are ( ) ( ) dent variables in total.In the following we use ( ) , , ; , , , , , , H L β whenever it is necessary to refer to these variables.

Recursion
Assuming that a given round starts with weights ( ) and the adversary generates penalties ( ) , the next round will will start with weights Then, the total loss of a T round game, which starts with weights w , can be written as the sum of the losses of a single round game, which starts with weights w , and a 1 T − round game, which starts with weights ; , , , ; , ; , , .
Note that the term 2 T L − , which expresses the contribution of the last T rounds, depends only on the updated weights provided by the initial round.Such a Markovian property can be generalized in the following sense: A 1 2 T T + round game can be seen as consisting of a 1 T round game 1 g followed by a 2 T round game 2 g , whose initial weights are the final weights of 1 g , and no more details about 1 g are passed to 2 g .Assuming that the solution to Problem 1 is ( ) ( ) the following recursive formula for ( ) where 0 =   is the penalty vector chosen by the adversary in the initial round.The optimal penalties can be computed also recursively.Let , where ( ) 1; T t i λ − w denotes the i th optimal penalty of the i th option in the t th round of a T round game (starting with weights w ).The optimal penalty of the initial round ( ) is apparently equal to the value of  , which optimizes (6).Therefore In all other rounds 1, 2, , the optimal penalties are such that the total loss of the rest of the game is maximized, i.e. such that ( )

Two Option Games and Numerical Results
This section we exploit the recursive methodology, which has been presented in the previous section, in order to provide some numerical results for two option games.We compare these results with available bounds in the literature.We consider 2 N = , i.e. two option games.We keep only the independent penalties 1 t  in the extended notation and use the more compact version ( ) . As an example, the loss of a single round game is given by Also, since the initial weights are , . 1 w W w w w Then ( 6) is simplified to where 0 =   is the penalty chosen by the adversary for the first option in the initial round.The iteration starts from ( ) the loss of a single round game.In such game the adversary controls a single penalty variable, as the loss is given by (9).Apparently the adversary will choose binary values, i.e.
, and the maximum total loss is ( ) The graph of ( )

L
w appears as the lowest V-shaped "curve" in Figure 1.The fact that the ( ) piecewise linear function of w with a breakpoint (i.e. a sudden change in its slope), creates even more breakpoints in ( ) L w and so on.Therefore, while it is possible to use the aforementioned recursion in order to find analytical expressions for the maximum total loss and the associated penalties, the analysis becomes quite complicated even for small values of the number of rounds T (i.e. in a 1 T + round game).We omit this tedious analysis and present numerical results based on the recursive methodology given above.
Instead we have implemented a numerical computation based on (11).
( ) .Initially we create ( )  by using (9).We use the result as input to (11) and create ( ) ( ) Then we use the already calculated 0 L and 1 L in (11) to calculate 2 L , then 0 L and 2 L to calculate 3 L , and so on.In The optimal penalties can be determined by using formulas ( 7) and ( 8) for 2 N = .In Figure 2 we draw one of the curves of Figure 1 together with the respective optimal penalties.The final round optimal penalty (i.e.

( )
3;3 w λ in this example) is certain to be binary, since the adversary will assign 3 1 i =  to the option i with the greatest weight factor.However, the penalties ( )

Binary and Greedy Schemes
The penalty values in the first two rounds in the example of Figure 2 prove that the adversary's optimal penalties are not necessarily binary.However, in this example β is "unnaturally" close to 0, as in practical Hedge implementations β is chosen close to 1; this choice achieves a more gradual adaptation to losses.Both experimental and analytical evidence show that the optimal penalties tend rapidly to binary values as β approaches 1. Effectively, it seems that results very close to optimum can be achieved by a "binary adversary", i.e. an adversary that will resort to binary values only.
On the other hand the optimal adversarial policy with binary penalties can be found exhaustively as where S is a set of N binary vectors ( ) , i.e.only one component equals 1.
Apparently, the complexity of this calculation grows with T N .However, in the following we show that the optimal binary adversary is in fact the "greedy adversary", The latter achieves binary optimality in linear time.
A "greedy adversary" is eager to punish the maximum weight option as much as possible in each round.Thus the adversary will assign exactly one unit of penalty to the maximum current weight option, and zero penalties to all other options.Given a sufficient number of rounds (say 0 t ), it easy to see that the weights of an N option game are "equalized" so that any two weights t i p , t j p are such that t t i j p p β < for 0 t t ≥ .When equalization is achieved, a periodic phenomenon starts and the greedy penalties form a rotation scheme.

Greedy Behavior
We explore the greedy pattern in a two option game that can be generalized to N options.Assuming initial weights 1 w , 2 w ( ) > , a greedy adversary will choose , where 0 1 t ≥ (having assumed 1 2 w w > ).At 0 t the weight of the second option becomes for the first time greater than the weight of the first option, and a loss equal to 1 is assigned to the second option.Therefore, in the next step 0 1 t + the weights (before normalization) are w again, and in general they oscillate between these two pairs periodically.Therefore the total loss for 0 t t ≥ in a pair of subsequent rounds is equal to The value of 0 t is determined by the initially assumed inequality, and since 0 t ought to be integer ( ) The loss in the first 0 t steps ( ) Therefore, for an even positive integer 0 T t − the total loss in T steps is ( ) In a game with more than two options it is straightforward to show that in the "steady" (periodic) state weights tend to become equal, i.e. almost equal to 1 N , where N is the number of options.Consequently, the loss is given by in a T round game.

Optimality of the Greedy Behavior
The following proposition provides a simple polynomial solution to the problem of finding the optimal binary adversary.Proposition 1.The greedy strategy is optimal for the adversary among all strategies with binary penalties.□ Proof: Due to normalization of weights and penalties, in the proof we mention only option 1 weights and penalties.Assuming an initial weight ω and penalties 0 1 in the first n rounds, the weight, which emerges before the (n + 1)th round is ( ) Effectively, two options are available to the adversary in each step, either i) to assign a penalty equal to 1 , which will produce an incre- mental loss equal to ( ) ωβ ωβ ω + − , and will update the weight to ( ) or ii) to assign a zero penalty, which will produce a loss equal to ( ) and an updated weight equal to ( ) This looks like a new game, in which the adversary is the player.The player's status is determined by a real number x , and possible rewards are ( ) ( ) . If the player opts for ( ) f x , this will bring him to a new status x δ + .If he opts for ( ) , this will bring him to x δ − .In our case 1 δ = .Note also that ( ) 1 f −∞ = , ( ) 0 f +∞ = , and ( ) It is easy to prove that there is an odd symmetry around ( ) , then ( ) , and 0 0 ξ ≥ .If the current status of the player is 1 x , and 1 0 x ξ < , the greedy behavior is to move ( )  times to the right, which (unless T is too short) will bring the player to a point 2 x such that 2 0 x , which will last until the end of the game.In the following we prove that this behavior is optimal, in spite of the fact that profits around 0 ξ are low.
The main idea behind this sketch of proof is that a retreat (with consequent low profits ( ) good investment for the future.Assume 1 x as the player's status, and T steps (rounds) remain until the end of the game, while 1 0 x Tδ ξ + < .The player executes M forward steps, i.e.
Then, 1 M − backward steps with gains ( ) x is reached again.In the rest of the game, i.e. until the T th step, greedy selections are made.This course of events is shown on curve (a) in Figure 3, where the dots mark the rewards achieved (and some dots have been vertically displaced by a small amount so as to be distinguishable from other dots at the same position).If greedy selections had been made all the way, the course of events would be as shown by curve (b).If i y describes the status of the adversary on the greedy curve (b) at the i th step and i x the status on curve (a), then ( ) ( ) Effectively we need to show that 0 R ∆ ≥ .First, let us make some observations and explore other variations of 0 R ∆ ≥ .Note that R ∆ , as given by ( 14), is positive if the cumulative reward from the back and forth movement (in the first 2M steps) is less than the reward in the last 2M steps.However, as T increases, the position of the last step approaches 0 ξ and it can be shown that the cumulative reward of the last 2M steps decreases.This property will be proved later, and it is due to the convexity and monotonicity properties of f .When T further increases, some of the very last 2M steps of the greedy behavior enter the phase of oscillation around 0 ξ , and for T sufficiently large, all 2M belong the oscillation phase.Note, however, that the oscillation phase rewards are those closer to 1/2, which is the lower limit of all greedy steps.If the greedy algorithm is to be optimal, even the 2M oscillatory steps should bring a cumulative reward greater than the original back and forth movement.On the other hand, if we prove this last inequality, this will also prove (14), whose last 2M steps bring more reward than the 2M oscillatory steps.

(
) ( ) ( ) This is a consequence of the following lemma: Lemma 1.For any concave function ( ) f x the following inequality is true: Inequality (15) holds because which is a consequence of the mean value theorem stating that there is a point 1 φ in ( ) . However, f is a concave function, and its derivative is non-increasing, therefore 1 2 , which proves (16).In fact (15) easily generalized to any same length intervals, even overlapping ones, i.e. if 1 2 x x ≤ , then Due to (15) each successive equal length (i.e.x ∆ ) interval produces an incremental reward ( ) ( ) + ∆ , which is smaller than the incremental reward of the next interval, and of all succeeding intervals, as long as f remains concave.Effectively, Lemma 1 proves that the incremental reward of the rightmost interval, which does not contain 0 ξ , i.e. the interval ( ) , ψ δ ψ − , is the highest among the rewards of all intervals of the same length, which begin to the left of 1 ψ δ − .Unfortunately, our aim was to prove ( 14), which would be secured if f remained concave in 1 2 , ψ ψ , e.g. if 1 0 ψ ξ δ = − and 2 0 ψ ξ = .However this is not true, since at 0 ξ f turns from concave to convex.Fortunately, the term ( ) ( ) , ψ ψ can be seen as the sum of rewards related with the concave f in ( ) , ψ ξ and the con- , ξ ψ .Due to the odd symmetry around 0 ξ , However, due to the concavity of f , ( ) ( ) ( ) ( ) This result states that the interval ( ) , ψ ψ δ + , which contains 0 ξ , provides higher f ∆ than the previous interval ( ) , ψ δ ψ − , which in turn is higher than the f ∆ of any previous interval of the same length.Therefore we have seen so far that a sequence of penalties, which begins at some 0 x ξ < and involves one fold, can be reduced to a sequence without any folds, and with improved total reward, as shown in Figure 4.In Figure 4 a sequence of steps with a single fold, which starts at 1 x and ends at 2 x , is shown together with the respective greedy sequence, which starts at 1 x and ends at 3  Suppose that the initial position of the game is at 1 x , and that 1 0 x ξ ≤ (otherwise reverse the initial probabilities , 1 ω ω − ).Suppose also that the initial sequence does not extend beyond 2 ψ , i.e. it does not reach 0 ξ or it involves a number of oscillations around 0 ξ .Then take the last fold and reduce it as mentioned, i.e. by replacing it with an equal number of greedy steps at the end of the current sequence.If these steps reach 0 ξ they are oscillation steps.Repeat the same step, until all folds have disappeared (oscillations do not count as folds).If the original sequence does extend beyond 0 ξ , the approach is the same, but the reader should note that the application of this algorithm will finally reduce the part, which extends beyond 2 ψ , to oscillations between 1 ψ and 2 ψ .

Conclusion
We summarize the main results of this paper: An worst performance (adversarial) analysis of the Hedge algorithm has been presented, under the of limited penalties per round.A recursive expression has been given for the evaluation of the maximum total cumulative loss; this expression can be exploited both numerically and analytically.However, binary penalty schemes provide an excellent approximation to the optimal scheme, and, remarkably, the greedy binary strategy has been proved optimal among binary schemes for the adversary.
Clearly the objective function (2) is a function of a) the N initial weights i w , and b) the N T × variables t i

Figure 1
Figure 1.Plot of

Figure 1 −
is more "interesting" for "unreasonably" small values of β .

Figure 2
Figure 2. Plot of

and 2 w 1 t w β and 2
for the second time.In the next round they become 0 difference between the cumulative reward on curve (b) and curve (a) is

Figure 3 .
Figure 3. Sample paths of player behavior, which are used in the proof of Proposition 1.
the sequence must extend after 0 ξ , the additional steps are oscillation steps around 0 ξ .The rest of this proof is just an application of this result, so that a sequence with an arbitrary number of folds can be reduced to an improved reward foldless sequence.

Figure 4 .
Figure 4. Reduction of a sequence of penalties, which contains a fold, to a sequence without folds and with improved total reward.