1. Introduction
The problems of adaptive network path selection and load distribution have often been considered as games that are played simultaneously and independently by agents controlling flows in a network. A possible abstraction of these and other related problems is the bandit game. In the multi-armed bandit game [1] a player chooses one out of
strategies (or “machines” or “options” or “arms”). A loss or penalty (or a reward, which can be modeled as a negative loss)
is assigned to each strategy
after each round of the game.
An agent facing repeated selections will possibly try to exploit the so far accumulated experience. A popular algorithm that can guide the agent in each selection round is the multiplicative updates algorithm or Hedge. In this paper we calculate the worst possible performance of Hedge by using the adversarial technique, i.e. we investigate the behavior of an intelligent adversary, who tries to maximize the player’s cumulative loss. In Section 1 we describe Hedge; in Section 2 we give a rigorous formulation of the adversary’s problem; in Section 3 we give a recursive solution; and in Section 4 we present sample numerical results. Finally, in Section 5 we explore binary adversarial strategies. Our main result is that the greedy adversarial strategy is optimal among binary strategies.
1.1. The bandit game
In a generalized bandit game the player is allowed to play mixed strategies, i.e. to assign a fraction
(such
that
) of the total bet to option
, thereby getting a loss equal to
. Alternatively, 
can be interpreted as a probability that the player assigns the bet on option
. In the “bandit” version only the total loss
is announced to the player, while in the “full information” version the penalty vector
is announced.
A game consists of
rounds; a superscript
marks the
th
round. Apparently the
player will try to minimize the total cumulative loss
(1)
by controlling the bet distribution, i.e. by properly selecting the variables
. We use the additional assumption
that the loss budget is limited in each round by setting the constraint
. Clearly a player’s goal is to
minimize his or her total cumulative loss. An extremely lucky player, or a player with “inside information”, would select the minimum penalty option in each round and would put all his or her bet on this option, thereby
achieving a total loss equal to
.
1.2. The Hedge algorithm
Quite a few algorithmic solutions, which will guide the player’s hand in the full information game, have appeared in the literature. Freund and Schapire have proposed the Hedge algorithm [2] for the full information game. Auer, Cesa-Bianchi, Freund and Schapire have proposed the Exp3 algorithm in [3] . Allenberg-Neeman and Neeman proposed a Hedge variant, the GL (Gain-Loss) algorithm, for the full information game with gains and losses [4] . Dani, Hayes, and Kakade have proposed the GeometricHedge algorithm in [5] , and a modifi- cation was proposed by Bartlett, Dani et al. in [6] . Recently Cesa-Bianchi and Lugosi have proposed the ComBand algorithm for the bandit version [7] . A comparison can be found in [8] .
Hedge maintains a vector
of weights, such that
(
, and
). In each round
Hedge chooses the bet allocation according to the normalized weight
. When the opponent reveals the loss vector of this round, the next round weight
is
determined so as to reflect the loss results, i.e.
for some fixed
, such that
.
In [9] Auer, Cesa-Bianchi, Freund and Schapire have proved that the expected Hedge performance and the
expected performance of the best arm differ at most by
. Freund and Schapire [2] have given a
loss upper bound, which relates the total cumulative loss with the total loss of the best arm.
1.3. Competitive analysis
The competitive analysis of an algorithm
, which in this paper is Hedge, involves a comparison of
’s performance with the performance of the optimal offline algorithm. In the bandit game the optimal offline algorithm, i.e. the optimal player’s decisions given the sequence of all penalties in advance, is trivial. In a given round the player can just bet everything on the option with the lowest penalty.
According to S. Irani and A. Karlin (in Section 13.3.1 of [10] ) a technique in finding bounds is to use an “adversary” who plays against
and concocts an input, which forces
to incur a high cost. Using an adversary is just an illustrative way of saying that we try to find the worst possible performance of an online algorithm. In our analysis the adversary tries to maximize Hedge’s total loss by controling the penalty vector (under a limited budget).
1.4. Interpretations and applications
In this section we offer some interpretations from the areas of 1) communication networks and 2) transportation. The general setting of course involves a number of options or arms, which must be selected by a player without any knowledge of the future.
Bandit models have been used in quite diverse decision making situations. In [11] He, Chen, Wand and Liu have used a bandit model for the maximization of the revenue of a search engine provider, who charges for advertisements on a per-click basis. They have subsequently defined the “armed bandit problem with shared information”; arms are partitioned in groups and loss information is shared only among players using arms of the same group. In [12] Park and Lee have used a multi-armed bandit model for lane selection in automated highways and autonomous vehicles traffic control.
1.4.1. Traffic load distribution
This first application example can take multiple interpretations, which always involve a selection in a compe- titive environment, in which competition is limited. It can be seen as 1) a path selection problem in networking, 2) a transport means (mode) choice or path selection problem, 3) a computational load distribution problem, which we mention in the end of this section. Firstly, we describe the problem in the context of networking.
Consider
similar independent paths (in the simplest case just
parallel links), which join a pair of nodes
,
. A traffic volume equal to
is sent from
to
in consecutive time periods or rounds by
a population of agents.
is the same in each round, but the allocation of
to paths, i.e. ![]()
such that
, is different in each round
. An agent
produces a constant amount of traffic equal
to
, such that
, in
consecutive rounds, and allocates a part equal to
to the
th
path in round
. The average delay (or cost) experienced by
’s traffic in the
th round is proportional to
, if we assume a linear delay (or cost) model. Linear models are used for simplicity in network analysis
[13] and can be realistic if a network resource still operates in the linear region of the delay vs. load curve, e.g. when delay is calculated in a link, which operates not very close to capacity. Agent
aims at minimizing the total delay for its own traffic and may use Hedge to determine the quantities
in round
, assuming that
knows the performance of its own traffic in each path in the past time period. Note that the maximum delay in a round occurs if
puts the whole
in a single path together with the whole traffic of the competition, i.e. with
; then
’s average delay in this round equals
. On the contrary, if
is evenly distributed in all paths,
’s allocation decision does not really matter, as the average will be equal to
. Of course the minimum delay in a round will occur if
puts the whole
in an
empty path, thereby achieving a zero delay.
The above problem can also be formulated as a more general problem of distributing workload over a collection of parallel resources (e.g. distributing jobs to parallel processors). A. Blum and C. Burch have used the following motivating scenario in [14] : A process runs on some machine in an environment with
machines in total. The process may move to a different machine at the end of a time interval. The load
, which will be found on a machine
at time round
is the penalty felt by the process.
1.4.2. Interdiction
Although an adversary is usually a “technical” (fictional) concept, which serves the worst case analysis of online algorithms, in some environments a real adversary, who intentionally tries to oppose a player, does exist. An example is the interdiction problem.
We present a version of the interdiction problem in a network security context. An attacker attacks
resources (e.g. launches a distributed denial of service attack on nodes, servers, etc., see [15] ) by sending
streams of harmful packets to resource
at a rate
(where
and
is constant). A defen-
der assigns a defense mechanism of intensity
(e.g. a filter that is able to detect and avoid harmful packets with a probability proportional to
) to resource
. At the end of a time interval
, e.g. one day, both the attacker and the defender revise the flows and the distribution of defense mechanisms to resources respectively, based on past performance.
Similar interpretations exist in transportation network environments, as in border and custom control, including illegal immigration control. An interdiction problem formulation can be used in a maritime transport security context: pirates attack the vessels traversing a maritime route. In [16] Vanek et al. assign the role of the player to the pirate. The pirate operates in rounds, starting and finishing in his home port. In each round he selects a sea area (arm) to sail to and search for possible victim vessels. A patrol force distributes the available escort resources to sea areas (arms), and pirate gains are inversely proportional to the strength of the defender’s forces on this area. Naval forces reallocate their own resources to sea areas.
2. Problem formulation
In this paper we aim at finding the worst case performance of Hedge. Effectively, we try to solve the following problem:
Problem 1. Given a number of options
, an initial normalized weight vector
, and a
Hedge parameter
, find the sequence
,
,
,
that maximizes the player’s total cumulative loss
(2)
where
is the penalty vector in round
, such that
, and the
th
round penalty weights
are updated according to
(3)
for
and
. ![]()
Clearly the objective function (2) is a function of a) the
initial weights
, and b) the
variables
, and c)
. Due to the normalization of both weights and penalties there are
indepen-
dent variables in total. In the following we use
or
instead of
whenever it is necessary to refer to these variables.
3. Recursion
Assuming that a given round starts with weights
and the adversary generates penalties
, the next round will will start with weights
where
(4)
Then, the total loss of a
round game, which starts with weights
, can be written as the sum of the losses of a single round game, which starts with weights
, and a
round game, which starts with
weights
, as follows:
(5)
Note that the term
, which expresses the contribution of the last
rounds, depends only on the updated weights provided by the initial round. Such a Markovian property can be generalized in the following sense: A
round game can be seen as consisting of a
round game
followed by a
round game
, whose initial weights are the final weights of
, and no more details about
are passed to
.
Assuming that the solution to Problem 1 is
the following recursive
formula for
can be derived from (5):
(6)
where
is the penalty vector chosen by the adversary in the initial round.
The optimal penalties can be computed also recursively. Let
, where
denotes the
th optimal penalty of the
th option in the
th round of a
round game (starting
with weights
). The optimal penalty of the initial round
is apparently equal to the value of
, which optimizes (6). Therefore
(7)
In all other rounds
the optimal penalties are such that the total loss of the rest of the game is
maximized, i.e. such that
is achieved. Since the total loss
is achieved by
using penalties
, the total loss
is realized by using
instead. Therefore
(8)
4. Two option games and numerical results
this section we exploit the recursive methodology, which has been presented in the previous section, in order to provide some numerical results for two option games. We compare these results with available bounds in the literature. We consider
, i.e. two option games. We keep only the independent penalties
in the
extended notation and use the more compact version
. As an example, the loss of a
single round game is given by
(9)
Also, since the initial weights are
, we simplify the maximum cumulative loss
to
. Assuming losses
and
, the next round will will start with weights
and
, where
(10)
Then (6) is simplified to
(11)
where
is the penalty chosen by the adversary for the first option in the initial round.
The iteration starts from
, i.e. the loss of a single round game. In such game the adversary controls a
single penalty variable, as the loss is given by (9). Apparently the adversary will choose binary values, i.e.
![]()
if
, and the maximum total loss is
, i.e.
(12)
The graph of
appears as the lowest V-shaped “curve” in Figure 1. The fact that the
is a
piecewise linear function of
with a breakpoint (i.e. a sudden change in its slope), creates even more break-
points in
,
and so on. Therefore, while it is possible to use the aforementioned recursion in
order to find analytical expressions for the maximum total loss and the associated penalties, the analysis becomes quite complicated even for small values of the number of rounds
(i.e. in a
round game). We omit this tedious analysis and present numerical results based on the recursive methodology given above.
Instead we have implemented a numerical computation based on (11).
is approximated by ![]()
samples in the interval
, i.e. by
, where
and
. In the same way the
functions
and
are represented by
samples
and
,
where
. We have used
. Initially we create
by using (9). We
use the result as input to (11) and create
. Then we use the already calculated
and
in (11)
to calculate
, then
and
to calculate
, and so on. In Figure 1 we show
as a function of
the initial weight
in games with up to ten rounds
for different values of
. Observe
that the shape of
is more “interesting” for “unreasonably” small values of
.
The optimal penalties can be determined by using formulas (7) and (8) for
. In Figure 2 we draw one of the curves of Figure 1 together with the respective optimal penalties. The final round optimal penalty (i.e.
in this example) is certain to be binary, since the adversary will assign
to the option
with
the greatest weight factor. However, the penalties
and
of the first two games are clearly
non-binary.
5. Binary and greedy schemes
The penalty values in the first two rounds in the example of Figure 2 prove that the adversary’s optimal penalties are not necessarily binary. However, in this example
is “unnaturally” close to 0, as in practical Hedge implementations
is chosen close to 1; this choice achieves a more gradual adaptation to losses. Both experimental and analytical evidence show that the optimal penalties tend rapidly to binary values as
approaches 1. Effectively, it seems that results very close to optimum can be achieved by a “binary adversary”, i.e. an adversary that will resort to binary values only.
On the other hand the optimal adversarial policy with binary penalties can be found exhaustively as
![]()
where
is a set of
binary vectors
such that
, i.e. only one component equals 1.
Apparently, the complexity of this calculation grows with
. However, in the following we show that the optimal binary adversary is in fact the “greedy adversary”, The latter achieves binary optimality in linear time.
A “greedy adversary” is eager to punish the maximum weight option as much as possible in each round. Thus
the adversary will assign exactly one unit of penalty to the maximum current weight option, and zero penalties to all other options. Given a sufficient number of rounds (say
), it easy to see that the weights of an ![]()
option game are “equalized” so that any two weights
,
are such that
for
. When
equalization is achieved, a periodic phenomenon starts and the greedy penalties form a rotation scheme.
5.1. Greedy behavior
We explore the greedy pattern in a two option game that can easily be generalized to
options. Assuming
initial weights
, ![]()
such that
, a greedy adversary will choose
,
iff
, where
(having assumed
). At ![]()
the weight of the second option becomes for the first time greater than the weight of the first option, and a loss equal to 1 is assigned to the second option. Therefore, in the next step
the weights (before normalization) are
and
, or equivalently
and
for the second time. In the next round they become
and
again, and in general they oscillate between these two pairs periodically. Therefore the total loss for
in a pair of subsequent rounds is equal to
(13)
The value of
is determined by the initially assumed inequality, and since
ought to be integer
. The loss in the first
steps
is equal to
![]()
Therefore, for an even positive integer
the total loss in
steps is
![]()
In a game with more than two options it is straightforward to show that in the “steady” (periodic) state weights tend to become equal, i.e. almost equal to
, where
is the number of options. Consequently, the
loss is given by
in a
round game.
5.2. Optimality of the greedy behavior
The following proposition provides a simple polynomial solution to the problem of finding the optimal binary adversary.
Proposition 1. The greedy strategy is optimal for the adversary among all strategies with binary penalties. □
Proof: Due to normalization of weights and penalties, in the proof we mention only option 1 weights and penalties. Assuming an initial weight
and penalties
in the first
rounds, the weight, which
emerges before the (n + 1)th round is
, where
. Effectively, two options are
available to the adversary in each step, either i) to assign a penalty equal to
, which will produce an incre-
mental loss equal to
, and will update the weight to
or ii) to
assign a zero penalty, which will produce a loss equal to
and an updated weight equal
to
. Define
.
This looks like a new game, in which the adversary is the player. The player’s status is determined by a real
number
, and possible rewards are
and
. If the player opts for
, this will bring him to
a new status
. If he opts for
, this will bring him to
. In our case
. Note also that
,
, and
. Moreover, if
is the root of
(or
),
then
for
, and
for
. It is easy to prove that there is an odd symmetry
around
, i.e.
, and
is concave in
, while it is
convex in
.
Assume that
, then
, and
. If the current status of the player is
, and
, the greedy behavior is to move
times to the right, which (unless
is too short) will
bring the player to a point
such that
. If
, then
and the greedy
player must choose
and move back to
. Effectively, this starts an oscillation between
and
, which will last until the end of the game. In the following we prove that this behavior is optimal, in spite of the fact that profits around
are low.
The main idea behind this sketch of proof is that a retreat (with consequent low profits
is never a
good investment for the future. Assume
as the player’s status, and
steps (rounds) remain until the end of
the game, while
. The player executes
forward steps, i.e.
,
, with
rewards
. Then,
backward steps with gains
are executed; consequently
is
reached again. In the rest of the game, i.e. until the
th step, greedy selections are made. This course of events is shown on curve (a) in Figure 3, where the dots mark the rewards achieved (and some dots have been vertically displaced by a small amount so as to be distinguishable from other dots at the same position). If greedy selections had been made all the way, the course of events would be as shown by curve (b).
If
describes the status of the adversary on the greedy curve (b) at the
th step and
the status on curve
(a), then
for
. Furthermore,
. Therefore the difference
between the cumulative reward on curve (b) and curve (a) is
![]()
Figure 3. Sample paths of player behavior, which are used in the proof of Proposition 1.
![]()
Effectively we need to show that
. First, let us make some observations and explore other variations of
. Note that
, as given by (14), is positive if the cumulative reward from the back and forth movement (in the first
steps) is less than the reward in the last
steps. However, as
increases, the position of the last step approaches
and it can be shown that the cumulative reward of the last
steps decreases. This property will be proved later, and it is due to the convexity and monotonicity properties of
. When
further increases, some of the very last
steps of the greedy behavior enter the phase of oscillation around
, and for
sufficiently large, all
belong to the oscillation phase. Note, however, that the oscillation phase rewards are those closer to 1/2, which is the lower limit of all greedy steps. If the greedy algorithm is to be optimal, even the
oscillatory steps should bring a cumulative reward greater than the original back and forth movement. On the other hand, if we prove this last inequality, this will also prove (14), whose last
steps bring more reward than the
oscillatory steps.
Let
be the pair of oscillation points around
, i.e.
and
. In
the worst case, which has just been mentioned,
![]()
However,
can be seen as the sum of
terms
, for
,
. We shall further prove that each of these terms is smaller than the difference inside the big parentheses, i.e.
(14)
This is a consequence of the following lemma:
Lemma 1. For any concave function
the following inequality is true:
(15)
Inequality (15) holds because
(16)
which is a consequence of the mean value theorem stating that there is a point
in
such that
. Also, there is a point
in
such that
. However,
is a concave function, and its derivative is non-increa-
sing, therefore
implies
, which proves (16). In fact (15) can be
easily generalized to any same length intervals, even overlapping ones, i.e. if
, then
(17)
Due to (15) each successive equal length (i.e.
) interval produces an incremental reward
, which is smaller than the incremental reward of the next interval, and of all succeeding
intervals, as long as
remains concave. Effectively, Lemma 1 proves that the incremental reward of the
rightmost interval, which does not contain
, i.e. the interval
, is the highest among the rewards
of all intervals of the same length, which begin to the left of
. Unfortunately, our aim was to prove (14), which would be secured if
remained concave in
, e.g. if
and
. However this is
not true, since at
turns from concave to convex. Fortunately, the term
, which covers
the interval
can be seen as the sum of rewards related with the concave
in
and the con-
cave
in
. Due to the odd symmetry around
,
, therefore
, and
.
However, due to the concavity of
,
, and
. Therefore
.
This result states that the interval
, which contains
, provides higher
than the previous
interval
, which in turn is higher than the
of any previous interval of the same length.
Therefore we have seen so far that a sequence of penalties, which begins at some
and involves one fold, can be reduced to a sequence without any folds, and with improved total reward, as shown in Figure 4. In Figure 4 a sequence of steps with a single fold, which starts at
and ends at
, is shown together with the
respective greedy sequence, which starts at
and ends at
. If the sequence must extend after
, the additional steps are oscillation steps around
. The rest of this proof is just an application of this result, so that a sequence with an arbitrary number of folds can be reduced to an improved reward foldless sequence.
![]()
Figure 4. Reduction of a sequence of penalties, which contains a fold, to a sequence without folds and with improved total reward.
Suppose that the initial position of the game is at
, and that
(otherwise reverse the initial probabilities
). Suppose also that the initial sequence does not extend beyond
, i.e. it does not reach
or it involves a number of oscillations around
. Then take the last fold and reduce it as mentioned, i.e. by replacing it with an equal number of greedy steps at the end of the current sequence. If these steps reach
, they are oscillation steps. Repeat the same step, until all folds have disappeared (oscillations do not count as folds). If the original sequence does extend beyond
, the approach is the same, but the reader should note that the application of this algorithm will finally reduce the part, which extends beyond
, to oscillations between
and
.
6. Conclusion
We summarize the main results of this paper: An worst performance (adversarial) analysis of the Hedge algorithm has been presented, under the assumption of limited penalties per round. A recursive expression has been given for the evaluation of the maximum total cumulative loss; this expression can be exploited both numerically and analytically. However, binary penalty schemes provide an excellent approximation to the optimal scheme, and, remarkably, the greedy binary strategy has been proved optimal among binary schemes for the adversary.