A Competitive Markov Approach to the Optimal Combat Strategies of On-Line Action Role-Playing Game Using Evolutionary Algorithms

In the case of on-line action role-playing game, the combat strategies can be divided into three distinct classes, Strategy of Motion(SM), Strategy of Attacking Occasion (SAO) and Strategy of Using Skill (SUS). In this paper, we analyze such strategies of a basic game model in which the combat is modeled by the discrete competitive Markov decision process. By introducing the chase model and the combat assistant technology, we identify the optimal SM and the optimal SAO, successfully. Also, we propose an evolutionary framework, including integration with competitive coevolution and cooperative coevolution, to search the optimal SUS pair which is regarded as the Nash equilibrium point of the strategy space. Moreover, some experiments are made to demonstrate that the proposed framework has the ability to find the optimal SUS pair. Furthermore, from the results, it is shown that using cooperative coevolutionary algorithm is much more efficient than using simple evolutionary algorithm.


Introduction
In recent years, on-line Action Role-Playing Games (ARPGs) become more and more popular all over the world.An on-line ARPG is a virtual world that consists of several distinct races.Player first creates a character of any race, then plays the game by both exploring the virtual world and fighting with others.During gaming, player has direct control over the created character.Usually, the on-line ARPG is regarded as an extension of the off-line one by the reason that it allows players to fight with each other besides AI opponent.Such new feature results in more complexity of the game balance.
What is the game balance?In the off-line ARPG, game balance means the difficulty control of the game and can be reached simply by adjusting the power of AI opponent.On the other hand, however, in the on-line ARPG, game balance mainly refers to power balancing among the races.Currently, on-line ARPGs are balanced by hand tuning, but this approach presents several problems.Human players are expensive in both time and resources, and even human players can not explore all strategies to find out whether a dominate one exists [1].Moreover, since strengthening one race will definitely weaken the others, the result of tuning operations may, somehow, become worse.It is hard to control.Hence, a more theoretical and efficient design method is needed.
Chen, et al. [2,3] have proposed an evolutionary design method for both turn-based and action-based on-line role-playing games.In such approaches, they, successfully, constructed an automated testing framework to verify whether the game world is well-balanced.However, they just investigated the case in which each race has only one skill.The situation of having multi-skills was ignored because they failed to retrieve the optimal Strategy of Using Skill (SUS), which plays a crucial role in the automated testing framework.
In this paper, we propose an evolutionary framework, including integration with competitive coevolution and cooperative coevolution, to search the optimal SUS pair of a basic action-based game model, where the combat is modeled by the Discrete Competitive Markov Decision Process (DCMDP) [4].Also, by introducing the Chase Model and the Combat Assistant Technology (CAT), we analyze the optimal Strategy of Motion (SM) and the optimal Strategy of Attacking Occasion (SAO) of the game model.
The paper is organized as follows.In Section 2, a brief description of DCMDP and a theorem are given.The details of Cooperative Coevolutionary Algorithm (CCEA) technology are explained in Section 3. Section 4 presents a basic action-based game model and its optimal SM as well as SAO.Section 5 describes the proposed competetive framework.Experimental results are reported in Section 6.The conclusions and possible future research directions are given in Section 7.

Discrete Competitive Markov Decision Process
A DCMDP is a multi-player dynamic system that evolves along discrete time points.At each time point t, the state of the system, denoted by S t , is a random variable that can take on values from the finite set S = 1, 2, •••, N. At these discrete time points, called stages, both players have the possibility to influence the course of the system.In this paper, we only consider the case of two players, and we associate two finite action sets } for player 2, then at any stage, the system is one of the states and both players are allowed to choose an action out of their respective action sets independently of one another.
If in a state s, at some decision moment, player 1 chooses and player 2 chooses , then two things happen: a A s  1) Player 1 earns the immediate reward r 1 (s, a 1 , a 2 ) and player 2 earns r 2 (s, a 1 , a 2 ).
2) The dynamic of the system is influenced.The state at the next decision moment is determined in a stochastic sense by a transition vector which is denoted as:

Classes of Strategy
Strategies for players are rules that tell them what action to choose in any situation.The choice at a certain decision moment may depend upon the history of the play up to that moment.Furthermore, as is usual in game theory, the choice of an action may occur in randomized way, that is, the players can specify a probability vector over their action spaces and next the action is the result of a chance experiment according to this probability vector.
There are three classes of strategy exist, namely, the behavior strategy, the Markov strategy, and the stationary strategy.
Behavior strategy, denoted by F B , being the most general type of strategy, can be represented by a sequence π ) specifies for each state s a probability vector f t (s) on A(s) as a function of history of the game up to decision moment t.
A Markov strategy, denoted by F M , is a behavior strategy where, for every t = 0, 1, 2, •••, the decision rule f t is completely determined by the decision moment t and the current state s t at moment t.
A stationary strategy is a Markov strategy where, for every t = 0, 1, 2, •••, the decision rule f t is completely determined by the current state s t at moment t.Thus, a stationary strategy can be represented by a sequence π = specifies for each state s  S a probability vector f(s) on A(s).We will denote such stationary strategy by F S .If the players use , as their strategy respectively, there exist stationary transition probabilities: for all t = 0, 1, 2, •••, and Formula 1 is called Stationary Markov Transition Property.

β-Discounted Competitive Markov Decision Model
The infinite stream of rewards that results during a particular implementation of a strategy pair   need to be evaluated in some manner.So, the β-Discounted Competitive Markov Decision Model, denoted by Γ β , is introduced.
We have k t R denoting the reward at time t to player k, as well as 1, , , , , , denoting the immediate excepted reward vector to player k corresponding to a strategy pair (π 1 , π 2 ) mentioned above, where k = 1, 2, and r k (s, π 1 , π 2 ), for each s S  , can be calculated by the following formula: , , , , , , The expected reward at stage t to player k resulting from (π 1 , π 2 ) and an initial state s is denoted by: where [u] s denotes the sth entry of a vector u, P t (π 1 , π 2 ) is the t-step Transition Probability Matrix (TPM).Consequently, the sth entry of the overall discounted value vector of a strategy pair (π 1 , π 2 ) to player k will be given by: be a optimal strategy pair of Γ β , then (π 1* , π 2* ) is optimal in the entire class of behavior strategies.That is:

Proof:
Let either π 1* or π 2* be fixed, then the Γ β will reduce to the discounted Markov decision model which has only one player.Further, for the discounted Markov decision model, it is well-known that the optimal stationary strategy is optimal in the entire class of behavior strategies [4].This completes the proof.
In the game theory, we also call that   is a Nash Equilibrium Point (EP) of the space B B F F  .This theorem is vary important because it suggests that we can retrieve the EP of by just searching the space of .
for all s S  ,   a A s  . Thus we may drop the superscript k by defining: , , , , , , r s a a r s a a r s a a   so, a extension of this definition lead to the following: Hence, in the case of zero-sum Γ β , the two sets of inequalities defining an EP reduce to the single set of saddle-point inequality as follow: Formula 4 leads to an important property, that is, if is another pair of optimal strategies, then we have the following equation: this simply means that in the case of zero-sum Γ β , the overall discounted value vectors of all optimal strategy pair coincide and can be denoted by: Thus, we can retrieve v β by searching the space instead of , and the result follows from Theorem 1 and Formula 5.

Cooperative Coevolutionary Algorithm
CCEAs [5,6] have been applied to solve large and complex problems, such as multiagent systems [7][8][9], rule learning [10,11], fuzzy modeling [12], and neural network training [13].It models an ecosystem which consists of two or more species.Mating restrictions are enforced simply by evolving the species in separate populations which interact with one another within a shared domain model and have a cooperative relationship.The original architecture of the CCEA for optimization can be summarized as follows: 1) Problem Decomposition: Decompose the target problem into smaller subcomponents and assign each of the subcomponents to a population; 2) Subcomponents Interaction: Combine the individual of a particular population with representatives selected from others to form a complete solution, then evaluate the solution and attribute the score to the individual for fitness; 3) Subcomponent Optimization: Evolve each population separately by using a different evolutionary algorithm, in turn.
The empirical analyses have shown that the power of CCEAs depends on the decomposition work as well as separate evolving of these populations resulting in significant speedups over Simple Evolutionary Algorithm (SEA) [14][15][16].Here, we give the theoretical evidence of such results with the following two assumptions.
1) The elitists of CCEA populations are chosen as the representatives; 2) There are no variational operators in both the SEA and CCEA.
Let's begin with some definitions.Definition 1.Given schemata: H 1 , H 2 , •••, H n where H i denotes a schema of the ith CCEA population, the n-expanded schema, denoted by H 1 , is the sequential concatenation of the n schemata.For example, let Definition 2. Let there be n populations in the CCEA.A complete genotype is the sequential concatenation of n individuals selected from n different populations.If all the individuals are representatives, then the complete genotype is the best one.
Definition 3. Given an individual I of the ith CCEA population, the expanded genotype of I is the best complete genotype in which the ith representative is replaced by I.
Definition 4. Given a target problem J, the two algorithms, SEA and CCEA, are comparable if the population of the SEA consists of all the expanded genotypes in the CCEA.
Theorem 2 Let a target problem f be decomposed into n subcomponents, r i be the increasing rate of the individuals matching H i in the ith CCEA population, r ccEA be the increasing rate of the complete genotypes matching 1 in the CCEA, and r SEA be the increasing rate of the individuals matching 1 in the SEA.If the two algorithms are comparable, then, Since the two algorithms are comparable, in the SEA, the number of individuals matching at generation t can be calculated by: where M(H i , t) denotes the number of individuals matching H i at generation t, in the ith CCEA population.Then according to the schema theorem (refer to Thm. 1), we have, , , where in the ith CCEA population,   , i f H t and   , F i t denote the mean fitness of individuals matching H i and the mean fitness of all individuals, at generation t, respectively.
In the case of CCEA, because 1 is the conjunction of H i , the number of the complete genotypes matching at generation t is given by: Again, according to the schema theorem, we obtain the following equation: Hence, obviously, with the increasing of the min {r 1 , •••, r n }, 1 will receive a much higher increasing rate in the CCEA.n H However, Theorem 2 does not mean that CCEAs are superior to SEAs, which depends on the target problems.Actually, since the representatives are necessary in calculating the fitness of the individual of an arbitrary population, the relationships between populations impose a great influence on the efficiency of CCEAs [17,18].It has been proved that even with prefect information, infinite population size and no variational operators, CCEAs can be expected to converge to suboptimal solution [18], while SEAs do not suffer from such affliction [19][20][21].However, Liviu [22] has emphasized that CCEAs will settle in the globally optimal solution with arbitrarily high probability, when properly set and if given enough resources.

Game Model
In this section, we construct a two-dimensional ARPG model in which three distinct races (A, B and C) are designed with maximum level of L. Each race has seven properties, which are Health (H), Dodge Rate (DR), Skill Damage (SD), Skill Critical Hit Rate (SCHR), Skill Cool Down Time (SCDT), Skill Range (SR) and Velocity (V), as well as two skills.Except H and SD which are monotone increasing functions of l = {1, 2, •••, L -1, L} he other properties are designed to be constant.SCHR denotes the probability that player outputs the double damage.SCDT, belonging to the time feature, means that how much time should the character rest after an attack, and SR, a kind of space feature, denotes the range in which the skill can be used.Specially, in the game world, Velocity is measured in pixels per frame instead of miles per second.cos cos In order to evaluate the power of the race, a standard square map with width w is introduced, and all battles will be held in it.In this case, we only concern the "1 versus 1" fighting type which contains two players (the defender and the attacker), because it is the core of the combat system of on-line ARPG, and we design the low-velocity race to play the role of defender, because if a high-velocity defender chose the strategy of keeping escaping, the battle will become meaningless and uninteresting.As a principle, after being attacked, the defender should be able to launch at least one attack.And there is no constrain on battle time, that is, players are allowed to fight as long as they like.The basic design contents are as follows: where r  is the distance vector of the two players, atk v  and def v  denote the velocity of attacker and defender.
According to Formula 6, the attacker should maximize d r  by maximizing cosω and the defender should minimize d r  by maximizing cosθ.These can be recognized as the optimal SM.Before analyzing the optimal SAO, we introduce the Combat Assistant Technology (CAT) which has been applied to most of on-line ARPGs (such as JXOnline, World of Warcraft, etc.).It helps players to run towards the opponent automatically and attack if possible.The optimal SAO will be analyzed based on it.
1) The width of the standard map: w = 400; Figure 3 demonstrates a freeze-frame of the attacking process in which each combatant uses the CAT to attack.For the defender, after the attacker launched an attack at point f, the defender will strikes back at a point within (a, b], which enables he he) to output the maximum damage because 2) Health: So using the CAT to attack is the optimal SAO of the defender.On the other hand, in fact, the attacker may launch the attack in the range (c, d] or (d, e] or (e, g] with the probability of P 1 , P 2 and P 3 respectively.We define (P 1 , P 2 , P 3 ) as the Action Probability Vector (APV), which represents the action feature and can be obtained through statistical analyze.Therefore, the CAT with the APV of the expert-level players is the optimal SAO of the attacker.
4) Dodge Rate: where denotes the maximum damage among two skills of race k with level l.

  max k SD l
Based on these ideas, we model the motion of the players by using the Chase Model (refer to Figure 2), and by the theory of vector differential, we acquire the following formula: Unlike the SM and SAO, an optimal SUS completely depends on the competent's SUS, which means that the optimal SUS pair is an EP.Also, each combat group has a limit state space which is defined as follows: where the (0, 0) state does not exist, T def and T atk are the maximum lifetime of defender and attacker, H k (l) denotes the health value of k with level l, and   min k SD l is defined as the minimum damage among two skills of k with level l.

Competitive Coevolutionary Framework
In this section, firstly, we model the combat process by using the zero-sum Γ β process, in this step, the fitness function will be constructed.Next, we will introduce the competitive framework as well as demonstrate how to use it to find the optimal SUS pair.
As the preceding section stated, there is no time constrain during fighting, that is, the combat can be regarded as an infinite process, where players keep changing their state from one to another by their actions.And this infinite process can be evaluated by Γ β .
It is natural for any player that he(she) wants to maximize his(her) win probability during battles.Thus, we use the increment of win probability from stage t to t + 1, as the expectation of immediate reward of the decision rule pair (f t , g t ).Let p 1 (t), p 2 (t) be the win probabilities of player 1 and player 2 at stage t respectively, then, by the fact that the larger t becomes, the higher probability of game over will be, we obtain the follows: from such formula, it can be inferred that an increment of p 1 (t) will bring a potential decrement to p 2 (t).With these results, in this case, combat can be regarded as the zero-sum game, and can be modeled by zerosum Γ β process.The immediate reward vector and the overall discounted value vector are defined as follows: where , , , Here, π 1 , π 2 are the SUS of the defender and the attacker, N is the size of the state space, I N is the identity matrix of size N, S T denotes the absorbing state set in which the lifetime of attacker is 0. Particularly, in this case, the v β (s, π 1 , π 2 ) will always converge into a certain value, even the discount factor, β, is set to 1.
Since the combat always starts from the state (T def , T atk ), denoted by s N , it is natural that we shall use v 1 (s N , π 1 , π 2 ), denoting the win probability of defender, as the fitness function of our evolutionary algorithm.
The competitive framework consists of two symmetrical blocks (refer to Figure 4), the left block denotes the evolutionary process of π 1 while the right block represents π 2 .Both blocks use CCEA to search the solution space, that is, each decision rule, f or g, is divided into N groups which will be evaluated in turn.Such group, for example, group i, denotes the probability vector on actions of state i, that is, f(i) or g(i).Through the elite pair of current generation, (f * , g * ), the two blocks of framework interact in a competitive way, and this enables the framework to lead such elite pair to converging in the decision rule pair of an EP.
The pseudo-code of the algorithm is shown as follows: 1) Create species i f s for f(i) and i g s for g(i), where i = {1, 2, •••, N}; 2) Initialize population for every species by using the uniform distribution, where each bit of chromosome has the same probability to be a 1; 3) Select a individual from each f(i) and g(i), randomly, to form the initial elite decision rule pair (f * , g * ); 4) Set generation t = 0; 5) Set i = 1; 6) Evaluate the individuals of species i f s with g * as well as f * (j), where j i  (maximize v β ); 7) Replace the f * (i) by the current elite, if necessary; 8) Select the outstanding individuals, and construct the new probability distribution from them.Then new generation can be obtained by sampling this distribution and mutation;

Experiments and Results
In this section, we will use the proposed competitive coevolutionary framework to retrieve the EP of a battle group.A character, from race A with level 3, is chosen as the defender, and with the same level, a character from race C as the attacker.Also, for convenience, we define as the two skills of race A, similarly, C for race C. In accordance with the basic design contents mentioned before, we set the experiment environments as follows: We define the fighting round, a period of time, from attacker launching an attack to the ending of his(her) waiting time.Such definition implies that attacker can attack only one time in a round.Also, based on the experiment environments, we obtain the attacking times of defender, when players chose a certain skill pair and attacker attacked in a certain range.The results are shown in Table 1.
According to those results and the optimal SM as well as the optimal SAO, the probability transition matrix defined by skill pair, C where i, j = {1, 2}, can be calculated (refer to Table 2), then according to Formula 1, we can retrieve the probability transition matrix defined by a stationary strategy pair (π 1 , π 2 ).Consequently, such strategy pair can be evaluated by .Besides the proposed algorithm, another one, whose blocks use SEA instead of CCEA, is introduced for the comparison.Considering the stochastic nature of the algorithms, each of the programs is repeated 100 times.The results are shown in Figures 5 and 6.
According to the results in Figure 5, both two algorithms can retrieve the EPs,      Such data are very important because we can judge whether the designed contents are well-balanced based on it.Also, it can be seen from Figure 6 that the CCEA is much faster than SEA, hence a better placement solution than using SEA.

Conclusion and Future Work
In this paper, we constructed a basic ARPG system and modeled the combat by using the zero-sum Γ β process.We noted that in the case of zero-sum Γ β , an EP of the entire behavior strategy space can be retrieved just by searching the stationary strategy space, and all the overall discounted value vectors of the EPs are coincide.Also, an evolutionary framework, which includes integration with competitive coevolution and CCEA, has been proposed to search the SUS pairs which are the EPs of the strategy space.Based on the framework, we made some experiments with certain environments.The results showed that the EPs can be obtained, and by using CCEA, the efficiency and capability of the framework have been improved significantly.
In the future work, we will investigate another combat type in which the battle time is constrained, and in such type the strategies of players will no longer be stationary.
, p s a a p s a a p N s a a    a simple example of the DCMDP is shown in Figure 1.

Figure 1 .
Figure 1.A discrete competitive markov decision process with two states.

Figure 5 .
Figure 5.The fitness of the ep retrieved by each execution.

Figure 6 .
Figure 6.The computation time of retrieving EP in each execution.