Topological Order Value Iteration Algorithm for Solving Probabilistic Planning *

AI researchers typically formulated probabilistic planning under uncertainty problems using Markov Decision Processes (MDPs).Value Iteration is an inefficient algorithm for MDPs, because it puts the majority of its effort into backing up the entire state space, which turns out to be unnecessary in many cases. In order to overcome this problem, many approaches have been proposed. Among them, LAO*, LRTDP and HDP are state-of-the-art ones. All of these use reachability analysis and heuristics to avoid some unnecessary backups. However, none of these approaches fully exploit the graphical features of the MDPs or use these features to yield the best backup sequence of the state space. We introduce an improved algorithm named Topological Order Value Iteration (TOVI) that can circumvent the problem of unnecessary backups by detecting the structure of MDPs and backing up states based on topological sequences. The experimental results demonstrate the effectiveness and excellent performance of our algorithm.


Introduction
In recent years, intelligent planning has developed into an important branch in artificial intelligence research, especially the uncertainty planning problem has aroused the researcher's more attention.Among a large number of research methods, probabilistic methods can be more accurate to describe the uncertainty information, so it has been widespread concerned in the research, the solving method has been gradually matured.Probabilistic planning uses probability distribution to describe the uncertainty of the initial world state and the effects of actions.In 2004, the IPC-4(2004 International Planning Competition) especially increased the competition in probabilistic planning domains; it has showed that the research of probabilistic planning is very important in the field of intelligent planning study.
Markov decision processes (MDPs) is a model for representing probabilistic planning problems.Value iteration and policy iteration are two fundamental dynamic programming algorithms for solving MDPs [1].However, these two algorithms are sometimes inefficient.They spend too much time backing up states, often redundantly.Recently several types of algorithms have been proposed to efficiently solve MDPs.The first type uses reachability information and heuristic functions to omit some unnecessary backups, such as RTDP [2], LAO* [3], LRTDP [4] and HDP [5].The second uses some approximation methods to simplify the problems.The third aggregates groups of states of an MDP by features, represents them as factored MDPs and solves the factored MDPs.Often the factored MDPs are exponentially simpler, but the strategies to solve them are tricky, sLAO* [6], sRTDP [7] are examples.One can use prioritization to decrease the number of inefficient backups.Faster dynamic programming [8] and ranking policies in discrete Markov Decision Processes [9] are two recent examples.
In this paper we propose an improvement of the value iteration algorithm named Topological Order Value Iteration which combines the first and last technique.It decompose a MDP into strong topological order connected components, and then using value iteration algorithm to solve the components in order, so it can prevent the calculation of a large number of useless states and make the available states arranged orderly.It does backups in the best order and only when necessary.Topological Order Value Iteration is itself not a heuristic algorithm, but it can efficiently make use of extant heuristic functions to initialize value functions.AI researchers typically use MDPs to formulate probabilistic planning problems.An MDP is defined as a four-tuple<S,A,T,C>, where S is a discrete set of states, A is a finite set of all applicable actions, T is the transition matrix describing the domain dynamics, and C denotes the cost of action transitions.The agent executes its actions in discrete time steps called stages.At each stage, the system is at one distinct states ∈S.The agent can pick any action a from a set of applicable action Ap(s) ⊆A, incurring a cost of C(s, a).The action takes the system to a new state s′ stochastically, with probability Ta (s′|s) .
The horizon of an MDP is the number of stages for which costs are accumulated.There are a set of sink goal states G⊆S, reaching which terminates the execution.To solve the MDP we need to find an optimal policy (S→A), a probabilistic execution plan that reaches a goal state with the minimum expected cost.Any optimal policy must satisfy the following system of Bellman equations, the value function of a policy π is defined as: and the optimal value function is defined as:

Dynamic Programming
Most optimal MDP algorithms are based on dynamic programming.Its usefulness was first proved by a simple yet powerful algorithm named value iteration [10].Value iteration first initializes the value function arbitrarily.Its basic idea is to iteratively update the value functions of every state until they converge.And in each iterm, the value function is updated according to Equation 2. We call one such update a Bellman backup.The Bellman residual of a state s is defined to be the difference between the value functions of s in two consecutive iterations.The Bellman error is defined to be the maximum Bellman residual of the state space.When this Bellman error is less than some threshold value, we conclude that the value functions have converged sufficiently.
The main drawback of the value functions algorithm is that, and in each iterm, the value functions of every state are updated, which is highly unnecessary.Firstly, some states are backed up before their successor states, and often this type of backup is fruitless.Secondly, different states converge with different rates.When only a few states are not converged, we may only need to back up a subset of the state space in the next iteration.

Topological Order Value Iteration
We have studied the sequence of state backups according to an MDP's graphical structure, which is the intrinsic property of an MDP and potentially decides the complexity of solving it [11].Our first observation is that states and their value functions are causally related.If in an MDP M, one state s′ is a successor state of s after applying action a, then V (s) is dependent on V (s′).For this reason, we want to back up s′ ahead of s.The causal relation is transitive.
Topological Order Value Iteration solves an MDP problem by using the problem's graphical structure wisely.Given an MDP, TOVI first builds a directed reachability graph Gsr, where G has one vertex per state s ∈ S. A directed edge from vertex s1 to s2 exists if there is an action such that Ta (s2|s1) > 0. TOVI then finds all the strongly connected components of Gsr, and the topological order of the components.Then, it solves every connected component individually, by value iteration, according to their topological order.Figure 1 shows the graphical representation of one simple MDP that has 7 states and 12 actions.In the figure, successors of probabilistic actions are connected by an arc.For simplicity reason, transition probabilities Ta and costs C(s, a) are omitted.Using TOVI, we can divide the MDP into two connected components C1 and C2.Based on the remaining actions, C1 and C2 can be subdivided into three and two smaller components respectively.By decomposing an MDP into smaller components, TOVI's convergence can be much faster than VI.
We use Kosaraju's algorithm of detecting the topological order of strongly connected components in a directed graph [12].Note that Bonet and Geffner used Tarjan's algorithm in detection of strongly connected components in a directed graph in their solver [5], but they do not use the topological order of these components to systematically back up each component of an MDP.Kosaraju's algorithm is simple to implement and its time complexity is only linear in the number of states, so when the state space is large, the overhead in ordering the state backup sequence is acceptable.Our experimental results also demonstrate that the overhead is well compensated by the computational gain.The pseudo code of TOVI is shown in Algorithm 1.We first use Kosaraju's algorithm to find the set of strongly connected components C in graph Gsr, and their sequential order.Note that each c ∈ C maps to a set of states in M. We then use value iteration to solve each c.Since there are no cycles in those components, we only need to solve them once.

Experiment
We is better than the other three algorithms on most of domains, it can fast convergence due to only update the appropriate path to calculate the sequence to avoid a large number of useless state calculations.However, in Single Arm Pedulum and TireWorld test domains, TOVI algorithm grouped the state diagram and ordered each connected component has speeded more time in the overall running time, so its performance less than the LAO * algorithm and LRTDP.

Conclusions
We have introduced and analyzed a probabilistic planning MDP solver, Topological Order Value Iteration that studies the dependence relation of the value functions of the state space and use the dependence relation to decide the sequence to back up states.The algorithm is based on the idea that different MDPs have different graphical structures, and the graphical structure of an MDP intrinsically determines the complexity of solving that MDP.We notice that no current solvers detect this information and use it to guide state backups.Thus, they solve MDPs of the same problem sizes but with different graphical structure with almost the same strategies.In this sense, they are not "intelligent".Topological Order Value Iteration is proposed to solve this problem.It is guaranteed to find the optimal solution of a Markov decision process sequentially.
Topological Order Value Iteration also is a flexible algorithm, which can use the initial state information and apply reachability analysis.Our results have shown that TOVI is extremely useful in MDPs with many connected components.The complexity increase of TOVI is not as great as other algorithms as the number of layers increase, which shows that TOVI is very suitable for solving MDPs with layered structures.

Figure 1 .
Figure 1.A simplified MDP and its set of strongly connected components.

Table 1 .
All running times are in seconds, fastest times are bolded.BC size means the size of the biggest connected component."-"means that the algorithm failed to solve the problem within 5 minutes.The experimental results has showed TOVI algorithm