X. F. LIU ET AL. 87

AI researchers typically use MDPs to formulate prob-

abilistic planning problems. An MDP is deﬁned as a

four-tuple<S,A,T,C>, where S is a discrete set of states,

A is a ﬁnite set of all applicable actions, T is the transi-

tion matrix describing the domain dynamics, and C de-

notes the cost of action transitions. The agent executes its

actions in discrete time steps called stages. At each stage,

the system is at one distinct states ∈S. The agent can

pick any action a from a set of applicable action Ap(s)

⊆A, incurring a cost of C(s, a). The action takes the sys-

tem to a new state s′ stochastically, with probability Ta

(s′|s) .

The horizon of an MDP is the number of stages for

which costs are accumulated. There are a set of sink goal

states G⊆S, reaching which terminates the execution. To

solve the MDP we need to ﬁnd an optimal policy (S→A),

a probabilistic execution plan that reaches a goal state

with the minimum expected cost. Any optimal policy

must satisfy the following system of Bellman equations,

the value function of a policy π is deﬁned as:

'

()(,())('| )('),[0,1]

sS

Vs CssTssVs

*

(1)

and the optimal value function is deﬁned as:

*

() '

()min[(, )('| )(')],[0,1]

aAs sS

VsCsa TssVs

(2)

2.2. Dynamic Programming

Most optimal MDP algorithms are based on dynamic

programming. Its usefulness was ﬁrst proved by a simple

yet powerful algorithm named value iteration [10].Value

iteration ﬁrst initializes the value function arbitrarily. Its

basic idea is to iteratively update the value functions of

every state until they converge. And in each iterm, the

value function is updated according to Equation 2. We

call one such update a Bellman backup. The Bellman

residual of a state s is deﬁned to be the difference be-

tween the value functions of s in two consecutive itera-

tions. The Bellman error is deﬁned to be the maximum

Bellman residual of the state space. When this Bellman

error is less than some threshold value, we conclude that

the value functions have converged sufﬁciently.

The main drawback of the value functions algorithm is

that, and in each iterm, the value functions of every state

are updated, which is highly unnecessary. Firstly, some

states are backed up before their successor states, and

often this type of backup is fruitless. Secondly, different

states converge with different rates. When only a few

states are not converged, we may only need to back up a

subset of the state space in the next iteration.

3. Topological Order Value Iteration

We have studied the sequence of state backups according

to an MDP’s graphical structure, which is the intrinsic

property of an MDP and potentially decides the com-

plexity of solving it [11]. Our first observation is that

states and their value functions are causally related. If in

an MDP M, one state s′ is a successor state of s after ap-

plying action a, then V (s) is dependent on V (s′). For this

reason, we want to back up s′ ahead of s. The causal rela-

tion is transitive.

Topological Order Value Iteration solves an MDP

problem by using the problem’s graphical structure

wisely. Given an MDP, TOVI first builds a directed rea-

chability graph Gsr, where G has one vertex per state s

∈ S. A directed edge from vertex s1 to s2 exists if there

is an action such that Ta (s2|s1) > 0. TOVI then finds all

the strongly connected components of Gsr, and the topo-

logical order of the components. Then, it solves every

connected component individually, by value iteration,

according to their topological order. Figure 1 shows the

graphical representation of one simple MDP that has 7

states and 12 actions. In the figure, successors of prob-

abilistic actions are connected by an arc. For simplicity

reason, transition probabilities Ta and costs C(s, a) are

omitted. Using TOVI, we can divide the MDP into two

connected components C1 and C2. Based on the remain-

ing actions, C1 and C2 can be subdivided into three and

two smaller components respectively. By decomposing

an MDP into smaller components, TOVI’s convergence

can be much faster than VI.

We use Kosaraju’s algorithm of detecting the topo-

logical order of strongly connected components in a di-

rected graph [12]. Note that Bonet and Geffner used Tar-

jan’s algorithm in detection of strongly connected com-

ponents in a directed graph in their solver [5], but they do

not use the topological order of these components to sys-

tematically back up each component of an MDP. Kosa-

raju’s algorithm is simple to implement and its time

complexity is only linear in the number of states, so

when the state space is large, the overhead in ordering

the state backup sequence is acceptable. Our experimen-

tal results also demonstrate that the overhead is well

compensated by the computational gain.

Figu re 1. A simplified MDP and its set of strongly conne cted

components.

Copyright © 2013 SciRes. CN