_{1}

This paper considers the variance optimization problem of average reward in continuous-time Markov decision process (MDP). It is assumed that the state space is countable and the action space is Borel measurable space. The main purpose of this paper is to find the policy with the minimal variance in the deterministic stationary policy space. Unlike the traditional Markov decision process, the cost function in the variance criterion will be affected by future actions. To this end, we convert the variance minimization problem into a standard (MDP) by introducing a concept called pseudo-variance. Further, by giving the policy iterative algorithm of pseudo-variance optimization problem, the optimal policy of the original variance optimization problem is derived, and a sufficient condition for the variance optimal policy is given. Finally, we use an example to illustrate the conclusion of this paper.

The Postal Service Company’s catalogue information system, inventory issues, and supply chain management issues are all early successful applications of the Markov decision process. Later, many real-life problems, such as sequential assignments, machine maintenance issues, and secretarial issues, can be described as dynamic Markov Decision Processes (MDP) model. They are finally solved very well by MDP [

This paper considers the variance optimization problem of average reward in continuous-time Markov decision process. It is assumed that the state space is countable and the action space is Borel measurable space. The main purpose of this paper is to find the policy with the minimal variance in the deterministic stationary policy class, which is different from the mean-variance criterion problem. The study of the mean-variance criterion problem is generally based on the discount criterion or the average criterion. In the literature of MDPs, many studies focus on the problem of expected reward optimization in finite stage, the discounted MDP in infinite stage and the average reward problem in infinite stage [

Variance is an important performance metric of stochastic systems. In financial engineering, we use the mean to measure the expected return, and the variance to measure the risk. The mean-variance problem of the portfolio can be traced back to Markowitz [

V T ( i , f ) = E i f { ∫ 0 T r ( X ( t ) , f ) d t } , (1.1)

The variance of f, σ 2 ( i , f ) , is given by:

σ 2 ( i , f ) = lim T → ∞ 1 T E i f { ∫ 0 T r ( X ( t ) , f ) d t − V T ( i , f ) } 2 (1.2)

However, the variance function of average expected return of the continuous-time in this paper is given by

η σ f = lim T → ∞ 1 T E i f { ∫ 0 T ( r ( X ( t ) , f ) − η f ) 2 d t } , (1.3)

The long-run expected average reward is defined as below.

η f = lim T → ∞ 1 T E i f { ∫ 0 T r ( X ( t ) , f ) d t } . (1.4)

The main work of this paper is to find the iterative algorithm of the optimal policy under the variance criterion (minimum variance) in the countable state space and the Borel measurable action space. For countable state space, the reward function r ( i , a ) may be unbounded, the expected average reward η f , may not be infinite. To guarantee the finiteness of η f , we will impose the following Assumption 1. Next, we use a unique invariant probability measure of Markov chain to denote the average expected return and variance. To this end, we will impose the following Assumption 2, 3, 4. Suppose that Assumptions 1, 2, 3, and 4 are satisfied. We have established a variance criterion. Under the variance criterion, we define the cost function m ( i , a ) = ( r ( i , a ) − η f ) 2 , where r ( i , a ) is the system reward at the current stage with state i and action a, and η f is the expected average reward. Obviously, the cost will be affected by future actions, so, η f is also affected by future actions. The traditional MDPs differs from this. The cost function and state transition probability depend only on the current state and the action selected on this stage. Therefore, the conclusions in [

The control model associated with the continuous-time MDP that we are concerned with is the five-tuple

{ S , ( A ( i ) ⊆ A , i ∈ S ) , q ( . | i , a ) , r ( i , a ) } (2.1)

1) A denumerable set S, called the stated space, which is the set of all the states of the system under observation.

2) A Borel space A, called the action space. Let

K : = { ( i , a ) | i ∈ S , a ∈ A ( i ) } . (2.2)

be the set of all feasible state-action pairs.

3) The transition rates q ( j | i , a ) which satisfy q ( j | i , a ) ≥ 0 for all ( i , a ) ∈ K and j ≠ i . Moreover, we assume that the transition rates q ( j | i , a ) are conservative, i.e.,

∑ j ∈ S q ( j | i , a ) = 0 ∀ ( i , a ) ∈ K (2.3)

and stable, which means that

q ∗ ( i ) : = sup a ∈ A ( i ) q i ( a ) < ∞ ∀ i ∈ S (2.4)

where q i ( a ) : = − q ( i | i , a ) ≥ 0 for all ( i , a ) ∈ K . In addition, ∑ j ∈ S q ( j | i , a ) is measurable in a ∈ A ( i ) for each fixed i , j ∈ S .

4) A measurable real-valued function r ( i , a ) on K. called the reward function, which is assumed to be measurable in a ∈ A ( i ) for each fixed i ∈ S .

The above model is a classical continuous-time MDP model [

Definition 1. A deterministic stationary policy is a function f : S → A ( i ) such that f ( i ) is in A ( i ) for all i ∈ S . A deterministic stationary policy is simply referred to as a stationary policy.

Let F be the set of all deterministic stationary policies.

For each f ∈ F , the associated transition rates are defined as

q ( j | i , f ) : = q ( j | i , f ( i ) ) ∀ i , j ∈ S , (2.5)

the reward function is given by

r ( i , f ) : = r ( i , f ( i ) ) ∀ i ∈ S . (2.6)

Under Assumption 1, the transition function p f ( i , t , j ) is regular [

Assumption 1:

a) There exist a nondecreasing function ω ≥ 1 , on S, and constants c 1 > 0 , and b 1 > 0 , such that

∑ j ∈ S ω ( j ) q ( j | i , a ) ≤ − c 1 ω ( i ) + b 1 δ i 0 ,

for all ( i , a ) ∈ K , where δ 00 = 1 , δ i 0 = 0 , i ≠ 0 .

b) q ∗ ( i ) ≤ L 0 ω ( i ) for all i ∈ S , with L 0 > 0 and q ∗ ( i ) as in (2.4).

c) | r ( i , a ) | ≤ M ω ( i ) , for all ( i , a ) ∈ K , with some M > 0 .

d) The action set A ( i ) is compact for each i ∈ S , the functions r ( i , a ) , q ( j | i , a ) and ∑ k ∈ S ω ( k ) q ( k | i , a ) are all continuous in a ∈ A ( i ) for each fixed i , j ∈ S .

e) There exists a nonnegative function ω ′ on S and constants c ′ > 0 , b ′ > 0 , and L 1 > 0 such that

q ∗ ( i ) ω ( i ) ≤ L 1 ω ′ ( i ) and ∑ j ∈ S ω ′ ( j ) q ( j | i , a ) ≤ c ′ ω ′ ( i ) + b ′ (2.7)

For all ( i , a ) ∈ K , with K and q ∗ ( i ) as in (2.2) and (2.4), respectively.

For all f ∈ F and an arbitrary initial state i ∈ S , there is a unique probability space ( Ω , B ( Ω ) , P i f ) , where the probability measure P i f is determined by f and p f ( i , t , j ) . E i f is denoted as the expectation operator P i f . Define the expected average reward and variance respectively.

η f ( i ) = lim T → ∞ 1 T E i f { ∫ 0 T r ( X ( t ) , f ) d t } , (2.8)

η σ f ( i ) = lim T → ∞ 1 T E i f { ∫ 0 T ( r ( X ( t ) , f ) − η f ) 2 d t } . (2.9)

Remark 1. From (2.9), we can see that the definition of η σ f is different from the definition of the continuous-time MDP average reward variance criterion in ( [

Let’s give some marks first. For any measurable function ω ≥ 1 on S, we define the ω -weighted supremum norm ‖ . ‖ ω of a real-valued measurable function ω on S by

‖ u ‖ ω : = sup i ∈ S | u ( i ) | ω ( i ) , (2.10)

and the Banach space B ω ( S ) : = { u : ‖ u ‖ ω < ∞ } . Similarly, we can define B ω 2 ( S ) .

We will use the Markov chain invariant measure to represent Equation (2.8) and Equation (2.9). To this end, we impose the following three Assumptions (see [

Assumption 2:

For each f ∈ F , the corresponding Markov process { X ( t ) } with transition function p f ( i , t , j ) is irreducible, which means that, for any two states i ≠ j , there exists a set of distinct states i = i 1 , ⋯ , i m such that

q ( i 2 | i 1 , f ) ⋯ q ( j | i m , f ) > 0 . (2.11)

Under Assumptions 1(a) and 2, for each f ∈ F , Propositions C.11 and C.12 yield that the Markov chain { X ( t ) } has a unique invariant probability measure, denoted by μ f , which satisfies that μ f ( j ) = lim t → ∞ p f ( i , t , j ) (independent of i ∈ S ) for all j ∈ S . Thus, by [

μ f ( ω ) : = ∑ j ∈ S ω ( j ) μ f ( j ) ≤ b 1 c 1 , (2.12)

which shows that the μ f -expectation of ω (i.e., μ f ( ω ) ) is finite. Therefore, for all f ∈ F and u ∈ B ω ( S ) , the inequality | u ( i ) | ≤ ‖ u ‖ ω ω ( i ) for all i ∈ S gives that the expectation

μ f ( u ) : = ∑ i ∈ S u ( i ) μ f ( i ) (2.13)

exists and is finite.

Assumption 3:

With ω as in Assumption 1, assume the following conditions are true:

a) There exists constants c ′ > 0 , b ′ > 0 , such that

∑ j ∈ S ω 2 ( j ) q ( j | i , a ) ≤ − c 2 ω 2 ( i ) + b 2 (2.14)

For all ( i , a ) ∈ K .

b) There exists a nonnegative function ω ″ on S and constants c ″ > 0 , b ″ > 0 and L 2 ≥ 0 , such that

q ∗ ( i ) ω 2 ( i ) ≤ L 2 ω ″ ( i ) , ∑ j ∈ S ω ″ ( j ) q ( j | i , a ) ≤ − c ″ ω ″ ( i ) + b ″ (2.15)

for all ( i , a ) ∈ K , with K and q ∗ ( i ) as in (2.2) and (2.4).

Assumption 4:

a) The control model (2.1) is uniformly ω-exponentially ergodic, which means the following: there exist constants β > 0 , and L 3 > 0 δ > 0 such that

sup f ∈ F | E i f u ( X ( t ) ) − μ f ( u ) | ≤ L 3 e − β t ‖ u ‖ ω ω ( i ) (2.16)

for all i ∈ S , u ∈ B ω ( S ) , and t ≥ 0 .

b) The control model (2.1) is uniformly ω 2 -exponentially ergodic, the definition as in (a)

Remark 2. Under the premise of the above assumptions, it can be known from the literature [

η f = μ f r : = ∑ i ∈ S r ( i , f ) μ f ( i ) , (2.17)

η σ f = μ f m : = ∑ i ∈ S m ( i , f ) μ f ( i ) . (2.18)

We denote r as an S-dimensional column vector composed by element r ( i , f ) and m as an S-dimensional column vector composed by element m ( i , f )

where m ( i , f ) : = ( r ( i , f ) − η f ) 2 ∀ i ∈ S , f ∈ F . (2.19)

Our optimization goal is to select f ∗ ∈ F that satisfies the following condition

η σ f ∗ = min f ∈ F η σ f (2.20)

By (2.20), the variance minimization problem of Markov chains can be defined as below.

f ∗ = arg min f ∈ F { η σ f } = arg min f ∈ F { μ f m } (2.21)

Remark 3. From (2.21), we see that the value η f will be affected by future actions. There the problem (2.21) is different from standard MDP.

Even if we consider m as a cost function, we can’t directly use the existing conclusions to get the optimal policy.

In this section, we will define a pseudo-variance minimization problem. By proving the relation between the pseudo-variance and the variance, the optimization problem of (2.21) is transformed into the pseudo-variance optimization problem. Further, the optimal policy for variance optimization problem can be derived by the policy iterative algorithm for the pseudo-variance optimization problem, and we can give a sufficient condition for the variance optimal policy.

We define a new cost function as below.

m λ ( i , f ) = ( r ( i , f ) − λ ) 2 ∀ i ∈ S , f ∈ F . (3.1)

where λ is a given constant. We denote m λ as an S-dimensional column vector composed by element m λ ( i , f ) and we have

m λ : = ( r − λ Ι ) 2 (3.2)

where Ι denote an S-dimensional column vector composed by element 1. We define pseudo-variance function as below

η σ λ f = μ f m λ . (3.3)

Obviously, we have

η σ λ f = η σ f , when λ = η f .

the pseudo-variance minimization problem of Markov chains can be defined as below.

f λ * = arg min f ∈ F { η σ λ f } = arg min f ∈ F { μ f m λ } = arg min f ∈ F ∑ i ∈ S μ f ( i ) ( r ( i , f ) − λ ) 2 (3.4)

From (3.4), we can see that m λ is an instant cost and it has no relation to future actions, Below, we study the relation between these two problems (2.21) and (3.4). First, we have the following lemma about the relation between η σ λ f and η σ f .

Lemma 1. For all f ∈ F , the corresponding variance and the pseudo-variance has the following relation

η σ λ f = η σ f + ( η f − λ ) 2 . (3.5)

Proof: From (3.1) and (3.3)

η σ λ f = ∑ i ∈ S μ f ( i ) ( r ( i , f ) − λ ) 2 = ∑ i ∈ S μ f ( i ) ( r ( i , f ) − η f + η f − λ ) 2 = ∑ i ∈ S μ f ( i ) [ ( r ( i , f ) − η f ) 2 + ( η f − λ ) 2 + 2 ( r ( i , f ) − η f ) ( η f − λ ) ] = η σ f + ( η f ) 2 + λ 2 − 2 η f λ + 2 ( η f ) 2 − 2 η f λ − 2 ( η f ) 2 + 2 η f λ = η σ f + ( η f − λ ) 2

The lemma is proved. □

Below we discuss how to solve the pseudo-variance minimum problem. Because (3.4) is a traditional MDP optimization problem, we can solve the problem with the policy iterative algorithm (3.4). Before using the policy iterative algorithm to solve the problem (3.4), we need to prove the existence of the pseudo-variance optimal policy. We suppose that Assumption 1, 2, 3, and 4 are all satisfied, we give the following theorems and lemmas.

Theorem 1. A pair ( g ∗ , u ) ∈ ℝ × B ω 2 ( S ) is said to be a solution to the pseudo-variance of average-reward optimality equation if

g ∗ = inf a ∈ A ( i ) { ( r ( i , a ) − λ ) 2 + ∑ j ∈ S u ( j ) q ( j | i , a ) } ∀ i ∈ S (3.6)

Lemma 2. Suppose that Assumptions 1, 2, 3, and 4 are satisfied. Consider an arbitrary fixed state i 0 ∈ S . Then, for all f ∈ F and discount factors α > 0 , the relative differences of the discounted-reward function η α f , namely,

u α f ( i ) = η α f ( i ) − η α f ( i 0 ) i ∈ S (3.7)

are uniformly ω-bounded in α > 0 and f ∈ F . More precisely, we have

‖ u α f ‖ ω 2 ≤ L 3 ( M + | λ | ) 2 δ [ 1 + ω 2 ( i 0 ) ] α > 0 , f ∈ F . (3.8)

where η α f ( i ) = E i f [ ∫ 0 ∞ e − α t ( r ( X ( t ) , f ) − λ ) 2 d t ] . (3.9)

Prove: According to the literature [

where m λ ( i , f ) = ( r ( i , f ) − λ ) 2 ∀ i ∈ S , f ∈ F .

Theorem 2. Suppose that Assumptions 1, 2, 3, and 4 hold. Then:

There exists a solution ( g ∗ , u ) ∈ ℝ × B ω 2 ( S ) to pseudo variance of average-reward optimality equation. Moreover, the constant g ∗ coincides with the optimal average reward function η σ λ ∗ , i.e.

g ∗ = η σ λ ∗ ( i ) ∀ i ∈ S (3.10)

Prove: Our assumptions ensure the existence of a policy attaining the minimization in the pseudo-variance of average-reward optimality equation, that is,

g ∗ = ( r ( i , f ∗ ) − λ ) 2 + ∑ j ∈ S u ¯ ( j ) q ( j | i , f ∗ ) ∀ i ∈ S , t ≥ 0 , (3.11)

Therefore, Proposition 7.3 of the literature [

As a consequence, g ∗ = η σ λ ∗ ( i ) for every i ∈ S , and, moreover, f ∗ is optimal policy of pseudo-variance.

In the case where the existence of the pseudo-variance optimal policy is guaranteed, we use the policy iterative algorithm to get the optimal policy.

Suppose that Assumptions 1, 2, 3, and 4 hold, we gave the following concepts.

Definition 2. We define the bias of f as

h σ λ f ( i ) = ∫ 0 ∞ [ E i f ( r ( X ( t ) , f ) − λ ) 2 − η σ λ f ] d t i ∈ S . (3.12)

Assumption 4: Gives

| E i f ( r ( X ( t ) , f ) − λ ) 2 − η σ λ f | ≤ L 3 ( M + | λ | ) 2 e − δ t ω 2 ( i )

| h σ λ f ( i ) | ≤ ∫ 0 ∞ L 3 ( M + | λ | ) 2 e − δ t ω 2 ( i ) d t = L 3 ( M + | λ | ) 2 ω 2 ( i ) δ

sup f ∈ F ‖ h σ λ f ‖ ω 2 ≤ L 3 ( M + | λ | ) 2 δ .

So, h σ λ f is finite and in B ω 2 ( S ) . Moreover, the bias is uniformly bounded in the ω 2 -norm.

Next we introduce the Poisson equation, which is one of the main results of this paper.

Theorem 3. Let f ∈ F . We say that a pair ( η σ λ f , h σ λ f ) ∈ ℝ × B ω 2 ( S ) is a solution to the Poisson equation for f ∈ F if

η σ λ f = ( r ( i , f ) − λ ) 2 + ∑ j ∈ S h σ λ f ( j ) q ( j | i , f ) ∀ f ∈ F , i ∈ S . (3.13)

Proof: Our assumptions (in particular, Assumptions 1 and 4) allow us to interchange the sums and integrals in the following equations:

∑ j ∈ S h σ λ f ( j ) q ( j | i , f ) = ∑ j ∈ S [ ∫ 0 ∞ ( E j f ( r ( X ( t ) , f ) − λ ) 2 − η σ λ f ) d t ] q ( j | i , f ) = ∫ 0 ∞ ∑ j ∈ S E j f ( r ( X ( t ) , f ) − λ ) 2 q ( j | i , f ) d t − 0 = ∫ 0 ∞ ∑ j ∈ S ∑ k ∈ S ( r ( k , f ) − λ ) 2 p ( j , t , k ) q ( j | i , f ) d t = ∫ 0 ∞ ∑ k ∈ S ( r ( k , f ) − λ ) 2 ∑ j ∈ S q ( j | i , f ) p ( j , t , k ) d t

= ∫ 0 ∞ ∑ k ∈ S ( r ( k , f ) − λ ) 2 d d t p f ( i , t , k ) d t = ∑ k ∈ S ( r ( k , f ) − λ ) 2 p f ( i , t , k ) | 0 ∞ = ∑ k ∈ S ( r ( k , f ) − λ ) 2 μ f ( k ) − ( r ( i , f ) − λ ) 2 = η σ λ f − ( r ( i , f ) − λ ) 2

We have

η σ λ f = ( r ( i , f ) − λ ) 2 + ∑ j ∈ S h σ λ f ( j ) q ( j | i , f ) .

The theorem is proved. □

Finally, we should prove the uniqueness of the solution of Poisson’s equation.

Theorem 4. For every f ∈ F , the solutions to the Poisson equation for f are of the form ( η σ λ f , h σ λ f + z ) with z any real number. Moreover, is the unique solution to the Poisson equation

η σ λ f = ( r ( i , f ) − λ ) 2 + ∑ j ∈ S h σ λ f ( j ) q ( j | i , f ) ∀ i ∈ S (3.14)

for which μ f ( h σ λ f ) = 0

Prove: Suppose now that ( η σ λ f , h σ λ f ) and ( η σ λ f , h ′ σ λ f ) are two solutions to the Poisson equation, simultaneous transformation of both sides of Equation (3.14)

η σ λ f T = ∫ 0 T ∑ k ∈ S ( r ( k , f ) − λ ) 2 p ( 0 , i , t , k ) d t + ∑ j ∈ S ∫ 0 T h σ λ f ( j ) ∑ k ∈ S q ( j | k , f ) p ( 0 , i , t , k ) d t = ∫ 0 T E i f ( r ( X ( t ) , f ) − λ ) 2 d t + ∑ j ∈ S ∫ 0 T h σ λ f ( j ) ∑ k ∈ S p ( 0 , i , t , k ) q ( j | k , f ) d t = ∫ 0 T E i f ( r ( X ( t ) , f ) − λ ) 2 d t + ∑ j ∈ S ∫ 0 T h σ λ f ( j ) d p ( 0 , t , t , j )

= ∫ 0 T E i f ( r ( X ( t ) , f ) − λ ) 2 d t + ∑ j ∈ S h σ λ f ( j ) p ( 0 , i , t , j ) | 0 T = ∫ 0 T E i f ( r ( X ( t ) , f ) − λ ) 2 d t + ∑ j ∈ S h σ λ f ( j ) p ( 0 , i , T , j ) − h σ λ f ( i ) = ∫ 0 T E i f ( r ( X ( t ) , f ) − λ ) 2 d t + E i f h σ λ f ( X ( T ) ) − h σ λ f ( i ) (3.15)

Because ( η σ λ f , h ′ σ λ f ) is also the solution of the Poisson equation, therefore

η σ λ f T = ∫ 0 T E i f ( r ( X ( t ) , f ) − λ ) 2 d t + E i f h ′ σ λ f ( X ( T ) ) − h ′ σ λ f ( i ) (3.16)

(3.15) subtract (3.16)

E i f [ h σ λ f ( X ( t ) ) − h ′ σ λ f ( X ( t ) ) ] = h σ λ f ( i ) − h ′ σ λ f ( i ) ∀ i ∈ S . (3.17)

letting t → ∞ , it follows from Assumption 4 that μ f ( h σ λ f − h ′ σ λ f ) = h σ λ f ( i ) − h ′ σ λ f ( i ) , showing that the functions h σ λ f and h ′ σ λ f differ by the constant μ f ( h σ λ f − h ′ σ λ f ) .

It remains to show that μ f ( h σ λ f ) = 0 .

μ f ( h σ λ f ) = ∑ i ∈ S h σ λ f ( i ) μ f ( i ) = ∑ i ∈ S ∫ 0 ∞ [ E i f ( r ( X ( t ) ) − λ ) 2 − η σ λ f ] d t μ f ( i ) = ∑ i ∈ S ∫ 0 ∞ [ E i f ( r ( X ( t ) ) − λ ) 2 − ( r ( i , f ) − λ ) 2 − ∑ j ∈ S h σ λ f ( j ) q ( j | i , f ) ] d t μ f ( i ) = ∑ i ∈ S ∫ 0 ∞ [ E i f ( r ( X ( t ) ) − λ ) 2 ] d t μ f ( i ) − ∫ 0 ∞ ∑ i ∈ S ( r ( i , f ) − λ ) 2 μ f ( i ) d t − 0

= ∑ i ∈ S ∫ 0 ∞ ∑ j ∈ S ( r ( j , f ) − λ ) 2 p ( i , t , j ) μ f ( i ) d t − ∫ 0 ∞ ∑ i ∈ S ( r ( i , f ) − λ ) 2 μ f ( i ) d t = ∫ 0 ∞ ∑ j ∈ S ( r ( j , f ) − λ ) 2 ∑ i ∈ S μ f ( i ) p ( i , t , j ) d t − ∫ 0 ∞ ∑ i ∈ S ( r ( i , f ) − λ ) 2 μ f ( i ) d t = ∫ 0 ∞ ∑ j ∈ S ( r ( j , f ) − λ ) 2 μ f ( j ) d t − ∫ 0 ∞ ∑ i ∈ S ( r ( i , f ) − λ ) 2 μ f ( i ) d t = 0 □

Remark 4. Given f ∈ F , we can determine the gain and the bias of f by solving the following system of linear equations. First, determine the i.p.m. (invariant probability measure) vas the unique nonnegative solution (by Proposition C.12) to

{ ∑ j ∈ S q ( j | i , f ) μ f ( j ) = 0 ∑ j ∈ S μ f ( j ) = 1 (3.18)

Then, as a consequence of lecture [

η σ λ f = ∑ j ∈ S ( r ( j , f ) − λ ) 2 μ f ( j ) ∈ ℝ

and the bias h σ λ f ∈ B ω 2 ( S ) of f form the unique solution to the system of linear equations

{ η σ λ f = ( r ( i , f ) − λ ) 2 + ∑ j ∈ S h σ λ f ( j ) q ( j | i , f ) ∑ i ∈ S h σ λ f ( i ) μ f ( i ) = 0 (3.19)

Proposition 1: Policy iterative algorithm.

Step 1. From f ∈ F , we can choose a arbitrary.

Step 2. (Strategy evaluation process) Determine the pseudo-variance and deviation of the stationary policy as in Remark 4.

Step 3. (Policy Improvement Process) Choose f ′ as an improvement policy such that

f ′ ∈ arg min f ∈ F { ( r ( i , f ) − λ ) 2 + ∑ j ∈ S h σ λ f ( j ) q ( j | i , f ) } . (3.20)

Step 4. If f ′ = f , the iteration stops, it is the optimal strategy to minimize the pseudo variance, otherwise, replace f with f ′ and return to step 2.

Proposition 2: Convergence of the strategy iterative algorithm.

When the assumptions 1, 2, 3, 4 are established, let f 1 ∈ F be an arbitrary initial policy, let f n ∈ F be the sequence of policies obtained from the policy iterative algorithm. The one of the following results is hold.

1) After a finite number of policy iterations, the algorithm converges to the pseudo variance of average-reward optimal strategy.

2) as n → ∞ , the sequence η σ λ f n converges to the optimal AR function value η σ λ f ∗ .

The minimum pseudo-variance problem has been solved. The following theorem gives that when the pseudo-variance reaches a minimum, the variance is also minimized.

Theorem 5. For any policy f ∈ F , we compute η f with (2.8), and set λ = η f . If we obtain an improved policy f ′ such that η σ λ f ′ ≤ η σ λ f , then we have η σ f ′ ≤ η σ f . If η σ λ f ′ < η σ λ f , such that η σ f ′ < η σ f .

Prove: With lemma 1, we have

η σ λ f = η σ f + ( η f − λ ) 2 . (3.21)

η σ λ f ′ = η σ f ′ + ( η f ′ − λ ) 2 . (3.22)

Let (3.22) subtract (3.21) and substituting λ = η f , we obtain

η σ f ′ − η σ f = η σ λ f ′ − η σ λ f − ( η f ′ − η f ) 2 (3.23)

Obviously, if η σ λ f ′ ≤ η σ λ f , then η σ f ′ ≤ η σ f ; if η σ λ f ′ < η σ λ f , then η σ f ′ < η σ f .

With Theorem 3: when the pseudo-variance reaches a minimum, the variance also reaches a minimum. A sufficient condition for the variance minimization problem is obtained.

This section, we give an example to illustrate the conclusions of this paper.

Example 1 (Control M / M / ∞ of queue systems)

The system state X ( t ) indicates the number of customers waiting at the moment (including being served), the arrival rate λ is fixed, and the service rates μ can be controlled. When the system status is i ∈ S = { 0 , 1 , … } , the decision-maker takes an action a from the allowed action set A ( i ) . When the system is empty, we may impose that A ( 0 ) : = 0 . For each i ≥ 1 , let A ( i ) : = [ μ 1 , μ 2 ] with constants μ 2 > μ 1 > 0 , μ ∈ [ μ 1 , μ 2 ] , which may increase or decrease the service rate. This action incurs a cost c ( i , a ) . In addition, suppose that there is a benefit represented by p > 0 for each arriving customer, and then the net income of the system is

r ( i , a ) = p i − c ( i , a ) (4.1)

This is a continuous time MDP model, the corresponding transition rate are given as follows.

For each a ∈ [ μ 1 , μ 2 ] ,

q ( 0 | 0 , 0 ) = − q ( 1 | 0 , 0 ) : = − λ , q ( j | 0 , 0 ) = 0 , j ≥ 2 , (4.2)

For all i ≥ 1 , a ∈ A ( i ) = [ μ 1 , μ 2 ] ,

q ( j | i , a ) : = { λ i if j = i + 1 , − ( λ + u ) i + a if j = i , μ i − a if j = i − 1 , 0 otherwise , (4.3)

Our goal is to find the existence of a variance optimal policy. To this end, we consider the following assumptions:

D_{1}. μ − λ > 0 .

D_{2}. The function c ( i , a ) is continuous in a ∈ A ( i ) for each fixed i ∈ S , and sup a ∈ A ( i ) | c ( i , a ) | < M ˜ ( i + 1 ) for all i ∈ S , for some constant M ˜ ≥ 0 .

Proposition 3: under conditions D_{1}, D_{2}, the above controlled satisfies Assumptions 1, 2, 3, and 4. Therefore, there exists an variance optimal stationary policy.

Proof: Let c 1 : = 1 3 ( μ − λ ) , b 1 : = μ 2 + λ , ω ( i ) : = i + 1 , for all i ∈ S . Then, from (4.2) and (4.3), we have

∑ j ∈ S ω ( j ) q ( j | 0 , a ) = λ ≤ − c 1 ω ( 0 ) + μ 2 + λ ∀ a ∈ A ( i ) (4.4)

Moreover, for all i ≥ 1 , a ∈ [ μ 1 , μ 2 ]

∑ j ∈ S ω ( j ) q ( j | i , a ) = − ( μ − λ ) i + a ≤ − c 1 ω ( i ) + μ 2 + λ . (4.5)

which verifies Assumption 1(a).

On the other hand, by (4.2)-(4.3), we have

q ∗ ( i ) ≤ ( μ + λ ) ( i + 1 ) = ( μ + λ ) ω ( i ) , (4.6)

So Assumption 1(b) follows.

By (4.1) and D_{2}, we have | r ( i , a ) | ≤ ( p + M ˜ ) ω ( i ) , for all i ∈ S , which implies Assumption 1(c).

By (4.2)-(4.3) and D_{2}, we see that Assumption 1(d) holds.

To verity Assumption 1(e), let

ω ′ ( i ) : = ( i + 1 ) ( i + 2 ) , for each i ∈ S (4.7)

Then by (4.2)-(4.3) we have

q ∗ ( i ) ω ( i ) ≤ ( μ + λ ) ω ′ ( i ) ∀ i ∈ S (4.8)

∑ j ∈ S ω ′ ( j ) q ( j | i , a ) ≤ ( 4 λ + μ 2 ) ω ′ ( i ) ∀ a ∈ [ μ 1 , μ 2 ] , i ∈ S (4.9)

which imply Assumption 1(e) with

L 1 : = ( μ + λ ) , c ′ : = 4 λ + μ 2 , b ′ : = 0.

Obviously, Assumption 2 follows from the description of the model.

We verity Assumption 3, by D_{1} and (4.2)-(4.3), for all i ∈ S , i ∈ A ( i ) , we have

∑ j ∈ S ω 2 ( j ) q ( j | 0 , a ) = 3 λ ≤ − 1 2 ( μ − λ ) ω 2 ( 0 ) + b 2 , (4.10)

(there is b 2 > 0 , (presence is guaranteed by (D1))

∑ j ∈ S ω 3 ( j ) q ( j | 0 , a ) = 7 λ ≤ − ( μ − λ ) ω 3 ( 0 ) + b 3 . (4.11)

(there is b 3 > 0 , (presence is guaranteed by (D1))

For i ≥ 1 , a ∈ A ( i ) = [ μ 1 , μ 2 ] ,

∑ j ∈ s ω 2 ( j ) q ( j | i , a ) = − 2 ( μ − λ ) i 2 − ( μ − 2 a − 3 λ ) i + a ≤ − 2 ( μ − λ ) ( i + 1 ) 2 + 4 ( μ − λ ) i + 2 ( μ − λ ) + ( 3 λ + 2 a ) i + a ≤ − 1 2 ( μ − λ ) ω 2 ( i ) + b 2 (4.12)

∑ j ∈ s ω 3 ( j ) q ( j | i , a ) = − 3 ( μ − λ ) i 3 + 3 ( 3 λ − μ + a ) i 2 − ( μ − 7 λ − 3 a ) i + a = − 3 ( μ − λ ) ( i + 1 ) 3 + 9 ( μ − λ ) i 2 + 9 ( μ − λ ) i + 3 ( 3 λ − μ + a ) i 2 − ( μ − 7 λ − 3 a ) i + a ≤ − ( μ − λ ) ω 3 ( i ) + b 3 (4.13)

q ∗ ω 2 ( i ) ≤ ( μ + λ ) ω ″ ( i ) ∀ i ∈ S (4.14)

Let ω ″ ( i ) : = ω 3 ( i ) = ( i + 1 ) 3 , c 2 : = 1 2 ( μ − λ ) , b 2 as above, c ″ : = ( μ − λ ) , b ″ = b 3 and L 2 = μ + λ , by (4.10)-(4.14), we see that Assumption 1(d) holds.

Finally, we verify Assumption4, by (4.2)-(4.3) we have, for each fixed f ∈ F ,

∑ j ≥ k q ( j | i , f ( i ) ) ≤ ∑ j ≥ k q ( j | i + 1 , f ( i + 1 ) ) ∀ i , k ∈ S , k ≠ i + 1

which, together with Proposition C.16 of [

Assumption 1, 2, 3 and 4 holds. So, the Proposition is proved.

In this article, it defines the variance optimization problem of continuous-time Markov decision processes, which is different from the mean-variance optimization problem previously studied. By defining pseudo-variance, the deviation of the deterministic stationary policy f and the Poisson equation, a series of concepts and theorems, we prove the existence of the variance optimal strategy in the deterministic stationary policy space, and give the policy iterative algorithm to calculate optimal policy. Finally we prove the convergence of the policy iterative algorithm.

The author declares no conflicts of interest regarding the publication of this paper.

Fu, Y.Q. (2019) Variance Optimization for Continuous-Time Markov Decision Processes. Open Journal of Statistics, 9, 181-195. https://doi.org/10.4236/ojs.2019.92014