Variance Optimization for Continuous-Time Markov Decision Processes

This paper considers the variance optimization problem of average reward in continuous-time Markov decision process (MDP). It is assumed that the state space is countable and the action space is Borel measurable space. The main purpose of this paper is to find the policy with the minimal variance in the deterministic stationary policy space. Unlike the traditional Markov decision process, the cost function in the variance criterion will be affected by future actions. To this end, we convert the variance minimization problem into a standard (MDP) by introducing a concept called pseudo-variance. Further, by giving the policy iterative algorithm of pseudo-variance optimization problem, the optimal policy of the original variance optimization problem is derived, and a sufficient condition for the variance optimal policy is given. Finally, we use an example to illustrate the conclusion of this paper.


Introduction
The Postal Service Company's catalogue information system, inventory issues, and supply chain management issues are all early successful applications of the Markov decision process.Later, many real-life problems, such as sequential assignments, machine maintenance issues, and secretarial issues, can be described as dynamic Markov Decision Processes (MDP) model.They are finally solved this paper is to find the policy with the minimal variance in the deterministic stationary policy class, which is different from the mean-variance criterion problem.The study of the mean-variance criterion problem is generally based on the discount criterion or the average criterion.In the literature of MDPs, many studies focus on the problem of expected reward optimization in finite stage, the discounted MDP in infinite stage and the average reward problem in infinite stage [2] [3].By establishing the optimal equation, then the existence of optimal policy is proved, and finally the policy iteration type algorithm is used to solve the MDP problem.However, in real-life, the optimal criteria of this unconstrained optimization problem are often not unique, such as queuing system and network problems.So we introduce variance to choose the optimal strategy.
Variance is an important performance metric of stochastic systems.In financial engineering, we use the mean to measure the expected return, and the variance to measure the risk.The mean-variance problem of the portfolio can be traced back to Markowitz [4].Then the Markowitz's mean-variance portfolio problem has been studied [5]- [13], the decision maker's expected reward is often assumed to be a constant, and then the investor chooses a policy with a given expected return to minimize this risk, we can see that the Markowitz mean-variance portfolio model is a model of maximization of return and minimization of risk.However, given expected return which may not be maximal, an optimal policy in Markowitz' mean-variance portfolio may not be optimal in the usual sense of variance minimization problems for MDPs.Moreover, more and more real-life situations such as queuing systems and networks can be described as MDPs rather than stochastic differential equations, so Markowitz's mean-variance portfolio problem should be extended to MDPs.For mean-variance problem of the MDPs, as in [14] [15] [16], we aim to obtain a variance optimal policy over a set of policies where the average reward or discounted reward is optimal, so the variance criterion can be transformed into an equivalent average or discount criterion.However, when the mean criterion is not optimal, it is not clear how to develop a policy iteration algorithm to solve the problem.For discrete-time, discount and long-run average variance criterion problem has been studied in [17] [18].They mainly consider the variance optimization problem, and do not constrain the mean.For continuous-time, the variance of the average expected return has been defined in deterministic stationary policy.The finite-horizon expected reward is defined as below.
The variance of f, ( )

, i f σ
, is given by: However, the variance function of average expected return of the continuous-time in this paper is given by The long-run expected average reward is defined as below.
( ) The main work of this paper is to find the iterative algorithm of the optimal policy under the variance criterion (minimum variance) in the countable state space and the Borel measurable action space.For countable state space, the reward function ( ) , r i a may be unbounded, the expected average reward f η , may not be infinite.To guarantee the finiteness of f η , we will impose the fol- lowing Assumption 1. Next, we use a unique invariant probability measure of Markov chain to denote the average expected return and variance.To this end, we will impose the following Assumption 2, 3, 4. Suppose that Assumptions 1, 2, 3, and 4 are satisfied.We have established a variance criterion.Under the variance criterion, we define the cost function ( ) ( ) , r i a is the system reward at the current stage with state i and action a, and f η is the expected average reward.Obviously, the cost will be affected by future actions, so, f η is also affected by future actions.The traditional MDPs differs from this.The cost function and state transition probability depend only on the current state and the action selected on this stage.Therefore, the conclusions in [14] [15] [16] do not apply to this model.In this paper, we define a pseudo-variance , where λ is a given constant [17].Ob- viously, the value of the pseudo-variance at current stage will not be affected by future actions.It is only related to the current state and current actions, so the pseudo-variance minimization problem is a standard MDP.In this paper, we prove the relation between variance and pseudo-variance.Unlike the literature [17], we define the deviation of the deterministic stationary policy f for continuous-time MDP.It is proved that the deviation function and the objective function satisfy the Poisson equation, and the uniqueness of the Poisson equation is proved.Based on this, we develop a continuous time MDP policy iterative algorithm to get the optimal strategy, and we prove the convergence of the policy iterative algorithm.

Model and Optimization Criteria
The control model associated with the continuous-time MDP that we are concerned with is the five-tuple 1) A denumerable set S, called the stated space, which is the set of all the states of the system under observation.
2) A Borel space A, called the action space.Let be the set of all feasible state-action pairs.
3) The transition rates ( ) q j i a which satisfy ( ) | , 0 q j i a ≥ for all ( ) , i a K ∈ and j i ≠ .Moreover, we assume that the transition rates ( ) | , q j i a are conserv-ative, i.e., ( ) ( ) and stable, which means that ( ) ( ) ( ) where ≥ for all ( ) 4) A measurable real-valued function ( ) , r i a on K. called the reward function, which is assumed to be measurable in ( ) The above model is a classical continuous-time MDP model [3].In MDP, the policies have stochastic Markov policy, stochastic stationary policy and deterministic stationary policy.This paper only considers finding the minimal variance in the deterministic stationary policy class.So we only introduce the definition of deterministic stationary policy.
Definition 1.A deterministic stationary policy is a function A i for all i S ∈ .A deterministic stationary policy is simply referred to as a stationary policy.
Let F be the set of all deterministic stationary policies.
For each f F ∈ , the associated transition rates are defined as the reward function is given by Under Assumption 1, the transition function
For all f F ∈ and an arbitrary initial state i S ∈ , there is a unique probabil- ity space , where the probability measure f i E is denoted as the expectation operator f i P .Define the expected average reward and variance respectively.Let's give some marks first.For any measurable function 1 ω ≥ on S, we de- fine the ω -weighted supremum norm .ω of a real-valued measurable func- tion ω on S by ( ) ( ) and the Banach space We will use the Markov chain invariant measure to represent Equation (2.8) and Equation (2.9).To this end, we impose the following three Assumptions (see [3]).
Assumption 2: For each f F ∈ , the corresponding Markov process , , f p i t j is irreducible, which means that, for any two states i j ≠ , there exists a set of distinct states 1 , , ) for all j S ∈ .Thus, by [3], we have which shows that the f µ -expectation of ω (i.e., ( ) , the inequality ( ) ( ) gives that the expectation ( ) ( ) ( ) exists and is finite.

Assumption 3:
With ω as in Assumption 1, assume the following conditions are true: a) There exists constants 0, for all ( ) , i a K ∈ , with K and ( ) 2) and (2.4).
b) The control model (2.1) is uniformly 2 ω -exponentially ergodic, the defini- tion as in (a) Remark 2. Under the premise of the above assumptions, it can be known from the literature [3] that for the given f F ∈ , the average reward and variance de- fined by Equation (2.8) and Equation (2.9) are both a number, independent of the initial state.They can represent the expectation form of invariant measures We denote r as an S-dimensional column vector composed by element ( ) , r i f and m as an S-dimensional column vector composed by element ( ) Our optimization goal is to select f F * ∈ that satisfies the following condition By (2.20), the variance minimization problem of Markov chains can be defined as below.

{ } { }
arg min arg min Remark 3. From (2.21), we see that the value f η will be affected by future actions.There the problem (2.21) is different from standard MDP.
Even if we consider m as a cost function, we can't directly use the existing conclusions to get the optimal policy.

Analysis and Optimization
In this section, we will define a pseudo-variance minimization problem.By proving the relation between the pseudo-variance and the variance, the optimization problem of (2.21) is transformed into the pseudo-variance optimization problem.
Further, the optimal policy for variance optimization problem can be derived by the policy iterative algorithm for the pseudo-variance optimization problem, and we can give a sufficient condition for the variance optimal policy.

Pseudo-Variance Minimization
We define a new cost function as below.
where λ is a given constant.We denote λ m as an S-dimensional column vector composed by element ( ) where Ι denote an S-dimensional column vector composed by element 1.We define pseudo-variance function as below .
Obviously, we have the pseudo-variance minimization problem of Markov chains can be defined as below.
The lemma is proved. Below we discuss how to solve the pseudo-variance minimum problem.Because (3.4) is a traditional MDP optimization problem, we can solve the problem with the policy iterative algorithm (3.4).Before using the policy iterative algorithm to solve the problem (3.4), we need to prove the existence of the pseudo-variance optimal policy.We suppose that Assumption 1, 2, 3, and 4 are all sa-tisfied, we give the following theorems and lemmas.Theorem 1.A pair ( ) ∈ ×  is said to be a solution to the pseudo-variance of average-reward optimality equation if Lemma 2. Suppose that Assumptions 1, 2, 3, and 4 are satisfied.Consider an arbitrary fixed state 0 i S ∈ .Then, for all f F ∈ and discount factors 0 α > , the relative differences of the discounted-reward function f α η , namely, ( ) ( ) ( ) are uniformly ω-bounded in 0 α > and f F ∈ .More precisely, we have ( ) where Prove: According to the literature [3], Lemma 2 can be known. where Therefore, Proposition 7.3 of the literature [3] gives ( ) As a consequence, ( ) for every i S ∈ , and, moreover, f * is op- timal policy of pseudo-variance.
In the case where the existence of the pseudo-variance optimal policy is guaranteed, we use the policy iterative algorithm to get the optimal policy.Suppose that Assumptions 1, 2, 3, and 4 hold, we gave the following concepts.
Definition 2. We define the bias of f as

.13)
Proof: Our assumptions (in particular, Assumptions 1 and 4) allow us to interchange the sums and integrals in the following equations: , The theorem is proved. Finally, we should prove the uniqueness of the solution of Poisson's equation.
showing that the functions f h σλ and f h σλ ′ differ by the constant ( ) Given f F ∈ , we can determine the gain and the bias of f by solving the following system of linear equations.First, determine the i.p.m. (invariant probability measure) vas the unique nonnegative solution (by Proposition C.12) to Then, as a consequence of lecture [2], the gain Proposition 1: Policy iterative algorithm.
Step 1. From f F ∈ , we can choose a arbitrary.
Step 2. (Strategy evaluation process) Determine the pseudo-variance and deviation of the stationary policy as in Remark 4.
Step 3. (Policy Improvement Process) Choose f ′ as an improvement policy such that Step 4. If f f ′ = , the iteration stops, it is the optimal strategy to minimize the pseudo variance, otherwise, replace f with f ′ and return step 2.
Proposition 2: Convergence of the strategy iterative algorithm.When the assumptions 1, 2, 3, 4 are established, let 1 f F ∈ be an arbitrary initial policy, let n f F ∈ be the sequence of policies obtained from the policy iterative algorithm.The one of the following results is hold.
1) After a finite number of policy iterations, the algorithm converges to the pseudo variance of average-reward optimal strategy.

Variance Minimization
The minimum pseudo-variance problem has been solved.The following theorem gives that when the pseudo-variance reaches a minimum, the variance is also minimized.Theorem 5.For any policy f F ∈ , we compute f η with (2.8), and set f λ η = .
If we obtain an improved policy f ′ such that f Prove: With lemma 1, we have

Examples
This section, we give an example to illustrate the conclusions of this paper.Example 1 (Control / / M M ∞ of queue systems) The system state ( ) X t indicates the number of customers waiting at the moment (including being served), the arrival rate λ is fixed, and the service rates µ can be controlled.the system status is { } 0,1, i S ∈ = … , the decision-maker takes an action a from the allowed action set ( ) A i .When the system is empty, we may impose that ( ) , µ µ µ ∈ , which may increase or decrease the service rate.This action incurs a cost ( ) , c i a .In addition, suppose that there is a benefit represented by 0 p > for each arriving customer, and then the net income of the system is This is a continuous time MDP model, the corresponding transition rate are given as follows.
For each , Our goal is to find the existence of a variance optimal policy.To this end, we consider the following assumptions: for all i S ∈ , for some constant 0 M ≥  .Proposition 3: under conditions D 1 , D 2 , the above controlled satisfies Assumptions 1, 2, 3, and 4. Therefore, there exists an variance optimal stationary policy.
Proof: Let ( ) , for all i S ∈ .Then, from (4.2) and (4.3), we have    q j i f i q j i f i i k S k i which, together with Proposition C.16 of [3], implies that the corresponding Markov process ( ) X t is stochastically ordered.Thus, Assumption 4(a) follows from Proposition 7.6.Similarly, the assumption 4(b) is established.Assumption 1, 2, 3 and 4 holds.So, the Proposition is proved.

Discussion and Conclusion
In this article, it defines the variance optimization problem of continuous-time Markov decision processes, which is different from the mean-variance optimization problem previously studied.By defining pseudo-variance, the deviation of the deterministic stationary policy f and the Poisson equation, a series of concepts and theorems, we prove the existence of the variance optimal strategy in the deterministic stationary policy space, and give the policy iterative algorithm to calculate optimal policy.Finally we prove the convergence of the policy iterative algorithm.

. 9 )
Remark 1. From (2.9), we can see that the definition of f σ η is different from the definition of the continuous-time MDP average reward variance criterion in ([3], chapter 10), where the value of cost function will be affected by future actions, so this is not a standard MDP optimization problem.

4 )From ( 3 . 4 ) 1 .
, we can see that λ m is an instant cost and it has no relation to future actions, Below, we study the relation between these two problems (2.21) and(3.4).First, we have the following lemma about the relation between f For all f F ∈ , the corresponding variance and the pseudo-variance has the following relation

Theorem 2 .
Suppose that Assumptions 1, 2, 3, and 4 hold.Then: There exists a solution ( ) of averagereward optimality equation.Moreover, the constant g * coincides with the optimal average reward function σλ η * , i.e. assumptions ensure the existence of a policy attaining the minimization in the pseudo-variance of average-reward optimality equation, that is,

Theorem 4 .′
For every f F ∈ , the solutions to the Poisson equation for f are of the form ( ) are two solutions to the Poisson equation, simultaneous transformation of both sides of Equation (3.14)

With Theorem 3 :
when the pseudo-variance reaches a minimum, the variance also reaches a minimum.A sufficient condition for the variance minimization problem is obtained.
2 follows from the description of the model.We verity Assumption 3, by D 1 and (4.2)-(4.3),for all i S ∈ ,