Conditional Value-at-risk for Random Immediate Reward Variables in Markov Decision Processes

We consider risk minimization problems for Markov decision processes. From a standpoint of making the risk of random reward variable at each time as small as possible, a risk measure is introduced using conditional value-at-risk for random immediate reward variables in Markov decision processes, under whose risk measure criteria the risk-optimal policies are characterized by the optimality equations for the discounted or average case. As an application, the inventory models are considered.


Introduction
As a measure of risk for income or loss random variables, the variance has been commonly considered since Markowitz work [1].The variance has the shortcoming that it does not approximately account for the phenomenon of "fat tail" in distribution functions.In recent years, many risk measures have been generated and analyzed by an economically motivated optimization problem, for example, value at risk , conditional value-at-risk [2,3], coherent risk of measure [4][5][6], convex risk of measure [7,8] and its applications [9,10].
On the other hand, a lot of research considering the risk have been progressed by many authors [11][12][13][14][15] in the framework of Markov decision processes (MDPs, for short).In [11,16], the risk control for the random total reward in MDPs is discussed.In the sequential decision making under uncertain circumstance, it may be better to minimize the total risk through the infinite horizon controlling the risk at each time.For example, in multiperiod inventory and production problem, we often want to order optimally by the ordering policy such that while it minimizes the total risk through all the periods it also makes the risk at each time as small as possible.
In this paper, with above motivation in mind we introduce a new risk measure for each policy using conditional value-at-risk for random immediate reward variables, under whose risk measure criteria the optimization will be done, respectively, in the discounted and average case.As an application, the inventory model is consid-ered.In the reminder of this section, we shall establish notations that will be used throughout the paper and define the problem with a new risk measure.
A Borel set is a Borel subset of a complete separable metric space.For a Borel set X, X denotes the B   algebra of Borel subset of X.For Borel sets X and Y,   X P and

 
X Y P be the sets of all probability measures on X and all conditional probability measures on X given Y respectively.The product of X and Y is denoted by XY.Let be the set of real numbers.Let I be a random income (or reward) variable on some probability space , and We define the inverse function Then, the Conditional Value-at-Risk for a level We note that   @ CV R I  is specified depending only on the law of the random variable I.For any Borel set X, the set of all bounded and Borel measurable functions on X will be denoted by  .

is in
The sample space is the product space .

B
=  such that the projections on the t-th factors describe the state and the action at the t-th time of the process denotes the set of all policies, i.e., for let , , , , 0 .
for all  a policy is called stationary.Such a policy will be denoted by f.Let For any we assume that and for   P  We want to minimize the total reward risk making the risk at each time as small as possible.So, using for the random reward variable will be defined in the discounted or average case as follows.With some abuse of notation, we denote by b) The average case.
For the family of random reward streams have same properties as those of coherent risk measures (cf.[4]), which is shown in the following proposition.

 
Other assertions in Proposition 1.1 are easily proved.This completes the proof.□ For  , where where     By the representation formula of @ CV R  (cf.[2,3]), the second equality of (7) holds, which completes the proof.□ The value function of the discounted and average cases are defined respectively by A policy is called discounted and average risk-optimal, respectively, if

Risk-Optimization
In this section, using for a random reward variable (1), we define a new immediate reward function by which the theory of MDPs will be easily applicable.Moreover, sufficient conditions are given for the existence of discounted or average risk optimal policies.@ CV R

Another Representation of Risk Measures
In this subsection, another representation for DS  and AV  are given.For any the corresponding immediate reward function will be defined by for each x S  and .a A  Then, we have the following, which shows that the original problem with is equivalent to the new problem with r. r Theorem 2.1.It holds that, for any .

The Discounted Case
Here, we drive the optimality equation for the discounted case, which characterizes a discount risk optimal policy.To this end, we need the following Assumption A.
Assumption A. The following 1) -4) holds: i.e., for any For any


, so that there exists 1 2 , > 0 y   Therefore, from Assumption A 2) and convergence assumptions there exists N for which z    , n n n y x a  for , which implies (12).Thus, by the general convergence theorem (cf.[17]) and ( 11) and ( 12), we have that We can be in position to state the main theorem in the discounted case.
Theorem 2.3.Suppose that Assumption A holds.Then, 1) The value function DS  is given by where is a unique solution to the optimality equation of the discounted case, 2) The exists a measurable function : attains the minimum in (14) and the stationary policy f  is discount risk-optimal.

The Average Case
In order to obtain the optimality equation for the average case, we assume that Assumption below holds, which guarantees the ergodicity of the process.
Assumption B. There exists a number where  denotes the variation norm for signed measures.
One of sufficient condition for Assumption B to hold, easily checked for applications, is as follows (cf.[19,20]).
Assumption B ' There exists a measure  on with Theorem 2.4.Suppose that Assumptions A and B hold.Then, there exists Moreover, there is an average risk-optimal stationary policy f  such that minimizes the righthand side of (17).
Proof.We have already obtained that is continuous in , applying the theory of average MDPs (cf.Corollary 3.6 in [19]), Theorem 2.4 follows, as required.□

An Application to Inventory Model
We consider the single-item model with a finite capacity < C  , in which the demands   =0 X denotes the stock level at the beginning of period t and action t is the quantity ordered (and immediate supplied) at the beginning of period t.Putting the amount sold during period t, , the system equation is given as follows.
  The transition probability Also, the immediate reward is given as , , r x a where is the unit sale price, the unit production cost and unit holding cost.Several lemmas are needed for risk analysis.Let > 0  be a random variable with a given demand distribution  and for where In order to the equivalent MDPs, we specify the immediate reward where   and the third equality follows from the monotonicity and homogeneous property of The function L defined above is proved to be a convex function.

Lemma 3.2
The following 1) -2) hold. 1) The second and the third inequalities follow from the monotonicity and the convexity of , respectively.This means that @ CV R   L u is convex.□ To applying Theorems 2.3 and 2.4 to inventory problems, the following is needed.We can state the main theorem.Theorem 3.3.Suppose that Assumption C holds.Then, for each of discounted or average case, there exists a constant level stationary policy f   which is optimal, that is, the ordered amount for some , x    where the critical level x  for each case is given from the corresponding optimality Equations (14) and (17).
Proof.First we verify that 1) -4) of Assumption A are satisfied.A 1) -A 4) are clearly true by definitions.For any Since   , r x a is convex in , using the result of Iglehant [21] (cf.[22]), it follows that the right-hand sides of the corresponding optimality equation ( 14) and ( 16) are convex in So, it is easily shown that there exists a risk-optimal policy f  of a constant level type (23) for each case.The proof is complete.□

Acknowledgements
This study was partly supported by "Development of methodologies for risk trade-off analysis toward optimizing management of chemicals" funded by New Energy and Industrial Technology Development Organization (NEDO).


process is a controlled dynamic system defined by a six-tuple sets S and A are state and action spaces, respectively,   A x is non-empty Borel subset of A which denotes the set of feasible actions when the system is in state , is the law of motion, is an immediate reward function and is an initial state distribution.Throughout this paper, we suppose that the set shows that D is closed.Similarly, we can prove that


are i.i.d. with the distribution function which has a continuous density   x  w.r.t. the Lebesgue measure . The state space and action space are assertion(16) in Assumption B ' holds.Thus, Theorems 2.3 and 2.4 are applicable. ,