Discrete-Time Hybrid Decision Processes : The Discounted Case

This paper is a sequel to Kageyama et al. [1], in which a Markov-type hybrid process has been constructed and the corresponding discounted total reward has been characterized by the recursive equation. The objective of this paper is to formulate a hybrid decision process and to give the existence and characterization of optimal policies.


Introduction
The credibility theory, developed by Liu [2], is useful in dealing with uncertainty in human thinking.In the real world, we often encounter the complex problem with human thinking, which could not be treated only by probability theory.To deal with such complex problem, Li and Liu [3] have introduced a more flexible uncertain theory, called chance theory, which is a hybrid of probability and credibility.Also, recently, Kageyama et al. [1] have given a method of constructing a Markov-type hybrid process from stochastic kernel and credibilistic kernel.Imagining the much wider applications of hybrid processes in the near future, it is meaningful to consider the case where the behavior of hybrid processes given in Kageyama et al. [1] may be influenced by a suitable choice of decisions or actions.The objective of this paper is to formulate a hybrid decision process, referring a modeling of stochastic control system known as a Markov decision process (cf.[4,5]), and to give the existence and characterization of optimal policies.
In the remainder of this section, we shall establish the notation that will be used throughout the paper and recall the chance measure and hybrid variables whose expected values are defined.For any non-empty set X , a function   : 0,0.g X  5 is said to satisfy condition with The set of such functions will be denoted by   X K .A Borel set is a Borel subset of a metric space.Let X and be arbitrary Borel sets.We denote by the sets of the Borel subsets and the power set of X respectively.Let   X  be the set of probability measures on X .The subset The set of all events will be denoted by [3,6]) as follows: for any where  is a complement of a set .The triplet ).In Section 2, We define a hybrid decision process by using the credibilistic kernel and stochastic kernel, which is analyzed to show the existence and functional characterization of optimal policies in Section 3.An example is given in Section 4.
The expected value  , E r g p   of the hybrid variable is defined by the Choquet integral: :

Hybrid Decision Processes
The state and parameter of some dynamic systems will be denoted respectively by points in a Borel set X and a finite set , , k A a a a   be a finite action space.
Let , where satisfies the condition The discrete-time Markov-type hybrid decision process with the state space X is a five-tuple   is chosen, then two things happen.
1) The state  0 0 ,  g p  of the system moves to the new state  1 1 , g p  X as the following state equation: for any ; for any .
2) The expected value of reward function occurs, where Once the state has been translated into the new state, a new action is chosen and the processes is repeated.
The metric on the state space X will be defined by, where TV p p  is the total variation metric which is defined by the Borel subsets of X generated by the ρ-metric topology.
The stationary policy is measurable if, for any action , where We denote by the set of all measurable stationary policies.


For any initial state 0 0 , g g p p   with   , g p  X , under the stationary policy , we define the total discounted reward function by where The policy

Analysis
In this section, we will utilize the method of dynamic programming (cf.[4,5,9]) to drive the discounted optimality equation, from which the existence of optimal policies is shown.As first, we show the measurability of the total discounted reward function and the value function.Lemma 3.1.For any stationary policy f   , the function Proof.For the case of 1 t  , it suffices to prove that g and 1 are given by (2.1) and (2.2).From (1.4) and (1.5), for any , The set is rewritten by where a is given in (2.5).Since  A is finite, together with (2.1), (2.2) and (2.5), For the case of and , the measurability of can be proved, by induction, similarly to the above.This completes the proof.
r g p f g p  Theorem 3.1.For any stationary policy , the discounted total reward function is measurable on the state space X .
Proof.From Lemma 3.1 and the definition of φ f (g, p), the assertion holds obviously. Let X denote the class of all bounded measurable functions on X .For , Then, it is clear that the space  ,   X is complete.For any policy , f     , g p  X and h  X , we define the operator f U on X as follows:

h g p r g p f g p h T g f g p T p f g p
Lemma 3.2.The operator f U is a contraction on the space X .
Proof.For any state   , g p  X , we have

U h g p U h g p h T g f g p T p f g p h T g f g p T p f g p h g p h g p h h
. This completes the proof. Theorem 3.2.The discounted total value f  is a unique fixed point of the operator f U , i.e. .
Proof.As is a non-negative and bounded function, there exists a such that .So, we have From the Banach fixed point theorem, the conclusion of the theorem follows.
 In order to describe the optimality equation with respect to the value function  ,  g p


, we define the operator on U X as following:

Uh g p r g p a h T g a T p a g p h
Then, we have the following.Lemma 3.3.The operator is a contraction with modulus U  on the space X .
Proof.From the definitions of X and U , it is obviously that is the mapping from U X to X .For any   , g p  X and any ,

Uh g p Uh g p h T g a T p a h T g a T p a h g p h g
,  , which completes the proof. Lemma 3.4.The value function  is a bounded and measurable function.i.e.,   X .


Here, we can state the main result which shows the existence of optimal policies.Theorem 3.3.It holds that 1) The value function  is the unique fixed point of the operator .U 2) An optimal stationary policy exists.
is the optimal policy.
Proof.For 1), we have, for any   U g p r g p f g p U g p Repeating the above inequality, we have For 2) and 3), we have already shown in the proof of 1).

The Controlled Floating Exchange Rate System
In this section, we will give an example for application of our model.
For simplicity, let us denote by  ,    a normal distribution with the mean    X be the amount of wealth at time , whose controlled system equation will be defined by: where t Z is a sequence of i.i.d.income random variables with the normal distribution and is an action at time , selected from the action space 7, 0.8 6, 0. , 0.9,1.0  be the possible exchange rate.Then, the real income will be defined by where, is a reward when the action (4.