Stability Analysis of a Neurocontroller with Kalman Estimator for the Inverted Pendulum Case

In this paper, a practical analysis of stability by simulation for the effect of incorporating a Kalman estimator in the control loop of the inverted pendulum with a neurocontroller is presented. The neurocontroller is calculated by approximate optimal control, without considering the Kalman estimator in the loop following the Theorem of the separation. The results are compared with a time-varying linear controller, which in noiseless conditions in the state or in the measurement has an acceptable performance, but when it is under noise conditions its operation closes into a state space range more limited than the one proposed here.


Introduction
The motivation for this case study known as inverted pendulum control arises from the need to obtain robust controller systems to implement in situations where it is desired to maintain equilibrium of an unstable system.A direct related situation is the attitude control of a booster rocket at takeoff for sending a payload to space.It is a well-known problem in the control theory literature [1] [2] and machine learning [3] [4] [5] [6].However, such problems are challeng-ing since real systems are difficult to control and this is to some extent due to the fact that redundant feedback systems must be considered by the controller, as an effect similar to that of incorporating an estimator into the variables controller status.This fact causes instability in the closed loop, which must be foreseen and analyzed.The analysis can be done through simulations with the estimator-controller system, in order to establish some stability domain.In this work we opt for the control based on optimization [7] [8] [9], where the optimal control problem is formulated.To solve the problem of optimal control, a very powerful tool is the Dynamic Programming technique [10] [11] implemented with approximations [3] [4] [5] in a machine learning scheme [12], since it allows dealing with constrained, nonlinear processes and non-quadratic performance indexes.However, it is often difficult to achieve a methodology for the implementation of controllers based on machine learning, since they require heuristic and a good knowledge of the involved adaptation mechanisms [3].In this work, a methodology to determine the conditions that achieve good results to implement in simulation is shown.A controller consisting of a compact function called neurocontroller is achieved.
In this paper, cases with and without estimator of a model that represents an inverted pendulum are studied.When a time varying linear quadratic regulator (TVLQR) with direct state measurement is used, good performance can be achieved.However, it can be improved with a neurocontroller.The obtained performances by using linear and neurocontroller are shown in Figure 1 and Figure 2, respectively.Note that the cumulative cost of the linear controller  (3760.7) is 28% higher than that of the Neurocontroller (2691.3).However, when the controller is used in more realistic situations using a state estimator and considering noisy conditions in the measurements, the performance of the linear controller deteriorates more than the performance of the neurocontroller even until fails to stabilize the system for the same initial conditions.In this paper, an analysis of the system performance deterioration is shown when it requires a state estimator.
This paper is organized as follows.After this Introduction, the problem is detailed and expressed as mathematical equation in Section 2. In Section 3 is detailed the proposed solution.In Section 4 the implementation of the obtained solution and another one with classical methods for comparison purposes is developed.The obtained results are discussed in Section 5, with its pros and cons.

Problem Formulation
The dynamic programming approach assumes that the process evolution can be split in stages [10], so take the version of dynamic systems in discrete time [9] is straightforward.The problem formulation puts in formal terms the optimal control elements.These elements are the cost function to minimize, the control law and the dynamic system model with its constraints.If the system model cannot be or is not feasible to express it in closed analytical form through a differential equation, it is useful to generate a black box model [13].

1) Notation and Assumptions
This section introduces the nomenclature used along the article.The symbols

∈ℜ r v
, where h ∈  is determined by the control engineer.
xi. J  is the approximation of the cost function J, and its domains includes the parameter vector r, xii. µ  is a function whose behavior approximates the function µ, includes the parameter vector v, is the minimum cost to go from the state x at time k up to the terminal state at time N.

( )
, Q i u , real valued function associated at state i and action u.
xxi. n η is a function that varies with iteration number n, bounded between 0 and 1. xxii.
( ) is the approximate version of the factor ( ) xxiii.n γ is the discount factor, variable with iteration n and bounded be- tween 0 and 1.
xxiv.µ control action expressed as look up

( )
k ∈ℜ w is a white noise sequence with zero mean and unit variance.
xxxiv.δ ∈ℜ is longitudinal displacement of the cart.xxxv.δ ∈ℜ  is longitudinal velocity of the cart.
xxxvi.φ ∈ℜ is the angle of the inverted pendulum bar.xxxvii.φ ∈ℜ  is the angular velocity of the inverted pendulum bar.
xxxviii.M P is the cart concentrated mass, whose value here is 0.5 Kgr.
xxxix.m P is the bar concentrated mass, valued here is 0.1 Kgr.
xl. F P is the displacement friction constant assigned 0.1 N•m −1 •s.
xli. l P is the size of the pendulum bar, 0.6 m. xlii.g P is the standard acceleration due to gravity, 9.81 m•s −2 . xliii.

×
∈ℜ Q is the weighing matrix for the state vector from k = 0 to k = N − 1 with N the terminal state time. xliv.

S
is the weighing matrix for the state vector at the terminal state time N.
xlv. ∈ℜ R is the weighing matrix for the control action variable.
2) The basic problem Thus, to formulate the optimal control problem the expressions of the process model in discrete time, the restrictions in the variables and the cost function to be minimized are presented.Next, the problem of minimizing the separable cost function is considered by where x(0) has a fixed value and the constraints must be satisfied together with the system equation, where the constraints on state and manipulated variables are , .
The function I(⋅) is defined by the control engineer which must be convex but not necessarily quadratic, and f(⋅) is the nonlinear relationship between instants k and k + 1 of the state and manipulated variables.Moreover, they are bounded and continuous functions of their arguments, and both x and u belong to closed and bounded subsets of ℜ n and ℜ m , respectively.Then, the Weierstrass theorem asserts that there exists a minimization policy also called control law.Therefore, it is desired to find a correspondence relation that makes evolve the processes modelled by ( 2) from any initial condition to the final terminal state x(N) satisfying constraints (3), and minimizing the cost function (1).The implementation is shown in Figure 3, where the flow of information between the controller and the closed-loop system is stated.Note that the behavior of the closed loop system is done by designing the performance index which is added at each stage in the cost function (1).

Proposed Solution
In order to solve the formulated problem, the proposed solution is by using dynamic programming and then approximations are introduced through functions where the parameter vectors v and r must be determined.
1) Optimal control for processes modelled as constrained nonlinear systems The procedure to solve the optimal control problem for both continuous and discrete time dynamic systems is well known [7] [8] [14], and consists of analytically minimize the proposed cost function (1) and from this minimization achieve an expression for function μ.When the system is linear and the cost function is quadratic, the optimal control problem has unique solution through the Riccati Equation.However, when the system is nonlinear the solution of the Hamilton-Jacobi-Bellman equation [14] must be found, whose solution is restricted to a certain class of nonlinear systems.Here, an optimization principle to solve the same control problem that allows to use any cost function and respecting the constraints in the state variables and in the control variables in a natural way is used.

2) Bellman's optimality principle
The principle of optimality [10] allows solving an optimization problem in Figure 3. Implementation of the controller based on numerical dynamic programming.Applied Mathematics which a dynamic process evolves over time through stages.Applying the principle of optimality in (1), we obtain called the Bellman's Equation.Therefore, the optimal control action u o will be ( ) which is the optimal policy of decisions or optimal control law.Note that J * does not depend explicitly on u(k), as shows Equation (8).
3) Introducing approximations To obtain the control law or the decision policy, there exists numerical methods, [3] [4] [10] [11] and approximations [3] [5] [12] which are detailed below.Now, an approximation function for values of Equation ( 1) in a compact domain is introduced.Thus, a compact representation of the cost associated with each state of the process is obtained.

4) Design of the approximation function
The approximation function incorporates a set of vectors of parameters r, which is defined as a partitioned vector whose structure defines the function structure, where each vector r 1 has the same dimension, which is the number of inputs of the function plus one to consider a static scalar unit parameter.So, h intermediate scalar values ξ are computed as the scalar product between the input vector x and the corresponding parameters as every single value are processed through the hyperbolic tangent function, avoiding large numbers by where the right side has only one exp(⋅) computation for improving calculation time.So, with these h values together with the polarization 1 the inner product is implemented with the rest of the r parameter vector which is r 2 , and must be consistent in its dimension to be able to perform the product This approximation function has the parameter h as a tuning parameter for the dimension of vector r, in terms of the structure of the approximation function.
Finding a suitable value for vector r, one have the approximated value of the minimum cost that is incurred to reach the terminal state from the current state x(k), and with the model of the system can be found the control policy using the argument u(k) that minimizes For finding r, the search process that finds the policy function is divided into two tasks, as shown in Figure 4.One of them, is the evaluation of a defined control policy or control law, from which the costs for all stages of the process are calculated.The other one, is the policy improvement procedure.Both tasks are done approximately with respect to the original system, because an approximation function is used that tune its behavior.

5) Implementation
A set of representative data S  in the state space in a domain is available and for each state i S ∈  the cost values C(i) are calculated.To this end, an initial control law or control policy is proposed, and the system (1) is evolved from the given state i to the terminal stage, evaluating the performance index by expression (2) of the cost function ( ) i J µ .This procedure is performed for every state i S ∈  .Then the approximated cost function is tuned by minimizing in r an approximation function for the cost associated to the evaluated policy is obtained.The parameters vector r is obtained by minimizing expression (16).The incremental gradient iteration is where η fulfills the conditions such that the algorithm converges, where n refers to the tuning iteration n.For computing C(i) the performance evaluation is implemented through the system model ( 1) and the cost function (2).
Then the costs associated with each state-action pair are computed, by using Figure 4. Scheme of the optimal control policy search process.
the auxiliary cost function Q(i, u), which in its approximate version is where γ n is a discount factor that can vary from iteration to iteration up to reach unity.Then, the improved policy is obtained by the table Once available ( ) . Then, the policy improvement task is carried out, in which a new tabulated control policy ( ) µ ⋅ expressed as (20) is obtained.After that, the calculation of the costs for each state i starts, and in each iteration the function γ n is updated.
Simultaneously with the described tasks, an approximation for the improved control law ( ) µ ⋅ is introduced, by a function with parameters v as shown Fig- ure 5 following the same structure as that described by Equations ( 9)-( 14).
Thus, since the function ( ) µ ⋅ is the analytical solution of the optimal control problem, it is intended to obtain an approximation ( ) µ ⋅ -which is expressed as table, where v is the parameter vector.
To find the approximation function ( ) , using the data of the improved policy ( ) µ ⋅ defined in (20), it is proposed to minimize the expression within the set S  , where the control law is represented by ( )  with the tuning parameters vector v.A solution for Equation ( 21) is obtained by the incremental gradient method [13], which is expressed as iterations on n by where η n fulfills the conditions (18).A summary of the algorithm is detailed in Table 1.Note that two approximation problems are solved at the same time, since given ( ) Then, given ( ) is computed for i S ∈  and then find the new policy ( )  is available, the control actions are obtained as shown in Figure 5.The control scheme is shown in Figure 6.

6) Discussion and comment on the implementation
The algorithm to solve the optimal control problem for nonlinear processes with non-quadratic cost function and constraints was detailed.Given the employment of approximations, the topic of approximation function in dynamic systems [13] must be well mastered to obtain suitable result in the closed loop system.
As general suggestions, it must be mentioned that as in many nonlinear system, the algorithm is strongly dependent on the initial conditions.Thus, its dependence lies on the initial policy and on the states used to compute ( ) The parameter tune speed with respect of the iterations, is fixed by the function γ, and the method is sensitive to this parameter.Usually one can make the first attempts setting γ = 1 constant, with few iterations, and then begin to modify it to converge to 1 with the iterations, always verifying that the performance of the controller improves at the long term.The adjustment parameters amount in each approximation function depends on the data complexity, which generally are conditioned by implementing some normalization or feature extraction techniques.

Control of the Inverted Pendulum
The inverted pendulum can be represented as shown Figure 7.For this cart-bar Figure 7. Diagram of the inverted pendulum system on a cart.
system, a controller will be designed using the algorithm of Table 1.Knowing that the equations that describe the angle and the linear displacement dynamics are, ( ) 2 cos sin sin cos 0 whereas the controller is designed the system trajectories are generated by simulation for initial angle φ of 0.2 radians.It is considered that the force u must fulfill with the constraint 30 30 u − ≤ ≤ .
The proposed cost function is composed by ( ) T 0 5 0 0 0 0 0 0 0 , 0.001 0 0 50 0 where θ u is defined to constrain the values of u k by 30 , if The continuous time model is discretized at a rate of 0.1 Section.

1) State estimation
In order to retrieve the system state vector, a Kalman estimator is used, where the discrete-time linearized version estimate of ( 23) is given by where T , , , for the measure y(k).Furthermore, C = [1 0 0 0].To find the x(k) estimate, it used a priori estimate of the observed states by means of ( ) ( ) ( ) and these states x are obtained from measurements of the system output where K O is the Kalman gain [15].This gain is calculated by using a Gaussian noise model in the state x(k) and in the measurement y(k) given by ( 26) and (27).
2) Implementing the controller The tune of the Table 1 algorithm was done to achieve the control objective that is to bring the bar to the vertical position, starting from positions smaller than 1 radian.Figure 10 shows the evolution of the algorithm parameters which are the cost to go function from the initial condition [0 0 0.2 0] T to the final time, set in 10 sec.Note that the behavior is not stable in first half of the performed iterations, but then stabilizes and the control objective is achieved.
The set S  was of 3000 samples in the ℜ 4 space of the state variables, with the range shown in Figure 11.The control law is where ( ) is obtained from Equation (31), and v contains the parameters corresponding to a set as (9), where 7 hidden nodes were used, which gives 7 vectors 5 1 i ∈ r R , and vector 8 2 ∈ r R , implementing the Equation (14).For the approximation of the function J(⋅) defined in Equation (24) it is used the same structure of the approximation function as for the control law.The parameters tuning was performed by the Levenberg Marquardt algorithm [5].
In order to perform the comparison of the neurocontroller performance, a time variant-discrete linear quadratic regulator controller with the classical LQR theory in discrete time [16] (TVDLQR) was implemented.Here, the design matrices were 3) System simulation results under noise conditions Figure 8 and Figure 12 show both performance results of the neurocontroller and the TVDLQR, with the same Kalman estimator.Under no-noise conditions, the performance of the system in both cases are quite similar, as shown in Figure 1 and Figure 2. Nevertheless, under noise conditions the TVDLQR does not achieve the same performance as that of the neurocontroller since the last allows to increase the range of initial conditions from 0.19 to 0.47 rad as seen in Figure 13 and Figure 12.Furthermore, estimated variables used by the LQR controller are shown in Figure 9.

4) Discussion
Since the control objective of the system with estimator is that the pendulum does not fall, a qualitative analysis of the performance of the TVLQR and NC controllers can be inferred.As can be seen in the examples shown, in Figure 8 can be seen that the linear controller meets the control objective for initial conditions of 0.19 rad or less.In contrast, for the case of the NC in Figure 13 and Figure 12 shows that with initial conditions of 0.19 radians the linear controller does not meet the control objective, but the NC does.In Figure 10 can be see that the tuning parameter's procedure is erratic and difficult to adjust since the value of the costs associated with the control policies are not necessarily monotonically decreasing.This means that is hard to tune the algorithm of Table 1, which must be tuned by trial-test and error for each particular system.Also, in     implement the algorithm of Table 1 is detailed.Note that the system evolution can be out of range and the policy function must be able to give a response that stabilizes the system.This was achieved because of the suitable relation between the data set S  and the problem complexity.Therefore, the results are encour- aging since in the examples the control objective is met, which is to obtain zero error in the output with respect to the desired value that is the origin.In a comparison of the performances from both control systems can be seen in Table 2.It is important to state that the linear controller is unable to keep the pendulum vertical when the angle initial condition exceeds 0.2 rad whereas the NC achieves it even up to 0.47 rad.Table 2. Comparative figures of performance results obtained by both controller estimator systems.The Monte Carlo simulation was realized with 150 trays, along 1000 time steps of 0.1sec.Figure 8 and Figure 12 show the temporal evolution of these indices.The temporal evolution of these indices is shown in Figure 8 and Figure 12, where the mean value and the 66% quota from the 150 trajectories of the Monte Carlo simulation are highlighted.In Table 2 are resumed the performance achieved by each controller.Note that even with higher cost to go figures, the NC gives robust behavior with regards to the initial conditions and noise.

Conclusions
In this paper, a stability analysis of a neurocontroller with Kalman estimator for the inverted pendulum case was presented.The NC performance was compared against the TVLQR controller with the same Kalman estimator.
µ ⋅ from Equation (20).Then the costs associated with each state, symbolized by C(i), are evaluated by Equation (17), and the r parameters are tuned by obtaining a new version of the approximation function ( )

Figure 5 .
Figure 5. Compact expression of the policy or control law.

k is the operating force as shows Figure 7 4 ∈ v R , 1 ∈
discretizing the continuous time linear version of (23) with a sampling time of 0.1 Section For the case of the pendulum w R , assuming that Gaussian sequences have zero mean and unit variance.The matrices F and G are defined as

Figure 11 Figure 8 .
Figure 11 the range of the system samples used in the calculation of the NC to

Figure 9 .
Figure 9.Estimated variables evolution for three initial conditions of the inverted pendulum system when it has noise in the states and in the measurement, with initial conditions of x 0 = [0 0 φ 0 0] T , where φ 0 takes the values 0.1, 0.15 , and 0.19 rad, i.e. 5.7296˚, 8.5944˚ and 10.8862˚.

Figure 10 .
Figure 10.Cost-to-go function evolution associated with the initial condition x 0 = [0 0 0.2 0] T and tuning law evolution used for the neurocontroller calculation.

Figure 11 .
Figure 11.Phase planes for the initial conditions x 0 = [0 0 φ 0 0] T where φ 0 takes the values 0.1, 0.15, and 0.47 rad i.e. 5.7296˚, 8.5944˚ and 26.9290˚.Continuous lines show the evolution of system trajectories under noise conditions in the state and in the measurement.

Table 1 .
Approximate optimal controller calculation algorithm.