^{1}

^{*}

^{2}

^{2}

A novel approach to optimizing any given mathematical function, called the MOdified REinforcement Learning Algorithm (MORELA), is proposed. Although Reinforcement Learning (RL) is primarily developed for solving Markov decision problems, it can be used with some improvements to optimize mathematical functions. At the core of MORELA, a sub-environment is generated around the best solution found in the feasible solution space and compared with the original environment. Thus, MORELA makes it possible to discover global optimum for a mathematical function because it is sought around the best solution achieved in the previous learning episode using the sub-environment. The performance of MORELA has been tested with the results obtained from other optimization methods described in the literature. Results exposed that MORELA improved the performance of RL and performed better than many of the optimization methods to which it was compared in terms of the robustness measures adopted.

If

Although most of the methods described in the literature are able to discover global optimum of any given optimization problem, the performance of newly developed algorithms should be investigated. Therefore, we present a MOdified REinforcement Learning Algorithm (MORELA) approach which differs from RL based approaches by means of generating a sub-environment based on the best solution obtained so far which is saved to prevent the search being trapped at local optimums. And then, all of the function values with corresponding decision variables in MORELA are ranked from best to worst. In this way, the sub-environment is compared with the original environment. If one of the members of the sub-environment produces better functional value, it is added to the original environment and the worst solution is omitted. This makes the searching process more effective because the global optimum is sought around the best solution achieved so far, with the assistance of the sub-environment and the original environment.

RL has been attracted a lot of attention from scientific community for solving different class of problems especially in last decades [

There are three primary types of RL based methods, each with its advantages and disadvantages. The Monte Carlo and Temporal difference learning methods are able to learn only from experience whereas the other one, called Dynamic programming, requires a model of the environment. Therefore, because of not need to have a model, they are superior to Dynamic programming. Indeed, temporal difference learning methods are at the core of RL [

Q learning algorithm includes in a sequence of learning episodes (i.e. iteration). At each learning episode, the agent chooses an action in accordance with information provided from a state s. The agent deserves to receive a reward considering its Q value and observes the next state,

Initialize Q values |
---|

Repeat t times (t = number of learning episodes) |

Select a random state s |

Repeat until the end of the learning episode |

Select an action a |

Receive an immediate reward r |

Observe the next state ^{ } |

Update the Q table according to the update rule |

Set |

where

This paper proposes a new and robust approach, called MORELA, to optimize any given mathematical function. MORELA approach varies from other RL based approaches through generating a sub-environment. In this way, developed MORELA approach has ability to find global optimum for any given mathematical optimization because it is sought both around the best solution achieved so far with the assistance of the sub-environment and the original environment. The rest of this paper is organized as follows. The definition of fundamental principles of MORELA is provided in Section 2. Section 3, which also contains comparison of MORELA and RL, various contrastive analyses of MORELA such as robustness analysis, comparisons with other related methods, explanation of evolving strategy of MORELA and investigation of the effect of high dimensionality, presents numerical experiments and the last section is conclusions.

There are several studies related to RL combined with different heuristic methods for solving different types of optimization problems. Liu and Zeng [

From a different viewpoint, Derhami et al. [

Hybridizing RL algorithms with other optimization methods is a powerful technique to tackle different types of optimization problems arising in different fields. Thus, in the context of this paper, we focus on applicability of RL based algorithms to discover global optimum for any mathematical function. The proposed algorithm called MORELA is on the basis of Q-learning, which is a model-free RL approach. In addition, a sub-environment is generated in MORELA so that the environment consists of original and sub-environment differently from other RL based approaches as shown in Equation (2) [

where m is the size of the original environment, n is the number of decision variables, and f is fitness value at the t^{th} learning episode. As shown in Equation (2), ^{th} row. At the t^{th} learning episode, a sub-environment is generated as given in Equation (3) and located between rows (m + 2) and (2m + 1). Thus, a global optimum is explored around the best solution with assistance of sub-environment with vector of

After generating the sub-environment, the solution vectors located in both environments are ranked from best to worst according to their fitness values. With assistance of this sorting, the worst solution vector is excluded from environment whereas the solution vector provided a better functional value is included. Thus, MORELA may gain the ability to solve any given optimization problem without prematurely converging.

In MORELA, each action a in state s is rewarded as shown in Equation (4) [

where ^{th} learning episode. In MORELA, the reward value is determined for each member of the solution vector by considering its Q value and the best Q value provided so far. The reward values come to close to the value “0” at the end of the solution process because of the structure of the reward function. In fact, a solution receives less reward when it is located closer to global optimum than the others. On the other hand, the probability of global optimum finding for further located solutions may be increased by means of providing them bigger rewards. Thus, the reward function developed may be referred to as penalty contrary to reward.

An application of MORELA was carried out by solving several mathematical functions taken from the literature. However, before solving these functions, it may be essential to demonstrate the effectiveness of MORELA over RL. For this purpose, a performance comparison was conducted by solving a mathematical function. MORELA was encoded in the MATLAB for all test functions, using a computer with Intel Core i7 2.70 GHz and 8 GB of RAM. The related solution parameters for MORELA were set as follows: the environment size is taken as 20, the discounting parameter

The test function used to compare MORELA and RL is given in Equation (5). It has a global optimum solution of

As seen in

A robustness analysis for MORELA was carried out by using succeed ratio (SR) given in Equation (6).

where N_{s} is the number of successful runs which indicates that the algorithm produces the best solution at the required accuracy and N_{T} is the total number of runs which is set to 50 to make a fair comparison. For this experiment, a run is accepted as successful when its objective function value is around 3.5% of global optimum. The robustness analysis for MORELA, PSACO [

To gauge the performance of MORELA against the performance of some other methods described in the literature, sixteen well-known benchmark problems were used which are given in Appendix A. Functions 6, 7, 9, 13 and 16 are taken from Shelokar et al. [

To assess the ability of MORELA, its performance was compared with 12 algorithms listed in

Function | The values of SR | ||||
---|---|---|---|---|---|

MORELA | PSACO | CPSO | PSO | GA | |

F2 | 100 | 100 | 98 | 100 | 84 |

F7 | 100 | 100 | 100 | 98 | 98 |

F9 | 100 | 100 | 100 | 98 | 98 |

F12 | 98 | 100 | 90 | 96 | 16 |

F13 | 96 | 98 | 96 | 26 | 94 |

F16 | 100 | 100 | 100 | 94 | 92 |

Functions | Algorithm | Reference |
---|---|---|

F4-F7 | SZGA | Successive zooming genetic algorithm [ |

F1-F2-F3-F4 | IGARSET | Improving GA [ |

F7-F12-F13 | ACO | Ant colony optimization [ |

F4-F5-F11-F12-F13-F15-F16 | PSACO | Particle swarm and ant colony algorithm [ |

F5 | ECTS | Enhanced continuous tabu search [ |

F10 | ACORSES | Ant colony optimization [ |

F8 | SA | Simulated annealing [ |

F14 | RW-PSO+BOF | Random walking particle swarm optimization [ |

F5-F6-F7-F9-F11-F12-F13 | GA-PSO | Genetic algorithm particle swarm optimization [ |

F4-F7-F9 | GAWLS | Genetic algorithm [ |

F1-F3 | HAP | Hybrid ant particle optimization algorithm [ |

F1-F4-F10 | ACO-NPU | Ant colony optimization [ |

All problems | MORELA | Modified reinforcement learning algorithm (This study) |

Function | Method | Best function value | Number of learning episodes^{* } | Best solution time (sec) | Success ratio | Average number of learning episodes^{* } | Average error |
---|---|---|---|---|---|---|---|

F1 | IGARSET | 0 | 2174 | 0.0568 | NA | 2375 | NA |

ACO-NPU | 0 | 20,000 | 0.0590 | NA | NA | NA | |

HAP MORELA | 2.4893e−8 0 | 100 68,760 | NA 0.8688 | NA 100 | NA 71,000 | NA 0 | |

F2 | IGARSET | −2 | 2400 | 0.0614 | NA | 3111 | NA |

MORELA | −2 | 34,920 | 1.1790 | 97 | 36,200 | 0 | |

F3 | IGARSET | 2.08e−27 | 1821 | 0.0666 | NA | 2156 | NA |

HAP MORELA | 2.56e−39 9.71e−40 | 100 31,620 | NA 0.9584 | NA 98 | NA 31,740 | NA 9.33e−34 | |

F4 | SZGA | 2.9e−8 | 4000 | NA | NA | NA | NA |

IGARSET | 0 | 1004 | 0.0485 | NA | 1065 | NA | |

GAWLS | 0 | 2572 | NA | NA | NA | NA | |

PSACO | NA | NA | NA | 100 | 370 | 5.55e−17 | |

ACO-NPU MORELA | 0 0 | 1000 34,340 | 0.0556 0.5660 | NA 99 | NA 35,700 | NA 1.11e−18 | |

F5 | ECTS | NA | NA | NA | NA | 338 | 3e−08 |

PSACO | NA | NA | NA | 100 | 190 | 7.69e−29 | |

GA-PSO | NA | NA | NA | 100 | 206 | 0.00004 | |

MORELA | 0 | 36,955 | 21.1367 | 100 | 36,980 | 0 | |

F6 | GA-PSO | NA | NA | NA | 100 | 8254 | 0.00009 |

MORELA | −1 | 40,680 | 0.6722 | 100 | 40,960 | 0 |

F7 | SZGA | 3 | 9000 | NA | NA | NA | NA | |
---|---|---|---|---|---|---|---|---|

ACO | NA | NA | 0.11^{a } | NA | 264^{a } | NA | ||

GAWLS | 3 | 2573 | NA | NA | NA | NA | ||

GA-PSO | NA | NA | NA | 100 | 25,706 | 0.00012 | ||

MORELA | 3 | 33,920 | 0.5344 | 100 | 37,120 | 0 | ||

F8 | SA | −9.999994e−01 | 16,801 | NA | NA | NA | NA | |

MORELA | −1 | 35,640 | 0.5365 | 100 | 38,380 | 0 | ||

F9 | GAWLS | −186.7309 | 2568 | NA^{ } | NA | NA^{ } | NA | |

GA-PSO | NA | NA | NA | 100 | 96,211 | 0.00007 | ||

MORELA | −186.7309^{b } | 16,740 | 0.3516 | 100 | 17,460 | 0^{ } | ||

F10 | ACORSES | −837.9658 | 1176^{ } | 0.0690 | NA | NA | NA | |

ACO-NPU MORELA | −837.9658 −837.9658^{b} | 750 14,880 | 0.0289 0.2722 | NA 100 | NA 17,500 | NA 0^{ } | ||

F11 | PSACO | NA | NA | NA | 100 | 167 | 5.7061e−27 | |

GA-PSO | NA | NA | NA | 100 | 95 | 0.00005 | ||

MORELA | 0 | 37,016 | 23.3041 | 100 | 37,856 | 0 | ||

F12 | PSACO | NA | NA | NA | 100 | 592 | 2.0755e−11 | |

ACO | NA | NA | 0.74^{a } | NA | 528^{a} | NA | ||

GA-PSO | NA | NA | NA | 100 | 2117 | 0.00020 | ||

MORELA | −3.8628^{b} | 13,840 | 0.4053 | 96 | 15,700 | 1.015e−13 | ||

F13 | PSACO | NA | NA | NA | 96 | 529 | 4.4789e−11 | |

GA-PSO | NA | NA | NA | 100 | 12,568 | 0.00024 | ||

ACO | NA | NA | 4.10^{a} | NA | 1344^{a} | NA | ||

MORELA | −3.32^{c} | 30,400 | 0.7804 | 96 | 32,200 | 2.3413e−16^{ } | ||

F14 | RW-PSO+BOF | NA | NA | NA | NA | NA | 0^{d} | |

MORELA | 0 | 33,500 | 0.5254 | 100 | 34,440 | 0 | ||

F15 | PSACO | NA | NA | NA | 100 | 1081 | 6.23e−22 | |

MORELA | 0 | 37,010 | 19.8856 | 100 | 38,896 | 0 | ||

F16 | PSACO | NA | NA | NA | 100 | 209 | 2.6185e−13 | |

MORELA | 0.3979^{b} | 7680 | 0.1413 | 100 | 7900 | 0^{ } | ||

NA: Not available; ^{a}The average number of function evaluations of four runs and running time in units of standard time. ^{b}The theoretical minimum value was considered to be four digits. ^{c}The theoretical minimum value was considered to be two digits. ^{d}Mean results of more than 30 independent trials.

termined based on the number of successful runs in which algorithm generates the best solution for the required accuracy. Average error is defined as the average of the difference between best function value and theoretical global optimum.

The findings indicate that the MORELA showed remarkable performance for all of the test functions except F3, for which theoretical global optimum could not be found with the required accuracy, namely, 0. Although MORELA was not able to solve this function, it produces better functional value than those provided by other compared algorithms, as shown in

In addition, we have used Bohachevsky function (F4) in order to better explain how the evolving population is diversified by the sub-environment in MORELA. The function of F4 has a global optimum solution of ^{nd} learning episode using Equation (3), depending on the best solution found in the previous learning episode and

When the main difference between ^{th} learning episode as shown in ^{th} learning episode, the solution points in the original environment are almost close to global optimum, but the other points in the sub-environment are still dispersed in the solution space as can be seen in ^{th} learning episode. After 250^{th} learning episode, the solution points in the original environment are too close to global optimum whereas the others in the sub-environment still continues to explore new solution points around global optimum although they have a tendency to reach global optimum. Finally, at

1000^{th} learning episode, all solution points populated in the original and sub-environment reached to global optimum as seen in

The Ackley function given in Equation (7) was chosen to explore the effect of high dimensions on the search capability of MORELA. The global minimum of this function is given as ^{ }

・

・

As

Although the average number of learning episodes increased notably with increasing dimension up to 1000, the average number of learning episodes increased very little within the dimension range from 1000 to 10,000. This experiment clearly demonstrates that the number of learning episodes required by MORELA is not apparently affected by high dimensionality.

A powerful and robust algorithm called MORELA is proposed to find global op-

timum for any given mathematical function. MORELA differs from RL based approaches by means of generating a sub-environment. Thus, this approach makes it possible to find global optimum, because it is sought both around the best solution achieved so far with the assistance of the sub-environment and the original environment.

The performance of MORELA was examined in several experiments, namely, a comparison of MORELA and RL, a robustness analysis, comparisons with other methods, explanation of evolving strategy of MORELA, and an investigation of the effect of high dimensionality. The comparison of MORELA and RL showed that MORELA requires many fewer learning episodes than RL to find global optimum for a given function. The robustness analysis revealed that MORELA is able to find global optimum with high success. MORELA was also tested on sixteen different test functions that are difficult to optimize, and its performance was compared with that of other available methods. MORELA has found global optimum for all of the test functions except F3, based on the required accuracy. Besides, the last experiment clearly shows that MORELA is not significantly affected by high dimensionality.

Finally, all numerical experiments indicate that MORELA performed well in finding global optimum of mathematical functions considered, compared to other methods. Based on the results of this study, it is expected that in future research, optimization methods based on RL will be found to possess great potential for solving various optimization problems.

Ozan, C., Baskan, O. and Haldenbilen, S. (2017) A Novel Approach Based on Reinforcement Learning for Finding Global Optimum. Open Journal of Optimization, 6, 65-84. https://doi.org/10.4236/ojop.2017.62006

F1:

Rosenbrock (2 variables)

・ global minimum:

F2:

(2 variables)

・

・ global minimum:

F3:

(2 variables)

・ global minimum:

F4:

Bohachevsky (2 variables)

・

・ global minimum:

F5:

De Jong (3 variables)

・

・ global minimum:

F6:

Easom (2 variables)

・

・ global minimum:

F7:

Goldstein-Price (2 variables)

・

・ global minimum:

F8:

Drop wave (2 variables)

・

・ global minimum:

F9:

Shubert (2 variables)

・

・ 18 global minima

F10:

Schwefel (2 variables)

・

・ global minimum:

F11:

Zakharov (2 variables)

・

・ global minimum:

F12:

Hartman (3 variables)

・

・ global minimum:

F13:

Hartman (6 variables)

・

・ global minimum:

F14:

Rastrigin (2 variables)

・

・ global minimum:

F15:

Griewank (8 variables)

・

・ global minimum:

F16:

Branin (2 variables)

・

・ three global minimum: