Inference of General Mass Action-Based State Equations for Oscillatory Biochemical Reaction Systems Using k -Step Genetic Programming

Systems biology requires the development of algorithms that use omics data to infer interaction networks among biomolecules working within an organism. One major type of evolutionary algorithm, genetic programming (GP), is useful for its high heuristic ability as a search method for obtaining suitable solutions expressed as tree structures. However, because GP determines the values of parameters such as coefficients by random values, it is difficult to apply in the inference of state equations that describe oscillatory biochemical reaction systems with high nonlinearity. Accordingly, in this study, we pro-pose a new GP procedure called “k-step GP” intended for inferring the state equations of oscillatory biochemical reaction systems. The k-step GP procedure consists of two algorithms: 1) Parameter optimization using the modified Powell method—after genetic operations such as crossover and mutation, the values of parameters such as coefficients are optimized by applying the modified Powell method with secondary convergence. 2) GP using divided learning data—to improve the inference efficiency, imposes perturbations through the addition of learning data at various intervals and adaptations to these changes result in state equations with higher fitness. We are confident that k-step GP is an algorithm that is particularly well suited to inferring state equations for oscillatory biochemical reaction systems and contributes to solving inverse problems in systems biology.


Introduction
In recent years, the development of experimental technologies has enabled researchers to obtain various types of omics data.Consequently, functional analyses of complicated, large-scale metabolic pathways have been and will continue to be conducted more and more routinely.This trend is based on the conclusion that even when all of the individual biochemical reactions are analyzed, it is not always possible to predict the function of all the interaction networks among the biomolecules working as a complete system.We refer to this system as a "biochemical reaction system".Henceforth, systems biology [1] [2], which aims to understand biochemical reaction systems, will increasingly occupy an important position.
One major aim of systems biology is to comprehensively analyze omics data and to elucidate biochemical reaction systems.Biochemical reaction systems can be expressed as state equations, most of which are represented by simultaneous differential equations of interrelated biomolecules.In conventional research, the structure of state equations have been fixed and only inferring of their parameter values has been performed [3]- [14].But in the case where the structure of state equations is unknown because of uncertain information on the interaction networks among the biomolecules, it is necessary to infer the structure from only experimentally observed time-series data.In other words, state equations (i.e., both structures and parameter values) of biochemical reaction systems must be inferred from empirical results, a process which can be referred to as an inverse problem.To overcome this inverse problem encountered in systems biology, researchers predict and scrutinize unknown biochemical reaction systems from experimentally observed time-series data under various conditions.Following this, they formulate the state equations that are expressed by using simultaneous ordinary differential equations based on the general mass action (GMA) law.This method requires a great deal of trial and error, empirical guesswork, and labor.Thus, it would be invaluable to establish a method to infer state equations for biochemical reaction systems using computational science.When inferring state equations capable of reproducing experimentally observed time-series data, multiple state equations may reproduce similar dynamic behaviors.Thus, these potential state equations should be narrowed down by biologically verifying their validity.To infer undiscovered biochemical reaction systems, it is therefore essential to develop an algorithm with high heuristic ability to obtain multiple likely state equations in a short time.Moreover, obtaining various state equations can also lead to the discovery of new biochemical reaction systems that have not yet been considered.
Various solution methods of inverse problems in systems biology have been reported [3]- [22].Notably, genetic programming (GP) [23] is an evolutionary algorithm that has high heuristic ability as a search method for obtaining suitable solutions expressed as tree structures.GP is widely used in systems biology; for example, Miyahara and Kuboyama [24] reported the determination of glycan motifs by using GP.GP can also be adapted to the inference of state equations expressed as tree structures.Importantly, it is unnecessary to fix the structure of the state equations in advance.In other words, GP requires information about neither the biomolecules that constitute various system components nor their reaction mechanisms.Consequently, GP can obtain various state equations that are able to reproduce experimentally observed time-series data [25].In GP, since the values of parameters such as coefficients are set to random values, they do not necessarily take optimal values.On the other hand, biochemical reaction systems are highly nonlinear, and the values of these parameters exert strong influences on the dynamic behavior of these systems (i.e., the predicted time-series data).Thus, it is essential to develop an algorithm that can optimize the values of these parameters after inferring the structure of state equations using GP.Ando et al. [26], Sugimoto et al. [27] and Iba [28] reported a method for inferring the structures of state equations by using GP and optimizing the values of the parameters with the least squares method.However, the authors showed only examples in which monotonically increasing or monotonically decreasing biochemical reaction systems were inferred.Since highly stable biochemical reaction systems often show oscillatory dynamic behaviors, it is essential the development of an algorithm that can infer the state equations of oscillatory biochemical reaction systems.
Accordingly, in this study, we propose a new GP methodology, called "k-step GP", for inferring state equations of oscillatory biochemical reaction systems.
The k-step GP procedure consists of two algorithms: 1) Parameter optimization using the modified Powell method-after genetic operations such as crossover and mutation, the values of parameters such as coefficients are optimized by applying the modified Powell method with secondary convergence.2) GP using divided learning data-to improve the inference efficiency, imposes perturbations through the addition of learning data at various intervals and adaptations to these changes result in state equations with higher fitness.

Data Structure for Genetic Programming
State equations describing biochemical reaction systems often use simultaneous ordinary differential equations based on GMA.State equations based on GMA can be written as where n is the number of the system components (i.e., biomolecules), i X is the state variable for the system components, ik α and ik β are reaction rate con- stants, p and q are the numbers of generation and decomposition terms, respectively, and ijk g and ijk h are the interaction coefficients.The state equations based on GMA can describe the biochemical reaction systems in detail, and these state equations can also be obtained by intuitive understanding.
Based on Equation (1), it is possible to infer state equations that can reproduce experimentally observed time-series data.Each individual in the GP algorithm is a set of simultaneous ordinary differential equations that is equal in number to the number of state variables.The genotype of each individual is the set of simultaneous ordinary differential equations represented by the tree structure, and the phenotype of each individual is the time-series data calculated using these equations.We have defined the terminal and nonterminal symbols, respectively, in the tree structure data for expressing Equation (1) as follows: , .
n T aX aX aX Thus, all terminal symbols contain the coefficient a for expressing tree structure data based on Equation (1).Furthermore, nonterminal symbols need not contain a minus symbol because the coefficient a is contained in all terminal symbols and the optimum value of each coefficient a (including a negative value) is calculated using the modified Powell method [29].After generating new individuals (i.e., offspring) by genetic operations such as crossover and mutation, the tree structure data are optimized by the following rules. If a term n aX is added to the same term n aX (where both terms have the same n), then merge the two terms into one term (i.e., ).For example, the tree structure data of the state equations shown in Equation (3) below are rendered in Figure 1.

Modified Powell Method
After optimizing the tree structure data described in the previous section, the modified Powell method is applied to optimize (decide) the value of the coefficient of the term n aX .The modified Powell method is well known to have an ultimate fast convergence among various direct search methods without the calculation of the derivative of the objective function.

Fitness
For the purpose of evaluating each individual, we have calculated the average value of the relative error between the time-series data obtained by calculating state equations expressed for each individual and the experimentally observed time-series data using the equation , where L is the number of experimentally observed time-series data sets, M is the number of state variables describing system components, and N is the number of observation points for each experimentally observed time-series dataset.
The fitness f of each individual is calculated using the equation where δ is the penalty coefficient and m is the number of terms contained in each individual.The introduction of the penalty coefficient is based on the minimum description length (MDL) principle [30] [31], which is commonly used in GP.The MDL principle can also be used to evaluate the length of the description of the model itself.By including the penalty coefficient in the calculation of the fitness, the simpler state equations have higher fitness.The value of δ needs to be determined according the number of state variables.In this study, we have empirically determined the value of δ .

Genetic Programming Using Divided Learning Data
GP is a type of algorithm that mimics the evolutionary process of biological systems.Environmental changes (i.e., perturbations) exert strong influences on the evolution of such biological systems.When a stable biological species A is perturbed, it adapts to the perturbation and becomes biological species A', which has newly acquired functions.In this way, biological systems evolve through repeated adaptations to perturbations.In inferring state equations by GP, the learning data can be considered the environment to which adaptation occurs.In standard GP, all learning data are only given at the beginning, and subsequent changes in the data environment are not taken into consideration.In k-step GP, the learning data are divided and gradually change in accordance with the progress of the inference process.Accordingly, these changing data become the perturbations to GP individuals, which then collectively adapt to the perturbations and evolve, thus leading to the inference of state equations that can reproduce the learning data.In k-step GP, individuals evolve into better individuals by repeating the changing learning data as perturbations and by adapting to them.Consequently, k-step GP is expected to achieve improved inference efficiency.

The k-Step GP Procedure
The k-step GP procedure is shown below, and a flowchart of this procedure is illustrated in Figure 2.
1) Before processing, the learning data are divided.Length of learning dataset Step k End Genetic operations and modified Powell method 1 starts from the initial values to the first dividing point; length of learning dataset 2 is starts from the initial value to the second dividing point; that is, length of learning dataset k starts from the initial value to the k-th dividing point.Learning dataset k ultimately comprises all sections of the learning data.
2) The state equations that reproduce learning dataset 1 are inferred using the genetic operations and modified Powell method, yielding a population of these state equations as population 1 (Step 1).The completion conditions for Step 1 are as follows. An individual with sufficiently high fitness (over 95%) is obtained (Success). The average fitness of the whole population (all individuals) becomes 80% or more (Success). The maximum number of generations is reached (Failure).
3) Based on population 1, state equations are inferred that reproduce learning dataset 2 by using the genetic operations and modified Powell method, yielding a population of these state equations as population 2 (Step 2).The completion conditions for Step 2 are the same as those above in 2.
4) In the same way, based on population k-1, state equations are inferred that reproduce learning dataset k by using the genetic operations and modified Powell method, yielding populations of these state equations as population k (Step k).The individual with the highest fitness in population k represents the inferred state equations.

Experiments and Results
We have examined the inference efficiency of k-step GP compared with standard GP.The comparison was conducted based on the inference of state equations expressing two-and three-variable oscillatory biochemical reaction systems.When there only one time-series dataset used as the learning data, there may be cases where a monotonically increase or a monotonically decrease has a high fitness against the oscillation with small amplitude.By initializing the GP with multiple time-series datasets with different initial values, such circumstances can be avoided, thus improving the inference efficiency.In contrast, when the number of time-series datasets is increased, the constraint conditions increase, ultimately making it impossible to obtain a variety of state equations.Thus, we adopt the use of two time-series datasets in these experiments.Moreover, to investigate the influence of the number of partitions in k-step GP on inference efficiency, we examined k values of 1, 2, and 5 divisions.For 1 k = , the learning data is undivided (i.e., parameter values are optimized by the modified Powell method only).We refer to this case as "GP + MP".We conducted 100 trials with the standard GP and k-step GP procedures and scrutinized the state equations of the obtained oscillatory biochemical reaction systems and their associated time-series data.

Two-Variable Oscillatory Biochemical Reaction System
We used time-series data in which state variables 1 X and 2 X were obtained by using Equation ( 6) to generate pseudo-experimentally observed time-series data for a two-variable oscillatory biochemical reaction system.This equation is known as the Lotka-Volterra model [32].
Figure 3 shows the time-series data generated using Equation (6).In Figure 3, the initial values of the state variables 1 X and 2 X are given as follows: 1) ( ) We inferred the state equations that can reproduce Figure 3 by using standard GP and k-step GP.The parameters related to GP are shown in Table 1.The learning data for standard GP and GP + MP are shown in Figure 3.For 2 k = , the learning dataset is divided into two sections at time 5 t = , and from the initial value to time 5 t = is learning dataset 1, while that from the initial value to time 10 t = is learning dataset 2 (i.e., all learning data).Learning dataset 1 is shown in Figure 4, while learning dataset 2 is the same as the data shown in Figure 3.         3; these equations differ from Equation ( 6).Table 2. Numbers of successfully obtained state equations for two-variable oscillatory biochemical reaction systems (out of 100 trials).
Standard GP 0 Table 3. Examples of state equations for obtained two-variable oscillatory biochemical reaction systems.

State equations Fitness
95.3% 95.0% Figure 9 and Figure 10 show the time-series data for Equation (7) and Equation (8), respectively.As shown in Figure 9 and Figure 10, it was possible to obtain biochemical reaction systems that express steady-state oscillation.These results demonstrate that this approach may lead to the discovery of new biochemical reaction systems that have not yet been considered.
As shown in Table 2, while the standard GP had no success, the GP + MP approach had 73 successful results out of 100 trials.This demonstrates that the optimization of parameter values plays an important role in the inference of biochemical reaction systems involving high nonlinearity.
The average values of the elapsed time for the inference (i.e., computation time using a Xeon E5-1620V4 3.5 GHz CPU with 16 GB of memory) are shown in Table 4.The standard GP utilized less computation time, but failed to obtain solutions.To obtain solutions with standard GP, it is expected that many individuals must be prepared, which will require substantial computation time.In k-step GP, because the values of parameters such as coefficients are optimized by  Table 4. Average values of elapsed computation time for inference (in sec per 100 trials using a Xeon E5-1620V4 3.5 GHz CPU with 16 GB of memory).
Standard GP 131 the modified Powell method, there is no need to prepare a great number of individuals, which contributes to a reduction in the computation time.Although there was no obvious influence of increasing the number of learning datasets on the inference success rate, this is likely because the target biochemical reaction systems were considered on a relatively small scale.In the following section, we examine cases in which the scale of biochemical reaction systems to be inferred was increased.

Three-Variable Oscillatory Biochemical Reaction System
We used time-series data in which state variables 1 X , 2 X , and 3 X were obtained by using Equation ( 9) to generate pseudo-experimentally observed time-series data for a three-variable oscillatory biochemical reaction system.
Figure 11 shows the time-series data based on calculations with Equation ( 9); the initial values of state variables 1 X , 2 X , and 3 X are as follows: 1) ( ) We have inferred state equations that can reproduce Figure 11 using standard GP and k-step GP.The parameters related to GP are shown in Table 5.The learning data for standard GP and GP + MP are shown in Figure 11.
For 2 k = , the learning dataset is divided in the same way as the two-variable oscillatory biochemical reaction system.Figure 12 shows learning dataset 1, while Figure 11 shows all the data, which is equivalent to learning dataset 2. For 5 k = , the learning dataset is divided in the same way as the dataset for the two-variable oscillatory biochemical reaction system.Figures 13-16 show learning datasets 1, 2, 3, and 4, respectively, while Figure 11 shows all the data, which is the same as learning dataset 5.
We have been able to obtain many oscillatory biochemical reaction systems using k-step GP.The numbers of successfully obtained equations (out of 100 Figure 11.Time-series data based on the three-variable oscillatory biochemical reaction system calculated by Equation ( 9).      6.The examples of state equations obtained as three-variable oscillatory biochemical reaction systems are shown in Table 7; these equations differ from Equation (9).
to obtain biochemical reaction systems that express steady-state oscillation.
These results demonstrate that this approach may lead to the discovery of new biochemical reaction systems that have not yet been considered.Table 6 shows that increasing the number of divisions of the learning data improves the inference success rate.The numbers of successfully obtained state equations in GP + MP is reduced compared with Table 2 (the two-variable oscillatory biochemical reaction system).This is considered the number of parameters to be optimized has increased.In k-step GP, individuals that have high fitness are gathered during the early stage (Step 1), where the quantity of the learning data is relatively small, and that the individuals evolve while inheriting the characteristics.The

Conclusions
In progress of systems biology, it is essential to develop an algorithm with high heuristic ability for efficiently inferring multiple likely state equations of oscillatory biochemical reaction systems.In conventional research, fixed the structure of state equations, only the inference of included parameter values has been performed.In this study, we obtained various structures of the state equations for two-and three-variable oscillatory biochemical reaction systems with high nonlinearity from only the experimentally observed time-series data by using k-step GP.In particular, we showed that parameter optimization is indispensable for inferring the state equations of oscillatory biochemical reaction systems with high nonlinearity.Moreover, in k-step GP, the learning data are divided and are gradually changed in accordance with the progress of the inference process.This change essentially becomes a series of perturbations to the individuals in the GP procedure.As evolution occurs, the tree structure data of individuals reproduce the learning data by adapting to the perturbations.Through repetition of these perturbations and adaptations, individuals are expected to yield offspring with progressively higher fitness, thus improving inference efficiency.
In the inverse problem, since the correct solution is unknown, it is important to propose a variety of solutions, verify and scrutinize from there and narrow down the solution.Consequently, it is thought that this approach will ultimately lead to the discovery of new biochemical reaction systems that may not yet have been considered.The dynamic behavior of stable biochemical reaction systems can be described as monotonically increasing, monotonically decreasing, steady-state oscillation, or damped oscillation.We have shown that k-step GP can infer state equations for oscillatory biochemical reaction systems.Thus, k-step GP can be applied and contributed to solving the inverse problem of inferring state equations (i.e., both structures and parameter values) of biochemical reaction systems from systems biology data.We are confident that k-step GP is an algorithm that is particularly well suited to inferring state equations for oscillatory biochemical

Future Works
One problem that remains to be overcome in k-step GP methodology is its high computation time requirements.In particular, the inference of the three-variable oscillatory biochemical reaction system required an average value of computation time of 6720 seconds when the division number was 5 (using a Xeon E5-1620V4 3.5 GHz CPU with 16 GB of memory).We are planning to parallelize k-step GP using GPGPU (General Purpose computing on Graphics Processing Units) which has been attracting attention in recent years and to improve its computation time.

n
aX is multiplied by a term m aX , then the value of coefficient a of the multiplied term is replaced with 1 (i.e.,
For 5 k = , the learning dataset is divided into five sections according to times 2, 4, 6,8,10 t = ; from the initial value to time 2 t = is learning dataset 1, from the initial value to time 4 t = is learning dataset 2, and so on, from the initial value to time 10 t = is learning dataset 5 (i.e., all learning data).Learning datasets 1, 2, 3, and 4 are shown in Figures 5-8, respectively, while learning dataset 5 is the same as the data shown in Figure 3.

Figure 3 .
Figure 3. Time-series data for a two-variable oscillatory biochemical reaction system as calculated by Equation (6).

Figure 4 .
Figure 4. Learning dataset 1 for 2 k = in k-step GP for a two-variable oscillatory bio-

Figure 5 .
Figure 5. Learning dataset 1 for 5 k = in k-step GP for a two-variable oscillatory bio-

Figure 6 .
Figure 6.Learning dataset 2 for 5 k = in k-step GP for a two-variable oscillatory bio-

Figure 7 .
Figure 7. Learning dataset 3 for 5 k = in k-step GP for a two-variable oscillatory bio-

Figure 8 .
Figure 8. Learning dataset 4 for 5 k = in k-step GP for a two-variable oscillatory bio-

Figure 12 .
Figure 12.Learning dataset 1 for 2 k = in k-step GP for a three-variable oscillatory bi-

Figure 13 .
Figure 13.Learning dataset 1 for 5 k = in k-step GP for a three-variable oscillatory bi-

Figure 14 .
Figure 14.Learning dataset 2 for 5 k = in k-step GP for a three-variable oscillatory bi-

Figure 15 .
Figure 15.Learning dataset 3 for 5 k = in k-step GP for a three-variable oscillatory bi-

Figure 16 .
Figure 16.Learning dataset 4 for 5 k = in k-step GP for a three-variable oscillatory bi-

Table 6 .
Numbers of successfully obtained state equations for three-variable oscillatory biochemical reaction systems (out of 100 trials).

Table 7 .
Example state equations for obtained three-variable oscillatory biochemical reaction systems.

Table 8 .
Average values of elapsed computation time for inference (in sec per 100 trials using a Xeon E5-1620V4 3.5 GHz CPU with 16 GB of memory).