Research on Dynamic Forecast of Flowering Period Based on Multivariable LSTM and Ensemble Learning Classification Task

The flowering forecast provides recommendations for orchard cleaning, pest control, field management and fertilization, which can help increase tree vigor and resistance. Flowering forecast is not only an important part of the construction of agro-meteorological index system, but also an important part of the meteorological service system. In this paper, by analyzing local meteorological data and phenological data of “Red Fuji” apples in Fen County, Linfen City, Shanxi Province, with the help of machine learning and neural networks, we proposed a method based on the combination of time series forecasting and classification forecasting is proposed to complete the dynamic forecasting model of local flowering in Ji County. Then, we evaluated the effectiveness of the model based on the number of error days and the number of days in advance. The implementation shows that the proposed multivariable LSTM network has a good effect on the prediction of meteorological factors. The model loss is less than 0.2. In the two-category task of flowering judgment, the idea of combining strategies in ensemble learning improves the effect of flowering judgment, and its AUC value increases from 0.81 and 0.80 of single model RF and AdaBoost to 0.82. The proposed model has high applicability and accuracy for flowering forecast. At the same time, the model solves the problem of rounding decimals in the prediction of flowering dates by the regression method.


Introduction
Flowering forecast is an important part in the construction of agro-meteorological How to cite this paper: Chen, C., Zhang, X.W. and Tian, S. (2020) Research on Dynamic Forecast of Flowering Period Based on Multivariable LSTM and Ensemble Learning Classification Task. Agricultural Sciences, materials and energy for organisms, but also constitute the external environmental conditions that are the growth and development of organisms, with determined yield and quality. Climatic factors influence the phenology of many animal and plant species, rendering them susceptible to the effects of climate change [5]. During the growth and development of the plant in each year, with the seasonal changes of weather and climate, it changes from sprouting, branching and leafing, flowering and fruiting stages to deciduous dormancy and other phenological phenomena with regular changes. The plant phenology model is based on the reaction mechanism of the plant growth and development process to the factors that constitute the climatic factors. It is a mathematical expression used to simulate the plant growth and development process.
According to the number of meteorological elements, the flowering forecast model can be divided into single factor forecast model and multi-factor forecast model. Temperature is a common meteorological factor of flowering and can influence a variety of stages in floral development [6]. Alcalá and Barranco [7] determined the heat storage period based on 10-year phenology and temperature data, and used the heat accumulation threshold to predict flowering time. The method of the heat storage period is the same as the accumulated temperature method. When the accumulation of temperature reaches a certain degree Celsius, it is the initial flowering period of the plant. Temperature-based models include the developmental rate (DVR) model, the chill day model and the new chill day model. The developmental rate (DVR) has been described as a functions of temperature, with the relationships between DVR and temperature being considered linear or exponential [8] [9] [10]. Surgiura and Honjo [11] founded that the DVR model accurately predicted the flowering dates with Root Mean Square Error (RMSE) of 1.23 days. Jong Ahn Chun et al. [12] developed the prediction models of full blooming dates for the five peach cultivars at the six major peach cultivation sites using the DVR model, chill day model and new chill day models. Jina Hura and Joong Bae Ahn [13] used the dynamically downscaled daily temperature to research the effect of global warming on the first-flowering data (FFD) of cherry, peach and pear in Northeast Asia. Adnane et al. [14] considers the successive and the independent effect of temperature on the dormancy and the induction of ecodormancy in flower buds of fruit trees, using a phenology model based on a sequential model to simulate flowering dates. Soil moisture is also used to predict the flowering period. Chauhan et al. [15] used the APSIM model to predict the effect of soil moisture on the flowering time of chickpea and wheat.
Schneemilch et al. [16] used multivariate logistic regression to determine which environmental variables were influential on flowering timing. The resultant models described a large amount of variation in flowering presence or absence, with R 2 ranging from 0.72 to 0.79. Cenci and Ceschia [17] used daily maximum and minimum temperatures, daily rainfall, relative humidity and daily global insolation to establish flowering prediction models z x y β α γ = + .
Among them, z represents flowering time that the equation has been simply expressed as the power relationship of the variable x and the linear correlation of the variable y. Park, et al. [18] used Elasticnet regression techniques with the quantity of winter and spring precipitation that fell as snow in a given year, the number of frost-free days that occurred in a given winter and spring, the date of the beginning of the frost-free period five climatic parameters to establish a predictable 2320 Species-specific model of angiosperm flowering date. Finding the best meteorological elements (golden feature) is a key step in improving the accuracy of the model for regression models. The flowering prediction model is also a time series prediction model, which is a special regression problem. With the popularization of artificial intelligence big data, the prediction of flowering period has also begun to use neural networks for prediction. Elizondo et al. [19] used the back propagation training algorithm of BP neural network to complete the prediction of soybean flowering and physiological maturity. The meteorological factors used include daily maximum, minimum temperature and photoperiod.
Since the prediction result of the regression model contains decimals, and the flowering period is an integer in days, different decimal rounding methods will cause prediction errors. In order to predict as far as possible in the process of regression model prediction, it will cause a long blank period. If meteorological disaster such as low-temperature freezing damage occurs during the blank period, the flowering period will be delayed, thereby affecting the accuracy of the forecast. This paper proposes a flowering period prediction model based on multi-variable LSTM (Long Term short Memory) and ensemble learning classifier to solve the problem between long blank period and Decimal rounding on the prediction process.

Experimental Data
In this study, meteorological and phenological data were provided by the Ji county of Linfen City, Shanxi Province. Among them, the meteorological data includes 12 meteorological elements, such as temperature (maximum, minimum, average), precipitation, sunshine time, ground temperature (5 cm, 10 cm, To determine that a certain day is a flowering day, it is necessary to combine meteorological data and phenological data. In the classification task, the more data dimensions, the better the model, so the two features of the sum of air temperature (SAT) and The sum of geothermal temperature (SGT) are added. The formula is: Among them, MaxT, MinT, AT represent the maximum temperature, minimum temperature, average temperature.

Meteorological Data Analysis
Data quality directly affects the performance indicators of the model. Statistical description helps to discover some obvious problems related to data quality (e.g., missing values, duplicate values, outliers), deepen the understanding of the relationship between data and variables, and provide useful information for subsequent data preprocessing and model selection. Descriptive statistical analysis of meteorological data is shown in Table 1. Observe the number, distribution and quartiles of the data. It can be found that from January 1, 2015 to December 31, 2019, there should be 5478 data. However, there are only 2039 precipitation data.
There are 5472 data in 5 cm ground temperature and 10 cm ground temperature, but 6 data are missing. There is no missing value for the 15 cm ground temperature data. The average wind speed has 5 missing values. The minimum value of the maximum wind speed during this period was a positive value, and no abnormal value occurred.
Periodicity is a prerequisite for time series forecasting. Before the periodic analysis, the missing data has been filled. The methods are as follows, 1) Fill in the missing values directly as zero according to the reason for the lack of precipitation. 2) For meteorological elements other than precipitation, first find the year corresponding to the missing value, and then infer the corresponding missing value based on the annual data. When there are multiple missing values in the same year, the missing values are treated according to averaging. Figure 1 shows the periodicity of each meteorological element. Where count represents the number of data, and men, std, min, max, Q1, Q2, and Q3 represent the mean, variance, minimum, maximum, first, second, and third quartiles of different meteorological factors.
Selection of predictive variables. As show in Figure 1, it can be found that the average temperature, the maximum temperature, the minimum temperature, the sum of temperature, the 5 cm ground temperature, the 10 cm ground temperature, the 15 cm ground temperature, and the sum of ground temperature have obvious periodicity. At the same time, it can be seen that the upper edge of the sunshine hours is also periodic, while the periodicity of other meteorological elements is not obvious. Therefore, these nine meteorological elements (maximum temperature, minimum temperature, the sum of temperature, 5 cm ground temperature, 10 cm ground temperature, 15 cm ground temperature, the sum of ground temperature and sunshine hours) with observable periodicity are

Multivariable LSTM Network
The neural network used to process serialized data is called recurrent neural network. When folded according to time, it can be regarded as a deep neural network with infinite layers [20]. In the traditional neural network model, the layers are fully connected, and the nodes of each layer are not connected. Theoretically, the function of each layer of the recurrent neural network is used to memorize the data, rather than hierarchical processing. There may be problems of gradient disappearance and gradient explosion. Long-term and short-term memory networks have designed new computing nodes based on maintaining the original recurrent neural network structure [21] [22] [23]. In the long-term and short-term memory network, storage units are used to replace conventional neurons. Each storage unit is composed of input gate, output gate, forget gate, and free state, as shown in Figure 2 LSTM network.

Basic Binary Classifier
Ensemble learning refers to constructing multiple weak learners first, and then using a certain integration strategy to combine to obtain a "strong learner" with better performance indicators [24]. Logistic regression, Naive Bayes classification, Support vector machine, Random forest, Bagging classification, Decision tree classification, AdaBoost classification and Extra Trees classification are used as weak learners.

Experimental Process
The machine learning method makes the prediction results of the flowering prediction model dynamic. The proposed method is based on the combination of multivariate LSTM prediction and combined strategy binary classification prediction. In this way, it can solve the vacuum period caused by the early prediction of the regression prediction model. It can also solve the problem of decimal rounding in the prediction process. The multivariable LSTM prediction model and the combination strategy binary classification model mainly include three aspects, namely data processing, Multivariable LSTM and binary classification ensemble learning model and model evaluation, as show in Figure 3.  The steps of flowering forecast are as follows: Step 1: Data set partition 1) Forecast data set For LSTM, the input data must be sequential data. The nine meteorological element data from January 1, 2005 to December 31, 2016 were used as the training set, and the nine meteorological element data from January 1, 2017 to December 31, 2019 were used as the test set.
2) Classification data set For the classification model, input x (data of nine meteorological elements) and output y (1 or −1) are required. The data from March 25 to April 30 of each year from 2010 to 2019 is extracted, and the data is randomly divided into train set and test set according to a ratio of 7 -3.
Step 2: Multi-LSTM network 1) Data Normalization The LSTM network is particularly sensitive to the size of the input value, so the Min-Max normalization method is used to process the data. The common method of Data Normalization is Min-Max normalization. Through the linear transformation of the data, the result falls within the range of [0, 1]. This makes it easier and faster to transform dimensional data into pure values without dimensions to ensure comparability between data. The formula is: Among them, x is the observed value, 2) Window method The window method is to use multiple recent time items to predict the next time item. Use the data in the first 90 days of t to predict the data in the last 7 days of t.
3) Model building Define the model. It is to create a sequential model and add a configuration layer. Sequential model is a linear stacking of multiple network layers, that is  Table 2.

4) Model validation
Take the meteorological elements that can be observed from January 1, 2019 to April 17, 2019. After normalizing these data, a windowing process that predicts the next 3 days in 90 days is performed, and a total of 18 windows are generated by windowing. Among them, the first window is based on 90 days of data from January 1 to March 31 to predict the three days of data on April 1, April 2 and April 3. Extract the data of the next three days of the 18 windows predicted by the multivariate LSTM model, and divide them according to the first day of the future, the second day of the future, and the third day of the future into Dataset 1, Dataset 2, and Dataset 3. As shown in Table 3, it is the detailed information of the three data sets.

1) Data standardization
Data standardization helps to remove the unit restrictions of the data and converts the data into pure values without dimensional constraints, ensuring the comparability between the data. The formula is: Among them, x is the observed value, µ is the overall mean, and σ is the overall standard deviation.  Table 3. Information about the data set.

3) Combination strategy
The binary classifier combination strategy makes each classifier to solve the same original task, and combine the results of each model through a specific strategy to obtain a better global model. Using the arithmetic average combination strategy in the ensemble learning idea, when multiple classification learners judge all to 1, then judge to 1. When a classification learner judges that the result is not 1, the model judges that it is −1. The formula is: Among them, i pred represents the predicted value of the i learner.
Step 4: Judging the initial flowering period When the result value is 1 and it appears for the first time, the corresponding date is the initial flowering period. Step

Multivariable LSTM Prediction Effect
Training the multivariable LSTM model and evaluating the model with MSE as the loss function. The loss of the model is shown in Figure 5. It can be seen that after 50 iterations, the loss of the model has stabilized and converged.

Binary Classifier Selection
Use F1 score, accuracy and recall rate as evaluation indicators to complete the screening of weak learners. As shown in Table 4, the accuracy of different classification task learners is greater than 70% in both the training set and the test set.
Among them, the accuracy of logistic regression, naive Bayes classification and support vector machine learner in the test set is better than that in the training set. In general, the performance in the training set is always better than the test   Native Bayestion, and support vector machine whose test set is more accurate than the training set. The accuracy of the remaining five classifiers in the training set is higher than 95%, and the accuracy in the test set is also higher than 75%. There are RF classifier and Adaboost classifier with F1 score greater than 0.75, and the corresponding Recall score is 0.76, which is the largest among the remaining five classifiers. Draw ROC-AUC curve for the remaining 5 learners. The results are shown in Figure 8. The AUC values are ranked RF, AdaBoost, Bagging, Extra Trees and Decision Tree in descending order. Area Under the ROC Cure (AUC) is the area under the ROC curve and is used to measure the stability of the model. It is a curve with True Positive Rate (TPR) as the vertical axis and False Positive Rate (FPR) as the horizontal axis.
Combining Table 4 and Figure 8, select RF and Adaboost as weak learners to complete the binary-class ensemble learning combination strategy. As shown in Figure 9, the AUC value corresponding to the model using the integrated learning combination strategy is 0.82, which is greater than 0.81 and 0.80 of RF and AdaBoost.

Forecast Result
Dataset 1, Dataset 2, Dataset 3 are used to the ensemble learning model. Finding the date corresponding to the first occurrence of 1 is the initial flowering period. The results are shown in Table 5. Using the number of error days (actual value-predicted value) and the number of days ahead as evaluation indicators, the model is evaluated.

Discussion
Crop phenology is highly dictated by weather variables such as radiation, precipitation and temperature [25] [26]. Thus, the accuracy of predicting weather inputs is critical for crop simulation model [27]. Select meteorological factors with obvious periodicity helps to improve the accuracy of LSTM model prediction. Since LSTM adds a cell to judge whether the information is useful, an input gate, a forget gate, and an output gate are placed in a cell. Therefore, it is particularly suitable for the prediction of time series. Meanwhile, the advantage of the multivariable LSTM network is that it can complete the prediction of multiple meteorological factors simultaneously. In this study, the LSTM network simultaneously completes the prediction of maximum temperature, minimum temperature, average temperature, 5 cm ground temperature, 10 cm ground temperature, 15 cm ground temperature, sunshine hours, the sum of air temperature (SAT) and the sum of geothermal temperature (SGT).
For the binary-class classification task of ensemble learning. Firstly, it screens different classifiers to find classifiers that have no underfitting or overfitting.
Underfitting is usually due to insufficient learning ability of the learner, and overfitting is usually due to too strong learning ability. Both will affect the generalization ability of the model. Secondly, complet the judgment of flowering period with the idea of combination strategy (1 = flowering, −1 = non-flowering).
The advantages of multiple classifiers are combined to enhance the classification effect. In addition, the parameters of the selected classifier can be further optimized.
In this paper, a machine learning technique that combines time series prediction (special regression prediction) and classification prediction to complete flowering prediction is proposed. By analyzing the quality of the data and the periodicity of the data, seven feature variables with no missing values and obvious periodicity were extracted, and two features of SAT and SGT were added.
Secondly, the weather data and phenology data are combined to divide the data into forecast data sets and classification data sets. Then the prediction results of the multivariable LSTM network are passed into the trained combined strategy binary classification learner to complete the prediction of flowering. Finally, the date corresponding to the first occurrence of the classification label 1 is the initial flowering period. The model solves the problem of decimal rounding in the regression prediction process, realizes the dynamic prediction of the flowering period, and the model error is within the range of one day.
In addition, in order to further improve the accuracy of the model, several problems need to be solved. First, by improving the LSTM network model, the RMSE value is further reduced. Secondly, further adjust the classification learner. Finally, increase the scope of the data. The LSTM network can predict time series prediction problems as special regression problems. Neural network can complete not only regression prediction but also classification prediction. Our future work will focus on using one network to complete flowering forecast.

Conclusion
This research combines neural networks with integrated learning, and proposes a method to dynamically predict whether the next three days will be flowering dates, effectively solving the problems of decimal rounding and long-term blank periods brought by regression prediction. This method utilizes the long-term