^{1}

^{1}

^{*}

^{1}

The flowering forecast provides recommendations for orchard cleaning, pest control, field management and fertilization, which can help increase tree vigor and resistance. Flowering forecast is not only an important part of the construction of agro-meteorological index system, but also an important part of the meteorological service system. In this paper, by analyzing local meteorological data and phenological data of “Red Fuji” apples in Fen County, Linfen City, Shanxi Province, with the help of machine learning and neural networks, we proposed a method based on the combination of time series forecasting and classification forecasting is proposed to complete the dynamic forecasting model of local flowering in Ji County. Then, we evaluated the effectiveness of the model based on the number of error days and the number of days in advance. The implementation shows that the proposed multivariable LSTM network has a good effect on the prediction of meteorological factors. The model loss is less than 0.2. In the two-category task of flowering judgment, the idea of combining strategies in ensemble learning improves the effect of flowering judgment, and its AUC value increases from 0.81 and 0.80 of single model RF and AdaBoost to 0.82. The proposed model has high applicability and accuracy for flowering forecast. At the same time, the model solves the problem of rounding decimals in the prediction of flowering dates by the regression method.

Flowering forecast is an important part in the construction of agro-meteorological index system and meteorological service system. The phenology is used to indicate the seasonal changes of the seasons, the response and adaptation process of the ecosystem to changes in the external environment [

According to the number of meteorological elements, the flowering forecast model can be divided into single factor forecast model and multi-factor forecast model. Temperature is a common meteorological factor of flowering and can influence a variety of stages in floral development [

Schneemilch et al. [^{2} ranging from 0.72 to 0.79. Cenci and Ceschia [

Since the prediction result of the regression model contains decimals, and the flowering period is an integer in days, different decimal rounding methods will cause prediction errors. In order to predict as far as possible in the process of regression model prediction, it will cause a long blank period. If meteorological disaster such as low-temperature freezing damage occurs during the blank period, the flowering period will be delayed, thereby affecting the accuracy of the forecast. This paper proposes a flowering period prediction model based on multi-variable LSTM (Long Term short Memory) and ensemble learning classifier to solve the problem between long blank period and Decimal rounding on the prediction process.

In this study, meteorological and phenological data were provided by the Ji county of Linfen City, Shanxi Province. Among them, the meteorological data includes 12 meteorological elements, such as temperature (maximum, minimum, average), precipitation, sunshine time, ground temperature (5 cm, 10 cm, 15 cm) and humidity from 2005 to 2019. The phenological data is the phenological data of local apples named “Red Fushi” in Ji county Prefecture from 2010 to 2019.

To determine that a certain day is a flowering day, it is necessary to combine meteorological data and phenological data. In the classification task, the more data dimensions, the better the model, so the two features of the sum of air temperature (SAT) and The sum of geothermal temperature (SGT) are added. The formula is:

SAT = MaxT + MinT + AT (1)

Among them, MaxT, MinT, AT represent the maximum temperature, minimum temperature, average temperature.

SGT = GT 5 + GT 10 + GT 15 (2)

Among them, GT5, GT10, GT15 respectively represent 5 cm, 10 cm, 15 cm ground temperature.

Extract the meteorological data from March 25 to April 30 every year from 2010 to 2019, and add tags according to the phenological data of the corresponding year. The flowering observations for each date are converted to binary data (1 = flowering, −1 = non-flowering). In this decade, there are 370 data, including 187 data with label 1 and 183 data with label-1. Basically meet the balance of positive and negative sample data.

Data quality directly affects the performance indicators of the model. Statistical description helps to discover some obvious problems related to data quality (e.g., missing values, duplicate values, outliers), deepen the understanding of the relationship between data and variables, and provide useful information for subsequent data preprocessing and model selection. Descriptive statistical analysis of meteorological data is shown in

Periodicity is a prerequisite for time series forecasting. Before the periodic analysis, the missing data has been filled. The methods are as follows, 1) Fill in the missing values directly as zero according to the reason for the lack of precipitation. 2) For meteorological elements other than precipitation, first find the year corresponding to the missing value, and then infer the corresponding missing value based on the annual data. When there are multiple missing values in the same year, the missing values are treated according to averaging.

Describe | Count | Mean | Std | Min | Max | Q1 | Q2 | Q3 |
---|---|---|---|---|---|---|---|---|

Average Temperature | 5478 | 11.05 | 10.45 | −14.40 | 30.6 | 1.80 | 12.40 | 20.20 |

Maximum Temperature | 5478 | 18.24 | 10.68 | 9.70 | 39.70 | 9.50 | 19.70 | 27.40 |

Minimum Temperature | 5478 | 5.58 | 10.40 | −19.80 | 24.60 | −3.20 | 6.60 | 14.80 |

Precipitation | 2039 | 3.86 | 8.70 | 0 | 111.90 | 0 | 0.40 | 3.80 |

5 cm Ground Temperature | 5472 | 12.75 | 10.67 | −10.10 | 33.20 | 2.10 | 13.80 | 22.30 |

10 cm Ground Temperature | 5472 | 12.78 | 10.38 | −8.70 | 32.40 | 2.20 | 13.90 | 22.10 |

15 cm Ground Temperature | 5472 | 12.79 | 10.13 | −7.60 | 32.10 | 2.30 | 13.90 | 22.00 |

Average Humidity | 5478 | 58.21 | 18.78 | 8.00 | 99.00 | 44.00 | 59.00 | 73.00 |

Minimum Humidity | 5478 | 31.92 | 19.41 | 3.00 | 99.00 | 16.25 | 27.00 | 44.00 |

Sunshine Hours | 5478 | 5.94 | 3.95 | 0 | 13.40 | 2.00 | 7.00 | 9.00 |

Average Wind speed | 5473 | 1.78 | 0.63 | 0.10 | 5.30 | 1.40 | 1.80 | 2.20 |

Maximum Wind speed | 5478 | 4.81 | 1.33 | 1.40 | 12.10 | 3.90 | 4.70 | 5.60 |

Where count represents the number of data, and men, std, min, max, Q1, Q2, and Q3 represent the mean, variance, minimum, maximum, first, second, and third quartiles of different meteorological factors.

Selection of predictive variables. As show in

The neural network used to process serialized data is called recurrent neural network. When folded according to time, it can be regarded as a deep neural network with infinite layers [

Ensemble learning refers to constructing multiple weak learners first, and then using a certain integration strategy to combine to obtain a “strong learner” with better performance indicators [

The machine learning method makes the prediction results of the flowering prediction model dynamic. The proposed method is based on the combination of multivariate LSTM prediction and combined strategy binary classification prediction. In this way, it can solve the vacuum period caused by the early prediction of the regression prediction model. It can also solve the problem of decimal rounding in the prediction process. The multivariable LSTM prediction model and the combination strategy binary classification model mainly include three aspects, namely data processing, Multivariable LSTM and binary classification ensemble learning model and model evaluation, as show in

The steps of flowering forecast are as follows:

Step 1: Data set partition

1) Forecast data set

For LSTM, the input data must be sequential data. The nine meteorological element data from January 1, 2005 to December 31, 2016 were used as the training set, and the nine meteorological element data from January 1, 2017 to December 31, 2019 were used as the test set.

2) Classification data set

For the classification model, input x (data of nine meteorological elements) and output y (1 or −1) are required. The data from March 25 to April 30 of each year from 2010 to 2019 is extracted, and the data is randomly divided into train set and test set according to a ratio of 7 - 3.

Step 2: Multi-LSTM network

1) Data Normalization

The LSTM network is particularly sensitive to the size of the input value, so the Min-Max normalization method is used to process the data. The common method of Data Normalization is Min-Max normalization. Through the linear transformation of the data, the result falls within the range of [0, 1]. This makes it easier and faster to transform dimensional data into pure values without dimensions to ensure comparability between data. The formula is:

x * = x − min ( x ) max ( x ) − min ( x ) (3)

Among them, x is the observed value, min ( x ) , max ( x ) corresponding to the x minimum and maximum values.

2) Window method

The window method is to use multiple recent time items to predict the next time item. Use the data in the first 90 days of t to predict the data in the last 7 days of t.

3) Model building

Define the model. It is to create a sequential model and add a configuration layer. Sequential model is a linear stacking of multiple network layers, that is “one road goes to black”. The layers used are LSTM layer, Repeat Vector layer, Dropout layer, Time Distributed layer and Dense layer. Among them, the activation function used for the LSTM layer is ReLU.

Compilation model. It is to select the parameters of the loss function and optimizer. The model is compiled with Adam as the optimizer and MSE as the loss function. The parameters of the model are shown in

4) Model validation

Take the meteorological elements that can be observed from January 1, 2019 to April 17, 2019. After normalizing these data, a windowing process that predicts the next 3 days in 90 days is performed, and a total of 18 windows are generated by windowing. Among them, the first window is based on 90 days of data from January 1 to March 31 to predict the three days of data on April 1, April 2 and April 3. Extract the data of the next three days of the 18 windows predicted by the multivariate LSTM model, and divide them according to the first day of the future, the second day of the future, and the third day of the future into Dataset 1, Dataset 2, and Dataset 3. As shown in

Step 3: Binary classification

1) Data standardization

Data standardization helps to remove the unit restrictions of the data and converts the data into pure values without dimensional constraints, ensuring the comparability between the data. The formula is:

x * = x − μ σ (4)

Among them, x is the observed value, μ is the overall mean, and σ is the overall standard deviation.

Layer | Type | Output Shape | Params |
---|---|---|---|

lstm_1 | LSTM | (None, 64) | 18,944 |

repeat vector_1 | Repeat Vector | (None, 7, 64) | 0 |

dropout_1 | Dropout | (None, 7, 64) | 0 |

Lstm_2 | LSTM | (None, 7, 32) | 12,416 |

dropout_2 | Dropout | (None, 7, 32) | 0 |

time distribute_1 | Time Distribute | (None, 3, 9) | 297 |

Dataset | Starting time | Termination time | Length |
---|---|---|---|

Dataset 1 | 2019-4-01 | 2019-4-18 | 18 |

Dataset 2 | 2019-4-02 | 2019-4-19 | 18 |

Dataset 3 | 2019-4-03 | 2019-4-20 | 18 |

2) Basic classifier

With Logistic regression, Naive Bayes classification, Support vector machine, Random forest classification, Bagging classification, Decision tree classification, AdaBoost classification and Extra Trees classification, eight classification learners are used as weak learning to complete the selection of the learner.

3) Combination strategy

The binary classifier combination strategy makes each classifier to solve the same original task, and combine the results of each model through a specific strategy to obtain a better global model. Using the arithmetic average combination strategy in the ensemble learning idea, when multiple classification learners judge all to 1, then judge to 1. When a classification learner judges that the result is not 1, the model judges that it is −1. The formula is:

result = 1 n ∑ i n p r e d i = { 1 − 1 (5)

Among them, p r e d i represents the predicted value of the i learner.

Step 4: Judging the initial flowering period

When the result value is 1 and it appears for the first time, the corresponding date is the initial flowering period.

Step 5: Model evaluation

Use the number of error days (actual value-predicted value) and the number of days in advance as evaluation indicators to complete the evaluation of model performance.

The input and output of the multivariable LSTM prediction model and binary classification model are shown in

Training the multivariable LSTM model and evaluating the model with MSE as the loss function. The loss of the model is shown in

Use F1 score, accuracy and recall rate as evaluation indicators to complete the screening of weak learners. As shown in

Classifier | Accuracy score | F1 score | Recall score | |
---|---|---|---|---|

Train data | Test data | |||

Logistic Regression | 0.78 | 0.79 | 0.79 | 0.78 |

Native Bayestion | 0.73 | 0.77 | 0.78 | 0.80 |

Support Vector Machine (SVM) | 0.79 | 0.81 | 0.80 | 0.75 |

Random Forest (RF) | 0.98 | 0.81 | 0.80 | 0.76 |

Bagging | 0.98 | 0.77 | 0.74 | 0.65 |

Decision Tree | 1.00 | 0.76 | 0.72 | 0.62 |

Adaboost | 0.96 | 0.80 | 0.79 | 0.76 |

Extra Trees | 1.00 | 0.77 | 0.75 | 0.67 |

set. Therefore, when selecting a learner, first exclude the Logistic Regression, Native Bayestion, and support vector machine whose test set is more accurate than the training set. The accuracy of the remaining five classifiers in the training set is higher than 95%, and the accuracy in the test set is also higher than 75%. There are RF classifier and Adaboost classifier with F1 score greater than 0.75, and the corresponding Recall score is 0.76, which is the largest among the remaining five classifiers.

Draw ROC-AUC curve for the remaining 5 learners. The results are shown in

Combining

Dataset 1, Dataset 2, Dataset 3 are used to the ensemble learning model. Finding the date corresponding to the first occurrence of 1 is the initial flowering period. The results are shown in

Dataset | predicted value | actual value | Days of error | Days in advance |
---|---|---|---|---|

Dataset 1 | 2019-4-07 | 2019-4-08 | +1 | 1 |

Dataset 2 | 2019-4-07 | 2019-4-08 | +1 | 2 |

Dataset 3 | 2019-4-08 | 2019-4-08 | 0 | 3 |

Crop phenology is highly dictated by weather variables such as radiation, precipitation and temperature [

For the binary-class classification task of ensemble learning. Firstly, it screens different classifiers to find classifiers that have no underfitting or overfitting. Underfitting is usually due to insufficient learning ability of the learner, and overfitting is usually due to too strong learning ability. Both will affect the generalization ability of the model. Secondly, complet the judgment of flowering period with the idea of combination strategy (1 = flowering, −1 = non-flowering). The advantages of multiple classifiers are combined to enhance the classification effect. In addition, the parameters of the selected classifier can be further optimized.

In this paper, a machine learning technique that combines time series prediction (special regression prediction) and classification prediction to complete flowering prediction is proposed. By analyzing the quality of the data and the periodicity of the data, seven feature variables with no missing values and obvious periodicity were extracted, and two features of SAT and SGT were added. Secondly, the weather data and phenology data are combined to divide the data into forecast data sets and classification data sets. Then the prediction results of the multivariable LSTM network are passed into the trained combined strategy binary classification learner to complete the prediction of flowering. Finally, the date corresponding to the first occurrence of the classification label 1 is the initial flowering period. The model solves the problem of decimal rounding in the regression prediction process, realizes the dynamic prediction of the flowering period, and the model error is within the range of one day.

In addition, in order to further improve the accuracy of the model, several problems need to be solved. First, by improving the LSTM network model, the RMSE value is further reduced. Secondly, further adjust the classification learner. Finally, increase the scope of the data. The LSTM network can predict time series prediction problems as special regression problems. Neural network can complete not only regression prediction but also classification prediction. Our future work will focus on using one network to complete flowering forecast.

This research combines neural networks with integrated learning, and proposes a method to dynamically predict whether the next three days will be flowering dates, effectively solving the problems of decimal rounding and long-term blank periods brought by regression prediction. This method utilizes the long-term storage characteristics of the LSTM network and the classification functions of Random Forest (RF) and Adaboost. The loss of the multivariable LSTM model is below 0.2, and the RMSE value is below 0.3. The AUC value of the combined classification model based on RF and AdaBoost is 0.82. In short, the error of the prediction model is 1 day.

This research was funded by the Chengdu Science and Technology Bureau Fund (2018-YF05-01217-SN). We would like to thank the Ji County Meteorological Bureau for the data provided.

The authors declare no conflicts of interest regarding the publication of this paper.

Chen, C., Zhang, X.W. and Tian, S. (2020) Research on Dynamic Forecast of Flowering Period Based on Multivariable LSTM and Ensemble Learning Classification Task. Agricultural Sciences, 11, 777-792. https://doi.org/10.4236/as.2020.119050