Tourism Traffic Demand Prediction Using Google Trends Based on EEMD-DBN

Predicting tourism traffic demand accurately plays an important role in making effective policies for tourist administration. It helps to distribute the resources reasonably and avoid the tourism congestions. This paper considered the noise interference and proposed a hybrid model, combining ensemble empirical mode decomposition (EEMD), deep belief network (DBN) and Google trends, for tourism traffic demand prediction. This model firstly applied dislocation weighted synthesis method to combine Google trends into a search composite index, and then it denoised the series with EEMD. EEMD extracted the high frequency noise from the original series. The low frequency series of search composite index would be used to forecast the low frequency tourism traffic series. Taking the inbound tourism in Shanghai as an example, this paper trained the model and predicted the next 12 months tourism arrivals. The conclusion demonstrated that the forecast error of EEMD-DBN model is lower remarkably than the baselines of ARIMA, GM(1,1), FTS, SVM, CES and DBN model. This revealed that nosing processing is necessary and EEMD-DBN forecast model can improve the prediction accuracy.


Introduction
According to data released by China Tourism Research Institute, the growth rate of inbound tourist volumes in China is relatively slow. That is to say, for a long How to cite this paper: Xiao, Y., Tian, X.T., Liu, J.J., Cao, G.H. and Dong, Q.X. (2020) Tourism Traffic Demand Prediction Using Google Trends Based on EEMD-DBN. time in the past, the development of China's inbound tourism has been basically stagnant, which is inconsistent with the hot situation of the domestic tourism market and outbound tourism. The tourism demand forecast can provide timely basis for relevant departments to formulate effective tourism policies [1]. Modern information technology has brought great convenience to people's life, work and study. People usually turn to search engines when making travel strategies, and people's travel plans often use keyword search information as a reference [2].
In recent years, the research results of using the network search data to establish the tourism demand forecasting model are quite fruitful. The traditional econometric model and the machine learning method will be limited by historical data when forecasting the tourism demand, compared with the network search data. Search has instantaneity and subjectivity and more accurately the needs of tourists can be reflected. In fact, as early as 2009, the predictive power of web search data has been confirmed, for example, the application of Google Trends in all walks of life is considered effective [3].
Tourism is a related industry and has been greatly affected by emergencies; it has been difficult to solve the impact of emergencies on the tourism industry. all sectors of society. These problems indicate that the spatial allocation of tourism resources is crucial to the healthy development of tourism [1]. It is possible to balance tourism volumes of different tourist attractions due to the instant and efficient tourism demand forecasting model. In this way, the tourism industry will be more orderly and standardized, creating a pleasant atmosphere for China's inbound tourism.
In the field of tourism demand forecasting, research methods will vary depending on the conditions and objects of the forecast. Inbound tourists are more purposeful. Due to long distance of travel and the relatively long stay time compared to domestic tourism, the possibility of planning ahead is even greater. It is more prevalent to rely on online search to develop a travel schedule. However, the prediction model established by simply using the search data is not robust, and the combination of artificial intelligence and search data can greatly improve the accuracy of prediction [4] [5]. This paper uses EEMD to decompose the historical tourist volumes sequence of Shanghai inbound tourism and Google keyword search data respectively to eliminate the adverse effects of noise interference on the prediction results. Finally, the DBN with better convergence effect is used to predict the tourist volumes with the synthetic search index, which ensures the real-time validity of the prediction model.

Tourism Demand Forecast
In the forecast of tourism demand, there have been very rich research results in the past ten years [6]. Whether different methods or combinations have better predictive effects have always been the main direction of scholars' exploration [7]. The methods currently used in this field can be roughly divided into traditional time series models, artificial intelligence prediction methods [8] [9] and hybrid methods [10]. Among the studies of time series models, the most used ones are ARIMA models [11] [12], exponential smoothing models [13], and linear regression [14] and so on. Among them, ARMA has diverse prediction performance under different conditions, that is, it can be adjusted according to different research conditions to achieve better prediction results. ARMA has more possibilities [15]. The exponential smoothing model has also gradually evolved from primary exponential smoothing to quadratic exponential smoothing to cubic exponential smoothing to obtain more accurate predictions. In the study of tourism demand forecasting in several major source countries in Australia, we can see a comparison of several exponential smoothing models [12]. In fact, it can be observed in the research of many scholars that no one method has an absolute advantage. In general, we think that the combined model is more accurate than the single model [4] [5]. The application of artificial intelligence methods in tourism demand forecasting has begun to rise in the past 30 years. A back propagation neural network model can be applied to tourism demand forecasting [16]. As one of the international tourist cities, there have many methods for the prediction of Hong Kong's demand for inbound and outbound tourism, such as rough set theory [17]. Grey models are also widely used in tourism demand forecasting, including research on air passenger traffic [18] [19]. Genetic algorithms [20] have been developed from artificial neural networks [21]. The two major source countries of the Balearic Islands, the UK and Germany, have corresponding visitors every month, and some scholars have used genetic algorithms to conduct special research on this, which shows that genetic algorithms are also feasible in tourism demand forecasting [22]. Support vector machines can better solve practical problems such as small samples, nonlinear, high-dimensional numbers and local minimum points, adding network search data will greatly improve the accuracy of the prediction model, which predicts the passenger flow of Barbados Island, and there is a good embodiment [23].

Network Search Forecast
Since the network search data are used to successfully predict the epidemic [24], it has begun to use the search data to predict the phenomenon in many fields such as economics and social sciences. For example, scholars used Google search data to predict unemployment, housing prices, stocks [25] [26] [27], etc. Web search data has indeed contributed its valuable role in research in various fields, especially in today's rapid development of information technology, when people are increasingly relying on online query tools we also hope to use Google search data to further study future consumer behavior [28].
The application of network search data is more and more extensive, providing a good enlightenment for the research of the tourism industry [28] [29]. However, due to the different language and cultural background of each region or country, the search intensity of multi-language source market sets and different leading search engine platforms will also affect the results of tourism demand forecasting [30]. A nonlinear auto-regressive method is combined with keyword search data to predict Malaysia's passenger volumes shows good predictive performance [28]. Since then, Google search data has also been used in the tourism demand forecasting study in the Caribbean region [31]. It indicates the effectiveness of web search data applications in forecasting tourism demand.
In China, there are numerous users of Baidu search engines [32]. Most of the scholars' research is dependent upon the Baidu search index. However, in the world, Google search engine dominates. This article takes Shanghai's inbound tourism demand forecast as an example, targeting the world's tourist groups, so it uses the data of the Google search engine.

Principle of EEMD (Ensemble Empirical Mode Decomposition)
EEMD is an improved algorithm of EMD (Experimental Mode Decomposition) [33], which effectively solves the problem that EMD relies on local number of extreme data information. EMD is generally used for the decomposition of the original sequence from the data itself, because EMD is decomposed according to its own characteristics, and no other prior conditions are needed, so it is more used in noise processing and prediction. However, when the signal is not stable or contains anomalous events, EMD cannot show its superiority [34]. When the signal is disturbed by an abnormal event (such as pulse interference); mode mixing phenomenon occurs. In order to compensate for the shortcomings of EMD in modal decomposition, EEMD can be effectively solved to solve the model aliasing phenomenon. EEMD can make the decomposition scale more uniform, suppress the influence of abnormal events on the signal, and make the prediction more accurate.
The basic methods of EEMD are as follows: Step 1: Calculate the sequence (set to P (t) ) local number of extreme data point using EMD, the maximal value constitutes the upper envelope m (t) , and the minimum value constitutes the envelope n (t) , and the mean z (t) of the upper and lower envelopes at any point is zero.
Step 2: Subtract the mean of the upper and lower envelopes with the sequence to get R (t) .
Verify that R (t) satisfies the IMF. If not, repeat steps 1 and 2 until R (t) satisfies the IMF condition, and treat R (t) as an IMF separated from P (t) one by one. In the above process, the finite number of IMF i components and the sum of the remainders u (t) and Y (t) are decomposed one by one from high frequency to low frequency by multiple screenings.
Step 3: Add random white noise to the sequence P (t) , and equalize the abnormal events, so that the abnormal event mode is mixed into the random white noise mode during the EMD decomposition process, and then normalized, and the random white noise is applied by applying the EMD pair. The subsequent signal is decomposed to obtain an IMF i component.
Step 4: Get IMF i integration after decomposition (adding a new random nor- Step 5: Preset a threshold k. If the integrated value in the fourth step is less than k, it is removed as noise. If the integrated value in the fourth step is greater than k, the IMF i is reset, and Q is an entropy function.

Compositions of Google Search Keyword Variables
The era of big data brings new opportunities for the establishment of tourism demand forecasting models. The dependence of users on search engines can provide important data for tourism demand forecasting. In this paper generalized dynamic factor model (GDMF) [35], which can process high-dimensional data, is used to combine keyword variables. The unique advantage of GDMF is that variable data can be updated in real time, and known variables can be interpreted by common parts of unknown variables, aggregated into travel-related indices [36]. Forni (2004) proposed the idea of using VAR to represent the model of GDFM [37]. The traditional factor model is composed of the sum of s common factor k t and special factor j t . On this basis, GDFM gives common factors. Partially multiplied by the load matrix of m*n, denoted as α, then the observed variable can be expressed as: The matrix transformation of k t can be expressed as: , , , , Tourism demand has many uncertainties and is able to be influenced by policies and media indices. The network search can reflect the tourists' decision-making behavior motives, but due to the influence of some emergencies, the search volume of one or several keywords in a certain period will be extremely high or very low, and these data are abnormal. So we create a standard scale for the search data when synthesizing the keywords. When the data show a maximum value beyond the standard scale, the method of taking the mean value is used to process the abnormal data. Using EEMD to decompose tourist volumes sequence n IMF components, the same approach is applicable to the Google keyword search index sequence.

DBN (Deep Belief Network) Prediction Model
Hinton proposed the Deep belief network [38], and the initial parameters of the Then, the energy of the RBM is expressed as: ω ij is the symmetric connection weight between the visible layer υ i and the hidden layer h j . The probability of the binary state of the hidden layer υ i being set to 1 and the probability of the binary state of the visible layer h j being set to 1 are calculated [39]. In general, the algorithm of contrast divergence is used to represent the log likelihood gradient of RBM, and the weight and offset parameters are updated by calculation, for detailed algorithm, see Hinton's work [40].
In this paper, we propose an EEMD-DBN prediction model whose structure is shown in Figure 1.  The data selected in this paper are the number of monthly inbound tourists from 2004 to 2018 in Shanghai, and divide the data into two parts: the training set and the predictive test set. In order to ensure the validity of the prediction model, 2004-2017 was selected as the sample data for the training and establishment of the prediction model, and the 2018 tourist data were used as the prediction set. Baidu search engine is more widely used in China, so it is more suitable for China's domestic tourism demand forecast. Globally, Google's users are more extensive, accounting for about 66.7% of the world's total [29]. So this article uses the search data of Google search engine. Figure 2   reflects the long-term trend of Shanghai tourist volumes. Compared with the annual data, the time series of monthly data changes more significantly, which can provide a more detailed basis for tourism destination tourism decisions.

Keyword Search Data
Compared with short-distance travel, inbound tourists will stay at the destina-

Data Analysis
It can be seen from the time series of inbound tourist traffic in Shanghai that the fluctuation of passenger tourism volumes data is more obvious, and there is a large amount of data to be processed. The peak of the tourism volumes sequence is from September to October of 2010. As we all know, from May to June 2010,  Shanghai hosted the World Expo, which played a positive role in Shanghai's tourism development to a certain extent, which led to the rapid growth of Shanghai's inbound tourism volumes in a short period of time. In order to reduce the impact of abnormal events on the forecasting model, we performed a simple noise reduction process on the tourism volumes time series to eliminate the interference of the data peaks and valleys. As shown in Figure 4, the time series after smoothing the linear trend factor of the volume data is relatively flat. Determine keywords according to the eight characteristics of tourism activities, eat, live, travel, travel, purchase, entertainment, determine the basic keywords, such as: travel to Shanghai, Shanghai attractions, Shanghai hotels, Shanghai flights, etc., then enter the basic keywords for Google search volume Inquire. When entering the basic keywords on Google, Google will intelligently recommend keywords related to it, and then record relevant keywords recommended by Google, and sort out 50 keywords related to inbound tourism in Shanghai, the 16 keywords with the largest search volume are retained, and the Pearson correlation coefficient between the tourism volumes and the search keyword is calculated. From the analysis results, we can see that there is a negative correlation between some keywords and the tourism volumes, and these keywords are eliminated. Based on the tourist psychology motivation angle and the Pearson correlation coefficient calculation results, as shown in Table 1, finally, choose the keyword with Pearson correlation coefficient above 0.3 as experimental data.
Using the standard scale we set as the limit, the abnormal data values in the search volume of the 5 keywords with the most relevant correlation are processed. When the search volume of keywords is lower or higher than the standard scale, many researchers will adopt the method of directly eliminating such abnormal data, but this will also cause excessive cleaning damage to the data. According to the seasonal characteristics of the tourism industry itself, this paper averages the other annual data of the month in which the data of the   abnormal data is taken to ensure the integrity of the data information.
The basic method of GDFM index composite is to use the weighted idea to sum up the common components of the variables. First, use the variance contribution rate to determine the number of factors, and then calculate the common components of the multidimensional stationary search data s it , and finally add the search index. n is the determined number of factors, L is the lag operator, and k is the number of lag operators. According to the Forni that when n = 4 and k = 5, the model works best [35].
After processing the abnormal data of the keyword search volume, the effect of the abnormal value on the authenticity of the search index is reduced. However the tourism is a relatively relevant industry, emergencies or other urban activities not related to tourism have a certain impact on tourism search, which has led to a significant increase or decrease in the overall network search of tourist destinations over time. For example, the SARS incidents in 2003 which affected the web search of tourist destinations. Abnormal data processing is for individual data, as shown in Figure 5. We still need to optimize the predictive power of the search index by simple noise reduction processing.
It can be seen that the composite search index does reflect the trend of tourist volumes and shows a certain lead time, as shown in Figure 6, which is also consistent with people's behavioral motives. At the peak of the tour, people seem more willing to use the search data in advance to help them make travel decisions.

Evaluating Indicator
In order to investigate the prediction ability of the established model from different angles, this paper selects the mean absolute percentage error (MAPE), the mean square error (MSE), the mean absolute error (MAE) and the fitting coefficient R 2 as the evaluation indicators of the model from different angles. Measure the predictive affect of the model. Among them, i x represents the model simulation output value, that is, the predicted tourist volume; x i represents the actual number of visitors, and n is the number of test data. The fitting coefficient R 2 represents the degree of fitting of the predicted value curve to the actual value curve. The value of R 2 can measure the fitting ability of

EEMD Noise Reduction
In the forecast research of tourism demand in recent years, there are many combined forecasting models, especially the combination of network search index and time series model. However, these studies lack the intensity of noise processing. The modal decomposition of the tourist volumes time series and the keyword search index allows the Shanghai inbound tourist volumes and the keyword search index to obtain a uniformly distributed decomposition scale, so that smooth the interference of abnormal events. The core idea of EEMD is to add Gaussian white noise to the signal and perform ensemble averaging. The two important parameters of EEMD are the ratio k of white noise to the standard deviation of the original signal amplitude, and the average number of times M, however there is no specific calculation method for the values of k and M. Combined with the experience of the researchers and the experiments in this paper, and for the data characteristics, we take k = 0.2 and M = 100 as the benchmark experimental values, and then adjust them continuously in the experiment to get the best EEMD model with the best decomposition effect.
When performing the EEMD test of the first iteration, the Gaussian white noise sequence f (t) was added to the Shanghai inbound tourist volumes time series P (t) and the keyword search index I (t) , and n trials were performed. After that, the nth pending tourism volumes time series and the network search index sequence are obtained.
In further experiments, the EEMD parameter values continuously adjusted.

DBN Forecast
We prediction models that the predicted values of the ARIMA model show obvious convergence characteristics, and ARIMA has greater limitations in dealing with non-stationary time series. The prediction ability of the GM(1,1) model usually shows a large volatility, which is affected by the smoothness of the tourist volumes data series, the prediction effect of the GM(1,1) model seems to be less than ideal. FTS usually optimizes the uncertainty of data and solves fuzzy problems. When forecasting separately it often does not really work out well. SVM is a widely used forecasting model in tourism demand forecasting in recent years. However, the kernel function selection of SVM model is a very difficult problem, and it is computationally complex and often sensitive to data loss. The cubic exponential smoothing is based on an exponential smoothing model and a quadratic exponential smoothing model. It is commonly used in China's domestic  tourism demand forecast because it has good predictive ability for seasonal time series. From the figure, the fitting effect of the cubic exponential smoothing and DBN is relatively trustworthy. The group image of (b) is a good representation of the degree of dispersion between the ideal prediction and the actual predictions of the six different models. The discrete trend of the DBN model is the most stable and the least discrete, that is, the predicted value of the DBN model is closely surrounding the actual value. Combining the final results of the (a) group image with the (b) group image, we can conclude that the DBN has better predictive power than other models.

Comparison of Different Forecasting Methods
The comparison experiment was set up in two groups. The first group predicted the tourist volumes without adding the composite search index, and the second group joined the Google search index to predict the tourist volumes. The results of both groups were evaluated by MAE, MAPE, RMSE, and R 2 as shown in

Stability Test and Granger Causality Test
The granger causality test is to prove the effectiveness of Google keyword search data on tourism demand forecasting model. Before conducting the Granger causality test, we must first ensure that the tourist volumes sequence and the Google search index sequence are stationary. According to the unit root test results, the Google search index is stationary, at the 1% significance level ( Table  4). The tourist volumes sequence shows the non-stationary, state during the same period, so the differential processing has to be done. Under the second-order differential level, the tourist volumes sequence is stationary (Table   5). Further, Granger causality test between variables is shown in Table 6. According to the results of the Granger causality test, Google search data is the cause of tourists' travel behavior. Then it proves that the Google search index has predictive ability for inbound tourism in Shanghai. Therefore, it is feasible to use the search data to predict the tourism volumes. The tourism volumes sequence is recorded as Y t , and the keyword search index is recorded as I t .

Conclusions
Network technology is constantly upgrading, and has achieved good popularity, becoming an indispensable part of people's daily lives. With the advent of the 5G era, web search may penetrate deeper into our daily lives, especially in terms of travel. The EEMD decomposition method adopted in this paper overcomes the large noise defects in the traditional index composite, making the keyword search index play the most important role in tourism demand forecasting. However, we have to admit that the selection of keywords is something we need to explore further, although it is now possible to use high-speed computers to extract keywords, and such methods will greatly improve the accuracy of keywords, but this technology extremely high hardware requirements and therefore no universality. Accurately finding keywords that best reflect the motivation of tourist guests will further optimize the tourism demand forecasting model. Five comparative forecasting models selected in this paper are widely used in tourism demand forecasting. The maturity of artificial intelligence technology will also bring new opportunities and challenges to tourism demand forecasting. Tourism is a comprehensive industry that will not only be affected by force major such as weather, natural disasters, but also subjective factors such as politics, economy, culture and even religion. In the traditional forecast of tourism demand, the research using quantitative methods accounts for the majority, which also neglects the tourism behavior caused by people's subjective consciousness to some extent, and the keyword search data, reflects the subjective behavior of people. Although the keyword information has been applied to some extent, it is not deep enough. Therefore, accurate extraction of keywords and qualitative analysis is another challenging problem that we need to work hard to solve.