Predicting tourism traffic demand accurately plays an important role in making effective policies for tourist administration. It helps to distribute the resources reasonably and avoid the tourism congestions. This paper considered the noise interference and proposed a hybrid model, combining ensemble empirical mode decomposition (EEMD), deep belief network (DBN) and Google trends, for tourism traffic demand prediction. This model firstly applied dislocation weighted synthesis method to combine Google trends into a search composite index, and then it denoised the series with EEMD. EEMD extracted the high frequency noise from the original series. The low frequency series of search composite index would be used to forecast the low frequency tourism traffic series. Taking the inbound tourism in Shanghai as an example, this paper trained the model and predicted the next 12 months tourism arrivals. The conclusion demonstrated that the forecast error of EEMD-DBN model is lower remarkably than the baselines of ARIMA, GM(1,1), FTS, SVM, CES and DBN model. This revealed that nosing processing is necessary and EEMD-DBN forecast model can improve the prediction accuracy.
According to data released by China Tourism Research Institute, the growth rate of inbound tourist volumes in China is relatively slow. That is to say, for a long time in the past, the development of China’s inbound tourism has been basically stagnant, which is inconsistent with the hot situation of the domestic tourism market and outbound tourism. The tourism demand forecast can provide timely basis for relevant departments to formulate effective tourism policies [
In recent years, the research results of using the network search data to establish the tourism demand forecasting model are quite fruitful. The traditional econometric model and the machine learning method will be limited by historical data when forecasting the tourism demand, compared with the network search data. Search has instantaneity and subjectivity and more accurately the needs of tourists can be reflected. In fact, as early as 2009, the predictive power of web search data has been confirmed, for example, the application of Google Trends in all walks of life is considered effective [
Tourism is a related industry and has been greatly affected by emergencies; it has been difficult to solve the impact of emergencies on the tourism industry. The forecasting model cannot be adjusted in real time according to the changes in tourism dynamics, for example, the congestion of tourists at the entrance of Jiu Zhaigou Scenic Spot in Sichuan, China, on October 2, 2013 and Shanghai Bund stampede, on December 31, 2014 etc. These incidents have led that the number of tourists in local tourist attractions is difficult to dredge, and the quality of service of scenic spots has declined. It has become a hot spot of concern to all sectors of society. These problems indicate that the spatial allocation of tourism resources is crucial to the healthy development of tourism [
In the field of tourism demand forecasting, research methods will vary depending on the conditions and objects of the forecast. Inbound tourists are more purposeful. Due to long distance of travel and the relatively long stay time compared to domestic tourism, the possibility of planning ahead is even greater. It is more prevalent to rely on online search to develop a travel schedule. However, the prediction model established by simply using the search data is not robust, and the combination of artificial intelligence and search data can greatly improve the accuracy of prediction [
In the forecast of tourism demand, there have been very rich research results in the past ten years [
The application of artificial intelligence methods in tourism demand forecasting has begun to rise in the past 30 years. A back propagation neural network model can be applied to tourism demand forecasting [
Since the network search data are used to successfully predict the epidemic [
The application of network search data is more and more extensive, providing a good enlightenment for the research of the tourism industry [
In China, there are numerous users of Baidu search engines [
EEMD is an improved algorithm of EMD (Experimental Mode Decomposition) [
The basic methods of EEMD are as follows:
Step 1: Calculate the sequence (set to P(t)) local number of extreme data point using EMD, the maximal value constitutes the upper envelope m(t), and the minimum value constitutes the envelope n(t), and the mean z(t) of the upper and lower envelopes at any point is zero.
Step 2: Subtract the mean of the upper and lower envelopes with the sequence to get R(t).
R ( t ) = P ( t ) − ( m ( t ) + n ( t ) ) 2 (1)
Verify that R(t) satisfies the IMF. If not, repeat steps 1 and 2 until R(t) satisfies the IMF condition, and treat R(t) as an IMF separated from P(t) one by one. In the above process, the finite number of IMFi components and the sum of the remainders u(t) and Y(t) are decomposed one by one from high frequency to low frequency by multiple screenings.
Y ( t ) = ∑ i = 1 N I M F i + u ( t ) (2)
Step 3: Add random white noise to the sequence P(t), and equalize the abnormal events, so that the abnormal event mode is mixed into the random white noise mode during the EMD decomposition process, and then normalized, and the random white noise is applied by applying the EMD pair. The subsequent signal is decomposed to obtain an IMFi component.
Step 4: Get IMFi integration after decomposition (adding a new random normal distribution white noise sequence)
P ( t ) = ∑ i , j = 1 n I M F i ( t ) , j ( t ) , i = 1 , ⋯ , N ; j = 1 , ⋯ , n (3)
Step 5: Preset a threshold k. If the integrated value in the fourth step is less than k, it is removed as noise. If the integrated value in the fourth step is greater than k, the IMFi is reset, and Q is an entropy function.
K i [ I M F i ( t ) , P ( t ) ] = Q [ I M F i ] − Q [ I M F i ( t ) , P ( t ) ] (4)
The era of big data brings new opportunities for the establishment of tourism demand forecasting models. The dependence of users on search engines can provide important data for tourism demand forecasting. In this paper generalized dynamic factor model (GDMF) [
Forni (2004) proposed the idea of using VAR to represent the model of GDFM [
X − α k t + j t (5)
The matrix transformation of kt can be expressed as:
k t = p k t − 1 + q δ t (6)
where δ t = ( δ 1 t , δ 2 t , δ 3 t , ⋯ , δ n t ) is a s-dimensional common component.
Tourism demand has many uncertainties and is able to be influenced by policies and media indices. The network search can reflect the tourists’ decision-making behavior motives, but due to the influence of some emergencies, the search volume of one or several keywords in a certain period will be extremely high or very low, and these data are abnormal. So we create a standard scale for the search data when synthesizing the keywords. When the data show a maximum value beyond the standard scale, the method of taking the mean value is used to process the abnormal data. Using EEMD to decompose tourist volumes sequence n IMF components, the same approach is applicable to the Google keyword search index sequence.
Hinton proposed the Deep belief network [
Hinton proposed the idea of training each layer of RBM separately [
θ = ( ω i j , a i , b j ) (7)
Then, the energy of the RBM is expressed as:
E ( υ , h : θ ) = − ∑ i = 1 c a i υ i − ∑ j = 1 d b j h j − ∑ i = 1 c ∑ j = 1 d a i ω i j h j (8)
ωij is the symmetric connection weight between the visible layer υi and the hidden layer hj. The probability of the binary state of the hidden layer υi being set to 1 and the probability of the binary state of the visible layer hj being set to 1 are calculated [
In this paper, we propose an EEMD-DBN prediction model whose structure is shown in
Shanghai is an important pillar of China’s national economic development; and it attracts many tourists from home and abroad to visit here with rich tourism resources. The regional differences in China’s inbound tourism development have gradually narrowed, but it is undeniable that Shanghai still has a huge impetus to the progress of China’s inbound tourism. According to the 2017 National Statistical Report on National Economic and Social Development, as of the end of 2017, there were 99 A-level scenic spots in Shanghai, including 3 scenic spots in 5A, 50 scenic spots in 4A, and 46 scenic spots in 3A. It has become one of the cities with the most inbound tourist volumes in China.
The data selected in this paper are the number of monthly inbound tourists from 2004 to 2018 in Shanghai, and divide the data into two parts: the training set and the predictive test set. In order to ensure the validity of the prediction model, 2004-2017 was selected as the sample data for the training and establishment of the prediction model, and the 2018 tourist data were used as the prediction set. Baidu search engine is more widely used in China, so it is more suitable for China’s domestic tourism demand forecast. Globally, Google’s users are more extensive, accounting for about 66.7% of the world’s total [
reflects the long-term trend of Shanghai tourist volumes. Compared with the annual data, the time series of monthly data changes more significantly, which can provide a more detailed basis for tourism destination tourism decisions.
Compared with short-distance travel, inbound tourists will stay at the destination for a relatively long time, so people have a tendency to use search engines to develop relevant travel plans in advance. This includes travel route planning, travel hotel ordering, and travel destination information inquiry. This series of behaviors is basically done by means of search engines, so this paper uses keyword search index to continue to improve the forecasting accuracy of tourism demand forecasting model. A key step in the synthesis of online search index is the selection of search keywords. In terms of keyword selection, there is currently no mature program and theoretical system. This article uses a more common method of directly selecting keywords.
Firstly, the 50 common keywords related to tourism are selected for search volume search. According to the Google search volume ranking, the first 16 keywords are retained, and the Pearson correlation test is performed on these 16 keywords, and the correlation with the tourism is the largest 5 keywords as research samples as
It can be seen from the time series of inbound tourist traffic in Shanghai that the fluctuation of passenger tourism volumes data is more obvious, and there is a large amount of data to be processed. The peak of the tourism volumes sequence is from September to October of 2010. As we all know, from May to June 2010,
Shanghai hosted the World Expo, which played a positive role in Shanghai’s tourism development to a certain extent, which led to the rapid growth of Shanghai’s inbound tourism volumes in a short period of time. In order to reduce the impact of abnormal events on the forecasting model, we performed a simple noise reduction process on the tourism volumes time series to eliminate the interference of the data peaks and valleys. As shown in
Determine keywords according to the eight characteristics of tourism activities, eat, live, travel, travel, purchase, entertainment, determine the basic keywords, such as: travel to Shanghai, Shanghai attractions, Shanghai hotels, Shanghai flights, etc., then enter the basic keywords for Google search volume Inquire. When entering the basic keywords on Google, Google will intelligently recommend keywords related to it, and then record relevant keywords recommended by Google, and sort out 50 keywords related to inbound tourism in Shanghai, the 16 keywords with the largest search volume are retained, and the Pearson correlation coefficient between the tourism volumes and the search keyword is calculated. From the analysis results, we can see that there is a negative correlation between some keywords and the tourism volumes, and these keywords are eliminated. Based on the tourist psychology motivation angle and the Pearson correlation coefficient calculation results, as shown in
Using the standard scale we set as the limit, the abnormal data values in the search volume of the 5 keywords with the most relevant correlation are processed. When the search volume of keywords is lower or higher than the standard scale, many researchers will adopt the method of directly eliminating such abnormal data, but this will also cause excessive cleaning damage to the data. According to the seasonal characteristics of the tourism industry itself, this paper averages the other annual data of the month in which the data of the
Keyword | Correlation Coefficient |
---|---|
Flight to Shanghai | 0.597 |
Shanghai weather | 0.128 |
Shanghai traffic | −0.420 |
Weather in Shanghai China | 0.348 |
Shanghai tourist attractions | −0.183 |
China time Shanghai | 0.656 |
Shanghai flight | 0.479 |
Shanghai scene | −0.292 |
Shanghai hotel booking | 0.024 |
Bund Shanghai | 0.393 |
Shanghai cuisine | −0.312 |
Shanghai food | 0.178 |
Shanghai airport | −0.272 |
Shanghai airport arrivals | 0.045 |
Shanghai restaurant | 0.124 |
Shanghai visa | 0.053 |
abnormal data is taken to ensure the integrity of the data information.
The basic method of GDFM index composite is to use the weighted idea to sum up the common components of the variables. First, use the variance contribution rate to determine the number of factors, and then calculate the common components of the multidimensional stationary search data sit, and finally add the search index.
s i t = ∑ i = 1 n b i t ( L k ) g n t (9)
n is the determined number of factors, L is the lag operator, and k is the number of lag operators. According to the Forni that when n = 4 and k = 5, the model works best [
After processing the abnormal data of the keyword search volume, the effect of the abnormal value on the authenticity of the search index is reduced. However the tourism is a relatively relevant industry, emergencies or other urban activities not related to tourism have a certain impact on tourism search, which has led to a significant increase or decrease in the overall network search of tourist destinations over time. For example, the SARS incidents in 2003 which affected the web search of tourist destinations. Abnormal data processing is for individual data, as shown in
It can be seen that the composite search index does reflect the trend of tourist volumes and shows a certain lead time, as shown in
In order to investigate the prediction ability of the established model from different angles, this paper selects the mean absolute percentage error (MAPE), the mean square error (MSE), the mean absolute error (MAE) and the fitting coefficient R2 as the evaluation indicators of the model from different angles. Measure the predictive affect of the model. Among them, x ¯ i represents the model simulation output value, that is, the predicted tourist volume; xi represents the actual number of visitors, and n is the number of test data.
The fitting coefficient R2 represents the degree of fitting of the predicted value curve to the actual value curve. The value of R2 can measure the fitting ability of
the model, R 2 ∈ [ 0 , 1 ] , and the closer the value of R2 is to 1, the model is explained the stronger the fitting ability.
R 2 = 1 − ∑ i = 1 n ( x i − x ¯ i ) 2 ∑ i = 1 n ( x i − x ¯ i ) 2 (10)
The MAE measures the accuracy of the prediction by calculating the difference between the predicted value and the true value data. The smaller value of MAE we have the higher prediction accuracy.
MAE = 1 n ∑ i = 1 n | x i − x ¯ i | (11)
MAPE is an evaluation criterion used to explain the relative error of the prediction model, which can well evaluate the prediction ability of the model.
MAPE = ∑ i = 1 n | x i − x ¯ i | / x i n 100 % (12)
RMSE represents the square root of the ratio of the sum of the predicted value to the true value and the ratio of the experimental number, used to estimate the degree of deviation between the predicted value and the true value.
RMSE = 1 n ∑ i = 1 n ( x i − x ¯ i ) 2 (13)
In the forecast research of tourism demand in recent years, there are many combined forecasting models, especially the combination of network search index and time series model. However, these studies lack the intensity of noise processing. The modal decomposition of the tourist volumes time series and the keyword search index allows the Shanghai inbound tourist volumes and the keyword search index to obtain a uniformly distributed decomposition scale, so that smooth the interference of abnormal events.
The core idea of EEMD is to add Gaussian white noise to the signal and perform ensemble averaging. The two important parameters of EEMD are the ratio k of white noise to the standard deviation of the original signal amplitude, and the average number of times M, however there is no specific calculation method for the values of k and M. Combined with the experience of the researchers and the experiments in this paper, and for the data characteristics, we take k = 0.2 and M = 100 as the benchmark experimental values, and then adjust them continuously in the experiment to get the best EEMD model with the best decomposition effect.
When performing the EEMD test of the first iteration, the Gaussian white noise sequence f(t) was added to the Shanghai inbound tourist volumes time series P(t) and the keyword search index I(t), and n trials were performed. After that, the nth pending tourism volumes time series and the network search index sequence are obtained.
P n ( t ) = p ( t ) + l f n ( t ) (14)
I n ( t ) = I ( t ) + l f n ( t ) (15)
In further experiments, the EEMD parameter values continuously adjusted. The decomposed tourist volumes time series and the search index sequence contain 6 IMF components and one residual. The amplitude and fluctuation of the IMF component are different. It can also be seen from the figure that the amplitude and fluctuating frequency of the first IMF are always the largest and the wavelength is relatively short. The first IMF component obtained after decomposition is removed as noise, and the remaining IMF2, IMF3, IMF4, IMF5, IMF6 and smooth trend residuals are summed as an experimental sequence of tourist volumes and keyword search respectively.
We are required to determine the number of hidden layer nodes and the number of hidden layers in the DBN. There is not any exact rule about the number of nodes in the DBN input layer and hidden layer. Basing on the experience of researchers, this paper sets the number of layers of DBN to N = 3 and the number of neurons is set at intervals of 5 for each hidden layer. The number of nodes in the layer is taken as an integer in [100, 1000], and then each additional layer of hidden layer is added to determine the optimal value of the number of neurons in the second hidden layer, and the experiment is repeated to achieve the highest accuracy. It component RBM of the DBN needs to optimize the feature extraction ability through training. Therefore, the DBN also needs a weight; the weight can determine the influence factor of the maximum probability of the training sample. We define the learning rate of the DBN model to 0.1, the number of iterations. Set to200, by repeating the training, it is determined that the final DBN structure is a 3-layer RBM composition, and the number of hidden neurons in the first layer and the second layer is 20, 15, respectively.
In this paper, the established DBN model is used for prediction and compared with support vector machine, ARIMA, GM(1,1), fuzzy time series and cubic exponential smoothing model. The group (a) of Figures 6-11 shows the fit between the predicted value and the actual value, and the group (b) of Figures 7-12 shows the degree of dispersion between the predicted value and the observed value.
It can be seen from the comparison of the (a) group images of the six different
prediction models that the predicted values of the ARIMA model show obvious convergence characteristics, and ARIMA has greater limitations in dealing with non-stationary time series. The prediction ability of the GM(1,1) model usually shows a large volatility, which is affected by the smoothness of the tourist volumes data series, the prediction effect of the GM(1,1) model seems to be less than ideal. FTS usually optimizes the uncertainty of data and solves fuzzy problems. When forecasting separately it often does not really work out well. SVM is a widely used forecasting model in tourism demand forecasting in recent years. However, the kernel function selection of SVM model is a very difficult problem, and it is computationally complex and often sensitive to data loss. The cubic exponential smoothing is based on an exponential smoothing model and a quadratic exponential smoothing model. It is commonly used in China’s domestic
tourism demand forecast because it has good predictive ability for seasonal time series. From the figure, the fitting effect of the cubic exponential smoothing and DBN is relatively trustworthy.
The group image of (b) is a good representation of the degree of dispersion between the ideal prediction and the actual predictions of the six different models. The discrete trend of the DBN model is the most stable and the least discrete, that is, the predicted value of the DBN model is closely surrounding the actual value. Combining the final results of the (a) group image with the (b) group image, we can conclude that the DBN has better predictive power than other models.
The comparison experiment was set up in two groups. The first group predicted the tourist volumes without adding the composite search index, and the second group joined the Google search index to predict the tourist volumes. The results of both groups were evaluated by MAE, MAPE, RMSE, and R2 as shown in
Looking at the results of the second set of experiments, after joining the Google keyword search index, we used six different models to predict the tourist volumes as shown in
MAE (104) | MAPR | RMSE (105) | R2 | |
---|---|---|---|---|
ARIMA | 10.16828 | 0.14032 | 1.27639 | 0.62679 |
GM(1,1) | 8.47406 | 0.1159 | 1.21326 | 0.63701 |
FTS | 7.77496 | 0.11376 | 0.99984 | 0.70957 |
SVM | 7.95296 | 0.11417 | 1.13246 | 0.74384 |
CES | 7.31596 | 0.10397 | 0.99589 | 0.76065 |
DBN | 5.8507 | 0.08417 | 0.90995 | 0.80651 |
MAE (104) | MAPR | RMSE (105) | R2 | |
---|---|---|---|---|
Index-ARIMA | 10.40777 | 0.15558 | 1.30863 | 0.64662 |
Index-GM(1,1) | 7.03853 | 0.09941 | 0.85096 | 0.64336 |
Index-FTS | 8.02626 | 0.10963 | 1.05467 | 0.72485 |
Index-SVM | 6.28613 | 0.09048 | 1.00487 | 0.779 |
Index-CES | 5.99157 | 0.084653 | 0.91868 | 0.78252 |
Index-DBN | 5.04885 | 0.071167 | 0.65961 | 0.82518 |
effect in terms of tourist volumes forecasting, and the Google search index can indeed optimize the forecast of tourist volumes.
The granger causality test is to prove the effectiveness of Google keyword search data on tourism demand forecasting model. Before conducting the Granger causality test, we must first ensure that the tourist volumes sequence and the Google search index sequence are stationary. According to the unit root test results, the Google search index is stationary, at the 1% significance level (
t-statistic | Prob | ||
---|---|---|---|
Augmented Dickey-Fuller test statistic | −4.447879 | 0.0002 | |
Test critical values: | 1% level | −3.467205 | |
5% level | −2.877636 | ||
10% level | −2.575430 |
t-statistic | Prob | ||
---|---|---|---|
Augmented Dickey-Fuller test statistic | −12.42683 | <0.0001 | |
Test critical values: | 1% level | −3.469933 | |
5% level | −2.878829 | ||
10% level | −2.576067 |
null hypothesis | F-statistic | Prob |
---|---|---|
Yt does not Granger cause It | 7.39027 | 0.0001 |
It does not Granger cause Yt | 3.06659 | 0.0295 |
Network technology is constantly upgrading, and has achieved good popularity, becoming an indispensable part of people’s daily lives. With the advent of the 5G era, web search may penetrate deeper into our daily lives, especially in terms of travel. The EEMD decomposition method adopted in this paper overcomes the large noise defects in the traditional index composite, making the keyword search index play the most important role in tourism demand forecasting. However, we have to admit that the selection of keywords is something we need to explore further, although it is now possible to use high-speed computers to extract keywords, and such methods will greatly improve the accuracy of keywords, but this technology extremely high hardware requirements and therefore no universality. Accurately finding keywords that best reflect the motivation of tourist guests will further optimize the tourism demand forecasting model. Five comparative forecasting models selected in this paper are widely used in tourism demand forecasting. The maturity of artificial intelligence technology will also bring new opportunities and challenges to tourism demand forecasting.
Tourism is a comprehensive industry that will not only be affected by force major such as weather, natural disasters, but also subjective factors such as politics, economy, culture and even religion. In the traditional forecast of tourism demand, the research using quantitative methods accounts for the majority, which also neglects the tourism behavior caused by people’s subjective consciousness to some extent, and the keyword search data, reflects the subjective behavior of people. Although the keyword information has been applied to some extent, it is not deep enough. Therefore, accurate extraction of keywords and qualitative analysis is another challenging problem that we need to work hard to solve.
This research is supported by the Fundamental Research Funds for the Central Universities under Grant No. CCNU19ZN024 and the Humanities and Social Sciences Layout Foundation of the Ministry of Education of China under Grant No. 20YJA740047.
The authors declare no conflicts of interest regarding the publication of this paper.
Xiao, Y., Tian, X.T. and Xiao, M. (2020) Tourism Traffic Demand Prediction Using Google Trends Based on EEMD-DBN. Engineering, 12, 194-215. https://doi.org/10.4236/eng.2020.123016