^{1}

^{1}

^{2}

^{1}

^{*}

^{1}

Influenza is a kind of infectious disease, which spreads quickly and widely. The outbreak of influenza has brought huge losses to society. In this paper, four major categories of flu keywords, “prevention phase”, “symptom phase”, “treatment phase”, and “commonly-used phrase” were set. Python web crawler was used to obtain relevant influenza data from the National Influenza Center’s influenza surveillance weekly report and Baidu Index. The establishment of support vector regression (SVR), least absolute shrinkage and selection operator (LASSO), convolutional neural networks (CNN) prediction models through machine learning, took into account the seasonal characteristics of the influenza, also established the time series model (ARMA). The results show that, it is feasible to predict influenza based on web search data. Machine learning shows a certain forecast effect in the prediction of influenza based on web search data. In the future, it will have certain reference value in influenza prediction. The ARMA(3,0) model predicts better results and has greater generalization. Finally, the lack of research in this paper and future research directions are given.

Influenza, referred to as the flu, is an acute respiratory infectious disease caused by influenza virus that cannot be completely controlled until now [

First of all, in order to control the spread of influenza virus and reduce the losses caused by influenza, it is necessary to use reasonable methods to predict the trend of influenza activity. However, the influenza virus has the characteristics of strong infectiousness, rapid propagation, wide spread, and antigen variability [

Nowadays, search engines are increasingly becoming the main method for people to obtain information. Web search data has become an ideal data source for influenza surveillance. In the United States, about 90 million adults annually search the Internet for health information such as disease and medicine [

Influenza has caused great difficulties in prevention and monitoring due to its rapid mutation rate. Therefore, the most important task in influenza epidemic surveillance research is to improve the timeliness of predictions. The use of more immediate and accurate data sources is the main reason for improving timeliness [

In 2008, Polgreen et al. [

The prediction of events through social networks is a hot topic of big data research. In foreign countries, there are many researchers who use the social platform Twitter to do data analysis, including flu trend monitoring. Nigel Collier et al. [

At present, the most representative foreign influenza surveillance platform is Flu Near You. Flu Near You is a flu monitoring and visualization system that can be intuitively displayed on maps. It is also participatory for the general public. Users can submit the relevant information about flu symptoms every week. These data for researchers better understand the spread of the flu, while ordinary citizens can also watch the surrounding communities where they live and the spread of national flu [

Nowadays, there is no standardized flu prediction model in China and there are not many researches on the use of Web search data to study flu prediction models. This study establishes some prediction models by using Python to crawl relevant flu data together with machine learning. Considering the seasonality of influenza, a time series model has also been established, which has certain reference value for the monitoring and prevention of influenza.

A major indicator of influenza surveillance at home and abroad is the proportion of influenza-like illness (ILI). It refers to fever (body temperature ≥ 38˚C) and cough in all outpatient clinics at sentinel hospitals. Sore throat is one of the cases of acute respiratory infection [^{th} week of 2018 (2018/16, April 23, 2018). The data collected in this paper are mainly the proportion of influenza-like cases in the country (The proportion of flu-like cases is the total number of ILI patients divided by the number of outpatients, expressed as ILI%).

In order to obtain more time-sensitive data, this paper uses Python to write a crawler program and uses Baidu’s webpage data as crawling objects. Selected more primitive search terms such as “flu vaccine”, “cold”, “flu treatment”, “flu medicine” and “H7N9”, and summarized the keywords selected by other related studies and keyword recommendations of search engines. Expand the number of words for each keyword type, sum up all valid keywords, and form an initial vocabulary. As shown in

At the beginning of the model establishment of data mining and machine learning algorithms, in order to minimize the problem of model deviation due to the lack of important variables, we usually choose as many independent variables as possible. However, during the actual modeling process, it is usually necessary to find the subset of independent variables that have the ability to interpret the response variables to improve the model’s ability to interpret and predict. This process is called feature selection.

Principal component analysis (PCA) is a method of dimension reduction for unsupervised learning. It requires only eigenvalue decomposition to compress and denoise the data. Therefore, this paper uses PCA algorithm to extract features of influenza keywords.

Algorithm flow: Input: n-dimensional sample set D = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( m ) ) , to reduce dimension to n ′ dimension. Output: Sample set D ′ after dimension reduction.

1) Centralize all samples: x ( i ) = x ( i ) − 1 m ∑ j = 1 m x ( j ) ,

Classification | Key words | Extended words |
---|---|---|

Prevention stage | Flu vaccine | Flu vaccine side effects, Influenza vaccine necessary to fight it, Bird flu vaccine |

Flu prevention | How to prevent flu, Prevent influenza, Influenza outbreak, Prevent influenza A | |

Stage of symptoms | Cold | Gastro-intestinal flu, Influenza, Cold symptoms, Viral influenza |

Respiratory infection | Upper respiratory tract infection, Nasal congestion, Cough, Bronchitis, Sore throat, Runny nose, Rhinitis, Pharyngitis | |

Fever | Fever, High fever, Headache, Dizziness, Fatigue, Fever, Chills | |

Treatment stage | Flu treatment | A stream of treatment, What medicine to eat flu |

Cold medicine | Baijiahei, Contac, Tylenol, Gankang, Amoxicillin, Cough, Antipyretics, Cephalosporins, Oseltamivir | |

Commonly used words | H1N1 | H1N1 flu, H7N9, H7N9 flu |

Influenza-A | Influenza A, Type A H1N1 flu, Type A flu symptoms, What is Influenza A | |

Influenza | Flu virus, Swine flu, Bird flu |

2) Calculate the sample’s covariance matrix XX^{T} ,

3) Perform eigenvalue decomposition on the matrix XX^{T} , and take out the eigenvector corresponding to the largest n ′ eigenvalue ( w 1 , w 2 , ⋯ , w n ′ ) , After all the eigenvectors are normalized, they form a matrix of eigenvector W,

4) Transform each sample x ( i ) in the sample set into a new sample z ( i ) = W T x ( i ) ,

5) Get output sample set D ′ = ( z ( 1 ) , z ( 2 ) , ⋯ , z ( m ) ) .

Provided that the training sample is ( x i , y i ) and ( i = 1 , 2 , ⋯ , l ) , the simplest support vector regression (SVR) uses a linear function f ( x , ω ) = ( ω ⋅ x ) + b to model the sample points Together, where w and b are the normal vector and the offset of the linear regression function respectively. Assume that all training data are fitted with a linear function without errors under ε . Solve the following optimization problem:

min α Φ ( w ) = 1 2 ‖ w ‖ 2 (1)

s .t . { ( w ⋅ x i + b ) − y i ≤ ε , i = 1 , 2 , ⋯ , l y i − ( w ⋅ x i + b ) ≤ ε , i = 1 , 2 , ⋯ , l (2)

When we cannot fully satisfy the above two-condition constraint, we introduce the slack variables ξ i , ξ i * and the penalty parameter C to “soften” the same as the linear inseparable support vector classification. The original optimization problem becomes:

min a , ξ i , ξ i * , b 1 2 ‖ w ‖ 2 + C ∗ 1 l ∑ i = 1 l ( ξ i + ξ i * ) (3)

s .t . { ( w ⋅ x i + b ) − y i ≤ ε + ξ i , i = 1 , 2 , ⋯ , l y i − ( w ⋅ x i + b ) ≤ ε + ξ i * , i = 1 , 2 , ⋯ , l ξ i ≥ 0 , ξ i * ≥ 0 , i = 1 , 2 , ⋯ , l (4)

To solve the problem, you can get the normal vector and the regression function of the regression function:

w = ∑ i = 1 l ( α i * − α i ) x i (5)

f ( x ) = ∑ i = 1 l ( α i * − α i ) ( x i ⋅ x ) + b (6)

Here, ( x i ⋅ x ) is the inner product of the vector x i and the vector x .

Least Absolute Shrinkage and Selection Operator (LASSO), also known as linear regression L1 regularity, is a kind of compression estimation. It obtains a refined model by constructing a penalty function, making it compress some coefficients and setting some coefficients to zero. Therefore, the advantage of subset shrinkage is preserved, which is a kind of biased estimation of multiple colinearity data. The objective function is:

J ( w ) = min m { 1 2 N ‖ X T w − y ‖ 2 2 + α ‖ w ‖ 1 } (7)

Among them, y is the proportion of influenza-like cases, X is the independent variable that affects influenza cases, N is the number of data groups, α = 0.001, and w is the regression coefficient of the influenza model.

Convolutional Neural Networks (CNN) is a deep neural network model containing convolutional layers. It has become a hot topic in the field of speech analysis and image recognition. Since CNN’s feature detection layer learns through training data, when CNN is used, explicit feature extraction is avoided, and learning is implicitly performed from training data. Furthermore, because the neuron weights on the same feature map are the same, the network can learn in parallel. Therefore, this paper selected CNN to establish influenza prediction model.

Several important levels of convolutional neural networks:

1) Convolution layer: Each neuron is seen as a filter, which calculates the local data. Take a data window, this data window slides continuously until all samples are covered.

2) Pooled layer: The pooled layer is sandwiched between successive convolution layers to compress the amount of data and parameters and reduce overfitting.

3) Excitation layer: The excitation layer has an excitation function that performs non-linear mapping of the convolutional output.

4) Fully connected layer: In the fully connected layer, all neurons between the two layers have the right to reconnect. Usually the fully connected layer is at the tail of the convolutional neural network because the amount of information at the tail does not begin to be as large.

In this paper, CNN is divided into six layers: input layer, first convolution layer, pooled layer, second convolution layer, fully connected layer, and output layer. Here, the convolutional layer excitation layer adds the excitation function ReLU to each convolution process. In addition, the droupout layer was also added to the fully connected layer, and the inactivation ratio was 0.3, which means that 70% of the neurons were retained and the overfitting phenomenon was reduced.

Enter a size of 1*16 for each training matrix. Before the first convolutional layer, change the matrix size to 4*4 and use a convolution kernel of 2*2*32. The horizontal step is 1 and the vertical step is 1, the result is 4*4*64. Enter the pooling layer to get a 2*2*32 matrix. The function used by the pooling layer is MaxPool. Then enter the next layer of convolution layer, enter 2*2*32, use the convolution kernel as 2*2*64, get 2*2*64, horizontal step is 1, vertical step is 1. Finally enter the fully connected layer, the learning efficiency is 0.01, finding the best value of the mean-square error (MSE) function by using the stochastic gradient method, the results obtained before reduce the dimension, stretched into a 512*1 matrix, and set the deactivation rate. The output to the output layer completes a training. The CNN training was completed after 500 training steps.

Taking into account the seasonal characteristics of influenza, this article considers the establishment of a time series model. The time series modeling refers to the model established by using only its past values and random disturbance terms. Its general form is:

Y t = F ( Y t − 1 , Y t − 2 , ⋯ , u t ) (8)

At present, there are two types of time-series models. One is the ARMA (Auto Regression Moving Average) model, which is an autoregressive moving average model; the other is the ARIMA (Auto Regression Integrated Moving Average) model, which is an autoregression integral moving average model. The ARMA model is suitable for stationary time series data, and the ARIMA model is suitable for non-stationary time series data.

A total of 47 indicators were crawled in this study from the 16^{th} week of 2016 (started on April 25, 2016) to the 16^{th} week of 2018 (April 23, 2018). Firstly, after PCA dimensionality reduction, there are 16 main components remaining, and the 16 main components after dimensionality reduction are included in SVR, LASSO and CNN for modeling respectively.

In this paper, a total of 105 sets of data were randomly selected from the 105 sets of data to perform tests on 10 groups. SVR, LASSO and CNN were all using the same 10 groups for testing, and the remaining 95 groups were trained.

The fitting results of the SVR, LASSO and CNN models are shown in

In the SVR model, the ploynomial kernel was used for the kernel function, C = 9.1896, gamma = 0.0474, training RMSE (Root Mean Square Error) = 0.1027, and test RMSE = 6.4906.

The LASSO model uses the penalty function L1, α = 0.001, the training RMSE = 3.9954, and the test RMSE = 2.2268.

The learning efficiency of the CNN model is 0.01. In order to prevent over-fitting, the penalty function increases the Dropout layer. Some neurons are randomly deactivated at a ratio of 0.7. The training RMSE = 1.8670 and the test RMSE = 9.6885.

Due to the seasonal features of influenza, the time series model was considered in this paper. Since the time series model requires consistency and completeness of time series data, the first 95 groups were used as training data and the last 10 groups were taken as Test Data. The unit root test results show ADF = −3.6991, p = 0.0041, indicating that the time series is a stationary time series and can be modeled with time series. The AIC rule of ARMA model is used to determine the order, and the minimum AIC value p = 3 and q = 0 are calculated. The ARMA(3,0) model is selected. The result of ARMA(3,0) fitting is shown in

From the training and predictive results of the SVR, LASSO, CNN and ARMA models, it is feasible to predict the proportion of influenza-like illnesses through the Web search data. Each model shows a certain predictive result, as shown in

of influenza. The absolute error of ARMA(3,0) is smaller and the error range is (0, 2.5).

From the training RMSE of the model (in

The use of web search data to predict flu epidemics is a popular research in developed countries in recent years. Baidu Index was selected as the web search data source, and the feature extraction of influenza keywords was performed through PCA algorithm. Four influenza prediction models were established and compared to explore the use of web search data to assist in the application of influenza surveillance. The conclusions are as follows:

1) It is feasible to predict the proportion of influenza-like cases by web search data.

2) Machine learning shows a certain predictive effect in the prediction of influenza based on web search data, and it has certain reference value in the future of influenza prediction.

Date | Real value (RV) | Predictive value (PV) | Date | ARMA | |||
---|---|---|---|---|---|---|---|

SVR-PV | LASSO-PV | CNN-PV | ARMA-RV | ARMA-PV | |||

2016/22 | 3.3 | 2.190572 | 3.82588 | 5.128867 | 2018/7 | 33 | 35.43721 |

2016/32 | 2.7 | 2.759205 | 3.552288 | 4.206986 | 2018/8 | 31.4 | 30.83961 |

2016/42 | 8.3 | 8.047085 | 7.270786 | 11.35356 | 2018/9 | 24.9 | 25.3212 |

2016/52 | 23.3 | 16.69593 | 20.84384 | 41.21413 | 2018/10 | 19.8 | 20.00432 |

2017/10 | 16.7 | 22.42617 | 19.99406 | 6.872914 | 2018/11 | 15.6 | 14.70278 |

2017/20 | 6.1 | 6.652845 | 4.370603 | 3.13537 | 2018/12 | 11.8 | 9.889074 |

2017/30 | 18.1 | 30.91959 | 18.31482 | 26.71198 | 2018/13 | 8.1 | 6.629661 |

2017/40 | 6.8 | 1.038435 | 6.23335 | 4.948258 | 2018/14 | 6.6 | 5.14302 |

2017/50 | 40.2 | 44.69814 | 43.72965 | 44.19198 | 2018/15 | 5 | 3.50494 |

2018/8 | 24.9 | 13.69753 | 21.04979 | 4.802393 | 2018/16 | 4 | 2.24089 |

Training RMSE | 0.1027 | 3.9954 | 1.8670 | 1.7123 | |||

Test RMSE | 6.4906 | 2.2268 | 9.6885 | 1.4333 |

3) The ARMA(3,0) model has a better predictive result and is more generalized. It also reflects that seasonal characteristics should be taken into account when predicting the proportion of influenza-like cases.

The outbreak and epidemic of influenza are affected by a variety of factors, including meteorological factors, virus activity intensity, and air pollution, as well as the combined effects of various factors such as the level of antibody in the population and behavioral patterns. In this study, we only studied flu prediction models by using web search data and influenza history data. Although the use of web search data for influenza surveillance has improved real-time performance, there is still a lack of accuracy, especially at the peak season of the flu season.

Future study directions for this topic include:

1) From the aspect of data sources, on the one hand, we can consider integrating the original search data of multiple search engines to reflect the search behavior of Internet users as fully as possible. In addition, we can obtain interactive behaviors through social networks, professional medical information portals, etc. and browsing behaviors to get more information on influenza concerns; on the other hand, we can collect other metrics that reflect the outbreak and epidemic of flu as a part of the predictive model input.

2) With regard to the scope of research, the scope of the study can be narrowed down to the scope of cities and counties. Based on a regional influenza prediction study, the impact of regional differences can be filtered out, and meteorological factors and other measurement indicators can be introduced more easily.

3) In the aspect of model optimization, more forecasting models can be used for weighted combinatorial optimization, and other better combinatorial optimization methods can also be used. The next optimization goal is to improve the early warning capability and achieve prediction in advance for a period of time.

4) For predictive visualization, some data visualization software can be combined to display the predictive analysis results by using charts and other methods. Displaying the real-time changes of various indicators can help users quickly obtain relevant information and respond quickly.

This project was supported by the Fundamental Research funds for Central Universities, China University of Geosciences (Wuhan) (1810491T09) and Laboratory Research Funds, China University of Geosciences (Wuhan) (SKJ2018240).

Bu, Y., Bai, J.H., Chen, Z., Guo, M.J. and Yang, F. (2018) The Study on China’s Flu Prediction Model Based on Web Search Data. Journal of Data Analysis and Information Processing, 6, 79-92. https://doi.org/10.4236/jdaip.2018.63006