The Study on China ’ s Flu Prediction Model Based on Web Search Data

Influenza is a kind of infectious disease, which spreads quickly and widely. The outbreak of influenza has brought huge losses to society. In this paper, four major categories of flu keywords, “prevention phase”, “symptom phase”, “treatment phase”, and “commonly-used phrase” were set. Python web crawler was used to obtain relevant influenza data from the National Influenza Center’s influenza surveillance weekly report and Baidu Index. The establishment of support vector regression (SVR), least absolute shrinkage and selection operator (LASSO), convolutional neural networks (CNN) prediction models through machine learning, took into account the seasonal characteristics of the influenza, also established the time series model (ARMA). The results show that, it is feasible to predict influenza based on web search data. Machine learning shows a certain forecast effect in the prediction of influenza based on web search data. In the future, it will have certain reference value in influenza prediction. The ARMA(3,0) model predicts better results and has greater generalization. Finally, the lack of research in this paper and future research directions are given.


Introduction
Influenza, referred to as the flu, is an acute respiratory infectious disease caused by influenza virus that cannot be completely controlled until now [1].According to the WHO (World Health Organization) study of seasonal influenza, seasonal influenza causes about 3 to 5 million serious diseases each year, resulting in approximately 250,000 to 500,000 deaths [2].From the Spanish flu (H1N1) in 1918, the Asian flu (H2N2) in 1957, the Hong Kong flu (H3N2) in 1968, and the Russian flu (H1N1) in 1977 to April 2009, the outbreak of H1N1 has caused a huge loss of human society for every outbreak of flu [3] [4] [5].For all countries in the world, the prevention and control of influenza has always been a serious problem.
First of all, in order to control the spread of influenza virus and reduce the losses caused by influenza, it is necessary to use reasonable methods to predict the trend of influenza activity.However, the influenza virus has the characteristics of strong infectiousness, rapid propagation, wide spread, and antigen variability [6], which brings great difficulties to prevention and monitoring.As a result, researchers in various countries are focusing more on improving the timeliness of forecasting the flu epidemic.Second, the use of more timely and accurate data sources is the main means of improving timeliness.In order to obtain influenza case data, most national influenza surveillance agencies generally conduct surveys on suspected influenza cases in hospitals.However, this method requires the collection of national influenza case data.There are complex data processing processes, heavy workload, and monitoring data lag about influenza development and other issues.Finally, in order to obtain more data on the flu cases, flu monitoring agencies used data such as telephone consultations on influenza, sales of flu-type non-prescription drugs, and page views of relevant websites to predict the incidence of influenza [7].To a certain extent, it improves the accuracy and timeliness of short-term forecasting.
Nowadays, search engines are increasingly becoming the main method for people to obtain information.Web search data has become an ideal data source for influenza surveillance.In the United States, about 90 million adults annually search the Internet for health information such as disease and medicine [8].
Compared with other data sources, web search data has a stronger tendency and immediacy, and search keywords can directly reflects the intent of the inquirer, and the search data can be collected in a timely manner to maintain complete synchronization with the development of the flu epidemic.In addition, the search data has a wider range of survey populations.It can show the attention of all Internet users in a certain area to the flu, and the data is closer to the true whole.Using web search data to monitor epidemic disease is a faster, more accurate and low-cost way.It can be used as an auxiliary measure of traditional investigation methods to provide early warning of disease and is important for the prevention and control of infectious diseases in China and beyond.

Literature Review
Influenza has caused great difficulties in prevention and monitoring due to its rapid mutation rate.Therefore, the most important task in influenza epidemic surveillance research is to improve the timeliness of predictions.The use of more immediate and accurate data sources is the main reason for improving timeliness [7].In the era of big data, web search data has become an ideal data source

Using Search Engines for Influenza Surveillance
In 2008, Polgreen et al. [9] used the Web search data for the first time.They used the search volume of influenza-related search terms on the Yahoo!search engine in the United States to verify the correlation between search volume and influenza mortality.Jeremy G et al. [10] published a flu trend monitoring research based on Google search data in Nature, which laid the theoretical foundation for the Google Flu Trends (GFT) launched by Google later [11].GFT is an online flu trend online warning system based on its own search data released by Google.It provides flu trend predictions in 28 countries around the world.After the GFT was released, it was applied to influenza surveillance activities in different countries.In addition to Google's search engine, search data can also be obtained through other methods, such as China's Baidu Index, Weibo Micro Index, etc. Q.
Yuan et al. [12] studied the relationship between search terms and flu trends through Baidu Index and fitted a multiple regression monitoring model.Lu Li et al. [13] compared and analyzed the role of Baidu Index and Sina Weibo micro index in the monitoring of influenza in China and found that the Baidu Index was more relevant to the flu epidemic.Search engine-based influenza surveillance estimates the incidence of influenza due to the search frequency of keywords alone.This can easily lead to over-sensitivity of the model, causing "overestimation" of the epidemic, as well as seasonal and geographical impacts.After it is still insufficient, it needs to improve.

Using Social Networks for Influenza Surveillance
The prediction of events through social networks is a hot topic of big data research.In foreign countries, there are many researchers who use the social platform Twitter to do data analysis, including flu trend monitoring.Nigel Collier et al. [14] used SVM algorithm to analyze the epidemic situation by studying user behavior information in the information posted by users on Twitter, and compared the results with that of the CDC (United States Centers for Disease Control and Prevention), and found that it had a very strong relationship with that.Lampos, V. et al. [15] observed and tracked the Twitter information published by users in the UK's most popular 49 regions.Using the flu keyword weighted filtering method, it was found that the flu episode showed strong linear correlation with the HPA's influenza-like illness (ILI) data.Similar examples of flu predictions based on social platforms are numerous.Chen et al. [16] used Facebook, micro-blog, and Instagram as research data to filter textual data for flu symptoms keywords, to obtain suspected influenza users, and to associate GPS information on Instagram to geographically monitor the flu.Also as a social media in recent years, Weibo has been popular among Chinese citizens.At present, there are many researchers who are doing data mining based on Weibo, such as: anal-ysis of social relations based on Weibo, public opinion analysis based on Weibo, and outbreak analysis based on Weibo [17].However, the data did not make significant progress in the study of seasonal influenza surveillance based on Weibo.

Using Existing Disease Surveillance Platforms for Influenza Surveillance
At present, the most representative foreign influenza surveillance platform is Flu Near You.Flu Near You is a flu monitoring and visualization system that can be intuitively displayed on maps.It is also participatory for the general public.Users can submit the relevant information about flu symptoms every week.These data for researchers better understand the spread of the flu, while ordinary citizens can also watch the surrounding communities where they live and the spread of national flu [18].In China, Baidu and the Chinese Center for Disease Control and Prevention launched its disease prediction platform.The Baidu Disease Forecasting Platform provides an online map tool to show people how active certain diseases are in each region, and to make predictions about disease changes in the past 30 days and the next seven days.
Nowadays, there is no standardized flu prediction model in China and there are not many researches on the use of Web search data to study flu prediction models.This study establishes some prediction models by using Python to crawl relevant flu data together with machine learning.Considering the seasonality of influenza, a time series model has also been established, which has certain reference value for the monitoring and prevention of influenza.

Influenza-Like Illness
A major indicator of influenza surveillance at home and abroad is the proportion of influenza-like illness (ILI).It refers to fever (body temperature ≥ 38˚C) and cough in all outpatient clinics at sentinel hospitals.Sore throat is one of the cases of acute respiratory infection [19].

Feature Selection
At the beginning of the model establishment of data mining and machine learning algorithms, in order to minimize the problem of model deviation due to the lack of important variables, we usually choose as many independent variables as possible.However, during the actual modeling process, it is usually necessary to find the subset of independent variables that have the ability to interpret the response variables to improve the model's ability to interpret and predict.This process is called feature selection.
Principal component analysis (PCA) is a method of dimension reduction for unsupervised learning.It requires only eigenvalue decomposition to compress and denoise the data.Therefore, this paper uses PCA algorithm to extract features of influenza keywords.

Support Vector Regression
Provided that the training sample is ( ) x y and ( ) ( ) , 1, 2, , When we cannot fully satisfy the above two-condition constraint, we introduce the slack variables i ξ , * i ξ and the penalty parameter C to "soften" the same as the linear inseparable support vector classification.The original optimization problem becomes: , To solve the problem, you can get the normal vector and the regression function of the regression function: Here, ( ) i x x ⋅ is the inner product of the vector i x and the vector x.
Among them, y is the proportion of influenza-like cases, X is the independent variable that affects influenza cases, N is the number of data groups, α = 0.001, and w is the regression coefficient of the influenza model.Several important levels of convolutional neural networks: 1) Convolution layer: Each neuron is seen as a filter, which calculates the local data.Take a data window, this data window slides continuously until all samples are covered.
2) Pooled layer: The pooled layer is sandwiched between successive convolution layers to compress the amount of data and parameters and reduce overfitting.
3) Excitation layer: The excitation layer has an excitation function that performs non-linear mapping of the convolutional output.4) Fully connected layer: In the fully connected layer, all neurons between the two layers have the right to reconnect.Usually the fully connected layer is at the tail of the convolutional neural network because the amount of information at the tail does not begin to be as large.
In this paper, CNN is divided into six layers: input layer, first convolution layer, pooled layer, second convolution layer, fully connected layer, and output layer.Here, the convolutional layer excitation layer adds the excitation function ReLU to each convolution process.In addition, the droupout layer was also added to the fully connected layer, and the inactivation ratio was 0.3, which means that 70% of the neurons were retained and the overfitting phenomenon was reduced.Finally enter the fully connected layer, the learning efficiency is 0.01, finding the best value of the mean-square error (MSE) function by using the stochastic gradient method, the results obtained before reduce the dimension, stretched into a 512*1 matrix, and set the deactivation rate.The output to the output layer completes a training.The CNN training was completed after 500 training steps.

Time Series Model
Taking into account the seasonal characteristics of influenza, this article considers the establishment of a time series model.The time series modeling refers to the model established by using only its past values and random disturbance terms.Its general form is: , , , At present, there are two types of time-series models.One is the ARMA (Auto Regression Moving Average) model, which is an autoregressive moving average model; the other is the ARIMA (Auto Regression Integrated Moving Average) model, which is an autoregression integral moving average model.The ARMA model is suitable for stationary time series data, and the ARIMA model is suitable for non-stationary time series data.In this paper, a total of 105 sets of data were randomly selected from the 105 sets of data to perform tests on 10 groups.SVR, LASSO and CNN were all using the same 10 groups for testing, and the remaining 95 groups were trained.

Results Analysis
The fitting results of the SVR, LASSO and CNN models are shown in Figure 1.
The training results (TR) of the three models fitted with the trends of the flu.
In the SVR model, the ploynomial kernel was used for the kernel function, C

Conclusions
The use of web search data to predict flu epidemics is a popular research in developed countries in recent years.Baidu Index was selected as the web search data source, and the feature extraction of influenza keywords was performed through PCA algorithm.Four influenza prediction models were established and compared to explore the use of web search data to assist in the application of influenza surveillance.The conclusions are as follows: 1) It is feasible to predict the proportion of influenza-like cases by web search data.
2) Machine learning shows a certain predictive effect in the prediction of influenza based on web search data, and it has certain reference value in the future of influenza prediction.3) The ARMA(3,0) model has a better predictive result and is more generalized.It also reflects that seasonal characteristics should be taken into account when predicting the proportion of influenza-like cases.

Prospects
The outbreak and epidemic of influenza are affected by a variety of factors, including meteorological factors, virus activity intensity, and air pollution, as well as the combined effects of various factors such as the level of antibody in the population and behavioral patterns.In this study, we only studied flu prediction models by using web search data and influenza history data.Although the use of web search data for influenza surveillance has improved real-time performance, there is still a lack of accuracy, especially at the peak season of the flu season.
Future study directions for this topic include: 1) From the aspect of data sources, on the one hand, we can consider integrating the original search data of multiple search engines to reflect the search behavior of Internet users as fully as possible.In addition, we can obtain interactive behaviors through social networks, professional medical information portals, etc. and browsing behaviors to get more information on influenza concerns; on the other hand, we can collect other metrics that reflect the outbreak and epidemic of flu as a part of the predictive model input.
2) With regard to the scope of research, the scope of the study can be narrowed down to the scope of cities and counties.Based on a regional influenza prediction study, the impact of regional differences can be filtered out, and meteorological factors and other measurement indicators can be introduced more easily.
3) In the aspect of model optimization, more forecasting models can be used for weighted combinatorial optimization, and other better combinatorial optimization methods can also be used.The next optimization goal is to improve the early warning capability and achieve prediction in advance for a period of time.
4) For predictive visualization, some data visualization software can be combined to display the predictive analysis results by using charts and other methods.Displaying the real-time changes of various indicators can help users quickly obtain relevant information and respond quickly.

Y.
Bu et al.Journal of Data Analysis and Information Processing for influenza surveillance.The flu monitoring application based on web search data mainly includes the following aspects.

1 ) 4 )
dimension to n′ dimension.Output: Sample set D′ after dimension reduction.Centralize all samples: vaccine side effects, Influenza vaccine necessary to fight it, Bird flu vaccine Flu prevention How to prevent flu, Prevent influenza, Influenza outbreak, Prevent influenza A Stage of symptoms Cold Gastro-intestinal flu, Influenza, Cold symptoms, Viral influenza Respiratory infection Upper respiratory tract infection, Nasal congestion, Cough, Bronchitis, Sore throat, Runny nose, Rhinitis, Pharyngitis Fever Fever, High fever, Headache, Dizziness, Fatigue, Fever, Chills Treatment stage Flu treatment A stream of treatment, What medicine to eat flu Cold medicine Baijiahei, Contac, Tylenol, Gankang, Amoxicillin, Cough, Antipyretics, Cephalosporins, Oseltamivir Commonly used words H1N1 H1N1 flu, H7N9, H7N9 flu Influenza-A Influenza A, Type A H1N1 flu, Type A flu symptoms, What is Influenza A Influenza Flu virus, Swine flu, Bird flu Journal of Data Analysis and Information Processing 2) Calculate the sample's covariance matrix XX T , 3) Perform eigenvalue decomposition on the matrix XX T , and take out the eigenvector corresponding to the largest n′ eigenvalue ( ) normalized, they form a matrix of eigenvector W, Transform each sample ( ) i x in the sample set into a new sample ( ) ( ) the simplest support vector regression (SVR) uses a linear function model the sample points Together, where w and b are the normal vector and the offset of the linear regression function respectively.Assume that all training data are fitted with a linear function without errors under ε .Solve the following op- timization problem:

4. 2 . 3 .
Convolutional Neural Networks Convolutional Neural Networks (CNN) is a deep neural network model containing convolutional layers.It has become a hot topic in the field of speech analysis and image recognition.Since CNN's feature detection layer learns through training data, when CNN is used, explicit feature extraction is avoided, and learning is implicitly performed from training data.Furthermore, because the neuron weights on the same feature map are the same, the network can learn in parallel.Therefore, this paper selected CNN to establish influenza prediction model.

Enter a size of 1 *
16 for each training matrix.Before the first convolutional layer, change the matrix size to 4*4 and use a convolution kernel of 2*2*32.The horizontal step is 1 and the vertical step is 1, the result is 4*4*64.Enter the pooling layer to get a 2*2*32 matrix.The function used by the pooling layer is MaxPool.Then enter the next layer of convolution layer, enter 2*2*32, use the convolution kernel as 2*2*64, get 2*2*64, horizontal step is 1, vertical step is 1.

A
total of 47 indicators were crawled in this study from the 16 th week of 2016 (started on April 25, 2016) to the 16 th week of 2018 (April 23, 2018).Firstly, after PCA dimensionality reduction, there are 16 main components remaining, and the 16 main components after dimensionality reduction are included in SVR, LASSO and CNN for modeling respectively.

= 9 .Figure 2 .Figure 3 and
Figure 2. The training RMSE = 1.7123 and the test RMSE = 1.4333.From the training and predictive results of the SVR, LASSO, CNN and ARMA models, it is feasible to predict the proportion of influenza-like illnesses through the Web search data.Each model shows a certain predictive result, as shown in Figure 3 and Figure 4.Figure 5 shows the accumulation absolute error of the

Figure 5 Figure 1 .
Figure 3 and Figure 4.Figure 5 shows the accumulation absolute error of the SVR, LASSO and CNN models (SVR-AE, LASSO-AE, CNN-AE).The LASSO model has the smallest absolute error.At the same time point (2016/52, 2017/10, 2017/30, 2018/8) almost all of the three models exhibited relatively large absolute errors.Explain that the three models have poor predictability for certain periods

Figure 3 .
Figure 3.Comparison of the prediction results of SVR, LASSO and CNN models.

Figure 5 .
Figure 5. Accumulated absolute error of the prediction results of SVR, LASSO and CNN models.
H7N9", and summarized the keywords selected by other related studies and keyword recommendations of search engines.Expand the number of words for each keyword type, sum up all valid keywords, and form an initial vocabulary.As shown in Table1, the data were crawled according to the initial vocabulary.
In order to obtain more time-sensitive data, this paper uses Python to write a crawler program and uses Baidu's webpage data as crawling objects.Selected more primitive search terms such as "flu vaccine", "cold", "flu treatment", Y.Buet al. DOI: 10.4236/jdaip.2018.6300683 Journal of Data Analysis and Information Processing "flu medicine" and "

Table 2 .
Training and prediction results based on SVR, LASSO, CNN and ARMA models.
4333Journal of Data Analysis and Information Processing