Hot Events Detection of Stock Market Based on Time Series Data of Stock and Text Data of Network Public Opinion

With the highly integration of the Internet world and the real world, Internet information not only provides real-time and effective data for financial investors, but also helps them understand market dynamics, and enables investors to quickly identify relevant financial events that may lead to stock market volatility. However, in the research of event detection in the financial field, many studies are focused on micro-blog, news and other network text information. Few scholars have studied the characteristics of financial time series data. Considering that in the financial field, the occurrence of an event often affects both the online public opinion space and the real transaction space, so this paper proposes a multi-source heterogeneous information detection method based on stock transaction time series data and online public opinion text data to detect hot events in the stock market. This method uses outlier detection algorithm to extract the time of hot events in stock market based on multi-member fusion. And according to the weight calculation formula of the feature item proposed in this paper, this method calculates the keyword weight of network public opinion information to obtain the core content of hot events in the stock market. Finally, accurate detection of stock market hot events is achieved. However,


Introduction
In the securities industry, once market fluctuations occur, investors first hope to find the answer from the Internet information. However, the geometric expan- tract effective information. If investors are unable to obtain timely and accurate information about events that lead to financial market volatility, then the losses caused are incalculable. Therefore, how to quickly find valuable topics and events from a large number of Internet data is particularly important.
With time goes by, numerous research methods for event discovery have been put forward [1]- [11]. However, most of these methods are based on text data [1]- [11] or time series data [12]- [19] for event discovery separately. There are few scholars, to the best of our knowledge, study the characteristics of financial time series data and text data to carry out research [20] [21] [22]. As a realistic behavior of financial markets, time-series data such as stock trading data and market data are often affected by events and can better reflect changes before and after events. Therefore, this paper studies the discovery of financial events by combining network text information and financial time series data, so as to help investors to quickly obtain hot events and correctly grasp market dynamics.

Definition and Quantification of Post's Activity
In web forums, netizens can express their concern for specific information by posting, reading and replying. And this degree of attention is an important external feature of the emotional tendency of network public opinion. In this paper, we call it post's activity. In order to quantify the user's attention to topic information intuitively, we calculate it by the amount of readings and the amount of comments of the posts. Among them, the readings amount of posts reflects the degree of dissemination of the information contained in the posts and it is the instinct concern of users. The comments amount of posts reflects the attention paid to the information contained in the posts. And it is the manifestation of the user's emphasis on topic interaction, and its emotional intensity is stronger. So in this paper, we choose the amount of readings and the amount of comments as indicators of post's activity. The specific definitions are as follows: . Finally, the activity of i p is defined as the sum of its propagation coefficient and the attention coefficient, namely:

Definition and Quantification of User's Influence
The influence of users in the stock bar forum refers to the popularity index of the user in the stock bar. It is mainly affected by the age of the user, the amount of comments posted by the user, the amount of forwarding, and other factors. So in this paper we use user's power, user's activity and user's attention to measure user's influence in the stock bar forum.

User's Power
User's power is the potential energy that users have under static conditions. It is mainly reflected in the three factors of age, the amount of fans and the amount of people that user concern.
Definition 2-2 User's power: The user's power of the i-th user i a is defined as and Pfe is the average amount of people that all users concern.

User's Activity
User's activity reflects the degree of user's autonomy, which is mainly determined

User's Attention
User's attention mainly reflects the degree of attraction and attention of users to other users in the online forum. When a user's posts are commented by a large number of other users, it shows that the quality of these posts are high and attractive, which further indicates that the user has great influence. In addition, there are some users who are not good at commenting, but are used to expressing their concern about posts through reading, which also shows the attraction of posts to them. Therefore, the total amount of readings also needs to be regarded as the influencing factor of users' attention.

Definition of Stock Market Hot Events
The events to be studied in this paper refer to hot events that are related to the stock market and can lead to changes in stock trading behavior. This paper defines it as hot events of stock market. It is embodied in the following three characteristics: 1) The events corresponding to popular posts (which have been read and commented for many times and have high influence) on the web forums.
2) The events corresponding to online hot news (which is reported or reproduced by multiple news websites).
3) The events that can have a significant impact on the stock market.
The first two are based on the feedback from the online public opinion space to understand the stock market hot events, and the third is to understand the hot events of stock market based on the information fed back from the real trading conditions of the stock market. These three event characteristics will be fully combined below, and on this basis, we will conduct a research on the detection of hot events in the stock market.

Time Extraction of Stock Market Hot Events Based on Multi-Member Fusion
The trajectory of events in the real world is often reflected by the trajectory of the events-related members. Therefore, whether the abnormal changes in the state of the event members can be found is the key to determine whether or not the event occurs. According to the definition of stock market hot events in Section 3.1, this paper studies the relationship between stock trading behavior attributes, post's influence and online news volume as relevant members of stock market hot events. And based on this, an event time series can be constructed. Among them, the specific relationship between the attributes of stock trading behavior is obtained by the previous research of our team [23] [24]. The research found that there are a certain relationship among the four pairs of attributes in stock attributes, which are Vol and Close, Pcl and Open, Close and %Tuv, %Tuv and %Chg. By detecting the abnormal points of the event time series, we can discover the occurrence time of hot events in the stock market. The event time series is defined as follows: In this paper, the event member set consists of six event members, which are the relationship between Vol and Close , the relationship between Pcl and Open , the relationship between Close and %Tuv , the relationship between %Tuv and %Chg , post's influence and online news volume. We use the k-nearest neighbor local anomaly detection algorithm to detect anomaly points in event time series, where each , can be regarded as a feature point in the data point set D. Then, the local anomaly coefficients of each feature point are calculated, and are sorted according to the value of LOF. According to the ranking results, the feature points with the largest values of the first λ are output, and these feature points constitute a set of abnormal points of the event time series. In this paper, these abnormal points are considered as the occurrence points of hot events in the stock market, and the time corresponding to these points is the occurrence time of the hot events. Figure 1 shows the abnormal point detection process for stock market hot event time series.

Weight Calculation of Feature Items Based on Multi-Feature Fusion
In this paper, we choose Vector Space Model (VSM) to vectorize the pre-processed online public opinion text (posts and news). The specific expression is as follows: Each row represents a document, and each column represents a feature item. For each document, it consists of several feature items, which can be represented by a vector ( ) ( ) ( In the expression, i d denotes the i-th document, j t denotes the j-th feature item in the document, ij w denotes the weight of the j-th feature item in the document i d , n denotes the total number of documents, and m denotes the total number of feature items. The weights of feature items represent the measurement of document content, which is usually calculated by the classical Term Frequency-Inverse Document Frequency (TF-IDF) method [25]. Considering that the feature item set in this paper is mainly obtained by word segmentation of post content and network news content, and each keyword in the set includes valid information such as TF, IDF, and part of speech. Therefore, in this paper, we optimize the original feature weight calculation formula, and calculate the weight of feature items according to the optimized formula. The calculation formula is as follows:

( )
1.5 when is located in the title 1 when is located in the body The title is more representative than the body, and it is the overall summary of the article. And when netizens read articles, they usually browse the title first.
Therefore, when calculating the weights of feature items, it is necessary to take into account the location factor of the feature item in the document.
( ) j length t represents the length of the feature item j t , which reflects the amount of information contained in the feature item to a certain extent. In general, the longer the length of a feature item, the greater the amount of information it contains.
AvgLen represents the average length of all feature items in the document i d .

Content Extraction of Stock Market Hot Events
When a hot event occurs in the stock market, the relevant reports and comments on this hot event will grow explosively and update rapidly every day. If the rele-

Detection of Hot Events in Stock Market
In the financial field, time series data and text data are often related and interacted with each other. The emergence of a hot event often leads to changes in stock market and further promotes public opinion. Therefore, when conducting event discovery research for the financial field, we should take into account both the text features and the unique temporal characteristics of the financial field. This paper takes the post information and news information as the breakthrough point, and combines the characteristics of time series data in the stock market to realize the hot spot event detection in the stock market. The overall flow chart for event detection is shown in Figure 2 below. Firstly, construct the event time series by the relationship between stock attributes, post's influence and network news volume.
On this basis, the outlier detection algorithm is used to detect the time attributes of the event. Then, the keyword weight of the network public opinion information is calculated by using the feature weight formula of multi-feature fusion to get the event core content of each document set in the document set sequence.
Finally, according to the time attributes of events, the corresponding event set is obtained to realize hot event detection in the stock market.

Collection and Preprocessing of the Information of Network Public Opinion Space
In this paper, we used the web crawler program to crawl the online post data and

Computation and Analysis of the Post's Influence in Network Public Opinion Space
According to formula (  that the influence of posts is different every day. But most of the time, the difference of the post's influence is not big, and the trend is stable and the value is low. Only in a few time periods will the influence suddenly increase and the value becomes larger. We think this is mainly related to the activities of the stock market. When the stock market is running normally and there is no major change, the netizens only pay attention to and discuss the stock information according to their daily habits, so the influence of related posts will not change much. However, once a major event occurs in the market, it will attract the attention of interested netizens quickly in a short time. At this time, the related posts will be published, commented and forwarded in large quantities, which will lead to a sudden increase in the influence of the post on this day, that is, abnormal. In order to verify this notion, this paper selected 10 historical events of 3 stocks randomly for time comparison. The event table is shown in Table 1.    events in the graph. From the graph, we can find that the starting point of historical events coincides with the abnormal points of post's influence. This shows that the influence of the post in this paper can reflect the abnormal changes of stock market to a certain extent, and it is the verification of the above speculation.

Detection of Hot Events
According to the formula of feature weight introduced in Section 3.3.1, the keywords in each document are calculated sequentially, and the weight values of the same keywords in each document on the same day are added up in time units. Then the keywords are sorted in descending order according to the weight value.    use red dashed lines and text to mark them. We believe that the time corresponding to these six abnormal points is the occurrence time of the stock market hot event of stock 600519. Based on this result and the daily keyword set obtained above, hot events in the stock market can be found. Table 3 shows the top 5 keywords of each day at the six abnormal time points mentioned above and the content of hot events in the stock market integrated according to these keywords.
The above is the whole process of stock market hot event discovery for 600519 stock by combining the time series data of stock trading and online public opinion text data. Use this process to find stock market hot events on the remaining 20 stocks. Figure 6 shows a graph of the calculation of LOF values for four stocks. And Figure 7 shows the daily LOF values of 21 stocks over the period from 2018/01/01 to 2018/10/31. According to this result, we set the threshold of abnormal points of hot events to 138 to obtain the top ten stock market hot events of the liquor sector stocks in 2018, which are marked with text arrows in the graph. The specific hot event set is shown in Table 4 below.

Conclusion and Future Work
In this paper, we consider that the impact of an event in the financial field will often be mapped to the network public opinion space and the real transaction space at the same time. Therefore, this paper proposes a multi-source heterogeneous information detection method combining stock transaction time series data and network public opinion text data to discover stock market hot events.
However, the characteristics of the time series data and text data considered in this paper are still limited. So, in subsequent studies, we can consider combining more temporal and textual features to assist in the discovery of hot events in the stock market.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.