Predicting the Stock Price Movement by Social Media Analysis

Prediction of stock trend has been an intriguing topic and is extensively studied by researchers from diversified fields. Machine learning, a well-established algorithm, has been also studied for its potentials in prediction of financial markets. In this paper, seven different techniques of data mining are applied to predict stock price movement of Shanghai Composite Index. The ap-proaches include Support vector machine, Logistic regression, Naive Bayesian, K-nearest neighbor classification, Decision tree, Random forest and Adaboost. Extracting the corresponding comments between April 2017 and May 2018, it shows that: 1) sentiment derived from Eastmoney, a social media platform for the financial community in China, further enhances model performances, 2) for positive and negative sentiments classifications, all classifiers reach at least 75% accuracy and the linear SVC models prove to perform best, 3) according to the strong correlation between the price fluctuation and the bullish index, the approximate overall trend of the closing price can be acquired.

However, financial markets are considered to be a complex non-linear system, and it is very challenging to predict stock prices in a technical way [2] [3] [4] [5].
Market anomalies were observed which contradict the EMH basic assumptions according to which the prediction of share prices should not be possible [6] [7] [8] [9]. In recent years, financial economists have been trying to study the financial behavior of investors from the perspective of human science, which has also spawned a new field of financial research-behavioral finance tracing back to the early 1990s [10]- [15]. The important branch with investor sentiment as the research object is gradually emerging as the information technology has witnessed an unprecedented boom. Single events (e.g., sport results, daylight saving anomaly) or continuous effects (e.g., weather effect, air pollution) influence people's emotions [16] [17] [18] [19]. The prediction of share returns based on mood states can be seen as market anomaly contradicting the efficient market hypothesis [20]. These mood-related anomalies can be explained by the misattribution bias according to which people make risky decisions depending on mood states [21]. The Affect Infusion Model (AIM) can explain the relationship between positive and negative mood states and the risk-taking tendency which postulates that people in positive mood rely on positive cues to make decisions [22] [34]. Hassan, Nath, and Kirley proposed and implemented a fusion model by combining the Hidden Markov Model (HMM), Artificial Neural Networks (ANN) and Genetic Algorithms (GA) to make financial market behavior forecast [35]. Kumar & Thenmozhi collected five different approaches including SVM, Random forecast, Neural network, Logit and LDA to predict Indian stock index movement based on economic variable indicators [36].
In this paper, we aim to analyze individual sentiment by addressing the accu- tic investors to interact. We compare the accuracy of these classifiers using the feature model: unigram TF-IDF. We assess the effects of including public mood information on the accuracy of a "baseline" prediction model rather than proposing an optimal prediction model.

Methods
In terms of the methodology as shown in Figure 1, we totally proceed in three phases.

Data Preparation and Feature Engineering
In the first phase, after data pre-processing, including word segmentation, pause word removal and tokenization, we leverage the unigram TF-IDF metric, a feature for word importance in a document that takes the product of term frequency (TF) and inverse document frequency (IDF). TF-IDF for a certain term t is defined as the multiplication of TF(t) by IDF(t). TF measures how frequently a term (feature) occurs in a comment. Since every comment may have different length, it is possible that a term would appear much more times in long blogs than shorter ones. Thus, the term frequency is often divided by the length as a way of normalization. Normalized TF for a given term t is defined as (formula where n = Numbers of term t occurs in the comments, N = Total numbers of the terms in the comments.
In contrast, IDF measures the importance of terms based on how frequently they appear across multiple comments. Intuitively, a term appears frequently in Figure 1. Diagram outlining the methodology overview.
where q = Numbers of comments with term t in it, Q = Total numbers of comments.

K-Fold Cross Validation with Multiple Machine Learning Algorithms
In the second phase, we deploy bag-of-words technique by manually sorting out positive and negative messages respectively. We apply K-fold cross validation to train the models where we divide the data into 5 splits and harness the first 80% for observations and the remaining 20% for test. We leverage multiple machine learning algorithms for analyzing the emotional polarity (Table 1). Table 1. Machine learning classifiers overview.

Algorithms Explanation
LinearSVC Support vector machine (SVM) have two main categories: support vector classification (SVC) and support vector regression (SVR). SVM is a learning system using a high dimensional feature space. The main objective of support vector machine is to identify maximum margin hyper plane as the final decision boundary.

Logistic Regression
Logistic regression predicts the probability of an outcome that can only have two values (i.e. a dichotomy). A logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic regression is similar to a linear regression, but the curve is constructed using the natural logarithm of the "odds" of the target variable, rather than the probability. Moreover, the predictors do not have to be normally distributed or have equal variance in each group.

Naive Bayesian
The Naive Bayesian classifier is based on Bayes theorem with the independence assumptions between predictors. Bayes theorem provides a way of calculating the posterior probability. A Naive Bayesian model is useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more sophisticated classification methods.

K neighbors Classifier
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure. KNN has been used in statistical estimation and pattern recognition already in 1970's as a non-parametric technique. A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.

Decision Tree
Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

Random Forest
The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners.

AdaBoost
Adaptive boosting machine learning meta-algorithm used for enhancing performance and classifier accuracy by means of adding more weight to previously misclassified instances. Journal of Data Analysis and Information Processing Taking the product of the two, we calculate the F1-score which is defined as: TP TN A TP FP TN FN where F1 is the F1-score of the model and A is the accuracy of the model.

Bivariate Correlation Analysis for the Two Time Series
In the third phase, we select the model with the best accuracy and conduct the relationship between bullish sentiment and stock market trend. The bull/bear ratio is a market-sentiment indicator which reflects how these professionals are feeling about the market, and how they are likely advising their clients to invest based on those feelings. In this paper, we define the bullish indicator as: The Pearson correlation coefficient formula is as follows:

Data
We perform analysis on the Shanghai Composite Index. All price data and comments data are drawn from the period between April 2017 and May 2018, totaling 266 trading days. Two main datasets were used.

Comments Data
Comments data is collected from the financial forum of Eastmoney (http://guba.eastmoney.com/) in CSV format, containing over 480,000 messages.
Besides, we manually sort out about 5000 positive messages and 5000 negative messages.

Price Data
Daily split-adjusted stock price data of Shanghai Composite Index is collected via Tushare, a Python module which provides stock price data in dataframe format. We focus only on the closing price data.

Results and Discussions
As shown in Table 3 and Figure 2, the results indicate that the chosen algorithms are clearly indicators of both the positive and negative sentiments classifications with worst case accuracy of 75% and SVC yielded the best accuracy of 88%.
We choose SVM as the basic classification algorithm for our prediction model. We calculate the time series data of sentiment indicators through the bullish index. We combine it with the time series of stock prices in a single picture, ( Figure 3). As shown in Figure 3, BI index and Shanghai composite index were selected as variables and Pearson coefficient was used for correlation test. The two series yielded statistically significant Pearson correlation coefficient of 0.689 (as shown in Table 4).

Conclusions
This research focused on predicting the direction of stocks and stock price indices. Prediction performances of seven models namely SVM, Logistic regression, Naive Bayesian, KNN, Decision tree, Random forest and Adaboost are compared based on one year of historical data of Shanghai Composite Index from the Platform Eastmoney. Experiments with continuous-valued data show that Adaboost model exhibits least performance with 77.2% accuracy and SVM with highest performance of 88.16% accuracy. SVM classifier has a better fitting degree for dichotomies. We divide emotions into positive emotions and negative emotions, so SVM is the most suitable classifier. Although these seven classification algorithms have achieved good fitting results, none of them is more than 90 percent accurate. On the one hand, Chinese words are more complex than English. On the other hand, most of natural language processing is mainly aimed at English, but not suitable for Chinese.
Further research will focus on extending the technical indicator's opinion about stock price movement as "highly possible to go up", "highly possible to go down", "less possible to go up", "less possible to go down" and "neutral signal" are worth exploring. This may give more accurate input to inference engine of the sentiment analysis algorithms. Besides calculating the correlation coefficient of the two time series, the research will be conducted to predict long term analysis of stock's quarterly performance involved the ARIMA model based on exogenous variables for empirical test.