Predicting Stock Movement Using Sentiment Analysis of Twitter Feed with Neural Networks

External factors, such as social media and financial news, can have widespread effects on stock price movement. For this reason, social media is considered a useful resource for precise market predictions. In this paper, we show the effectiveness of using Twitter posts to predict stock prices. We start by training various models on the Sentiment 140 Twitter data. We found that Support Vector Machines (SVM) performed best (0.83 accuracy) in the sentimental analysis, so we used it to predict the average sentiment of tweets for each day that the market was open. Next, we use the sentimental analysis of one year’s data of tweets that contain the “stock market”, “stocktwits”, “AAPL” keywords, with the goal of predicting the corresponding stock prices of Apple Inc. (AAPL) and the US’s Dow Jones Industrial Average (DJIA) index prices. Two models, Boosted Regression Trees and Multilayer Perceptron Neural Networks were used to predict the closing price difference of AAPL and DJIA prices. We show that neural networks perform substantially better than traditional models for stocks’ price prediction.


Introduction
It is in the interest of many people and companies to predict the price movement and direction of the stock market. Also, the stock market is a vital component of a country's economy. It is one of the most significant opportunities for investment by companies and investors. Stock traders need to predict trends in the stock market to determine when to sell or buy a stock. To see profits, stock trad-ers need to acquire those stocks whose prices are expected to rise shortly and sell those stocks whose prices are expected to decline. If traders can adequately predict the stock trends and patterns, they can earn a considerable profit margin.
However, stock markets are very volatile and, consequently, difficult to predict.
External factors, such as social media and financial news, can have widespread effects on stock price movement. For this reason, social media is considered to have profound importance for precise market predictions.
Investors assess a company's performance and its stock before determining whether to acquire the company's shares, in order to avoid buying risky stocks. This evaluation comprises an analysis of the company's execution on social media websites. One such social media platform that has great importance in the finance and stock market realm is Twitter. One hundred million active Twitter users update nearly 500 million tweets every day [1]. Users express their opinions, decisions, feelings, and predictions through these tweets, which can be translated into useful information. However, such a tremendous amount of social media data cannot be entirely assessed by investors alone. It is a nearly impossible task for humans to perform on their own. Therefore, a computerized analysis system is necessary for investors, as this system will automatically evaluate stock trends using such large amounts of data in data sets.
A substantial amount of practice in previous research on stock prediction has been applied to historical or social media data. Research with historical data includes using a technical analysis approach in which mathematics is employed to analyze data for finding future stock market trends and prices [2]. Researchers used different machine learning techniques, such as deep learning [3] and regression analysis [4], on stock historical price data. However, these studies did not include external factors such as social media. It is important to utilize social media data because events expressed through social media can significantly affect stock prices and trends due to the belief that prices change because of human behavior which can be reflected by social media.
Social media sentiment analysis is an excellent reservoir of information and can provide insights that can indicate positive or negative views on stocks and trends. There has been a sufficient amount of research on sentiment analysis on various topics, such as movie reviews and Twitter feeds in past years. Agarwal et al. 2011 [5], examined sentiment analysis on Twitter Data and prefaced POS-specific prior polarity features and investigated the use of a tree kernel to eliminate the need for slow feature engineering. Pang and Lee 2004 [6], proposed an innovative machine-learning method that utilizes text-categorization techniques to just subject portions of texts. Kim 2014 [7], advised a simple one-layer Convolutional Neural Network (CNN) that would produce impressive, second to none results across several different data sets. In 4 out of the seven categories tested in the experiment, CNN did much better, whereas it was comparable to the other three types. CNN had the highest accuracy with 81.5 in movie reviews, etc. The robust results achieved with this CNN design suggest that neural networks may serve as a better replacement for well-established baseline models, such as Support Vector Machines [8] and Logistic Regression.
Furthermore, the research that has used both social media and historical data has much room for improvement. Studies conducted on Twitter and Stock market data to predict the stock market using machine learning algorithms include Chakraborty et al. 2017 [9]; Khatri and Srivastava 2016 [10]; Chen and Lazer 2011 [11]; Khan et al. 2020 [12].
This particular research paper will build on Chakraborty et al.'s research paper: "Predicting stock movement using sentiment analysis of the Twitter feed".
In their article, the researchers have found that Twitter data could predict stock prices very well on stable days in the stock market. However, the researchers used a boosted regression tree model to predict the stock price difference for the next day with the current day's stock market Sentiment. This paper will implement neural networks to see if they produce better results than the boosted tree model. Specifically, a Multilayer Perceptron Neural Network (MLP) model will be employed. This paper aims to improve the previous writing using MLP and analyze the effectiveness of using Twitter data to predict stock market trends and prices.
In this paper, a sentiment tagged Twitter dataset of 1.6 million tweets collected from Sentiment 140 will be used for sentiment classification. Then, the Boosted Regression Tree and Multilayer Perceptron models will be used for predicting the next day's stock movement with the present day's tweets containing the "stock market", "StockTwits", "AAPL". The hypothesis that this paper will test is: "Can Stock Market related tweets accurately predict stock market movement?" Furthermore, this paper will also test: "Are neural networks more effective at predicting the stock market movement than traditional models?"

Data
Similarly, to Chakraborty et al., the training data set was collected through Sentiment 140 that is available on Kaggle [13]. The dataset contains 1.6 million hand-tagged tweets, collected through Sentiment 140 API. The tweets are tagged "1" and "0" for being "positive" and "negative". We perform a random split over the dataset to divide the dataset into a training dataset and a testing data set. The training dataset contains 1.52 million tweets, whereas the testing dataset contains 80,000 tweets. The distribution of the data is shown in Table 1. As shown in Table 1, the data is reasonably balanced with an almost equal amount of Positive and Negative tweets in both the Sentiment 140 testing and training data.
Next, tweets containing the "stock market", "StockTwits", "AAPL" keywords that were posted between January and December 2016 are collected for predicting the corresponding stock movement. We assembled at most a hundred tweets every day. The tweets were collected using GetOldTweets3, which is a Python library for accessing old tweets. It allows the user to get old tweets specified by dates and keywords or usernames. It also enables the user to get tweets based on location. We use GetOldTweets3 because it allows us to access old tweets, unlike other APIs.
Stock historical price data is available on Yahoo Finance [14]. The selected stock markets' price data are collected from Yahoo Finance for the chosen period in csv file format. Since the stock keywords data was not tagged with sentimental numbers, the Sentiment 140 dataset models were used to predict sentimental values. Table 2 is a sample of the Sentiment 140 dataset.

Data Preprocessing
Each of the tweets will be preprocessed with the following guidelines. The preprocessing of the data will be conducted by running a function on all of the text with the following guidelines. The function will then transform the data as shown in Table 3. This preprocessing process differs from the previous study as Lemmatization, Removing Keywords, and Removing Short words were added in this research study. These steps were added as they better allow for the data to be preprocessed for sentimental analysis.
1) Lower Casing: Each text is converted to lowercase.
8) Removing Stopwords: Stopwords are the English words that do not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. (e.g.: "the", "he", "have") 9) Lemmatizing: Lemmatization is the process of converting a word to its base form. (e.g., "Great" to "Good") After preprocessing, data of Table 2 took the form of Table 3.

Results of Sentiment Analysis
Following the same steps as Chakraborty et al., 5   where tp is the number of true positives, and fp is the number of false positives.
True positive would be guessing a positive sentiment when it is positive, whereas false positives would be assuming a positive sentiment when negative.
In Table 4, recall is the equation, In Table 4, the F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at one and worst score at O. The F-beta score weights recall more than precision by a factor of beta = 1.0 means recall and precision are equally important. The parameter beta = 1.0 was used in order to keep this research paper consistent with the research report conducted by Chakraborty et al. It is essentially a way to test the accuracy of the model.

Stock Movement Prediction
From the results of sentiment analysis, we found that SVM worked best on our test data of Sentiment 140 which is in line with the results reported by Chakraborty et al. For this reason, SVM was used for the sentimental analysis of the stock related tweets with the keywords, "stock market", "Stocktwits", and "AAPL". This was kept the same as the previous research study.
Tweets related to the Keyword "stock market", and "stocktwits" were trained to predict DJIA closing difference values whereas, Tweets relating to the keyword "AAPL" were introduced to predict Apple Inc. closing difference values.

Stock Index Value Prediction Using Boosted Regression Tree
Similarly, to Chakraborty

Stock Index Value Prediction Using Multilayer Perceptron Neural Network
We improved the regression modeling by implementing a Multilayer Perceptron Neural Network model to see if neural networks better predict the stock closing difference since neural networks are shown to work better than regular models.
Like the Boosted Tree model, the training set is data from January to August 2016, and testing was done on stock-related data from September to December 2016. The average sentiment values of tweets containing "stock market", "stocktwits" are trained with DJIA closing price difference, while the average sentiment values of tweets containing "AAPL". We trained our model to predict stock price differences the next day. In the training data set, the average sentiment values are of a day's tweets, and their corresponding closing price difference is between that day and the next day. So, after getting the sentiment value of tweets of the present day, we can predict how much the stock market will rise or fall the next day. In other words, for the current day's stock value prediction, we will need the previous day's tweets average marginal value. Journal of Data Analysis and Information Processing The tweets from September to December 2016, which were used for testing went through SVM classification first, to obtain average sentiment values as our Multilayer Perceptron Neural Network model is trained with average sentiment values. These average marginal values were then used to predict the next day's stock difference by our Multilayer Perceptron Neural Network model. Table 5 shows the first few entries that the models were trained with.

Prediction Results of Stock Movement
We have plotted both actual stock differences and predicted stock differences in the testing period from September to December 2016. Furthermore, there are two tables below which show the Mean Average Error (MAE) and Root Mean Square Error (RMSE) between the actual stock differences and predicted stock differences by the Boosted Tree model and the MLP regression model. Table 6 shows the MAE values, while Table 7 shows RMSE values. Figure 1 shows the actual and predicted stock differences for tweets with "stock market". Figure 2 shows the actual and predicted stock differences for tweets with "stocktwits". Figure 3 shows the actual and predicted stock differences for tweets with "AAPL".     The formula for RMSE in Table 6  The results for each of the figures show that the MLP neural network is in fact on average better than the boosted regression tree model at predicting the Price difference of stocks. However, it is noticeable that the boosted regression tree model tended to overpredict the values whereas, the MLP neural network tended to underpredict the price difference values.

Conclusions and Further Work
In our work, we predict the future movement of the United States' stock market by analyzing the sentiment of Twitter posts related to the Stock market. To do this, we collected stock-related tweets and obtained their average sentiment value by using SVM. After that, we prepared the training set with those tweets and with corresponding DJIA or Apple Inc. closing stock index differences between the present-day and next day. Then we tested on similar stock related tweets on a different timeline to see how much we can predict the stock index. We used a Boosted Regression Tree model and a Multilayer Perceptron Neural Network model to do this.
We were able to derive answers for both of our hypotheses. From the results of our work, it is seen that tweets do play a role in the prediction of stock market movement. Furthermore, it is implied that Neural Networks perform better than the Boosted Regression Tree. For all three sets of data with the keywords: "stock market", "stocktwits", "AAPL", the Multilayer Perceptron Neural Network model has a lower MAE and RMSE than the Boosted Regression Tree model. From our results, it is also clear that too high and too low differences in Stock Indexes are challenging to predict with Boosted Regression Tree. However, except for those days, our models predicted very well on the given data set.
Future work regarding this study would include using the models on different stock markets across the world. Furthermore, using a data range of more than one year may provide more accurate results. Additionally, analyzing the models in different economic situations such as booms or recession may allow us to better see the productivity of the models. Besides, the use of a neural network for classifying the sentimental analysis tweets may offer better results.