Integrated Real-Time Big Data Stream Sentiment Analysis Service

Opinion (sentiment) analysis on big data streams from the constantly generated text streams on social media networks to hundreds of millions of online consumer reviews provides many organizations in every field with opportunities to discover valuable intelligence from the massive user generated text streams. However, the traditional content analysis frameworks are inefficient to handle the unprecedentedly big volume of unstructured text streams and the complexity of text analysis tasks for the real time opinion analysis on the big data streams. In this paper, we propose a parallel real time sentiment analysis system: Social Media Data Stream Sentiment Analysis Service (SMDSSAS) that performs multiple phases of sentiment analysis of social media text streams effectively in real time with two fully analytic opinion mining models to combat the scale of text data streams and the complexity of sentiment analysis processing on unstructured text streams. We propose two aspect based opinion mining models: Deterministic and Probabilistic sentiment models for a real time sentiment analysis on the user given topic related data streams. Experiments on the social media Twitter stream traffic captured during the pre-election weeks of the 2016 Presidential election for real-time analysis of public opinions toward two presidential candidates showed that the proposed system was able to predict correctly Donald Trump as the winner of the 2016 Presidential election. The cross validation results showed that the proposed sentiment models with the real-time streaming components in our proposed framework delivered effectively the analysis of the opinions on two presidential candidates with average 81% accuracy for the Deterministic model and 80% for the Probabilistic model, which are 1% 22% improvements from the results of the existing literature.


Introduction
In the era of the web based social media, user-generated contents in "any" form of user created content including: blogs, wikis, forums, posts, chats, tweets, or podcasts have become the norm of media to express people's opinion. The amounts of data generated by individuals, businesses, government, and research agents have undergone exponential growth. Social networking giants such as Facebook and Twitter had 1.86 and 0.7 billion active users as of Feb. 2018. The user-generated texts are valuable resources to discover useful intelligence to help people in any field to make critical decisions. Twitter has become an important platform of user generated text streams where people express their opinions and views on new events, new products or news. Such new events or news from announcing political parties and candidates for elections to a popular new product release are often followed almost instantly by a burst in Twitter volume, providing a unique opportunity to measure the relationship between expressed public sentiment and the new events or the new products.
Sentiment analysis can help explore how these events affect public opinion or how public opinion affects future sales of these new products. While traditional content analysis takes days or weeks to complete, opinion analysis of such streaming of large amounts of user-generated text have commanded research and development of a new generation of analytics methods and tools to process them in real-time or near-real time effectively.
Big data is often defined with the three characteristics: volume, velocity and variety [1] [2] because of the nature of being constantly generated massive data sets having large, varied and complex structures or often unstructured (e.g. tweet text). Those three characteristics of big data imply difficulties of storing, analyzing and visualizing for further processes and results with traditional data analysis systems. Common problems of big data analytics are firstly, traditional data analysis systems are not reliable to handle the volume of data to process in an acceptable rate. Secondly, big data processing commonly requires complex data processing in multi phases of data cleaning, preprocessing, and transformation since data is available in many different formats either in semi-structured or unstructured. Lastly, big data is constantly generated at high speed by systems giving that none of the traditional data preprocessing architectures are suitable to efficiently process in real time or near real time.
Two common approaches to process big data are batch-mode big data analytics and streaming-based big data analytics. Batch processing is an efficient way to process high volumes of data where a group of transactions is collected over time [3]. Frameworks that are based on a parallel and distributed system architecture such as Apache Hadoop with MapReduce currently dominate batch mode big data analytics. This type of big data processing addresses the volume and variety components of big data analytics but not velocity. In contrast, stream processing is a model that computes a small window of recent data at one time [3]. This makes computation real time or near-real time. In order to meet the demands of the real-time constraints, the stream-processing model must be able Journal of Data Analysis and Information Processing 0to calculate statistical analytics on the fly, since streaming data like user generated content in the form of repeated online user interactions is continuously arriving at high speed [3].
This notable "high velocity" on arrival characteristic of the big data stream means that corresponding big data analytics should be able to process the stream in a single pass under strict constraints of time and space. Most of the existing works that leverage the distributed parallel systems to analyze big social media data in real-time or near real-time perform mostly statistical analysis in real time with pre-computed data warehouse aggregations [4] [5] or simple frequency based sentiment analysis model [6]. More sophisticated sentiment analyses on the streaming data are mostly the MapReduce based batch mode analytics. While it is common to find batch mode data processing works for the sophisticated sentiment analysis on social media data, there are only a few works that propose the systems that perform complex real time sentiment analysis on big data streams [7] [8] [9] and little work is found in that the proposed such systems are implemented and tested with real time data streams. Sentiment Analysis otherwise known as opinion mining commonly refers to the use of natural language processing (NLP) and text analysis techniques to extract, and quantify subjective information in a text span [10]. NLP is a critical component in extracting useful viewpoints from streaming data [10]. Supervised classifiers are then utilized to predict from labeled training sets. The polarity (positive or negative opinion) of a sentence is measured with scoring algorithms to measure a polarity level of the opinion in a sentence. The most established NLP method to capture the essential meaning of a document is the bag of words (or bag of n-gram) representations [11]. Latent Dirichlet Allocation (LDA) [12] is another widely adopted representation. However, both representations have limitations to capture the semantic relatedness (context) between words in a sentence and suffer from the problems such as polysemy and synonymy [13].
A recent paradigm in NLP, unsupervised text embedding methods, such as Skip-gram [14] [15] and Paragraph Vector [16] [17] to use a distributed representation for words [14] [15] and documents [16] [17] are shown to be effective and scalable to capture the semantic and syntactic relationships, such as polysemy and synonymy, between words and documents. The essential idea of these approaches comes from the distributional hypothesis that a word is represented by its neighboring (context) words in that you shall know a word by the company it keeps [18]. Le and Mikolov [16] [17] show that their method, Paragraph Vectors, can be used in classifying movie reviews or clustering web pages. We employed the pre-trained network with the paragraph vector model [19] for our system for preprocessing to identify n-grams and synonymy in our data sets.
An advanced sentiment analysis beyond polarity is the aspect based opinion mining that looks at other factors (aspects) to determine sentiment polarity such as "feelings of happiness sadness, or anger". An example of the aspect oriented opinion mining is classifying movie reviews based on a thumbs up or downs as seen in the 2004 paper and many other papers by Pang and Lee [10] [20].

S. S. Chung, D. Aring Journal of Data Analysis and Information Processing
Another technique is the lexical approach to opinion mining developed famously by Taboda et al. in their SO-CAL calculator [21]. The system calculated semantic orientation, i.e. subjectivity, of a word in the text by capturing the strength and potency to which a word was oriented either positively or negatively towards a given topic, using advanced techniques like amplifiers and polarity shift calculations.
The single most important information needs to be identified in a sentiment analysis is to find out about opinions and perspectives on a particular topic otherwise known as topic-based opinion mining [22]. Topic-based opinion mining seeks to extract personal viewpoints and emotions surrounding social or political events by semantically orienting user-generated content that has been correlated by topic word(s) [22].
Despite the success of these sophisticated sentiment analysis methods, little is known about whether they may be scalable to apply in the multi-phased opinion analysis process to a huge text stream of user generated expressions in real time.
In this paper, we examined whether a stream-processing big data social media sentiment analysis service can offer scalability in processing these multi-phased top of the art sentiment analysis methods, while offering efficient near-real time data processing of enormous data volume. This paper also explores the methodologies of opinion analysis of social network data. To summarize, we make the following contributions:  We propose a fully integrated, real time text analysis framework that performs complex multi-phase sentiment analysis on massive text streams: Social Media Data Stream Sentiment Analysis Service (SMDSSAS).  We propose two sentiment models that are combined models of topic, lexicon and aspect based sentiment analysis that can be applied to a real-time big data stream in cooperation with the most recent natural language processing (NLP) techniques:  Deterministic Topic Model that accurately measures user sentiments in the subjectivity and the context of user provided topic word(s).  Probabilistic Topic Model that effectively identifies polarity of sentiments per topic correlated messages over the entire data streams.  We fully experimented on the popular social media Twitter message streams captured during the pre-election weeks of the 2016 Presidential Election to test the accuracy of our two proposed sentiment models and the performance of our proposed system SMDSSAS for the real time sentiment analysis. The results show that our framework can be a good alternative for an efficient and scalable tool to extract, transform, score and analyze opinions for the user generated big social media text streams in real time.

Related Works
Many existing works in the related literature concentrate on topic-based opining mining models. In topic-based opinion mining, sentiment is estimated from the Journal of Data Analysis and Information Processing messages related to a chosen topic of interest such that topic and sentiment are jointly inferred [22]. There are many works on the topic based sentiment analysis where the models are tested on a batch method as listed in the reference Section. While there are many works in the topic based models for batch processing systems, there are few works in the literature on topic-based models for real time sentiment analysis on streaming data. Real-time topic sentiment analysis is imperative to meet the strict time and space constraints to efficiently process streaming data [6]. Wang  More recent research [23] [24] have proposed big data stream processing architectures. The first work in 2015 [23] proposed a multi-layered storm based approach for the application of sentiment analysis on big data streams in real time and the second work in 2016 [24] proposed a big data analytics framework (ASMF) to analyze consumer sentiments embedded in hundreds of millions of online product reviews. Both approaches leverage probabilistic language models by either mimicking "document relevance": with probability of the document generating a user provided query term found within the sentiment lexicon [23] or by adapting a classical language modeling framework to enhance the prediction of consumer sentiments [24]. However, the major limitation of their works is both the proposed frameworks have never been implemented and tested under an empirical setting or in real time.

Architecture of Big Data Stream Analytics Framework
In this Section, we describe the architecture of our proposed big data analytics framework that is illustrated in Figure 1  to Hive data warehouse scanner and non-alphanumeric characters from. We Journal of Data Analysis and Information Processing employee the natural language processing techniques in the Data Preprocessing layer with the pertained network in the paragraph vector model [16] [17]. This layer can also employee the Stanford Dependency Parser [26] and Named Entity Recognizer [27] to build an additional pipelining of Dependency, Tokenizer, Sentence Splitting, POS tagging and Semantic tagging to build more sophisticated syntax relationships in the Data Preprocessing stage. The transformation component of this later preprocesses in real time the streaming text in JSON to CSV formatted Twitter statuses for Hive table inserts with Hive DDL. The layer is also in charge of removing ambiguity of a word that is determined with pre-defined word corpuses for the sentiment scoring process later.
The forth Layer, Feature Extraction layer, is comprised of a topic based feature extraction function for our Deterministic and Probabilistic sentiment models.
The topic based feature extraction method employees the Opinion Finder Subjectivity Lexicon [28] for identification and extraction of sentiment based on the related topics of the user twitter messages.
The fifth layer of our framework: the Prediction Layer uses our two topic and lexicon based sentiment models: Deterministic, and Probabilistic for sentiment analysis. The accuracy of each model was measured using the supervised classifier Multinomial Naive Bayes to test the capability of each model for correctly identifying and correlating users' sentiments on the topics related data streams with a given topic (event).
Our sixth and final layer is Presentation layer that consists of a web based user interface.

Sentiment Model
Extracting useful viewpoints (aspects) in context and subjectivity from streaming data is a critical task for sentiment analysis. Classical approaches of sentimental analysis have their own limitations in identifying accurate contexts, for instance, for the lexicon-based methods; common sentiment lexicons may not be able to detect the context-sensitive nature of opinion expressions. For example, while the term "small" may have a negative polarity in a mobile phone review that refers to a "small" screen size, the same term could have a positive polarity such as "a small and handy notebook" in consumer reviews about computers. In fact, the token "small" is defined as a negative opinion word in the well-known sentiment lexicon list Opinion-Finder [28].
The sentiment models developed for SMDSSAS are based on the aspect model [29]. The aspect based opinion mining techniques are to identify to extract personal opinions and emotions of surrounding social or political events by capturing semantically orienting contents in subjectivity and context that are correlated by aspects, i.e. topic words. The design of our sentiment model was based on the assumption that positive and negative opinions could be estimated per a context of a given topic [22]. Therefore, in generating data for model training and testing, we employed a topic-based approach to perform sentiment annota-Journal of Data Analysis and Information Processing tion and quantification on related user tweets.
The aspect model is a core of probabilistic latent semantic analysis in the probabilistic language model for general co-occurrence data which associates a class (topic) variable model is a joint probability model that can be defined in selecting a document d with probability P(d), picking a latent class (topic) t with probability ( ) and occurring a word (token) w with probability ( ) As a result one obtains an observed pair (d,w), while the latent class variable z is discarded. Translating this process into a joint probability model results in the expression as follow: Essentially, to derive (2) one has to sum over the possible choices of z that could have generated the observation.
The aspect model is based on two independence assumptions: First, any pairs (d,w) are assumed to be occurred independently; this essentially corresponds to the bag-of-words (or bag of n-gram) approach. Secondly, the conditional independence assumption is made that conditioned on the latent class t, words w are occurred independently of the specific document identity d i . Given that the number of class states is smaller than the number of documents ( K N  ), t acts as a bottleneck variable in predicting w conditioned on d.
where n(d,w) denotes the term frequency, i.e., the number of times w occurred in d. An equivalent symmetric version of the model can be obtained by inverting the conditional probability ( ) In the Information Retrieval context, this Aspect model is used to estimate the probability that a document d is related to a query q [2]. Such a probabilistic inference is used to derive a weighted vector in Vector Space Model (VSM) where a document d contains a user given query q [2] where q is a phrase or a sentence that is a set of classes (topic words) as where tf.idf t,d is defined as a term weight w t,d of a topic word t with tf t,d being the term frequency of a topic word t j occurs in d i and idf t , being the inverted document index defined with df t the number of documents that contain t j as below. N is the total number of documents.
Then d and q are represented with the weighted vectors for the common terms. score(q,d) can be derived using the cosine similarity function to capture the concept of document "relevance" of d respect to q in the context of topic words in q. Then the cosine similarity function is defined as the score function with the length normalized weighted vectors of q and d as follow.

Context Identification
We derive a topic set T(q) by generating a set of all the related topic words from a user given query (topics) where q is a set of tokens. For each token t i in q, we derive the related topic words to add to the topic set T(q) based on the related language semantics R(t i ) as follow.
where , i j t t T ∈ . t i .*|*.t i denote any word concatenated with t i and t i _t j denotes a bi-gram with t i and t j , label_synonym(t i ) is a set of the labeled synonyms of t i in the dictionary identified by in WordNet [23]. For context identification, we can choose to employee the pre-trained network with the paragraph vector model [16] [17] for our system for preprocessing. The paragraph vector model is more robust in identifying synonyms of a new word that is not in the dictionary.

Measure of Subjectivity in Sentiment: CMSE and CSOM
The Then, we define w(d i ) a weight for a sentiment orientation for d i to measure a subjectivity of sentiment orientation of a document, then a weighted sentiment measure for d i senti_score(d i ) is defined with w(d i ) and sentiment i the sentiment label of d i as a score of sentiment of d i as follow: where −1 ≤ w(d i ) ≤ 1, and α is a control parameter for learning. When α = 0, senti_score(d i ) = sentiment i . senti_score(d i ) gives more weight toward a short message with strong sentiment orientation. w(d i ) = 0 for neural.  t T t t ∈ =  with 1 ≤ j ≤ k and T is a set of all the related topic words derived from a user given query q as a seed to generate T. T j , which is a subset of T, is a set of topic words that is derived from a given topic For CSOM, we define two relative opinion measures: Semantic Orientation (SMO) and Sentiment Orientation (STO) to quantify a polarity for a given data set correlated with a topic set T j . SMO indicates a relative polarity ratio between two polarity classes within a given topic data set. STO indicates a ratio of the polarity of a given topic set over an entire data set.
With our Trump and Hillary example from the 2016 Presidential Election event, the positive SMO for the data set D(TR j ) with the topic word "Donald Trump" and the negative SMO for the Hillary Clinton topic set D(HC j ) can be derived for each polarity class respectively as below. For example, the positive SMO for a topic set D(TR j ) for Donald Trump and the negative SMO for a topic set D(HC j ) for Hillary Clinton are defined as follow: where Weight(TR j ) and Weight(HC j ) are the weights of the topics over the entire dataset, defined as follow. Therefore, STO(TR j ) indicates a weighted polarity of the topic TR j over the entire data set D(T j ) where

Deterministic Topic Model
The Deterministic Topic Model considers the context of the words in the texts and the subjectivity of the sentiment of the words given the context. Given the presumption that topic and sentiment can be jointly inferred, the Deterministic Topic Model measures polarity strength of sentiment in the context of user provided topic word(s). Deterministic Topic Model considers subjectivity of each word (token) in d i in D(T j ). Likelihoods were estimated as relative frequencies with the weighted subjectivity of a word. Using the Opinion Finder [28], lexicon of the tweets were categorized and labeled by subjectivity and polarity. The 6 different weight levels below define subjectivity. Each token was categorized to one of the 6 strength scales and weighted with subjectivity strength scale range from −2 to +2 where −2 denotes "strongest" subjective negative; +2: strongest subjective positive word.
subjScale(w t ) is defined as Subjectivity Strength Scale for each token w t in d i .
The weight of each group is assigned as below for the 6 subjectivity strength sets. Any token that does not belong to any of the 6 subjectivity strength sets is set to 0.
Then CMSE Subj (D(T)) is a sum of subjectivity weighted opinion polarity for a given topic set D(T) with where 1 ≤ j ≤ k and T is a set of all the related topic words derived from the user given topics and T j is a set of all the topic words that are derived from a given query q as defined in the Section 4.1.

Probabilistic Topic Model
The probabilistic model adopts SMO and STO measures of CSOM with the subjectivity to derive a log-based modified log-likelihood of the ratio of subjectivity weighted PosSMO and NegSMO over a given topic set D(T) and a subset D(T j ).
Our probabilistic model ρ with a given topic set D(T) and D(T j ) measures the probability of sentiment polarity of a given a topic set D(T j ) where D(T j ) is a subset of D(T). For example, the probability of the positive opinion for Trump in D(T), denoted as P(Pos_TR), is defined as follow: Journal of Data Analysis and Information Processing  is a smoothing factor [30] and we consider a strong neutral subjectivity as a weak positivity here. Then, we define our probabilistic model ( ) where NegativeInfo(TR) is essentially a subjectivity weighted NegSMO(TR j ) defined as follow.

Multinomial Naive Bayes
The fifth layer of our framework: the Prediction Layer employees the Deterministic and Probabilistic sentiment models discussed in Section 4 to our predictive classifiers for event outcome prediction in a real-time environment. Predictive performance of each model was measured using a supervised predictive analytics model: Multinomial Naive Bayes. Naive Bayes is a supervised probabilistic learning method popular for text categorization problems in judging if documents belong to one category or another because it is based on the assumption that each word occurrence in a document is independent as in "bag of word" model. Naive Bayes uses a technique to construct a classifier: it assigns class labels to problem instances represented as vectors of feature values where class labels are drawn from a finite set [31]. We utilized the Multinomial model for text classification based on "bag of words" model for a document [32]. Multinomial Naive Bayes models the distribution of words in a document as a multinomial. A document is treated as a sequence of words and it is assumed that each word position is generated independently of every other. For classification, we assume that there are a fixed number of classes, where a class In a multinomial event model, a document is an ordered sequence of word events, that represent the frequencies which certain events have been generated by a multinomial ( ) 1 n p p  where p i is the probability that event i occurs, and x i is the feature vector counting the number of times event i was observed in an instance [32]. Each document d i is drawn from a multinomial distribution of words with as many independent trials as the length of d i , yielding a "bag of words" representation for the documents [32]. Thus the probability of a document given its class is the representation of k such multinomial [32].

Experiments
We applied our sentiment models discussed in Section(s) 4.2 and 4.3 on the real-time twitter stream for the following events-the 2016 US Presidential election and the 2017 Inauguration. User opinion was identified extracted and measured surrounding the political candidates and corresponding election policies in an effort to demonstrate SMDSSAS's accurate critical decision making capabilities.
A total of 74,310 topic-correlated tweets were collected randomly chosen on a continuous 30-second interval in Apache Spark DStream accessing the Twitter Streaming API for the pre-election week of November 2016 and the pre-election month on October, as well as pre-inauguration week in January. The context detector on the following topics generates the set of topic words: Hillary Clinton, Donald Trump and political policies. The number of the topic correlated tweets for the candidate Donald Trump was ~53,009 tweets while the number of the topic correlated tweet for the candidate Hillary Clinton was ~8510 which is a lot smaller than that of Trump.
Tweets were preprocessed with a custom cleaning function to remove all non-English characters including: the Twitter at "@" and hash tag "#" signs, image/website URLS, punctuation: "[. , ! " ']", digits: [0-9], and non-alphanumeric characters: $ % & ^ * () + ~ and stored in NoSql Hive database. Each topic-correlated tweet was labeled for sentiment using the OpinionFinder subjectivity word lexicon and the subjScale(w t ) defined in 4.3 associating a numeric value to each word based on polarity and subjectivity strength.  orientation on October and it kept dropping to 0.016 for November and January.

Predicting the Outcome of 2016 Presidential Election in Pre-Election Weeks
In contrast, Hillary Clinton's positive and negative sentiment orientation measures were consistently low during October and November; her positive sentiment measure was ranging from 0.022 on October to 0.016 on November, which is almost ten times smaller than Trump's. It kept dropping to 0.007 on January.
Clinton's negative orientation measure was 10 times higher than Trump's ranging from 0.03 on October to 0.01 on November, but it decreased to 0.009 on January.

Predicting with Deterministic Topic Model
Our Deterministic Topic Model as discussed in 4.3 was applied to the November 2016 pre-election tweet streams. The positive polarity orientation for Donald Trump was increased to 0.60 while the positive polarity measure for Hillary Clinton was 0.069. From our results show in Figure 3(b) below, we witnessed the sharply increased positive sentiment orientation for candidate Donald Trump in the data streams during the pre-election November with candidate Donald Trump's volume of Trump-correlated topic tweets (53,009 tweets) compared to that for Hillary Clinton (8510 tweets) for Subjectivity Weighted CMSE shown in Figure 3(a). Our system showed that Donald Trump as the definitive winner of the 2016 Presidential Election.

Cross Validation with Multinomial Naive Bayes Classifier for Deterministic and Probabilistic Models
Our cross validation was performed with the following experiment settings and an assumption that for a user chosen time period for a user given topic (event),

Conclusions
The main contribution of this paper is the design and development of a real time big data stream analytic framework; providing a foundation for an infrastructure of real time sentiment analysis on big text streams. Our framework is proven to be an efficient, scalable tool to extract, score and analyze opinions on user generated text streams per user given topics in real time or near real time. The experiment results demonstrated the ability of our system architecture to accurately predict the outcome of the 2016 Presidential Race against candidates Hillary Clinton and Donald Trump. The proposed fully analytic Deterministic and Probabilistic sentiment models coupled with the real-time streaming components were tested on the user tweet streams captured during pre-election month in October 2016 and the pre-election week of November 2016. The results proved that our system was   [30] that lacked the complexity of sentiment analysis processing, either in batch or real time processing. Finally, SMDSSAS performed efficient real-time data processing and sentiment analysis in terms of scalability. The system uses the continuous processing of a smaller window of data stream (e.g. consistent processing of a 30sec window of streaming data) in which machine learning analytics were performed on the context stream resulting in more accurate predictions with the ability of the system to continuously apply multi-layered fully analytic processes with complex sentiment models to a constant stream of data. The improved and stable model accuracies demonstrate that our proposed framework with the two sentiment models offers a scalable real-time sentiment analytic framework alternative for big data stream analysis over the traditional batch mode data analytic frameworks.