TITLE:
Machine Learning Approaches for Classifying the Distribution of Covid-19 Sentiments
AUTHORS:
M. Kuyo, S. Mwalili, E. Okang’o
KEYWORDS:
Machine Learning, Sentiment Analysis, Natural Language Processing, Covid-19, Naive Bayes, N-Gram
JOURNAL NAME:
Open Journal of Statistics,
Vol.11 No.5,
September
30,
2021
ABSTRACT: Previously,
rapid disease detection and prevention was difficult. This is because disease
modeling and prediction was dependent on a manually obtained dataset that
includes use of survey. With the increased use of social media platforms like
Twitter, Facebook, Instagram, etc., data mining and sentiment analysis can help
avoid diseases. Sentiment analysis is a powerful tool for analyzing people’s
perceptions, emotions, value assessments, attitudes, and feelings as expressed
in texts. The purpose of this research is to use machine learning techniques to
classify and predict the spatial distribution of positive and negative
sentiments of Covid-19 pandemic. This study research has employed machine
learning to classify spatial distribution of Covid-19 twitter sentiments as positive or negative. The data for this study were
geo-tagged tweets concerning COVID-19 which were live streamed using streamR
package. The key terms used for streaming the data were: Corona, Covid-19, sanitizer, virus, lockdown,
quarantine, and social distance. The classification used Naive Bayes algorithms
with ngram approaches. N-Gram model is a probabilistic language model used to
predict next item in a sequence in the form (n-1) order
Markov. It relies on the Markov assumption—the probability of a word depends
only on the previous word without looking too far into the past. The steps
followed in this research include: cleaning and preprocessing the data, text tokenization using n-gram i.e. 1-gram, 2-gram, and 3-gram, tweets
were converted or weighted into a matrix of numeric vectors using Term
Frequency Inverse-Document. Also, data were divided 80:20 between train and
test data. A confusion matrix was utilized to evaluate the classification
accuracy, precision, and recall performance of the various algorithms tested.
Prediction was done using the best performing Naive Bayes algorithm. The
results of this research showed that under Multinomial Naive Bayes, unigram
accuracy was 92.02%, bigram accuracy was 97.37%, and trigram accuracy was
94.40%. Unigram had 89.34% accuracy, bigram had 96.80%, and trigram had 94.90%
accuracy using Bernoulli Naive Bayes. Unigram accuracy was 90.43%, bigram
accuracy was 95.67%, and trigram accuracy was 92.89% using Gaussian Naive
Bayes. Bigram tokenization outperformed unigram and trigram tokenization.
Bigram Multinomial Naive Bayes was used to predict test data since it was the
most accurate in classifying train data. Prediction accuracy was 84.92%, precision 85.50%, recall 81.02%, and F1 measure
83.20%. TF-IDF was employed to increase prediction accuracy, obtaining
87.06%. These were then plotted on a globe map. The study indicates that
machine learning can identify patterns and emotions in public tweets, which may
then be used to steer targeted intervention programs aimed at limiting disease
spread.