Trend Analysis of Large-Scale Twitter Data Based on Witnesses during a Hazardous Event: A Case Study on California Wildfire Evacuation

Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effective and innovative digital platform to observe trend from social media users’ perspective who are direct or indirect witnesses of the calamitous event. This paper aims to collect and analyze twitter data related to the recent wildfire in California to perform a trend analysis by classifying firsthand and credible information from Twitter users. This work investigates tweets on the recent wildfire in California and classifies them based on witnesses into two types: 1) direct witnesses and 2) indirect witnesses. The collected and analyzed information can be useful for law enforcement agencies and humanitarian organizations for communication and verification of the situational awareness during wildfire hazards. Trend analysis is an aggregated approach that includes sentimental analysis and topic modeling performed through domain-expert manual annotation and machine learning. Trend analysis ultimately builds a fine-grained analysis to assess evacuation routes and provide valuable information to the firsthand emergency responders.


Introduction
Data analytics caused a paradigm shift in the transportation domain, specifically in areas such as safety, operation, control, and development [1]. Data-driven approaches have proven to be effective in extracting in-depth information suitable for transportation decision-making [2] [3] [4] [5]. Social media platforms widened the sources of getting crisis-related information [6]. With this fact, research on the use of communication tools during a crisis has emerged considering the attention any crisis attracts on social media [7]. As a microblogging tool, Twitter is considered to be an actively influential source of information that arouses situational awareness regarding any crisis [8]. In order to achieve a relation, the accuracy of truth on Twitter during an emergency was studied [9]. The possible and clear detection of rumors indicates that Twitter could be a reliable source of information during a crisis. As a matter of approval, the national Police Headquarter in Nepal confirmed the reliability of Twitter in 2015 as a source of information during an emergency [10]. Recent study shows that the use of microblogging platform during California wildfires can be a reliable source to predict air quality [11].
From the trend analysis performed on Twitter users' response to crises, it can be distinguished that the scale of emergency and the behavior of Twitter users are directly related. Overwhelming natural disasters produce a spontaneous reaction of the general public classified into different situational resonance, such as offering help or sharing information [12]. However, prior to a natural disaster, it is assumed that the larger proportion of the general public on Twitter gets surprised with the crisis. Nevertheless, the number of tweets exponentially increases right after a natural disaster hit. Hashtags initiated on Twitter indicate the sympathy and other feelings of anger, fear, and hope that Twitter users hold for the victims [6].
Trend analysis is a netnography based analysis that observes participants' reactions and collects and analyzes computer-mediated data from the source. Trend analysis based on Twitter data begins by collecting the tweets. In order to achieve that, the tweets must be filtered in terms of relevancy. Twitcident, was used a framework to filter and analyze tweets concerning natural disasters. An alternative of using Twitcident is Twitter's API search by relying on hashtags. Hashtags are typically initiated during the storm by citizens that call for action, offer aid, share information, or seek help. On a more sophisticated level, Tweets can be collected through Artificial Intelligence for Disaster Response (AIDR) [13]. Following the collection of data, further filtration must be applied in terms of accuracy. Data should be annotated in order to decide on the relevancy of the Tweets. Data can be annotated through the MicroMappers platform and further annotated through CrowdFlower. Data is then categorized into specified categories based on the study using Naive Bayes (NB), Support Vector Machines (SVM), and Random Forest (RF). Due to character limitations per tweet, it is possible that a user might use informal language for the sake of not exceeding the limit. For that, a spell checker is possible, similar to that in the study of [13] where they have used SCOWL (Spell Checker Oriented Word Lists) that consists of 349,554 English words which can normalize informal language in tweets. For the baseline, unigrams and bigrams were used. Features typically used in sentiment analysis are also included, namely features representing information from a sentiment lexicon and POS features. Finally, other features may be incorporated to capture some of the more domain-specific languages of microblogging.
As a branch of Natural Language Processing, and as a process of sentencelevel sentiment analysis [14] [15], sentiment analysis for Twitter has been the interest of researchers to study and extract the relevant information and responses during a crisis [16]. Twitter sentiment analysis is processed using classifiers applied on relevant extracted data from Twitter using machine learning models [17]. Sentiment analysis relies on subjective data that contains opinions, reactions, and feelings of people concerning a certain occurring disaster. Microblogging features assess the sentiment in a tweet specifically when using emoticons, hashtags, and part-of-speech data [15]. Data processing begins with using machine learning models that extract tweets based on certain data. However, training machine learning models are required to prevent generalizing data that might not provide accurate sentiment analysis. For example, extracting tweets from a machine learning model that considers the word "rain" as data for a crisis reaction will not provide accurate analysis during a wildfire, considering that rain extinguishes the fire. This notion addresses the domain-dependent sentiment analysis [16]. This method includes a design of domain-dependent sentiment analysis that can accurately perform sentiment analysis from reaction tweets during a hurricane.
The reasons behind California wildfires and the scale of the danger it causes are consistent with varying factors. These variables are linked to seasonal and regional conditions such as atmospheric aridity in summer and extreme offshore desert westward wind in coastal areas [18]. However, the topography is not a spontaneous ignition of wildfires. In fact, anthropogenic warming has been a contribution to California wildfires since the late 1800s. The scale of the wildfires is in correlation with the scale of the damage. As wildfires results in land erosion and sediment and water discharge, the assessed the damage of the 2018 Thomas and Camp Fire on ranchers and farmers were 430,000 acres of land combined [19] [20]. The ecological damage was evident in loose soil. Consequently, the economic damage was to a scale of $170 million on farmers. Casualties are inevitable; however, they can be reduced. Study found that the highest tendency to provide shelters and evacuation utilities during wildfires is through friends and families, highlighting the presence of strong trust between the victims and the helpers [21]. However, this should propose a future recommendation for the uphold accountable states to provide evacuation strategies.
Improvement of situational awareness based on public opinion through analyzing big data extracted from social media in real time is a challenging task [22] [23] [24]. What better than a social media platform that can be harnessed to understand the pulse of the populace? Regarding availability and popularity, Twitter is a rich source of users' behavior and opinions, especially during a crisis event such as wildfires. Although analytical approaches have been developed to process Twitter data, a systematic framework accounting for granularity in order to efficiently monitor and predict the outcome of hazardous events based on users' opinion has not been developed yet. This study aims at minimizing the above-mentioned knowledge gap by developing a framework for trend analysis based on Twitter data collected during the California wildfire.

Methodology
Trend analysis predicts public opinions and situational awareness through automatic extraction and analysis of a large amount of social media data. Twitter, due to its flexible-communicative structure, is the most suitable platform for trend analysis. Trend analysis paradigm for Twitter data can depend upon different levels of granularity. The two most common granularity levels in social media data analysis are fine-grained granularity and coarse-grained granularity. Fine-grained analysis usually makes tweet-level predictions on domain-independent factors like sentiment, topics, and emotions, whereas coarse-grained analysis predicts events in real time, by aggregating and utilizing fine-grained predictions. In this study, a customized fine-grained analysis has been conducted by aggregating witness detection, comparative sentiment analysis, and topic modeling. The customized packet of these three features formed a trend analysis on California wildfire tweets. In essence, trend analysis based on geolocation, timeframe, witness-type identification, sentiment analysis, and word cloud can be utilized to make high-level predictions regarding real time information such as evacuation route, severity and damage assessment, adaptive measures, etc. A framework of the customized fine-grained trend analysis is shown in Figure 1.

Data Collection
In this study, the Twitter search (Tweepy) API has been used to collect wildfire-related tweets. Tweets were collected based on keywords-matching and hashtag processes. Keywords like wildfire, California, and hashtags such as #wildfire, #CaliforniaWildfire, etc. were used to extract relevant data from a large pool of data within the Twitter network. Geotagged data in the California region and its surrounding during the timeline of the crisis event were categorized to illustrate a holistic picture of "California Wildfire" on the Twitter platform. The total number of tweets collected was 1958 using keyword matching. However, initial screening of the raw dataset caused elimination of 590 tweets containing incomplete information. Based on relevance of the topic, 114 of the tweets were termed as "off-topic", and the rest of the 1254 tweets were deemed as "on-topic". The tweets were screened based on geolocation and timestamp tags i.e. metadata fields. Additionally, the tweets were also categorized based on original tweets and retweets and replies to the original source of tweets.

Data Preprocessing
Twitter Data preprocessing is an important feature to extract information from the tweets. Data cleansing, a preprocessing strategy in this study, was performed using the python programming language platform. Data preprocessing involved

Witness Categorization
Social media applications like Twitter allow users to either generate their own words in a post or re-post other users' tweet. (i.e., retweet). This study focused on the texts of the on-topic "wildfire" tweets (including original tweets and retweets) in an affected geolocation (California region) to identify types of witnesses during wildfire hazard. Retweets (RT) can be directly identified since the term "RT@username" appears before the retweeted post. In the retweet network, the nodes are labeled as users who retweet other users' messages or posts, as well as users who are retweeted by others. Hence, based on type of nodes one can easily verify the originality of the post. In crisis events, social media users tend to disseminate information related to existing condition of the affected area, damage, fatality, rescue updates, evacuation, and improvement of the situation.
Twitter users with original posts related to the crisis event, i.e. wildfire, were termed as direct witness and users who have shared posts from different news outlets, or direct witnesses were termed as indirect witness. It is important to build a network formed of direct and indirect witnesses using "wildfire" tweets and retweets to gain knowledge regarding the importance of the information disseminated through Twitter.

Sentimental Analysis
Sentiment analysis of microblogging social media platforms plays a vital role in disaster management. Assessing the dynamic polarity of sentiments in Twitter over the course of a disaster is an effective disaster management strategy [25]. Sentiment analysis labels the tweet based on positivity, negativity, and neutrality.
Analyzing the tweets collected during the California wildfire based on sentiments may improve the decision-making regarding resource assistance and requests along with humanitarian efforts and disaster recovery. Random Forest Classifier model was implemented on the dataset (training dataset 80% and testing data 20%) to check the accuracy of the sentiment analysis based on manual annotation. However, unavailability of large-scale preprocessed data in wildfire domain has caused less than expected accuracy of the validation result.

Topic Modeling
A word cloud visualizes language or text data based on the frequency distribution of the words used in big data. In this paper, a word cloud has been formed using the high-frequency words extracted from the preprocessed datasets. Word cloud developed based on the tweets indicates intensity and popularity of the words used during disseminating firsthand information from the direct and indirect witnesses of the hazardous event, i.e. California wildfire. Finally, a topic model is formed based on the word cloud.

Result and Discussion
Domain-specific trend analysis approach designed specifically for tweets posted during a wildfire is an aggregated result of results obtained from the witness detection, sentiment analysis, and topic model. The number of direct and indirect witnesses is identified based on originality and resharing of the tweets. There are about only 476 tweets from direct witnesses and 780 tweets from indirect witnesses from the total sample of data. As shown in Table 1, indirect witnesses tend to retweet information from news outlets, such as CBS and Southern California weather channel as well as sources such as Cal_Fire, about the status of a wildfire or evacuation related information such as evacuation alerts or routes.
On the other hand, direct witnesses can be further categorized into eyewitnesses reporting their observations and personal experience and witnesses who gather information from several outlets and write their own informational tweets. Information gathered from witness categorization can provide an innovative source for emergency responders who will be racing against time trying to devise safe evacuation plans and pin down the affected locations.
As summarized in Figure 2, it is evident from the sentimental analysis of the "on-topic" tweets that as the intensity of the wildfire increased in the California World Journal of Engineering and Technology  region, negative tweets (66%) became dominant. For instance, tweets such as "A new wildfire in California exploded in a matter of hours and forced tens of thousands from their homes" and "Forest management can do more than just protect against wildfire in California" indicated the negative impact on wildfire victims during the crisis moment. Tweets like "FEMA Warns California Wildfire Victims About Federal Aid Scams" could not be labeled as positive or negative. As a result, such tweets were labeled as neutral (11%). However, Twitter users also tend to share positivity by expressing positive sentiment (23%). For example, tweets like "A heartwarming GIVE FEARLESSLY and INFLUENCE POSITIVELY story: Lost dog reunites with family after California wildfire evacuation" spread positivity during or after the crisis period. To validate the result of the sentiment analysis, a supervised machine learning approach using Random Forest Classifier has been implemented which provided 76% model accuracy with 82% precision, 90% recall and 86% F1-score.
The topic model is an important feature of the trend analysis as it acts as a starting point for any coarsely grained analysis. Topic Model provides the most frequently used topics present in the extracted data. As shown in Figure 3 and Figure 4, the most important and frequent topics extracted from the dataset included the words: California (28%), wildfire (26%), fire (6%) and wind 4%), etc. The complete word cloud based on the frequencies of the most relevant "ontopic" words is summarized in Figure 4.
Like any social media data analysis research, this study also suffers from high resolution dataset. Additionally, categorization of the data based on geolocation and timestamp is a cumbersome process due to lack of clarity in the raw data. For coarsely grained analysis, large-scale dataset with precise geotag and timestamp is required. However, this study focused only on the fine-grained analysis of the available data. The compiled fine-grained analysis can therefore form a preliminary assessment of the level of intensity of the wildfire based on Twitter reactions.

Conclusion
Trend analysis conducted in this study tested the developed approach to identify tweets that indicate useful information related to evacuation, emergency response and disaster management, reflecting actual situations during wildfires. With the available dataset extracted from twitter, a fine-grained analysis was performed by combining witness detection, comparative sentiment analysis and topic modeling based on word frequency detection. Witness categorization distinguished users who relay information and updates about the crisis and users who express personal observations. The sentiment of most of the tweets deemed negative, which aligns with the general ambience of an ongoing crisis. As for the topic model, words related to the location of the crisis as well as the keywords describing the crisis itself were the most frequent. In this sense, the analysis promises a practical source of updates and information for firsthand emergency responders who will be managing evacuation plans and locating influenced areas to dispatch help. An uninterrupted flow of dataset and a faster rendition process can upgrade the trend analysis allowing the dissemination of real-time information from reliable sources while the crisis is ongoing. A coarse-grained inclusive analysis can be possible in the future based on the current results achieved in this study. Additionally, witness level can be more distinguished by enhancing the screening process to justify the authenticity of the information collected from the tweets.