Media Reports of the COVID-19 Pandemic: A Computational Text Analysis of English Reports in China, the UK, and the US

This study explored how news outlets, China Daily (CD), Cable News Network (CNN), and Daily Mail (DM) have reported the COVID-19 pandemic. Mainstream media is a credible communication path to guide public attention on COVID-19. Computational text analysis contributes to understanding media activities about the pandemic and promotes health information communication. The word frequency statistics and lexical diversity highlighted how pandemic reports changed in the early outbreak. A cluster analysis illustrated the frequency and semantic relationship between the highly frequent words from CD, CNN, and DM reports. Sentiment analysis was based on natural language processing when analyzing the sentiment of all headlines and the sentiment of the different words in the headlines. This study also discussed similarities and differences in the coverage by the three different media outlets at various stages of the outbreak. Three media reported comprehensive coverage of the pandemic. Since they are based in different countries, their focus and the numbers of reports are different at different stages. The richness of the vocabulary and the degree of emotion are related to their media attributes. These results can help health departments exchange information, guide accurate public awareness, and eliminate public fears regarding misconceptions about the pandemic.


Introduction
Worldwide, the media's reports on significant events have greatly influenced the public's perception, most recently evidenced by the daily coverage of the COVID-19 pandemic. Since late December 2019, multiple cases with symptoms of unexplained viral pneumonia occurred in Wuhan, China. By February 2020, the localized outbreak in China had evolved into a global outbreak.
People can quickly obtain an unprecedented amount of content from online platforms beyond their inner social networks. Since everyone can communicate their opinion on social media platforms, these sites become burdened with misinformation, including the spread of rumors and "fake news", which are often difficult to verify (Bode & Vraga, 2018;van der Meer & Jin, 2020). Users tend to obtain information that adheres to their views of the world and ignores opposing information, establishing what is known as an "echo chamber" (Mocanu, Rossi, Zhang, Karsai, & Quattrociocchi, 2015). Algorithms mediate and promote content according to users' preferences and attitudes, thereby facilitating information exchange (Kulshrestha et al., 2017). It profoundly impacts the construction of social concepts and narrative frameworks. The spread of uncertain information and concepts can cause group differentiation and negatively affect user emotions (Bakshy, Messing, & Adamic, 2015). Thus, the spread of misinformation in mainstream media is dangerous, perhaps even fatal.
Communication researchers have found that respondents who tend to obtain information from mainstream media are more aware of the disease's lethality and how to protect themselves (Ball & Maxmen, 2020). In the middle of March, Fox News reported that hydroxychloroquine was robust against COVID-19. In response to the Fox News report, Stanford University clarified that the author was not a consultant, and the university was not involved. Although correcting the misinformation, it has widely spread on social media platforms to the extent that it covered up accurate information. This misinformation communication has caused medication shortages, poisoning, and death (Donovan, 2020). The spread of misinformation in mainstream media is dangerous and can even turn out deadly.
According to the media dependency theory (Ball-Rokeach, 1998;Ball-Rokeach, 1985), people mainly depend on the media to acquire the information needed in hazardous conditions (Jang & Baek, 2019;Seo, 2019). The mainstream media's impact is evident; the information they release should be as accurate as possible (McCombs & Shaw, 2016). Therefore, examining the differences in the reporting habits of distinct media outlets is important. This includes whether cultural, regional, and media differences have affected the development of the pandemic and people's perception of the pandemic in each country; whether the media outlets disseminate credible information; and whether their reporting promoted panic.
Media from China, the UK, and the US were chosen as research objects. The goal of the study was to understand the patterns in informing the public by three major media outlets in each country. Computational text analysis was performed for mining the data, and visual data analysis was applied to visualize the patterns of the COVID-19 pandemic (DiMaggio, 2015). This study discusses similarities and differences in the way the media from these three countries reported on the COVID-19 pandemic. The following research questions (RQs) were asked: What are the news reports trends concerning the pandemic (RQ1)? What did these media outlets report at different stages (RQ2)? Was the sentiment expressed in the headlines of their coverage positive or negative (RQ3)?
This study contributes to clearing up national or regional media discourses on the pandemic. Besides, it develops a valuable model for hot news reports, public opinion research, and media effects research in the future.

Computational Text Analysis
Content analysis of text-based data is generally accepted as a popular method in social sciences (Grimmer & Stewart, 2013). Various computational techniques have been developed in computer science, bioinformatics, psychology, linguistics, and communication (e.g. computational communication research) (Nelson, 2017). Artificial intelligence knowledge, such as natural language processing (NLP), deep learning, and data mining, suggests implicit connections between data and how entities express, infer, predict, and visualize relationships between texts and concepts (Socher et al., 2013). The content analysis framework for health communication is explored in research on new methods suitable for combining computational research and humanities (Medford, Saleh, Sumarsono, Perl, & Lehmann, 2020).
NLP and other machine learning methods provide support and analyses of text data to construct newer and faster computing methods, especially for big data research (Agerri et al., 2015;DiMaggio, 2015;Scharkow, 2013). Additional software packages have been developed to bundle algorithms and simplify their application in conventional text analysis projects (Oh et al., 2020)

NLP used in my research).
Many researchers are still skeptical about the role of computers in processing content 21. Human behavior is the core content of this type of research, which is not easy to elucidate using simple data-based methods (Hancock, Landrigan, & Silver, 2007). This article provides a hybrid approach for developing language and text analysis, which offers a comprehensive interpretation of the texts and incorporates the rigorous, reliable, and repeatable computational text analysis method.

Data Collection
In this research, Python version 3.7 was used for data processing and analysis (C. Advances in Journalism and Communication Luthra and D. Mittal, 2010). This study employed a computational text analysis approach of coverage between 9 January 2020 (the first report about the pandemic) and 31 March 2020. At the beginning of the outbreak, the pandemic erupted in China and spread to Europe and America. This particular period is essential for examining media reactions related to the early COVID-19 outbreak.

Lexical Diversity (LD)
LD is used for content analysis of vocabulary richness (McCarthy & Jarvis, 2010;Yu, 2010). LD is a measure of the number of different words used in a text (Johansson, 2009). One of the most common methods for measuring LD is to use the ratio of unique lexical items divided by the ratio of the total number of words in text samples; i.e. the type-token ratio (TTR) (Bates et al., 1988;Fergadiotis, G., Wright, H. H., & West, 2013). This study examined the lexical semantics between the different media outlets.

Cluster Analysis
This study used keywords such as "COVID-19", "pandemic", etc. to capture the media's coverage. Word frequency analysis was performed and high-frequency words were discussed. Using VOSviewer, a cluster analysis was performed (van Eck & Waltman, 2010, 2013. Focusing on the frequencies of single words in the coverage and visualizing these frequencies through cluster analysis to view the most common topics is critical.

Sentiment Analysis
Sentiment analysis is an NLP method that classifies sentiment towards news report headlines. This study used two methods for sentiment analysis. First, keywords that appeared in headlines were analyzed. Positive, neutral, and negative words were tallied, based on the study by Pang and Li (Pang & Lee, 2005).
Second, based on a new type of recurrent neural network (RNN) (Socher et al., 2013), the Stanford Sentiment Treebank was built on a grammatical structure which is a deep recursive model of semantic composition on a sentiment tree.
The semantic space might be handy; however, the meaning of longer phrases cannot be expressed in a principled way. It primarily relied on the sentence structure to construct the representation of the entire sentence and measure the sentiment according to the meaning of longer phrases.

News Coverage Trend of COVID-19
From the perspective of global information dissemination, English is still the most widespread language. Therefore, this study chose these three media as re- CNN's reports did not change significantly in Stage 1; although, there were considerable increases in the number of reports posted on 29 January 04 February and 10 February. The overall trend was comparatively similar to that of DM. Table 1 shows the changes in the number of reports of the three media at different stages.

Word Frequency Statistics and LD
In the word frequency, high-frequency words were sorted from high to low ( Figure 2   The LD was calculated based on the number of articles, the word count, and the unique word count published over this period (Tweedie & Baayen, 1998 on similar topics about this global pandemic. All media paid attention to the changes in the outbreak in different countries, the number of cases, and the respective government's responses. Owing to geographical reasons, the frequency of both "China" and "Daily" was high in CD. CNN repeatedly featured "the US" and "CNN". "UK" appeared in DM for the same reason. "Trump" and "president" appeared more often on CNN than others, suggesting that the President of the United States plays a dominant role in the public sphere. Moreover, Trump keenly expressed his opinions on social media platforms, prominently Twitter: every time he posted a social media statement, he received considerable attention (Lee & Xu, 2018).
According to the content analysis, CNN and DM preferred to use "tell" to quote other people's statements, while CD opted for an objective narrative; thus, "photo(s)" appeared more often. Interestingly, the frequency of "Australian" was very high in DM's reporting.

Cluster Analysis
A cluster analysis was conducted on the high-frequency keywords in the reports between 9 January and 31 March 2020. Through the empirical judgment method, 23 high-frequency keywords with strong descriptiveness and contribution to the topic were selected from CNN, DM, and CD. This highlights the differences between different media concerning frequency and variation (Figures 3-5).
From the cluster analysis, news reports on the pandemic mainly revolved around four themes: 1) the pandemic situation and confirmed cases of COVID-19, 2) how to control the disease, 3) travel concerns, and 4) health and medical treatment concerns.   At this stage, both CD and DM were intensely concerned with Theme 4.
In Stage 2, the outbreak developed into a global pandemic. The focus of the reports shifted from China to global developments and the situation in the respective countries. The coverage was still closely related to these four themes.
The "outbreak" in CD's coverage was still strongly related to "government".
CNN's "outbreak" began to connect to "government" and "medical". In DM, "outbreak" was also strongly related to "medical". During this time, CNN and DM discussed their own countries' medical conditions. Figure 4 shows that in DM and CNN, the word "killed" was the high frequency in Stage 1, during the outbreak in China. This term decreased in frequency during

Sentiment Analysis
In the news report, the headline is critical (Blom & Hansen, 2015). It helps the audience decide whether to continue reading. Therefore, sentiment analysis was used for all headlines in the dataset (Rameshbhai & Paulose, 2019). Two sentiment analysis methods were employed. This set of values indicates the attitudes and opinions of each medium regarding the outbreak. One is to analyze every word in the headline. Sematic vector spaces have been extensively used as features for single words (Turney & Pantel, 2010). The more significant the proportion of positive and negative words in the headline, the more exaggerated it is.
According to the results (Table 2), the proportion of positive and negative words was the highest in DM as compared to CNN and CD. This indicates that this newspaper preferred to use exaggerated expressions to attract readers. CD had the lowest percentage of positive and negative words as compared to CNN and DM. Its headline expression was closer to the objective narrative. Throughout Stages 1 and 2, the format of news headlines in the dataset did not change much, and the style was relatively consistent.
The other is to analyze the entire headline. According to the RNN model results for sentiment analysis (Oh et al., 2020) (Table 3) During Stage 1, the negative emotions expressed or evoked by DM were of the highest degree using exaggerated emotional vocabulary; during the pandemic phase, they slightly decreased but still remained at a high level.

Discussion
In a global health crisis, misinformation does not lead people, such as people starting to take untested medication, ignoring public health advice, and even boycotting a prospective vaccination. Misinformation provides a hotbed for the pandemic to attack anywhere. It is impossible to stop the spread of misinformation, even if social media platforms already stepped up their efforts to remove misinformation and lead people to reliable sources (Ball & Maxmen, 2020).
Mainstream media plays a decisive role in guiding people and conveying reliable information.
The All three media reports revolved around three words: "outbreak", "government", and "medical". These were the most used terms in the four main themes in both stages. The strong correlation with "government" was mainly related to government measures. The words are closely associated with the development of the pandemic and government decisions. The proportion of socially active words such as "travel" and "flight" in Theme 3 increased in the second stage and were related.
CD paid more attention to "masks" than the other media. The difference in the frequency of the word "masks" depended on whether different countries required people to wear masks to guide public health measures. Perhaps CD mentioned masks more often because China emphasized that wearing a mask is the most effective protective measure. At this stage, the United Kingdom and the United States believed that masks were only necessary for sick people. However, in Stage 2, British and American media reported more on masks, which significantly impacted the audience's decision to wear masks.
Rumors triggered by emergencies all have an inevitable life cycle, and the development of rumors will gradually disappear from inception, followed by the birth of new rumors (Shibutani, 1966). As a special kind of unexpected event, the pandemic has a prolonged impact cycle; therefore, the rumors triggered by CNN had the highest overall vocabulary richness. The DM's vocabulary richness was slightly lower than that of the CD. However, CNN and DM both used English as their mother tongue. As the so-called yellow press, DM uses simplified language and exaggerated headlines. According to the sentiment analysis, the proportion of positive and negative words in CD was smaller than in the other two. It preferred neutral words. In Stage 1, the negative words were even higher than the positive words, indicating that DM has a particular negative emotion for China's localized outbreak. In Stage 2, although positive words were slightly higher than negative words, both had a large proportion. The DM headlines were the most exaggerated to garner attention. Regarding overall sentiment analysis, CNN's and DM's reports focused heavily on the Chinese outbreak under review and were very negative overall. Once the disease had engulfed the whole world, CNN's negative headlines reduced significantly. When the outbreak was more severe in each respective country, negative sentiment was reported less. CD had the smallest before-and-after change in positive and negative sentiment of all three media outlets. In Stage 2, when the outbreak in China was already under control, and the world outbreak was trending, the negative sentiment of CD increased by almost 2% from Stage 1. The negative sentiment was mainly related to the widespread knowledge of the pandemic and the current situation. In Stage 1, the negative sentiment reported by CNN and DM focused on China's pandemic. In Stage 2, CD's negative sentiment focused on its own respective pandemic situation, the spread of protection knowledge, and the pandemic situation in other countries.
The sentiment analysis and LD correlated with the media attributes: CNN targeted the more educated public; the state officially operates the CD. The positive and negative words were not rich, mainly based on the objective narrative and strict attitude. The news sources were well-founded. DM relied heavily on exaggerated headlines to arouse potential readers' interest in reading and tempting them to invest in the media. The average literacy level of the audience was lower than that of others, meaning that the words were simple to read 39.
However, the number of extremely positive and negative two extremes was high.
Headlines with more positive and negative words were more stimulating and have their attitudes. However, news content published during outbreaks is likely to spread quickly and generate public opinions. If the media target an audience with a low overall literacy level, such news is more likely to promote the spreading of rumors.
Individuals' need for pandemic-related information rises rapidly based on their safety and rights. If the information cannot be obtained from the authoritative official channels, the public will blindly adopt the information from other sense of uncertainty and insecurity will gradually deepen, and public opinions and rumors will get out of control and even threaten people's lives. Mainstream media, as the leading voice channel for public affairs communication, should assume corresponding responsibility. Through objective and truthful reporting of emergencies, media outlets should reasonably help the public vent dissatisfaction and express opinions, thereby alleviating or eliminating public dissatisfaction and insecurity, maintaining stable social relationships.
Researchers were trying to use interdisciplinary methods for text analysis. Researchers express, infer, predict, and visualize the relationship between text and concepts through computational text analysis. According to people-oriented communication studies, researchers find new ways to combine computational research and humanism within the content analysis framework in health communication.

Implications
This study had some limitations. First, the length of the articles varied greatly, which may have affected data analysis. Analyzing articles of a similar length from the same date on the same topic would yield more accurate results. Second, as data were mined only from CNN, DM, and CD, there was still much room for optimizing data volume and data dimensions. Data mining on the same timeline on social media networks is necessary and combines data with media report data (Demszky et al., 2019;Ordun, Purushotham, & Raff, 2020) The final step is to formulate a practical management and prediction model through machine learning. For instance, understanding users' engagement in the related COVID-19 tags on social media platforms and understanding users' sentimental dynamics behind the COVID-19 pandemic could help explore mainstream media's effect in a crisis and design more efficient social behavior models for efficiently solving misinformation communication. Lastly, sentiment analysis is not a magic wand. It depends heavily on the psychological theories behind it; for example, whether a sentiment analysis is based on discrete emotions (Ekman & Friesen, 1971) or valence and arousal (Kim & Klinger, 2018).