A Comparative Study of Keywords and Sentiments of Abstracts by Python Programs

Four corpora are created to investigate the self-mentions, keywords and sentiment of abstracts. First, self-mentions are categorized to examine the authorial interactions with the reader. Then, the study of high-frequency words and keywords is conducted with different Python programs and the software AntConc. The keywords generated with WordCloud and TF-IDF-LDA methods show a definite relation with high-frequency words generated by Jieba_Counter and NLTK FreqDist. Further, the sentiment analysis is performed with SnowNLP and TextBlob yielding different results, which verifies the authorial interactions with the reader and increased factual information respectively. Finally, the verification by reference corpora validates the consistency of the sentiment analysis by these two methods. The research suggests that the methods for high-frequency words generation, keywords generation and sentiment analysis be selected discriminatively since different methods generate different results; meanwhile, the study verifies that the objectivity remains in the writing of abstracts. The investigation is conducive to the choices of keywords generation and self-mentions in writing.


Introduction
Abstracts, as an indispensable part of academic writing, serve as a stepstone for the reader to find important information and to decide whether to read the whole paper or not. They are studied by the Swalesian School under the framework of move patterns IMRD and more variations of move patterns However, the views are seldom verified quantitively to show the objectivity or subjectivity, as Hyland (2005) holds that the authorial interactions with the reader in abstracts are realized through various self-referencing strategies. Some related sentiment analysis can be found in these scholar's works (Batool et al., 2013;Chong et al., 2014;Tyagi & Tripathi, 2019;Dubey, 2020). However, few researches have been performed on the sentiments of abstracts to show the interactions between the author and the reader.
Considering these situations in academic writing of abstracts and keywords, we select abstracts from several international agricultural journals to investigate the variations of keywords and the differences of sentiments of abstracts by different libraries of Python.
The previous studies on keywords generation can be traced in these scholars' works (Joshi & Motwani, 2006;Thomaidou & Vazirgiannis, 2011;Hussey et al., 2012;Liu et al., 2014;Savva et al., 2014;Scholz, et al., 2019;Arora & Kumar, 2019;Zheng & Sun, 2019;Thushara et al., 2019). Among the works by these researchers, Scholz, et al. (2019) propose an automated approach for generating keywords for Sponsored Search Advertising based on his keyword generation algorithms. Zheng & Sun (2019) utilize the three properties of relevance, coverage, and evolvement of candidate keywords by using active learning and multiple-instance learning to follow up the main topics of tweets along the development of events. Few researchers have assessed the reliability of the keyword generated by different Python programs.
In terms of genres, the study of self-mentions ranges from research articles (Khedri, 2016;McGrath, 2016;Wu & Zhu, 2015;Chen, 2020), research article abstracts (Friginal & Mustafa, 2017;Bonn & Swales, 2007), presentations (Zareva, 2013), speech (Albalat-Mascarell & Carrió-Pastor, 2019) and introduction part in academic writing (Loi, 2010;Wang & Yang, 2015;Tankó, 2017), to literature review (Soler-Monreal, 2015), and personal statement (Li & Deng, 2019). However, these studies of self-mentions have not covered the sentiments of abstracts and their self-reference. This paper intends to compare the keywords generated by WordCloud and TF-IDF-LDA and to explore the sentiments of abstracts generated by SnowNLP and TextBlob to show whether the programs are reliable or not, and whether authorial interactions with the reader can be improved by self-mentions. The next part Section 2 of the paper is the research materials and methods of the study, followed by the results of the high-frequency words by Jieba_Counter, Section 4 is the discussion of the research results and the last part is the conclusion of the paper.

Materials
The research materials we selected are the abstracts from international agricultural journals. The journal names, country names of the author and quantity of abstracts in each journal are listed in Table 2. The raw materials include the information of the author, journal name, volume, year, URL, ISSN and DOI, which are deleted with Python programs to retain abstracts' content only in order to obtain keywords and sentiments of abstracts. Altogether, 460 abstracts have been selected from the journals and 451 abstracts have been obtained after being processed by programs. A corpus named as INC is created with these abstracts. In order to verify the sentiment results we build another corpus named as CHC as reference corpus, which is made of 462 abstracts processed from 468 abstracts by Python programs. These two corpora are raw corpus without annotation and POS tags.
The overall statistics of the two corpora are listed in Table 3:

Methods
After building the corpus of INC, we investigate the high-frequency words and keywords of INC by using various methods comparatively. The high-frequency words are generated by Jieba_Counter and NLTK FreqDist, and then cross-checked WordCloud and TF-IDF-LDA methods respectively. We adopt Jieba_Counter and NLTK FreqDist methods as per the following procedures. First, Jieba is used to split words with NLTK stopwords loaded, and then the symbols and punctuations are deleted, and finally Counter is imported to generate a list of high-frequency words. NLTK FreqDist high-frequency words are generated with NLTK Word Tokenize and FreqDist, with NLTK stopwords and these additional stopwords and symbols "'.', '(', 'NO', 'The', ')', 'N', '%', '&', ';', ',', '1', 'l.', 'n', 'kg'" loaded, and then the high-frequency words are visualized by Numpy and Matplotlib. Only the first 15 high-frequency words are selected for the visualization. The results of Jieba_Counter and NLTK FreqDist are cross-checked and confirmed by the wordlist and frequency of AntConc.
WordCloud keywords and TF-IDF-LDA topic keywords are produced in the following procedures. First, NLTK Word Tokenize is used to split the abstract into words with NLTK stopwords and additional stopwords "'and', 'of', 'the', 'in the', 'for ', 'in', 'with', 'at', 'by', 'under', 'wa', 'were', 'as', 'on', 'to the', 'kg', 'ha'" loaded, and then Python WordCloud and Matplotlib are imported to visualize the first 15 high-frequency words. TF-IDF-LDA topic keywords are generated by using TF-IDF and LDA model with NLTK stopwords loaded. Five topics are designed for generating 10 keywords in each topic upon several tests with more stopwords added.
In order to calculate the sentiments of texts with self-mentions, first we extract the whole sentences and phrases with self-mentions and then create two corpora named as INSM and CHSM for INC and CHC respectively. The frequencies of each self-mention in INC and CHC are shown in Table 4. We have excluded those self-mentions that do not refer to the author to ensure that the self-mentions reflect the authorial interactions with the reader. The overall statistics of INSM and CHSM are listed in the following Table 5. The statistics in this table are calculated by the same way as those in INC and CHC. These data suggest CHSM can be used as reference corpus in terms of the comparability.
Sentiment analysis is conducted with SnowNLP and TextBlob respectively.
These two libraries are for sentiment analysis, but they have different scoring systems to show the sentiments of texts; therefore, we use both libraries to compare the sentiment results quantitively. Firstly, the sentiment analysis of INC and CHC are performed, and then the sentiment analysis of INSM and CHSM are conducted respectively in order to explore whether the sentiment scores reflect the authorial interactions with the reader. The average sentiments are also calculated by Numpy and visualized by Matplotlib with the visualized sentiments of the texts in these corpora.

High-Frequency Words and Keywords in INC
The high-frequency words generated by NLTK FreqDist in INC are yield, soil, crop, water, study, production, cover, increased, minus, results, growing, grain, biomass, compacted, and two as shown in Figure 1. The keywords generated by WordCloud in INC are yield, soil, plant, cover, crop, rate, treatment, using, cultivar, model, increased, this study, year, field, and high, which are shown in Figure 2. The high-frequency words calculated by Jieba_Counter, AntConc, NLTK FreqDist and WordCloud keywords are listed in Table 6.  The keywords with their respective TF-IDF values generated by TF-IDF-LDA are listed in the following.
The first three words with highest TF-IDF values in each topic are stress, growth, data, model, moisture, potential, results, water, different, soil, yield, using, study, two, and plant.

Sentiment Results
The

High-Frequency Words and Keywords in INC
The high frequency words generated by Jieba_C and AntConc are the same, with frequency differences in words yield, study, and plant greater than 1, and grain less than 1 by AntConc. We search the words with Notepad manually to find The keywords generated by WordCloud and TF-IDF-LDA are overlapping with those high-frequency words given by Jieba_C and AntConc, which shows that the keywords have a definite relation with the high-frequency words.
WordCloud generates 6 keywords yield, soil, plant, cover, crop, and increased, same as those in high-frequency words, and yields these 9 different words and phrases rate, treatment, using, cultivar, model, increased, this study, year, and filed from those of high-frequency words.
Of the 15 keywords out of the first 3 words in each topic, the keywords that are the same as high-frequency words are the 8 words growth, results, water, soil, yield, study, two, and plant and those that are not in the high frequency words are the 7 words stress, data, model, moisture, potential, different, and using. The two common keywords are using and model by these two methods. The keywords from those high-frequency words by WordCloud and TF-IDF-LDA take 40% and 53.3% of the total high-frequency words respectively, though the topic keywords are calculated with different methods.
The topic keywords may provide us with more perspectives to show the key information in a text, while WordCloud keywords may be eye-catching. Topic keywords may change with different topics assigned, which need tests and trials and human judgement on the different outcomes. As the TF-IDF values are given with the keywords in each topic, we can judge and compare the keywords quantitively, which is an advantage over the WordCloud keywords. The dynamic change of keywords with topics makes it possible the multi-views of keywords in different topics. With different language models applied, there unfolds a different picture of keywords, which will definitely deepen our understanding of the text.

Sentiments by SnowNLP and TextBlob
The The average polarity of INC by TextBlob has also been decreased by 47.2% with the use of self-mentions in INSM, which suggests that the self-mentions in INSM serve as a strategy for lowering the sentiments of texts, thus showing little authorial interactions with the reader. In view of the self-mentions' constituents, the inanimate entities take more proportion than the human entities, which might cause the decrease of sentiments.
In order to confirm the sentiment results by TextBlob, tests of sentiment analysis have been performed of CHC and CHSM, which results in similar data.
The average subjectivity and polarity of CHC and CHSM are of the same order as those in INC and INSM. The average polarity in CHC is decreased by 36.5% and the average subjectivity is decreased by 51.1% by the use of self-mentions in CHSM. The tests with CHC and CHSM may validate that the sentiment analysis by TextBlob is consistent with that of INC and INSM. However, the problem arises which sentiment analysis is reliable since both SnowNLP and TextBlob yield seemingly reasonable results. The other problem is also thought-provoking that self-mentions are devices for raising the sentiments of texts with SnowNLP, but they function as a strategy for lowering the subjectivity and polarity with TextBlob. Why do these two methods yield different outcomes?
We have checked the program many times to ensure the programs are correct and the materials in these corpora are proper. As a result, no faults have been found with the programs and materials. Then we reexamined the source codes of SnowNLP and TextBlob to find that TextBlob is NLTK-based, while SnowNLP is not, although both methods are Naïve-Bayers-based. In addition, another library Open Journal of Modern Linguistics Pattern Analyzer of TextBlob may also contribute to the differences of the two methods. Theoretically speaking, the sentiments results by SnowNLP agree with Hyland (2005) claim self-mentions improve the authorial interactions with the reader, while the sentiment results by TextBlob suggest the personal opinions have been decreased with the use of self-mentions. The seemingly paradoxical results given by the two methods may attribute to their different mechanisms since the research materials are the same in these tests. Therefore, it is hard to conclude which method is more reliable or better than the other one. Undoubtedly, we may find different perspectives of sentiments by different methods.

Conclusion
We have examined different methods for generating high-frequency words, keywords and sentiments of abstracts. The high-frequency words generated by Jieba_C and AntConc are reliable; however, the frequencies need further verification and FDist high-frequency words are mostly reliable, with some needing further confirmation. The keywords by WordCloud and TF-IDF-LDA overlap with some of those high-frequency words, which shows a definite relation between high-frequency words and keywords.
Different methods for generating keywords yield different results; however, the TF-IDF-LDA method can demonstrate dynamic topic keywords for us to select the best combinations. LDA language model is one of the language models for generating keywords, and other language models can also be applied to generate keywords, which will show more scenarios. Keywords generation is somehow like casting dices, which may result in different results, and can show us a broader view of the text. The high-frequency words in combination with keywords can help us obtain the key information in the text and they may help the writer to write the keywords for the paper.
The sentiment analysis by SnowNLP and TextBlob yields different results, which can function differently, since they have different parameters for us to see through the sentiments of texts. SnowNLP results agree with the theoretical assumption that self-mentions improve the authorial interactions with the reader.
This finding also suggests that more opinions exist in the texts with the use of self-mentions. TextBlob sentiment results support more factual information and fewer opinions are in the texts with self-mentions. These two methods support different views on self-mentions, which are seemingly controversial.
This paradox can be due to the mechanism of the two methods, and the controversy may also be caused by the different constituents of self-mentions as analyzed in the discussion. Recognizing the differences of the two methods, we may use the methods discriminatively. Based on the findings in sentiments by SnowNLP and TextBlob, we may conclude that the objectivity of abstracts remains as before, though the authorial interactions with the reader have been greatly increased with the use of self-mentions.
This study can help the writer select proper approaches to generating the P. H. Zhang, Y. Pan Open Journal of Modern Linguistics keywords of the paper on the one hand; on the other hand, it may also give some implications for the writer to choose proper self-mentions in order to enhance the authorial interactions with reader. Meanwhile, this approach may broaden the researchers' horizon to adopt Python programs to study the automatic keywords generation and sentiments of academic texts. More libraries and language models can be imported to explore the keywords and sentiments of academic texts quantitively.

Fund
The study is supported by the cross-disciplinary project in humanities and information (project name: Research on the Corpus-based Translation Universals of Abstracts and Their Application in Computer-aided Translation) of Xidian University (project No. RW180180).