A Stylometric Investigation of Linguistic Styles Based on a Vietnamese Corpus

The role of stylometric methods in linguistics has received increased attention across a number of disciplines in recent years, particularly in forensic linguistics. This study assesses the value of correspondence analysis, a stylometric method, in Vietnamese text analysis. Based on a dataset extracted from VVC (VnExpress Viewpoint Corpus), a 1.3-million-token corpus of Vietnamese opinion articles, linguistic features examined are seven parts-of-speech features to seek relational features characterizing authorial styles. Our focus in the analysis is on feature effects, with the aim to shed light on whether linguistic features of writing styles are consistent across various genders and profes-sions. Seven features altogether produce encouraging results to what is ac-knowledged to be a difficult problem for Vietnamese language. In addition, we find that when using correspondence analysis for seven linguistic features in the dataset based on authors’ gender, conjunctions and verbs perform best. Regarding authors’ profession, conjunctions and pronouns offer a striking improvement on stylometric investigation. The discriminating ability was particularly impressive, suggesting that, in a collective sense, parts-of-speech features provide a good set of markers.


Introduction
During the last decade, the link between stylometric analysis and linguistics has been at the center of much attention. The innovative work of Barlow (2013) pioneered a new approach to examining linguistic features by using correspon-dence analysis technique based on a specialized corpus, providing a reliable technique to identify a language user. He insisted that one consequence of relying on corpus data is that individual differences in usage tend to be obscured. To overcome this problem and investigate individual differences in spoken usage, he examined a corpus consisting of the spoken output of six White House press secretaries. The results provide strong evidence that within this one particular discourse context, the patterns of speech of each individual are clearly recognizable.
Today's research focus is on individual variation in language use in the context of forensic author identification, with the purpose of developing the theoretical underpinnings of the notion of authorial style and to validate methods of authorship analysis for a variety of forensic tasks, one of that is authorship attribution (PAN, 2019). In this study, the term "style" will be used in its broadest sense to refer to a language user's unique way of choosing linguistic features in his/her works. The term "stylometry" is a relatively new name for forensic stylistics, commonly referred to as digital text forensics. In other words, the terms "stylometry" and "forensic stylistics" are used interchangeably to mean a research field investigating and evaluating stylometric methods in forensic contexts.
So far, no large-scale studies have been performed to investigate the prevalence of a great variety of linguistic features using quantitative stylometric method, such as correspondence analysis, for Vietnamsese data. There also remain several aspects of linguistic features about which relatively little is known. The experimental work presented here provides one of the first investigations into how linguistic features discriminate individual authors based on a specialised Vietnamese corpus. The purpose of this stylometric investigation was to explore the stylometric discriminating ability of correspondence analysis. Another purpose of this study was to assess the extent to which linguistic features were. The study sought to answer the following specific research questions: Research question 1: Are parts-of-speech features able to discriminate authorial styles in Vietnamese texts?
Research question 2: Which features are the best style markers for author's gender and profession?
The overall structure of the current study takes the form of six distinct sections. Section 2 has attempted to provide a brief summary of the literature relating to stylometric methods. Section 3 will consider both the data and methods of study which will include correspondence analysis of three sets of linguistic features. Section 4 analyzes the results and addresses each of the research questions in turn. Section 5 presents the findings of the research, focusing on the best linguistic markers. The purpose of the final chapter is to conclude the main points and limitations in this study and to provide perspectives for future works.

Related Works
Individual style has been investigated with a range of linguistic features both lexical and grammatical. What we know about stylometry is largely based upon empirical studies that investigate how linguistic features discriminate individual authors of English texts. Around the early 1960s, small-scale research and case studies began to emerge linking the use of stylometric technique in attributing the true author of an anonymous text. Over the last decades, hundreds of character-to structure-based style markers and a great variety of stylometric techniques have been proposed with some recent studies reporting attribution success rates in the region of 95% (e.g. Grieve, 2007;Wright, 2017). One well-known study that is often cited in research on stylometry is that of Juola (2007), who found that a combination of various linguistic features helps improve the attribution accuracy. A more substantial approach to the more stable significance of word-based features can be found in Juola (2013).
A variety of methods are used to assess the effect of linguistic features on authorial style. Each has its advantages and drawbacks. Three of the most common methods for estimating such effect are the use of statistic tests, machine learning and deep learning. More recent examples of methods within statistic tests can be found in the work of Savoy (2020). Results from earlier studies demonstrate a strong and consistent association between word-based features and linguistic styles. There are a large number of published studies (e.g., Mealand, 1995;Koppel et al., 2012;Stamatatos et al., 2018) that describe the development of powerful computing tools and the easy accessibility of large quantities of linguistic data online, which have sparked renewed interest in authorship analysis and it is machine learning approach that seem to be the most promising at the moment.
However, there are several problems with these approaches. According to the Centre for Forensic Text Analysis, Aston University (2020), "the studies use non-transparent classification algorithms; meanwhile, in legal and forensic settings identification models need to be explanatorily rich because the forensic linguist needs to be both certain of the validity of their findings and able to explain them to lay triers of fact". Secondly, although there are many reports in the literature on the linguistic style, most are restricted to grammatical features; the influence of such features has been the subject of intense debate within the scientific community. Last but not least, research into stylometry was mainly concerned with too few linguistic features. Several divergent accounts of individual words have been proposed, creating numerous controversies.
As we can see, much of the quantitative stylometric research has focused on identifying and evaluating the best linguistic features in rich-resources languages such as English, Spanish, etc. Nguyen et al. (2020) show how, in the past, publications that concentrate on linguistic style of Vietnamese texts more frequently adopt a qualitative approach. Previous qualitative research findings into authorial styles in Vietnamese texts have been inconsistent and contradictory. The generalizability of much published research on this issue is problematic. Some small-scale studies suggest an association between individual words and linguistic style (Nguyen & Dang, 1999;Nguyen et al., 2018). Contrary to previously published studies, Ho et al. (2020) demonstrated that various quantitative ap-proaches are able to identify the true author of online texts, i.e. online news articles or posts on the social networking site Facebook. However, Ho et al. focused on word-length and the most 20 frequent-occuring words, a majority of them are function words like "và" (and), "của" (of), "cho" (for).
To date, no large-scale studies have been performed to investigate the prevalence of a great variety of linguistic features using quantitative stylometric method, such as correspondence analysis, for Vietnamsese data. Although studies have recognized some specific words, research has yet to systematically investigate the effect of linguistic features on authorial linguistic style. Our research will thus use socio-linguistically dynamic, cross-topic data and in interpreting the findings we will be looking for ways to open the black box.

Data
The dataset under investigation includes 80 opinion articles, which come from VVC (VnExpress Viewpoint Corpus) (Nguyen et al., 2020). VVC is a specialized corpus, including opinion articles whose topics are various and the authors are allowed to write using their own styles or with little formality in comparison with other genres. These articles are written by authors of Goc nhin (Perspectives), a unique section where authors voice their opinion about various problems or express their observations. The authors chosen for the study consist of 10 males and 10 females. The choice was made mainly on the basis of the length of their opinion articles since one objective of this study is to work with reasonably large samples of individual usage. The length of each article varies, but they all contain at least 500 words. Table 1 is an overview of the dataset we used.

Methods
According to Brezina (2018), in forensic linguistics, there are two basic approaches, which depend on the amount of linguistic evidence available: if the text under investigation contains a few sentences, close reading for signs of idiosyncratic language use may be appropriate; if the text has around 500 words or more, the statistical approach should be called for.
To answer the research questions, we employ the technique called correspondence analysis, a summary technique which outputs a correspondence plot. Conceptually, correspondence analysis is related to the chi-squared test which tests the homogeneity null hypothesis. However, correspondence analysis has several merits in comparison with the chi-squared test as follows. First, instead of a p-value, which the chi-squared test produces, the correspondence analysis shows the relationship between linguistic features visually by plotting both the authors and the linguistic features in the same correspondence plot. The sensitivity of correspondence analysis has been demonstrated in Brezina's book (2018), Statistics in Corpus Linguistics: While the chi-squared test can only answer a simple YES/NO question about statistical significance without indicating where exactly the difference lies (which is especially problematic with large cross-tabulation tables), the correspondence analysis can show us the larger picture of complex relationships, both similarities and differences (Brezina, 2018: p. 202).
Authorial style was examined using the same method that was detailed for White House secretaries (Barlow, 2013), using a series of correspondence analysis. However, the choice of opinion articles within one news site meant that it may not possible to generalize the authorial writing style in other text genres. In this study, such statistical analysis was performed using R software. With four packages: FactoMineR, shiny, FactoInvestigate, and ggplot2, R software visualize results of the analysis as a correspondence plot.

Results and Analysis
In order to answer the two research questions, two series of correspondence T.-N. Nguyen, D. Dinh analysis were conducted based on parts-of-speech features. The analysis looked into the proportions of different word classes in subcorpus VVC_AA20 as the linguistic variables which are both frequent and independent of the topic discussed.

Use of Word Classes by Males and Females
The full datasets on which the correspondence plots in Figure 3 and Figure 4 are based can be seen in the cross-tabulation tables below (Table 2 and Table 3).
The data in Table 2 and Table 3 were analyzed using correspondence analysis. The resulting correspondence plots are displayed in Figure 1 and Figure 2.
Overall, the correspondence plots respectively explain nearly 64% and 55% of the variation in the data, which is a reasonable amount.
In Figure 1   positions and nouns. Most importantly, we can see that the samples drawn from the writing of the same four writers (1093, 766, 50 and 83) cluster relatively closely together-again we can measure the chi-squared distances between the samples.
In Figure 2, the correspondence analysis grouped individual text samples from the opinion writers not clearly as the plot in Figure 1. However, the texts of females are characterized by the frequent use of pronouns and a relatively infrequent use of pronouns. In addition, the correspondence plot clearly shows four individual writers (95, 465, 47 and 149) clustered very closely to the left according to their use of different word classes. Interestingly, their main job in the newspaper VnExpress' introduction is teacher. An exception is author 1160, whose samples stretch from the center to the left of the plot.

Use of Word Classes by Business People and Teachers
The full datasets on which the correspondence plots in Figure 3 and Figure 4 are based can be seen in the cross-tabulation tables ( Table 2 and Table 3 in the Subsection 4.1), whose profession is shown in Table 1 (Subsection 3.1). Similar results are obtained if we use authors' profession rather than their gender. Following the same methodology and creating a target feature list from the seven most frequent POS tags in each sample, we determine the frequency of each of  the resulting POS tags in the 400-word samples. As with the authors' gender, the perception that there is consistent authorial writing style is confirmed by performing a correspondence analysis on all the samples.
In Figure 3, Dim1 explains 29.20% and Dim2 explains 21.30% of variation, which means that overall the correspondence plot explains about 50% of variation. While it is possible to make out individual POS tags in the graph, the overall pattern is not really clear. In the plots of authors who are businesspeople, the POS data however sometimes partitions the text samples in such a way that samples from the same author cluster together and the samples from different authors are distributed in different regions of the plot. For example, the articles associated with author 1020 cluster tightly on the bottom right of the graph and the three out of four articles associated with author 92 are to be found in the top middle region.
In Figure 4, we can observe that the main dimension on the horizontal axis accounts for 45.46% of the variation and, taken together with the vertical axis 20.16, around 65% of the variation in the data is accounted for by these two dimensions. Most importantly, looking at the correspondence analysis for the POS data in this figure, we find that the POS tags do clearly differentiate the ten authors, as detailed below.
Of the different opinion authors whose main jobs are teachers, author 766 shows the greatest variation with sample 766c displaced from 766a, 766 b and 766d. The use of nouns by this author decreases by nearly a fifth from 766b to 766c, which may be indicative of a general reduction in the use of terminology.
In addition, the samples from author 1026 cluster in two contiguous regions: article 1026c and 1026b are located close together and a little distant from 1026a and 1026d. It is unclear what is happening in these cases, but it may be that these displacements represent some changes in style over time. The samples from authors 1035 and 83 are distinct but located close to each other. This confirms that the frequency of common POS use distinguishes authors.
Although they may be difficult to make out, the POS are displayed on the graphs of teachers in locations related to the two axes. We find descriptive-related POS tags such as adjective and adverbs positioned towards the top and movement-related POS tags such as verbs and preposition at the bottom. This dimension may reflect, in part, a difference in referential style: a distinction between a predominant use of words for describing quality and quantity of things and manner of movements.

Discussion
The graphical representations of the correspondence analysis results showcase the author samples displayed in relation to their preferences for different linguistic features. The plots thus suggest that there is strong evidence in the data for distinct styles of writing in these authors. In the plots displayed in Figure 1 and Figure  This impressive result adds weight to Barlow's (2013)  Although the findings should be interpreted with caution, this study has sev-eral strengths. One of the strengths of this study is that it represents a comprehensive examination of a great variaty of grammatical features. All seven features have produced encouraging results. The data-preparation underpinning both the gender and profession analysis necessitate working with 20 authors and 80 texts involved and our findings must be qualified in this respect. The discriminating ability was particularly impressive, suggesting that, in a collective sense, POS features provide a good set of markers.
The large sets of common POS tags employed help obtain meaningful results with this correspondence analysis technique, a finding which concur with Barlow's work (2013) in this area. The unique feature of the correspondence plot is the fact that it captures both the column and row categories of the cross-tabulation table in the same space. Taken together, the results in this section indicate that there is an association between POS features and authorial styles in Vietnamese texts. The next section, therefore, moves on to conclude the main points and provide suggestions for future works.

Conclusions and Future Perspectives
The present study was designed to determine the discriminating ability of stylometric method correspondence analysis based on a specialised Vietnamese corpus. The second aim of this study was to investigate the effects of each type of linguistic features on the stylometric investigation. Returning to the research questions posed at the beginning of this study, it is now possible to state that linguistic style of an author can be identified by using stylometric method with a set of POS tags. The most obvious finding to emerge from this study is that when using correspondence analysis for dataset based on authors' gender, conjunctions and verbs perform best. Regarding authors' profession, conjunctions and pronouns offer a striking improvement on stylometric investigation. These findings have significant implications for the understanding of how authorial style in Vietnamese texts is able to be determined by using linguistic features.
Several limitations to this study need to be acknowledged. Firstly, the study tends to use socio-linguistically and situationally homogeneous data whereas forensically realistic identification methods need to be able to capture stylistic similarities between texts created in different contexts and for different purposes and audiences. The lack of other POS tags such as modifiers or exclaimation words in the sample adds further caution regarding the generalizability of these findings.
Stylometric methods in general and correspondence analysis in particular, have been underutilized in forensic studies hitherto: our study suggests that the prospects for their successful application based on Vietnamese data in future look good. Further studies regarding the role of lexical features, not just grammatical ones, would be worthwhile. This would provide a fascinating scenario where various aspects of lexis were employed in a correspondence analysis to seek characterizing individual writing style.