Visualization of Special Features in " the Tale of Genji " by Text Mining and Correspondence Analysis with Clustering

In this paper, visualization of special features in " The Tale of Genji " , which is a typical Japanese classical literature , is studied by text mining the auxiliary verbs and examining the similarity in the sentence style by the correspondence analysis with clustering. The result shows that the text mining error in the number of auxiliary verbs can be as small as 15%. The extracted feature in this study supports the multiple authors of " The Tale of Genji " , which agrees well with the result by Murakami and Imanishi [1]. It is also found that extracted features are robust to the text mining error, which suggests that the classification error is less affected by the text mining error and the possible use of this technique for further statistical study in classical literatures.


Introduction
The scientific arts have been a topic of interests in recent study of visualization.This research field has been studied by applying the visualization technique developed in the field of science and engineering to the field of liberal arts, such as literature, social science, education, archaeology and others.Examples of them can be found in archaeology [2], interdisciplinary education [3], generation of fluid art [4], music and art [5][6][7][8][9], and so on.
The application of the visualization technique to the field of literature is becoming active in recent years.Inami et al. [10] introduced discrete wavelet multi-resolution analysis into "The Tale of Genji", and they successfully visualized the variations of emotional feeling stream of the major characters in the story along with the story progression.Later, Yamada and Murai [11,12] developed a story-visualization technique by color coding the key-words and applying the interpolation technique using Laplace equation to understand the time variation of emotional feeling in the Shakespeare's play.Carpena et al. [13] visualizes spectra of the words frequency in "Don Quixote" based on statistical analysis."The Tale of Genji" is one of the oldest stories in the world, which is written by Shikibu Murasaki in 12th century in Japan, and has been attracted worldwide [14,15].It is also reported that "The Tale of Genji" was written by multiple writers, including Shikibu Murasaki, because some readers feel the different impression in the flow of story.Therefore, the number of writers of "The Tale of Genji" has been one of the important research topics in Japanese classical literature [1].
In recent years, the mysterious problem of multiple authors of "The Tale of Genji" was examined from the point of statistics combined with the careful consideration of the part-of-speech classification [1].Note that the classification was carried out by 5 specialists in this field spending several years to accomplish the whole work, and the results were summarized in 10 volumes of books of total more than 10 thousand pages [16].Based on this classification, Murakami and Imanishi [1] carried out statistical study based on the correspondence analysis of the number of auxiliary verbs, and suggested that "The Tale of Genji" could be written by multiple authors including Shikibu Murasaki.However, it should be pointed out that the part-of-speech classification of a literary work generally required huge amount of time and labor due to its manual work, so that it might be difficult to accomplish all the classical works in literatures using the same approach.
The purpose of this paper is to develop an efficient method of part-of-speech classification using text mining, and the feature extraction of "The Tale of Genji" is carried out by using the statistical correspondence analysis combined with clustering.This approach could minimize the uncertainty in the feature extraction from the classical literatures.

Structure of "The Tale of Genji"
"The Tale of Genji" is a typical example of classical literary works in Japanese history and one of the longest stories in the world.This story deals with the love and death of the major characters in 12th century in Japan.It consists of 54 chapters of story and they are normally divided into four groups, that is Group 1 (Murasakinoue, 17 chapters), Group 2 (Tamakazura, 16 chapters) and Group 3 (11 chapters), and Group 4 (Uji-jujo, 10 chapters).Among them, Murasakinoue (17 chapters) and Tamakazura (16 chapters) are considered in the classification study, which is similar to the work by Murakami and Imanishi [1].Note that the classification of the story was conducted in relation to the topics of interests.The classification of the chapters of "The Tale of Genji" is summarized in Table 1.It should be mentioned that the text of "The Tale of Genji" used in this study was written in roman characters by Shibuya [17].An introductory part of "Murasakinoue" is written by roman characters as follows: "Idure no ohom-toki ni ka, nyougo, kaui amata saburahi tamahi keru naka ni, ito yamgotonaki kiha ni ha ara nu ga, sugurete tokimeki tamahu ari keri." Note that the underlined words are auxiliary verbs, where "ni" is the continuative form of "nari", "keru" is the attributive form of "keri", "ni" is the same as previous one, "nu" is the attributive form of "zu" and "keri" is the terminal form of "keri".

Auxiliary Verbs and Text Mining
Frequency analysis of auxiliary verbs provides the feature of the story in each chapter of "The Tale of Genji", because the auxiliary verbs are the key to understand the writer's personal character, such as the sentence style.In the present study, the auxiliary verbs are extracted from the whole text by using a text mining program written in c++ language.In "The tale of Genji", 26 auxiliary verbs exists, among which 21 words are listed in rest of the auxiliary verbs are rarely used in the story, so that they are not considered in this analysis, as is the case of Murakami and Imanishi [1].
The auxiliary verbs consisting of more than four characters were easily extracted by using the simple character search technique with high accuracy only by considering the conjugations.On the other hand, the auxiliary verbs with shorter characters are not easy to distinguish from the other words, because they have conjugations as well as they are the part of the other words.Therefore, the auxiliary verbs shorter than two characters were extracted by searching several words before and after the target auxiliary verbs.Further details of the extraction formula are as follows.The word before the auxiliary verb should be a verb and the several conjugations of the auxiliary verbs have to be followed by the specific vowels due to the limitation of pronunciation.These formulas for classification of auxiliary verbs were programmed in advance based on the information by manual classification choosing each one chapter from Murasakinoue and Tamakazura.Then, the validity of the classification was tested in reference to the manual classification by Murakami and Imanishi [1].

Correspondence Analysis and Clustering
In order to evaluate the similarity in sentence style among the chapters of "The Tale of Genji", the correspondence analysis with clustering was applied to the text.The correspondence analysis is a multivariate statistical analysis applicable to the qualitative data, such as text data in classical literatures.In the correspondence analysis, factor scores x i , y j are calculated from a frequency distribution p ij to maximize the correlation coefficient ρ XY , which is written as follows [18]: where σ XY , σ X , σ Y are defined as follows: In this study, frequency distribution p ij corresponds to the frequency of auxiliary verbs, and m and n are the number of factor scores x i , y j , respectively.
The grouping of chapters by the correspondence analysis was carried out using the k-means clustering [19] to remove the arbitrariness.Note that the Murakami and Imanishi [1] grouped the analyzed data, using the indepth knowledge of the story of "The Tale of Genji" with free-hand drawing for the boundary.The k-means clustering is a method to divide a set of data into a specified number of groups.This technique is applicable to a set of data on the condition that the number of cluster is known and the number of data constituting clusters is nearly equal.In this study, two clusters are assumed in this case following the Murakami and Imanishi [1].In the first stage of clustering, a random cluster is assigned to each data, then the centroids of the results are evaluated and each group is assigned to a cluster with the nearest centroid.These procedures are repeated until the positions of centroid and the cluster do not change.

Text Mining of Auxiliary Verbs
Table 2 shows the frequency distribution of the auxiliary verbs in "The Tale of Genji", which are obtained from the present c++ program.As a typical example, results are shown for the subtotal in "Murasakinoue" and that in "Tamakazura".The results are also shown for the total number of the auxiliary verbs in the 33 chapters.The RMS errors in the total number of auxiliary verbs ε are also shown with respect to those of Murakami and Imanishi [1].The RMS error ε is defined by the following equation: where N pi and N mi are the number of the auxiliary verbs in present study and those of Murakami and Imanishi [1], respectively, and M is the number of chapters.The result shows that most of the auxiliary verbs can be extracted within an error smaller than 15%, but the three auxiliary verbs "Nari", "Su" and "Sasu" show comparatively larger errors 23.6%, 23.6% and 44.3%, respectively.This is because these auxiliary verbs can be used as verbs as well, so that it is difficult to distinguish them.Although it is possible to reduce the RMS error further by using a more complicated formula, the influence of the error on the grouping has to be examined before further spending the efforts to improve the accuracy of extraction of auxiliary verbs.It should be mentioned that a part of the difference in the number of auxiliary verbs may come from the different source books of "The Tale of Genji", which comes from Shibuya in the present study and from Ikeda [20] in the Murakami and Imanishi [1].Note that the number of auxiliary verbs manually extracted from the first 4 chapters of Shibuya's text is 6% smaller than those of the Ikeda's text.

Correspondence Analysis and Clustering
Figure 1 shows the result of correspondence analysis and clustering for the 33 chapters of "The Tale of Genji", which are carried out using the frequency distribution of auxiliary verbs by Murakami and Imanishi [1].The horizontal and vertical axes are major two components of the factor scores obtained from the correspondence analysis.Note that the 1st component has higher probability than the 2nd component.The square symbols show chapters of "Murasakinoue" and triangle symbols are those of "Tamakazura", while closed symbols denote the results of cluster 1 in present classification and open symbols are those of cluster 2. It should be mentioned that the chapters with closer values of factor scores show similar features of sentence style, so that they are considered as the same cluster.The results indicate that most of the chapters in "Murasakinoue" are located closer together, while those of "Tamakazura" are placed in the other cluster, which indicates the difference in sentence style of the "Murasakinoue" and "Tamakazura".Therefore, Murakami and Imanishi supported the theory of multiple writers for "The Tale of Genji" [1].It should be mentioned that the chapter 16 has been considered as the irregular chapter by the in-depth knowledge of Murakami and Imanishi.The present result agrees closely with the Murakami and Imanishi's result, suggesting the validity of the computer program of this analysis.The minor difference can be found in the grouping, because Murakami and Imanishi judge the grouping by personal in-depth knowledge without scientific basis [1].
Figure 2 shows the result of the correspondence analysis and clustering applied to the present text mining data of auxiliary verbs, which are shown in Table 2.In comparison with Figure 1, the classification error is marginally increased, though the classification into "Murasakinoue" and "Tamakazura" are similarly observed in Figure 2. It can be seen that some data apart from the origin of the figure are deviated from those in Figure 1.These are found in chapters of 1, 3, 11, 16, 17, 27.These chapters are shorter than the other chapters, so that the influence of the text mining error can be larger in these chapters due to the unexpected error in the text mining of auxiliary verbs.Minor difference can be found in the overlapped region of the two clusters near the origin, which is due to the 3rd component of the factor scores.Due to these minor influences of the text mining error on the classification, the main features of the classifications are unchanged.Therefore, it can be concluded that the feature extraction from "The Tale of Genji" is well reproduced in the fully computerized program of text mining of auxiliary verbs and correspondence analysis with clustering without the in-depth knowledge on "The Tale of Genji".

Influence of RMS Random Error on Clustering
In order to understand the influence of RMS error on the clustering, the correspondence analysis with clustering was carried out using the artificial data sets by adding a certain number of random errors of auxiliary verbs.Note that the random error was added to the Murakami and Imanishi's data of auxiliary verbs assuming Gaussian distributions of random error.Typical example of clustering result is shown in Figure 3 for the case of 15% random RMS error, which is the same level of RMS error as shown in Figure 2.Although there is a minor difference in the factor scores, the main features of the result is in close agreement with those of Figures 1 and 2, which suggests the robustness of the present analysis to the RMS error in the number of auxiliary verbs.It should be mentioned that the minor difference in the factor scores between Figures 2 and 3 are due to the assumption of randomness in the frequency distribution of the auxiliary verbs.This is not true for the frequency distribution of extracted auxiliary verbs in Table 2, which shows large error in "Nari", "Su" and "Sasu".
Figure 4 shows the relation between the misclassification rate α and the RMS random errors ε in the frequency distribution of auxiliary verbs by the correspondence analysis with clustering, where α = N c /M (N c : number of misclassification, M: number of chapters).Note that the error bars indicate the standard deviations of the misclassification rate in the 10 trials.It is found that the misclassification rate increases with increasing the RMS random error, and it increases suddenly with an increase in RMS random error around 20%.This result implies that the RMS random error in the frequency distribution of auxiliary verbs should be kept lower than 20% to obtain a reliable result.In addition, the influence of the number of the auxiliary verbs on the classification was investigated by reducing the total number of the auxiliary verbs.It is found from the numerical simulation that the misclassification is almost zero when the total number of the auxiliary verbs is reduced by half.Therefore, the classification error of the correspondence analysis and clustering is less affected by the number of the auxiliary verbs in the present text.

Conclusion
The visualization of special features in "The Tale of Genji" is studied by text mining the auxiliary verbs and examining the similarity in the sentence style by the correspondence analysis with clustering using computer programs.The present result shows that the text mining RMS error of the auxiliary verbs can be as small as 15%.It is found that the correspondence analysis with clustering is robust to the text mining error.The extracted features from the present analysis agree well with the Murakami and Imanishi's work, which supports the theory of multiple writers of "The Tale of Genji".This method of analysis is applicable without in-depth knowledge in

Figure 1 .
Figure 1.Correspondence analysis and clustering of Murakami and Imanishi's data (numbers correspond to the chapter numbers).

Figure 2 .
Figure 2. Correspondence analysis and clustering of present text mining data (for caption see Figure 1).

Figure 3 .
Figure 3. Correspondence analysis and clustering of data with 15% RMS random error (for caption see Figure 1).

Figure 4 .
Figure 4. Relation between misclassification rate α and RMS random error ε in auxiliary verbs.theclassical literature, so that it will provide an efficient tool for the feature extraction from classical stories.

Table 2 .
The