Comparing and Analyzing Cohesive Devices of SMT and NMT from Chinese to English: A Diachronic Approach

This work presents a detailed comparison and analysis of the usage of cohesive devices by three Machine Translation systems from Chinese to English, in both SMT and NMT situations. By both a general analysis of sentence length as well as cohesive devices and detailed analysis of a sentence translation in SMT and NMT with human translation as a reference, it is shown that, compared with SMT, NMT system is better at handling cohesive ties such as additive, adverbs and pronouns; however, both SMT and NMT underperform at dealing with demonstratives and lexical cohesion. This suggests an evidence of improved translation quality and the necessity of pre-editing and post-editing cohesive devices in MT translations.

values without symbols.
The new method NMT can greatly improve quality of automatic translation.
The evaluation of machine translations adopts either human-metrics or automatic metrics to view the overall performance of MT system, famous ones including BLEU (Papineni et al., 2002), TER (Snover et al., 2006), and Meteor (Denkowski & Lavie, 2011), which judge the quality of machine translations in terms of fluency, fidelity, and so on.
However, in this paper, we try a different way and carry out a case study on a text from academic genre by diachronically identifying and comparing cohesive devices of the text, translated by human and three MT systems, i.e. Google Translate, Baidu Translate, and Bing Translator, in 2016 and 2020, relying on SMT and NMT respectively. Cohesive devices, including reference devices and conjunction devices, are one kind of grammatical errors that can be easily found in machine translations and can greatly influence the coherence of target text.
Knowing the linguistic differences between English and Chinese language, the author also tries to outline possible strategies for MT pre-editing and post-editing.
The following two aims are targeted: 1) to compare overall use of cohesive devices by human translator, and three MT systems both in SMT and NMT, identifying particular strengths in terms of cohesive devices for the NMT approach compared with SMT approach, and weakness of both NMT and SMT in handling cohesive devices compared with human translator; 2) to examine if the NMT approach can better translate cohesive devices than SMT approach and examine if human is better at dealing with cohesive devices, hence giving suggestions for machine translation post-editing and pre-editing in terms of achieving cohesion of translated text.
MT quality varies a great deal across different language pairs, genres, and domains. Chinese and English are two distinct languages, and automatic translation between the two language pairs underperforms compared with other language pairs such as English to French or Spanish to English. The significance of comparing cohesive devices of Chinese to English translations by different MT systems based on both SMT and NMT approaches outweighs, thus benefiting the development of SMT and NMT systems as well as MT evaluation.

Related Works
Limited studies were carried to compare SMT and NMT between Chinese and English in detail, especially through a diachronic way. There was an empirical analysis on Chinese to English news translation done by Microsoft AI & Research center revealed that NMT was at "human parity" compared to professional human translations; however, errors including incorrect words, ungrammatical, missing words, named entity were identified after human analysis, indicating big room to improve the quality even of NMT, the best-state-of-the art Other analyses between NMT and PBMT approach were carried in (Bentivogli et al., 2016;Toral & Víctor, 2017). Motivated by the advantage of statistical approach, a pre-translation technique was also given to combine PBMT and NMT on the English to German translation task, which uses PBMT to pre-translate source text and generates target text using NMT and increases MT quality measured in BLEU by up to 2 points (Niehues et al., 2016).

Cohesive Devices
Cohesive devices are both, however differently, used in English and Chinese language, and many scholars discussed the issue. Guo (2006: p. 188 Wang (2006) thinks that some pronouns should be substituted by other types of cohesive devices like lexical cohesion and ellipsis because pronouns are far more frequently used in English but content words are prominent in Chinese, though we can find their equivalents in Chinese except for some exceptions like "the", "one", and "one's". He also warns that we do not have to follow form when translating cohesion devices; instead we translate in terms of function (Wang, 2006). Therefore, cohesive devices are regarded as one major source of difficulties in translation.
Cohesion in this paper refers to a series of obvious and language specific resources, which link text together at the global level. These resources include five categories, which are "reference, conjunction, substitution, ellipsis and lexical cohesion" realized by both grammar and vocabulary as in Halliday and Hasan (1976). Reference, substitution and ellipsis are under the term of grammatical cohesion. Conjunction is both grammatical and lexical. Lexical cohesion belongs to the lexical cohesion.
Of the five types, reference includes pronouns that are further classified into "personal pronouns", "possessive determiners" and "possessive pronouns", demonstratives such as "this", "that", "these" and "those", definite article "the", comparatives such as "same", "similar", "equal", "other", "different", and adverbs such as "here", "there", "now" and "then". Generally, pronouns in Chinese are simpler in form than in English. Substitution in English can be nominal (achieved by the use of "one/ones" or "the same" in place of a noun phrase, as in "We have no coal fires; only wood ones"), verbal (realized with the help of "do"/"did" in place of a verb, as in "No one can accomplish this task better than I do"), and clausal (realized through the use of "so" and "not", when they replace an entire clause, as in "Is there going to be an earthquake?" "It says so.") (Halliday & Hasan, 1976: pp. 91-130). According to Hu (1994: pp. 73-74), "这么着", "来" and "干" in Chinese can function as verbal substitution. Chinese does not have as many substitutions as English.
However, some lexical cohesion and ellipsis should be replaced by substitutions when translating from Chinese to English (Wang, 2006: p. 269).
Ellipsis occurs when an item is omitted and no tangible substitution happens.
The last category, lexical cohesion includes reiteration and collocation. The former refers to the direct repetition of lexical words or the repetition of their synonyms, and collocation means "a word that is in some way associated with another word in the preceding text", including superordinates, hyponyms, and antonyms (Halliday & Hasan, 1976: p. 318). According to Hoey (1991: p. 9), lexical cohesion contributes to probably more than 40% of all cohesive ties in Halliday and Hasan's text samples.

Data Collection and Research Methods
In January 2016, when SMT was still adopted, a Chinese-English parallel corpus of 239,504 words including academic, literary, and news texts translated by human and three online MT systems: Google Translate, Baidu Translate, and Bing Translator were collected and complied. Through a quantitative analysis on the corpus, differences of MT systems and human in dealing with cohesive devices were found, e.g. in abstract corpus, MT uses more definite article in Baidu Translate, and more additive devices, i.e. "and", "besides", in Baidu Translate and Bing Translator, compared with human.
As in the latter part of 2016, NMT began to be deployed for users and developers. In 2020, one of the academic texts was retrieved and re-translated by the

Sentence Length of Each Translation
As shown in Table 1, for the same piece of source text of academic genre, 21 original Chinese sentences were remained in the same segments for both three MT systems supported by SMT; however, quite different sentences segmentation for NMT, comparing that human translator tended to separate some long sentences into smaller ones, having 34 sentences in total, which shows that NMT more flexibly deal with sentence length than SMT.

Overall Usage of Cohesive Devices
To have a general view of cohesive ties used in each text, we used FileLocatePro to track both the human-translated text and 6 MT texts and about 72 cohesive ties were classified as references and conjunctions under the aforementioned classification. The results were shown in Table 2.
For example, when we searched personal pronouns of text translated by human, all 11 results were located and source sentence with high lightened key words "they" and "we" were clearly shown in the following Figure 1.  The general calculation of cohesive ties suggests that both references and conjunctions are incorporated in texts translated either by the three MT systems and human translators. However, minor differences can be spotted: 1) demonstratives such as "this", "that", "these" and "those" were less frequently used in both SMT or NMT systems compared with human translator, indicating the incapability of MT systems to add enough demonstratives in machine translations; 2) definite article has 73 hits in text translated by Baidu SMT system, significantly larger than other systems and human translator, for example "Based on the theory of intercultural communication, the ideal goal of teaching a foreign language is to let the students use the language in the cultural context of the target language to meet each other's cultural habits of communication is to cultivate the students ability of cross culture communication" (Baidu-SMT-Sentence 18) uses too many definite articles before nouns making the sentence less fluent, however, this feature was less predominant in Baidu NMT; 3) the total classifications of cohesive ties were found more diversified in human translation, as most of MT systems lack of casual or temporal conjunctions; 4) compared with SMT, the number of total cohesive ties tracked above was larger in both Google NMT and BING NMT, still smaller than human translator, which probably showed that NMT was more capable of incorporating cohesive devices into sentences.

Comparison of Cohesive Devices in Sentence
[Source]: 到目前为止，比较系统化的语言教学法流派不下二十种，其中最 具影响力的流派有五种：翻译法、直接法、听说法、认知法、交际法。 [Human Translator]: Up to now, there are no less than twenty systemized approaches and methods in language teaching, five of which are most influential. They are Grammar-translation Method, Direct Method, Audio-lingual Approach, Cognitive Approach and Communicative Approach.
[Google-SMT]: So far, more systematic language teaching methods no less than twenty kinds of genres, including the most influential genre has five: translation method, direct method, I heard French, cognitive method, communicative approach.
[Google-NMT]: So far, there are no less than 20 schools of more systematic language teaching methods, of which five are the most influential: translation, direct method, listening and speaking, cognitive, and communicative methods.
Taking the seventh sentence translated by Google SMT and NMT for exam-J. Liu Open Journal of Modern Linguistics ple, human translator divides it into two sentences and incorporates 7 cohesive ties of both 3 rd person pronouns, comparatives, adverbs, and additive conjunction which was tracked as above, and some lexical cohesive ties such as "twenty" and "five of which" were also noticeable, while in Google SMT, the number of cohesive ties was only 4, lacking adverbs, proper personal pronouns, though including a first pronoun "I", was used improperly however, as well as necessary additive conjunctions. It seems Google NMT was better at dealing with cohesive ties because the sentence contains enough of them, making the sentence basically a cohesive one. However, Google NMT still failed to handle phrases such as "比较系统化", which was a common usage in Chinese, remaining a comparative "more" in the sentence but obviously violating English grammar. Experienced human translator could easily decide to omit the comparative in this case.

Summary and Outlook
In this paper, we had conducted a detailed comparison of cohesive ties between SMT and NMT for Chinese to English language pair. The targets were to identify some strength and weakness of the two systems and raise possible suggestions for pre-editing and post-editing cohesive ties for MT translations. Our findings are: 1) NMT system is better at handling a) additive devices to make English sentence a cohesive one, b) adverbs, c) and pronouns compared with SMT, suggesting an evidence of improved translation quality.
2) Compared with human translator, both SMT and NMT underperform at dealing with demonstratives and lexical cohesion, as Chinese language usually lacks those kinds of cohesive devices, however, easily noticed by experienced translators; and both SMT and NMT are hard to decide where a definite article is needed.
Therefore, to better deal with cohesive devices in Chinese to English translation by MT, it was suggested we incorporate covert cohesive ties in pre-edit, making them easier to be traced by machines, check the use of those ties carefully after translated by MT systems, properly correct them if they were misunderstood or add enough of cohesive ties to ascertain the post-edited machine translations a coherent one in the post-editing process.
It was believed that our analysis would benefit both development of MT system and MT evaluation methods. A more linguistic approach to MT quality, possibly as the analysis presented in this paper, would bring MT quality into a higher stage. Open Journal of Modern Linguistics members made their contributions to the paper.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.