_{1}

^{*}

We propose the first statistical theory of language translation based on communication theory. The theory is based on New Testament translations from Greek to Latin and to other 35 modern languages. In a text translated into another language
,
all linguistic variables do numerically change. To study the chaotic data that emerge, we model any translation as a complex communication channel affected by “noise”, studied according to Communication Theory applied for the first time to this channel. This theory deals with aspects of languages more complex than those currently considered in machine translations. The input language is the “signal”, the output language is a “replica” of the input language, but largely perturbed by noise, indispensable, however, for conveying the meaning of the input language to its readers
**.**
We have defined a noise-to-signal power ratio and found that channels are differently affected by translation noise. Communication channels are also characterized by channel capacity. The translation of novels has more constraints than the New Testament translations. We propose a global readability formula for alphabetical languages, not available for most of them, and conclude with a general theory of language translation which shows that direct and reverse channels are not symmetric. The general theory can also be applied to channels of texts belonging to the same language both to study how texts of the same author may have changed over time, or to compare texts of different authors. In conclusion, a common underlying mathematical structure governing human textual/verbal communication channels seems to emerge. Language does not play the only role in translation; this role is shared with reader’s reading ability and short-term
memory capacity. Different versions of New Testament within the same language can even seem, mathematically, to belong to different languages. These conclusions are everlasting because valid also for ancient Roman and Greek readers.

Translation is the replacement of textual material in one language by equivalent textual material in another language. It transfers meaning from one set of patterned symbols into another set of patterned symbols. Translation was formerly studied as a language-learning methodology or as part of comparative literature. Over time, however, the interdisciplinary and specialization of the subject have become more evident and theories and models have continued to be imported from other disciplines [

The aim of this paper is to show that there seems to be a mathematical/statistical background that unifies all alphabetical languages, despite the spreading of the statistics of linguistic variables from language to language, described by parallel communication channels, one for each linguistic variable. The differences between translations seem to be mostly due to differences in the expected reader’s reading ability—quantified by readability formulae—and reader’s short-term memory capacity—quantified by Miller’s 7 ± 2 law [

This unifying picture is mainly assessed by defining firstly an ideal translation channel, and secondly by comparing the actual translation channels to it, according to communication theory. Shannon has set the fundamental mathematics of the main parts of a communication channel [

In this paper, we study the translation channel, after suitably defining input and output symbols. Compared to Shannon’s channels, our linguistic channels work at a different level because, for equal meaning, both input and output texts are structured in such a way to match reader’s expected reading ability and short-term memory capacity. In other words, these channels do not communicate with a machine, but with human beings, who may have serious difficulties in understanding what they read, if the text is not matched to their own reading ability and short-term memory capacity. The peculiarity of these linguistic channels makes less important the translation language. None of the previous studies has considered this unified approach.

The main mathematical/statistical characteristics are determined by studying the translation of a large selection of New Testament (NT) books—namely Matthew, Mark, Luke, John, Acts, Epistle to the Romans, Apocalypse, for a total of 155 chapters, according to the traditional subdivision of the original Greek texts—from Greek to Latin and to other 35 modern languages, 36 translations in total. The theory does not include meaning.

The rationale for studying the NT translations is based on their accuracy in any translation because done by a team of experts, whose aim is to render the same meaning of the original Greek texts to their readers, regardless of the language used. These translations strictly respect the subdivision in chapters and verses of the Greek texts, therefore they can be studied at least at these two different levels (chapters and verses), by comparing how a variable varies quantitatively from translation to translation. Notice that “translation” should not be confused with “language” because language plays one of the roles in translations, not the only one. For our analysis, we have chosen the chapter level because the amount of text is sufficiently large to assess reliable statistics. Therefore, for each translation we have considered a database of 155 × 37 = 5735 samples for each stochastic variable, sufficiently large to give reliable results.

After this introduction, the rest of the paper is organized as follows. In Section 2 we list the NT translations and some fundamental statistics; in Section 3 we define the ideal translation and the real translation; in Section 4 we define the communication channel, its noise-to-signal power ratio and describe its geometrical representation; in Section 5 we define the linguistic communication channels and study them; in Section 6 we deal with channel capacity according to communication theory; in Section 7 we relate the number of words per interpunctions (also termed the word interval) to human short-term memory capacity; in Section 8 we define a readability formula applicable to any alphabetical language; in Section 9 we examine different versions of the NT translation in the same language; in Section 10 we compare the translations of a novel from English to some modern languages and compare their statistics with the NT translation statistics from English to the same languages; in Section 11 we propose a general theory of language translation; finally, in Section 12 we summarize the main results and draw some conclusions. Some appendices report more details.

Following our statistical study of a large corpus of literary texts taken from the Italian Literature spanning seven centuries [_{W}, sentences n_{S} and interpunctions n_{I} per chapter, the number of characters per word C_{P}, the number of words per sentence P_{F}, the number of words per interpunctions I_{P}—a variable that seems to be related to the short-term memory capacity [_{F}, which gives also the number of word intervals per sentence, and the total number of words W, sentences S and interpunctions I (Appendix A lists all mathematical symbols). How interpunctions were inserted into the scriptio continua of Greek and Latin texts is reviewed and discussed in [

We downloaded each translation text from the web sites reported in

The first impression arising after reading these statistics is their large variety. Words, sentences and their distribution within chapters (for sentences) and within sentences (for words) can be very different from translation to translation. Even though all these texts convey the same meaning, the spread—i.e. the scattering of the values—is large. For example, the number of total words ranges from 90,799 (Latin) to 152,823 (Haitian), a spread of 62,024 words which represents 61.9% of the total number of words in Greek, 100,145; the number of total sentences ranges from 5370 (Latin) to 10,429 (Haitian), a spread of 106.3% of the total number of sentences in Greek, 4759. However, a ranking is evident as some translations are closer to the Greek originals than others. A similar spread is also noticeable in the average and standard deviation of the number of words per sentence P_{F} (from Hebrew to Welsh, range 52.4%), words per interpunction I_{P}—word interval—(from Russian to Cebuano, range 60.8%) and interpunctions per sentence M_{F} (from Cebuano to Esperanto, range 79.5%),

Language | Family | W | C_{P} | n_{W} | S | n_{S} | I | n_{I} |
---|---|---|---|---|---|---|---|---|

Greek1 | Hellenic | 100,145 | 4.86 (0.25) | 646.1 (220.4) | 4759 | 30.7 (14.0) | 13,698 | 88.4 |

Latin2 | Italic | 90,799 | 5.16 (0.28) | 585.8 (206.5) | 5370 | 34.6 (15.9) | 18,380 | 118.6 |

Esperanto3 | Constructed | 111,259 | 4.43 (0.20) | 717.8 (245.6) | 5483 | 35.4 (15.6) | 22,552 | 145.5 |

French4 | Romance | 133,050 | 4.20 (0.16) | 858.4 (282.5) | 7258 | 46.8 (17.0) | 18,284 | 118.0 |

Italian5 | Romance | 112,943 | 4.48 (0.19) | 728.7 (246.3) | 6396 | 41.7 (16.7) | 17,904 | 115.5 |

Portuguese6 | Romance | 117,537 | 4.43 (0.20) | 706.2 (239.6) | 6518 | 45.7 (18.1) | 18,410 | 118.8 |

Romanian7 | Romance | 109,468 | 4.34 (0.19) | 766.1 (265.4) | 7080 | 45.3 (19.5) | 20,105 | 129.7 |

Spanish8 | Romance | 118,744 | 4.30 (0.19) | 758.3 (252.3) | 7021 | 42.1 (17.4) | 18,587 | 119.9 |

Danish9 | Germanic | 131,021 | 4.14 (0.16) | 845.3 (299.1) | 8762 | 56.5 (22.5) | 22,196 | 143.2 |

English1^{0} | Germanic | 122,641 | 4.24 (0.17) | 791.2 (274.8) | 6590 | 42.5 (17.3) | 16,666 | 107.5 |

Finnish1^{1} | Germanic | 95,879 | 5.90 (0.31) | 618.6 (216.6) | 5893 | 38.0 (16.9) | 19,725 | 127.3 |

German1^{2} | Germanic | 117,269 | 4.68 (0.19) | 756.6 (258.0) | 7069 | 45.6 (18.4) | 20,233 | 130.5 |

Icelandic13 | Germanic | 109,170 | 4.34 (0.18) | 704.3 (243.2) | 7193 | 46.4 (18.5) | 19,577 | 126.3 |

Norwegian14 | Germanic | 140,844 | 4.08 (0.13) | 908.7 (313.3) | 9302 | 60.0 (20.8) | 18,370 | 118.5 |

Swedish1^{5} | Germanic | 118,833 | 4.23 (0.18) | 766.7 (268.9) | 7668 | 49.5 (19.6) | 15,139 | 97.7 |

Bulgarian1^{6} | Balto-Slavic | 111,444 | 4.41 (0.19) | 719.0 (246.8) | 7727 | 49.8 (20.1) | 20,093 | 129.6 |

Czech1^{7} | Balto-Slavic | 92,533 | 4.51 (0.21) | 597.0 (203.0) | 7514 | 48.5 (21.2) | 19,465 | 125.6 |

Croatian1^{8} | Balto-Slavic | 97,336 | 4.39 (0.22) | 628.0 (220.6) | 6750 | 43.6 (18.7) | 17,698 | 114.2 |

Polish1^{9} | Balto-Slavic | 99,592 | 5.10 (0.22) | 642.5 (224.6) | 8181 | 52.8 (18.9) | 21,560 | 139.1 |

Russian2^{0} | Balto-Slavic | 92,736 | 4.67 (0.27) | 598.3 (208.3) | 5532 | 36.1 (16.4) | 22,083 | 142.5 |

Serbian2^{1} | Balto-Slavic | 104,585 | 4.24 (0.20) | 674.7 (237.0) | 7532 | 48.6 (20.0) | 18,251 | 117.7 |

Slovak2^{2} | Balto-Slavic | 100,151 | 4.65 (0.23) | 646.1 (223.7) | 8023 | 51.8 (20.8) | 19,690 | 127.0 |

Ukrainian2^{3} | Balto-Slavic | 107,047 | 4.56 (0.26) | 690.6 (247.9) | 8043 | 51.9 (21.0) | 22,761 | 146.8 |

Estonian2^{4} | Uralic | 101,657 | 4.89 (0.24) | 655.9 (229.8) | 6310 | 40.7 (17.5) | 19,029 | 122.8 |

Hungarian2^{5} | Uralic | 95,837 | 5.31 (0.29) | 618.3 (212.3) | 5971 | 38.5 (16.7) | 22,970 | 148.2 |

Albanian2^{6} | Albanian | 123,625 | 4.07 (0.22) | 797.6 (278.1) | 5807 | 37.5 (16.4) | 19,352 | 124.9 |

Armenian2^{7} | Armenian | 100,604 | 4.75 (0.40) | 649.1 (235.6) | 6595 | 42.6 (18.7) | 18,086 | 116.7 |

Welsh2^{8} | Celtic | 130,698 | 4.04 (0.15) | 843.2 (299.6) | 5676 | 36.6 (15.8) | 22,585 | 262.4 |
---|---|---|---|---|---|---|---|---|

Basque^{29} | Isolate | 94,898 | 6.22 (0.27) | 612.2 (219.7) | 5591 | 36.1 (15.8) | 19,312 | 124.6 |

Hebrew3^{0} | Semitic | 88,478 | 4.22 (0.17) | 570.8 (199.7) | 7597 | 49.0 (20.4) | 15,806 | 102.0 |

Cebuano3^{1} | Austronesian | 146,481 | 4.65 (0.10) | 945.0 (326.6) | 9221 | 59.5 (22.4) | 16,788 | 108.3 |

Tagalog3^{2} | Austronesian | 128,209 | 4.83 (0.17) | 827.2 (283.6) | 7944 | 51.2 (21.2) | 16,405 | 105.8 |

Chichewa3^{3} | Niger-Congo | 94,817 | 6.08 (0.18) | 611.7 (203.7) | 7560 | 48.8 (18.3) | 15,817 | 102.0 |

Luganda3^{4} | Niger-Congo | 91,819 | 6.23 (0.23) | 592.4 (207.9) | 7073 | 45.6 (18.8) | 16,401 | 105.8 |

Somali3^{5} | Afro-Asiatic | 109,686 | 5.32 (0.16) | 707.7 (236.1) | 6127 | 39.5 (17.9) | 17,765 | 114.6 |

Haitian3^{6} | French Creole | 152,823 | 3.37 (0.10) | 986.0 (330.1) | 10429 | 67.3 (24.3) | 23,813 | 153.6 |

Nahuatl3^{7} | Uto-Aztecan | 121,600 | 6.71 (0.24) | 784.5 (260.3) | 9263 | 59.8 (21.4) | 19,271 | 124.3 |

^{1}https://www.biblegateway.com/versions/Tyndale−House−Greek−New−Testament/#booklist

^{2}http://www.vatican.va/archive/bible/nova_vulgata/documents/nova-vulgata_novum-testamentum_lt.html

^{3}https://newchristianbiblestudy.org/bible/esperanto/

^{4}https://www.bibliacatolica.com.br/

^{5}http://www.vatican.va/archive/ITA0001/_INDEX.HTM

^{6}https://www.bibliacatolica.com.br/

^{7}https://www.biblegateway.com/versions/Nou%C4%83%E2%88%92Traducere%E2%88%92%C3%8En%E2%88%92Limba%E2%88%92Rom%C3%A2n%C4%83%E2%88%92NTLR/%23booklist

^{8}http://www.vatican.va/archive/ESL0506/_INDEX.HTM

^{9}https://www.biblegateway.com/versions/Bibelen%E2%88%92p%C3%A5%E2%88%92hverdagsdansk%E2%88%92BPH/%23booklist

^{10}http://www.vatican.va/archive/ENG0839/_INDEX.HTM

^{11}https://www.biblegateway.com/versions/Raamattu-1933-1938/#booklist

^{12}https://www.biblegateway.com/versions/Raamattu-1933-1938/#booklist

^{13}https://www.uibk.ac.at/theol/leseraum/bibel/mt1.html

^{14}https://www.biblegateway.com/versions/Icelandic-Bible/#booklist

^{15}https://www.biblegateway.com/versions/En-Levende-Bok-LB/#booklist

^{16}https://www.biblegateway.com/versions/nuBibeln-Swedish-Contemporary-Bible-NUB/#booklist

^{17}https://www.biblegateway.com/versions/Bulgarian-Bible-Easy-to-Read-Version-ERV-BG/#booklist

^{18}https://www.biblegateway.com/versions/Bible-21-B21/#booklist

^{19}https://www.biblegateway.com/versions/Hrvatski-Novi-Zavjet-Rijeka-2001-HNZ-RI/#booklist

^{20}https://www.biblegateway.com/versions/Nowe-Przymierze/#booklist

^{21}https://www.biblegateway.com/versions/Russian-Synodal-Version-RUSV/#booklist

^{22}https://www.biblegateway.com/versions/New-Serbian-Translation-NSP-Bible/#booklist

^{23}https://www.biblegateway.com/versions/N%C3%A1dej-pre-kazd%C3%A9ho-NPK/#booklist

^{24}https://www.biblegateway.com/versions/Ukrainian-Bible-Easy-to-Read-Version-ERV-UK/#booklist

^{25}https://newchristianbiblestudy.org/bible/estonian/

^{26}https://www.biblegateway.com/versions/Hungarian-New-Translation/#booklist

^{27}https://www.biblegateway.com/versions/Albanian-Bible-ALB/#booklist

^{28}https://studybible.info/Armenian

^{29}https://www.biblegateway.com/versions/Beibl-William-Morgan-BWM-Bible/#booklist

^{30}http://www.vc.ehu.es/gordailua/testamentu.htm

^{31}https://www.biblegateway.com/versions/Habrit-Hakhadasha-Haderekh/#booklist

^{32}https://www.biblegateway.com/versions/Ang-Pulong-Sa-Dios-APSD-Cebuano/#booklist

^{33}https://www.biblegateway.com/versions/Filipino-Standard-Version-Biblia-FSV/#booklist

^{34}https://www.biblegateway.com/versions/Mawu-a-Mulungu-mu-Chichewa-Chalero-Word-of-God-in-Contemporary-Chichewa-CCL/#booklist

^{35}https://www.biblegateway.com/versions/Endagaano-Enkadde-nEndagaano-Empya-Luganda-Contemporary-Bible-LCB/#booklist

^{36}https://www.biblegateway.com/versions/Somali-Bible-SOM/#booklist

^{37}https://www.biblegateway.com/versions/Haitian-Creole-Version-HCV/#booklist

Language | P_{F} | I_{P} | M_{F} |
---|---|---|---|

Greek | 23.07 (6.65) 0.897 | 7.47 (1.09) 0.930 | 3.08 (0.73) 0.938 |

Latin | 18.28 (4.77) 0.901 | 5.07 (0.68) 0.952 | 3.60 (0.77) 0.937 |

Esperanto | 21.83 (5.22) 0.916 | 5.05 (0.57) 0.967 | 4.30 (0.76) 0.955 |

French | 18.73 (2.51) 0.942 | 7.54 (0.85) 0.948 | 2.50 (0.32) 0.951 |

Italian | 18.33 (3.27) 0.907 | 6.38 (0.95) 0.948 | 2.89 (0.40) 0.954 |

Portuguese | 16.18 (3.25) 0.913 | 5.54 (0.59) 0.962 | 2.93 (0.56) 0.948 |

Romanian | 18.00 (4.19) 0.910 | 6.49 (0.74) 0.959 | 2.78 (0.65) 0.938 |

Spanish | 19.07 (3.79) 0.926 | 6.55 (0.82) 0.962 | 2.91 (0.47) 0.958 |

Danish | 15.38 (2.15) 0.935 | 5.97 (0.64) 0.957 | 2.59 (0.33) 0.955 |

English | 19.32 (3.20) 0.917 | 7.51 (0.93) 0.951 | 2.58 (0.39) 0.948 |

Finnish | 17.44 (4.09) 0.939 | 4.94 (0.56) 0.962 | 3.54 (0.75) 0.946 |

German | 17.23 (2.77) 0.949 | 5.89 (0.60) 0.962 | 2.94 (0.45) 0.955 |

Icelandic | 15.72 (2.58) 0.934 | 5.69 (0.67) 0.960 | 2.77 (0.39) 0.953 |

Norwegian | 15.21 (1.43) 0.968 | 7.75 (0.84) 0.958 | 1.98 (0.22) 0.962 |

Swedish | 15.95 (2.17) 0.959 | 8.06 (1.35) 0.922 | 2.01 (0.31) 0.950 |

Bulgarian | 14.97 (2.61) 0.930 | 5.64 (0.64) 0.959 | 2.67 (0.43) 0.948 |

Czech | 13.20 (3.10) 0.920 | 4.89 (0.65) 0.950 | 2.71 (0.61) 0.928 |

Croatian | 15.32 (3.54) 0.928 | 5.62 (0.75) 0.950 | 2.72 (0.49) 0.961 |

Polish | 12.34 (1.93) 0.913 | 4.65 (0.43) 0.965 | 2.67 (0.40) 0.925 |

Russian | 17.90 (4.46) 0.898 | 4.28 (0.46) 0.971 | 4.18 (0.92) 0.927 |

Serbian | 14.46 (2.42) 0.929 | 5.81 (0.69) 0.951 | 2.50 (0.39) 0.944 |

Slovak | 12.95 (2.10) 0.929 | 5.18 (0.61) 0.953 | 2.51 (0.36) 0.954 |

Ukrainian | 13.81 (2.18) 0.963 | 4.72 (0.41) 0.973 | 2.95 (0.58) 0.945 |

Estonian | 17.09 (3.89) 0.927 | 5.45 (0.66) 0.956 | 3.14 (0.64) 0.947 |

Hungarian | 17.37 (4.54) 0.943 | 4.25 (0.45) 0.972 | 4.09 (0.93) 0.948 |

Albanian | 22.72 (4.86) 0.925 | 6.52 (0.78) 0.961 | 3.48 (0.61) 0.958 |

Armenian | 16.09 (3.07) 0.930 | 5.63 (0.52) 0.970 | 2.86 (0.47) 0.964 |

Welsh | 24.27 (4.75) 0.941 | 5.84 (0.44) 0.982 | 4.16 (0.76) 0.949 |

Basque | 18.09 (4.31) 0.934 | 4.99 (0.52) 0.967 | 3.63 (0.81) 0.951 |

Hebrew | 12.17 (2.04) 0.935 | 5.65 (0.59) 0.962 | 2.16 (0.33) 0.964 |

Cebuano | 16.15 (1.71) 0.968 | 8.82 (1.01) 0.947 | 1.85 (0.22) 0.958 |

Tagalog | 16.98 (3.24) 0.943 | 7.92 (0.82) 0.956 | 2.16 (0.44) 0.936 |

Chichewa | 12.89 (1.79) 0.940 | 6.18 (0.87) 0.942 | 2.10 (0.25) 0.960 |

Luganda | 13.65 (2.78) 0.931 | 5.74 (0.82) 0.949 | 2.39 (0.40) 0.950 |

Somali | 19.57 (5.50) 0.882 | 6.37 (1.01) 0.930 | 3.06 (0.65) 0.940 |

Haitian | 14.87 (1.83) 0.943 | 6.55 (0.71) 0.967 | 2.28 (0.26) 0.957 |

Nahuatl | 13.36 (1.70) 0.938 | 6.47 (0.91) 0.930 | 2.08 (0.24) 0.958 |

For avoiding misuse of the results reported in

The correlation r between the number of characters and the number of words is not reported because, as for Italian [

When a text written in a language is translated into a text written in another language, all linguistic variables do numerically change. Besides the total number of words W, sentences S and interpunctions I, the other main linguistic variables are the number of words n_{W}, sentences n_{S}, and interpunctions n_{I}, per chapter. To them we add the number of characters per word C_{P}, words per sentence P_{F}, words per interpunctions I_{P}, interpunctions per sentence M_{F}. We refer to this latter set of variables as the deep-language variables. All these variables of language Y can be statistically compared to those of a reference language X (Greek) by calculating the correlation coefficient3^{8} r between any couple of variables y of language/translation Y and x of the reference language/translation X (in the following, where no confusion is possible, we refer to a variable and to the language/translation with the same mathematical symbol), and their expected regression line (i.e., the relationship between averages):

y = m x (1)

^{38}The correlation coefficient r between two variables x, y, with averages m_{x}, m_{y} and standard deviations s_{x}, s_{y}, is given by r = μ x y / ( s x s y ) , where μ x y = 〈 { ( x − m x ) ( y − m y ) } 〉 is the covariance; 〈 { } 〉 indicates average value [

with m the slope of the line. Of course, we expect, and it is so in the following, that no translation can yield r = 1 and m = 1, a case referred to as the ideal translation. In practice, we always find r < 1 and | m | ≠ 1 . The slope m measures the multiplicative “bias” of the dependent variable compared to the independent variable, the correlation coefficient measures how “precise” the linear fit is. Even though the ideal translation is never found, it is useful as a reference model to which real translations can be compared. In the following we refer to it as the self-translation channel.

_{W} in Greek and n_{W} in the other languages listed in

or fewer words than Greek, and that Latin (red line) is one of the closest translations to Greek.

According to the regression lines, i.e., to the relationship between the average values in Y for assigned values in X (Greek), the translations that reduce the number of words (regression lines below the 45˚ line y = x in

_{S}. All languages/translations have more sentences than Greek, ranging from Latin (m = 1.123) to Haitian (m = 2.085), _{W}, in the range 0.899 ≤ r ≤ 0.969 ,

_{I}. Most translations use more interpunctions than Greek, ranging from Swedish (m = 1.107) to Haitian (m = 1.730), see

Language | n_{W} m | r | n_{S} m | r | n_{I} m | r | P_{F} m | r | I_{P} m | r | M_{F} m | r |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Greek | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

Latin | 0.909 | 0.994 | 1.123 | 0.969 | 1.347 | 0.965 | 0.780 | 0.883 | 0.673 | 0.688 | 1.149 | 0.731 |

Esperanto | 1.110 | 0.991 | 1.139 | 0.966 | 1.647 | 0.961 | 0.925 | 0.842 | 0.668 | 0.586 | 1.356 | 0.621 |

French | 1.320 | 0.970 | 1.458 | 0.939 | 1.293 | 0.960 | 0.768 | 0.619 | 0.999 | 0.676 | 0.779 | 0.449 |

Italian | 1.125 | 0.985 | 1.303 | 0.935 | 1.338 | 0.957 | 0.757 | 0.602 | 0.846 | 0.536 | 0.894 | 0.218 |

Portuguese | 1.091 | 0.984 | 1.442 | 0.948 | 1.459 | 0.954 | 0.675 | 0.728 | 0.732 | 0.502 | 0.922 | 0.546 |

Romanian | 1.186 | 0.983 | 1.450 | 0.956 | 1.348 | 0.955 | 0.759 | 0.798 | 0.858 | 0.536 | 0.888 | 0.675 |

Spanish | 1.168 | 0.980 | 1.338 | 0.957 | 1.338 | 0.947 | 0.797 | 0.760 | 0.866 | 0.503 | 0.914 | 0.523 |

Danish | 1.308 | 0.963 | 1.784 | 0.943 | 1.612 | 0.958 | 0.630 | 0.571 | 0.789 | 0.625 | 0.805 | 0.394 |

English | 1.225 | 0.986 | 1.346 | 0.942 | 1.216 | 0.966 | 0.797 | 0.647 | 0.995 | 0.597 | 0.808 | 0.456 |

Finnish | 0.959 | 0.987 | 1.226 | 0.969 | 1.439 | 0.965 | 0.738 | 0.850 | 0.655 | 0.656 | 1.128 | 0.743 |

German | 1.167 | 0.968 | 1.440 | 0.936 | 1.468 | 0.953 | 0.713 | 0.721 | 0.779 | 0.674 | 0.920 | 0.468 |

Icelandic | 1.088 | 0.974 | 1.461 | 0.926 | 1.427 | 0.957 | 0.650 | 0.701 | 0.754 | 0.619 | 0.866 | 0.458 |

Norwegian | 1.401 | 0.956 | 1.848 | 0.905 | 1.331 | 0.958 | 0.612 | 0.152 | 1.024 | 0.548 | 0.609 | 0.073 |

Swedish | 1.187 | 0.980 | 1.559 | 0.939 | 1.107 | 0.960 | 0.656 | 0.700 | 1.073 | 0.673 | 0.631 | 0.565 |

Bulgarian | 1.110 | 0.973 | 1.572 | 0.929 | 1.458 | 0.953 | 0.614 | 0.492 | 0.746 | 0.556 | 0.830 | 0.305 |

Czech | 0.922 | 0.985 | 1.552 | 0.944 | 1.422 | 0.950 | 0.554 | 0.715 | 0.649 | 0.650 | 0.856 | 0.501 |

Croatian | 0.974 | 0.992 | 1.393 | 0.956 | 1.291 | 0.967 | 0.646 | 0.802 | 0.747 | 0.720 | 0.864 | 0.711 |

Polish | 0.995 | 0.986 | 1.631 | 0.899 | 1.558 | 0.959 | 0.498 | 0.179 | 0.614 | 0.600 | 0.824 | 0.124 |

Russian | 0.927 | 0.990 | 1.164 | 0.952 | 1.610 | 0.958 | 0.756 | 0.769 | 0.566 | 0.542 | 1.327 | 0.646 |

Serbian | 1.045 | 0.983 | 1.539 | 0.936 | 1.326 | 0.961 | 0.600 | 0.734 | 0.770 | 0.695 | 0.786 | 0.603 |

Slovak | 0.997 | 0.964 | 1.634 | 0.939 | 1.432 | 0.954 | 0.536 | 0.703 | 0.686 | 0.625 | 0.785 | 0.508 |

Ukrainian | 1.071 | 0.972 | 1.637 | 0.924 | 1.650 | 0.959 | 0.568 | 0.594 | 0.621 | 0.431 | 0.923 | 0.522 |

Estonian | 1.016 | 0.985 | 1.304 | 0.965 | 1.389 | 0.967 | 0.720 | 0.794 | 0.723 | 0.717 | 0.997 | 0.640 |

Hungarian | 0.956 | 0.986 | 1.235 | 0.956 | 1.671 | 0.961 | 0.734 | 0.740 | 0.561 | 0.512 | 1.301 | 0.650 |

Albanian | 1.235 | 0.984 | 1.203 | 0.965 | 1.409 | 0.956 | 0.957 | 0.850 | 0.862 | 0.513 | 1.104 | 0.760 |

Armenian | 1.008 | 0.979 | 1.366 | 0.956 | 1.321 | 0.974 | 0.669 | 0.698 | 0.743 | 0.587 | 0.901 | 0.663 |

Welsh | 1.309 | 0.985 | 1.174 | 0.961 | 1.642 | 0.956 | 1.015 | 0.786 | 0.770 | 0.489 | 1.313 | 0.639 |

Basque | 0.952 | 0.991 | 1.158 | 0.956 | 1.413 | 0.970 | 0.762 | 0.764 | 0.661 | 0.720 | 1.155 | 0.667 |

Hebrew | 0.881 | 0.949 | 1.556 | 0.940 | 1.142 | 0.938 | 0.503 | 0.658 | 0.745 | 0.447 | 0.676 | 0.617 |

Cebuano | 1.460 | 0.974 | 1.856 | 0.921 | 1.217 | 0.965 | 0.654 | 0.373 | 1.168 | 0.613 | 0.571 | 0.239 |

Tagalog | 1.278 | 0.979 | 1.619 | 0.918 | 1.187 | 0.943 | 0.706 | 0.682 | 1.047 | 0.583 | 0.683 | 0.618 |

Chichewa | 0.942 | 0.979 | 1.523 | 0.928 | 1.155 | 0.954 | 0.529 | 0.626 | 0.819 | 0.551 | 0.648 | 0.095 |

Luganda | 0.916 | 0.972 | 1.446 | 0.941 | 1.196 | 0.939 | 0.569 | 0.692 | 0.761 | 0.557 | 0.751 | 0.587 |

Somali | 1.090 | 0.979 | 1.276 | 0.958 | 1.295 | 0.970 | 0.835 | 0.813 | 0.848 | 0.695 | 0.973 | 0.665 |

Haitian | 1.518 | 0.971 | 2.085 | 0.912 | 1.730 | 0.951 | 0.603 | 0.363 | 0.864 | 0.411 | 0.702 | 0.012 |

Nahuatl | 1.205 | 0.956 | 1.205 | 0.956 | 1.398 | 0.938 | 0.544 | 0.448 | 0.857 | 0.559 | 0.642 | 0.067 |

Range | 0.881 1.518 | 0.949 0.994 | 1.123 2.085 | 0.899 0.969 | 1.107 1.730 | 0.938 0.974 | 0.503 1.015 | 0.152 0.883 | 0.561 1.168 | 0.411 0.720 | 0.571 1.356 | 0.012 0.760 |

A larger spread can be noticed in the deep-language variables P_{F}, I_{P} and M_{F}, _{W}, n_{S} and n_{I}, the multiplicative bias increases for all languages, with very few exceptions (e.g. Esperanto and Welsh in the variable P_{F}), and the correlation coefficients become smaller.

Now, to study the chaotic data reported in Tables 1-3, it is very useful to consider a translated text as the output of a communication channel fed by the original text. The characteristics of this channel (one for each stochastic variable) can give us more insight into the mathematical/statistical deep structure of alphabetical (and possibly human) languages. Before doing so, in the next section we define a useful parameter, namely the noise-to-signal power ratio of a real translation channel compared to the ideal channel.

We characterize any translation and its linguistic stochastic variables as a complex communication channel, made of parallel channels—one for each variable—affected by “noise”. The input language is the “signal”, the output language is a “replica” of the input language, but largely perturbed by noise. From the point of view of the output language this noise is, of course, indispensable for conveying the meaning to readers of the output language. To study these channels, we define a suitable noise-to-signal power ratio and use a geometrical representation borrowed from author’s design of deep-space radio links [

Two variables y and x, linked by a regression line y = mx, where m is the slope of the line, are perfectly correlated if the correlation coefficient r = 1, and are not biased if m = 1, in other words, if the regression line is y = x (45˚ line, m = 1) and all y-values lie on the line (r = 1). If these conditions are not met, we consider the variance of the difference between the regression line values (m ≠ 1) and the ideal line y = x values, at given x-values, as the “regression noise” power N_{m}, and the variance of the difference between the values not lying on the line and the regression line y = mx, (r ≠ 1), as the “correlation noise” power N_{r}.

Let us apply these concepts to language translation. Defined the variance s x 2 of language x and s y 2 of language y, the difference y - x between the regression line of the real translation channel and that of the ideal channel is given by ( m − 1 ) x , therefore the variance (or power) of the regression noise is given by:

N m = ( m − 1 ) 2 s x 2 (2)

Then, the regression noise-to-signal power ratio, R_{m}, is given by:

R m = N m s x 2 = ( m − 1 ) 2 (3)

Notice that in (3) what counts is the absolute difference | m − 1 | because R_{m} is an even function (parabola) around m = 1.

According to the theory of regression lines [_{r}) is given by:

N r = ( 1 − r 2 ) s y 2 (4)

This noise power is correlated with the slope m, because the fraction of the variance s y 2 due to the regression line y = mx, namely r 2 s y 2 , is related to m according to the following relationship [

r 2 s y 2 = m 2 s x 2 (5)

Therefore, the correlation noise-to-signal power ratio, R_{r}, is given by:

R r = N r s x 2 = 1 − r 2 r 2 m 2 (6)

Now, because the two noise sources are disjoint, the total noise-to-signal power ratio of the channel is given by:

R = R m + R r (7)

By (3) and (6), R depends only on the two parameters m and r of the regression line (

R = ( m − 1 ) 2 + 1 − r 2 r 2 m 2 (8)

For each couple of the same variable, in Greek and in a translation, we can represent Equation (8) graphically by considering the variables (not to be confused with translations):

X = R m (9a)

Y = R r (9b)

By setting R = R o , being R_{o} a constant, X and Y trace a circle with radius R o = R m + R r in the first Cartesian quadrant. All points inside the circle correspond to R < R o ; the origin of the axes corresponds to R = 0 of the ideal channel, m = 1 and r = 1. The reciprocal of R is the signal-to-noise power ratio ρ = 1 / R , which becomes infinite at the origin and decreases as the radius of the circle increases.

As discussed in [_{r} and R_{m}, through their square roots, contribute to the total R, and which of the two pushes the translation away from the ideal self-translation.

In conclusion, the comparison between any couple of corresponding variables can be studied as a “communication channel” in which the input signal is the Greek text variable and the output signal is the translation variable. Compared to the ideal channel, the actual channel is noisy, always characterized by R > 0. Of course, as already noted, this indispensable “noise” is what actually makes the translation intelligible to the intended readers of the translated texts. In the next section we study these communication channels.

We compare, for each chapter, the numbers of words, sentences, interpunctions, and the so-called deep-language variables P_{F}, I_{P}, M_{F}, of the original Greek texts to those of another language. The values of the slope m of the linear model (1) and the correlation r for all variables and translations can be read in

Let us first consider the words channel n_{W}.

Y = 0.477 X + 0.157 (10)

A regression line Y = a X + b with a > 0, as Equation (10), is due to languages with m > 1, while a regression line with a < 0 is due to languages with m < 1.

From Equation (10) it turns out that, even though some translations can be practically unbiased (m ≈ 1), as is the case of Slovak, they can never be perfectly correlated with the Greek texts, i.e., their correlation coefficient can never approach 1. In fact, when m = 1, i.e. X = 0, from Equation (10) we get Y = 0.157 and, by setting m = 1 in Equation (6), we can calculate the corresponding “irreducible” (minimum) correlation coefficient:

r m = 1 = 1 / 1 + b = 0.930 (11)

This value has to be compared with the minimum value 0.949 of Hebrew (

In conclusion, even though the channel is very close to being ideal for the slope (m ≈ 1, no bias on the average, very small regression noise), it can never be ideal for the correlation coefficient, therefore there is always some significant correlation noise around the 45˚ line. Notice that there is no clear trend for the various language families, except for the Balto-Slavic family, which minimizes the regression noise X, because m ≈ 1, therefore these translations are grouped towards the Y-axis. The noisiest languages are Norwegian, Cebuano and Haitian.

Let us consider the sentences channel n_{S}, whose results are shown in

Let us consider the interpunctions channel n_{I}, whose results are also shown in _{W} and n_{S} channels. Swedish is the least noisy language, Haitian the noisiest. Each language, in fact, introduces a very different distribution of interpunctions in a chapter, both in type (full-stops, question marks, exclamation marks, commas, colons, semicolons) and quantity, therefore changing the length of sentences, word intervals, and interpunctions per sentence. The regression line drawn in

Let us consider the number of words per sentence, P_{F}, a deep-language variable.

The results of the channel concerning the number of words per interpunction, i.e. the word interval I_{P}, are also shown in _{P} channel is less noisy than the P_{F} channel. It seems that I_{P} cannot be set as much independently from Greek as P_{F} seems it can be. A likely explanation is that the word interval is empirically correlated with the short-term memory capacity, and this capacity not only is limited according to the 7 ± 2 Miller’s law [_{F}, a variable more linked to the output language, or translation style and intended readers through a readability index (see Section 8), than to human short-term memory capacity.

The results of the channel concerning the number of interpunctions per sentence, M_{F}, are also shown in _{P} and M_{F} channels are quite similar for most languages.

Compared to n_{W}, n_{S} and n_{S} channels, the deep-language variables channels are the noisiest. The reason seems to be, again, the different distribution of interpunctions. For these channels we have not drawn regression lines because the correlation coefficient is small.

Let us summarize the main results of this section. The channels studied are differently affected by the translation noise. The most accurate channel is the word channel n_{W}, a finding that seems reasonable. Humans seem to express a given meaning with a number of words—i.e. finite strings of abstract signs (characters)—which cannot vary so much even if some languages (Hebrew, Welsh, Basque etc.) do not share, according to scholars, a common ancestor with most other languages. This result seems to be something basic to human processing capabilities.

The number of sentences and their length in words, i.e. P_{F}, can be treated more freely. We know that P_{F} affects readability indices very much, as shown for Italian [

Finally, we observe that, independently of the different channels, the correlation noise is always larger than the regression noise, therefore indicating that every translation tries as much as possible not to be biased, but it cannot avoid being decorrelated, with correlation coefficients which approximately decrease from words, to sentences, to interpunctions and down to the deep-language variables.

Besides the noise-to-signal power ratio, communication channels can be also characterized by the channel capacity, as we discuss in the next section.

The noise-to-signal power ratio and its universal geometrical representation is not the only interesting way for studying noisy channels. Noisy channels can be also characterized by a single variable, namely the channel capacity or mutual information defined by Shannon [

According to Shannon [

C = 0.5 × log 2 ( 1 + 1 / R ) (12)

In our analysis the term “symbol” is defined according to the linguistic variable under study. For example, in the words channel the “symbol” is defined as the number of words per chapter, therefore, the actual values n_{W} of input and output chapters. For example, in Matthew, Chapter 5, the input symbol (Greek) is 823, while the output symbol is 1006 in English, 932 in Italian and 765 in Russian. Therefore, the magnitude of additive noise is 1006 − 823 = 183 in English, 932 − 823 = 109 in Italian and 765 − 823 = −58 in Russian. This noise can be relatively large as it peaks at 22.2% of the input value in the English translation. The signal-to-noise power ratio of this sample is, therefore, (823/183)^{2} = 20.2 in English and (823/58)^{2} = 201.3 in Russian, synthetically underling that the Russian translation is closer to Greek than the English translation.

In other words, we do not consider the classical information content of texts according to communication/information theory, which, to a first approximation, is measured by the entropy of letters [

For a constant R = R o , Equation (12) gives the minimum channel capacity if the noise is Gaussian. If the noise is not Gaussian, the actual channel capacity is larger than (12) [

Of the two noise sources defined in Section 4, the correlation noise and the regression noise, the latter is deterministic (it could be cancelled by dividing the variables of the output language by the corresponding m, if known), but the first can approximately be modelled as Gaussian, Figures 1-6. Therefore, if we assume that both sources of noise are Gaussian, then the channel capacity calculated with Equation (12) is pessimistic. In any case, this is not of concern here because Equation (12) can be used for comparing different translations.

We have already shown contours of constant capacity C (given, of course, by constant R = R o ) in Figures 7-9, namely the black arcs of circles. In the origin of the Cartesian coordinates R = 0, therefore ρ = ∞ and C = ∞ . This last result, valid for the continuous channel assumed in Equation (12), merely means that the channel does not impose any limit to the output information, therefore in this case the mutual information coincides with the input self-information of the Greek texts.

Of the channels studied in Section 5, the words channel n_{W} has the largest channel capacity for most translations. _{W} and n_{S} channels. We notice that the two channels are quite correlated; for Welsh the two capacities are even practically identical. _{W} and n_{I} channels. The two capacities are practically uncorrelated. In Appendix C we report the scatterplots of the capacities of words channel and sentences channel with the deep-language channels capacities. In all cases, we notice a poor correlation, except partially for the P_{F} channel, therefore evidencing, again, the fact that every translation has its own pattern of interpunctions within a chapter, which determines P_{F}, I_{P} and M_{F}.

Some interesting observations can be done on the mixed scatterplots shown in _{P} and n_{W}, n_{S} and I_{P} channels capacities. The correlation between these variables is evident: as I_{P} increases, thus loading more reader’s short-term memory, the channel capacities decrease. In other words, by decreasing this important deep-language variable, I_{P}, channels tend to be closer to the ideal channels of words, sentences and I_{P} itself.

Differently of the word interval I_{P}, the number of words per sentence P_{F} is quite correlated only with its channel capacity, _{F} approaches the Greek value (23.07, _{P} approaches the Greek value 7.47, I_{P} channel capacity decreases, underlines that I_{P} seems to be more related to how human brain processes texts (short-term memory), regardless of the particular language. In other words, translations do not follow the high Greek I_{P}. On the contrary, P_{F} is more related and matched to the intended readers through the readability index, which does not consider I_{P} [

In the next subsection we discuss how large is the capacity of linguistic channels.

Two questions arise: 1) Are the channel capacities large? 2) How can we assess how large they are? Let us start with studying the sensitivity of the channel capacity to the parameters m and r.

Equations (12) and (8), which describes the relationship between the channel capacity C and the slope m, as a function of the correlation coefficient r. For illustration, we have also reported the values of the words channel capacity of some translations.

The maxima of C are found from Equation (12) when R = R min , which occurs if:

m C max = r 2 (13)

Therefore, from (8) it follows

R min = 1 − r 2 (14)

Consequently, from (12) we get:

C max = 0.5 × log 2 [ 1 + 1 1 − r 2 ] (15)

Because of (15), in _{max}, _{max} varies. Notice that C/C_{max} practically follows the same mathematical function, regardless of the channel (words or sentences) when the correlation coefficient r is about the same for all languages (_{F} and I_{P} channels (_{F} channel follows the same trend (not shown).

In conclusions, the capacity of n_{W}, n_{S} and n_{I} channels follow very closely the universal chart because of similar high correlation coefficients; on the contrary, the capacity of P_{F} and I_{P} channels is more spread because their correlation coefficients greatly varies from translation to translation.

As studied and discussed in [_{p}, varies in the same range of the short-term memory capacity—given by the 7 ± 2 Miller’s law [_{p} against the number of words per sentence P_{F}, I_{p} tends to saturate to a horizontal asymptote as P_{F} increases. In other words, even if sentences get longer, I_{p} cannot get larger than about the upper limit of Millers’ law (namely 9), because of the constraints imposed by the short-term memory capacity of readers.

Empirically (best-fit) the average value of I_{p} is related to the average value of P_{F} according to the relationship [

I P = ( I P ∞ − 1 ) × [ 1 − e − ( P F − 1 ) ( P F o − 1 ) ] + 1 (16)

where I_{P}_{∞} gives the horizontal asymptote, and P_{Fo} gives the value of P_{F} at which the exponential falls at 1/e of its maximum value. We apply Equation (16) to the NT translations. Because both I_{p} and P_{F} depend on the translation, we find different constants in Equation (16), listed in

_{p} spreads in Miller’s range. Not surprisingly, the ancient readers of these texts had the same short-term memory capacity of modern readers, i.e. they followed Miller’s 7 ± 2 law. This finding is confirmed by the results concerning modern languages for which, however, the spread within Miller’s range can be different from translation to translation. Some translations tend to use shorter values of I_{p}, as Latin and Hebrew (_{P}_{∞} in

In conclusion, each translation tends to address readers with different reading abilities because small I_{p} values are better matched to readers with small short-term memory capacity, who, therefore, can handle only short sentences, which correlates well with a large readability index, as we show in the next section.

As discussed in [

Language | I_{P}_{∞} | P_{Fo} | G | m | r |
---|---|---|---|---|---|

Greek | 9.27 | 13.61 | 58.44 (4.27) | 1 | 1 |

Latin | 6.15 | 10.50 | 62.06 (5.17) | 1.058 | 0.889 |

Esperanto | 6.34 | 14.06 | 59.02 (3.77) | 1.008 | 0.848 |

French | 9.46 | 11.78 | 60.63 (2.78) | 1.037 | 0.751 |

Italian | 9.66 | 17.56 | 61.22 (3.86) | 1.048 | 0.736 |

Portuguese | 6.45 | 8.18 | 63.48 (3.65) | 1.090 | 0.746 |

Romanian | 7.28 | 7.76 | 62.02 (4.53) | 1.058 | 0.770 |

Spanish | 8.43 | 12.78 | 60.81 (3.80) | 1.040 | 0.823 |

Danish | 7.56 | 10.00 | 64.26 (3.43) | 1.100 | 0.701 |

English | 9.53 | 12.45 | 60.45 (3.71) | 1.031 | 0.745 |

Finnish | 5.65 | 8.31 | 62.76 (4.73) | 0.785 | 0.802 |

German | 6.75 | 8.36 | 62.37 (3.55) | 1.063 | 0.755 |

Icelandic | 7.19 | 10.16 | 64.13 (3.56) | 1.095 | 0.713 |

Norwegian | 11.21 | 13.05 | 67.04 (2.81) | 1.099 | 0.435 |

Swedish | 13.38 | 17.57 | 63.65 (3.56) | 1.101 | 0.733 |

Bulgarian | 6.65 | 7.90 | 65.06 (3.76) | 1.112 | 0.704 |

Czech | 5.93 | 7.49 | 68.39 (5.45) | 1.168 | 0.722 |

Croatian | 7.37 | 10.68 | 65.15 (4.80) | 1.111 | 0.795 |

Polish | 4.98 | 4.42 | 68.78 (4.54) | 1.184 | 0.448 |

Russian | 4.80 | 7.94 | 62.44 (5.02) | 1.062 | 0.844 |

Serbian | 7.11 | 8.49 | 65.90 (4.45) | 1.125 | 0.736 |

Slovak | 6.52 | 8.26 | 68.01 (4.66) | 1.167 | 0.678 |

Ukrainian | 4.76 | 2.73 | 66.60 (3.52) | 1.140 | 0.700 |

Estonian | 6.40 | 8.83 | 62.99 (4.64) | 1.075 | 0.814 |

Hungarian | 4.77 | 7.76 | 62.90 (4.25) | 0.892 | 0.787 |

Albanian | 8.07 | 13.82 | 58.39 (3.76) | 0.996 | 0.781 |

Armenian | 6.44 | 7.65 | 64.38 (5.92) | 1.093 | 0.791 |

Welsh | 6.25 | 8.65 | 57.31 (2.82) | 1.025 | 0.786 |

Basque | 5.63 | 8.21 | 61.98 (4.01) | 0.691 | 0.801 |

Hebrew | 6.76 | 6.61 | 69.64 (4.99) | 1.192 | 0.682 |

Cebuano | 11.95 | 11.98 | 63.06 (2.05) | 1.079 | 0.400 |

Tagalog | 8.57 | 6.22 | 62.78 (3.11) | 1.071 | 0.592 |

Chichewa | 12.34 | 19.34 | 68.16 (3.51) | 1.166 | 0.659 |

Luganda | 7.65 | 9.86 | 67.33 (4.56) | 1.151 | 0.713 |

Somali | 8.57 | 14.31 | 60.92 (3.94) | 1.041 | 0.791 |

Haitian | 9.12 | 11.94 | 64.72 (3.06) | 1.109 | 0.519 |

Nahuatl | 11.35 | 16.32 | 67.04 (2.81) | 1.149 | 0.521 |

In particular, the last observation can justify our present proposal to adopt a readability formula that can be used for comparing texts of different languages because most of them do not have a readability formula, and few adapt some formulae studied for English texts to their texts [

For this purpose, we propose to adopt, as a calque, the readability formula used for Italian, amply studied in [

G = 89 − 10 × c / p + 300 × f / p (17a)

In Equation (17a) p is the total number of words in the text considered, c is the number of characters contained in the p words, f is the number of sentences contained in the p words.

Notice that Equation (17a), as all readability formulae found in the literature, does not contain any reference to interpunctions, therefore it does not consider the very important parameter linked to the short-term memory capacity, namely the word interval I_{P}.

G can be interpreted as a readability index by considering the number of years of school attended in Italy’s school system, as shown in

G = 89 − 10 × C P + 300 / P F (17b)

G = 89 − G C + G F (17c)

In [_{P} is typical of a language and, if not scaled, would bias G, without really quantifying the change in reading difficulty of readers, who are accustomed to reading in their language shorter or longer words, on the average, than those found in Italian. In other words, this scaling avoids changing G for the only reason that a language has, on the average, words shorter or longer than Italian.

On the other hand, we maintain the constant 300 because P_{F} depends significantly on reader’s reading ability and short-term memory capacity [

Therefore, the readability formula of a text written in a language with average 〈 C P 〉 characters per word is given by:

G = 89 − 10 × k × C p + 300 / P F (18a)

with

k = 〈 C p , I T A 〉 / 〈 C P 〉 (18b)

By using Equation (18), we force the average value of G_{C} to be equal to that found in Italian, namely G C = 10 × 4.48 . For example (see _{P} is multiplied by 10 × 4.48/4.86 = 9.22, instead of 10, for Finnish (longer words) C_{P} is multiplied by 10 × 4.48/6.22 = 7.20 and for Haitian (shorter words) for 10 × 4.48/3.37 = 13.29.

_{C} and G_{F} versus G, for Greek, Latin and for all languages, with some other examples shown in Appendix E. We can notice that G_{F} largely determines G, compared to G_{C}. The regression line relating G_{F} to G, drawn in ^{2} = 0.518 is the fraction of the variance of G_{F} due to Equation (19). The remaining fraction 1 − 0.518 = 0.482 is due to the values scattered around the line. On the contrary, the correlation coefficient between G_{C} and G of the regression line G C = − 0.187 G + 56.6 also drawn in _{F}.

_{F} distinguishes significantly the translations. From

In conclusion, Equation (18) can be useful for comparing the readability of texts (not necessarily translations) written in different languages because of a “common ground” for interpreting them, namely

If we considered different translations of the NT within the same language, do the statistics of linguistic parameters change? In other words, different versions of the NT in the same language are very similar, or do they differ from each other, maybe as much as do NT versions belonging to different languages? Indeed, for some languages there is a huge number of distinct translations: we have counted at least 60 English and 20 Spanish versions3^{9}, which means that at least 60 different audiences have been considered in the English case and 20 in the Spanish case, which is really remarkable.

In this section, just for a very preliminary investigation, we report the average values of the most important linguistic parameters concerning 6 languages and 18 distinct versions, 3 per language, of Matthew’s gospel, namely English, German, Polish, Russian, Spanish and Swedish,

In

^{39}https://classic.biblegateway.com/versions/

The spread of the values within the same language can be a sizeable fraction of the overall range calculable from

^{40}https://classic.biblegateway.com/versions/Contemporary-English-Version-CEV-Bible/#booklist

^{41}https://classic.biblegateway.com/versions/New-King-James-Version-NKJV-Bible/#booklist

^{42}https://classic.biblegateway.com/versions/Neue-Genfer-%C3%9Cbersetzung-NGU/#booklist

^{43}https://classic.biblegateway.com/versions/Luther-Bibel-1545-LUTH1545/#booklist

^{44}https://classic.biblegateway.com/versions/S%C5%82owo-%C5%BBycia-SZ/#booklist

^{45}https://classic.biblegateway.com/versions/Updated-Gda%C5%84sk-Bible-UBG/#booklist

^{46}https://classic.biblegateway.com/versions/-CARS/#booklist

^{47}https://classic.biblegateway.com/versions/Russian-Bible-Easy-to-Read-Version-ERV-RU/#booklist

^{48}https://classic.biblegateway.com/versions/Nueva-Version-Internacional-Castilian-Biblia-CST/#booklist

^{49}https://classic.biblegateway.com/versions/Traducci%C3%B3n-en-lenguaje-actual-TLA-Biblia/#booklist

^{50}https://classic.biblegateway.com/versions/Svenska-Folkbibeln-2015-SFB15-Bible/#booklist

^{51}https://classic.biblegateway.com/versions/Svenska-1917-SV1917/#booklist

Version | W | S | P_{F} | I_{P} | M_{F} | G |
---|---|---|---|---|---|---|

Greek | 18,121 | 914 | 20.66 | 7.23 | 2.86 | 59.1 |

English | 22,000 | 1247 | 17.88 | 6.91 | 2.61 | 61.8 |

CEV4^{0} | 23,444 | 1728 | 13.64 | 8.28 | 1.67 | 67.8 |

St James4^{1} | 23,397 | 1040 | 23.51 | 5.91 | 3.98 | 57.2 |

Range (%) | 8.0 | 75.3 | 47.8 | 32.8 | 80.8 | 17.9 |

German | 21,424 | 1324 | 16.69 | 5.80 | 2.90 | 61.4 |

NGU-DE4^{2} | 23,122 | 1534 | 15.28 | 5.85 | 2.62 | 62.0 |

Luther4^{3} | 21,998 | 1211 | 18.71 | 5.93 | 3.17 | 60.2 |

Range (%) | 9.4 | 23.0 | 16.6 | 1.8 | 19.2 | 3.0 |

Polish | 17,650 | 1563 | 11.61 | 4.54 | 2.56 | 59.3 |

Slowo4^{4} | 17,211 | 1677 | 10.55 | 4.83 | 2.19 | 61.0 |

UBG4^{5} | 17,651 | 1299 | 13.89 | 4.58 | 3.05 | 55.7 |

Range (%) | 2.4 | 41.4 | 16.2 | 4.0 | 30.1 | 9.0 |

Russian | 16,786 | 956 | 18.33 | 4.12 | 4.46 | 58.9 |

CARS4^{6} | 18,243 | 1359 | 13.65 | 4.50 | 3.04 | 63.6 |

ERV-RU4^{7} | 18,395 | 1353 | 13.90 | 4.65 | 3.00 | 63.5 |

Range (%) | 8.9 | 44.1 | 22.7 | 7.3 | 51.0 | 8.0 |

Spanish | 21,217 | 1232 | 18.22 | 6.12 | 2.99 | 60.5 |

CST4^{8} | 21,318 | 1392 | 16.07 | 6.41 | 2.52 | 62.6 |

TLA4^{9} | 25,367 | 1630 | 15.47 | 7.00 | 2.22 | 62.5 |

Range (%) | 22.9 | 43.5 | 13.3 | 12.2 | 26.9 | 3.6 |

Swedish | 21,552 | 1445 | 15.10 | 7.50 | 2.02 | 63.1 |

SFB155^{0} | 20,676 | 1409 | 15.10 | 7.56 | 2.00 | 64.3 |

SV19175^{1} | 22,503 | 1182 | 19.72 | 6.59 | 3.00 | 57.9 |

Range (%) | 5.2 | 28.8 | 22.4 | 13.4 | 35.0 | 10.8 |

In conclusion, it is clear that each NT different translation within the same language addresses different audiences, as it can be noticed from the range of the linguistic parameters, but, more interestingly, a translation in a language can be confused, mathematically, with the translation in another language. In other words, this preliminary sampling seems to confirm that language does not play the only role in translation, but that this role has to be shared mainly with reader’s reading ability (i.e., P_{F}, G) and short-term memory (I_{P}).

Another question arises: Are the above results only applicable to NT translations, or can they be also applied to translations of literary texts, such as novels? In this section we show, preliminarily with just one example, that novels tend to show similar statistics, but with more constraints on the translations than those found in the NT translations.

We have done the following exercise. We have studied the translations of Treasure Island (by R.L. Stevenson) from the original English text to Italian, French and German, by considering each chapter as text unit (34 chapters).

The comparison to the NT translation must be done, of course, by starting first with the English version of the NT and then studying its translations. Only after this study, we can consider Treasure Island as input text and calculate the same statistics. Therefore, we take the English NT as the reference (input) language and Italian, French and German as output languages, as if these NT versions were obtained by translating the English text, not the original Greek text. This hypothesis assumes, of course, that if the Italian, French and German translators had started from the English version of the NT, they would have ended up with the same text translated from Greek. This might be reasonable, although not directly controllable. We show below that the assumption can be justified.

Let us examine the single channels. In the words channel n_{W} we notice that the slope m and correlation coefficient r of the three languages are about the same in both cases (

Language | Words | m | r | Sentences | m | r |
---|---|---|---|---|---|---|

English | 68,033; 2001.0 (302.3) | 1 | 1 | 3824; 112.5 (31.4) | 1 | 1 |

Italian | 64,603; 1900.1 (294.6) | 0.950 | 0.985 | 3805; 111.9 (30.2) | 1.181 | 0.904 |

French | 68,818; 2024.1 (334.6) | 1.013 | 0.982 | 4054; 119.2 (30.8) | 1.253 | 0.874 |

German | 72,119; 2121.1 (332.6) | 1.060 | 0.970 | 4111; 120.9 (31.5) | 0.889 | 0.833 |

Language | P_{F} | m | r | I_{P} | m | r | M_{F} | m | r |
---|---|---|---|---|---|---|---|---|---|

English | 18.9 (9.8) | 1 | 1 | 6.05 (1.86) | 1 | 1 | 3.09 (0.77) | 1 | 1 |

Italian | 17.9 (8.4) | 0.95 | 0.907 | 6.52 (1.68) | 1.024 | 0.900 | 2.72 (0.73) | 0.796 | 0.725 |

French | 17.9 (8.4) | 1.013 | 0.882 | 6.11 (1.62) | 0.959 | 0.927 | 2.88 (0.66) | 0.842 | 0.665 |

German | 18.3 (7.6) | 1.060 | 0.643 | 5.96 (1.53) | 1.025 | 0.861 | 3.05 (0.75) | 1.107 | 0.522 |

Language | Words | m | n_{W} channel r | C | C/C_{max} | Sentences | m | n_{S} channel r | C | C/C_{max} |
---|---|---|---|---|---|---|---|---|---|---|

English | 122,641 | 1 | 1 | n.a. | n.a. | 6590 | 1 | 1 | n.a. | n.a. |

Italian | 112,943 | 0.918 | 0.993 | 2.852 | 0.923 | 6396 | 0.963 | 0.951 | 1.728 | 0.982 |

French | 133,050 | 1.077 | 0.984 | 2.270 | 0.904 | 7258 | 1.076 | 0.945 | 1.497 | 0.888 |

German | 117,269 | 0.952 | 0.979 | 2.330 | 1.000 | 7069 | 1.064 | 0.952 | 1.607 | 0.907 |

Language | m | P_{F} channel r | C | C/C_{max} | m | I_{P} channel r | C | C/C_{max} |
---|---|---|---|---|---|---|---|---|

Italian | 0.757 | 0.602 | 0.478 | 0.703 | 0.846 | 0.536 | 0.318 | 0.503 |

French | 0.768 | 0.619 | 0.500 | 0.719 | 0.999 | 0.676 | 0.441 | 0.585 |

German | 0.713 | 0.721 | 0.736 | 0.906 | 0.779 | 0.674 | 0.597 | 0.795 |

Language | m | n_{W} channel r | C | C/C_{max} | m | n_{S} channel r | C | C/C_{max} |
---|---|---|---|---|---|---|---|---|

Italian | 0.950 | 0.985 | 2.542 | 0.995 | 1.181 | 0.904 | 0.982 | 0.729 |

French | 1.013 | 0.982 | 2.375 | 0.978 | 1.253 | 0.874 | 0.749 | 0.627 |

German | 1.060 | 0.970 | 1.930 | 0.927 | 0.889 | 0.833 | 0.959 | 0.916 |

Language | m | P_{F} channel r | C | C/C_{max} | m | I_{P} channel r | C | C/C_{max} |
---|---|---|---|---|---|---|---|---|

Italian | 0.950 | 0.907 | 1.301 | 0.953 | 1.024 | 0.899 | 1.165 | 0.884 |

French | 1.013 | 0.882 | 1.070 | 0.870 | 0.959 | 0.927 | 1.460 | 0.967 |

German | 1.060 | 0.642 | 0.349 | 0.487 | 1.025 | 0.861 | 0.946 | 0.829 |

In the sentences channel n_{S}, on the contrary, m and r of the three languages are significantly different in the two cases. This is, of course, confirmed by the different capacities. This trend is further enhanced in the P_{F} and I_{P} channels (_{F} and to I_{P} (or M_{F}), each translation has quite different ways of using interpunctions for their intended readers, therefore matching more reader’s reading ability and short-term memory capacity.

Finally, it is very interesting to notice in n_{W} and n_{S} channels (_{F} and I_{P} channels, Treasure Island translations are more accurate than NT translations because, very likely, all dialogues must be strictly respected in any translation.

In conclusion, the statistics of words and sentences of a novel seems to be similar to those found in the NT translations. For example, the ranking of the number of sentences, from minimum to maximum, is the same both in the NT and in the Treasure Island translations: Italian, English, French, German. It is almost the same for words, namely, Italian, German, English, French for the NT translations; Italian, English, German, French for Treasure Island translations. The translation of a novel seems to be more respectful of the original text than the NT translations for what concerns P_{F} and I_{P}, mainly because the translators must consider the presence of dialogues, whose fraction of the total text can be, however, largely variable within novels, according to author’s style etc. Because these results refer to just one particular case, they should be further assessed with other literary (novels) translations, a study well beyond the aim of this paper.

It is possible to extend the statistical theory outlined in the previous sections in such a way to arrive at a general theory of translation applicable to any alphabetical language. By knowing the statistics of the various linguistic variables studied in the previous sections—obtained in the translation channel from Greek to other languages—it is possible as we show below, to estimate the statistics obtainable in the translation channel from any language to any other language of those listed in

The theory can also be applied to channels of texts belonging to the same language (not showing for brevity): for example, the channel that transforms words into sentences in a text can be compared to the channel that transforms words into sentences in a different text, both written in the same language. This comparison can be useful to study how texts of the same author may have changed over time, or to compare texts of different authors.

_{k} ( k = 1 : 36 , Greek in _{j} (channel Y k → Y j ; j = 1 : 36 ) and the flow chart of the reverse channels, from any language Y_{j} to the same language Y_{k} (channels Y j → Y k , k = 1 : 36 , Greek in

We first calculate the noise-to-signal power ratio obtainable in the general theory from the data reported in

Let us consider two languages Y_{k} and Y_{j}, and let us refer to Greek explicitly as language X. With reference to the ideal channel whose output is X (self-translation), we have found that the same variable of languages k and j are related by linear relationships with the corresponding Greek variable x:

y k = m k x + n k (19a)

y j = m j x + n j (19b)

In Equation (19) n_{k} and n_{j} are the noise sources added to the regression lines y = m x . The slope m is the source of the regression noise—because | m | ≠ 1 —the correlation coefficient r is the source of the correlation noise—because | r | < 1 —as discussed in Section 4. For example, in the words channel between Greek and English, m = 1.225 and r = 0.986 (

Let us refer to the 36 possible translations from language k = 1 : 37 —including Greek—to language j. In other words, language k plays now the role played before by Greek. By eliminating x, i.e. Greek, from Equation (19), we get the linear relationship between the input language k and the output language j :

y j = m j m k y k − m j m k n k + n j (20)

Compared to the reference language y_{k}, the slope is given by:

m k j = m j m k (21)

Therefore, the regression noise-to-signal power ratio, R_{m}, of the channel is readily found, according to Equation (3), as:

R m = ( m k j − 1 ) 2 (22)

Notice that R_{m} depends only the known slopes of the translations from Greek (

Let us calculate the correlation noise-to-signal power ratio, R_{r}. To apply Equation (6), we must insert the unknown correlation coefficient r k j = r j k = r between y_{j} and y_{k} due, of course, to the two noise sources in Equation (20). We can calculate its value from the correlation coefficients r_{k} and r_{j} reported in _{j} to the input variable y_{k} is given by:

n j , t o t = − m k j n k + n j (23)

As we can see from (22), the two noise sources are correlated, with unknown correlation coefficient r. Let s n k 2 and s n j 2 be the single noise powers, then the total noise power s n , j , t o t 2 due to n j , t o t is given by ( [

s n , j , t o t 2 = m k j 2 s n k 2 + s n j 2 + 2 r m k j s n k s n j (24)

Equation (24) has a geometric representation [_{nj}, which form the angle θ k j = arcos ( r k j ) between them. By applying this representation also to the vectors s_{nk} and s_{x} (Greek) forming the angle θ k = arcos ( r k ) and to the vectors s_{nj} and s_{x}, forming the angle θ j = arcos ( r j ) , the angle θ k j is given by θ k j = | θ j − θ k | , therefore r is given by:

r = cos | arcos ( r j ) − arcos ( r k ) | (25)

Now, by Equation (6), the correlation noise-to-signal power ratio in the translation channel from language k to language j is given by:

R r = 1 − r 2 r 2 m k j 2 (26)

In conclusion, the total noise-to-signal power ratio in the translation channel from language k to language j, for a given stochastic variable, is given by:

R = ( m k j − 1 ) 2 + 1 − r 2 r 2 m k j 2 (27)

Figures 21-23 show the geometrical representation of R_{m} and R_{r} in the first Cartesian quadrant as discussed in Section 4, for all linguistic variables. Notice that the regression lines from Greek to other languages, drawn from _{W}, sentences n_{S} and interpunctions n_{I} channels are mostly dominated by R_{m}, because for most languages X > Y , i.e. R m > R r . This result underlines, again, the greater freedom used in these channels in sizing the number of words, sentences and interpunctions, whose average values may vary substantially (_{F}, I_{P} (with some exceptions), M_{F}, and the readability index G, we mostly observe X < Y , i.e. R m < R r . In the P_{F} channel, for example, in

From Figures 21-23 we can calculate direct and reverse channels capacities. _{kj} (direct channel) and C_{jk} (reverse channel) for some languages in the words channel n_{W}.

These scatterplots show that direct and reverse channels are not very different. Although C k j ≠ C j k , as we establish in the next subsection, they are, however, very similar for all variables and languages, regardless of their absolute value. In other words, a common underlying structure emerges from considering channel capacities, which seems to govern textual/verbal communication channels defined here, as we can see in

In the next subsection we show that C k j ≠ C j k .

Are direct and reverse channels concerning a couple of languages, e.g. translations from Greek to English and from English to Greek, symmetric? We can answer to this question by considering the channel capacity.

The specific question becomes now: Is the capacity C_{kj} (bits per symbol) of the (direct) channel from language k to language j, equal to the capacity C_{jk} of the (reverse) channel from language j to language k? In other words, can the two languages be exchanged in the input-output relationship without changing the statistical characteristics of the translation channel? According to communication theory [

We establish now that any couple of direct and reverse channels are not symmetric, unless m k = m j and r k = r j , a case never found. The reason for this asymmetry is because the noise added to any ideal (self-translation) channel to get the text in another language is statistically always different.

According to Equations (12) and (27), and recalling that r i j = r j i = r , the two channel capacities are equal if:

( m j m k − 1 ) 2 + 1 − r 2 r 2 ( m j m k ) 2 = ( m k m j − 1 ) 2 + 1 − r 2 r 2 ( m k m j ) 2 (28)

Let x = m j / m k . After standard algebraic passages, we get following solution for the unknown correlation coefficient:

r = ∓ x 2 + 1 2 x (29)

To yield real values, the radicand in Equation (29) must be positive, and to yield a correlation coefficient must be less than 1, therefore we get the range:

0 ≤ x 2 + 1 2 x ≤ 1 (30)

The lower limit in (31) is always satisfied because x 2 + 1 > 0 ; the upper limit gives:

x 2 − 2 x + 1 = ( x − 1 ) 2 ≤ 0 (31)

The inequality (31) is never satisfied, unless x = 1, therefore only if m j = m k , in which case, from Equation (29) r = ± 1 . In other words, in translation channels C j k ≠ C k j . Only in the ideal channel (self-translation) C j k = C k j = ∞ . In the next subsection we assess how large the capacity difference is, in other words, how asymmetric direct and reverse channels are.

Figures 26-28 show Δ C k j main statistics, for all couples of direct and reverse channels—and for the same linguistic variable—for each language, by drawing, as a function of 〈 Δ C k j 〉 , the standard deviation σ Δ C , the root mean square (RMS) value (bits per symbol) and its relative (normalized) value RMS (%)—the latter obtained by dividing RMS of Δ C k j by the average direct channel capacity 〈 C k j 〉 .

Several interesting observations can be done. First, we notice that 〈 Δ C k j 〉 , σ Δ C and RMS vary in about the same range. The average value, for example is approximately always in the range − 0.4 ≲ 〈 Δ C k j 〉 ≲ + 0.4 (bits per symbol), regardless of the variable. Only Greek is clearly distinct from the other languages, with larger values. The standard deviation is even more stable as σ Δ C ≈ 0.2 (bits per symbol) in most cases. Only RMS has larger variations, between 0.2 and 0.6 (bits per symbol). As already noticed, Latin and modern languages are closer to each other than to Greek.

On the contrary, the variations of the normalized RMS (%) are significantly different. In the words channel RMS varies between 10% and 30%, and similarly for the sentences channel (10% to 40%) and interpunctions channel (10% to 20%); on the contrary RMS varies in a larger range in the deep-language channels P_{F}, I_{P} and M_{F}, up to 300%.

We can rank the channels according to the normalized RMS (%).

Channel | C a v e = 〈 ( C k j + C j k ) 〉 / 2 | s C | 100 × s C / C a v e |
---|---|---|---|

Words per chapter n_{P} | 2.67 | 0.27 | 10.1 |

Sentences per chapter n_{S} | 2.62 | 0.33 | 12.6 |

Interpunctions per chapter n_{I} | 3.02 | 0.26 | 8.6 |

Words per sentence P_{F} | 1.79 | 0.39 | 21.8 |

Words per interpunctions (word interval) I_{P} | 2.09 | 0.35 | 16.7 |

Interpunctions (word intervals) per sentence M_{F} | 1.59 | 0.43 | 27.0 |

Readability index per chapter G | 2.42 | 0.23 | 9.5 |

Interpunctions | G | Words | Sentences | I_{P} | P_{F} | M_{F} | |
---|---|---|---|---|---|---|---|

RMS (%) | 13.7 | 15.9 | 16.7 | 22.4 | 36.0 | 59.9 | 73.2 |

We have proposed a unifying statistical theory of translation, based on communication theory, which involves linguistic stochastic variables, some of which are not considered by scholars. Its main mathematical characteristics have emerged by studying the translation of most NT books.

When a text written in a language is translated into another language, all linguistic variables do numerically change. To study these apparently chaotic data we have characterized any translation as a complex communication channel affected by “noise”, studied according to Communication Theory applied for the first time to this channel. The new theory deals with aspects of languages more complex than those currently considered in machine translations. The input language is the “signal”, the output language is a “replica” of the input language, but largely perturbed by noise. For the output language, this noise is indispensable for conveying the meaning of the input language to its readers To study these channels, we have defined a suitable noise-to-signal power ratio and applied a geometrical representation.

All channels studied are differently affected by translation noise. The more accurate channel is the word channel n_{W}, a finding that seems reasonable. It emerges that humans seem to express a given meaning with a number of words—i.e. finite strings of abstract signs (characters)—which cannot vary so much even if some languages do not share a common ancestor. On the contrary, the number of sentences and especially their length in words, i.e. P_{F}, are treated more freely by translators. P_{F}, affects readability indices very much, therefore this variable tends to be better matched to the intended readers, with specific reading ability.

Independently of the different parallel channels (one for each variable), the correlation noise (due to a regression line slope m ≠ 1 ) is mostly larger than the regression noise (due to a regression correlation coefficient r < 1 ), therefore indicating that every translation tries as much as possible to be not biased, but it cannot avoid being decorrelated, with correlation coefficients which approximately decrease from words, to sentences, to interpunctions and down to the deep-language variables P_{F}, I_{P}, M_{F} and C_{P}.

Different translations of the NT within the same language, mathematically, can be quite different and they can even seem to belong to different languages. In other words, in language translations differences are mainly due to specific linguistic variables, not to the particular language. Clearly, they are matched to different audiences, an aspect not explicitly considered in machine translations.

Besides the noise-to-signal power ratio, communication channels can be also characterized by the channel capacity (bits per symbol, the latter suitably defined). This parameter can be relatively large, very close to the maximum value obtainable, for n_{W}, n_{S} and n_{I} channels, less for P_{F}, I_{P}, M_{F} channels. We have found that the NT translations are similar to translations of literary texts, as shown for the novel Treasure Island translated from English to Italian, French and German for n_{W}, n_{S} and n_{I} channels. On the contrary, the translation of novels seems to set more stringent constraints on the translators for P_{F}, and I_{P}, channels because dialogues must be strictly maintained. A topic to be further researched.

The number of words per interpunctions I_{p} varies in the same range of the short-term memory capacity. Drawn against the number of words per sentence P_{F}, I_{p} tends to saturate to a horizontal asymptote as P_{F} increases because, even though sentences get longer, I_{p} cannot get larger than about the upper limit of Millers’ law, because of the constraints imposed by readers’ short-term memory capacity.

We have defined a formula for the readability index of any alphabetical languages, based on a calque of the readability formula used in Italian, both for providing it to languages that have none, and also for estimating, on common grounds, the readability of texts belonging to different languages/translations.

Finally, we have extended the statistical theory outlined before to a general theory of translation applicable to any alphabetical language, even to texts written in the same language. The general theory shows that direct and reverse channels are not symmetric.

In conclusion, a common underlying statistical structure, governing human textual/verbal communication channel—not defeated by the mythical biblical Tower of Babel—seems to emerge from the findings. The main result is that the statistical and communication characteristics of a text, and its translations into other languages, seem to depend not only on the particular language—mainly through the number of words and sentences—but also on the particular translation because the text is very much characterized by the reading abilities and short-term memory capacity of the intended readers, aspects not explicitly considered in machine translations. These conclusions seem to be everlasting because applicable also to ancient Roman and Greek readers. A future research should extend the general theory to non-alphabetical languages.

The author wishes to warmly thank all those scholars who, with continuous great care and dedication, keep online the texts of the Bible in many languages, for the benefit of everyone.

The author declares no conflicts of interest regarding the publication of this paper.

Matricciani, E. (2020) A Statistical Theory of Language Translation Based on Communication The- ory. Open Journal of Statistics, 10, 936-997. https://doi.org/10.4236/ojs.2020.106055

Appendix A. List of mathematical symbols.

Appendix B. Entropy and human information-processing

The short-term memory capacity follows Miller’s 7 ± 2 law [

Let us consider the total number of words W (_{1} by Shannon [_{1} for the mentioned languages are reported in

Now the total number of information bits produced according to Communication/Information theory can be estimated by:

N b i t = W × C P × F 1 (B.1)

Language | F_{1} (bits per letter) | n_{W} (words) | C_{P} (letters per word) | N_{bit} | N b i t − N b i t , G r e e k (bits) |
---|---|---|---|---|---|

Greek | 4.09 | 100,145 | 4.86 | 1,990,622 | 0 |

French | 3.98 | 133,050 | 4.20 | 2,224,064 | 233,442 |

English | 4.14 | 122,641 | 4.24 | 2,152,791 | 162,169 |

German | 4.08 | 117,269 | 4.68 | 2,239,181 | 248,559 |

Italian | 3.95 | 112,943 | 4.48 | 1,998,639 | 8017 |

Russian | 4.36 | 92,736 | 4.67 | 1,888,216 | −102,406 |

Spanish | 4.00 | 118,744 | 4.30 | 2,042,397 | 51,775 |

According to words | According to bits |
---|---|

Russian | Russian |

Greek | Greek |

Italian | Italian |

German | Spanish |

Spanish | English |

English | French |

French | German |

Appendix C. Scatterplots of average channel capacity for the indicated channels. Languages are distinguished according to the symbols listed below.

Appendix D. Scatterplots of I_{P} versus P_{F} for the indicated languages. The horizontal magenta lines are Miller’s bound 5 and 9.

Appendix E. Scatterplots of G_{C} (blue) and G_{F} (red) versus G for the indicated languages.

Appendix F. Scatterplots between direct channel capacity (from … to) and the reverse channel capacity (to … from) for all languages. Red circles are the average values of each translation. The red symbol is the overall average value.