Deep Language Statistics of Italian throughout Seven Centuries of Literature and Empirical Connections with Miller’s 7 ∓ 2 Law and Short-Term Memory

Statistics of languages are usually calculated by counting characters, words, sentences, word rankings. Some of these random variables are also the main “ingredients” of classical readability formulae. Revisiting the readability formula of Italian, known as GULPEASE, shows that of the two terms that determine the readability index G—the semantic index C G , proportional to the number of characters per word, and the syntactic index G F , proportional to the reciprocal of the number of words per sentence—G F is dominant because G C is, in practice, constant for any author throughout seven centuries of Italian Literature. Each author can modulate the length of sentences more freely than he can do with the length of words, and in different ways from author to author. For any author, any couple of text variables can be modelled by a linear relationship y mx = , but with different slope m from author to author, except for the relationship between characters and words, which is unique for all. The most important relationship found in the paper is that between the short-term memory capacity, described by Miller’s “7 (i.e., the number of “chunks” that an average person can hold in the short-term memory ranges from 5 to 9), and the word interval, a new random variable defined as the average number of words between two successive punctuation marks. The word interval can be converted into a time interval through the average reading speed. The word interval spreads in the same range as Miller’s law, and the time interval is spread in the same range of short-term memory response times. The readers because words are on the average longer, the readability index G is lower, word and time intervals are longer. Future work done on ancient languages, such as the classical Greek and Latin Literatures (or modern languages Literatures), could bring us an insight into the short-term memory required to their well-educated ancient readers.

lute values because readers are used to considering them when they apply readability formulae, but I also discuss differences, which the reader can appreciate from the numerical values reported in the tables or in the scatter plots below.
In spite of the long-lasting controversy on the development and use of classical readability formulae, researchers still continue to develop methods to overcome weaknesses by advancing natural language processing and other computerized language methods [15] [16], to capture more complex linguistic features [17]. Benjamin [15], however, predicts that also in this research field will happen what does happen in any research field when no general consensus is shared on a specific topic, that is, that also these new developments will be judged controversial. In any case, the classical readability formulae have served their purpose in leveling typical books for schoolchildren and general audience, such as in Italy in the 1980s [18]. Now, new methods, developed after cognitive processing theories, should allow analyzing more complex texts for specific targets such as adolescents, university students, and adults. Moreover, with machine-learning developments, non-traditional texts, like those found in many web sites, can be categorized for greater accessibility. Some of these advances concern even observing eye tracking while reading [16] [19]. For Italian, the work by Dell'Orletta and colleagues [17] aims at automatically assessing the readability of newspaper texts with the specific task of text simplification, not for specifically analyzing and studying literary texts and their statistics, as I do in this paper.
A readability formula is, however, very attractive because it allows giving a quantitative and automatic judgement on the difficulty or easiness of reading a text. Every readability formula, however, gives a partial measurement of reading difficulty because its result is mainly linked to words and sentences length. It gives no clues as to the correct use of words, to the variety and richness of the literary expression, to its beauty or efficacy, does not measure the quality and clearness of ideas or give information on the correct use of grammar, does not help in better structuring the outline of a text, for example a scientific paper. The comprehension of a text (not to be confused with its readability, defined by the mathematical formulae) is the result of many other factors, the most important being reader's culture and reading habits. In spite of these limits, readability formulae are very useful, if we apply them for specific purposes, and assess their possible connections with the short-term memory of readers.
Compared to the more sophisticated methods mentioned above the classical readability formulae have several advantages: 1) They give an index that any writer (or reader) can calculate directly, easily, by 2) Their "ingredients" are understandable by anyone, because they are interwound with a long-lasting writing and reading experience based on characters, words and sentences.
3) Characters, words, sentences and punctuation marks appear to be related to the capacity and time response of short-term memory, as shown in this paper. 4) They give an index based on the same variables, regardless of the text considered, thus they give an objective measurement for comparing different texts or authors, without resorting to readers' physical actions or psychological behavior, which largely vary from one reader to another, and within a reader in different occasions, and may require ad hoc assessment methods.

5)
A final objective readability formula or more recent software-developed methods valid universally are very unlikely to be found or accepted by everyone.
Instead of absolute readability, readability differences can be more useful and meaningful. The classical readability formulae provide these differences easily and directly.
In this paper, for Italian, I show that a relationship between some texts statistics and reader's short-term memory capacity and response time seems to exist. I have found an empirical relationship between the readability formula mostly used for Italian and short-term memory capacity, by considering a very large sample of literary works of the Italian Literature spanning seven centuries, most of them still read and studied in Italian high schools or researched in universities.
The contemporaneous reader of any of these works is supposed to be, of course, educated and able to read long texts with good attention. In other words, this audience is quite different of that considered in the studies and experiments reported above on new techniques (based on complex software) for assessing readability of specific types of texts [17]. In other words, the subject of my study are the ingredients of a classical readability formula, not the formula itself (even though I have found some interesting features and limits of it), and its empirical relationship with short-term memory. From my results, it might be possible to establish interesting links to other cognitive issues, as discussed by [20], a task beyond the scope of this paper and author's expertise.
The most important relationship I have found is that between the short-term memory capacity, described by Miller's "7 ∓ 2 law" [21], and what I call the word interval, a new random variable defined as the average number of words between two successive punctuation marks. The word interval can be converted into a time interval through the average reading speed. The word interval is numerically spread in a range very alike to that found in Miller's law, and more recently by Jones and Macken [12], and the time interval is spread in a range very alike to that found in the studies on short-term memory response time [10] [22] [23]. The connection between the word interval (and time interval) and Italian. An analysis of the New Testament Greek originals [24], similar to that reported in this paper, shows results very similar to those reported in this paper, therefore evidencing some universal and long-lasting characteristics of western languages and their readers.
In conclusion, the aim of this paper is to research, with regard to the high Italian language, the following topics: 1) The impact of character and sentence indices on the readability index (all defined in Section 2).
2) The relationship of these indices with the newly defined "word interval" and "time interval".
3) The "distance", absolute and relative, of literary texts by defining meaningful vectors based on characters, words, sentences, punctuation marks.
4) The relationship between the word interval and Miller's law, and between the time interval and short-term memory response time.
After this Introduction, Section 2 revisits the classical readability formula of Italian; Section 3 shows interesting relationships between its constituents; Section 4 reports the statistical results for a large number of texts of the Italian Literature since the XIV century; Section 5 discusses the "distance" of literary texts; Section 6 introduces word and time intervals and their empirical relationships with short-term memory features; Section 7 discusses some different results concerning scientific and technical texts, and finally Section 8 draws some conclusions and suggests future work.

Revisiting the GULPEASE Readability Formula of Italian
For Italian, the most used formula (calculated by WinWord, for example), known with the acronym GULPEASE [25], is given by: Therefore, Equation (1a) can be written as: We analyze first Equations (1a) (1b), by means of standard statistics of its addends, because, as other readability formulae, they contain important characteristics of literary texts which for the Italian Literature, that extends for the longest period of time compared to other modern western languages, have been stable over centuries (namely G C ).
Equation (1)  text is easier to read if it contains short words and short sentences, a known result applicable to readability formulae of any language. Now, the study of Equation (1), and in particular how the two terms C G , G F affect the value of G, brings very interesting results, as we show next. In this paper I apply the above equations to classical literary works of a large number of Italian writers 1 , from Giovanni Boccaccio (XIV century) to Italo Calvino (XX century), see Table 1, by examining some complete works, as they are available today in their best edition 2 . Information about authors and their literary texts can be found in any history of Italian literature, or in dictionaries of Italian literature. 2 The great majority of these texts are available in digital format at https://www.liberliber.it. 3 The standard deviation found in n text blocks

Relationships among GC, GF and G
The semantic index C G , given by the number of characters per word multiplied by 10 (Equation (2a)), and the syntactic index G F , given by the reciprocal of the number of words per sentence P F , multiplied by 300 (Equation (2b)), affect very differently the final value of G (Equation (1b)). Table 1  From the results reported in Table 1, it is evident that G C changes much less than G F , a feature highlighted in the scatter plot of Figure 2(a), which shows G C and G F versus G, for each text block (1260 text blocks in total, with different number of words) found in the listed literary works.
The theoretical range of G can be calculated by considering the theoretical range of G F . The maximum value of G F is found when P F is minimum, the latter given by 1 when  Table 1).
The constancy of G C versus G indicates that, in Italian, the number of characters per word C P has been very stable over many centuries, while the linear proportionality between G F and G, is directly linked to author's style, or to the style  Figure 2(b), which shows the scatter plots of the average number of characters per word C P vs. G, and P F vs. G In other words, the  (3) and (5).  (1) is practically due only to the syntactic index G F , therefore to the number of words per sentence. The two lines drawn in Figure   2(a) are given by the average value of G C ( Table 2): The correlation coefficient between G F and G, Equation (4), is 0.932. The slope is 0.912, therefore, giving practically a 45˚ line. By considering the coefficient of variation, 2 100 0.932 86.9% × = of the data is explained by (4). Figure   2(a) shows also the average values of selected works listed in Table 1 to locate them in this scatter plot. Figure 2(b) shows, superposed to the scattered values of P F , the theoretical relationship between the average value of P F , as a function of G, given, according to Equations (1a) and (3), by: The correlation between the experimental values of P F and that calculated from (5) (1) can be written as: From these results, it is evident that each author has his own "dynamics", in the sense that each author modulates the length of sentences in a way significantly more ample than he does or, I should say, he could do with the length of words, and differently from other authors, as we can read in Table 2 Figure 3 and Figure 4 show for Boccaccio  [26].
By considering the above findings, we can state that G C is practically a constant, 46.70 C G = , and that G can be approximated by (6).  (Table 1 and Table 2).  (Table 1 and Table 2).  with Equation (1a), and the regression line between the two data sets. The slope is 0.998, in practice 1 (45˚ line), and the correlation coefficient is 4 0.932. Defined the error estim G G − , its average value is −0.1, therefore 0 for any practical purpose, and its standard deviation is 2.14. For a constant readability level G, the latter value translates into an estimating error of school years required by at most 1 year, see Figure 1. Figure 5(b) shows that a normal (Gaussian) probability density function with zero average value and standard deviation 2.14 describes very well the error scattering. This value is the same as that of the couple (G, G F ) because G F is linearly related to G. Now, according to (6) it is obvious that the constant value min G can be set to zero, therefore making: with the advantage that the scaled index G s starts at 0. Now Equation (7) is not meant to be used to reduce any computability effort, as today Equation (1), like any other readability formula or other approaches, can be calculated by means of dedicated software, with no particular effort. Equation (7) is useful because underlines the fact that authors of the Italian Literature modulate much more the length of sentences, and each of them with personal style, than the length of words, and that the length of sentences substantially determines reading difficulty (as any Italian student knows when reading Boccaccio's Decameron, or Collodi's Pinocchio!), so that we could use Figure 6, as a guide, instead of Figure 1. Table 3 shows that, for any author, there is a large correlation, close to unity, between the number of characters and the number of words, as     Table 3). The relationship between words and sentences behaves differently. For each author a line y mx = still describes, usually very well, their relationship (see Table 2 and Table 3), but with different slope, as Figure 8 shows. The average number of words per sentence varies from 11.93 (Cassola) to 44.47 (Boccaccio) and these values affect very much the sentence term G F , which varies from 25.65 (Cassola) to 6.94 (Boccaccio). In Figure 8, we can notice that there is an angular range where all authors fall, a range that has collapsed into a line in Figure 7 because of a very tight, and equal for all authors, relationship between characters and words. Moreover, notice that the value of p f calculated from the average G F , i.e.

Characters, Words, Sentences, Punctuation Marks, Word and Time Intervals
300 F p f G = , is always smaller or at most equal 5 to the average value of the ratio p f (Table 2).
Defined the total number of punctuation marks (sum of commas, semicolons, colons, question marks, exclamation marks, ellipsis, periods) contained in a text, Figure 9 shows the scatter plot between this value and the number of sentences for each text block. Once more, for any author the relationship is a line y mx = with correlation coefficients close to 1 (Table 3), but with different slopes, the latter close to the average number of punctuation marks per sentence. For example, in Boccaccio, the average number of punctuation marks per sentence is 5.69 F M = (Table 2), whereas the slope 6 of the corresponding line is 5.57 m = (Table 3). 5 It can be proved, with Cauchy-Schwarz inequality, that the average value of 1/x ( is always less or equal to the reciprocal of the average value of x. 6 The slope m y x = has dimensions of words per punctuation mark, like the word interval I p .  The two authors represent approximate bounds to the angular region. An interesting comparison among different authors and their literary works can be done by considering the number of words per punctuation mark, that is to say, the average number of words between two successive punctuation marks, a random variable that is the word interval I p mentioned before, defined by: The word interval I P is very robust against changing habits in the use of punctuation marks throughout decades. Punctuation marks are used for two goals: 1) improving readability by making lexical and sentence constituents of texts more easily recognizable, 2) introducing pause [27], and the two goals can coincide [28] [29]. In the last decades, in Italian, there has been a reduced use of semicolons in favor of periods [30], but this change does not affect I P but only the number of words per sentence. The values of I P listed in Table 2 vary from 5.64 (Cassola) to 7.8 (Boccaccio). For any author, the linear model y mx = is still valid, as the high correlation coefficients listed in Table 3 and Figure 10 show. The slopes of the lines are very close to the averages, namely 5.56 and 7.82 respectively, because of correlation coefficients 7 close to 1.
Finally, Figure 11 shows the scatter plot between G, G C , G F and I p . We can notice that G F (and G) is significantly correlated with I P through an inverse proportionality. This result is very interesting because it links the readability of a text, the index G, or G F , to I p , another author's distinctive characteristic. Moreover, the word interval has other very interesting and intriguing relationships, as section 5 shows.

Comparing Different Literary Texts: Distances
A large number of texts produced today in several forms, both in hard copies and digital formats, such as books, journals, technical reports and others, have prompted several methods for fast automatic information retrieval, document classification, including authorship attribution. The approach is to represent documents with n-grams using vector representation of particular text features [31]. In this model, the similarity between two documents is estimated using the cosine of the angle between the corresponding vectors. This approach depends mainly on the similarity of the vocabulary used in the texts, while the characters and syntax are ignored. A more complex approach represents textual data in more detail [31]. These new techniques, implemented with complex software, are useful when, together with other tasks, automatic authorship attribution and verification are required. The ratio between the ordinate and the abscissa gives the word interval. 7 The ratio between P F (column 3 of Table 2) and M F (column 4) is another estimate of the word interval I p (column 5). The value so calculated and that of column 5 almost coincide because the correlation coefficient is close to 1. In other words, the ratio of the averages (column 3 divided by column 4) is practically equal to the average value of the ratio (column 5).  In the case of the literary texts considered in this paper, it is more interesting to compare the statistical characteristics of different authors or different texts of the same author, by using the data reported in Tables 1-3, instead of using the more complex methods reviewed by [32]. For this purpose, the parameters that are most significant are the four random variables defined before: C P , P F , M F and I P , because they represent fundamental indices and are mostly uncorrelated, except the couple (M F , I P ), as Table 4 shows. These parameters are suitable to assess similarities and differences of texts much better, as I show next, than the cosine of the angle between any two vectors. Therefore, in this section, I define absolute and relative "distances" of texts by considering the following six vectors of components 8 x and y:

R R
, the similarity would be zero, 0 S = . According to this criterion, two collinear vectors of very different length (the magnitude of the vector) will be classified as identical because 1 k S = , a conclusion that cannot be ac- 8 The choice of which parameter represents the component x or y is not important. Once the choice is made, the numerical results will depend on it, but not the relative comparisons and general conclusions cepted. This is a serious drawback of the cosine similarity. Figure 12 shows the scatter plot between the average value of S, calculated by considering all text blocks, and the readability index G. Any text block is compared also to another text block of the same literary text (but not with itself). The choice of not excluding the other text blocks of the same literary text leads to a simple and straight software code, which, however, does not affect the general conclusion arrived at by observing the scatter plot shown in Figure   12: there is no correlation between S and G, therefore S does not meaningfully discriminate between any two texts when the angle formed by their vectors is close to zero. Table 4. Linear correlation coefficients between the indicated pairs of random variables (1260 text blocks). Now, a better choice for comparing literary texts is to consider the "distance" of any text block from the origin of x and y axes 9 , given by the magnitude of the resulting vector R : (10) With this vectorial representation, a text block ends up in a point of coordinates x and y in the first Cartesian quadrant, as Figure 13 shows. The end point of the vectors with components given by the average values of the literary texts (obtainable from Tables 1-3) is also shown.  The efficacy of R can be appreciated in Figure 14, which shows the scatter plot between R and G, and between its angle arctan y x ϕ   =     , expressed in degrees, and G. The black lines describe very well the relationships between them, given by: and G (lower panel). 10 A compulsory reading in any Italian High School. 11 Notice that distances are distorted, if measured on the graph of Figure 13, because the abscissa (x scale) is expanded compared to the ordinate (y scale). ϕ . The correlation coefficient between measured and estimated values of R through (11a) is 0.802 between the measured and estimated values of ϕ with (11b) is 12 0.867. In conclusion, the magnitude (distance) R and the angle ϕ of the vector R are very well correlated with the readability index G.

Word Interval, Miller's 7 ∓ 2 Law and Short-Term Memory Capacity
The range of the word interval I p , shown in Figure 11, is very similar to the range mentioned in Miller's law 7 ∓ 2, although the short-term memory capacity of data for which chunking is restricted is 4 ± 1 [33] [34] [35] [36] [37]. For words, namely data that can be restricted (i.e., "compressed") by chunking, it seems that the average value is not 7 but around 5 to 6 [21], almost the average value of the word interval 6.56 (Table 2). Now, as the range from 5 to 9 in Miller's law corresponds to 95% of the occurrences [37], it is correct to compare with confidence level in excess of 99.99% (chi-square test) [38]. The log-normal probability density is valid only for 1 P I ≥ being 1 P I = the minimum theoretical value of this variable (a single sentence made of only 1 word).
The theoretical constants of Equation (12)   These results may be explained, at least empirically, according to the way our mind is thought to memorize "chunks" of information in the short-term memory. When we start reading a sentence, our mind tries to predict its full meaning from what has been read up to that point, as it seems that can be concluded from the experiments of Jarvella [40]. Only when a punctuation mark is found, our mind can better understand the meaning of the text. The longer and more twisted is the sentence, the longer the ideas remain deferred until the mind can establish the meaning of the sentence from all its words. In this case, the text is less readable, a result quantitatively expressed by the empirical Equation (1a) for Italian. Figure 16 shows the scatter plot between I p and P F for all the text blocks, together with the non-linear regression line (best-fit line) that models, on the average, I p versus P F for the Italian Literature, given by: The time axis drawn in Figure 11 is useful to convert I P into I T . The results, relating I T and I P to fundamental and accessible characteristics of short-term memory, are very interesting and should be furtherly pursued by experts. Moreover, the same studies can be done on ancient languages, such as Greek and Latin, to test the expected capacity and response time of the short-term memory of these ancient and well-educated readers, partially already done for the Greek of the New Testament [24].

Technical and Scientific Writings
Technical and scientific writings (papers, essays, etc.) ask more to their readers.
A preliminary investigation was done on short scientific texts published in the Italian popular science magazines Le Scienze and Sapere (because today is rare to find original scientific papers written in Italian), in a popular scientific book and newspaper editorials gave the results listed in Table 5. In this analysis, mathematical expressions, tables, legends have not been considered. From Table 5 we can notice some clear differences from the results of novels: words are on the average longer, the readability index G is lower, the word interval is longer.
These results are not surprising because technical and scientific writings use long technical words, deal with abstract meaning with articulated and elaborated sentences resulting in long sentences with series of subordinate clauses. Of course, the reader of these texts expects to find technical and abstract terms of his field, or specialty, and would not understand the text if these elements were absent.

Conclusions and Future Developments
Statistics of languages have been calculated for several western languages, mostly by counting characters, words, sentences, word rankings. Some of these parameters are also the main "ingredients" of classical readability formulae.
Revisiting the readability formula of Italian, known with the acronym GULPEASE, shows that of the two terms that determine the readability index G-the semantic index G C , proportional to the number of characters per word, and the syntactic index G F , proportional to the reciprocal of the number of words per sentence-G F is dominant because G C is, in practice, constant for any author. From these results, it is evident that each author modulates the length of sentences more freely than what he can do with word length and in different ways from author to author.  For any author, any couple of text variables can be described by a linear relationship y mx = but with different slope m from author to author, except for the relationship between characters and words, which is unique.
The most important relationship I have found is that between the short-term memory capacity, described by Miller's "7 ∓ 2 law", and what I have termed the word interval, a new random variable defined as the average number of words between two successive punctuation marks. The word interval can be converted into a time interval through the average reading speed. The word interval is numerically spread in a range very alike to that found in Miller's law, and the time interval is spread in a range very alike to that found in the studies on short-term memory response time. The connection between the word interval (and time interval) and short-term memory appears, at least empirically, justified and natural.
For ancient languages, no longer spoken by a people, but rich in literary texts that have founded the Western civilization, such as Greek or Latin, nobody can make reliable experiments, as those reported in the references recalled above.
These ancient languages, however, have left us a huge library of literary and (few) scientific texts. Besides the traditional count of characters, words and sentences, the study of their word interval statistics should bring us a flavor of the short-term memory features of these ancient readers, and this can be done very easily, as I have done for Italian. A preliminary analysis of a large number of Greek and Latin literary texts shows results very similar to those reported in this paper, therefore evidencing some universal and long-lasting characteristics of western languages and their readers.
In conclusion, it seems that there is a possible direct and interesting connection between readability formulae and reader's capacity of short-term memory capacity and response time. As short-term memory features can be related to other cognitive parameters [20], this relationship seems to be very useful. However, its relationship with Miller's law should be further investigated because the word interval is another parameter that can be used to design a text, together with readability formulae, for better matching expected reader's characteristics. 13 Bellone, E. (1999) Spazio e tempo nella nuova scienza. Carocci, 136 pages. 14 Le Scienze, Scienze e ricerche, 2017 issues. 15 Il Corriere della Sera, La Repubblica, Il Sole 24 ore, 2018.