^{1}

^{1}

^{1}

In a previous study, we introduced dynamical aspects of written texts by regarding serial sentence number from the first to last sentence of a given text as discretized time. Using this definition of a textual timeline, we defined an autocorrelation function (ACF) for word occurrences and demonstrated its utility both for representing dynamic word correlations and for measuring word importance within the text. In this study, we seek a stochastic process governing occurrences of a given word having strong dynamic correlations. This is valuable because words exhibiting strong dynamic correlations play a central role in developing or organizing textual contexts. While seeking this stochastic process, we find that additive binary Markov chain theory is useful for describing strong dynamic word correlations, in the sense that it can reproduce characteristics of autocovariance functions (an unnormalized version of ACFs) observed in actual written texts. Using this theory, we propose a model for time-varying probability that describes the probability of word occurrence in each sentence in a text. The proposed model considers hierarchical document structures such as chapters, sections, subsections, paragraphs, and sentences. Because such a hierarchical structure is common to most documents, our model for occurrence probability of words has a wide range of universality for interpreting dynamic word correlations in actual written texts. The main contributions of this study are, therefore, finding usability of the additive binary Markov chain theory to analyze dynamic correlations in written texts and offering a new model of word occurrence probability in which common hierarchical structure of documents is taken into account.

Introducing the notion of time to written texts reveals dynamical aspects of word occurrences, allowing us to apply standard dynamical analyses developed and used in the fields of signal processing and time series analysis. In a previous study [

Type-I words are known to occur multiple times in a text in a bursty and context-specific manner, and such occurrences ensure that the word has a dynamic correlation. Put another way, Type-I words are important for describing an idea or topic, and are therefore expected to be highly correlated with a duration, typically several tens of sentences in which the idea or topic is described. In contrast, Type-II words are not context-specific and their appearance is governed by chance. Type-I words are therefore more important than Type-II words, in the sense that they play a central role in explaining the author’s ideas or thoughts. The author’s insights and thought process should thus be discernible through modeling of the stochastic process that generates Type-I words. However, despite the importance of Type-I words, the stochastic process yielding them could not be clarified in the previous study.

The purpose of the present study was to find such a stochastic process for Type-I words. We found that additive binary Markov chain theory is suited to this purpose because this theory can capture characteristic behaviors for dynamic correlations of Type-I words in actual written texts. To our knowledge, this is the first application of the theory of additive binary Markov chain to analyze written texts, although the theory has been utilized to model natural phenomena such as wind generations [

The remainder of this paper is organized as follows: In Section 2, we define the autocovariance function (ACVF), an unnormalized ACF used throughout this study. In Section 3, we describe the additive binary Markov chain theory, which allows mutual conversion between memory functions and ACVFs. We also present the relation between the time-varying probability of word occurrence and the memory function, which allows us to estimate the time-varying probability of a given word. In Section 4, we present typical examples of time-varying probability for Type-I words and their two distinctive features. In Section 5, we describe how to establish a recursive model for a probability distribution that successfully reproduces the two features of the time-varying probability. Finally, in Section 6, we present our conclusions and suggest directions for future research.

The autocovariance function gives the covariance of a given process with itself at pairs of time points. A standard definition of the ACVF for a weak stationary process { X t } is

K ( τ ) = E [ ( X t − μ ) ( X t + τ − μ ) ] , (1)

where E [ ] is an expectation operator and μ = E [ X t ] is the mean of { X t } [

ρ ( τ ) = K ( τ ) K ( 0 ) . (2)

As Equation (1) and Equation (2) show, these definitions for ACVF and ACF use the deviation of X t from its mean instead of X t itself. In our previous study [

Another difference between the previous and present studies is that we extensively use ACVFs instead of ACFs because ACVFs more directly link to additive binary Markov chain theory, as will be shown later.

Because we use the set of serial sentence numbers assigned from the first to the last sentence in a considered text as discretized time, and because we intend to analyze word occurrence characteristics in terms of ACVFs, we define the signal X t representing word occurrence or non-occurrence as

X t = { 1 ( when a given wordoccurs in the t th sentence ) 0 ( when a given word does not occur in the t th sentence ) (3)

As

in the previous study. Therefore, the model functions used in the previous study to describe the ACFs of Type-I and Type-II words are still appropriate for describing ACVFs for each word type. Specifically, the Kohlrausch-Williams-Watts (KWW) function used to model the ACFs of Type-I words and the stepdown function used to describe the ACFs of Type-II words still provide full descriptive power when applied to modeling ACVFs.

Our previous study found that, without exception, all frequent words appearing in the twelve famous books are well classified into Type-I or Type-II words. This study also showed that the stochastic process governing occurrences of Type-II words is a homogeneous Poisson point process, which is completely memoryless. In the following section, we describe our attempts to determine a stochastic process for generating Type-I words in written text. We also investigate a mechanism for providing dynamic correlations to Type-I words.

One standard approach to analyzing time series with dynamic correlations is to use a Markov chain model [

P r ( X n + 1 = x | X 1 = x 1 , X 2 = x 2 , ⋯ , X n = x n ) = P r ( X n + 1 = x | X n = x n ) (4)

holds, then { X t } is a first-order Markov chain. In the case of a binary Markov chain in which signal X t takes only values 0 or 1 as in Equation (3), the stochastic properties of the first-order Markov chain can be completely determined by defining a transition matrix

P = ( P 00 P 10 P 01 P 11 ) , (5)

where P i j denotes a transition probability from state i to state j. To determine all values of P i j in the transition matrix from signals { X t } observed in actual written texts, we simply used maximum likelihood estimators

P i j = n i j n i 0 + n i 1 , (6)

where n i j is the number of transitions from state i to state j observed in signals { X t } .

After obtaining all values of P i j listed in

1) Arbitrarily set an initial state X 0 as 0 or 1.

2) Determine the next state X 1 , by comparing P 00 or P 10 with a generated random number p ∈ [ 0 , 1 ] following the standard uniform distribution U ( 0 , 1 ) . If the initial state was X 0 = 0 , compare random number p with P 00 . Otherwise, if X 0 = 1 , compare p with P 10 . For example, consider the case where X 1 = 0 . Then if p < P 00 we set the next state as X 1 = 0 , while if p ≥ P 00 we set X 1 = 1 . This procedure ensures that P r ( X 1 = 0 | X 0 = 0 ) = P 00 and P r ( X 1 = 1 | X 0 = 0 ) = P 01 because from Equation (6), P 00 + P 01 = 1 always holds. Similarly, given X 0 = 1 , if p < P 10 we set X 1 = 0 , otherwise we set X 1 = 1 so that P r ( X 1 = 0 | X 0 = 1 ) = P 10 and P r ( X 1 = 1 | X 0 = 1 ) = P 11 .

3) Repeat Step 2 with replacements X 1 → X i and X 0 → X i − 1 ( i = 2 , 3 , ⋯ , n ) . By repeating this procedure n times, we can obtain simulated signals X 1 , X 2 , ⋯ , X n , the set of which is a first-order Markov chain having the transition matrix P given by Equation (5).

Insets in

A direct and intuitive way to discover dynamic correlations with long durations in simulated signals X t is to consider higher-order Markov chains [

P r ( X n + 1 = x | X 1 = x 1 , X 2 = x 2 , ⋯ , X n = x n ) = P r ( X n + 1 = x | X n − m + 1 = x n − m + 1 , X n − m + 2 = x n − m + 2 , ⋯ , X n = x n ) . (7)

Intermediate | Seed | Organ | Instinct | |
---|---|---|---|---|

P_{00} | 0.9800766 | 0.9787887 | 0.9723871 | 0.9860371 |

P_{0}_{1} | 0.0199234 | 0.0212114 | 0.0276129 | 0.0139629 |

P_{10} | 0.6446281 | 0.6747967 | 0.6645963 | 0.5670103 |

P_{11} | 0.3553719 | 0.3252033 | 0.3354037 | 0.4329897 |

However, one difficulty is that the number of transition probabilities, each of which is an element of the transition matrix, grows exponentially with the order. In the case of binary Markov chains, we must evaluate 2 m + 1 transition probabilities to model the mth order Markov chain because we must consider all possible transition patterns in the last m signals. If m = 10 , we must determine 2048 transition probabilities. This is impossible because we cannot obtain sufficient samples to evaluate these probabilities when the number of signals X t , which is equal to the number of sentences in the considered text, is the same order as the number of transition probabilities to be evaluated. For example, The Origin of Species has 3991 sentences, so it is impossible to determine 2048 transition probabilities with sufficient statistical reliability from 3991 signals. We must therefore consider another model that is tractable and can generate dynamic correlations with long durations. In the following subsection, we introduce additive binary Markov chains for this purpose.

Melnyk et al. [

An additive Markov chain of order m has the property

P r ( X n = x n | X n − 1 = x n − 1 , X n − 2 = x n − 2 , ⋯ , X n − m = x n − m ) = ∑ r = 1 m f ( x n , x n − r , r ) . (8)

This means influences of previous states at different times are mutually independent on next states and thus can be expressed in additive form. In the binary case, where signal X t is restricted to a value of 0 or 1, the theory tells us that the conditional probability of Equation (8) can be modified as

P r ( X n = 1 | X n − 1 = x n − 1 , X n − 2 = x n − 2 , ⋯ , X n − m = x n − m ) = X ¯ + ∑ r = 1 m F ( r ) ( x n − r − X ¯ ) , (9)

where X ¯ is the mean of signals { X t } , and F ( r ) is a memory function representing the degree of influence of a previous signal occurring r time steps before. Thus, if F ( r ) takes a large value, then the occurrence of a given word at t = n − r positively affects the word occurrence at t = n . Another implication of Equation (9) is that by obtaining F ( r ) for a given word, we can calculate the probability of signal X t being 1 by this equation, and we can therefore generate signals X t by use of a simple Monte Carlo procedure with that probability. Because the parameters needed to simulate generation of X t are F ( r ) , the number of parameters to be evaluated is equal to the order m of the considered additive Markov chain. Tractability is therefore greatly improved because the number of parameters is only linearly dependent on the order m of a considered Markov chain.

Furthermore, the theory of additive binary Markov chains [

K ( r ) = ∑ s = 1 m K ( r − s ) F ( s ) ( r = 1 , 2 , ⋯ , m ) (10)

Note that the relations

K ( r ) = K ( − r ) , (11)

K ( 0 ) = X ¯ ( 1 − X ¯ ) , (12)

always hold by the definition of K ( r ) . By using Equation (11) and Equation (12), we can regard Equation (10) as m simultaneous equations relating ACVFs K ( 1 ) , K ( 2 ) , ⋯ , K ( m ) to memory functions F ( 1 ) , F ( 2 ) , ⋯ , F ( m ) . We can thus calculate ACVFs, K ( r ) , from a theoretically assumed memory function F ( r ) , or we can conversely calculate memory functions F ( r ) for a given word at r = 1 , 2 , ⋯ , m from its actual ACVFs.

Before applying additive binary Markov chain theory to analyze dynamic correlations for Type-I words, we confirm the validity of the theory as follows. First, we assume the tentative memory function

F ( r ) = { 0.1 − 0.05 r ( 1 ≤ r < 20 ) 0 ( 20 ≤ r ) (13)

which is shown as the solid line in

(C1) Signal X n = 1 if a generated random number p ∈ [ 0 , 1 ] following standard uniform distribution U ( 0 , 1 ) is less than the conditional probability given by Equation (9). Otherwise, X n = 0 . To calculate the conditional probability, we substitute F ( r ) as calculated from Equation (13) and the past m signal values into Equation (9). This procedure is repeated until we obtain the desired length of signals { X t } .

(C2) The number of past signals is insufficient for generating the first m − 2 signals X 1 , X 2 , ⋯ , X m − 2 because Equation (9) requires past m signals X n − 1 , X n − 2 , ⋯ , X n − m to calculate the conditional probability of X n being 1. In these cases, we use all available past n signals from X 0 to X n − 1 to calculate Equation (9) and ignore other terms that require X − 1 , X − 2 , ⋯ .

Obtained X t and estimated ACVFs from X t with a condition m = 30 are shown in

In this section, we apply the theory of additive binary Markov chain to Type-I word occurrences to clarify characteristics of the stochastic process that generates dynamic correlations of Type-I words. Since ACVFs for Type-I words can be calculated from actual signals X t by use of Equation (1) and these ACVFs can be used as observed K ( r ) values in Equation (10), it is easy to determine F ( r ) values at each lag r from Equation (10).

obtained F ( r ) for typical Type-I words. When calculating F ( r ) , we have used K ( r ) represented by best-fitted KWW functions at each lag step instead of the original ACVFs to reduce noise effects. Red lines in that figure represent results fitted to F ( r ) by use of KWW functions, indicating that memory functions F ( r ) for Type-I words can also be well described by KWW functions as in the case of fitting ACVFs.

Strictly speaking, because the memory function F ( r ) is defined at r ≥ 1 , as seen in Equation (9) and Equation (10), we use a modified form of the KWW function for F ( r ) :

F ( r ) = F 0 exp { − ( r − 1 τ ) β } , (14)

while when fitting ACVFs we use the standard form of the KWW function, namely,

K ( r ) = K 0 exp { − ( r τ ) β } . (15)

To confirm this observation, we conducted nonlinear least-squares fittings for all Type-I words from five well-known academic books, described in detail in the Appendix.

Consistency of the additive binary Markov chain theory for modeling occurrences of Type-I words can be used to confirm whether the theory can reproduce observed ACVFs. The procedure for confirming consistency is as follows. Note that we use the symbol X t for actual signals of word occurrence or nonoccurrence as observed in actual written text, and X n designates simulated signals obtained through simple Monte Carlo procedures.

K ( r ) | F ( r ) | |||||
---|---|---|---|---|---|---|

Word | K 0 | β | τ | F 0 | β | τ |

Instinct | 0.0233 | 0.220 | 3.78 | 0.257 | 0.619 | 1.69 |

Intermediate | 0.0289 | 0.412 | 1.03 | 0.294 | 0.633 | 0.994 |

Organ | 0.0374 | 0.303 | 2.01 | 0.292 | 0.622 | 1.31 |

Seed | 0.0292 | 0.316 | 1.09 | 0.268 | 0.618 | 1.31 |

Book | Number of Type-I words | K ( r ) | F ( r ) | ||||
---|---|---|---|---|---|---|---|

K 0 | β | τ | F 0 | β | τ | ||

Darwin | 109 | 0.0278 ± 0.0187 | 0.262 ± 0.115 | 0.485 ± 1.14 | 0.142 ± 0.0769 | 0.584 ± 0.0574 | 1.88 ± 0.545 |

Einstein | 17 | 0.0732 ± 0.0199 | 0.261 ± 0.0420 | 0.224 ± 0.169 | 0.161 ± 0.0353 | 0.589 ± 0.0195 | 1.77 ± 0.281 |

Freud | 14 | 0.0465 ± 0.0176 | 0.276 ± 0.144 | 0.658 ± 1.51 | 0.160 ± 0.0760 | 0.596 ± 0.0683 | 1.81 ± 0.564 |

Kant | 142 | 0.0283 ± 0.0250 | 0.245 ± 0.106 | 0.237 ± 0.391 | 0.140 ± 0.0534 | 0.580 ± 0.0506 | 1.97 ± 0.442 |

Smith | 382 | 0.0141 ± 0.0143 | 0.256 ± 0.106 | 0.305 ± 1.01 | 0.131 ± 0.0570 | 0.583 ± 0.0561 | 1.91 ± 0.491 |

Book | Number of Type-I words | K ( r ) | F ( r ) | ||||
---|---|---|---|---|---|---|---|

K 0 | β | τ | F 0 | β | τ | ||

Darwin | 109 | 0.673 | 0.439 | 2.351 | 0.542 | 0.098 | 0.290 |

Einstein | 17 | 0.272 | 0.161 | 0.754 | 0.219 | 0.033 | 0.159 |

Freud | 14 | 0.378 | 0.522 | 2.295 | 0.475 | 0.115 | 0.312 |

Kant | 142 | 0.883 | 0.433 | 1.650 | 0.381 | 0.087 | 0.224 |

Smith | 382 | 1.014 | 0.414 | 3.311 | 0.435 | 0.096 | 0.257 |

1) ACVFs for a Type-I word are calculated using Equation (1) from actual signals X t observed in the text.

2) Curve fitting to fit Equation (15) to ACVFs obtained in the previous step is performed to obtain optimized values for fitting parameters.

3) K ( r ) is calculated at each lag step r using Equation (15) with the optimized fitting parameters. We set the maximum lag step as r max = 100 , which, as will be recognized in the following steps, is equivalent to setting the order of the additive binary Markov chain to m = 100 . This setting is sufficient to cover the longest durations of dynamic correlations for Type-I words [

4) K ( r ) obtained in the previous step are substituted into the simultaneous equations, Equation (10), to obtain F ( r ) .

5) F ( r ) is used to calculate the conditional probability, Equation (9), and the calculated conditional probability of X n being 1 is used to generate simulated signals X n through simple Monte Carlo procedures. Conditions (C1) and (C2) described in Subsection 3.2 are still applied in this step. Before starting the Monte Carlo procedures, we set X ¯ to the averaged value of actual signals for a given word.

6) From the simulated signals X n , ACVFs are calculated using Equation (1) and compared with those obtained in step 1. If the ACVFs calculated from simulated signals agree well with the actual ACVFs, we can consider the additive binary Markov chain theory as consistent.

As Equation (9) shows, signals X t are considered to be consequences of the time-varying conditional probabilities for word occurrence given by Equation (9) in the framework of additive binary Markov chain theory. In this sense, the unobserved time-varying probabilities given by Equation (9) are more essential than the observed signals X t . The time-varying probabilities shown in

A closer look at

We further calculated the time-varying probabilities of X n being 1 for typical Type-I words selected from five academic books. Specifically, we chose the two words having the largest and second largest BIC in each book because BIC is a measure of dynamic correlation and thus indicates the importance of a given word [

strong dynamic correlations, although the second feature (aggregation along the horizontal axis) is not easy to see in Figures 7(g)-(j) due to compression of the horizontal axes.

The two features described above cannot be directly explained from Equation (9), so another viewpoint beyond the scope of additive binary Markov chain theory is needed to explain them. In the following section, we propose a recursive probability distribution model in which hierarchical structures of documents are considered to explain these features.

Almost all documents have a hierarchical structure consisting of chapters, sections, subsections, paragraphs, and sentences. This section describes the construction of a probability distribution model that reflects such hierarchical structures. The constructed model is expected to reproduce the two features of the time-varying probability of word occurrence described in the previous section. However, because our aim is to capture dynamic correlations of Type-I words with a simple model and we do not intend to build a complex or sophisticated model, some details of the proposed model will be tentatively determined. We believe, however, that the hierarchical structures of documents induce dynamic correlations of Type-I words, and that our model essentially captures the origin of these correlations.

Our model is based on a recursive probability redistribution that constructs a hierarchical probability distribution. After obtaining the hierarchical probability distribution, we convert it to the time-varying probability of word occurrence. The probability redistribution and conversion are performed in the following manner.

1) As a starting point, we consider the standard uniform distribution U ( 0 , 1 ) illustrated in

2) The unit interval [0, 1] is divided into 5 subintervals of the same length, indexed as 1, 2, 3, 4, 5 (

3) The rectangles representing probabilities at subintervals with indices a 1 and a 2 are removed and stacked on rectangles having indices b 1 and b 2 . In

zero, indicating that chapters 1 and 5 become irrelevant to descriptions using the considered word, while chapters 2, 3, and 4 are considered relevant.

4) Division into five subintervals, choosing indices a 1 , a 2 , b 1 , and b 2 and the stacking of probabilities are repeated for each portion of the top layer in the probability distribution as in

5) After obtaining a desired hierarchical probability distribution, that is, after repeating the probability redistribution a predefined number of times, we discretize the horizontal axis and convert it into sentence numbers. For example, if we repeat the previous step three times, that is, if we set the number of repetitions as r = 3 , then the unit interval [0, 1] is divided into 5 3 = 125 subintervals, considered to be 125 sentences with serial sentence number from 1 to 125. In this case, because the text length of 125 sentences is too short to calculate ACVFs with statistical reliability, we concatenate 10 different obtained hierarchical probability distributions, each having 125 sentences to form a text with 1250 sentences. Concatenation of 10 hierarchical probability distributions is applied to the cases of r = 2 , 3 , 4 and concatenation of 5 hierarchical probability distributions is used for r = 5 .

6) The vertical axis of the probability distribution is also converted. Originally, a vertical axis value represents a density function value defined within the unit interval [0, 1]. However, because we intend to obtain time-varying probabilities, these vertical axis values must be converted into word occurrence probabilities at corresponding sentences. That is, we want to transform the probability density function to the time-varying probability, as in

Note that the time-varying probabilities shown in

Another important finding observed in the ACVFs in these plots is that dynamic correlations become more prominent and durations of dynamic correlations increase with the number of repetitions of the recursive procedure. This indicates that the origin of dynamic correlations observed for Type-I words is closely connected to the hierarchical structure of a considered text. Judging from the ACVFs in

occurrence and hence lowers the values X ¯ and K 0 = X ¯ ( 1 − X ¯ ) . The disagreement in K 0 mentioned above is thus not serious, and we can therefore conclude that the proposed model for recursive probability distribution appropriately reproduces dynamic correlations of Type-I words.

Type-I words are those used to describe notions or ideas in written texts over some duration of sentences in context-specific manners. Therefore, if we consider occurrences of a given word as time-series data by regarding serial sentence numbers as time, the words exhibit dynamic correlations that are well captured in autocovariance functions (ACVFs). In this study, we investigated a stochastic process that governs word occurrences with the hope that the origin of dynamic correlations of Type-I words is well interpreted by that process.

To identify this process, we applied additive binary Markov chain theory to observe word occurrence signals for considered Type-I words to estimate memory functions and time-varying probabilities of word occurrence. The obtained time-varying probabilities represent the probability of word occurrence as a function of time (i.e., serial sentence number). These probabilities show two distinctive features: values of time-varying probabilities seem to be discretized, and similar probability values aggregate in some time-axis range.

To explain these features, we attempted to construct a recursive model of probability distribution that considers hierarchical structures of documents such as chapters, sections, subsections, paragraphs, and sentences. This construction was based on a recursive probability redistribution in a probability density function defined on the unit interval [0, 1]. After obtaining the hierarchical probability distribution in the density function, we converted the density function to time-varying probabilities of word occurrence by discretizing the horizontal axis and rescaling the vertical axis. We found that the obtained time-varying probabilities well reproduce the two distinctive features mentioned above. By using those time-varying probabilities, we generated signals X t representing word occurrence or non-occurrence over the entire text and calculated ACFVFs from the signals. The resultant ACVF with four repetitions, which is the number of recursive procedure repetitions needed to construct the hierarchical probability distribution, was quite similar to actual ACVFs for Type-I words.

At this stage, we have not yet considered optimization of the construction procedure for the recursive model of probability distribution. For example, the use of five subintervals in the procedure was tentatively determined without statistical verifications. To construct more realistic models, we should extract the number of subintervals from some adequate probability distribution each time this number is needed. Therefore, increased sophistication of the construction procedures described in Section 5 is one area for future research. Another is to interpret the time-varying probabilities of word occurrences as a fractal time series. As the recursive procedure for constructing the hierarchical probability distribution shows, the obtained probability distribution can be regarded as a statistical fractal [

We thank Dr. Yusuke Higuchi for useful discussion and illuminating suggestions. This work was supported in part by JSPS Grant-in-Aid (Grant No. 25589003 and 16K00160).

The authors declare no conflicts of interest regarding the publication of this paper.

Ogura, H., Amano, H. and Kondo, M. (2019) Origin of Dynamic Correlations of Words in Written Texts. Journal of Data Analysis and Information Processing, 7, 228-249. https://doi.org/10.4236/jdaip.2019.74014

In this study, we selected the English edition of five famous academic books as samples of written texts and analyzed all Type-I words appearing therein to clarify the features of Type-I words. Unlike in our previous study, we omitted novels from our text samples because the features of Type-I words are more prominent in academic books [

Short name | Title | Author | Download URL |
---|---|---|---|

Darwin | On the Origin of Species | Charles Darwin | https://www.gutenberg.org/ebooks/1228 |

Einstein | Relativity: The Special and General Theory | Albert Einstein | https://www.gutenberg.org/ebooks/5001 |

Freud | Dream Psychology | Sigmund Freud | https://www.gutenberg.org/ebooks/15489 |

Smith | An Inquiry into the Nature and Causes of the Wealth of Nations | Adam Smith | https://www.gutenberg.org/ebooks/3300 |

Kant | The Critique of Pure Reason | Immanuel Kant | https://www.gutenberg.org/ebooks/4280 |