_{1}

Causal analysis is a powerful tool to unravel the data complexity and hence provide clues to achieving, say, better platform design, efficient interoperability and service management, etc. Data science will surely benefit from the advancement in this field. Here we introduce into this community a recent finding in physics on causality and the subsequent rigorous and quantitative causality analysis. The resulting formula is concise in form, involving only the common statistics namely sample covariance. A corollary is that causation implies correlation, but not vice versa, resolving the long-standing philosophical debate over correlation versus causation. The applicability to big data analysis is validated with time series purportedly generated with hidden processes. As a demonstration, a preliminary application to the gross domestic product (GDP) data of United States, China, and Japan reveals some subtle USA-China-Japan relations in certain periods.

We have entered an era of data wealth; how to analyze these data has become a big problem for scientists in the twenty-first century. This raises many challenging issues, among which is causal inference, a field which actually forms an important subject in many different scientific disciplines, even in philosophy (e.g., [

Causality analysis, however, is a very challenging problem. In their book Doing Data Science (p. 274) [

Recently, a rigorous and quantitative analysis has been developed to address the challenge (cf. [

However, this line of work has not even been touched in big data studies. While we should avail ourselves of the arsenal of traditional tools, new ideas, particularly new ideas like this one which is based firmly on physical footing, will for sure facilitate the advancement of the new science. We are therefore motivated to introduce the newly developed causality analysis to data scientists. This makes the main purpose of this study.

In the following we first give a brief review of the formalism, its development and major results. To test its utility in handling big data, in Section 3 we purportedly generate series in extreme situations, particularly series in the presence of hidden processes. As a demonstration, Section 4 presents a preliminary application to the study of the USA-China-Japan relation. This study is summarized in Section 4.

Historically Granger [

So the two major lines of work on causality analysis eventually merge. The corresponding formalisms, however, have long been found unable to verify themselves in many applications, or they may even yield spurious causal relations. The verification is based on the following observation:

If the evolution of a variable, say, X_{1}, is independent of another one, X_{2}, then the causality from X_{2} to X_{1} vanishes.

Hereafter we will call it Principle of Nil Causality. Recently, Smirnov [

Since causality can be quantitatively measured by information flow, while information flow is a real physical notion (not just something in statistics), Liang argued that it should be formulated on a rigorous footing, rather than be proposed as an ansatz [

where (W_{1}, W_{2}) is a vector of standard Wiener process, and F_{1} and F_{2} are differentiable functions of (X_{1}, X_{2}). He obtained the following theorems:

Theorem 2.1. (Liang, 2008)

For the dynamical system (1)-(2), the rate of information flowing from X_{2} to X_{1} is

where E stands for mathematical expectation, and _{1}.

Theorem 2.2. Principle of nil causality (Liang, 2008)

If in the system (1)-(2), neither F_{1} nor b_{11} nor b_{12} has dependence on X_{2}, then

Note both are proven theorems (proofs are referred to [

If only two time series are given, the information flow between them can be obtained through maximum likelihood estimation.

Theorem 2.3. (Liang, 2014)

Given two time series X_{1} and X_{2}, under the assumption of a linear model, the maximum likelihood estimator (mle) of the rate of information flowing from X_{2} to X_{1} is

In this equation, _{1} and X_{2}, and _{i} and a series derived from X_{j} using Euler forward differencing scheme:

Note in (4) the T is actually the mle of the information flow, and, strictly, should bear a hat. We abuse the notation here as, from now on, only (4) will be used, and hence no confusion will arise. That is to say, (4) will be taken as the quantitative measure of causality from X_{2} to X_{1}. More precisely, the absolute value of T measures the causality. When_{2} is causal to X_{1}; if_{2} is not the cause of X_{1}.

The formula for information flow hence causality is very concise. Considering that in history there is a long-standing debate over correlation versus causation, one may transform it into a form in terms of correlation coefficient:

with

Causation implies correlation, but correlation does not imply causation.

Causality can be normalized so as to reveal its relative magnitude; see [

Equation (4) has been validated with touchstone problems that fail the traditional Granger causality analysis. It has also been applied to many real world problems, with remarkable success. Among these applications is the causal structure study between CO_{2} and global warming [_{2} concentration rise during the past 120 years does cause the recent global warming; the causal relation is one-way, i.e., from CO_{2} to global atmosphere temperature. However, on a 1000-year (or over) scale, the causality is totally reversed; i.e., it is global warming that causes CO_{2} to increase, in agreement with that inferred from the ice-core data recently from Antarctica. Besides, the anthropogenic gas emission mainly from the Northern Hemisphere, however, causes mainly the warming in the Southern Hemisphere.

Another application is with several series of prices of US stocks downloaded from

significant causal relation can be interpreted based on common sense. For example, Ford is found to have a much larger causality to Wal-Mart than to CVS the convenience store chain, since, in the States, people rely on motor vehicles to shop at Wal-Mart stores, while CVS stores could be within walking distances. A deeper study shows that the causality generally varies with time. For GE and IBM, overall it seems that they are not significantly causal to each other. However, if we do a running time analysis, it is found that there is a very strong, almost one-way causality from IBM to GE in 70’s, starting from 1971. This identified causal structure change reveals to us an old story about “Seven Dwarfs and a Giant” in 1960s: GE was once the biggest computer user besides the U.S. Federal Government; to avoid relying on IBM, it began to manufacture mainframe computers, together with six other companies, competing for the computer market with IBM the Giant. But in 1970, GE sold its computer division. Starting from 1971, it then had to rely on IBM again. That is the reason why there is such an abrupt one-way causality jump from 1970 to 1971. While the story has almost gone to oblivion, this finding, which is solely based on the analysis of a couple of stock price time series, is really remarkable.

Consider the series generated from two autoregressive processes, which traditionally have been used to test causality analysis tools,

where_{2} and b_{1}, initialize the system with random numbers between 0 and 1, generate two series with 50000 values, and then compute the causalities using (4). The results are tabulated in

The series generated for case I are shown in

For case II,

To test the validity of (4), we design a case (case III) with very weak coupling:

In order to see whether the negligible causalities can be detected between series immersed in noises, we amplify e_{1} and e_{2} by ten times:

Case | a_{2} | b_{1} | ||
---|---|---|---|---|

I | 0.7 | 0 | 4049 ± 32 | 3 ± 45 |

II | 0 | 0 | 0.55 ± 0.71 | 0.26 ± 0.36 |

III | 0.01 | 0.01 | 3.8 ± 1.8 | 1.3 ± 0.9 |

Our causality analysis is for two time series and, as we showed above, works perfectly for series generated with two processes. However, in real problems, a pair of time series could be the result of a lot of processes, and, moreover, we may have no idea what the processes are, or even are unaware of the existence of those processes. Will (4) still work in this case? In other words, can our analysis work well just the same in the presence of a hidden process? This is a problem where the traditional analyses fail.

Consider a pair of series formed from the X and Y in the following autoregressive processes:

Different from (6)-(7), here both X and Y are dependent on a third variable Z.

Pretending that we have no idea about the existence of Z, we perform a causality analysis just as before with the series X and Y. Repeat the experiments in

The results are just as one would expect. For example, case I is a one-way causal system, and the computed absolute information flow rates confirm this; in case II X and Y are independent, and the calculated causalties are essentially zero in both directions; for case III, the causalities do exist, although they are very small. In a word, our causality analysis is capable of handling the series in the presence of hidden processes, even in extreme cases. It then can be utilized for data analysis on a generic basis, and is hence expected to play a role in the new science of big data.

As a demonstration, we now take a look at the GDP of USA, China, and Japan, the three economic powers. The data are from World Bank^{1}, available every year from 1960 through 2014. Note it is by no means our intention to conduct a research on international bilateral relation, which requires an in-depth investigation of the related economics and politics and, above all, more reliable data with finer time resolution; we are just about to provide an example to demonstrate how the above new causality analysis tool may allow us to extract the information underlying the data which would be otherwise very difficult, if not impossible, to extract.

Since the GDPs of the three countries soar from 1960 to 2014, we choose to examine their annual growth rates. Shown in

The validation in the preceding section allows us to examine the relations between the three countries regardless of the GDP data of the rest world, particularly, Europe, though we know the influence of the latter does exist. Since we need to do the covariance estimation, we pick a 40-year window to build the ensemble, and then do a running time estimation. This results in a time period 1980-1995 over which the causalities can be computed. A straightforward application of (4) yields these causalities, which we plot in

First look at

For _{1}, to another, say X_{2}, will result in a correlation, but a causality in the opposite direction, or a mutual

Case | a_{2} | b_{1} | ||
---|---|---|---|---|

I | 0.7 | 0 | 3933 ± 38 | 22 ± 46 |

II | 0 | 0 | 2.9 ± 4.8 | 2.2 ± 2.4 |

III | 0.01 | 0.01 | 22 ± 8 | 19 ± 4 |

causality, will also make such a correlation. Here the figure shows that the three possibilities all exist in this particular application, nicely spanning different periods (approximately 1980-1987, 1987-1990, 1990-1995). This serves as an excellent example about correlation versus causation, though an explanation of the structure in

The emerging data science will for sure benefit from the advancement of other data-related disciplines. In this study, we introduce to the community a recently established rigorous and quantitative causality analysis to help unravel the complexity of big datasets, explore the underlying causal structures, and hence design efficient platforms for service and management purposes. To summarize, we here repeat the formula in Theorem 2.3 for causality estimation, that is, for series X_{1} and X_{2}, the information flow from the latter to the former is estimated to be

with _{i} and X_{j} and _{i} and a derived series from X_{j} by taking Euler forward difference. If _{2} is causal to X_{1}, and vice versa. An immediate corollary is that causation implies correlation, but correlation does not imply causation.

The above formalism, or Liang14 formalism as referred in the text, has been applied with remarkable success to many real problems. In this study, it has been validated with data series in the presence of hidden processes, and then exemplified with an analysis of the GDP data of USA, China, and Japan. Though the study is preliminary, the result is very encouraging, from an aspect demonstrating its power. This analysis tool is expected to play a role in the new interdisciplinary science, i.e., the science of big data.

This study was supported by the National Science Foundation of China under Grant No. 41276032, by Jiangsu Provincial Government through “2015 Jiangsu Program for Innovation Research and Entrepreneurship Groups” and the Jiangsu Chair Professorship, and by the State Oceanic Administration through the Special Program on Global Change and Air-Sea Interaction (GASI-IPOVAI-06).

X. San Liang, (2016) Exploring the Big Data Using a Rigorous and Quantitative Causality Analysis. Journal of Computer and Communications,04,53-59. doi: 10.4236/jcc.2016.45008