Exploring the Big Data Using a Rigorous and Quantitative Causality Analysis

Causal analysis is a powerful tool to unravel the data complexity and hence provide clues to achieving, say, better platform design, efficient interoperability and service management, etc. Data science will surely benefit from the advancement in this field. Here we introduce into this community a recent finding in physics on causality and the subsequent rigorous and quantitative causality analysis. The resulting formula is concise in form, involving only the common statistics namely sample covariance. A corollary is that causation implies correlation, but not vice versa, resolving the long-standing philosophical debate over correlation versus causation. The applicability to big data analysis is validated with time series purportedly generated with hidden processes. As a demonstration, a preliminary application to the gross domestic product (GDP) data of United States, China, and Japan reveals some subtle USA-China-Japan relations in certain periods.


Introduction
We have entered an era of data wealth; how to analyze these data has become a big problem for scientists in the twenty-first century.This raises many challenging issues, among which is causal inference, a field which actually forms an important subject in many different scientific disciplines, even in philosophy (e.g., [1]).For data science, it will help to unravel the complexity of the ever-growing datasets, and hence help to build platforms for efficient management and better service.
Causality analysis, however, is a very challenging problem.In their book Doing Data Science (p.274) [2], O'Neil and Schutt remarked, "One of the biggest statistical challenges, from both a theoretical and practical perspective, is establishing a causal relationship between two variables."In the past few years, there has been a surge of interest in this field, echoing the call from the newly emerged science of big data.Many empirical or half-empirical formalisms have been proposed, and they generally work well in their specific contexts (see the references in [3]).
Recently, a rigorous and quantitative analysis has been developed to address the challenge (cf.[3], hereafter Liang 14, and [4]).It is found that causality analysis, which traditionally has been formulated as a statistical hy-pothesis testing (e.g., [5]), is actually a problem in physics; causality is actually a real physical notion, which can be put on a rigorous footing.With the Liang14 formalism, many problems, which traditional approaches fail to handle, turn out to be easy.It also unambiguously and explicitly resolves the long-standing debate in philosophy regarding correlation versus causation, and has been successfully applied to many real world problems.
However, this line of work has not even been touched in big data studies.While we should avail ourselves of the arsenal of traditional tools, new ideas, particularly new ideas like this one which is based firmly on physical footing, will for sure facilitate the advancement of the new science.We are therefore motivated to introduce the newly developed causality analysis to data scientists.This makes the main purpose of this study.
In the following we first give a brief review of the formalism, its development and major results.To test its utility in handling big data, in Section 3 we purportedly generate series in extreme situations, particularly series in the presence of hidden processes.As a demonstration, Section 4 presents a preliminary application to the study of the USA-China-Japan relation.This study is summarized in Section 4.

Theoretical Development and Applications
Historically Granger [5] formulated causality analysis as a statistical hypothesis test, which has now been referred to as Granger causality analysis.On the other hand, another real physical notion, namely, information flow, or information transfer as it may appear in the literature, has been developed for over three decades.Information flow has applications in a wide variety of disciplines; people gradually realize that central at the field, which makes it widely applicable, turns out to be its logical association to causality.This observation has further been substantiated as it was established that Granger causality and the most popular empirical measure of information flow so far, namely, transfer entropy [6], is actually equivalent [7].
So the two major lines of work on causality analysis eventually merge.The corresponding formalisms, however, have long been found unable to verify themselves in many applications, or they may even yield spurious causal relations.The verification is based on the following observation: If the evolution of a variable, say, X 1 , is independent of another one, X 2 , then the causality from X 2 to X 1 vanishes.
Hereafter we will call it Principle of Nil Causality.Recently, Smirnov [8] gave this a systematic investigation, and concluded that they cannot verify the principle in a wide range of situations; similar results also show in [9].In response to the call from the new science of big data, we should touch the base and re-examine the problem carefully.
Since causality can be quantitatively measured by information flow, while information flow is a real physical notion (not just something in statistics), Liang argued that it should be formulated on a rigorous footing, rather than be proposed as an ansatz [3] [10].Besides, the above principle should be stated as a proven theorem, not something to be verified in applications.In this spirit, Liang [10] considered a stochastic system in the form ( ) ( ) where (W 1 , W 2 ) is a vector of standard Wiener process, and F 1 and F 2 are differentiable functions of (X 1 , X 2 ).He obtained the following theorems: Theorem 2.1.(Liang, 2008) For the dynamical system (1)-( 2), the rate of information flowing from X 2 to X 1 is ( ) ( ) where E stands for mathematical expectation, and ( ) x ρ ρ = is the marginal probability density of X 1 .Theorem 2.2.Principle of nil causality (Liang, 2008) If in the system (1)-( 2), neither F 1 nor b 11 nor b 12 has dependence on X 2 , then 2 1 0 T → = .Note both are proven theorems (proofs are referred to [10]).Particularly, the second is just the principle of nil causality.
If only two time series are given, the information flow between them can be obtained through maximum likelihood estimation.
In this equation, is the sample covariance matrix between time series X 1 and X 2 , and , i dj C the sample covariance between X i and a series derived from X j using Euler forward differencing scheme: Note (4) the T is actually the mle of the information flow, and, strictly, should bear a hat.We abuse the notation here as, from now on, only (4) will be used, and hence no confusion will arise.That is to say, (4) will be taken as the quantitative measure of causality from X 2 to X 1 .More precisely, the absolute value of T measures the causality.When 2 1 0 The formula for information flow hence causality is very concise.Considering that in history there is a long-standing debate over correlation versus causation, one may transform it into a form in terms of correlation coefficient: ( ) .
does not necessarily vanish.Contrapositively, this means that Causation implies correlation, but correlation does not imply causation.
Causality can be normalized so as to reveal its relative magnitude; see [4] for details.One may also perform statistical significance test for Equation (4), which is referred to [3].
Equation ( 4) has been validated with touchstone problems that fail the traditional Granger causality analysis.It has also been applied to many real world problems, with remarkable success.Among these applications is the causal structure study between CO 2 and global warming [11].It is found that the CO 2 concentration rise during the past 120 years does cause the recent global warming; the causal relation is one-way, i.e., from CO 2 to global atmosphere temperature.However, on a 1000-year (or over) scale, the causality is totally reversed; i.e., it is global warming that causes CO 2 to increase, in agreement with that inferred from the ice-core data recently from Antarctica.Besides, the anthropogenic gas emission mainly from the Northern Hemisphere, however, causes mainly the warming in the Southern Hemisphere.
Another application is with several series of prices of US stocks downloaded from ! finance YAHOO .Basically each significant causal relation can be interpreted based on common sense.For example, Ford is found to have a much larger causality to Wal-Mart than to CVS the convenience store chain, since, in the States, people rely on motor vehicles to shop at Wal-Mart stores, while CVS stores could be within walking distances.A deeper study shows that the causality generally varies with time.For GE and IBM, overall it seems that they are not significantly causal to each other.However, if we do a running time analysis, it is found that there is a very strong, almost one-way causality from IBM to GE in 70's, starting from 1971.This identified causal structure change reveals to us an old story about "Seven Dwarfs and a Giant" in 1960s: GE was once the biggest computer user besides the U.S. Federal Government; to avoid relying on IBM, it began to manufacture mainframe computers, together with six other companies, competing for the computer market with IBM the Giant.But in 1970, GE sold its computer division.Starting from 1971, it then had to rely on IBM again.That is the reason why there is such an abrupt one-way causality jump from 1970 to 1971.While the story has almost gone to oblivion, this finding, which is solely based on the analysis of a couple of stock price time series, is really remarkable.

Validation with Series Generated with a Pair of Processes
Consider the series generated from two autoregressive processes, which traditionally have been used to test causality analysis tools, where For different a 2 and b 1 , initialize the system with random numbers between 0 and 1, generate two series with 50000 values, and then compute the causalities using (4).The results are tabulated in Table 1.
The series generated for case I are shown in Figure 1.By visual inspection they are correlated and look alike.This is not surprising, as Y drives X hence X follows Y.As regards the causalities, since 1 0 b = , Y does not de- pend on X, and hence ideally x y T → should vanish.Here at a 90% confidence level, ( ) 3 45 10 x y T − → =± × nats per iteration, which cannot be viewed as different from zero.In contrast, y x T → is huge, clearly indicating a one-way causality.This is an example of highly correlated series that results in a zero causality in one direction.
For case II, 2 1 0 a b = = , hence X and Y have nothing to do with each other.A faithful analysis should yield zero causalities for both directions.Indeed, at a 90% level, they can neither be distinguished from zero.
To test the validity of ( 4), we design a case (case III) with very weak coupling: 2 1 0.01 a b = = .In the equations X and Y are essentially independent, but theoretically there does exist causality, though negligible.Remarkably, our analysis yields two significant causalities, i.e., both of them, albeit very small, pass the significance test.
In order to see whether the negligible causalities can be detected between series immersed in noises, we amplify e 1 and e 2 by ten times: 1 , i.e., two information flow rates, albeit negligible, significant at a 90% level, just as one would expect!Table 1.Absolute information flow rates for the series generated with ( 6)-( 7), and their respective confidence intervals at a 90% significance level.Units are in 10 −4 nats per iteration.1.

X. S. Liang
Table 2. Absolute information flow rates for the series generated with ( 8)-( 9), and their respective confidence intervals at a 90% significance level.Units are in 10 −4 nats per iteration.
Case a2 b1   causality, will also make such a correlation.Here the figure shows that the three possibilities all exist in this particular application, nicely spanning different periods (approximately 1980-1987, 1987-1990, 1990-1995).This serves as an excellent example about correlation versus causation, though an explanation of the structure in Figure 3(c) requires more knowledge of the politics and economics in history about the two countries, which we leave to future studies.

Summary and Outlook
The emerging data science will for sure benefit from the advancement of other data-related disciplines.In this study, we introduce to the community a recently established rigorous and quantitative causality analysis to help unravel the complexity of big datasets, explore the underlying causal structures, and hence design efficient platforms for service and management purposes.To summarize, we here repeat the formula in Theorem 2.3 for causality estimation, that is, for series X 1 and X 2 , the information flow from the latter to the former is estimated to be , with ij C the sample covariances between X i and X j and , i dj

C
that between X i and a derived series from X j by taking Euler forward difference.If 2 1 T → is nonzero, then X 2 is causal to X 1 , and vice versa.An immediate co- rollary is that causation implies correlation, but correlation does not imply causation.
The above formalism, or Liang14 formalism as referred in the text, has been applied with remarkable success to many real problems.In this study, it has been validated with data series in the presence of hidden processes, and then exemplified with an analysis of the GDP data of USA, China, and Japan.Though the study is preliminary, the result is very encouraging, from an aspect demonstrating its power.This analysis tool is expected to play a role in the new interdisciplinary science, i.e., the science of big data.

Figure 3 .
Figure 3. Causalities (absolute information flow rates) between China and USA (a), between China and Japan (b), and between USA and Japan (c).Units: nats/yr.
2.3.(Liang,2014)Given two time series X 1 and X 2 , under the assumption of a linear model, the maximum likelihood estimator (mle) of the rate of information flowing from X 2 to X 1 is