Topology Data Analysis Using Mean Persistence Landscapes in Financial Crashes

Topological features in high dimensional time series are used to characterize changes in stock market dynamics over time. We explored the daily log returns of four major US stock market indices and 10 ETF sectors between January 2010-June 2020. Topological data analysis and persistence homology were used on two sequences of point cloud data sets the stock indices and the ETF sectors, respectively. Using these sequences, the daily log returns, persistence diagrams, persistence landscapes, and mean landscapes were used to quantify topological patterns in the multidimensional time series. For example, norms of the persistence landscapes were generated to detect critical transitions in the daily log returns. To measure statistical significance, we implemented three permutation tests with a significance level 0.05 α = to determine if topological features change within a particular time frame by comparing sliding windows in the sequence of point cloud data sets. We found that between July 1, 2019 and July 1, 2020, there is evidence of changing structure in the US stock market. Critical transitions are identified by the statistical properties of the norms of the persistence landscape between contiguous daily sliding windows of the stock indices and ETF sector series.


Introduction
Topological data analysis (TDA) extracts topological features by examining the shape of the data through persistent homology to produce topological summaries. Two topological summaries, the persistent barcode [1] [2], and the persistent diagram [3], provide visual representation of persistent topological features. works to track topological changes, using TDA with machine learning to understand what happens before a critical transition, and using the norms of persistence landscapes to indicate an approaching critical transition, these financial papers lack statistical inference. We are motivated to explore how the topological features change within a given time period for stocks and ETF sectors and find any statistical significant using a permutation test [5] [8], which we discuss in detail in Section 2.5, Section 2.6, and Section 3.4.
While we acknowledge the previous cited authors, we deem our contributions as an empirical framework that adapts their analytical models to new data sets and expand by conducting statistical inference. Similar to Gidea and Katz [16], we investigate the same four major indices (DJIA, S&P500, NASDAQ, and Russell 2000), but we extend our data set to include 10 ETF sectors (Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Industrials, Materials, Information Technology, Utilities, and Index) for January 4, 2010-July 1, 2020 to examine their topological features to detect a critical transition or transitions. Moreover, we generate several topological summaries, norms for persistence landscapes 1 p = and 2 p = , and conduct statistical inference on how these topological features change over time. In particular, we want to compare only sliding windows within a sliding step of one day from each other, which will be done separately for all the stock indices and for all the ETF sectors. We also compare all the stock indices against ETF sectors within the same sliding window. Our hypotheses tests will distinguish for two groups at a time if the means of topological features are the same either within a sliding step of one day in their respective sliding windows or within the same sliding window. The statistical tests of interest have not been seen before in any financial papers and will be our main contribution. The remainder of this paper is organized as follows.
In Section 2, we provide background information on algebraic topology, homology, constructing the Vietoris-Rips complex, persistent homology, topological summaries, norms of persistent landscapes, and statistical inference. In Section 3, we outline our methods for obtaining the data, constructing a sequence of a point cloud data, using persistent homology on a sequence of a point cloud data set, generating topological summaries, and performing statistical inference.
In Section 4, we present our findings from our data. In Section 5, we discuss and provide an interpretation of our results. In Section 6, we conclude the paper.

Background
This study presents a topological data analysis of financial time series data. Here we provide background material about four relevant areas: algebraic topology, homology, topological summaries, and norms for persistent landscapes. We apply topological data analysis to a sequence of point cloud data sets to examine their topological properties within a point cloud matrix of d 1-dimensional time series. For our analysis, a sequence of point cloud data sets denoted n X is shown below: A. Aguilar where each point in the sequence is expressed as ( ) ( ) 1 2 , , , d d n q q q x t x x x = ∈   , d is the column number from a 1-dimensional time series, w is the sliding window size for a certain number of trading days ( td n ) with a sliding step of one day, and 1, 2, , n q =  . To obtain q, the difference is taken between the total number of days of the daily log returns ( dlr n ) and one less than the sliding window size − . To approximate the daily log returns, the formula is discussed in Section 3.1. So, every point cloud is compromised of a d w × matrix, where w d > [16]. Note our method uses a sliding window w as seen in [16] and it does not apply the sliding window embedding theorem or Takens' theorem. In the next two subsections, we provide background information on algebraic topology and persistent homology, so that for every point cloud, we generate topological summaries and compute their p L norms based on their corresponding persistence landscapes to conduct statistical inference. For a more in depth background, we refer readers to [3]

Algebraic Topology
To produce topological summaries, we must first construct a Vietoris-Rips filtration for each point cloud in a sequence of point cloud data sets, which requires understanding simplices and simplicial complexes and are defined below [19]: be a geometrically independent set in N  .
We define the n-simplex σ spanned by 0 , , n a a  to be the set of all points x of N  such that: and 0 i t ≥ for all i. Definition 2.2 A simplicial complex K in N  in a collection of simplices in N  such that: • Every face of a simplex of K is in K.
• The intersection of any two simplexes of K is a face.

Homology
In homology, we are interested in a vector space The kernel of ( ) ( ) and is called the group of p-cycles. The image of ( ) ( ) which is called the p th Betti number of K [22]. The p-cycles that are not boundaries represent p-dimensional holes, which the p th Betti number counts. For the p th homology of a filtered simplicial complex K, we apply definition 2.3 and define as: Definition 2.4 Let K be a finite simplicial complex, and let be a finite sequence of nested subcomplexes of K. The simplicial complex K with such a sequence of subcomplexes is called a filtered simplicial complex. The p th persistent homology of K is the pair where n X is a sequence of point cloud data sets as given by Equation (1).
Moreover, the filtration of where q is the difference between the number of the daily log returns and the sliding window ( )  is a filtered simplicial complex.

Persistent Homology
Using definition 2.4 and definition 2.5, it is possible to find the p-dimensional homology of the Vietoris-Rips complex of n X labelled as such that each map is determined by a bipartite matching of basis vectors [22]. Journal of Mathematical Finance

Topological Summaries
To visualize, construct, and produce topological summaries, For additional information about the construction of a persistence module, see [5]. There are three main types of topological summaries associated with a persistence module. The first type of topological summary is a called a barcode.
Unfortunately, the geometric properties of the barcodes and persistence diagrams present a difficult challenge for the calculation of means and variances, since two barcodes or two persistence diagrams may not have the same unique Friechet mean, which means statistical inference cannot be done. While the barcode and the persistence diagram are conventional topological summaries, Bubenik [5] showed how the persistence landscape is a better alternative.

Persistence Landscapes and Mean Landscape
Bubenik and Dlotko [18] proved numerous statistical properties of persistence landscapes that we may use for statistical inference, such as stability, convergence, central limit theorem, and strong law of large numbers. The persistent landscape and mean landscape are also used as topological summaries to indicate how persistence changes by examining the number of peaks. First, given a pair of num- Second, given a persistence module, M, the persistence landscape may be defined as the function : Third, given a persistence diagram − , and the persistence landscape is defined as follows: where kmax denotes the k th largest element. Using Equation (10) for n X , the persistence landscape of n X denoted by ( ) n X λ is the following: where 1 dlr q n w = − + . This results in the following lemma from [5]: The persistence landscape has the following properties: 1) (10), the persistence landscape is obtained and used to calculate the mean landscape, which is defined below: be independent and identically distributed copies of Y, and let 1 , , n Λ Λ  be corresponding persistence landscapes. The mean landscape n Λ is given by the point wise mean, in particular, , .
Using Equation (12) for n X , we have the following:

Norms for Persistence Landscapes
Gidea and Katz [16] applied p L norms of the persistence landscapes to identify the signs of a financial crash, which usually occurs within a time of high variance and cross-correlations among stocks or ETFs, and demonstrated that 1 L and 2 L norms of the persistence landscapes of four stock indices exhibited significant rising trends before the financial crashes. We adopt their approach in our study.
Therefore, for real valued functions on ×   , for 1 p ≤ < ∞ , p-norms of persistence landscapes are defined as: and for p = ∞ , Applying Equation (14) to our sequence of point cloud data sets n X results in: where 1 dlr q n w = − + .

Statistical Inference: Part I
To compare the topological features between two groups, the persistence landscape is used to conduct a hypothesis test and statistical inference, which require several assumptions provided by [5]. First, the persistence landscapes lie in a se- be a random variable on some underlying probability space ( ) is the corresponding topological summary statistic. To avoid confusion, we use Y instead of X as a random variable, because our sequence of point cloud data sets uses the variable n X . In addition, Bubenik [5] proved the convergence of persistence landscapes using the Recall the sample mean is where again 1 k and 2 k are the samples taken from 1 Y and 2 Y . We assume that 1 µ and 2 µ are the expectations of 1 Y and 2 Y . So, 1 µ and 2 µ are assumed to be the population means of 1 Y and 2 Y . Therefore, the statistical hypothesis is: To test the null-hypothesis, we use a two sample permutation test. Let Using Equation (21) A general form of Equation (22) is: Hence, using Equation (22) and every instance where observed s t t ≤ , the p-value is obtained as: To measure the statistical significance, [8] used a significance level 0.05 α = in their study, which we incorporate in our study. We may apply the above assumptions, equations, and definitions to compare the topological features of more groups.  1 2 , , , q µ µ µ  are assumed to be population means of 1 2 , , , q Y Y Y  , and the statistical hypotheses are:

Methods
In this section, we describe the methods to obtain the data and analyze the financial time series using topological data analysis, statistical inference, and RStudio [23]. The data, which were obtained from Yahoo Finance, consisted of which is an approximation of a return [24]. Since the daily log returns are forward daily changes, then the time frame of the daily log returns is from January 5, 2010 to June 30, 2020.

Point Cloud Data
After approximating the daily log returns, we designed two sequences of point cloud data sets, each with a sliding window of 50 w = and a sliding step of one day, which is based on the same method found in [16]. The first sequence of point cloud data set denoted by  .
The second sequence of point cloud data set denoted by   based on similar methods found in [16]. Therefore, we obtained the following Rips filtration:

Topological Summaries
By modifying the R script in [5], the first dimensional persistence diagrams denoted by for each point cloud data set were used along with Equations (10) and (11)  For this reason, the sequences of point cloud data sets will go from 951 q = to 203 q = . Also, recall that the daily log returns are forward daily changes, so the time frame of the daily log returns are from July 2, 2019 to June 30, 2020. We as-

Statistical Inference
While the topological summaries were useful for examining topological features, we were also interested in finding statistical significant for any changes of these topological features within time. The time period of interest is July 1, 2019 to July 1, 2020, which has 253 td n = trading days.
We make the same assumptions from Section 2.5 and Section 2.6. Our random variables will derive from our two sequences of point cloud data sets, using Equations (41)-(56), a permutation is completed at a significance level of 0.05 α = for homology in degree 1 for all our hypothesis tests. Since we are only interested in the number of loops, we will look at homology in degree 1. All these hypothesis testing methods were modified from the R script in [5]. After finding the p-values, we plotted the daily log returns with the p-values that were less than or greater than or equal to our significant level α for either all the stock indices or all the ETF sectors along a sliding window of 50 trading days.

Results
The goal of this study is to detect a statistically relevant critical transition and characterize any changes in topological features over time. To assess the statistical significance of observed differences in the topological features that change over time, we used a permutation test. For degree 1, we obtained ten sample paring all the stock indices and all the ETF sectors in the same sliding windows between July 1, 2019 and July 1, 2020, which results in 199 p-values of 0.0000 and 2 p-values of 0.001 for homology in degree 1.
In order to understand these results, we will review the daily log returns, the norms of the persistence landscapes, and the topological summaries of all the stock indices and all the ETF sectors. When reviewing the daily log returns for DJIA, the S&P 500, NASDAQ, and Russell 2000 between January 5, 2010 and June 30, 2020 (see Figure 1)  date is noteworthy, and warrants closer examination for potential critical transitions prior to this date. Focusing on when the peaks occur, we include summary statistics for July 1, 2019 to July 1, 2020 for all the stock indices and all the ETF sectors in Table 1 and Table 2, respectively.
The norms of the persistence landscapes in homology degree 1 presented in Figure 3 and Figure 4 display all of the stock indices and all of the ETFs respectively for 1 p = and 2 p = for 1001 trading days prior to March 16, 2020. For the stock indices, the L 1 distances are less than 0.01 between 2017 and 2018, less than 0.02 between 2018 and 2020, but the greatest L 1 distance occurs in 2020 at approximately 0.08 as seen in Figure 3. The L 1 distance for all of the ETFs, have more spikes than the L 1 distances of the stock indices, especially between 2018 and 2020, but the greatest L 1 distance occurs in March 2020 at approximately 0.14 as seen in Figure 3. While the L 2 norms for all of the stock indices and all of the ETFs have similar distances, there is a noticeable spike in 2020. However, the distances in L 2 are not as great as in L 1 , as shown in Figure 3 and

Discussion
From reviewing the norms of the persistence landscape, the daily log returns, persistence diagrams, persistence landscapes, and mean landscapes for all of the selected dates, it is clear that the number of the loops in the relevant point clouds are more pronounced resulting in more persistence, which signifies that the stock market is transitioning from a stable state to a more unpredictable, volatile state. Moreover, the ETF sectors demonstrate more volatility than the stock indices. These stock indices' findings coincide with the 2000 and 2008 market crashes findings found in [16]. Similar to Gidea and Katz [16], we observe L 1 distances that confirm the critical thresholds prior to the 2020 peak and exhibit more than the L 2 norm. In other words, the L p -norms exhibit strong growth around the emergence of the primary peak.
While the highest peak occurred on February 21, 2020 for all of the stock in-

Conclusions
In this paper, we investigated the topological features of four major indices and 10 ETF sectors for January 4, 2010-July 1, 2020. We used two sequences of point cloud data sets, one for all the stock indices and the other for all the ETFs with a sliding window 50 w = . Both sequences were used to perform TDA through algebraic topology and persistent homology. From there, topological summaries are generated to determine persistence and the norms for persistence landscapes are used to detect a critical transition by adapting methods found in [16]. Our goal is to determine how the statistical significance of topological features of stock indices and ETF sectors change for a specific time frame. We found that between July 1, 2019 and July 1, 2020, there is evidence of difference of topological features for all the stock indices and all the ETFs. As a result, critical transitions are determined using the norms of the persistence landscape and topological features of stock indices and ETF sectors change within time when comparing two sliding windows of a sliding step of one day.
We conclude with possible future research goals. Further work could be done analyzing persistence landscapes for homology in degree two. It would be interesting to study topological features based on higher degree persistence. Furthermore, it would be fascinating to expand to commodities, futures, and other financial time series. Moreover, it would be more resourceful to expand topological data analysis to statistics beyond statistical inference and use for predictive modeling with machine learning. This table presents summary statistics for all the stock indices. We estimated the mean (μ ), standard deviation ( 2 σ ), variance (σ ), skewness (γ ), and kurtosis (κ ) of the daily log returns from July 2, 2019 to June 30, 2020. The reporting period of this table contains 253 trading days from July 1, 2019 to July 1,