A New Way to Compute the Probability of Informed Trading

Volume-Synchronized Probability of Informed Trading (VPIN) is a tool designed to predict extreme events like flash crashes in high-frequency trading. Its aim is to estimate the Probability of Informed Trading (PIN), which was built from a probabilistic framework. Some concerns have been raised about its theoretical foundations and its reliability. More precisely, it has been shown that theoretically the VPIN does not approximate the PIN as the PIN has been built with a time-clock framework and the VPIN with a volume clock one. On a practical point of view, the VPIN has been found to be sensitive to the starting point of computation of a data set and to different parameters, such as the classification rule. In this paper, in order to improve the PIN theoretical framework, we firstly analyze the theoretical foundations of the PIN and the VPIN models to have a better view of all its different assumption subtle-ties. It secondly makes it possible to point out some approximation flaws in the formula used to approximate the PIN and to propose another exact way to compute the PIN. All different results are illustrated with simulations.


Introduction
The amount of trading data has exploded in finance thanks to the continuing progress of high frequency techniques. It constrains practitioners to use more and more state-of-the-art algorithms to deal with this overwhelming amount of information. Computers and algorithms are more and more efficient, but still decision making is based on both the quantity and the quality of information.
Thus, errors and speculations that can make the financial market toxic, i.e. conducive to crashes, are still possible. Examples in the past, such as the "Flash Crash" of May 6, 2010, have shown that algorithmic trading in finance has made it possible to introduce new kind of crashes characterized by their suddenness. Such quick crashes seem dangerous because of a kind of inherent unpredictability. However, theoretical framework to model this new phenomenon exists.
Easley, Engle, O'Hara and Wu [1] designed a model of the high-frequency financial market based on flows of informed and uninformed traders. In this model, informed traders are aware of the evolution of the price in the future and thus of which decision takes (buy or sell). The authors managed to show that information is a key parameter of the spread between ask and bid of prices, as they demonstrate that the probability of being informed within their theoretical framework is proportionally linked with it. They named this key parameter the Probability of Informed Trading (PIN). A high value of the PIN is an indicator of the level of toxicity of this high frequency trading market, as it would mean it relies on too many informed traders. Later, Easley, Lopez de Prado, O'Hara [2] [3] designed a tool, nicknamed Volume-synchronized Probability of Informed Trading (VPIN), supposed to approximate the PIN. It appeared it could predict the "Flash Crash" of May, 6 2010 a few hours before it happened [4]. A number of papers have been written [5] [6] [7], and it is proposed to use it for regulation through a VPIN contract [4] [8]. However, critics pointed out some flaws, questioning its reliability. For example, Andersen and Bondarenko have shown [9] that the VPIN is quite sensitive to the starting point of when one starts computing the VPIN on a data set. It indeed questions the VPIN prediction quality. Moreover, they have also shown that the VPIN is sensitive to other parameters, such as the trade classification rule used [10], or how one defines the average daily volume of trades [11]. Changing the classification rule may drastically change the VPIN behavior [12]. Tomas Pöppe, Sebastian Moos and Dirk Schiereck have arrived to the same conclusions with a different approach. Using a different classification rule can change the VPIN prediction power toward a crash (in their paper a German blue-chip stock) [13]. Besides, controlling ex-ante parameters seem to give poorer prediction quality [10] [11]. This point has also been checked by D. Abad, M. Massot and R. Pascual [12]. Controlling for ex-ante realized volatility, and trading intensity, as did T. G. Andersen and O. Bondarenko [11], prediction quality seems to vanish. More deeper, they have also underlined that it is not obvious how one should define a VPIN prediction, analyzing more precisely toxic and non-toxic halts, as well as toxic events. Furthermore, Torben G. Andersen and Oleg Bondarenko interpret the VPIN as being too sensitive to trading intensity. They have also explained the VPIN metric is sometimes unexpectedly correlated with other usual ones (such as VIX or RV) [9] [10]. Moreover, it has been shown [14] [15] that the VPIN does not approximate the PIN, as the PIN was built on a time-clock theoretical framework, and the VPIN with a volume-clock paradigm. In this study, we propose another way to estimate the PIN within its original time-clock framework.
The purpose of this paper is to improve the PIN theoretical framework. Some concerns have been raised about its theoretical foundations. For this reason we assess step by step all the different theoretical ideas of the PIN model. More precisely, we firstly want to explicit all the theoretical framework of the PIN and the VPIN model to have a better view of all its different assumption subtleties. It secondly makes it possible to point out some approximation errors in the formula used to approximate the PIN and to propose another exact way to compute the PIN. In the following, we first recall the PIN model (Section 2). Second, after introducing the VPIN original ideas we analyze the original first order approximation and then recall the difference of time clock and volume clock paradigm (Section 3). Finally, we suggest another way to compute the PIN (Section 4).

The Time-Clock Framework
The Probability of Informed Trading (PIN) is computed on a simple model of information among traders [16]. Let's describe it with the following tree below ( Figure 1), originally designed in [16]. Suppose prior to the beginning of any trading day, Nature determines whether an information event is relevant to the value of the asset to occur. Suppose information events are independently distributed and occur with a Bernoulli probability of value α , which can be seen on the first two branches on the left-hand side of the tree. These events are good news with a Bernoulli probability 1 δ − (i.e. signal High), or bad news with probability δ (i.e. signal Low). After the end of trading on any day, and before Nature moves again, the full information value of the asset is realized. Hence, for any of the three leaves of the tree in Figure 1, an informed trader would know which action to take. Trade arises from both informed traders (those who have seen any signal) and uninformed traders. On any day, arrivals of uninformed buyers and uninformed sellers are described by independent Poisson processes of respective intensity  and µ . Individuals trade a single risky asset and money with a market maker over 1, , for a given trading day, t S and t B the events that an order of respectively a sell and a buy arrive at time t. Let be the market maker's prior belief about the events "no news" (n) "bad news" (b) and "good news" (g) at time t 1 . Within this model we compute the spread at t t Σ which is equal to t t a b − , where t a and t b are the ask and bid at time t (respectively the minimum price a seller is willing to receive and the maximum price a buyer 1 We summarize here the theoretic framework as described in [16]. Formally, considering the random variables corresponding to order arrival of sells and buys St and Bt we associate the canonical respective filtrations to define later conditioned expectations. They are still noted as the events "St" and "Bt".  is willing to pay). Within this framework t b is the expectation of the asset value, we denote t V , conditional on the history prior to t and on sell order t S . Similarly, t a is the expectation of t V conditional on the history prior to t and on buy order t B . Let note V , * V and V respectively the value of the asset under the conditions of good new, no information and bad new. We have of course the following inequalities:

Computation of the Spread
We explicit now more the content of [3]. Let's compute the bid, the ask follows exactly the same idea 2 : It can be re-written this way using the different possibilities of the tree on an event: Let's compute the first term ( ) | t t P n S , others follow the same idea. Using Bayes rule one finds the following: t t t P S n P n P n S P S = so, by decomposing the denominator: P S n P n P n S P S n P n P S g P g P S b P b = + + Let's have a look at the term ( ) | t t P S n which is the probability at t that there will be a sell order at t under the constraints of no news.
( ) | t t P S n is a transition rate. To compute it, one must first calculate the transition probability for a 2 We use the same notations as the author, distinguishing the events "t" and " t S ".
Dividing by h, one re-finds indeed the intensity of the Poisson process, which is a special case of a Markov jump process. Applying the same for other cases ("bad event", "good event"), we have finally the following: As the probabilities with  sum to one we get the following expression: Finally the bid has this expression: With the same reasoning the ask has this expression: Actually one may simplify a bit these expressions as the expectation of V has the following form: We find: and: So the spread equals to: In the special case where ( ) ( ) t t P g P b = one finds the following simple form: If we make the hypothesis that We will keep the same hypothesis for the rest of the paper.

Analysis of the First Order Approximate within the Time-Clock Framework
The idea behind the VPIN is to find an easy way to compute the last above expression of the PIN using a volume-clock paradigm. More precisely, it aims at finding a way to easily compute the expressions obtained for the numerator αµ and denominator ( 2 The key heuristic behind the VPIN is to take advantage of a supposedly good property of the expectation of the absolute difference between Poisson random variable within a volume-clock framework to approximate αµ , i.e.: ( ) , where X and Y are Poisson variables. We will see this heuristic does not really make it possible to conclude as expected. More precisely, in the first subsection we will see which idea has been used to approximate the PIN within a time-clock framework. Secondly, we will see that first-order approximations used are not correct as the framework does not verify a required hypothesis. We analyze more precisely the first order approximates which can be made in the time-clock framework. In the third subsection, we describe the volume-clock framework and explain why its hypotheses lead to different results compared to the time-clock framework. Finally, we illustrate our results with simulations.

The Design of a New Heuristic
In the first subsection we see which idea has been used to approximate the PIN within a time-clock framework. We refer now to the related work of Easley et al. [1]. Considering the previous framework the probability to obtain on the same time , S sells and B buys for day t of length one is: • Remark 1: the time period is fixed, thus S and B can take whatever possible positive integer values, which won't be the case if S + B was fixed. • Remark 2: intensities are rates, thus the equation has a meaning because one implicitly multiplies it by one (trading day). The authors propose to compute the expectation of the absolute value of the following random number K = S − B with an approximate. This is the intuition behind the computation of the VPIN. They refer to the following paper of Katti [17] but do not explicit any calculus. They assert that ( ) thanks to a first order approximation without explaining what it does mean. Let's first describe the content of this reference and assumptions assumed. Then let's describe which computations are involved within this time-clock framework.

Katti's Reference Assumptions
The reference proposes several ways to compute the expectation of the absolute value of two random variables that follow same discrete positive distribution but with possibly different parameters. The case of Poisson processes is treated. Let's describe the beginning of Katti's paper [17]. Let's note 1 X and 2 X two Poisson random variables of intensity 1 λ and 2 λ . We would like to compute the following number 1 1 2 E X X ∆ = − . One can write the following: Then, one can develop it as follows: the two different sums. The author, in order to simplify the calculus and use a trick, makes the following assumptions: 1 2 λ λ ν = , where ν is a constant not linked anymore to 2 λ nor 1 λ . It implies thus a relation between the two variables (for example 1 ). Thanks to this assumption he can do the following: is a confluent hypergeometric function. Operating by ) it finally leads to: The particular case of 1 2 λ λ λ = = cannot be treated with this trick because it would imply equal numbers are linked by an inverse relation, so that the product is independant of 2 λ . But 1 2 λ λ ν = = , ν is not anymore a constant of the main parameters 1 λ and 2 λ , so applying the operator does not give the previous results. One may use here another reference, one cited by Katti [18]. We will detail later the same ideas for our precise the VPIN framework. Anyway, this case leads to the following: is a modified Bessel function of first kind.

How to Use as Far as Possible References' Work to Approximate the VPIN in a Time-Clock Framework
First, let's put ourselves in the context where we have the differences of only Poisson processes. It's pretty simple, one just have to condition the expectation of ( ) E K for each case: Then, remind K S B = − . S and B are, under the model assumption, Poisson processes describing the number of sells and buys in one day of trade. We only need two different kinds of Poisson processes to describe the mixture of Poisson processes resulting of informed and uninformed traders in each case ("good event", "bad event" and "no event"). Let's note them as follows , S and B labelling buys or sells. One finds then: As all Poisson processes are independant one can sum them to produce new Poisson processes, as follows 3 : One can thus sum the two first terms and obtain the following: 3 S and B labels do not have any more importance, to differenciate Poisson processes of the last expectation we have thus just put label one and two to distinguish the "no event" case.
One has to treat finally two different cases: • different intensities: first term • same intensities: second term

How to Reach a First Order Approximate
In this subsection we will first see that main assumption to use Katti's result cannot be used to approximate the PIN. Therefore to approximate the PIN using authors' intuition we describe then the following two steps: • one way to reach numerator exact value consists in using Ramasubban's ideas [18], • first order asymptotic analysis involves separate cases to study sensitivity of the approximate to parameter's values.

Katti's Assumptions Are Not Met in the New Setting
We have seen that Katti's reference use the assumption that Poisson intensities are linked by a relation of the form 1 where ν is independent of these parameters. Here the respective parameters would be µ +  and  . The prod- has clearly no single reason to be a constant. One could create some tricky cases, but it does not seem that the model would like to be limited to these cases (indeed, one may consider for example to fit the PIN parameters maximising likelihood, like in [1]). Thus the assumptions are not met and the reference [17] cannot be invoked to say E K αµ ≈ at first order, as it was done in [1] for example.

E K
Anyway, let's do nevertheless calculations to compute . We follow the same natural ideas of T. A. Ramasubban in this paper which treats only the case of same Poisson intensities [18]. We begin with: Let's start with the easier calculation: the case where Poisson intensities are equal.
All the sums separately exist, we can split them in two different ones: One recognizes here a modified Bessel functions of first kind: for an integer n and, say scalar x, Here we obtain: which is the result of Ramasubban's quoted paper. The computation with different intensities follow the same idea, expect that the symmetry of the two initial sums is broken, so we have to compute them separately.
Let's calculate the first sum and then the second: which separates as follows as all sums exist separately: Replacing first sum of the rigth hand side by Bessel functions of second kind, we finally find: For the second sum, we do an equivalent calculus and find the following: If we put together all the terms we find: Arranging the last two sums of the left hand side of the equality we finnaly get: With an arbitrarily time length t for a trading period, we find:

Analysis of the First Order Approximate
Recall that  and µ are rates of uninformed and informed traders per day (in the original the PIN model). Thus, these parameters are pretty high integers: this is the first intuition behind first order approximate. Moreover, Hankel [19] de-rived an asymptotic expansion of modified Bessel function of first kind as follows: We first apply this expansion to E K with the condition 1 µ  and 1   , as we consider there are a lot of informed and uninformed traders per day (compared to 1). We find the following: Let's now distinguish these three cases: • µ and  are of same order, If µ and  are of same order, in this case: then it reduces to: we find: Thus, we can see that first order approximation depends a lot of: • the respective values of µ and  , proposed in [1] is not incorrect as we will see in the simulations, but sometimes, imprecise.

The Volume-Clock Paradigm: The Implicit Change of Model Assumptions
In this subsection, we describe the volume-clock framework and explain why its hypotheses lead to different results the PIN compared to the time-clock framework. More precisely, we first describe the new assumptions. Secondly, we make the computations within this new framework, which lead to a new value of the PIN.

The New Assumptions
In They introduce the paradigm of volume clock and time bars. Let's first describe it and see that the assumptions are implicitly changed, but ignored. The idea is pretty simple. Consider a trade described by a time serie of price, say t p , labelled with time t. First, They package trades in objects called "bars" that have a fixed time volume, i.e.: they aggregate the time serie in, for example, one-minute time bars. It is equivalent to a sampling of the time serie. Each bar is a kind of new trade with several rules to guess its price. Second they agreggate these time bars to form fixed in volum "buckets". Say these buckets have a volume V.
• Remark 1: nothing can ensure us that buckets will have a fixed volume size. Indeed, each time bar is sensitive to trading intensity. The last time bar can often be too big to be aggregated to a fixed size bucket. Which mean, that if one wants to force bucket size to be constant then, a lot of time bar won't be of one minute lenght. If one on the contrary wants to preserve time size to be constant, a lot of buckets might not be of constant volume size. Suppose anyway that everything is ideal and that each bucket is of constant volume. Authors note τ the label of a bucket of volume V, V τ , and S V τ and B V τ respectively the total number of sells and buys that occured in this bucket.
They then refer to their previous work [1] result: ( ) But even if the result does not hold as previously shown, one must note the following: • But finally, one should remark that this equality lacks a time, as we are talking of rates of traders. In the first model, the time was one day, and implicitly one would multiply within the time-clock framework, rates by one day. Here, in the volume-clock framework, one does not control anymore time. One should take into account filling bucket time which is a new random variable. At first glance, the expression is inhomogenous and even if right, it is far from being trivial. Indeed the authors preciss us "recall that we divide the trading day into equal-sized volume buckets and treat each volume bucket as equivalent to a period for information arrival". It's misleading. Recall that in the initial model time is fixed (one day) and thus volum is random. Here one has the contrary, volume is fixed and time is thus random. Let's detail a bit more the calculus with the new assumptions. To do so let's precise a bit more the new implicit framework.

S B t t E V V V | −
In fact we want to compute now , as bucket volume is fixed. Note t t′ − the filling time of the bucket τ and then note the following: and tot t B ′ the Poisson processes of the total sell up to t and t′ and the total buys up to t and t′ . One has in distribution the following: can condition the events: "good event" (g), "bad event" (b) and "no event" (n): On each event, one knows the distribution of The two first terms corresponding to "good" or "bad events" are equal in distribution, that's why we have: Before going further, let's implement the joint probability density function of for example, sells and buys and respective filling bucket time t-t' in the case of a bad event. Let's note it ( ) . Now, we synthetise and refer to the great ideas of the proof of Kin and Le [14]. Remark first the following: classically follows an Erlang law with the following parameters ( ) . Second, as S B V + = almost surely, we have the following equalities: and: and considering for example a continuous bounded function g, one can guess easily ( ) We find a binomial law The "no event" case is similar. We thus find the following: And after an integration on the random variable t-t': Taking the previous joint probability into account we are thus computing the following expectations of let say X and Y in fact:

Moreover, if x follows the binomail distribution of which p.d.f is ( )
; , x m p  , then using Jensen inequality for the concave function y y → we have: for large enough m and p differing from 1 2 Thus, for large enough V: Thus the VPIN metric approximates the following for large enough n as shown by Kin and Le [14]:

Some Simulation Verification
We present here some simulation verification. First we present the framework and the experienced tested. Second, we present the results.

Framework and Experience Tested
For purpose of illustration, we compare the empirical form of dering the values of  and µ +  , we have bounded the sum to 100000 i = , when probability values starts to be then very little.

Results
On each case, we plot first the empirical numerator ( )
This case is more tricky and actually the asymptotic limit is closer to the empirical value than the first order approximate proposed by the authors, but the trend is not obvious and need more study. We present here the good case that works fine. Further study must maybe be done.

Another Suggestion to Compute the PIN
In this section, we propose another way to compute the PIN. Indeed, as it was seen in the last section, the first order approximation of the PIN within the time-clock is not always precise and its theoretical foundation is not correct.
Furthermore, the one we propose is only asymptotic and not easy to compute.
Hence we propose an exact formula to compute the PIN in the time-clock framework. More precisely, in the first subsection we describe how to compute exactly the numerator αµ and then the PIN. Secondly, we describe how numerically one can design at least one methodology to compute the PIN. Finally, we present some simulation verification of our results.

One PIN Upgrade
In this subsection, we detail how to compute exactly the PIN. Recall that the probability to obtain S sells and B buys during a period of length t is: Recall that to compute the PIN we have the assumption: and we even have: So to estimate the PIN denominator, one can first use for an arbitrary time period an average of S, B or TT. Let's work with S and take a time period of length t.
Let's estimate the numerator 2 t αµ . To do this, we firstly explicit the margin probability function to obtain S sells in a time period of length t and secondly we compute its first three moments. Thirdly we explain how to compute α and hence the numerator, which finally leads to a new PIN formula.

Margin Function
The probability to obtain S sells during a time period of length t is the following:

Computation of First Three Moments
Let's compute the moment-generating function of this process. We will estimate the numerator using relations between moments. Let u be a real value, let S V be the random variable representing the volume of sells and t the fixed time period associated. We have: i.e. we have the classic decomposition:

Estimation of α
Remark the following: Then with the same idea let's compute the following: If we use again the formula, we can then replace 2 t αµ by If we arrange a bit the expression on denominator and numerator on the left hand side of the equation, we remark the following: • Remark 1: Introducing the skewness γ and the following notations: The discriminant is positive: . As α is a probability, we finally find:

Estimation of t 2 αµ
We know that , so let's replace the α of the right hand side of the equality (not with t µ ) by previous expresion.
We then estimate 2 t αµ . We finally obtain the following as with the previous notations:

A New PIN Formula
Finally we obtain the following equivalent exact formula: One then just have to estimate on a arbitrary time lenght t, m, σ and γ to estimate the PIN number. The difficulty is then put on estimating on this time period the volume of direction of trades. We describe further a possible framework to compute this number. One can verify numerically that these two formula give the exact same numbers of the PIN.

A New Framework to Compute the PIN
In this subsection we explain, how at least one framework can be designed to Thus two things must be implemented to well estimate the PIN: • the empirical averages implicitly behind m, σ and γ : we will have to put some hypothesis on the time series of volumes to use classic theorems.
• the volume of sells: one needs a model of classifier to guess on a given amount of time the number of sells within the total volume of sells.

Estimation of m, σ and γ
We would like to use the law of large number. We basically need random variable independant and identically distributed. Here: noting Thus the choices to make here are: • the time length η , • the number n of sub-intervals to have a precise average.
To reduce standard variation of , PIN t t η + , one direct way to do it is to take both the averages of the PIN estimated using volume of sells (let's note it now

Some Simulation Verification
We present finally some simulation verification. First we describe its framework. Second we present the results. The values of parameter tested are exactly the same as in the last framework, as we would like to compare previous results with the values of our new formula. The only difference which slightly change our framework, is that to compute the new formula one needs more sample. We detail it now.

Framework and Experience Tested
For purpose of illustration, we compare the empirical form  • We compute 20 values for each choice of  and µ in the three cases above, • For each of the 20 values, for a choice of  and µ , we generate 1,000,000 Poisson processes, we divide them in 100 consecutive intervals of 10,000 values. For each of the 100 intervals we compute empirical average to approximate mean m, standard deviation σ and skewness γ . We then compute an approximation of the PIN with an average of these 100 values 10 .

Results
On each case (Figure 8, Figure 9 and Figure  In any case one sees that new formula estimated is closer than the VPIN one.
By the way, we have checked that new PIN formula obviously equals true PIN formula for any parameter  , µ and α of the model.

Conclusions
In this last section, we present first a general summary of our findings. Then we propose suggestion for further research on this topic. 9 This case is more tricky and actually asumptotic limit is closer to the empirical value than first order approximate proposed by authors, but the tren is not obvious and needs more study. We present here the good case that works fine. Further study must maybe be done. 10 This double average equals traditional the VPIN formula as values are consecutive.    In this study we have analyzed the theoretical foundation of the PIN model and we have shown that its time-clock framework makes it hard to apply the VPIN original heuristic to estimate the probability of informed trading. Indeed, first order asymptotic is not that simple to estimate theoretically and in practice. That's why we propose another way to estimate the PIN, which is theoretically exact and hence more precise than the asymptotic formula, which is confirmed by our first tests. Moreover, the study recalls and highlights the difference of the volume-clock and time-clock paradigms which leads to a different formula of the PIN, and which respective hypotheses cannot therefore be used simultaneously to approximate the PIN.
Here are some ideas to further study this precise subject: • test and compare the performance of the new formula within the time-clock framework with real trading data: find local optima parameters (n, η , trade classification algorithm, …) to maximize prediction quality, • analyze and assess stability of the new formula and compare it to other ones.