_{1}

^{*}

It is difficult if not impossible to appropriately and effectively select from among the vast pool of existing neural network machine learning predictive models for industrial incorporation or academic research exploration and enhancement. When all models outperform all the others under disparate circumstances, none of the models do. Selecting the ideal model becomes a matter of ill-supported opinion ungrounded on the extant real world environment. This paper proposes a novel grouping of the model pool grounded along a non-stationary real world data line into two groups: Permanent Data Learning and Reversible Data Learning. This paper further proposes a novel approach towards qualitatively and quantitatively demonstrating their significant differences based on how they alternatively approach dynamic and raw real world data vs static and prescient data mining biased laboratory data. The results across 2040 separate simulation runs using 15,600 data points in realistically operationally controlled data environments show that the two-group division is effective and significant with clear qualitative, quantitative and theoretical support. Results across the empirical and theoretical spectrum are internally and externally consistent yet demonstrative of why and how this result is non-obvious.

Prior research is extremely mixed in terms of demonstrating which of the many forms of neural network predictive models provides superior performance under specific laboratory data set conditions. Resolving this apparent contradiction is of great interest both to industry in their search for the appropriate technology to pursue and to academia in their search for the appropriate model to explore and enhance. This paper provides clear and potential answers to this contradiction.

The tasks are twofold: first to clarify and classify the dozens of disparate models into two main, descriptive families; and second to theoretically and empirically explore their operational characteristics in real world settings such that their past published performances become understandable and the two main family approaches become proscriptive.

In accomplishing these two tasks, this paper hopes to point out how two modeling approaches initially appearing very similar are in fact qualitatively different, with different goals, different perspectives, and potentially very different outcomes—both in terms of model operations and in terms of their ensuing spiral effects on the users.

There are a myriad ways to classify the available neural network predictive machine-learning models ranging from date of first publication, complexity, speed, preand post-processing layers, and more. These model features are analog in nature; there are no clear-cut boundaries whereby members on one side qualitatively differ from members on the other. Any model can trade off speed performance for more complexity and higher accuracy performance and vice versa. Prior attempts at categorical subdivisions also fall short. For example, a division between instance-based lazy learning vs. generalized eager learning also captures insignificant differences since shifting the bulk of the processing towards training or testing phases is irrelevant besides needing to know which arbitrary training or testing set is larger. Similarly, localized min/max vs globalized min/max also captures ephemeral differences since even a nominally local k-NN model can simply expand k to shift from a locally dependent result to a generalized, global one.

By using real world model operational characteristics on non-stationary, heteroskedastic data, this paper categorically divides the models along a novel characteristic descriptor boundary. The two families are Permanent Data Learning and Reversible Data Learning. While they both nominally perform the same duty with nominally very similar pattern recognition and prediction mechanisms, they have subtle differences in data processing that have profound cascading effects on their users’ ensuing approaches towards predictive modeling.

A Permanent Data Learning (PDL) model irretrievably folds all training example data into a boundary classifier representative of neurological synaptic weights. Members include all forms of regression models, Perceptrons [

The following equations characterize the core training and pattern storage mechanisms for regression networks, back propagation and time series derivatives, and support vector networks, respectively.

The constant theme underlying Equations (1)-(3) is that all PDL model learning is operationally permanent. In Equation (1), all X-vector independent variables operate together to with the Y-vector variables to irrevocably produce the B-vector learned components. In Equation (2), all upstream and downstream layer connections work together across all training data x to compress and irrevocably embed into the learned weight components,. In Equation (3), the sum of all training data vectors affects finding the saddle point, , and ensuing support vectors, , with maximum margin boundary,.

On encountering a new training data point for incorporation, all prior learning partitions–be they nodes, neurons, support vectors, or boundaries and margins—need to be checked and modified. This causes the PDL models to be generalized global learners. There are no means to isolate any data point, retrieve a specific pattern in memory, and encapsulate its effects on the whole. This causes the PDL models to be fixedly global learners. They have no flexibility to be otherwise. To carry out any change in the knowledge base of incorporated training data points, including adding new data or unlearning past data requires a complete destruction and recreation of a new model using the entire updated training set.

Translating this computational behavior into the physiological realm reads the PDL model as a black box, nonspecific brain. Translating this computational behavior into the cognitive realm reads the PDL as forming stereotypes. There is no ability to explain exactly why and how it generates its output classifications beyond the mechanical equations. It cannot extract which case history or precedent supports its response. It is technically very noise resistant and robust such that physical damage to any of its storage partitions leads to only commensurate degeneration across the entire network. It is operationally very noise sensitive in that a single badly positioned data point could potentially corrupt the entire system in a noisy real world environment.

A Reversible Data Learning (RDL) model stores all training example data into reversible and dynamic categories and classes representative of neurological cells, with a specific example including the granule cells in the hippocampal complex. Members include k-NN [

The following equations characterize the core training and pattern storage mechanisms for ARTMAP and Echo ART network derivatives and k-NN, respectively.

Equations (4) and (5) refer to ARTMAP and Echo ART networks incorporating new training data by first checking all prior learning partitions, selecting the single optimal target, and modifying only that target partition as needed. Equation (6) refers to a simple k-NN network storing the pattern for later retrieval in a group of k matches. Especially by enhancing k to include the entire set, this causes RDL models to be generalized global learners. There are several means to isolate a data partition, retrieve a specific pattern in memory, and encapsulate its effects on the whole. This allows the RDL models to be flexibly global or local learners. To carry out any change in the knowledge base of incorporated training data points, including adding new data or unlearning past data requires simply locating the optimal partition to update or extract.

Translating this computational behavior into the physiological realm reads the RDL model as a white box, specialized cerebellar or hippocampal-like structure with specifically encoded neurons. Translating this computational behavior into the cognitive realm reads the RDL as forming episodic memories available for retrieval and narration. There is full ability to explain exactly which case history or precedent supports its output decision. It is technically not very noise resistant or robust on a local scale because physical damage to a partition destroys that particular memory pattern. It is operationally very noise resistant and robust on a global scale since all other functions are completely normal and unimpaired, while impaired functions can be detected and encapsulated.

The next section discusses how models grouped into PDL and RDL families operated and performed in the prior literature.

Hinton, Osindero, and Teh [

Versace, Bhatt, Hinds, and Schiffer [

Zhang, Jiang, and Li [

Kim [

Medeiros, Terasvirta, and Rech [

West, Dellana, and Qian [

Saad, Prokhorov, and Wunsch [

West [

Ng, Quek, and Jiang [

Wong and Versace [

The next section discusses a hands-on demonstration comparing a PDL and RDL model on real world data with respect to data mining complexity and bias.

The goal of this experiment is to empirically explore their operational characteristics in various real world settings with respect to varying levels of data mining complexity. The hypothesis is that PDL is more vulnerable to data mining bias with performance more correlated to data mining complexity than an RDL with 95% confidence. This would lead to a conclusion that the PDL/RDL division is qualitative and significant. This would also show that PDL models provide more incentive for users to concentrate on data mining performance gains while RDL models provide more incentive for exploring and understanding the model itself for similar gains. The null hypothesis is that there are insignificant correlation differences between the PDL and RDL model divisions on data mining complexity and the PDL/RDL division is thus not significant.

To represent the PDL family, this paper uses a typical multi-layer Perceptron [

The universe of data includes 30 members of the Dow Jones Industrial Average with weekly-adjusted closing prices (http://finance.yahoo.com ) over 10 years (2001-2010). This encompasses 15,600 total data points. This data sample includes uptrending bull and downtrending bear states both within the index aggregate and variously within each index member. This paper uses the 10-period moving average crossover rule stock data subdivision protocol as used in [

To represent the data mining complexity, this paper designates four novel subdivisions in the data for exploration: No Mining, Basic Mining, Moderate Mining, Heavy Mining, and Extreme Mining. See

In the No Mining Case, the data is un-partitioned so as to remove all user preconceptions of the data. The models train on the first year (2001), then test on the subsequent year (2002). The models proceed to incorporate the just-tested year into their training (2001 + 2002) before testing on the subsequent year (2003). To maximally use the data, both models end with training on (2002-2010) and testing on (2001). PDL models need to self-destruct, reset, and restart training at each step. Across 30 different stocks, the PDL model generates 300 simulated runs encompassing 15,600 test points. RDL models need not restart until the final testing year (2001) reversal and thus generate 60 simulated runs incorporating 15,600 test points.

In the Basic Mining case, the data is partitioned as per standard textbook guidelines [

In the Moderate Mining case, the data partitions into one-year forward sliding windows. This represents where users assume the data are slowly non-stationary such that adjacent years are sufficiently similar. All models train on one year and test on the subsequent year. To maximally use the data, all models end with training on (2002) and testing on (2001). Both PDL and RDL each generate another 300 simulated runs encompassing 15,600 test points.

In the Heavy Mining case, the data partitions presciently into similar predefined states. Allowing the users to peek into the future tracks downtrending state years (2001-2002 vs. 2007-2008), uptrending state years (2003- 2004 vs. 2009-2010), and non-trending state years (2005 vs 2006). In each case, the models train on the first subdivision and test on the other and vice versa. Both PDL and RDL each generate 180 simulated runs encompassing 15,600 data points. Clearly, this is the most unrealistic back-tested case where the training and testing is highly corrupted with processing focused on the user rather than the PDL or RDL model.

In the Extreme Mining case, the models purposely violate training and testing segregation by testing on prior trained data years. All models train on one year and test on the same year. This process repeats ten times, once for each year. Both PDL and RDL each generate another 300 simulated runs encompassing 15,600 test points.

This paper calculates the aggregate correlation coefficients between PDL and RDL across all 2040 simulations over the varying levels of data mining complexity. This paper also compares aggregate annual rates of return between PDL and RDL for significant differences in keeping with prior literature.

The PDL model benefits most with increasing data mining complexity. Under the No Mining case (annual percent rate (APR) = 2.7%, 2-tailed p < 0.01 vs. RDL), the PDL model attempts to generalize on all available past data to predict a given year. Since all financial data are notoriously chaotic, heterogeneous, and non-stationary (e.g. [

gression through Basic Mining (APR = 2.1%, p < 0.10), Moderate Mining (APR = 5.1%, p < 0.02), Heavy Mining (APR = 9.0%, p < 0.01), and Extreme Mining (APR = 12.2%, p < 0.01) helps greatly to clean, filter, and select the data such that it becomes stationary and noise-corrected. This allows the PDL model to attain monotonically higher performances. Under the Heavy Mining case for example, the PDL model generated the highest average feasible APR at 9.0% net of trading costs. The Extreme Mining case was not feasible, but included as a demonstration of the trend. This leads to a PDL correlation coefficient of 0.98 (p < 0.01 vs. RDL). This shows a strong incentive for the user to replicate and extend the Heavy Mining case in order to enhance a PDL model performance. The user research and development focus thus may shift exogenous to the PDL model in an attempt to limit and sanitize the data.

The RDL model benefits are unclear with respect to the data mining complexity. The Sharpe ratio performance was neither monotonic nor increasing. The RDL model performed best under the No Mining condition (APR = 6.6%, 2-tailed p < 0.01 vs. PDL) where the user is barred from introducing any preconceived notions on the data and the RDL model can make full use of all historical data in an automatically separated episodic memory manner. Progressing to the Basic (APR = 2.3%, p < 0.10) and Moderate Mining (APR = 4.3%, p < 0.02) cases only served to unnecessarily restrict the RDL model’s experiences by imposing artificial multi-year lags and restricting data access to fiveand one-year historical data periods, with greater lag and restriction resulting in worsening performance. The Heavy Mining case (APR = 5.7%, p < 0.01) performance approaches the initial No Mining performance due to its forced stationarity and larger two-year historical data periods. The Extreme Mining (APR = 3.4%, p < 0.01) case continued the decline in performance similar to the Moderate Mining case; this unusual and unintuitive trend may be due to the fact that RDL models attempt to generate episodic memory categories rather than an optimal boundary. The one-year historical data window size in each Extreme Mining case run may have severely restricted the RDL model’s ability to form sufficient separate episodic memory categories vis-à-vis the No Mining and Extreme Mining cases where much larger windows were available. The artificially shorter window sizes may have had a larger negative impact than the artificially positive impact from forced stationary data periods. This is a related operating characteristic of RDL models.

The RDL correlation coefficient showed −0.43 (p < 0.01), significantly different from the PDL. The twotailed differences in average returns were also significantly different at the 95% level between PDL and RDL in all cases except under the Basic Mining case. Partitioning the data in varying levels of data complexity clearly shows that separating the models into PDL and RDL families demonstrates significant empirical differences and is therefore an effective qualitative partition. That the PDL and RDL were not significantly different at 95% under the typically used Basic Mining case confirms the two representatives and their operations were appropriately selected and only further explained why this result was unnoticed earlier without more extensive data mining complexity analysis.

These results are also consistent with the theoretical analysis and prior literature review showing PDL models with relatively complex data and model setups (e.g. [

The resulting cognitive impacts of this among the users and researchers cannot be overstated. Neural network models rely on the users to provide the training data for exposure and produce results that subtly train the user as to the next steps in the user-network interaction. As the users train the network, so too might the network train the user.

A mass of prior research explores a variety of neural network and machine learning predictive models all with varying relative performances where comparable. Prior attempts to quantitatively and qualitatively isolate and extract utility from among this pool have heretofore been wanting. This paper proposed dividing the model population into two families along a novel boundary line—Permanent Data Learning (PDL) and Reversible Data Learning (RDL)—based on operational and theoretical experience with real world data environments. It further proposes demonstrating the differences in behavior along a novel dimension—increasing levels of data mining complexity towards the prescient, exogenous user. The results show that PDL and RDL models are qualitatively and quantitatively different when viewed through the lens of real world data environments along this dimension. The PDL-RDL family grouping is effective and immediately lends itself to qualitative and quantitative behavioral differences that can greatly help industrial users and academic researchers select appropriately for utility and modeling.

PDL models are highly responsive to more complex data mining and so can theoretically produce better statistical machine learning results that can be misleading and less robust in dynamic, non-stationary, real world environments. Long-term PDL users may be increasingly subtly trained to rely on more complex, exogenous data mining complexity in pursuit of short-term, stationary results on known, laboratory datasets as indicated by strong theoretical support incentives and by reviewing the literature. RDL models are unresponsive to complex data mining and produce socially and cognitively plausible approaches towards the unknown in realistic decision making. Long-term RDL users may have incentive to focus on data-neutral, model-centric approaches to research and development enhancements.