^{1}

^{1}

Artificial intelligence research in the stock market sector has been heavily geared towards stock price prediction rather than stock price manipulation. As online trading systems have increased the amount of high volume and re-al-time data transactions, the stock market has increased vulnerability to at-tacks. This paper aims to detect these attacks based on normal trade behavior using an Artificial Immune System (AIS) approach combined with one of four clustering algorithms. The AIS approach is inspired by its proven ability to handle time-series data and its ability to detect abnormal behavior while only being trained on regular trade behavior. These two main points are essential as the models need to adapt over time to adjust to normal trade behavior as it evolves, and due to confidentiality and data restrictions, real-world manipula-tions are not available for training. This paper discovers a competitive alterna-tive to the leading approach and investigates the effects of combining AIS with clustering algorithms; Kernel Density Estimation, Self-Organized Maps, Densi-ty-Based Spatial Clustering of Applications with Noise and Spectral clustering. The best performing solution achieves leading performance using common clustering metrics, including Area Under the Curve, False Alarm Rate, False Negative Rate, and Computation Time.

The financial stock market is vulnerable to different manipulation attacks to increase or decrease stock value. These manipulation attacks are anomalies in financial trade datasets since they do not follow traditional trading techniques. There are many challenges involved in detecting these anomalies, the first being normal trade behavior becoming anomalous over time or vice versa. This implies that the model must evolve with the financial data over time to model new trading trends effectively. The second challenge of anomaly detection is that some manipulations can result in a minor change as a tactic to go under the radar, emphasizing the importance of the model’s generalization. There are also many security restrictions on financial time series trade data, making it hard to obtain the training data. There are few exposed real-world manipulations cases, and most data are partially observable (data with missing information such as the buyer or seller), which can make validation difficult. Most artificial intelligence or machine learning applications in the financial stock market domain aim to predict the value of a stock to execute a buy or a sell [

Current approaches in detecting anomalies in stock market data that use supervised learning [

Previous research studies [

This paper is organized as follows: In Section 2, a background of stock market fraud is given and previous related work for fraud detection. In Section 3, the artificial immune system (AIS) approach and previous work that utilizes the approach is explained. In Section 4, clustering analysis and four-clustering approaches that are implemented in this paper, including KDE, SOM, SC, and DBSCAN, are introduced. In Section 5, the proposed algorithm is presented. In Section 6, the experimental work and results are discussed. Section 7 concludes the paper along with a discussion on future research direction.

Stock market fraud is a common attack with the primary goal of manipulating a stock’s price. There are two main forms of manipulations used by attackers called pump and dump and spoof trading. The main goal of pump and dump trading is to increase a stock value and then sell it once it has been increased to obtain the maximum profit possible [

In

Stock market fraud detection has been a difficult anomaly detection problem, and previous research did not investigate much in stock market prediction models. Supervised learning techniques have been implemented in [

The natural immune system (NIS) defends the body against its cells’ dysfunctions and actions from foreign cells. The authors of [

DCA [

· Pre-processing Phase: Categorizes the data into three signal types (PAMP, Safe, Danger);

· Detection Phase: Concentration metrics are calculated for each Dendritic Cell (DC);

· Context Assessment Phase: Compares the concentration to a set threshold;

· Classification Phase: Classifies the DCs as anomalous or not.

The (DCA) [

Clustering is the grouping of data instances based on a set metric and/or threshold that ensures common instances are grouped to form a cluster. A clustering approach generates a set number of clusters, representing a certain class of the data. Different clustering-based anomalies detection approaches are used in many applications [

KDE detects anomalies by comparing the density of a sample with its neighbors based on a kernel and set thresholds [

SOMs are referred to as “Kohonen Neural Networks”, a type of unsupervised learning based on competitive learning. They are typically used for classification or pattern recognition [

Spectral clustering is a graph-based clustering approach commonly used for anomaly detection with image-based data [

DBSCAN performs clustering by separating high and low-density regions within a data distribution. The DBSCAN algorithm is robust to noise and is highly scalable [

The proposed hybrid model combines the AIS and clustering analysis capabilities. First, it performs data preprocessing, followed by incorporating the DCA algorithm. Once the first two stages of the DCA are completed, a clustering algorithm Ai is performed on the output of the DCA to detect anomalies in the dataset.

The first phase of the DCA Algorithm is used as a dimension reduction step that utilizes Principle Component Analysis (PCA) to reduce the original five dimensions into three dimensions. The three dimensions represent the three categories. Cp represents the PAMP category, Cs represents the Safe Signal category, and Cd represents the Danger category. Once the data are reduced and has a value for each category type, the algorithm moves to the detection phase to calculate the concentration of co-stimulation (C_{csm}), Semi-mature (C_{smDC}) and mature (C_{mDC}) for each DC using Equation (1) below.

C [ csm , smDC , mDC ] = ( ( Wp + Cp ) + ( Ws + Cs ) + ( Wd + Cd ) ) ( Wp + Ws + Wd ) (1)

where Wp, Ws, and Wd are the weights for the different categories of signal. The weights used in this paper are summarized in _{csm}, C_{smDC}, and C_{mDC} are calculated, the new 3-dimensional dataset is passed to the clustering stage.

W | CSM | Semi-mature | Mature |
---|---|---|---|

PAMP signals (P) | 2 | 1 | 2 |

Danger signals (D) | 0 | 0 | 2 |

Safe signals (S) | 2 | 1 | −3 |

In the second stage of the hybrid model, A clustering algorithm Ai uses the output of the DCA (i.e. the feature set of three dimensions, each representing a concentration of signal types) is introduced. We have used four different clustering algorithms, including KDE, DBSCAN, Spectral clustering, and SOM. Each of these algorithms works on datasets of different configurations and shapes. We have modified each algorithm to tailor our detection problem as follows:

It focuses on the mean density of the data distribution to detect anomalies in the data. Such that clusters are created based on their difference from the mean density, which is recalculated each time a new cluster is created using Equation (2) and Equation (3). Where n is the number of data rows, g is a tuned smoothing parameter and Fi is the data instance. λ is used as a threshold to determine the cluster size for anomaly classification. The KDE detector algorithm is shown in

Mean Density = 1 n g ∑ i = 1 n K ( f − F i g ) (2)

K ( r ) = 1 2 π exp ( − r 2 2 ) (3)

The SOM for anomalies detection shown in

This method starts by finding a lower dimensional representation of the data that allows for more effective clustering using the DCA algorithm. The anomaly detector is created using the label diffusion approach of [

The first step in

In

Finally,

In this section, we have applied our proposed hybrid algorithm on five different stock market datasets. We have also compared the performance of the hybrid model against the individual-based models using various sets of evaluation metrics as shown next.

Financial data for the week of February 5^{th}, 2018 through February 9^{th}, 2018 was collected for each of the five stocks, including Amazon (AMZN), Apple (AAPL), Microsoft (MSFT), Intel Corp. (INTC), and Google (GOOGL) [

Since each stock has different volumes in trading, the total size of each dataset differs from one another over a week of trading. The total size of each dataset is shown in

Feature | Description |
---|---|

Timestamp | Epoch Timestamp |

Sequence Number | Sequence Number |

Exchange Id | Exchange Type |

Size | Number of Shares |

Price | Share Price |

Conditions | Conditions on Trade |

Stock Ticker | Number of Data Rows |
---|---|

Amazon | 590,387 |

AAPL | 1,057,525 |

MSFT | 818,603 |

INTC | 571,101 |

GOOGL | 225,349 |

The final dataset used is comprised of five features based on the price of the stock and the timestamp of when the trades were made. The first feature, x_{1} represents the share price after standard normalization techniques are applied. The second feature x_{2}, represents the share price after wavelet denoising is applied. Wavelet Denoising is a process that removes low-frequency components from the data, which is important when dealing with naturally high-frequency data such as stock market data. The wavelet denoising is applied to the stock price x, and the result is the second feature in the feature set [

This is an effective feature to include as it is a good representation of an unusual increase or decrease of a stock price. Equation (4) and Equation (5) below are how to calculate Wilson’s Amplitude.

s ( t ) = x ( t ) − x ( t − 1 ) (4)

w ( t ) = { 3 * s ( t ) , s ( t ) > p s ( t ) , s ( t ) ≤ p (5)

where x ( t ) is the price at time t and x ( t − 1 ) is the previous price. Threshold p is calculated as the average value of s(t). The final two features are Δx and Δw which are the rates of change in the stock price (x) and Wilson’s Amplitude (w) over time respectively. In Equation (6), y can be replaced by x or w accordingly.

Δ y ( t ) = y ( t ) − y ( t − 1 ) y ( t − 1 ) * 1 00 (6)

The final step of preprocessing is to apply standard normalization techniques to all feature values. The final set of features, F = { x , x 2 , w , Δ x , Δ w } are described in

In this subsection we review some of the main evaluation measures that we have used to assess the quality of the proposed hybrid algorithms as well as those for the individual algorithm including, F1-Score, Sensitivity, Specificity, False Negative Rate (FNR), and False Alarm Rate (FAR). The final metric is the computation time which is the run time of each algorithm.

F-Score is the weighted harmonic mean of the precision and recall defined below in Equation (9). Where TP is the number of True Positives, FP is the number of False Positives, FN is the number of False Negatives and TN is the number of True Negatives.

Precision = ( T P T P + F P ) (7)

Recall = ( T P T P + F N ) (8)

F-Score = 2 ∗ ( Precision * Recall Precision + Recall ) (9)

Feature | Description |
---|---|

x | Share Price (Normalized) |

x_{2} | Wavelet Denoised Share Price |

w | Wilson’s Amplitude |

Δx | Change in Stock Price Over Time |

Δw | Change in Wilson’s Amplitude Over Time |

Accuracy is a measure of how often data are classified correctly.

Accuracy = ( T P + T N T P + T N + F P + F N ) (10)

Sensitivity is a measure of the proportion of True Positive cases that got properly classified as positive.

Sensitivity = ( T P T P + F N ) (11)

Specificity is a measure of the proportion of True Negative cases that got properly classified as negative.

Specificity = ( T N T N + F P ) (12)

False Negative Rate (FNR) is the proportion of falsely classified negatives over all expected positive cases.

FNR = ( F N F N + T P ) (13)

False Alarm Rate (FAR) is the proportion of falsely classified positives over all expected negative cases.

FPR = ( F P F P + T N ) (14)

Area Under the Curve (AUC) is a measure of separability for classification and is used to determine how well a model can distinguish the different classes. The closer the value is to 1 then the better the model is at distinguishing separate classes. The value is determined by taking the area under the curve created when plotting FPR (x-axis) vs. TPR (y-axis).

The following section compares the hybrid DCA-KDE, DCA-SOM, DCA-DBSCAN, DCA-Spectral, DCA-CBLOF (K-Means) and DCA-CBLOF (BIRCH) algorithms along with their individual detection approach. In

obtained a drop-in accuracy across all datasets, with some being minuscule, as seen on Apple and INTC datasets. DCA-Spectral has the worst performance with a large drop in accuracy. Note that the accuracy can be close to one hundred percent when an algorithm performs well due to the limited number of anomalies within a large dataset. Similarly, in

In

In

Amazon | Apple | INTC | MSFT | ||
---|---|---|---|---|---|

KDE | 0 | 0.5 | 0.54 | 0.8 | 0.62 |

DCA-KDE | 0.01 | 0.03 | 0.03 | 0 | 0 |

SOM | 0.042 | 0.004 | 0.048 | 0.001 | 0.003 |

DCA-SOM | 0 | 0 | 0.05 | 0 | 0 |

DBSCAN | 0.12 | 0.109 | 0 | 0 | 0.03 |

DCA-DBSCAN | 0 | 0.05 | 0 | 0 | 0.3 |

Spectral | 0.84 | 0.94 | 0.86 | 0.88 | 0.88 |

DCA-Spectral | 0 | 0 | 0.04 | 0 | 0 |

CBLOF (K-Means) | 0 | 0 | 0 | 0.03 | 0 |

DCA-CBLOF (K-Means) | 0 | 0 | 0.02 | 0 | 0 |

CBLOF (BIRCH) | 0 | 0 | 0.26 | 0 | 0 |

DCA-CBLOF (BIRCH) | 0 | 0 | 0.16 | 0 | 0.01 |

Amazon | Apple | INTC | MSFT | ||
---|---|---|---|---|---|

KDE | 0.0891 | 0.0283 | 0.2955 | 0.1796 | 0.0828 |

DCA-KDE | 0.0013 | 0.0005 | 0.0011 | 0.0004 | 0.0007 |

SOM | 0.0008 | 0.0005 | 0.0015 | 0.0009 | 0.0007 |

DCA-SOM | 0.0011 | 0.0005 | 0.0012 | 0.0008 | 0.0174 |

DBSCAN | 0.0007 | 0.0004 | 0.0012 | 0.0006 | 0.0008 |

DCA-DBSCAN | 0.1879 | 0.0072 | 0.0386 | 0.0167 | 0.0384 |

Spectral | 0.0917 | 0.0904 | 0.1296 | 0.1544 | 0.0882 |

DCA-Spectral | 0.9927 | 0.9621 | 0.9589 | 0.9367 | 0.9664 |

CBLOF (K-Means) | 0.0959 | 0.0191 | 0.0746 | 0.0006 | 0.0005 |

DCA-CBLOF (K-Means) | 0.0055 | 0.0011 | 0.0012 | 0.0007 | 0.0004 |

CBLOF (BIRCH) | 0.0554 | 0.0329 | 0.0011 | 0.0007 | 0.0005 |

DCA-CBLOF (BIRCH) | 0.0014 | 0.0009 | 0.0011 | 0.0007 | 0.0004 |

Amazon | Apple | INTC | MSFT | ||
---|---|---|---|---|---|

KDE | 268 | 450 | 120 | 321 | 750 |

DCA-KDE | 17 | 100 | 15 | 25 | 24 |

SOM | 13 | 22 | 6 | 14 | 18 |

DCA-SOM | 12 | 20 | 5 | 12 | 17 |

DBSCAN | 21,108 | 62,246 | 3042 | 20,737 | 36,814 |

DCA-DBSCAN | 3131 | 10,125 | 453 | 2850 | 5678 |

Spectral | 8748 | 8408 | 18,770 | 8493 | 8499 |

DCA-Spectral | 8432 | 8326 | 8322 | 4980 | 8328 |

CBLOF (K-Means) | 28 | 41 | 7 | 20 | 30 |

DCA-CBLOF (K-Means) | 26 | 46 | 7 | 22 | 35 |

CBLOF (BIRCH) | 20 | 42 | 10 | 16 | 36 |

DCA-CBLOF (BIRCH) | 22 | 46 | 10 | 15 | 40 |

Tables 8-11 show the % of improvements of using hybrid models against the individual-based algorithm measured by the F-Score, accuracy, sensitivity, and specificity, respectively. We can observe that the DCA-KDE, DCA-CBLOF (K-Means),

DCA-CBLOF (BIRCH) has the greatest F-Score improvement as shown in

Amazon | APPL | GOOGL | INTC | MSFT | |
---|---|---|---|---|---|

DCA-KDE vs. KDE | 98.18 | 98.78 | 99.65 | 99.91 | 99.57 |

DCA-SOM vs SOM | 8.775 | 3.548 | 7.212 | 2.461 | 99.99 |

DCA-DBSCAN vs DBSCAN | −99.99 | −99.99 | −99.99 | −99.99 | −99.99 |

DCA-Spectral vs Spectral | −99.99 | 38.15 | −5.557 | 28.44 | −26.96 |

CBLOF (K-Means) vs DCA-CBLOF (K-Means) | 92.94 | 90.80 | 93.76 | 0.287 | 8.039 |

CBLOF (BIRCH) vs DCA-CBLOF (BIRCH) | 95.13 | 95.05 | 7.764 | 2.623 | 10.21 |

Amazon | APPL | GOOGL | INTC | MSFT | |
---|---|---|---|---|---|

DCA-KDE vs. KDE | 8.797 | 2.788 | 29.49 | 17.94 | 8.219 |

DCA-SOM vs SOM | 0.024 | 0.0006 | 0.0363 | 0.0055 | 1.696 |

DCA-DBSCAN vs DBSCAN | −23.04 | −0.6823 | −3.894 | −1.642 | −3.915 |

DCA-Spectral vs Spectral | −99.99 | −99.99 | −99.99 | −99.99 | −99.99 |

CBLOF (K-Means) vs DCA-CBLOF (K-Means) | 9.091 | 1.806 | 7.326 | −0.001 | 0.012 |

CBLOF (BIRCH) vs DCA-CBLOF (BIRCH) | 5.406 | 3.194 | 0.016 | 0.006 | 0.016 |

Amazon | APPL | GOOGL | INTC | MSFT | |
---|---|---|---|---|---|

DCA-KDE vs. KDE | −1 | 48.31 | 52.58 | 80.00 | 62.00 |

DCA-SOM vs SOM | 4.200 | 0.400 | −0.2105 | 0.100 | 0.301 |

DCA-DBSCAN vs DBSCAN | 12.00 | 5.747 | 0 | 0 | −38.57 |

DCA-Spectral vs Spectral | 84.00 | 94.00 | 85.42 | 88.00 | 88.00 |

CBLOF (K-Means) vs DCA-CBLOF (K-Means) | 0 | 0 | −2.041 | 3.000 | 0 |

CBLOF (BIRCH) vs DCA-CBLOF (BIRCH) | 0 | 0 | 11.90 | 0 | −1 |

Amazon | APPL | GOOGL | INTC | MSFT | |
---|---|---|---|---|---|

DCA-KDE vs. KDE | 8.799 | 2.784 | 29.48 | 17.73 | 8.213 |

DCA-SOM vs SOM | −0.0265 | −0.0008 | 0.0367 | 0.0054 | −1.697 |

DCA-DBSCAN vs DBSCAN | −23.05 | −0.683 | −3.896 | −1.642 | −3.912 |

DCA-Spectral vs Spectral | −99.99 | −99.99 | −99.99 | −99.99 | −99.99 |

CBLOF (K-Means) vs DCA-CBLOF (K-Means) | 9.098 | 1.807 | 7.343 | −0.004 | 0.0122 |

CBLOF (BIRCH) vs DCA-CBLOF (BIRCH) | 5.410 | 3.195 | −0.002 | 0.006 | 0.016 |

Amazon | APPL | GOOGL | INTC | MSFT | |
---|---|---|---|---|---|

DCA-KDE vs. KDE | 3.916 | 25.18 | 40.86 | 48.97 | 35.12 |

DCA-SOM vs SOM | 0.0017 | 0.0003 | 0.002 | 0.0015 | 0.0020 |

DCA-DBSCAN vs DBSCAN | −1.463 | −99.99 | 96.98 | 96.67 | 86.05 |

DCA-Spectral vs Spectral | 35.82 | 31.71 | 12.81 | 39.24 | 19.14 |

CBLOF (K-Means) vs DCA-CBLOF (K-Means) | 0.0003 | 0.0002 | 0.002 | 0.011 | 0.0014 |

CBLOF (BIRCH) vs DCA-CBLOF (BIRCH) | 0.0001 | 0.0001 | 0.0018 | −0.0001 | 0.0011 |

Amazon | APPL | GOOGL | INTC | MSFT | |
---|---|---|---|---|---|

DCA-KDE vs. KDE | 1 | −99.99 | −99.99 | −1 | −1 |

DCA-SOM vs SOM | −100 | −100 | 4.000 | −100 | −100 |

DCA-DBSCAN vs DBSCAN | −100 | −100 | 0 | 0 | 90 |

DCA-Spectral vs Spectral | −100 | −100 | −99.99 | −100 | −100 |

CBLOF (K-Means) vs DCA-CBLOF (K-Means) | 0 | 0 | 100 | −100 | 0 |

CBLOF (BIRCH) vs DCA-CBLOF (BIRCH) | 0 | 0 | −62.50 | 0 | 100 |

shows the positive results of the hybrid model in terms of FNR. Almost all approaches across all datasets showed a strong result, with very few false negatives being detected after combining with AIS.

Amazon | APPL | GOOGL | INTC | MSFT | |
---|---|---|---|---|---|

DCA-KDE vs. KDE | −99.99 | −99.99 | −99.99 | −99.99 | −99.99 |

DCA-SOM vs SOM | 23.99 | 1.495 | −31.69 | −6.754 | 95.76 |

DCA-DBSCAN vs DBSCAN | 99.61 | 94.76 | 96.98 | 96.67 | 97.95 |

DCA-Spectral vs Spectral | 90.76 | 90.60 | 86.49 | 83.51 | 90.87 |

CBLOF (K-Means) vs DCA-CBLOF (K-Means) | −99.99 | −99.99 | −99.99 | 5.319 | −29.07 |

CBLOF (BIRCH) vs DCA-CBLOF (BIRCH) | −99.99 | −99.99 | 1.639 | −8.247 | −45.21 |

Amazon | APPL | GOOGL | INTC | MSFT | |
---|---|---|---|---|---|

DCA-KDE vs. KDE | −93.66 | −77.78 | −87.50 | −92.21 | −96.80 |

DCA-SOM vs SOM | −7.69 | −9.09 | −16.67 | −14.29 | −5.560 |

DCA-DBSCAN vs DBSCAN | −85.17 | −83.73 | −85.11 | −86.26 | −84.58 |

DCA-Spectral vs Spectral | −3.610 | −0.980 | −55.66 | −41.36 | −2.01 |

CBLOF (K-Means) vs DCA-CBLOF (K-Means) | −7.143 | 12.19 | 0 | 10 | 16.67 |

CBLOF (BIRCH) vs DCA-CBLOF (BIRCH) | 10 | 9.524 | 0 | −6.25 | 11.11 |

The effect of combining clustering algorithms with DCA to create a hybrid model differed between clustering approaches is discussed in this project. The DCA-KDE has the biggest benefactor of the hybrid combination. KDE, as the individual-based solution did not perform well, but once combined with DCA, it demonstrated positive results. The comparable solutions were DCA-SOM, DCA- CBLOF (K-Means), and DCA-CBLOF (BIRCH). It has been observed in this project that the SOMs model is effective methods to model financial time series data and can be improved even further when combined with DCA. This has been a very encouraging section of this research as it is believed that SOMs have not been applied to anomaly detection in financial data and may be used as a reliable tool to do so. The DCA-CBLOF (K-Means) showed a significant improvement in F-Score and FAR and was also one of the best algorithms in terms of AUC. DCA-CBLOF (BIRCH) performed similarly to DCA-CBLOF (K-Means), as it achieves the most considerable improvement in F-Score and FAR. However, DCA-CBLOF (BIRCH) did show some minor improvement across certain datasets. The DCA-DBSCAN shows that not all clustering algorithms will have clear improvement across certain metrics. DBSCAN also had the most inconsistency across datasets, which demonstrates that it may not be able to handle financial time series data and the other approaches. DBSCAN illustrates the advantage of the hybrid model in computation time as it saw a significant decrease. Spectral saw the greatest benefit of combining with DCA in a hybrid model in terms of AUC and FNR. The FNR dropped significantly across all datasets but at the cost of a higher FAR. In this paper, we have introduced the adoption of the KDE, SOM, K-Means, and BIRCH as approaches to not only to clustering data but also to achieve comparable results and even exceeded KDE in AUC and FAR when combined with the DCA. These hybrid combinations had advantages in terms of the compared metrics without a large penalty in computation time and proved to be a competitive approach to DCA-KDE. In summary, this project has investigated the hybrid model of DCA-Ai with multiple standard clustering approaches. It has found that DCA-SOM, DCA-CBLOF (K-Means), and DCA- CBLOF (BIRCH) can be an effective tool for anomaly detection in the financial stock market data and is a competitive solution to the leading KDE approach that inspired this paper. It is believed the application of SOMs, CBLOF (K-Means), and CBLOF (BIRCH) for anomaly detection has not been heavily researched in this domain until now and shows promise and opportunity to do so. Possible future directions include expanding the datasets and possibly analyzing the commonalities between the clustering algorithms that have a positive effect when combined with DCA versus algorithms that are negatively affected. We will also investigate the combination of new clustering methods with DCA to find possible new competitive hybrid models for anomaly detection. Also, a more in-depth analysis of which of the two types of manipulation attacks was easier to find and how each algorithm performed separately could yield interesting results. These future works would allow us to understand which combination is the strongest for anomaly detection since it can be difficult to distinguish, which is the best between the top-performing models.

The authors declare no conflicts of interest regarding the publication of this paper.

Close, L. and Kashef, R. (2020) Combining Artificial Immune System and Clustering Analysis: A Stock Market Anomaly Detection Model. Journal of Intelligent Learning Systems and Applications, 12, 83-108. https://doi.org/10.4236/jilsa.2020.124005