Knowledge Discovering in Corporate Securities Fraud by Using Grammar Based Genetic Programming

Securities fraud is a common worldwide problem, resulting in serious negative consequences to securities market each year. Securities Regulatory Commission from various countries has also attached great importance to the detection and prevention of securities fraud activities. Securities fraud is also increasing due to the rapid expansion of securities market in China. In accomplishing the task of securities fraud detection, China Securities Regulatory Commission (CSRC) could be facilitated in their work by using a number of data mining techniques. In this paper, we investigate the usefulness of Logistic regression model, Neural Networks (NNs), Sequential minimal optimization (SMO), Radial Basis Function (RBF) networks, Bayesian networks and Grammar Based Genetic Programming (GBGP) in the classification of the real, large and latest China Corporate Securities Fraud (CCSF) database. The six data mining techniques are compared in terms of their performances. As a result, we found GBGP outperforms others. This paper describes the GBGP in detail in solving the CCSF problem. In addition, the Synthetic Minority Over-sampling Technique (SMOTE) is applied to generate synthetic minority class examples for the imbalanced CCSF dataset.


Introduction
In the US, financial analysts have been confirmed to contribute to corporate fraud detection.Effective external monitoring can increase investors' confidence, which is crucial to the functioning of any capital market [1].It is also important for China's securities market, as corporate fraud can impede China's economic development since it has serious consequences to stakeholders, employees and society [1].In recent years, corporate securities fraud detection becomes a hot spot domain in finance and there is a wave of research papers that have studied effective policies to detect and reduce fraud.
In China, the Securities Regulatory Commission (CSRC) serves as the main regulator of securities markets in China, which devotes to investigate the potential violations of securities regulations and make different en-forcement actions to those fraudulent corporations that have violated related laws.Any of the enforcement actions from the CSRC will affect the stock price of the firm, even result in bankruptcy [2].Prior studies on the causes of securities fraud focused on different types of determinants, such as agency problems, business pressures and corporate governance [3,4].There is a large related dataset about China's listed companies collected based on these determinants for this study, in order to find out corresponding relationships to detect whether a company is fraudulent or non-fraudulent.In this paper, we aim to evaluate several data mining techniques for the large and latest China Corporate Securities Fraud (CCSF) dataset.We also highlight the advantages of using SMOTE as the technique for the imbalanced data manipulation.
The main objective of this study is to contribute to identifying the factors of the company in assessing the likelihood of fraud by applying different statistical and Artificial Intelligence (AI) data mining techniques.AI data mining techniques have the theoretical advantage that they do not use arbitrary assumptions on the input variables [5].The models are built based on the data itself and used for the data.In this study, six data mining techniques are tested for their applicability in corporate securities fraud detection, which are Logistic regression model, Neural Networks (NNs), Sequential minimal optimization (SMO), Radial Basis Function (RBF) networks, Bayesian networks and Grammar Based Genetic Programming (GBGP).The six techniques are compared in terms of their classification accuracy.As a result, we found GBGP outperforms others.Thus the detail of using GBGP will be comprehensively discussed in this study.
The rest of the paper is organized as follows.Section 2 is the background and previous work.Section 3 describes the utilization of the GBGP approach.Section 4 provides the experimental results and evaluations.Section 5 discusses the conclusion and future work of the project.

Background and Previous Work
The China Securities Regulatory Commission (CSRC) has the similar powers and operations to SRC in the U.S.They investigate and take enforcement actions to listed corporations if their securities frauds are detected and proved.[6] examined these enforcement actions to explain whether the ownership and governance structures of corporations have impacts to commit fraud.The authors concluded that the proportion of outside directors, the tenure of the chairman and number of board meetings are related factors to commit fraud.[7] investigated enforcement actions from the viewpoint of the fraudulent firms rather than what factors lead up to fraud.They found that many of these firms have problems with published financial statements and irregular reports, such as inflated profit, false statements and major failure to disclose information, which are the common problems identified by the CSRC.
Considering the laws of federal securities, [8] examined the four attributes that might associated with the fraud including the number of defrauded investors, assets size, losses and financial distress of the firm.The authors concluded that only financial distress has a significant impact on the presence or absence of an enforcement action.In general, since the result of the enforcement action is either yes or no (i.e. 1 or 0), it is more reasonable to use bivariate probit model as the learning method to analysis the data.
Normal analysing methods may not discover many potential relationships.Therefore a lot of researchers have studied concept learning from data using genetic algorithms in classification problems.In [9], the authors evaluated GP in classification problems, and found that the more training time results in more accurate of the trained model by using Genetic Programming (GP) method.In addition, different runs may generate different novel models but are still able to solve the same problems.[10] developed a rule learning system that demonstrated the power and flexibility for knowledge discovery in real-life medical problems.Moreover, the authors applied token competition for learning multiple rules and reducing training time.
Except for evolutionary-based techniques, other data mining techniques are also widely used in classification problems.[5] evaluated the effectiveness of Decision Trees, Neural Networks and Bayesian Belief Networks in detecting and identifying the factors associated with fraudulent financial statements (FFS).In terms of their performance, the Bayesian Belief Network model outperforms others considering about accuracy.[11] developed fuzzy neural network (FNN) for corporate fraud detection and compared the performance of FNN with traditional neural networks and logistic regression.

Research Methodology
Identifying corporate securities fraud can be regarded as a typical classification problem.Six methods are em-ployed in this study, which are Logistic regression, Neural Networks (NNs), Sequential minimal optimization (SMO), Radial basis function (RBF) networks, Bayesian networks and Grammar-Based Genetic Programming (GBGP).Among these methods, GBGP will be introduced comprehensively.

Introduction to Grammar Based Genetic Programming (GBGP)
Comparing GBGP [12,13] with traditional GP [14], the concept of grammar is employed, which is used to control the structure during the evolutionary process.GBGP supports logic grammars, context-free grammars (CFGs) and context-sensitive grammars (CSGs) [15] to generate tree-based programs.The suitable grammar is designed for solving a particular problem.In this study, the designed grammar is shown in Table 1 for the rule learning in CCSF dataset.In order to have a better understanding about the designed grammar, a simple example in Table 2 can be used to illustrate the idea of using grammars.GBGP can ensure the structures of evolved rules are valid during the evolution.Table 2 is an example of a context-free grammar.Expression is the start symbol.The items with capital letters are the non-terminal symbols, and others are the terminal symbols.Each statement indicates a rule with the form α → β to show how a non-terminal symbol is expanded to another non-terminal or terminal symbol.The representation of individual in GBGP is a tree-based structure.The root node of an individual is the start symbol of the grammar.Figure 1 is the example of an individual in GBGP, which is generated by using grammar in Table 2.

System Flows
Figure 2 shows the standard flowchart of the GBGP algorithm [10].Firstly, the system loads the grammar and   then creates the initial population with user-defined size.
The initial population is randomly generated according to the grammar.Each individual represents one rule, which will be evaluated by a fitness function to calculate a score (accuracy).The fitness function is described in Section 3.5.Secondly, token competition is applied to maintain the good rules and diversity of the population.The detail about token competition is described in Section 3.6.Thirdly, if the stopping criterion is not reached, the new individuals are evolved by crossover and mutation operators, which are described in Section 3.4.The evolved rules will be finally evaluated by the testing instances until the stopping criterion is reached.

Grammar
The general format of a rule is defined as "IF conditions, THEN results".The conditions part involves a set of descriptors, which are divided into three major groups and shown in Table 1.The first group is related to firm basic characteristics that includes location: where the firm is located; industry: which industry that the firm belongs to; market: the firm is listed in Shanghai or Shenzhen stock market; ABshare: the firm is listed in A share or B share.The second group concerns the financial characteristics, which are assets: indicates the current asset of the company; shortterm: represents the short term solvency of working capital to total assets ratio; operating: indicates the fixed asset turnover capacity; longterm: is a radio of liabilities to total assets of the firm; earning: is the return on asset; risklevel: indicates the total leverage of the firm; roe: stands for return on equity, represents earnings per share; hdividend: whether the firm distributes dividend (yes or no); dividend: indicates how much dividend the firm distributes.The third group is about the governance features, which include nos: is the number of shareholders; noe: is the number of employees in the firm; chairCEO: whether the chairman of the board and CEO is the same person; nom: is the number of board meetings per year.The descriptors are selected based on the previous work in Section 2. The results part has only one descriptor that shows the firm is fraudulent or not.

Genetic Operators
After initializing the individuals to form a population at the first generation, the parental individuals are selected by using ranking selection method in terms of their fitness values [16].Crossover and mutation operators will produce new individuals from the selected parents.In this process, crossover operator swaps the subtree from two different parents.Mutation is able to alter a non-terminal variable, changes the value of the mutated variable randomly in terms of the designed grammar, or turns the attribute into "any" if the attribute will not be considered in the rule [10,13].

Fitness Evaluation
The fitness function measures the overall classification accuracy, which is the percentage of correctly classified examples for both classes to the total number of training examples.The possible outcomes for binary classification are shown in Table 3, and in this fraud detection problem, the minority class (fraudulent example) is the positive class.The overall accuracy is defined by Equation (1).

Token Competition
In addition to obtain high accurate individuals (classifiers), the token competition technique [17] is employed to maintain the diversity of the population.In token competition, each instance in the training samples is a token (score).If an individual (rule) classifies the instance correctly, then it will get one token and compare with other rules that can also classify the same instance.The rules with no or few tokens are removed in order to provide positions for good or strong rules come into the population.Therefore, the evolved population will have a set of strong individuals eventually [17].To apply the token competition for each individual is just to multiply the original fitness value with ratio t.where t is the number of tokens that the rule has obtained divided by the ideal number of tokens (i.e.number of training examples), which is shown in Equation ( 2).

Data Description
The original China Corporate Securities Fraud (CCSF) database contains records of corporations with their firm, financial, governance and trade characteristics.The variables are selected on the basis of the relative literatures that have been discussed in Section 2.Moreover, including more attributes may provide more interesting information of the fraudulent firms for the system to learn.The original database has 21,396 instances with 25 attributes for all listed firms from 1998 to 2011.Each instance with more than 20 missing values in these 25 attributes is directly removed.Moreover, there are 7 attributes about trade characteristics are removed since more than two third firms have no such trade data.The final dataset has 18,373 records with 18 attributes.The remaining missing values can be represented by "any" in the grammar.Table 4 shows the variables and corresponding brief definitions used in the GBGP rule learning system.

Data Preprocessing
The original CCSF database is highly imbalanced with 5.8% fraudulent and 94.2% non-fraudulent examples.Without considering the imbalance prior, the classifier(s) will always have biased results to the majority class.Such classifiers are not useful, as the performance could be very low for the objectives [18].A number of approaches have been introduced to address on the imbalanced problems.One of the most popular techniques is to resample the training data.This paper applies synthetic minority over-sampling technique (SMOTE) for a variety of reasons.First, the standard SMOTE is very simple to implement in practice.Second, empirically, SMOTE has shown to perform well against random oversampling techniques in a lot of experiments [18,19].Third, the synthetic examples are generated in less application-oriented manner.That is the new examples are operated in feature space rather than data space [19].Therefore, it can be widely applied in imbalanced datasets applications.

Experiment Setup
The parameters setting that control the rule learning system are shown in Table 5.
The values for the parameters setting that control SMO and neural networks are shown in Table 6.All experiments applied 10-fold Cross-validations to evaluate the performance of different method.

Results and Evaluations
The performance of Logistic Regression model, Neural Networks, SMO, RBF network, Bayesian networks and rule learning system (GBGP) is shown in Table 7. TP rate (yes) is the true positive rate for fraudulent firms, which is calculated by Equation (3).
TP rate (no) is the true negative rate for non-fraudulent firms, which is calculated by Equation (4).
The result shows that Logistic Regression model, Neural Networks, SMO and Bayesian networks are able to classify the non-fraudulent firms, especially for Bayesian networks, which outperform other techniques in terms  of accuracy.However, they perform poorly in fraudulent firms detection.The possible reason is that, the CCSF dataset is hard to build models by using these techniques.In addition, since it contains many noisy examples, which may be further rescaled by SMOTE.For the variables that will not be learnt in the rule can be represented by the term "any" in GBGP method.Therefore, it performs well in both classes.The comparison between GBGP with other models is shown in Table 8.
In Table 8, Acc (yes) is the classification accuracy for fraudulent firms, and Acc (no) is the classification accuracy for non-fraudulent firms.Diff.devotes the average difference between the performances of GBGP with the compared approach.S.D. presents the stand derivation and t-stat is a value to test if the average difference is significantly different from zero or not.
From Table 8, the GBGP outperforms Logistic regression, NNs and SMO in both classes significantly.The GBGP performs better in classifying non-fraudulent firms significantly compared to RBF network, and outperforms Bayesian networks in classifying fraudulent firms significantly.

Conclusions and Future Work
In this study, we have compared the performance of six different approaches in solving the China Corporate Securities Fraud (CCSF) problem.We found the GBGP outperforms Logistic regression model, back-propagation neural networks, SMO, RBF networks with Gaussian function and Bayesian networks in terms of accuracy.In addition, GBGP equipped with three competitive components.First, GBGP can generate understandable individuals for classification tasks.Second, the designed grammar can describe the problem clearly and ensure valid individuals are generated in the learning process.The third is the token competition technique, which can improve the diversity of the evolved rules in GBGP.
It would be interesting to extend the GBGP for multiple classes.For example, what types of fraudulent firms will commit.How much do the fraudulent firms need to pay for the enforcement action?Will the fraudulent firms commit to make fraud again?It may discover more useful information than binary classification.

Figure 1 .
Figure 1.An individual program in GBGP represents if the value of term2 > 4 then yes, otherwise no.

Figure 2 .
Figure 2. A flowchart of the GBGP system.

Table 1 .
A grammar for CCSF problem.

Table 3 .
Four outcomes of binary classification.

Table 4 .
Definition of variables.

Table 5 .
The parameters and values for the system.

Table 6 .
The parameters and values for NNs and SMO.

Table 7 .
The performance table in CCSF dataset.

Table 8 .
The comparison between GBGP to other models.