Feature Selection with Non Linear PCA: A Neural Network Approach

Machine learning consists in the creation and development of algorithms that allow a machine to learn itself, gradually improving its behavior over time. This learning is more effective, the more representative is the features of the dataset used to describe the problem. An important objective is therefore the correct selection (and, possibly, reduction of the number) of the most relevant features, which is typically carried out through dimensional reduction tools such as Principal Component Analysis (PCA), which is not linear in the more general case. In this work, an approach to the calculation of the reduced space of the PCA is proposed through the definition and implementation of appropriate models of artificial neural network, which allows to obtain an accurate and at the same time flexible reduction of the dimensionality of the problem.


Introduction
The term machine learning [1] refers to one of the fundamental areas of artificial intelligence, centred on the development of systems and algorithms capable of synthesizing a series of subsequent observations. Starting from more or less wide sets of data, the machine-using the themes and algorithms developed-becomes able to automatically recognize complex models and take "decisions".
Today machine learning technologies [2] are easily accessible (see e.g. Google's TensorFlow [3] or Microsoft's Cognitive Services [4] [5]) for high-level processing with an emphasis on data semantics. These platforms are proposed as "open'' tools, accessible by any developer who wants to use artificial intelligence to perform complex elaborations and analyze large databases from a machine learning perspective.
When it comes to machine learning, you don't necessarily have to think about robotics, driving independently or the games DeepMind won [6]: these automatic learning systems can also be used to combat spam (by better recognising unsolicited e-mails), to detect intrusion attempts into a computer network, to improve optical character recognition (OCR) skills, and for artificial vision. Search engines themselves make extensive use of them to offer users even more relevant results by analyzing the meaning (semantics) of the query.
The effective use of machine learning techniques depends strongly on the correct modelling of the problem by the researcher, who must be able to capture the fundamental characteristics that allow an effective implementation of the predictive model. If the selected features are excessive with respect to the available cases, the "power'' [7] of the corresponding statistical model is compromised, being typically necessary a number of cases in exponential ratio with respect to the number of features of the model. Therefore, it is of fundamental importance to reduce the number of features, obviously without losing the model's informative capacity too much. This reduction is usually made through the use of mathematical techniques to reduce the size of the problem such as the Principal Component Analysis (PCA) [8] or the Multi Dimensional Scaling (MDS) [9], which transform the initial data space into a new space with a reduced number of components (the so-called principal Components) on which the original variables are projected.
In this paper an approach to feature reduction by non-linear PCA is presented, this being the most general case. In our approach, the determination of the reduced space of components is done by setting up appropriate artificial neural network models, of hierarchical or symmetrical type, so as to arrive at the calculation of the main components through the progressive self-learning typical of a neural network.
In the following sections, the general principles of artificial neural networks and PCA are briefly discussed; then the method of calculating PCA through appropriate neural network models is presented, and the results are discussed, as well as the conclusions and future directions of research.

Artificial Neural Networks for Supervised Learning
Learning by example plays a fundamental role in the process of understanding by humans (in newborns for example, learning is done by imitation, rehearsal and error): the learner learns on the basis of specific cases, not general theories.
In essence, learning from examples is a process of reasoning that leads to the identification of general rules based on observations of specific cases (inductive inference).
There are two typical characteristics of the process of learning from examples: first, the knowledge learnt is more compact than the equivalent form with expli-cit examples, therefore requires less memory capacity; second, the knowledge learnt contains more information than the examples observed (being a generalization is applicable also to cases never observed). In inductive inference, however, starting from a set of true or false facts, we arrive at the formulation of a general rule that is not necessarily always correct: in fact, only one false assertion is sufficient to exclude a rule. An inductive system therefore offers the possibility of automatically generating knowledge that can be false. The frequency of errors depends strongly on how the set of examples on which the system is to be learned was chosen and how representative this is of the universe of possible cases.
Artificial neural networks (ANNs) [10] [11] are, among the tools capable of learning from examples, those with the greatest capacity for generalization, because they can easily manage situations not foreseen during the learning phase. These are computational models that are directly inspired by brain function and are at an advanced stage of research. An artificial neural network can be thought of as a machine designed to replicate the principles with which the neurons of the human brain work. In the field of automatic learning, a neural network is a mathematical-informational model called upon to solve engineering problems in different fields of application. This allows the creation of an adaptive system that changes its structure based on the flow of external or internal information that flows through the network during the learning phase.
Neural networks are non-linear structures that can be used to simulate complex relationships between inputs and outputs that other analytical functions cannot represent. The external signals are processed and processed by a set of input notes, in turn connected with multiple internal nodes (organized into levels): each node processes the signals received and transmits the result to the following nodes. Since neural networks are trained using data, connections between neurons are strengthened and output gradually forms patterns, which are well-defined patterns that can be used by the machine to make decisions.
Largely abandoned during the winter of artificial intelligence, neural networks are now at the centre of most projects focused on artificial intelligence and machine learning in particular [2]. They consist of a layer of input neurons (elementary computational units), a layer of output neurons and possibly one or more intermediate layers called hidden (see Figure 1). Interconnections range from one layer to the next and the signal values can be both discrete and continuous. The weight values associated with the input of each node can be static or dynamic in such a way as to plastically adapt the behaviour of the network according to the variations of the input signals.
The functioning of a neural network can be schematically outlined in two phases: the "training'' (learning) phase and the "testing'' (recognition) phase. In the learning phase the network is instructed on a sample of data taken from the set of data that will then be processed; in the testing phase, which is then the normal operating phase, the network is used to process the input data based on the configuration reached in the previous phase. Journal of Applied Mathematics and Physics As for the realizations, even if the networks have an autonomous structure, generally computer simulations are used in order to allow even substantial modifications in a short time and with limited costs. However, the first neural chips [12] are being created that have a performance considerably higher than that of a simulation but that has so far had very little diffusion due mainly to high costs and extreme structural rigidity.

Application of Neural Networks for Pattern Classification
Pattern recognition is currently the area of greatest use of neural networks. It consists in the classification of objects of the most varied nature in classes defined a priori or automatically created by the application based on the similarities between the objects in input (in this case we speak of clustering).
To perform classification tasks through a computer, real objects must be represented in numerical form and this is done by performing, in an appropriate way, a modeling of reality that associates each object with a pattern (vector of numerical attributes) that identifies it. This first phase is called feature extraction [13], so you can think of reducing them in order to speed up the classification process. This can be done manually or with automatic techniques such as Multi Dimensional Scaling [9] or Principal Component Analysis [8] (see next Section), resulting in a pattern shift to a new space with features that can be classified more simply. After this further phase-which is called preprocessing-we finally move on to the construction of the classifier, which can be seen as a black box capable of associating each input pattern to a specific class.
Suppose, more formally, that you need to classify a pattern ( ) Against the p input pattern, the classifier will output the binary vector ( ) tern belongs to the class i c , otherwise 0.
Neural networks can be effectively used as classifiers thanks to their ability to learn from examples and generalize. The idea is to let the neural network learn (through special training algorithms) the correct classification of a representa-tive sample of patterns, and then make the same network work on the set of all possible patterns. At this point we distinguish two different types of learning: supervised and unsupervised.
In "supervised learning'', the set of patterns on which the network must learn (training set) is accompanied by a set of labels that show the correct classification of each pattern. In this way, the network makes a regulation of its structure and internal parameters (connection weights and thresholds) until it obtains a correct classification of training patterns. Given the above mentioned generalization capabilities, the network will work correctly even on external patterns and independent from the training set, provided that the training set itself is sufficiently representative.
In "unsupervised learning'', a set of labels cannot be associated with the training set. This can happen for various reasons: the corresponding classes can be simply unknown and not obtainable manually or only inaccurately or slowly or, again, the a-priori knowledge could be ambiguous (the same pattern could be labeled differently by different experts). In this type of learning, the network tries to organize the patterns of the training set into subgroups called clusters [14] using appropriate similarity (or distance) measures, so that all the objects belonging to the same cluster are as similar (near) as possible while the objects belonging to different clusters are as different (distant) as possible. Next, you need to use the expert's a-priori knowledge to label the clusters obtained in the previous step in order to make the classifier usable.
These two different approaches to learning give rise to the different types of neural networks [15] which are used in this work for the implementation of the non-linear PCA calculation algorithm.

Principal Component Analysis
Rarely are the characteristics obtained during the extraction phase used as input for a classification, but often some transformation is necessary to facilitate the pattern classification. One of the most frequent problems to solve is the decrease of the pattern dimensionality (of the number of characteristics) in order to make the machine learning algorithms functioning more efficient and faster.
Increasing the number of features measured on the objects to be classified generally improves network performance because, intuitively, there is more information available on which to base learning. In reality this is true only to a certain extent, after which, the performance of the network tends to decrease (more wrong classifications are obtained). This is because we are forced to work on a limited set of data and therefore, increasing the size of the pattern space involves a thinning out of our training set that will become a poor representation of the distribution. We will need larger sets (growth must be exponential) that will slow down the training process and bring infinitesimal improvements. This problem is known in the literature as curse of dimensionality. It is better to prefer a network with few inputs because it has fewer adaptive parameters to determine and therefore even small training sets are sufficient. This will create a faster network with greater capacity for generalization. The problem now is to choose, among the characteristics we have available, those to be preserved and those to be discarded, trying to lose as little information as possible. The PCA helps us in this.
Principal Component Analysis is a statistical technique whose aim is to reduce the size of patterns and is based on the selection of the most significant characteristics, that is those that bring more information. It is used in many fields and under different names: Karhunen-Loeve expansion, Hotelling transformation, approach to signal subspace, etc.
Given a statistical distribution of data in an L-dimensional space, this technique examines the properties of distribution and tries to determine the components that maximize variance or, alternatively, minimize the misrepresentation. These components are called "main components'' and are linear combinations of random variables with the property of maximizing the variance in relation to the eigenvalues (and therefore the eigenvectors) of the covariance matrix of the distribution. For example, the first main component is a linear normalized combination that has maximum variance, the second main component has second maximum variance and so on. Geometrically, the PCA corresponds to a rotation of the coordinated axes in a new coordinate system such that the projection of the points on the first axis has maximum variance, the projection on the second axis has second maximum variance and so on (see Figure 2). Thanks to this important property, this technique allows us to reduce a space of features, preserving as much as possible the relevant information.
Mathematically, the PCA is defined as follows. Consider an M-dimensional vector p obtained from some distribution centered around the average of size L. We will then examine the mathematical and neural methods to compute the principal components of a distribution.

Computation of the Principal Components
Consider Suppose you want to reduce the size of the space from M to L with L M < in order to lose as little information as possible. The first step is to rewrite equation (1) this way: to then replace all ij a (for 1, , j L M = +  ) with constant j k so that each initial i p vector can be approximated by a new i p vector so defined: In this way we get a size reduction, since the second sum is constant and therefore each M-dimensional vector i p can be expressed in an approximate way using an L-dimensional vector i a . Let's now see how to find the base vectors j v and the coefficients j k to minimize the loss of information. The error on i p obtained from the size reduction is given by: We can then define a L E function that calculates the sum of the squares of the errors as follows: where we used the relation of "orthonormality''. If we put the derivative before L E compared to j k equal to zero, we get that: where the first step follows from the fact that T ij j i a v p = and X is the covariance matrix of the distribution so defined: Xv v λ = for constants j λ corresponding to the eigenvectors of the matrix X. It should also be noted that, since the covariance matrix is real and symmetrical, its eigenvectors can be chosen orthonormal as required. Returning to the analysis of the error function, we notice that: This network carries out Hebbian learning (therefore unsupervised). The synaptic modification law, however, is not the standard Hebbian rule:   where ( ) t p , ( ) t w and ( ) t z are, respectively, the value of j-th input, j-th weight and the output of the network at time t (the network is supposed to be composed of a single neuron), while η is the learning rate. Direct application of this rule would still make the network unstable. Oja [16] proposed another type of rule for changing weights over time, which turns the network into a principal component analyzer. He thought of normalizing the weight vectors at every step and, starting from (10), he obtained the following equation: is the stabilizing term that makes the sum of the (12) limited and close to 1 without explicit normalisation appearing. The Oja rule can be generalized for networks that have multiple output neurons by obtaining the two algorithms in Figure 5 and Figure 6. The first uses a symmetrical network and the second a hierarchical network. In both algorithms the weight vectors must be orthonormalized or: The PCA emerges as an excellent solution to several problems of information representation including:   At the same time, the PCA network has some limitations that make it less attractive;  The network is able to carry out only linear input-output correspondences;  Eigenvectors can be calculated much more efficiently using mathematical techniques;  The principal components take into consideration only the data covariances that completely characterize only Gaussian distributions;  The network is not able to separate independent subsignals from their linear combinations. For these reasons, it is interesting to study non-linear generalizations of PCA or learning algorithms derived from the generalization of the optimization problem of standard PCA. They can be divided into two classes: robust PCA algorithms (paragraphs 6.1 and 6.2) and non-linear PCA algorithms in the strict sense (Section 7). In the former, the criterion to be optimized is characterized by a function that grows more slowly than the quadratic function, and the initial conditions are the same as those of the standard PCA (the neuron weight vectors must be mutually orthonormal). In these algorithms, non-linearity appears only at certain points. In non-linear PCA algorithms, however, all neuron outputs are a nonlinear function of the response. It is also interesting to note that, while the standard PCA to obtain the main components needs some form of hierarchy to differentiate the output neurons (the symmetric algorithm obtains only linear combinations of the main components), in the non-linear generalizations the hierarchy is not so important, since the nonlinear function breaks the symmetry during the learning phase [17].

Generalization of Variance Maximization
The standard quadratic problem leading to a PCA solution can also be achieved by maximizing output variances The best solution in this casen is any orthonormal basis of the PCA subspace. It is therefore not unique. The problem of maximizing variance under symmetrical orthonality constraints therefore leads to symmetrical networks, the so-called PCA subspace networks.
Let us now consider the generalization of the problem of maximizing variance for robust PCA. Instead of using the previously defined root mean square, we can maximize a more general average as follows: The ( ) c t function must be a valid cost function that grows slower than the square, at least for large values of t. In particular we hypothesize that ( ) c t is equal, not negative, almost everywhere it continues, differentiable and that tanh t θ where θ represents a scaling factor that depends on the range within which the input values vary. In that case the criterion to maximize, for each weight vector i w , is: (15) In the summation, the Lagrange λ coefficients impose the necessary orthonormality constraints.
Both the hierarchical and the symmetrical problem can be discussed under the general G criterion. In the symmetrical standard case the upper limit of the summation index is ( )  A gradient descent algorithm to maximize Equation (14) is obtained by entering the ( ) d i estimation of the gradient vector (Equation (16)) at the step of the weight update, which becomes: It is interesting to note that the optimal solution for the robust criterion in general does not coincide with the standard solution but is very close to it. For example if we consider ( ) c t t = , the i w directions that maximize

Generalization of Error Minimization
Let's consider the linear approximation p of the vectors p in terms of a set of . Now let's see how to carry out the robust generalization of the quadratic mean representation error. Robust PCA algorithms can be achieved by minimizing the criterion: where the M-dimensional vector 1 and ( ) c t meet the above mentioned assumptions. By minimizing (21) against w we obtain the gradient descent algorithm shown in Figure 8. The algorithm can be applied, as usual, both to the symmetrical and hierarchical case; but in the symmetrical case it is ( ) l i L = and therefore:   Table 1. As we can see, the fastest converging networks were the nonlinear PCA, followed by the linear PCA (GHA and Oja subspace) and, immediately after, the PCA obtained by the generalization of the maximization of the variance.
In order to choose the most effective method, PCA networks have been divided into two main classes: networks with linear input-output matching and networks with non-linear input-output matching. The networks of the first type (linear) identify in the images only very bright objects and/or with a clearly distinct outline, confusing the weakest objects with the background. The networks of the second type allow, instead, under certain conditions, to identify also less defined objects. The condition to obtain this result is the use, in the algorithms of Figure 7 and Figure 10, of an activation function of sigmoidal type (hyperbolic tangent) [11]: this type of net gives a greater relief to the pixels of the weak objects detaching them from the background.

Discussion and Conclusions
The aim of this paper is to construct an algorithm capable of implementing both standard (linear) and non-linear Principal Component Analysis (PCA) through the use of artificial neural network models. PCA is mainly used to reduce the size (number of features) of a problem but, in the traditional approach, the determination of the main components most representative of the phenomenon has the following limitations: 1) The computation is of algebraic (matrix) nature and, for a high number of variables, can involve a high processing time; 2) The standard PCA is suitable for problems with linear relationships between variables.
The approach presented is an algorithm for calculating the principal components for both standard and non-linear problems. The algorithm makes use of artificial neural network models, with an iterative processing given by the "convergence'' of the network towards the optimal weights, which correspond to the final solution of the problem. The neural network models proposed in the algorithm make use of multiple layers of neurons (see Section 6) with the application of the hyperbolic tangent function to the PCA output.
The performance of the proposed approach has been evaluated in a test implementation, and can be further improved both in the definition phase of the neural network architecture (number of hidden layers and neurons) and in the learning and validation phase (e.g. through the introduction of cross-validation or leave-one-out depending on the size of the input dataset).
The scaled conjugate gradient learning algorithm leads to a rapidly decreasing average error, up to a level of stabilization. Newton's method has much slower iterations but, on the other hand, is able to reach values lower than the average error. You can then think of hybridizing the two algorithms using the first one until the average error drops to then exploit the second one starting from the final weight configuration of the first one. This will lower the error function without wasting too much time.
The performance of the proposed approach, while very good, can be further improved at both the detection and classification stages:  In order to improve the percentage of correctness of the recognition, it is desirable to create an algorithm capable of automatically recognising and eliminating only spurious objects present on a plate;  In order to speed up the learning of the supervised networks for the classification it is possible to think of a hybrid training that exploits the potentialities of more algorithms in contemporary.
As for the second point, it has been noted that using the scaled conjugate gradient learning algorithm, the average error quickly decreases to a certain level, after which, it tends to stabilise. Newton's method has much slower iterations but, on the other hand, manages to reach values lower than the mean error. One can therefore think of hybridizing the two algorithms using the first one until the average error drops and then exploit the second one starting from the final weight configuration of the first one. In this way, it would be possible to go low-