Orthogonal-Least-Squares Forward Selection for Parsimonious Modelling from Data

The objective of modelling from data is not that the model simply fits the training data well. Rather, the goodness of a model is characterized by its generalization capability, interpretability and ease for knowledge extraction. All these desired properties depend crucially on the ability to construct appropriate parsimonious models by the modelling process, and a basic principle in practical nonlinear data modelling is the parsimonious principle of ensuring the smallest possible model that explains the training data. There exists a vast amount of works in the area of sparse modelling, and a widely adopted approach is based on the linear-in-the-parameters data modelling that include the radial basis function network, the neurofuzzy network and all the sparse kernel modelling techniques. A well tested strategy for parsimonious modelling from data is the orthogonal least squares (OLS) algorithm for forward selection modelling, which is capable of constructing sparse models that generalise well. This contribution continues this theme and provides a unified framework for sparse modelling from data that includes regression and classification, which belong to supervised learning, and probability density function estimation, which is an unsupervised learning problem. The OLS forward selection method based on the leave-one-out test criteria is presented within this unified data-modelling framework. Examples from regression, classification and density estimation applications are used to illustrate the effectiveness of this generic parsimonious modelling approach from data.


Introduction
Data modelling is an important and recurrent theme in all the fields of engineering. Various data modelling applications can be classified into three categories, namely, regression [1-3], classification [4][5][6] and probability density function (PDF) estimation [7][8][9]. In regression, the task is to establish a model that links the observation data to their target function or desired output values. The goodness of a regression model is judged by its generalization performance, which can be conveniently determined by the test mean square error (MSE) on the data not used in training the model. Like regression, classification is also a supervised learning problem. However, the desired output is discrete valued, e.g. binary in the two-class classification problems, and the goodness of a classifier is determined by its test error probability or misclassification rate. Despite of these differences, classifier construction can be expressed in the same framework of regression modelling. The third class of data modelling, namely, PDF estimation, is very different in nature from regression and classification. The task of PDF estimation is to infer the underlying probability distribution that generates the observations. Because the true target function, the underlying PDF, is not available, this is an unsupervised learning problem and can only be carried out based on often noisy observation data. Nevertheless, this unsupervised task can be "transformed" into a supervised one, for example, by computing the empirical distribution function from the observation data and using it as the target function for the cumulative distribution function of the PDF estimation. This contribution adopts this unified regression framework for data modelling. The theory and practice of linear regression modelling is well established [10][11][12], and the least squares (LS) method [13] has been a basic toolkit for data modelling. Since real-world phenomena that generate data are nonlinear to some extent, nonlinear models are often required in order to achieve adequate modelling accuracy. Over the past three decades, extensive efforts have been directed onto developing coherent and concise methods of nonlinear regression modelling . A data modelling problem generally consists of two basic components: determining the model structure and estimating or fitting the model parameters.
Parameter fitting is relatively straightforward if the model structure is known a priori but this information is rarely available in practice and must be learnt. Determining the model structure is crucial in any practical data modelling problem, and a fundamental principle is that the model should be no more complex than is required to capture the underlying data generating mechanisms. This concept known as the parsimonious principle is particular relevant in nonlinear data modelling because the size of a nonlinear model can easily become explosively large. An over complicated model may simply fit to the noise in the training data, resulting overfitting. An overfitted model does not capture the underlying system structure and will perform badly on new data. In general, a huge model not only may have poor generalisation performance but also has little practical value in data analysis and system design.
There exists a vast amount of works in the area of parsimonious nonlinear regression modelling but the most popular approach is perhaps to adopt a linear-in-the-parameters nonlinear model. This is typically achieved by placing a radial basis function (RBF) or other type of kernel on each training data sample and a sparse representation is then sought which possesses excellent generalisation performance . Adopting a linear-in-the-parameters nonlinear model structure is attractive because many existing linear data modelling techniques can be applied successfully, providing that the model structure determination can be carried out effectively to guarantee a sufficiently parsimonious final model. Among the various linear-in-the-parameters nonlinear data modelling techniques, the support vector machine (SVM) method and other sparse kernel modelling methods [54][55][56][57][58][59][60][61][62][63][64][65][66][67][68] have become popular in the recent years. In particular, the SVM technique [54] is widely regarded as the state-of-the-art technique for regression and classification applications, and it has also been proposed as a promising tool for sparse kernel density estimation [69][70][71]. The formulation of SVM embodies the structural risk minimization principle, thus combining excellent generalisation properties with a sparse model representation. Despite of these attractive features and many good empirical results obtained using the SVM method, data modelling practicians have realized that the ability for the SVM method to produce sparse models has perhaps been overstated.
The orthogonal least squares (OLS) algorithm [34], de-veloped in the late 1980s for nonlinear system modelling, remains highly popular for nonlinear data modelling practicians, for the reason that the algorithm is simple and efficient, and is capable of producing parsimonious linear-in-the-parameters nonlinear models with good generalisation performance. Unlike the SVM and many other sparse kernel modelling techniques, which work on the full kernel model defined on the training data set to obtain a sparse model, the OLS method [34] adopts the forward selection to build up an adequate model by only selecting significant regressors. Since its derivation, many enhanced variants of the OLS based forward regression algorithm have been proposed [37][38][39][40][41][42][43][44][45][46][47][48][49][50][51][52][53]. In particular, the local regularisation assisted OLS algorithm [39,41], which employs the multiple regularisers to enforce the model sparsity [60], has been shown to be capable of producing very sparse regression models that generalise well. A significant improvement to the original OLS algorithm for sparse regression modelling is to enhance the algorithm with optimal experimental design criteria [39,45,47,48,50,51]. In a traditional forward regression procedure, a separate stopping criterion is required to terminate the selection procedure at an appropriate model size in order for example to avoid an over-fitted model. Typically, information based criteria, such as the AIC [72] and the minimum description length [73], were adopted to terminate the model selection process. An information based criterion can be viewed as a model structure regularisation by using a penalty term to penalise large sized models. However, the penalty term in an information based criterion does not help to determine which model term should be selected. Multiple regularisers, i.e. local regularisation [39,41,60], and optimal experimental design criteria [39,45,47,48,50,51] offer better solutions as model structure regularisation as they are directly linked to model efficiency and parameter robustness [74]. The basic criterion for most model construction procedures, including the original OLS algorithm [34,35], is the training MSE. However, the goodness of a regression model is its generalisation capability. Therefore, a better and more natural approach is using a criterion of model generalisation performance directly in the model selection procedure rather than only using it as a measure of model complexity. The evaluation of model generalisation capability is directly based on the concept of cross validation [75], and a commonly used cross validation is the delete-one or leave-oneout (LOO) cross validation [2, 76,77]. One of the most important improvements to the OLS algorithm based forward regression is the development of the OLS forward selection based on the LOO test score or MSE [40,46,47,49], which is a measure of the model generalisation performance. The use of the LOO estimate for general nonlinear-in-the-parameters models has been studied for example in [78][79][80]. However, even for the class of linear-in-the-parameters models, computation of the LOO MSE is normally expensive and the use of the LOO statistic in model selection is  [46]. An additional advantage of adopting the LOO test score based OLS algorithm is that the model construction process becomes truly automatic without the need for the user to specify some additional terminating criterion [46]. Our empirical modelling results for regression [40], classification [52] and kernel density estimation [43,44] have demonstrated that this OLS algorithm based on the LOO test score coupled with local regularisation compares favourably with the SVM and many other existing state-of-the-art sparse kernel modelling methods, in terms of generalisation capability and model sparsity as well as the computational complexity of model construction. This contribution is organized as follows. Section 2 presents the regression modelling framework, which unifies all the three classes of data modelling applications, namely, regression, classification and PDF estimation. In particular, the unsupervised density learning is converted into a supervised regression one by adopting the Parzen window (PW) estimate as the target function [44]. Based on this unified data-modelling framework, the OLS forward selection algorithm using the LOO test criteria and local regularisation is detailed in Section 3. More specifically, for regression modelling, the model selection criterion is based on the LOO test MSE, while for classification applications, the LOO misclassification rate is employed for model selection. In kernel density estimation, the kernel weights must satisfy the nonnegative and unity constraints, and a combined approach is adopted to tackle this constrained regression modelling. A sparse kernel density estimate is first selected by the efficient OLS algorithm based on the LOO test score and local regularisation. The kernel weights of the final model are then updated using the multiplicative nonnegative quadratic programming (MNQP) algorithm [61,81] to meet the nonnegative and unity constraints. The MNQP algorithm additionally has a desired property of forcing some kernel weights to (near) zero values, and thus further reducing the model size [61,81]. The experimental results are included in Section 4, where empirical examples taken from regression, classification and PDF estimation applications demonstrate the effectiveness of the proposed OLS algorithm based on the LOO test criteria coupled with local regularisation within the unified data-modelling framework. The concluding remarks are summarised in Section 5.

A Unified Data Modelling Framework
The three classes of data modelling, namely, regression, classification and PDF estimation, can be unified under the generic regression framework of sparse kernel data modelling based on the appropriate modelling criteria, where the kernel model is interpreted in a generic sense, namely, a kernel or nonlinear basis is placed on each training data sample and the model is obtained as a linear combination of all the bases defined on the training data set. For kernel density estimation, a kernel should also meet the usual requirement of a density distribution, i.e. the area under the kernel is unity. The objective is to derive a sparse model representation with excellent generalisation capability based on a training data set.

Regression Modelling
Consider the general nonlinear data generating mechanism governed by the nonlinear model where y  is the system output, where denotes the model output, where is the k-th kernel centre vector. The generic kernel model (2) is defined by placing a kernel at each of the training input samples and forming a linear combination of all the bases defined on the training data set. A sparse representation is then sought by selecting kernel model with only N s nonzero kernel weights, where At a training data point , the kernel model (2) can be expressed as where kk y y k   is the modelling error at , Note that k  is the k-th column of Φ N , while denotes the k-th row of Φ N . Let an orthogonal decomposition of the regression matrix Φ N be and 1 2 The regression model (5) can alternatively be expressed as where the weight vector defined in the orthogonal model space satisfies the triangular system , is identical to the space spanned by the orthogonal model bases , and the model is equivalently expressed by where is the k-th row of .
A procedure that can be used to perform the orthogonalisation (6) is summarised in Appendix A.

Classification Application
Consider the two-class classification problem with the given training data set is an m-dimensional pattern vector and is the class label for . The task is to construct a kernel classifier of the form where is the estimated class label for and Let us define the modelling error as k . Then the classification model over the training data set can be expressed in the regression model of (5) recited here again as or equivalently in the orthogonal regression model of (9) rewritten here again as where all the relevant notations are as defined in Subsection 2.1. It is clear that the kernel classifier construction can be expressed in the same kernel regression modelling framework of Subsection 2.1, and the only difference is that the target function y k in classification applications is discrete valued. In particular, for the two-class classification problem, y k is binary. The objective is again to derive a sparse kernel model that posses good generalisation capability and contains only N s significant kernels.

Kernel Density Estimation
Based on a finite data sample set drawn from a density , where , the task is to estimate the unknown density using the kernel density estimate of the form with the constraints , is chosen to be the Gaussian kernel in this study. However, many other kernel functions can also be used in the density estimate (15). Following the approach of [44], this unsupervised kernel density learning is transformed into a supervised learning problem. The Minimising this divergence subject to the constraints (16) and (17) Thus the generic kernel density estimation problem (15) can be viewed as the following regression problem with the PW estimate as the "desired response" or target function subject to the constraints (16) and (17) subject to the nonnegative constraint (16) and the unity constrain (17), where all the relevant notations have been defined in Subsection 2.1. The regression model (21) can of course be written equivalently in the orthogonal form of (9) which is recited here again as The objective is to obtain a sparse Ns-term kernel model, satisfying the kernel weight constraints (16) and (17) and yet having a test performance comparable to that of the full-sample optimized PW estimate.

Orthogonal-Least-Squares Algorithm
As established in the previous section, the regression, classification and PDF estimation can all be unified within the common regression modelling framework. Therefore, the OLS forward selection based on the LOO test criteria and local regularization (OLS-LOO-LR) [40] provides an efficient algorithm to construct a sparse kernel model that generalise well. For the regression and kernel density modelling, the LOO MSE criterion is an appropriate measure of model's generalisation capability for subset model selection, while for kernel classifier construction, the LOO misclassification rate offers a proper measure of classifier's generalisation performance for selecting significant kernels [52]. Sparse kernel density (SKD) construction is special as it is formulated as a constrained regression modelling, where the kernel weights must meet the nonnegative and unity constraints. A combined OLS-LOO-LR and MNQP approach is adopted for this constrained regression modelling [44], where the OLS-LOO-LR algorithm determines the sparse kernel model structure by selecting a subset of significant kernels while the MNOP algorithm [61,81] computes the kernel weights of the selected SKD estimate.

Sparse Kernel Regression Model Construction
The local regularization aided least squares solution for the weight parameter vector can be obtained by minimizing the following regularised error criterion [41] g N ( , ) where is the vector of regularisation parameters, and , which is also given in Appendix A. The criterion (23) is rooted in the Bayesian learning framework. According to the Bayesian learning theory [24,39,60], the optimal g N is obtained by maximizing the posterior probability of g N , which can be shown to be where ( | , ) Gaussian distribution, the likelihood is given by If the Gaussian prior is chosen, i.e.
maximising log( ( | , ) with respect to is equivalent to minimising the following Bayesian cost function where . It is obvious that the criterion (23) is equivalent to the criterion (28) with the relationship The hyperparameters specify the prior distributions of . Since initially the optimal value of is unknown, should be initialised to the same small value, and this corresponds to choose a same flat distribution for each prior of g i in (27). The beauty of the Bayesian learning framework is that it learns not only the model parameters g N but also the related hyperparameters h N . This can be done by iteratively optimizing g N and h N using the evidence procedure [24,39,60]. Applying this evidence procedure results in the following iterative updating formulas for the regularization parameters [39] where g i for denote the current estimated parameter values, and Usually a few iterations (typically less than 10) are sufficient to find a (near) optimal . The detailed derivation of the updating Formulas (30) and (31), quoted from [39], can be found in Appendix B. The use of multiple-regularisers or local regularisation is known to be capable of providing very sparse solutions [41,60]. N  It is highly desired to select a sparse model by directly optimizing the model generalisation capability, rather than minimising the training MSE. The OLS-LOO-LR algorithm achieves this objective by incrementally minimizing the LOO MSE criterion, which is a measure of the model's generalization performance [2, 40,46,47,[78][79][80]. At the n-th stage of the OLS forward selection procedure, an n-term model is selected. It can be shown that the LOO test error, denoted as ( , ) n k k  , for the selected n-term model is [40,46,47] (32) where ( ) n k is the usual n-term modelling error and ( ) n k  is the associated LOO error weighting. The LOO MSE for the model with a size n is then defined by The LOO MSE can be computed efficiently due to the fact that the n-term model error ( ) n k and associated LOO error weighting ( ) n k  can be calculated recursively according to [40,46,47] respectively, where w k,n is the k-th element of w n . The derivation of the LOO test error (32) together with the recursive Formulas (34) and (35) is detailed in Appendix C.
The subset model selection procedure is carried out as follows. At the n-th stage of the selection procedure, a model term is selected among the remaining n to N candidates if the resulting n-term model produces the smallest LOO MSE J n . The selection procedure is terminated when yielding an N s -term sparse model. It has been shown in [46] that the LOO statistic J n is at least locally convex with respect to the model size n. That is, there exists an "optimal" model size s N such that for s n n N J  decreases as n increases while the condition (36) holds. This property is extremely useful, as it enables the selection procedure to be automatically terminated with an N s -term model, without the need for the user to specify a separate termination criterion. The sparse regression model selection procedure based on the OLS-LOO-LR algorithm is now summarised as follows.
Initialisation: Set 6 10 i    for 1 , and set it- Step 1: Given the current and with the following initial conditions Step 2: Update using (30) and (31) with N = N I . If the pre-set maximum iteration number (e.g. 10) is reached, stop; otherwise set I + = 1 and go to Step 1.

Sparse Kernel Classifier Construction
Since the generic kernel classifier construction takes the same form of regression modelling, the OLS-LOO-LR algorithm described in the previous subsection can be applied to select a sparse kernel classifier. However, the goal of a classifier is to minimise the misclassification or error rate, and the MSE in general is not an appropriate criterion for classifier construction. Note that the class label . Define the signed decision variable Then the misclassification rate over the training data set is evaluated as where the indication function is defined by The classifier's generalisation capability however is measured by the test error rate over data unseen in training. The same LOO cross validation concept [2,76,77] is adopted to provide a measure of classifier's generalisation capability.
Let the k-th data sample be removed from the training data set D N , and the resulting LOO data set is used to construct and n-term classifier. The test output of the obtained LOO n-term model evaluated at the k-th data sample not used in training is again denoted by . The associated LOO signed decision variable is then defined by This LOO misclassification rate is a measure of the classifier's generalisation capability. Moreover, the LOO signed decision variable ( , ) n k k s  can be calculated very fast owning to the orthogonal decomposition and, therefore, the LOO misclassification rate J n can be evaluated efficiently [52]. Specifically, the LOO n-term modelling error is expressed by (also see Appendix C) (  From (44), the LOO n-term signed decision variable is given by The recursive formula for the LOO error weighting ( ) n k  is given in (35), while ( ) n k  can be represented using the following recursive formula [52] 2 , ( ) ( 1) , w w k n n n k k k n kn T n n n w y g w The OLS-LOO-LR algorithm described in Subsection 3.1 can readily be applied to select a sparse kernel classifier with some minor modifications. These modifications are due to the fact that the selection criterion is the LOO misclassification rate (42) rather than the LOO MSE (33). Extensive empirical experience has also suggested that, for constructing sparse kernel classifier, multiple regularisers or local regularisation, which is so effective in further enforcing model sparsity in regression, becomes unnecessary. Thus, all the regularisation parameters i  , 1 i N   , can be set to a small positive constant  , and there is no need to update them using the evidence procedure. The sparse kernel classifier selection procedure based on this OLS-LOO algorithm is summarised as follows.
Setting  to a small positive number, and with the following initial conditions use the procedure described in Appendix E to select a subset model with N s terms. The selection procedure of Appendix E is essentially the same one as described in Appendix D, with only minor modifications connected with the computation of the LOO misclassification rate J n . Note that the LOO misclassification rate J n is also locally convex with respect to the classifier's size n. Thus there exists an optimal model size N s such that for Therefore the selection procedure is automatically terminated with a subset classifier containing only N s significant kernels.

Sparse Kernel Density Estimator Construction
As shown in Subsection 2.3, the generic kernel density estimation problem can be expressed as a constrained regression modelling, and the regression modelling part itself is identical to that of regression described in Subsection 2.1. Therefore, the OLS-LOO-LR algorithm detailed in Subsection 3.1 can be used to select a sparse kernel density estimate. The only problem is that the kernel weights obtained by the OLS-LOO-LR algorithm for this sparse kernel density estimate do not necessarily meet the nonnegative constraint (16) and the unity constraint (17). This "deficiency" however can easily be corrected by using the MNQP algorithm to modify or update the kernel weights of the selected sparse model [44]. This combined OLS-LOO-LR and MNQP algorithm offers an effective means of obtaining sparse kernel density estimates with excellent generalisation capability. The detailed OLS-LOO-LR algorithm has been described in Subsection 3.1 and, therefore, only the MNQP part needs to be discussed. After the structure determination using the OLS-LOO-LR algorithm of Subsection 3.
where the superindex < t > denotes the iteratio L n index and h is the Lagrangian multiplier. Setting 1 0 and leads to the following updating equations  . During the iterative procedure, some of the kern ights may be driven to (near) zero [61,81]. The corresponding kernels can then be removed from the kernel model, leading to a further reduction in the subset model size. x was uniformly distributed in (−10, 10) and the noise e was Gaussian dist zero mean and standard deviation 0.2. The first 200 data points were used for training and the other 200 samples for model validation. The kernel variance 2 10.0   was found to be optimal empirically for this example. As a Gaussian kernel was placed on each tr ributed with aining data x , there were N = 200 candidate regressors in the regression model (5). The training data were very noisy. In a ition to use the noisy test data set for evaluating the model's generalisation performance, two hundred noise-free data ( ) dd f x with equally spaced x in (−10, 10) were also generated as the second test data set. The OLS-LOO-LR algo m was applied to the noisy training data set, and the algorithm rith  automatically selected a 7-term kernel model. The modelling accuracy of the resulting 7-term kernel model is summarised in Table 1, and the corresponding model mapping generated by this 7-term kernel model is depicted in Figure 1, in comparison with the true scalar function (58).

. Experimental Results
The relevance vector machine (RVM) algorithm [60] is an existing sparse kernel modelling algorithm that is ten regarded as the state-of-the-art. It has the same excellent generalisation performance as the SVM algorithm but achieves a dramatically sparser kernel model than the SVM method. A drawback of the RVM method is a significant increase in computational complexity, compared with the SVM method. The iterative procedure for updating multiple regularisers in the RVM method converges much slower and may even suffer from numerical instability, compared with the efficient OLS-LOO-LR algorithm. The detailed comparison for these two sparse kernel modelling algorithms is given in [40]. The RVM algorithm was also applied to fit a sparse Gaussian kernel model for this example, and the algorithm produced the 15-term kernel model as listed in Table 1. The model mapping generated by the 15-term kernel model constructed using the RVM algorithm is shown in Figure 2. It can be seen that the OLS-LOO-LR algorithm and the RVM algorithm both had the same excellent generalisation performance, but the former produced a much sparser model than the latter. The OLS-LOO-LR algorithm additionally had significantly computational advantages in model construction.

Engine Data Set
This example constructe tionship between the fuel rack position (input k u ) and the engine speed (output k y ) for a Leyland TL11 turbocharged, direct injection diesel engine operated at low engine speed. It is known that at low engine speed, the relationship between the input and output is nonlinear [84]. Detailed system description and experimental setup can be found in [84]. The data set, depicted in Figure 3, contained 410 samples. The first 210 data points were used in modelling and the last 200 points in model validation. The previous results [84] have shown that this data set can be modelled adequately as  x in the training data set was considered as a candid e kernel centre, there were N = 210 candidate kernel regressors in the full regression model (5).
Both the OLS-LOO-LR algorithm and the SVM hm [56] were applied to this data set, and the two sparse Gaussian kernel models obtained are compared in Copyright © 2009 SciRes. ENGINEERING Table 2. The model output ˆk y and modelling error k k k y y   generated by the -term kernel model obthe OLS-LOO-LR algorithm are depicted in Figure 4. The modelling performance of the 92-term kernel model constructed by the SVM algorithm, not shown here, are very similar to those shown in Figure 4. It can be seen that the two sparse regression modelling techniques achieved the same excellent generalisation performance but the OLS-LOO-LR method obtained a much sparser model than the SVM method. It should be emphasised that the model size is critically important for this particular example. The main purpose of identifying a model for this engine system is to use it for designing a controller. A large model will make the controller design a very complex task and, moreover, the resulting controller will be difficult to implement in the real system. It is also worth emphasising that the OLS-LOO-LR algorithm has considerably computational advantages over the SVM algorithm. Both the algorithms require to determine the kernel width ρ. However, the SVM method has two more learning parameters, namely the error-band and trade-off parameters [56], that require tuning. Therefore, the OLS-LOO-LR algorithm is easier to tune and computationally more efficient than the SVM algorithm.  This was a regression benchmark the UCI repository [85]. The data set comprised 506 data points with 14 variables. The task was to predict the median house value from the remaining 13 attributes. From the data set, 456 data points were randomly selected for training and the remaining 50 data points were used to form the test set. Because a Gaussian kernel was placed at each training data sample, there were N = 456 candidate regressors in the full regression model (5). The kernel width for the OLS-LOO-LR algorithm was determined via a grid-search based cross validation. Similarly, the three learning parameters of the SVM, the kernel width, error-band and trade-off parameters, were tuned via cross validation. Average results were given over 100 repetitions, and the two sparse Gaussian kernel models obtained by the OLS-LOO-LR and SVM algorithms, respectively, are compared in Table 3. For the particular computational plat periment, the recorded average run time for the OLS-LOO-LR algorithm when the kernel width was fixed was 200 times faster than the SVM algorithm 1 when the kernel width, error-band and trade-off parameters were chosen. It can be seen from Table 3 that the OLS-LOO-LR algorithm achieved better modelling accuracy with a much sparser model than the SVM algorithm. probably because the three learning parameters, namely the kernel width, error-band and trade-off parameters, were not tuned to the optimal values. For this regression problem of input dimension 13 and data size N ≈ 500, the grid search required by the SVM algorithm to tune the three learning parameters was expensive and the optimal values of the three learning parameters were hard to find, compared with for example the previous smaller engine data set.

4
This classification benchma the UCI repository [85] and the actual data set used in the experiment was obtained from [86]. The feature input space dimension was m =9. There were 100 realizations of this data set, each containing 200 training patterns and 77 test patterns. In [60], the SVM and RVM algorithms were applied to the first 10 realizations of this data set, and the results given in [60] were reproduced in Table 4. The OLS-LOO algorithm described in Subsection 3.2 was also applied to construct Gaussian kernel classifiers for the same first 10 realisations of this data set, and the results obtained are summarised in Table 4, in comparison with those obtained by the SVM and RVM algorithms. In [86,87], seven existing state-of-the-art RBF and kernel classifier construction algorithms were compared and the performance averaged over all the 100 realizations were given. The OLS-LOO algorithm was applied to all the 100 realizations of the data set to construct sparse Gaussian kernel classifiers and the results obtained are given in but it could safely be assumed that it was much larger than 100. The OLS-LOO algorithm was applied to construct sparse Gaussian kernel classifiers for this data set, and the results averaged over the 100 realisations are also listed in Table 6. It can be seen that the proposed OLS-LOO method produced the best classification accuracy with the smallest classifier.  Table 7, with the first seven methods quoted from [86,87]. It can be seen that the classification accuracy of the proposed OLS-LOO method is comparable to that of the SVM method, but the former achieved a much smaller model size than the latter.

4
S OLS-LOO-LR and MNQP algorithm and to compare its performance with the Parzen window estimator as well as the previous sparse kernel density estimation algorithm [43]. The algorithm presented in [43], although also based on the OLS-LOO-LR regression framework, is very different from the current combined OLS-LOO-LR and MNQP algorithm. In particular, it transfers the kernels into the corresponding cumulative distribution functions and uses the empirical distribution function calculated on the training data set as the target function of the unknown cumulative distribution function. In other words, the regression framework is defined in the cumulative distribution function "space", not the original PDF "space". Converting the kernels into corresponding cumulative distribution functions can be inconvenient and may be difficult for certain types of kernels. Moreover, in the work [43], the unity constraint is met by normalising the kernel weight vector of the final selected model, which is nonoptimal, and the nonnegative constraint is ensured by adding a test to the OLS forward selection procedure. In each selection stage, a candidate that causes the resulting kernel weight vector to have negative elements, if included, will not be considered at all. This nonnegative test imposes considerable computational cost to the OLS selection procedure. The proposed combined OLS-LOO-LR and MNQP algorithm in com-Copyright © 2009 SciRes. ENGINEERING the simulation were on parison is computationally simpler.
The first and third examples in e-dimensional and six-dimensional density estimation problems, respectively, where a data set of N randomly drawn samples was used to construct kernel density estimates based on the regression model (21), and a separate test data set of N test = 10,000 samples was used to calculate the L 1 test error for the resulting estimate according to The experiment was repeated N run different random runs. The second example was a two-class two-dimensional classification problem taken from [6]. For all the three example, the value of the kernel width  was determined via cross validation.

One-Dimensional Density Estimation
d was the The one-dimensional density to be estimate mixture of Gaussian and Laplacian given by The number of data points for density estimation was N = 100. The optimal kernel widths were found to be  = 0.54 and  = 1.1 empirically for the Parzen window estimate and e proposed sparse kernel density estimate, respectively. The experiment was repeated N run = 200 times. Table 8 compares the performance of these two kernel density estimates, in terms of the L 1 test error and the number of kernels required. Figure 5(a) plots a Parzen window estimate obtained while Figure 5(b) illustrates a sparse kernel density estimate obtained by the combined OLS-LOO-LR and MNQP algorithm, in comparison with the true distribution. It can be seen that the accuracy of the proposed sparse kernel density estimate was comparable to that of the Parzen window estimate, and the combined OLS-LOO-LR and MNQP algorithm achieved sparse estimate with an average kernel number less than 6% of the data samples. The maximum and minimum number of kernels over 200 runs were 9 and 2, respectively, for the sparse kernel density estimator. The previous sparse kernel density estimator using th the em sional method 1 test error kernel number pirical distribution function as the desired response and based on the OLS-LOO-LR algorithm only [43] was also applied to this example. Under the identical experimental conditions, the results obtained by this sparse kernel density estimator are also given in Table 8, where it can be seen that the both sparse kernel density estimators had a very similar performance, in terms of the L 1 test error and average number of kernels required.

4
This was a two-class classificatio two-dimensional feature space [6]. The training data set contained 250 samples with 125 points for each class, and the test data set had 1000 points with 500 samples for each class. The optimal Bayes test error rate based on the true underlying probability distribution for this example was known to be 8%. The task was first to esti- the test data set and calculated the corresponding e N C0 and N C1 are the number of the clas r, the curre he underlying density to be estimated was given by to rror s C0 rate, where and class C1 training data points, respectively. Table 9 lists the results obtained by the three kernel density estimates, the Parzen window estimato nt sparse kernel density estimator based on the combined OLS-LOO-LR and MNQP algorithm, and the previous sparse kernel density estimator with the empirical distribution function as the desired response and based on the OLS-LOO-LR algorithm only [43], where the value of the kernel width was determined by minimizing the test error rate. It can be seen that the proposed sparse kernel density estimation method yielded the very sparse conditional density estimates and achieved the optimal Bayes classification performance. This clearly demonstrated the accuracy of the density estimates.

Six-Dimensional Density Estimation
The estimate data set contained N = 600 samples. The ptimal kernel width was found to be ρ = 0.65 for the Pa rzen window estimate and ρ = 1.2 for the sparse kernel density estimate based on the combined OLS-LOO-LR and MNQP algorithm, respectively, via cross validation. The experiment was repeated N run = 100 times. The results obtained by the two density estimator are summarized in Table 10. For this example, again, the two density estimates were seen to have comparable accuracies, rk has been proposed for sparse odelling from data, which unifies the supervised re-ut the p arse k estimate m a of required kernels less than 2% of the data samples. The maximum and minimum numbers of kernels over 100 runs were 16 and 7, respectively, for the proposed sparse kernel density estimator.
This example was used to test the sparse kernel density estimation method of [43] under nditions. The results obtained by this previous sparse kernel density estimator, quoted from [43], are also given in Table 10 for comparison. It is seen from Table 10 that for this high dimensional example the proposed sparse kernel density estimator outperformed the previous sparse kernel density estimator in terms of both the test performance and the level of model sparsity.

Conclusions
A regression framewo m gression and classification problems as well as the unsupervised probability density function learning problem in the same kernel regression model. A powerful orthogonal-least-squares algorithm has been developed for selecting very sparse kernel models that generalise well, based on the leave-one-out test criteria and coupled with local regularisation. For sparse kernel density estimation, in particular, a combined approach of the OLS-LOO-LR algorithm and multiplicative nonnegative quadratic programming has been proposed, with the OLS-LOO-LR algorithm selecting a sparse kernel density estimate while the MNQP algorithm computing the kernel weights of the selected final model to meet the constraints for den-Copyright © 2009 SciRes. ENGINEERING sity estimate. Empirical datamodelling results involving all the three classes of data modelling, namely regression, classification and density estimation, have been presented to demonstrate the effectiveness of the proposed unified kernel regression modelling framework based on the OLS-LOO-LR algorithm, and the results shown have clearly confirmed that this proposed unified sparse kernel regression framework offers a truly state-of-art for data modelling applications. The unified regression framework developed in this contribution is based on the linear-in-the-parameters kernel model, where the full candidate kernel set is obtained by placing a kernel at each training data point and employing a fixed kernel width for all the kernel regressors. Further reseach has been conducted to develop a nonlinear-in-the-parameters regression model, where each regressor has tunable base centre vector and diagonal covariance matrix. A powerful orthogonalleast-squares assisted forward selection procedure can be developed based on the leave-one-out test criteria and local regulation. At each stage of the construction procedure, a nonlinear base is constructed by optimising the appropriate LOO test criterion to determine the base's centre vector and diagonal covariance matrix. Such sparse data modelling techniques based on tunable nonlinear base units have been proposed for regression data modelling [88] and classification application [89]. Sparse density estimation based on this novel tunable regression modelling framework is currently under investigation [90].

. Acknowledg 6
he author acknowledges th T Hong and Professor Chris J. Harris to the topic reported in this work.
Let t -th data sample be deleted from the t data set N , and the resulting leave-one-out training is he k raining D set used to estimate the model parameter vector. The corresponding regularised least squares solution is defined by The LOO test error evaluated at the k-th not used for training, denoted as , is given data sample by Applying the matrix inversion lemma to (81)   [31] T. Kavli, "ASMOD: An algorithm for adaptive spline modelling of observation data," International Journal of Control, Vol. 5