^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

Neural networks based on high-dimensional random feature generation have become popular under the notions extreme learning machine (ELM) and reservoir computing (RC). We provide an in-depth analysis of such networks with respect to feature selection, model complexity, and regularization. Starting from an ELM, we show how recurrent connections increase the effective complexity leading to reservoir networks. On the contrary, intrinsic plasticity (IP), a biologically inspired, unsupervised learning rule, acts as a task-specific feature regularizer, which tunes the effective model complexity. Combing both mechanisms in the framework of static reservoir computing, we achieve an excellent balance of feature complexity and regularization, which provides an impressive robustness to other model selection parameters like network size, initialization ranges, or the regularization parameter of the output learning. We demonstrate the advantages on several synthetic data as well as on benchmark tasks from the UCI repository providing practical insights how to use high-dimensional random networks for data processing.

In the last decade, machine learning techniques based on random projections have attracted a lot of attention because in principle they allow for very efficient processing of large and high-dimensional data sets [

A prominent example is the extreme learning machine (ELM) as proposed in [

The most prominent other example for random projections is the reservoir computing (RC) approach [

non-recurrent linear readout layer combines the advantages of recurrent networks with the ease, efficiency and optimality of linear regression methods. New applications for processing temporal data have been reported, for instance in speech recognition [9,10], sensori-motor robot control [11-13], detection of diseases [14,15], or flexible central pattern generators in biological modeling [

An intermediate approach to use dynamic reservoir encodings for processing data in static classification and regression tasks has also been considered under the notion of attractor based reservoir computing [17-19]. The rationale behind is that a recurrent network can efficiently encode static inputs in its attractors [19,20]. In this contribution, we regard static reservoir computing as a natural extension of the ELM. We point out that recurrent connections significantly enrich the set of possible features for an ELM by introducing non-linear mixtures. They thereby enhance approximation capability and performance under limited resources like a finite network size. It is noteworthy that this approach does not affect the output learning, where we will still use standard linear regression.

A central issue for all learning approaches is model selection, and it is even more severe for random projection networks because large parts of the networks remain fixed after initialization. The neuron model, the network architecture and particularly the network size strongly determine the generalization performance, compare

Several techniques to automatically adapt the network’s size to a given task have been considered [21-23], whereas success is always measured after retraining the output layer of the network. Despite these efforts, it

remains a challenge to understand the interplay between model complexity, output learning, and performance: controlling the network size affects only the number of features rather than their complexity and ignores effects of regularization both in the output learning and the ratio of data points to number of neurons.

An essential mechanism to consider in this context is regularization ([24-26], Section 7 in the appendix). In this paper we distinguish two different levels of regularization: output regularization with regard to the linear output learning and input or feature regularization with regard to the feature encoding produced in the hidden layer. Output regularization typically refers to Tikhonov regularization [

The designer also has to make choices with respect to the input processing, e.g. on the hyper-parameters governing the distributions of random parameter initialization, on proper pre-scaling of the input data, and on the type of non-linear functions involved.

It is therefore highly desirable to gain insight on the interaction between parameter or feature selection and output regularization. The goal is to provide constructive tools to robustly reduce the dependency of the network performance on the different parameter choices while keeping peak performance. To this aim, we investigate recurrence and intrinsic plasticity, an unsupervised biologically motivated learning rule that adjusts bias and slope of a neuron’s sigmoid activation function [

The remainder of the paper is organized as follows. We introduce the ELM including Tikhonov regularization in the output learning in Section 2. Then we add recurrent connections to increase feature complexity in Section 3, which results in greater capacity of the network and enhanced performance. Not unexpectedly, we observe a trade-off with respect to the risk of overfitting. In Section 4 we investigate the influence of IP pre-training on the mapping properties of ELMs and show that IP results in proper input-specific regularization. Here the trade-off is for the risk of poor approximation when regularizing too much. We proceed in Section 5 to show synergy effects between IP feature regularization and recurrence when applying the IP learning rule to recurrently enhanced ELM networks. Whereas IP simplifies the feature pool and tunes the neurons to a good regime, recurrent connections introduce nonlinear mixtures and thereby avoid to end up with a too simple feature set. We show experimentally that these two processes balance each other such that we obtain complex but IP regularized features with reduced overfitting. As a result, input-tuned reservoir networks that are less dependent on the random initialization and less sensitive to the choice of the output regularization parameter are obtained. We confirm this in experiments, where we observe constantly good performance over a wide range of network initialization and learning parameters.

In 2004, Huang et al. introduced the extreme learning machine (ELM) [

The activations of the ELM input, hidden and output neurons are denoted by x, h and y, respectively (see ^{inp} and W^{out} denoting the input and read-out weights. We consider parametrized activation functions

where

x is the total activation of each hidden neuron h_{r} for input x and D is the input dimension. We denote a_{r} as the slope and b_{r} as the bias of the activation function f_{r}(·). The output y of an ELM is

The key idea of the ELM approach is to restrict learning to the linear readout layer. All other network parameters, i.e. the input weights W^{inp} and the activation function parameters a, b stay fixed after initialization of the network.

The ELM is trained on a set of training examples, n = 1, ···, N_{tr} by minimizing the mean squared error

between the target outputs and the actual network output y_{n} with respect to the read-out weights W^{out}. The minimization reduces to a linear regression task given the fixed parameters and hidden activations h as follows. We collect the network’s states h_{n} as well as the desired output targets y_{n} in a state matrix H = (h_{1}, ···, h_{Ntr}) and a target matrix for all n = 1, ···, N_{tr}, respectively. The minimizer is the least squares solution

where H^{†} is the pseudo-inverse of the matrix H.

The ELM approach is appealing because of its apparently efficient and simple training procedure [_{tr}, because then the network suffers from poor approximation abilities. This is illustrated in _{tr} = 50), where we show the dependency of the ELM’s generalization ability on the random distribution of the input weights W^{inp}, the network size R and the biases b on the Mexican hat regression task (cf. Section C.2 for this often employed illustrative task). In such cases, the model selection becomes an important issue since the generalization ability is highly depending on the choice of the model’s parameters, e.g. output regularization or network size.

Since the ELM is based on the empirical risk minimization principle [_{tr} = 1000) or by using small network sizes. Assuming noise in the data, it is well known that this is equivalent to some level of output regularization [33,34]. It is therefore natural to consider output regularization directly as a more appropriate technique for arbitrary network and training data sizes as e.g. in [27,35]. As a state-of-the-art method, Tikhonov regularization ([

(4)

and the regularized minizer then becomes

which is, as a side effect, also numerically more stable because of the better conditioned matrix inverse. A suitable regularization parameter ε needs to be chosen carefully. Too strong regularization, i.e. too large ε, can result in poor performance, because it limits the effective model complexity inappropriately [

In the ELM paradigm, a typical heuristics is to scale the data to [–1, 1] and to set the activation function parameters a to one [^{inp} or the activation function parameters a, b is not needed, because a random initialization of these parameters is sufficient to create a rich feature set. In practice, the hidden layer size is limited and the performance does indeed depend on the hyper-parameters controlling the distributions of the initialization at least of the input weights and the biases b. Very small weights result in approximately linear neurons with no contribution to the approximation capability, whereas large weights drive the neurons into saturation resulting in a binary encoding. This is illustrated in _{tr} = 50, where we vary the initialization range of input weights and biases b and

Finally, the number R of hidden neurons plays a central role and several techniques have been investigated to automatically adapt the hidden layer size. The error minimized extreme learning machine [

In summary, the performance of the ELM on a broader range of tasks depends on a number of choices in model selection: the network size, the output regularization (or the equivalent in chosing a respective task), and the hyper-parameters for initialization. Methods to reduce sensitivity of the performance to these parameters are therefore highly desired.

Adding recurrent connections to the hidden layer of an ELM converts it to a corresponding reservoir network^{1} (RN) (see the machine learning view on RNs in

random projections work in these models, we compare an ELM and the corresponding reservoir network on the same tasks. We argue that the additional mixing effect of the recurrence enhances model complexity. The hypothesized effect can be visualized and evaluated on three levels: for the single feature, the learned function, and with respect to the task performance.

We first consider the level of a single neuron and the feature it computes in a given architecture. We define such a feature F_{r} as the response of the r-th reservoir neuron h_{r} to the full range of possible inputs from the network’s input space:

where denotes the network’s converged attractor state (cf. Section B). The feature can easily be visualized as e.g. in

not easily be quantified and we therefore consider also the network level.

To assess the effective model complexity, we consider the mean curvature (MC) of the network’s output function, which directly evaluates a property of the learned model. On the one hand, this measure is closely connected to the output regularization introduced in Section A. Typical choices for regularization functionals in (9) punish high curvatures such as strong oscillations. The network’s effective model complexity is reduced [

For these reasons, we measure the MC while decreasing the effective model complexity through either increasing the regularization parameter ε of the output regularization or decreasing the network size R and we expect qualitatively similar developments for varying both model selection parameters. Experiments are performed on the Mexican hat task and the default initialization parameters are shown in Section C.1. Due to the stochastic nature of parameter initialization, we average the MC over 30 networks and test each ELM and the corresponding RN for comparison.

The results shown in

result in a MC that is larger than the MC of the target function. This is an indication for overfitting. We also find that the ELM and the corresponding RN have very similar MC’s, except for the unregularized case, where the RN overfits more strongly. This is expected, because the more complex features of the RN provide a larger model complexity, which is favorable if the network size is limited. Note that the results for varying network size use a regularization of ε = 10^{–5}, which is quite optimal and as such already prevents overfitting quite well. Vice versa, the results for varying ε are given for a network size of N = 100, which is clearly suitable for the task. This once more underlines that model selection and regularization are important issues.

From the above, we expect that measuring task performance on training and test data displays a typical overfitting pattern. For small networks or too strong regularization, training and test performance are poor, for increasing regularization and for larger network size the test error reaches a minimum and then starts increasing, while the training error keeps decreasing. This is exactly the case in

The results of the last section show the higher complexity of the RN in comparison to the ELM, which is caused by the non-linear mixing of features. While the exact class of features which is thereby produced is unknown, [

Thereby λ_{1} ≥ ···≥ λ_{R} ≥ 0 are the eigenvalues of the covariance matrix corresponding to the principal components (PCs) of the network’s attractor or hidden state distribution. In principle, the cumulative energy content measures the increased dimensionality of the hidden data representation compared to the dimensionality D of the input data x. The case of g(D) < 1 implicates a shift of the input information to additional PCs, because the encoded data then spans a space with more than D latent dimensions. If g(D) < 1, no information content shift occurs, which is true for any linear transformation of data. The experiments conducted with several data sets from the UCI repository [

In the previous section, we have shown that overfitting can occur when using an ELM and is even stronger when a corresponding RN with its richer feature set is used. Output regularization can counteract this effect, however, needs proper tuning of the regularization parameter. Hence, we propose a different route to directly tune the features of an ELM and the corresponding RN with respect to the input. A machine learning view on this idea is visualized in

previous work [38,39], where IP was shown to provide robustness against both varying weight and learning parameters. We show that IP in our context works as an input regularization mechanism. Again, we analyze the resulting networks on all three levels: with respect to feature complexity, by means of the MC, and by evaluating task performance.

Intrinsic Plasticity (IP) was developed by Triesch in 2004 [_{h}, f_{exp}) between the output f_{h} and an exponential distribution f_{exp}:

where H(h) denotes the entropy and E(h) the expectation value of the output distribution. In fact, minimization of D(F_{h}, F_{exp}) in Eq. (6) for a fixed E(h) is equivalent to entropy maximization of the output distribution. For small mean values, i.e. μ ≈ 0.2, the neuron is forced to respond strongly only for a few input stimuli. The following online update equations for slope and bias-scaled by the step-width η_{IP}- are obtained:

(7)

The only quantities used to update the neuron’s non-linear transfer function are s, the synaptic sum arriving at the neuron, the firing rate h and its squared value h^{2}. Since IP is an online learning algorithm, training is organized in epochs: For a pre-defined number of training epochs the network is fed with the entire training data and each hidden neuron is adapted to the network’s current input separately. Within the ELM paradigm, IP is used as a pre-training algorithm to optimize the hidden layer features before output regression is applied.

Since IP adapts the parameters a and b of the hidden neurons’ activation function it directly influences the features generated by an ELM.

On the network level, we evaluate model complexity again by means of the MC and the network performance on the Mexican hat regression task. We apply readout learning after each epoch to monitor the impact of IP on these measures over epochs. Learning and initialization parameters are collected in Section C.1. For illustration we choose the size of the ELMs’ hidden layer as R = 100 and the number of samples used for training as N_{tr} = 50 such that the ELM is prone to show overfitting and the effect of regularization can be observed clearly. The results are shown in

The task performance shown in

In

Finally, we plot 30 trained ELMs for non IP, medium IP epochs and too many IP epochs each in

The ELMs without IP-training (a) clearly show the typical oscillations due to over-fitting; a suited number of IP pre-training epochs (b) leads to constantly good results, whereas too long IP pre-training (c) tends to reduce the model complexity inappropriately so that the mapping is not accurately approximated anymore. The set of corresponding features is shown in

The experiments in this section clearly reveal the regulatory nature of IP as a task-specific feature regularization for ELMs.

We now show that the combination of recurrence and IP can achieve a balance between task-specific regularization by means of IP and a large modeling capability by means of recurrence. Whereas this is interesting from a theoretical point of view, it turns out that this combination also strongly enhances robustness of the performance with respect to other model selection parameters and eases the burden to perform grid-search or other optimization of those. To obtain comparable results to the experiments performed in the last sections, we add recurrent connections to the hidden layer of the ELMs to obtain the corresponding RN (see

We repeat the experiments from the previous Section 4 with the corresponding reservoir networks instead of ELMs. The network settings are given in Section C.1. The MC development with respect to IP-training of the reservoir networks is illustrated in

Figures 15(b) and (c) show the performance and the bias/variance decomposition. The behavior of the networks show similar characteristics as the ELMs under the influence of IP (compare to

In previous sections, we used synthetic data and a rather simple one-dimensional task to clearly state and illustrate the concepts. We now investigate the enhanced intrinsic model complexity, which is due to the addition of recurrent connections, in a more complex function approximation task where the task complexity can be controlled with a single parameter. The target function is a two-dimensional sine function (cf. Section C.3), where the frequency ω is proportional to its mean curvature and the difficulty of task.