An Acoustic Events Recognition for Robotic Systems Based on a Deep Learning Method

In this paper, we provide a new approach to classify and recognize the acoustic events for multiple autonomous robots systems based on the deep learning mechanisms. For disaster response robotic systems, recognizing certain acoustic events in the noisy environment is very effective to perform a given operation. As a new approach, trained deep learning networks which are constructed by RBMs, classify the acoustic events from input waveform signals. From the experimental results, usefulness of our approach is discussed and verified.


Introduction
Social insect, or social animal can work more than their own ability in concert with other individuals.They usually communicate with each other through sounds and vibrations.Also, certain animals recognize events of surrounding environment using acoustic information that is obtained from the environmental sounds.
The other hand, multiple autonomous robots systems or swarm robots systems are needed to develop for the disaster response and search-rescue missions.We know well that the terrible disaster of nuclear plant in Japan reminding of necessity for robotic response systems.These systems are expected to achieve the difficult missions by cooperating among relatively simple robots.In this case, detecting and recognizing the environmental information is very important functions in whole system.Usually, vision-based recognition mechanisms are adopted in autonomous robotic systems.However, in swarm robots systems, each robot has comparatively simple structure without a camera, and each robot will act on the basis of the locally information to which it can be easily acquired.Also, sound information beyond a wall cannot be recognize by only vision-based systems.For example, it is very effective to detect and recognize the explosion sounds or human voices from the other side of the wall in the noisy environment.Therefore, we focus on developing the classification and recognition me-chanisms of acoustic events.
Recognizing acoustic events are becoming a key component of multimedia computational systems of all types, including robotic systems.Until now, identifying real-world acoustic events are tried by using some methodologies, e.g., a layered Hidden Markov Model (HMM).
In real environments, it is necessary to consider that an observed sound includes multiple sound source and are mixed their sound source.For example, a sound of environment that surrounds a living space is mixed a voice, a music, a engine sound of car, and a other living sound.Therefore, it is important to separate sound source or detect typical sound at a certain timing.In this paper, we focused on detection of typical sound at a certain timing In this paper, we discuss the acoustic events classification and recognition mechanisms based on the deep learning structure.This structure is constructed by Restricted Boltzmann Machines (RBM).As the experiments, we configured a deep network based on convolutional RBM and convolutional Deep Belief Nets.Learning and classifying results of model are compared, and discussed.

Binary Visible Units and Binary Hidden Units
An RBM [1] [2] is an undirected graphical model that is used to describe the dependency among a set of random variables over a set of observed data.In this model, the stochastic visible units v connected to the stochastic hidden units h.The joint distribution p(v, h) over the visible units and hidden units is defined through energy function E(v, h): and the probability density p(v) over the visible units defined as: where Z is normalization factor (or partition function) that can be estimated by the annealed importance sampling (AIS) method.
The commonly case (where ), the energy function E(v, h) of an RBM is defined as: ( , ) where v i , h j are states of visible unit i and hidden unit j, b i , c j are their biases and W ij is the weight between them.Since, an RBM has no intra-layer connections, the visible unit activations and the hidden unit activations are mutually conditional independence.Therefore, the conditional probability p(v i |h) and p(h j |v) that activate each unit are represented by a simple functions as: where sigmoid(x) = 1/(1 + e −x ) is standard sigmoid function.

Gaussian Visible Units
For real-valued data such as natural images or the Mel-Frequency Cepstral Coefficients, Bernoulli-Bernoulli (or binary-binary) form is poor representation.However, RBM can be applied to model the distribution of real-valued data by adopting its Gaussian-Bernoulli (or Gaussian-binary) form [3] [4].Where [ , ,..., ], . In this case, the energy function E(v, h) of an RBM is defined as: and the conditional probability p(v i |h) is defined as: ( ; ; ) N x µ σ is Gaussian probability density with mean µ and variance 2 σ , and 2 σ is variance para- meter of Gaussian noise on visible unit i.

Contrastive Divergence Learning Algorithm
The CD-k algorithm [5] is fast calculation algorithm to approximate the gradients of log-likelihood.Given a set of training data, the model parameters { , , } W b c θ = of an RBM are estimated by maximum likelihood learning of p(v).The model parameters that maximize the log-likelihood are determined with stochastic gradient method in general.The gradient of this log-likelihood is given through energy function E(v, h) of an RBM: However, this gradient is difficult to calculate strictly, because, calculation cost increase exponentially.CD algorithm approximate the gradients of log-likelihood using k-step Gibbs sampling and joint probability p(v|h), p(h|v).This gradient is given as: Therefore, gradients of each parameter are given as:

Experiment Condition
We configured a deep neural network based on convolutional RBM [6] and convolutional Deep Belief Nets (CDBN) [7] as shown Figure 1.Network has three convolution layers, two max pooling layers, and one full connection layer.Each layer setting illustrate Table 1.We configured a Convolutional Neural Network (CNN) [8] [9] of same parameters for comparison.
In pre-training step, each layer is training as standard RBM using the patch that was cut out from the inputs.Because, to reduce the computational cost.CD learning with 1-step Gibbs sampling (CD1) was adopted for the RBM training and the learning rate was 0.0001.The batch size was set to 100 and 100 epochs were executed for estimating each RBM.In fine-tuning step and training CNN, we used Adam learning method [10] and early stopping.
We used train and test dataset of D-CASE challenge [11] for our experiments.This data set is recorded typical 16 category sounds of office environments.Also, training data and test data has been granted noise.256-order spectrograms were derived from the waveform by STFT analysis using 512 points hamming window at 10 milliseconds frame shift.We also constructed 256 × 100 milliseconds-order patches of spectrogram from spectrogram using 50 milliseconds frame shift.

Results and Discussions
In the experiments, the network has over-fitting (Figure 2 and Figure 3).Also, transition of each values are analogous at between fine tuning of CDBN and CNN.In the classification of results after learning, f-measure of each network is not the significance compared to the case of random (Figure 4).We believe that there is cause to representation of the input data by transition of each value in training data and the outcome of deep learning     in the recent years.In recent years, it is known that it is possible to extract better features from the spectrogram of FFT by deep learning.However, in all cases, it is suggested that it may not be effective.If the frequency analysis by the auditory filter in consideration of the hearing mechanism of organisms, we think that the better results are obtained.

Conclusions
In swarm robots systems, each robot has comparatively simple structure without a camera, and each robot will act on the basis of the locally information to which it can be easily acquired.In this case, detecting and recognizing the environmental information is very important functions in whole system.Therefore, we focus on developing the classification and recognition mechanisms of acoustic events for swarm robots.
In this paper, we proposed the acoustic events classification and recognition mechanisms based on the deep learning structure.This structure is constructed based on RBM.However, in the experiments on this paper, we cannot figure out how to get enough recognition accuracy in noise environments.We believe that there is cause to representation of the input data by transition of each value in training data and the outcome of deep learning in the recent years.If the frequency analysis by the auditory filters in consideration of the hearing mechanism of organisms, we think that the better results are obtained.
In the future work, we plan to incorporate the auditory filter to our approach, and will expect to improve the recognition accuracy by this plan.

Figure 1 .
Figure 1.Structure of network for our experiments.

Figure 4 .
Figure 4. F-measure score of each category.

Table 1 .
Setting of each layers.
conv, pool and full are convolution layer, pooling layer and full connection layer respectively.