Optimizing Feedforward Neural Networks Using Biogeography Based Optimization for E-Mail Spam Identification

Spam e-mail has a significant negative impact on individuals and organizations, and is considered as a serious waste of resources, time and efforts. Spam detection is a complex and challenging task to solve. In literature, researchers and practitioners proposed numerous approaches for automatic e-mail spam detection. Learning-based filtering is one of the important approaches used for spam detection where a filter needs to be trained to extract the knowledge that can be used to detect the spam. In this context, Artificial Neural Networks is a widely used machine learning based filter. In this paper, we propose the use of a common type of Feedforward Neural Network called MultiLayer Perceptron (MLP) for the purpose of e-mail spam identification, where the weights of this network model are found using a new nature-inspired metaheuristic algorithm called Biogeography Based Optimization (BBO). Experiments and results based on two different spam datasets show that the developed MLP model trained by BBO gets high generalization performance compared to other optimization methods used in the literature for e-mail spam detection.


Introduction
Spam can be defined as a form of unwanted communications usually sent in a large volume that negatively affects networks bandwidth, servers storage, user time and work productivity [1]- [4].In the context of the internet, spammers utilize several applications including email systems, social network platforms, web blogs, web forums and search engines [5].The email spam is commonly used for advertising products and services typically related

Multilayer Perceptron Neural Networks
The human brain has the ability to perform multi-tasking.These tasks include several activities such as controlling the human body temperature, controlling blood pressure, heart rate, breathing, and other tasks that enable human beings to see, hear, and smell.The brain can perform these tasks at a rate that is far less than the rate at which the conventional computer can perform the same tasks [25].The cerebral cortex of the human brain contains over 20 billion neurons with each neuron linked with up to 10,000 synaptic connections [25].These neu-rons are responsible for transmitting nerve signals to and from the brain.Very little is known about how the brain actually works but there are computer models that try to simulate the same tasks that the brain carries out.These computer models are called Artificial Neural Networks, and the method by which the Neural Network is trained is called a Learning Algorithm, which has the duty of training the network and modifying weights in order to obtain a desired response.
The neuron (node) of a neural network is made up of three components: 1. synapse (connection link) which is characterised by its own weight, 2.An adder for summing the input signal, which is weighted by the synapse of the neuron, and 3.An activation function to compute the output of this neuron.
The main Neural Network architectures are Feedforward Neural Network (FFNN) and the Recurrent Neural Network (RNN).
The most common and well-known Feedforward Neural Network (FFNN) model is called Multi-Layer Perceptron (MLP).Let a MLP with K input units, N internal (hidden) units, and L output units, where ( ) , and , , , L y y y y =  , be the inputs of the input nodes, the outputs of the hidden nodes, and outputs of the output nodes respectively.j b and l b are the biases in the input and output layers.A three layer MLP is shown in Figure 1.
In the forward pass, the activations are propagated from the input layer to the output layer.The activations of the hidden nodes are the weighted inputs from all the input nodes plus the bias j b .The activation of the jth hidden node is denoted as j net , and computed according to: In the hidden layer, the corresponding output of the jth node (e.g.j x ) is usually calculated based on a sigmoid function as follows: ( ) The outputs of the hidden layer ( ) , , , N x x x  are used as inputs to the output layer.The activation of the output nodes ( ) , , , L y y y  is also defined as the weighted inputs from all the hidden nodes plus the bias l b , where lj W is the connection weight from the jth hidden node j x to the lth (linear) output node: The backward pass starts by propagating back the error between the current output l y and the teacher output ˆl y in order to modify the network weights and the bias values.The classical MLP network is attempted to mi- nimise the Error (E) via the Backpropagation (BP) training algorithm [26], where for each epoch the Error (E) is computed as: where P is the number of patterns.
In MLP all the network weights and bias values are assigned random values initially, and the goal of the training is to find the set of network weights that cause the output of the network to match the teacher values as closely as possible.MLP has been successfully applied in a number of applications, including regression problems [27], classification problems [28], or time series prediction using simple auto-regressive models [29].

Biogeography Based Optimization
Biogeography based optimization (BBO) is an evolutionary computation algorithm motivated by a natural process (biogeography) which originally introduced by Dan Simon in 2008 [28].BBO typically optimizes a multidimensional real-valued function by improving candidate solutions with regards of a given measurement or fitness function.It optimizes a given problem by combining an existing population of candidate solution with a newly created candidate solution according to a simple formula.In this way the objective function is behaving as a black box model that provides a measure of quality (fitness function) given a candidate solution [22].The environment of BBO is analogous to an archipelago of islands, where each island is considered as a possible solution to the problem [30].There are decision variables that is called suitability index variables (SIVs), where each island consists of SIVs.The performance is measured for each island by an objective function, where in our case we will use the habitat suitability index (HSI) for performance level measurement.BBO algorithm tries to randomly create new SIVs by using migration that shares the same SIVs with mutation [30].The BBO algorithm can be described by the following steps [30].
1. Define BBO parameters including the mutation probability and the elitism parameter as the same way as any genetic algorithms (GAs).
2. Initialize the population.Again, as the same way as any genetic algorithms (GAs).
3. Calculate the immigration and the emigration rates for each island, where good solution have a maximum emigration rate and minimum immigration rate, while bad solutions have a maximum immigration rate and minimum emigration rate.
4. Choose the immigrating islands based on the immigration rates.5. Migrate randomly selected independent solution variables (SIVs) based on the previously selected islands.6. Perform mutation for each island.7. Replace the worst islands in the population with the newly generated islands.8.If the termination condition is met, terminate; otherwise, go to step 3. BBO has been associated with several evolutionary computation algorithms, including particle swarm optimization [31], evolution strategy [32], harmony search [33], and case-based reasoning [34].Moreover it has been extended to noisy [35] and multi-objective functions [36], and has been mathematically evaluated using Markov chain [37] and dynamic system [38] models.

BBO for Training MLP
In our work the BBO algorithm is used for optimizing MLP network as follows: 1.A predetermined number of Habitats is generated.Each Habitat represents a set of weights of an MLP network.Therefore a Habitat corresponds to one MLP network.
2. The fitness value of the generated candidate networks in step 1 is calculated.In our implementation we use the Mean Squared Error (MSE).The goal is to minimize the difference between the actual and estimated values.The set of weights are assigned to an MLP network and the MSE (HSI) is calculated based on the training dataset.
3. Update emigration, immigration and mutation rates as described in the previous section.4. MLP networks are combined, selected and mutated.
5. Some MLP networks with high fitness (low MSE value) are kept intact and passed to the next generation.6. Steps 2 to 5 are repeated until the predetermined number of iterations is reached.The best MLP with the lowest MSE value is tested and evaluated on a separated unrepresented dataset.
The whole process is shown in Figure 2.

Datasets
In order to assess the BBO-MLP approach in identifying e-mail spam, we apply it on two different datasets.The first dataset is extracted from SpamAssassin public mail corpus 1 .This data consist of 9346 records with 90 features.Each example in the data is labeled as Ham or Spam.The data includes 6951 ham emails and 2395 spam emails.The percentage of spam email forms approximately 25.6% of the emails which makes the data imbalanced and therefore more challenging.The full description of the features can be found in [9].The second dataset is obtained from the University of California at Irvine (UCI) Machine Learning Repository2 [39].This data is consisted of 4601 instances and 57 features.Approximately 39.4% of the emails in this data are spam e-emails.The collected features in this data are based on frequency of some selected words and special characters in the e-mails.

Experiments and Results
The developed BBO-MLP classifier is evaluated using the two datasets described in the previous section and compared with Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Differential Evolution (DE), Ant Colony Optimization (ACO) and Back Propagation (BP) algorithms.All algorithms including BBO are tuned as listed in Table 1.For all metaheuristic algorithms the number of individuals in the population and number of iterations are unified.Each individual represents the connection weights of the neural network that connect the input layer to the hidden layer, the weights from the hidden layer to the output layer and the set of bias terms.In our experiments, we tried different numbers of neurons in the hidden layer of the trained networks: 5, 10 and 15 neurons respectively.Both datasets are equally split into two parts: one is used for training and the other is used for testing.Each experiment was repeated 10 times in order to get statistically significant results.
Figure 3 and Figure 4 show the convergence curves for the metaheuristic algorithms based on the Spambase and SpamAssassin datasets respectively.It can be noticed in the figures that the BBO trainer has achieved the fastest and lowest convergence curves while GA and PSO come second and ACO is the worst.
For each one of the training algorithms (GA, PSO, DE, ACO, BP, and BBO), we find the best representative MLP model and then evaluate it based on the accuracy rate, which is the number of correctly classified instances   • Pheromone constant (q) 1 • Global pheromone decay rate ( g p ) 0.9 • Local pheromone decay rate ( t p ) 0.5 • Pheromone sensitivity (α) 1 • Visibility sensitivity (β) divided by the total number of instances.Table 2 and Table 3 show the best, average and standard deviation values achieved by each approach for the Spambase and SpamAssassin datasets, respectively.According to the tables, it can be clearly seen that the BBO trainer achieved the highest averages and best accuracy rates results for both datasets (shown in bold fonts).It can be also noticed that 5 neurons in the hidden layer was good enough to train the MLP network.Most of metaheuristic trainers didn't achieve any better results with the 10 and and 15 neurons in the hidden layer.

Conclusion
In this work, a recent nature inspired metaheuristic algorithm named Biogeography Based Optimizer is used to train the Multilayer Perceptron neural network for the purpose of E-mail spam identification.The developed approach is evaluated and compared with four metaheuristic algorithms (GA, PSO, DE and ACO) and the gradient decent based Backpropagation algorithm.Two Spam datasets with their extracted features based on the content of the e-mails are deployed.The developed BBO based training approach showed significant improvement in the accuracy of identifying spam e-mails compared to the other approaches.The results of the experiments support the conclusion that BBO is very effective in avoiding local minima and have a relatively fast convergence

Figure 1 .
Figure 1.An example of the topology of the Multi-Layer Perceptron-MLP.

Table 1 .
Parameters settings of the metaheuristic algorithms.

Table 3 .
Results of SpamAssassin dataset.