^{1}

^{*}

^{2}

^{*}

^{2}

^{*}

^{3}

^{*}

The artificial neural networks (ANNs), among different soft computing methodologies are widely used to meet the challenges thrown by the main objectives of data mining classification techniques, due to their robust, powerful, dis tributed, fault tolerant computing and capability to learn in a data-rich environment. ANNs has been used in several fields, showing high performance as classifiers. The problem of dealing with non numerical data is one major obstacle prevents using them with various data sets and several domains. Another problem is their complex structure and how hands to interprets. Self - Organizing Map (SOM) is type of neural systems that can be easily interpreted, but still can ’ t be used with non numerical data directly. This paper presents an enhanced SOM structure to cope with non numerical data. It used DNA sequences as the training dataset. Results show very good performance compared to other classifiers. For better evaluation both micro - array structure and their sequential representation as proteins were targeted as dataset accuracy is measured accordingly .

Bioinformatics could be defined as the science of managing and analyzing biological data using advanced computing techniques. One of the main challenges in this area is information discovery from the mass biological data [

Recently Self-Organizing Map (SOM) has received attention as data mining knowledge discovery technique due to the highly beneficial properties [3-5]. A key characteristic of the SOM is its topology preserving ability to map a multi-dimensional input into a two-dimensional form. This qualifies SOM to be a good tool for data classification and clustering [6-8].

A data mining approach based on SOM as clustering, feature selection and classification, is introduced. SOM is employed by redesigning its several training phases to cope with the complex nature of DNA sequences, and integrating evolutionary techniques during learning process, using crossover and mutation to produce new features within the neighbor sequences of the wining unit in every iteration during training. Finally, set of class, cluster representative are generated. The main advantage of the proposed approach is that no interpretation phase is needed.

Sequence alignment is also employed in the introduced model. It is the method of arranging the DNA sequence or other sequences to indicate its similarity regions, infer that new sequence is similar to the previously known genes, or compare new sequences with all known sequences. Sequence alignment has two computational approaches: Local alignment and Global alignment. Local alignments seek only (relatively) conserved pieces of the sequence and the alignment stops at the ends of regions of strong similarity, as an example for local technique is Smith-Waterman algorithm (SW).

Global alignment: identifies the similarity regions in the entire length from end to end in two or more sequences. There are many algorithms applied in the problem of sequence alignment like Dynamic Programming (DP), it is slow but optimal. The general global technique based on dynamic programming is Needleman-Wunsch algorithm (NM&W).

The rest of the paper is organized as follows: Section 2 presents a background of DNA sequences classification techniques. Section 3 describes the SOM algorithm. Section 4 describes the phases of the proposed system. Section 5 presents the experimental results, and Section 6 concludes the paper.

During the past decades, advances in genomics have generated a wealth of biological data, increasing the discrepancy between what is observed and what is actually known about life’s organization at the molecular level. To gain a deeper understanding of the processes underlying the observed data, pattern recognition techniques play an essential role.

The machine learning techniques were generally applied for the following problems: classification, clustering, construction of probabilistic graphical models, and optimization.

The goal of the classification is to divide objects into classes, based on the characteristics of the objects.

The rule that is used to assign an object to a particular class is termed the classification function, classification model, or classifier. The problems in bioinformatics can be cast into a classification problem, and well established methods can then be used to solve the task [9-11]. The classification of micro-array data is often the first step towards a more detailed analysis of the organism as in [12, 13].

DNA sequences classification is a main class of problems in bioinformatics that depends on the topic of clustering, also termed unsupervised learning, because no class information is known a priori. The clustering goals is to find natural groups of (clusters) in the data that is being constructed, where objects in one cluster should be similar to each other, while being at the same time different from the objects in another cluster. The clustering in bioinformatics is concerned with the clustering of microarray expression data [14,15], and the grouping of sequences, e.g. to build phylogenetic tree. Formally, they represent multivariate joint probability densities via a product of terms, each of which only involves a few variables. The structure of the problem is then modeled using a graph that represents the relations between the variables, which allows to reason about the properties entailed by the product. Examples are Bayesian methods for constructing phylogenetic trees [

Additionally, many problems in computational biology involve searching for unknown repeated patterns, often called motifs, and identifying regularities in nucleic or protein sequences. Both imply inferring patterns, of unknown content at first, from one or more sequences. Regularities in a sequence may come under many guises. They may correspond to approximate repetitions randomly dispersed along the sequence, or to repetitions that occur in a periodic or approximately periodic fashion. The length and number of repeated elements one wishes to be able to identify may be highly variable [

The algorithms of motif discovery can be split into two categories: exhaustive or heuristic methods. In the former, the algorithms evaluate the statistical significance of all possible motifs, and output a ranked list. This approach is efficient since it spares from the need of pre-selecting a subset of motifs to use in the classification. That has also the merit of achieving better performances than most of the other methods introduced for the same task [

One of the most common algorithms used for sequence alignment is (NM&W) algorithm. It is the DP approach in bioinformatics to align protein or DNA sequences. In general it considers erroneous string data and work to solve the problems of global Alignment algorithms to determine an optimal alignment of two strings, align the strings to produce the most similarity between the two strings [20,21]. Another algorithm is the (SW) Algorithm, which applies more sensitive approach to the alignment of strings with different lengths [

Data mining is an area that, extract the hidden predictive information from large data with its powerful technology. One main objectives of data mining is classification learning. Classification assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data.

Self-Organizing Map (SOM) is one of the widely applied neural networks and has some interesting features over other neural networks. One advantages of using SOM is that it is quite robust with respect to noisy data, and its advantages over other classification models are its natural robustness and its very good illustrative power. Indeed, it has been successfully applied for several classification tasks.

SOM does not require an external teacher during the training phase. Therefore, SOM classified as unsupervised neural networks. The SOM receives a number of input patterns, discovers significant features in these patterns and learns how to classify input data into appropriate categories. SOM has the best characteristic which projects high-dimensional data onto a low-dimensional grid and the topological order of the original data was visually reveals. It was developed in 1982 by Tuevo Kohonen, a professor emeritus of the Academy of Finland [

SOM can also be viewed as a constrained version of k-means clustering [

1) Initialization: choose random values for the initial weight vectors w_{j}, and assign a small positive value to the learning rate parameter a.

2) Activation: apply the input vector X to activate the SOM network, and find Similarity Matching the BMU neuron X_{i} at iteration p, using the norm of minimum Euclidean distance usual measure as in “Equation (1)”,

where n is the number of neurons in the input layer, and m is the number of neurons in the SOM layer.

3) Updating: Apply the weight update equation

where Θ is restraint due to distance from BMU usually called the neighborhood function, is the learning rat, and is the weight repairing in p^{th} iteration.

4) Continuation: return to step 2 until the feature map stops changing, or no noticeable changes occur in the feature map.

After processing all of the input, the result should be a

spatial organization of the input data organized into similar regions.

The phases of the proposed system are described in the following subsections.

Redesigning the SOM node structure to handle the DNA sequence. Let be the set of SOM nodes (weight), (where m is height and n is width). Every w Î, represents the vector set of length k as shown in

In this phase the same idea of SOM training is used as described in Section 3 except for the similarity function and the neighborhood update. Initially SOM weights are set to random examples from input data, , w_{ij} = D_{y}. Traditional SOM can’t handle neither dynamic nor character based data since Euclidean distance “Equation (1)” and “Equation (2)” are used to compute or measure differences between numeric values, instead we use Needleman & Wunsch algorithm, to calculate the difference between the D_{ij} and W_{ij} as follows:

The cell of score matrix are labeled where i = 1, 2, ∙∙∙, U, j = 1, 2, ∙∙∙, T.

Set first row and column to 0’s and create a matrix with U + 1 Rows and T + 1 Columns.

The score matrix cells are filled row by row starting from the C (2, 2), where: match score = +1; mismatch score = −1; g = gap penalty = −1;

The first row and the first column of the score matrix are filled as multiple of gap penalty.

Score of any cell is the maximum of:

where is the substitution score for letters i and j.

The value of the cell depends only on the values of the immediately adjacent northwest diagonal, up and left cells, as shown in

After filling score matrix the last cell has the maximum alignment score.

Traceback is the process of deduction of the best alignment from the score matrix.

Trace back starts from the last cell bottom-right corner in the score matrix. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left. The traceback is completed when the first, top-left cell of the matrix is reached.

After selecting the wining node w_{ij}, the neighbor nodes are selected as “Equation (2)”.

As stated previously, SOM algorithm is designed for unsupervised learning. To use SOM for supervised learning (classification), enhanced node structure is used (as described in phase I). Additional blocks where employed. These blocks were initially set to zero at every step. If the node is selected as winner the class counter of the

corresponding class of the selected example is incremented by one shown as in

Every node w_{ij} in SOM network is connected to the data, by connecting weight, set of winning class counters, where m is the number of classes, as shown in

To understand that the technique provides the possibility of utilizing the class label provided in the training set while training the SOM, we can simply say that the vector BMU is introduced to the node structure to provide a voting criterion, so that such nodes with maximum BMU_{i} are dragged during the weight update process. Shifting such nodes towards the wining node which is definitely of the same class increases the means of relationship between such nodes, at the same time leaving nodes from other classes decreasing the relationship between such nodes and their un-similar neighbors.

The main idea in our proposed method is to measure the similarity between objects independently from the data by using new distance NM&W, after determined wining node, then {BMU}I increased by one for i^{th} of class counter. This confidence indicates a similarity between both the input data and the winning node BMU.

In the last step weight update is performed as shown in

In addition, the winning SOM unit is the unit W_{ij} who has the smallest distance to each instance, the appropriate class counter of the winning unit is incremented by one. After all instances have been presented, the largest class counter of each unit defines its label see phase III, and then calculates the reliability of all instances by “Reliability Equation” below.

Weight Update:

To increase the similarity between neighborhood nodes we introduce crossover and mutation. These operations will reproduce modified sequence oriented to the wining node and the current instance as well. For all nodes in the neighborhood of the BMU, crossover and mutation are performed as shown in Figures 8 and 9.

Number of crossover points is selected randomly and the value decreases based on how close g_{i} from n_{ij}. The node with high score with the wining node is selected and replaced with g_{i}.

The mutation step is applied here to reduce the localminima that might be caused by the crossover step and prevents the algorithm diversity towards the wining nodes and the data.

The generated SOM is then categorized based on generated

A

C

T

T

G

∙∙∙

C_{1}

C_{2}

∙∙∙

C_{Z}

reliability in “Reliability Equation”, as shown in