Image Classification using Statistical Learning Methods

In general, digital images can be classified into photographs, textual and mixed documents. This taxonomy is very useful in many applications, such as archiving task. However, there are no effective methods to perform this classification automatically. In this paper, we present a method for classifying and archiving document into the following semantic classes: photographs, textual and mixed documents. Our method is based on combining low-level image features, such as mean, Standard deviation, Skewness. Both the Decision Tree and Neuronal Network Classifiers are used for classification task.


Introduction
Nowadays, a huge number of documents are available in electronic format, whether as photos, plans, letters or press releases.With the continuous increase of the amount of such information, many applications for organizing this flood of documents are emerging.Amongst them, automatic image archiving systems are necessary to classify and to store a large collection of documents autonomously, to simplify searching and retrieving individual documents.
Recently automatic semantic classification and archiving of images has become an important field of research, aiming to automatically classify images, i.e. classification of images into significant categories, such as outdoor/indoor, city/landscape and people/non-people scenes [1,2].
In order to classify images into two classes (indoor/outdoor, city/landscape, etc.) Vailaya et al. use a Bayesian framework and obtain an average accuracy of 94.1% [3].
In [4] Gorkani et al. suggest an image classification method based on the most dominant orientation in the image's texture.In fact, this feature allows differentiating two final classes of images: city and landscape.Thus, they achieve a classification accuracy of 92.8%.
Another approach was proposed by Prabhakar et al. in [5].They used three low-level image descriptors (color, texture and edge information) to separate pictures and graphic images.Their algorithm reaches an accuracy rate of 96.6%.
In [6]  This paper presents a system able to automatically classify and archiving documents into the following three categories: photos, textual documents and mixed documents.
In Section 2, theoretic background of our approach is explained.Then in section 3, the experience plan is described, including data sets, experimental results and evaluation criteria, while in Section 4, results are discussed and new perspectives are suggested.

Proposed System
The system we propose allows discriminating documents into photographs, textual and mixed documents.It is based on two main stages (Figure 1): i) The features extraction: These features are extracted automatically from images using specific programs.For every single image, the values of these features will be used as coefficients of a representative vector.ii) The classification and archiving module: This is obtained after training and validating a model used to discriminate and store documents.

Features Extraction
Features selection is the key step leading to the success or failure of the classification phase.Therefore, several features are tested, looking to their relevance.In fact, features selection is an empiric process, though many approaches are suggested to weight their importance.In our system, images are classified based on six low-level featured, these features are considered as the coefficients of the image representative vector.They are calculated as follows: • Mean: is the average color value in the image.
Were i represent the color channel and Pij is the probability of occurrence of pixel with intensity j.
• Standard deviation: is the square root of the variance of the distribution ( ) • Skewness: represents the measure of the degree of asymmetry in the distribution.
( ) • Entropy: represent the disorder or the complexity of the image.A high value of entropy indicates a complex textures.
• Image dimension: represents the length and width of the image.

Classification Stage
After the extraction of the representative vector for each image, every document is classified as a photo, text or a mixed one.Photo family included indoor, outdoor, scenes, landscape, people, logos, and maps.Text family includes scanned and computer-generated text in various fonts.Mixed documents are documents that contain text and photo region.

Training
Thus, two well known classifiers are used to classify our documents namely the Decision tree and the Neuronal Network [7,8].

 The Decision Trees
The Decision Tree Classifier is a set of hierarchical rules which are successively applied to the input data [9].Those rules are thresholds used to split the data into two binary nodes.Each node is such that the descendant nodes contain more homogeneous data samples.Many features can be input into the Decision Tree to refine class description.A split is chosen because of its ability to render the nodes purer based on a purity measure and can be determined by any single feature [10].
In our paper we fitted the DT to the training data using the cross validation technique in order to select the best tree.Thus, we obtained two tree-based models (original, pruned) that were used in the classification task.
 The Artificial Neuronal Network A neural network is a set of connected units (nodes, neurons).Each node has an input and output then it can be connects with other nodes.Each connection has a weight associated to it.The topology of the neural network, the training methodology and the connections between the different nodes define the type of the corresponding Neuronal Network [11][12][13].In our case we used an RBF network.In which the input layer had 6 nodes that are equal to the number of features organized as vectors in the database.For the hidden layer, we chose 6 nodes while the output layer contains three nodes.By the end of this process, an input image is classified either as a photo, a pure text or a compound document.

Experimental Results
A data base of 291 documents was considered for both classification systems.From this set of documents 75% were used for training and 25% for testing the system performance.Thus, the training data set consists of 136 photo including indoor, outdoor, scenes, landscape images documents, 39 textual documents include scanned and computer-generated text in various font and 51 compound documents.Figure 2 shows some of the class images from the training data set.
In order to evaluate the accuracy of our approach, the following statistical coefficients are computed [14][15]: • The recall rate= CCI/TI • The precision rate= CCI/(TI+MI) Figure 3 presents the results obtained by using the Decision Tree.We can see that only for textual documents the full Decision Tree achieve high F-measure value than the pruned one.
The results obtained using the neural network as classifier are presented in There are some cases of misclassification produced by the both classifiers.Figure 5 shows examples of these images.
The main causes of misclassification on text are due to bad lighting conditions and to excessively noisy backgrounds that cause the final uniformity test to fail.

Conclusions
Automatic classification and archiving of images is an emerging research field in image processing.In this paper an algorithm for classifying photo, textual and mixed documents based on low-level image features was presented.Firstly, features are extracted from images to be assigned to a characteristic vector.Then, the Decision Tree and the neuronal Network classifiers are used to train and to validate a classification model using the extracted feature vectors.The obtained models allowed reaching an accuracy rate of 96% for discriminating a photo, a text and a mixed document.
Nevertheless, features relevance is weighted to select the most contributory ones, in order to increase classification and archiving performance.Moreover, we are currently studying other useful high-level feature to raise the accuracy and to build a new intelligent classifier.
represents the number of Correctly Classified Images.MI is the number of Misclassified Images and TI is the number of Test Images for each class.

Figure 4 .
These results show that both classifiers achieve notable results in the classification of documents.The DT classifier outperforms the NN classifier in execution speed and Recall value (by 12%).

Figure 2 .
Figure 2. Examples of training data set images.