^{1}

^{2}

Automatic web page classification has become inevitable for web directories due to the multitude of web pages in the World Wide Web. In this paper an improved Term Weighting technique is proposed for automatic and effective classification of web pages. The web documents are represented as set of features. The proposed method selects and extracts the most prominent features reducing the high dimensionality problem of classifier. The proper selection of features among the large set improves the performance of the classifier. The proposed algorithm is implemented and tested on a benchmarked dataset. The results show the better performance than most of the existing term weighting techniques.

The rapid development of technology leads human beings and the devices to connect to internet and share the data. Thus the information is accumulating in WWW at a very high rate. In this scenario it is necessary to categorize the web contents in an organized way. Automatic classification of web pages into relevant categories helps the search engines to give quicker and better results. Web page classification [

The goal of web page categorization is to classify the information on WWW into certain number of predefined categories. Categorization is an active research area in IR and machine learning. Several text categorization methods such as Naive Bayes [

Ali Selamat and Sigeru Omatu [

Thabit Sabbah et al. [

Ruma Dutta, Anirban Kundu and Debajyoti Mukhopadhyay [

Chris Buckley [

In this paper feature vectors of the web pages are classified using a new term weighting scheme. As the weight of a term drives the classifier for efficient and accurate categorization, the weighted features are ranked and the top ranked features are selected for classification. Neural network classifier is used to classify the 20 Newsgroups dataset [

In web page classification all the terms in the document will not be helping to identify the class the web page belongs to. Hence it is necessary to select only the most relevant terms from a document. The weight [_{ }are the weights of the respective terms

a) Term Frequency (TF): TF is based on the normalized frequency of a certain term. If a term has more occurrences then it has more implication.

b) Document Frequency (DF): Document frequency states the number of documents in the collection has the term t.

c) Term Frequency―Inverse Document Frequency (TF-IDF): In IDF the more occurred term in the collection is considered as least significant term. It is a global term weighting scheme in which the weight is computed with respect to its occurrence in the entire collection. TF-IDF is an eventual ranking measure in which the less frequent term in the collection but at the same time it is most occurred in a document is considered significant and vice versa.

Term Weighting Scheme | Formula |
---|---|

Term Frequency (TF) | |

Document Frequency (DF) | |

Term Frequency?Inverse Document Frequency (TF-IDF) | |

Glasgow | |

Entropy |

d) Glasgow term weighting: This weighting scheme is introduced to avoid favoring the longer documents with lot of irrelevant words than the small documents.

e) Entropy term weighting: It is based on a probabilistic analysis of the text. The more frequent the term is contained in most documents is considered as more significant. It computes the term weight from two aspects, which are local term-weighting and global term-weighting. This means that, once every term receives a weight, it will be composed of local and global weights. It is calculated at the range of 0 - 1, hence values are normalized.

f) Proposed Term Weighting Scheme

The new term―weighting scheme is originally improved from traditional TF-IDF. The improved term-weighting considers three main factors―term frequency factor, collection frequency factor, and document length factor. The proposed term weighting formula is expressed as,

where TF_{t}_{,d} represents the Term Frequency and DF_{t} represents the Document Frequency of the term.

The new term weighting technique is focused on the word occurrence in a single document and also its occurrence in the entire collection. The term value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the collection, which helps to adjust for the fact that some words appear more frequently in general.

Artificial neural network is used as the classifier to classify the web documents. ANN is composed of collection of neurons and layers. It usually consists of three layers which are input layer, hidden layer and output layer. A Single hidden layer feed-forward neural network in _{i}) be the input vector, Y = (y_{i}) be the output vector, H = (h_{i}) be the hidden neuron vector, W = (w_{ij}) be the weight matrix between input layer and hidden layer, and V = (v_{ij}) be the weight matrix between hidden

layer and output layer. In the input layer a bias node is included which with the constant value of 1.0. Bias node is added to increase the flexibility of the model to fit the data. It allows the network to fit the data when all input features are equal to 0.The weighted sum for neurons in hidden layer and output layer can be calculated by,

where n1 represents number of input neurons,

n2 represents the number of hidden neurons,

net^{o}^{ }represents net value for output layer neurons and

net^{h}^{ }represents net value for hidden layer neurons.

The training patterns are given as input to the network and the respective outputs of the output layer and hidden layer are computed. The Sigmoidal function is used as the activation function for hidden and output layer. It is given as follows,

where l represents the layer.

The network is learned by minimizing the Mean Square Error (MSE). MSE is defined as

where “p” represents patterns, “j” represents j^{th} neuron of the output layer, “d” represents desired value and “o” represents obtained value.

Standard Backpropagation

The basic idea of the Standard backpropagation algorithm is the repeated application of the chain rule to compute the influence of each weight in the network with respect to an arbitrary error function. This algorithm is used on hidden layer to find optimal set of hidden layer weights and thresholds that minimizes error. The sum of squared error of the network is

where the nonlinear error signal e_{1 }is, _{j} and y_{j} represent desired and obtained outputs for j^{th} unit in the output layer.

Now the weight update rule for hidden layer is as follows,

where μ is the learning rate, h represents hidden layer.

Then the hidden layer weight will be updated by the following equation.

The news web pages can be of different length, different structure and different vocabulary. The high dimensionality of the web pages diminishes the performance of the classifier. There are many categories of news in the web news pages such as sports, weather, politics, economy, computer etc. In each category there can be many different classes. The objective of this paper is to classify the web pages according to their category. The principal function of the web page classification is represented in _{j} refers to each web page document that exists in the collection where _{i} refers to the unique terms in the collection where_{i} occurred in Doc_{j}.

Since the numbers of terms are more in number it is necessary to recognize and extract the significant terms. To identify the significant terms, their term weights are computed using the proposed Formula (1). The term weights are sorted in ascending order and the top ranked p numbers of features are selected. The top ranked p feature profiles are stored in the Feature profiles. Feature selection is significant in supervised learning tasks such as binary or multiclass classification in order to improve the classifier efficiency and training time of the supervised classifier. In feature extraction phase the selected feature profiles are retrieved from the collection. The selected features are weighted using the new term weighting scheme. The selected features undergo normalization so that the data sets are scaled within the range of [1, −1]. To perform the normalization task, E is represented as _{j} denotes the p × 1 vector containing p term features for each e_{j}. The normalized feature vector is given to ANN for classification. The feed forward back propagation artificial neural network categorizes the given vector and in the final stage the classifier performance is measured.

Classification Algorithm

The steps involved in the web page classification are given as follows. Given the Vector space model of web documents in matrix format to the algorithm and the output is the class label of the web documents.

Doc/Term | t_{1} | t_{2} | t_{3} | … | t_{n} |
---|---|---|---|---|---|

Doc_{1} | 4 | 1 | 2 | … | 2 |

Doc_{2} | 2 | 3 | 3 | … | 1 |

Doc_{3} | 0 | 2 | 1 | … | 4 |

: | : | : | : | : | : |

Doc_{m} | 3 | 4 | 4 | … | 0 |

Step 1: Start the training.

Step 2: Fix the desired number of features p to be given to classifier.

Step 3: Read the VSM of the web documents in matrix A of dimension m x n, where n is the no. of terms in the collection; m is the no. of documents in the collection.

Step 4: Compute term weight matrix B by the new term weighting scheme for each element in matrix A.

Step 5: Rank the terms by sorting the value in B from highest to lowest according to sum of term weight values.

Step 6: Select p number of highest value from B and store it in feature profile matrix C.

Step 7: Extract the term values from the collection corresponding to feature profile matrix C and VSM model and store significant features in matrix D of dimension m x p.

Step 8: Compute the term weight of matrix D using proposed formula and store the result in E.

Step 9: Normalize the term matrix E and results can be stored in matrix F.

Step 10: The normalized output matrix F of dimension m x p is given as input to the neural network classifier.

Step 11: Initialize the parameters for neural network training.

Step 12: For each input pattern compute output of the network applying standard back propagation approach

Step 13: Update the weights for the network.

Step 14: Calculate network error for the network with updated weights.

Step 15: Repeat the training, until the MSE reaches 0.001.

Step 16: Test the trained network with the testing documents.

Step 17: Measure the classifier performance using Accuracy, Precision, Recall and F1.

Experiments were done on benchmark dataset called 20 News groups [

The best representative features are selected using new term weighting and ranking procedure. The features are weighted using the proposed formula then sorted and ranked according to the sum of weights. After feature ranking only 100 (p = 100) significant features are extracted for classification. The experiments were run under the hardware and software configurations specified in

Class | No. of documents taken for Training | No. of documents taken for Testing | Total no. of documents |
---|---|---|---|

Alt.atheism | 440 | 310 | 750 |

Comp.graphics | 520 | 360 | 880 |

Misc.forsale | 250 | 200 | 450 |

Sci.Crypt | 370 | 180 | 550 |

Rec.motorcycles | 410 | 260 | 670 |

Total | 3300 |

Hardware | Software |
---|---|

Processor: Intel Core Duo 2.1 GHz | Platform: MS Windows 8 |

Memory: 3GB RAM; 32 bit OS | Software: Matlab R2014a |

The extracted p number features are then given as input to the neural network classifier. The feed-forward neural network with an input layer, a hidden layer and an output layer is used. The Back-propagation method is used for training the network. The network parameters are given in

The learning rate is chosen based on trial basis for minimum cost. The weights are initialized randomly. The weight adjustments are drastic initially and then get stabilized. Thus the network learning is found to be smooth and is shown in

After the neural network is trained the test input patterns are given and the results of classifier are evaluated using standard information retrieval measurement tools. We evaluated using precision (P), recall (R), Accuracy (Acc) and F1. They can be expressed as follows:

The values of a, b, c and d are explained in

Parameter | Value |
---|---|

Learning Rate | 0.005 |

Mean Squared Error | 0.001 |

Input Neurons | n (No. of terms) |

Hidden Neurons | 10 |

Output Neurons | 5 |

Category Set | Expert Judgment | ||
---|---|---|---|

Yes | No | ||

System Judgement | Yes | a | b |

No | c | d |

The classification accuracy of the term weighting methods TF, TF-IDF, DF, Glasgow, Entropy and the proposed method is tabulated in

With the volatile growth of the web pages it is necessary to identify the category of the pages. With the similarity between the pages and its different attributes, the classifiers have a tough time to make the decision about the category of the web pages. It is the terms in a page, which describes the class of a web page. All the terms may not be essential

Category | TF-IDF | TF | DF | Glasgow | Entropy | Proposed Scheme |
---|---|---|---|---|---|---|

Alt.atheism | 95.86 | 96.01 | 96.08 | 64.01 | 92.25 | 98.3871 |

Comp.graphics | 96.74 | 97.25 | 97.10 | 70.5 | 94.35 | 98.3333 |

Misc.forsale | 94.29 | 95.5 | 96.45 | 75.5 | 93.4 | 97.5 |

Sci.Crypt | 97.06 | 96.9 | 96.24 | 68.35 | 90.5 | 98.3146 |

Rec.motorcycles | 89.75 | 90.62 | 91.55 | 63.18 | 94.65 | 95.7692 |

Average | 94.74 | 95.256 | 95.484 | 68.308 | 93.03 | 97.68524 |

Category | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|

Alt.atheism | 98.35 | 100 | 95.81 | 97.86 |

Comp.graphics | 98.1450 | 99.4 | 97.13 | 98.25 |

Misc.forsale | 97.9126 | 100 | 96.93 | 98.44 |

Sci.Crypt | 98.2121 | 99.5 | 96.33 | 97.89 |

Rec.motorcycles | 95.8065 | 96.7 | 95.65 | 96.17 |

during the classification hence the significant term selection and retrieval is mandatory. In this paper a new term weighting method is proposed that identifies the important and unique term in a web page. As only few significant terms fed to the classifier, the results are very accurate and efficient. The experiments were conducted on different classes of 20 News group dataset and the comparative results show the average results are better than the existing methods. The performance of this method is comparatively good for text based web documents, but when the page has more graphical contents than text contents the system performance goes low.

Thangairulappan, K. and Kanagavel, A.D. (2016) Improved Term Weighting Technique for Automatic Web Page Classification. Journal of Intelligent Learning Systems and Applications, 8, 63-76. http://dx.doi.org/10.4236/jilsa.2016.84006