An Approach for Content Retrieval from Web Pages Using Clustering Techniques

Mining the content from an information database provides challenging solutions to the industry experts and researchers, due to the overcrowded information in huge data. In web searching, the information retrieved is not an appropriate, because it gives ambiguous information for the user query, and the user cannot get relevant information within the stipulated time. To overcome these issues, we propose a new methodology for information retrieval EPCRR by providing the top most exact information to the user, by using the collaborative clustered automated filter which makes use of the collaborative data set and filter works on the prediction by providing the highest ranking for the exact data retrieved. The retrieval works on the basis of recommendation of data which consists of relevant data set with highest priority from the cluster of data which is on high usage. In this work, we make use of the automated wrapper which works similar to the meta crawler functionality and it obtains the content in the semantic usage data format. Obtained information from the user to the agent will be ranked based on the Enabled Pile clustered data with respect to the metadata information from the agent and end-user. The information is given to the end-user with the top most ranking data within the stipulated time and the remaining top information will be moved to the data repository for future use. The data collected will remain stable based on the user preference and works on the intelligence system approach in which the user can choose any information under any instances and can be provided with suitable high range of exact content. In this approach, we find that the proposed algorithm has produced better results than existing work and it costs less online computation time.

quest.For the past decades, we have noticed a vast growth in the data access through the web [1].Even though through the web we have got information that is not a relevant data to the user, since the information obtained may be overcrowded.That is information with lot of contents or the irrelevant information to the given query is mismatched to the query, so the user cannot get the relevant data for the request [2].The information obtained may be overlooked or overload.A traditional information retrieval (IR) technique has provided solutions to the fundamental issues.IR-based systems are not describing explicitly about how the systems can act like users and it is not supporting to obtain knowledge from large data sets to answer what users really want [3].In data mining, it is challenging task to know what is to be performed to get the relevant information with content mining [4].
For a short while, many mining methodologies have been proposed to give the solution on approximation with this challenge.Unfortunately, the user based systems and agent based systems can only show the architectural proposal for the information gathering and management [5].They cannot provide the novel approaches to these challenging issues.Some of the mining methodology provides the exact content to the maximum but not within the stipulated time and it has been accumulated by the most adherent relevant information [6].To overcome these issues exact content mining techniques has been proposed and named as EPCRR (Enabled Pile Clustered exact content retrieval and repository).This work is similar but advanced web content mining which can be viewed as the use of data mining techniques with the advancement towards the automatic data retrieval [7].It facilitates the web mining procedure namely usage, structure, content and user profiles.In this facilitation, we add the content mining with the exact data information to the end-user by framing the appropriate data cluster set with the pile approach method and ranking the data with the hierarchy and maintain the time factor for to give the information with exact content and the top most overlooked obtained information will be stored in the data repository.Meta crawlers use meta data stored in the data repository to deliver mined content to the user [5].
Mining the data is a tedious process based on the user request since each user will look for different information.There are different kinds of user will expect the exact information from the mining process, for the given input the database may contain ocean of information and retrieval process will be tedious, since it will be in a confusion to give out the output irrespective of this issue it has to give exact information to the given input.So it ranks the top most information with overcrowded and overlooked data as an output.The traditional informational retrieval techniques will deliver the content without the verification and overwhelmed data [7] [8].The content is not devised using the characteristics and usage.Overwhelmed information is given and timing factor is not considered.To address these issues, the proposed system has given the relevant process for the problem.
A native solution has been provided for the issues faced.In the proposed work we have used a Pile filter process and data set is set formed by using the meta data information, which is maintained by the agent.An automated wrapper will do the desired functionality for framing the data format for the given request and the advanced meta crawler will facilitate to provide the information desirably with the web data format within the time stipulation.The exact content and remaining hierarchy of data will be moved to the repository so that any change in the information can be updated and used for the future purpose [9] [10].
The rest of this paper is organized as follows.In Section 2, related work is discussed and in Section 3 briefly overview of the proposed work explained and proposed Architecture design was discussed.In Section 4, the data set formation and wrapper agent is explained.In Section 5, EPCRR (Enabled Pile Clustered exact content retrieval and repository) is explained.In Section 6, several theoretical and experimental research results are discussed.In Section 7, efficiency of the proposed system with Justification has been given.At last, some conclusions and future work discussed.

Related Work
Collaborative Filtering with clustering technique have been extensively studied by some researchers.Rong Hu et al. [11] proposed a concept which uses the description and functionality information as metadata to measure the characteristic similarities between services.It used AHC Algorithm in which the results depend strongly on the choice of number of clusters K, and initial value of K is not known.This system also suffers from cold start and data sparsity problem.
Mai et al. [12] designed a clustering collaborative filtering algorithm based on neural network in e-commerce recommendation system.With the data from web visiting message, the cluster analysis gathers users with similar characteristics.However, it was difficult to find the user's preference on web visiting is relevant to prefer-ence on purchasing.
Mittal et al. [13] proposed a work to calculate the user predictions by first minimizing the size of the item set and it was explored by the user.K-means clustering algorithm was applied to partition movies based on the query requested by the user.However, it has a drawback that each object must belong to exactly one group which leads to the limitation that all group must have at least one member.
Li et al. [14] proposed a concept to incorporate multidimensional clustering into a collaborative filtering recommendation model.In first stage, the user and item profiles were collected and clustered using the proposed algorithm.Then the clusters with poor similarity features were removed and the appropriate clusters were selected based on cluster pruning.In third stage, an item was predicted by performing a weighted average of deviations from the neighbour's mean.This approach was increasing the diversity of recommendations while maintaining the accuracy of recommendations.
Zhou et al. [15] represented a data-providing service in terms of vectors and it considered the relation between given input, expected output, and semantic relations among them.Refined fuzzy C-means algorithm was used to cluster the vectors .Through merging similar services into a same cluster, the capability of search engine service was significantly improved, especially in large Internet-based service repositories.However, in this approach, it assumed that domain ontology exists for facilitating semantic interoperability.Besides, this approach is not suitable for some services which lack parameters.
Pham et al. [16] proposed a concept to use network clustering technique on social network of users to identify their neighbourhood, and then use the traditional CF algorithms to generate the recommendations.
Simon et al. [11] used a high-dimensional divisive hierarchical clustering algorithm and it requires feedback on past user history implicitly and to discover the relationships within the users.Products of high interest were recommended to the users based on the clustering results.However, the implicit feedback was not providing sure information about the user's preference.

Overview of the Proposed Work
We proposed a new semantic content mining process by forming the user query as a new data for the mining execution.The mining execution begins with the data set formation from the user information, the data set is formed by looking in to the identification of the data after the data is identified the refining process begins by forming the clusters of data for the given information to avoid the repetition of the content and also to extract the exact information content for the given query.once the clusters has been formed ,we are going to check for the similarity between each and every words and content the cluster can also emulates the similarities between two words and four words till to the end count of the given input.Then the clustered data will move on to the collaborative automated filtering process to obtain the hierarchy for the obtained data [17].The clustering process begins by checking the data with the function similarity, characteristics similarity and description similarity and then the advanced Agglomerative Clustering algorithm is used for finding the similarities between words with respect to the measured factor, once the cluster of data is formed this will be act as wrappers to give out the output for the user query.It is going to give exact information data for the obtained process with advanced mining process with the automated collaborative filter which in turn will be acting wrapper to give out the specified output data which is a mined content information [10].Then the obtained information will pass on to the web with the snippets then before giving out the information the agent will supervise the information and form the list for the data given along with the ranking the listed data will be moved to the data repository and from the repository the information will be given to the user based on their priority.If the user the user got the information from the mining process to their exact that information will act as a Meta data for the future retrieval of other data in the data repository, this repository will be maintained by the agent and not by the user.
Architecture Diagram proposed in Figure 1 will give a layout for the exact mining process used to get the information.Each block used in the diagram shows the systematic relationship between the processes and shows the schematic representation of flow process between them.It is characterized by the working procedure of the system by showing how the schedule is carried from the tenure process.In the proposed process to cross check the hierarchy i.e. to provide the ranking after the filtering and refinement process we induce an agent to check the refined content is aligned according to the hierarchy.This is agent will do the process of third party expert system i.e., it uses the rating techniques for to get the content from the collaborative filter used, while providing the ranking for the data it also consider the user preferences for the period of time to get delved [17].To confirm the process we can make use genetic algorithm to check match between the filtered contained from the filter and also check the search with the given user query.
In the proposed architecture design we have four blocks, first block will discuss about the data set formation block and second block will illustrate the filtering process and the third block will discuss about the monitoring and storage of data set and the final block will enumerate the details about the content retrieval.The proposed design will facilitate the content mining process in an efficient manner and also it deals about the how fast they can able to get the data within the stipulated period of time.This system also avoids the data overlapping and data mismatching of the information to the end user.The clustering algorithm used will give out the exact content retrieval.The comparative has been made with the popular search engines and the effective comparative study has been made and analysed.The following diagram Figure 2 depicts the steps involved in the proposed work.

Data Set Formations
Many systems uses the recommender systems to form the dataset, the recommender system also uses the direct recommender algorithm to give out the data in the form of services [7].Mostly the data set formed will not look in to the description value of the data instant it checks only for the migrant value of the data based on the usage of the data information.The data set recommended also will be maintained by user and not by the external affairs so that the data set recommended will give an enormous value accordingly to the different user for the same input of data the variation in the data and mismatch of the data content was formulated in the data set recommender system and also it will be avoided by the measures induced in the filtering phase used by the different collaborators [12] [14].
In our proposed work the data set is formed and it has been send to the wrapper and that wrapper will be maintained by the external agents so that the information lost can be avoided.The data set is formed by collecting all the information about the given data and the data is identified by the characteristics, similarity based on functionality and description, then after the comparison process the clustered data set is formed once the cluster of data is emulated the advanced collaborative clustered automated filter is used to filter out the unwanted content information [18] [19].The data set is formed by giving out the unique id and value for each data unit, based on the characteristics and similarities and also it look for the user preferences in framing the clustered data unit which is a pile in our proposed system.when the uniqueness is found in the data set using the advanced collaborative filter the data set will be formulated the advanced collaborative filter works based on the aagglomerative Clustering Algorithm includes selecting the rating similarity computation and predicted rating computation, once the data set framed is given to the web the snippets along with the web will formulate the dataset accordingly to the desired format [8].In this paper we make use of real data set based on recovery services in which each service has its own service description, functionality and user privileges in the form of ratings.

Characteristic Similarity and Functionality Calculation
Description and functionality similarity are computed using Jaccard similarity coefficient (JSC) is the statistical measure for calculating similarity between samples sets [9].For both sets, the JSC is defined as a cardinality of intersection is divided by the cardinality of their union.Concretely the formula for computing similarity between and b is [19], ( ) This can be inferred from this formula that the larger a b D D  is, the more similar the two services are.From the above Division a b D D  is the scaling factor that ensures that description similarity is between 0 and 1. Similarly the Functionality similarity is calculated as given below.
( ) The weighted sum of description similarity and functionality similarity is used to compute characteristic similarity between a and b.

( ) ( ) ( )
, , , In this formula, α ∈ 0, 1 is the description similarity weight and β ∈ 0, 1 is the weight of functionality similarity.The relative importance between these two expressed using weight.In the recommender system, for the total n services provided, calculate the characteristic similarities of every pair of services and n × n characteristic similarity matrix M is formed.An entry m a,b in M represents the characteristic similarity between a and b.

Enabled Pile Clustered Exact Content Retrieval and Repository
Clustering methods are partitioned the set of objects into clusters and a cluster contains more similar objects and dissimilar objects are in different clusters according to some defined criteria.In huge data store cluster analysis algorithms have been utilized [20].
Clustering algorithms is divided into either hierarchical or partition based.Some standard partition based approaches like K-means suffer from several limitations: 1) results are dependent on the cluster value since initially they don't know the value of K; 2) cluster size is not subjected to monitoring process; 3) algorithms converge to a local minimum [21].Hierarchical clustering methods are classified in to two types based on bottom-up or top-down approach namely agglomerative or divisive clustering.

Pile Clustering Process
In this paper, we present a pile Clustering based Collaborative Filtering approach for big data applications and it is relevant to recommendation.Services are merged into some clusters via an Agglomerative Clustering algorithm before Collaborative Filtering technique is applied and, the rating similarities between services are com-puted for single cluster.There is less number of services in a cluster than the whole system, this approach costs less online computation time.Moreover, as the relevant ratings of services are grouped in the same cluster and dissimilar are in other clusters.Predictions of the ratings services in the same cluster are more accurate than the dissimilar services in other clusters.This approach provides a better solution for data sparsity and cold start problem.The clustering of services are explained in Figure 3 and the below algorithm.
Many current clustering systems use the agglomerative hierarchical clustering because of their clustering strategy, best performance.Furthermore, it does not require the number of clusters as input [22].Therefore, we Once the pile clustered is formed it has been given to the repository for the content retrieval by the filtering process under the supervision of the wrapper agent.The agent will authenticate and retrieve the information on demand, once the content is given to the end user, and query information output values of the data set are stored in the repository, so that the user can able to get the filtered mined content with the stipulated period of time.
Future research can be done with respect to service similarity, such that semantic analysis may be performed on the description information of service.By this way, more semantically similar services may be clustered together, which will increase the overall coverage of recommendations.Research can also be done to mine the implicit interests of the user based on usage records and reviews.The semantic measure of obtaining the relevant query is to search through different search.The overall steps of the semantic meta search engine includes process like: 1) using semantic similarity measures relevant query are formed; 2) based on the relevant query web documents are extracted; 3) ranking of web documents.Here, input query and neighbours extracted from ontology is used to select the most suitable query and then, ranking of web pages obtained from the different search engine was done using QSPR measure.With different set of queries the experimentation was done and the results performance was analyzed with the help of precision, F-measure.From the experimental results, we found that the proposed Meta search engine has performed better than existing work by achieving the precision of 0.8.Finally an expert system is introduced to rank the documents that are retrieved.Experts are to be authenticate to rate the page that are displayed that, when the same query is given again, the ranking is based on the experts preference.
It also suffer from a major disadvantage that each object must belong to exactly one group which leads to the limitation that all group must have at least one member.Fuzzy clustering produces results which include too much noise which affects the accuracy.

Experimental Results
For experimental verification a comparison is done with other search engines for example say, google, yahoo and altavista.The comparison is performed within these search engines and the proposed system.The proposed system uses the normal search engines as API and uses the proposed system as enhanced filters.The experimental results are calculated based on the results from the distribution hypothesis for the clustering of the documents and the genetic algorithm for the identification of the similarity between the contents.Later a sample experts' preference is given and consolidated with the ranking system where the search results has enhanced to a certain level.The initial data of precision and recall is collected from mined data records in different webpages and the comparison is given in Table 1.
For the purpose of experimentally verifying the recommendation process, a real data set is processed using EPCRR algorithm.Recommender systems based on Agglomerative clustering and collaborative filtering involves two stages.In the first stage, characteristic similarities between various services are first computed.Then, all services are merged into clusters using Agglomerative clustering Algorithm.In the second stage, rating similarities between services that belong to the same cluster are computed.Then some services whose rating similarities with the target service exceed a threshold are selected as neighbours of the target service.At last, the predicted rating of the target service is computed.Generally, a recovery service is described with some tags and contains certain functionality.As an experimental case, ten recovery services are considered and the corresponding description tags and functionality are listed in Table 2.
First description and functionality similarities between recovery services are computed.For instance, there are four same stemmed tags among the six different stemmed tags in s 2 and s 3 and the functionality of the two services are similar.Therefore, D sim (s 2 ,s 3 ) = 4/6 and F sim (s 2 ,s 3 ) = 1.Characteristic similarity is calculated using the weighted sum of the description similarity and functionality similarity.The description similarity weight α is set to 0.5.Then the characteristic similarity between s 2 and s 3 is computed as C sim = (0.5 × 4/6) + (0.5 × 1) = 0.833.Three digits after the decimal point are retained for the computation results.Characteristic similarities between all the recovery services are all computed by the same way, and the results are shown in Table 3.Now the agglomerative clustering algorithm is applied.Initially individual services are considered as clusters and based on characteristic similarity clusters are combined.
The reduction step of the Algorithm is described as follows: Step 1: More similar pair is searched in the similarity matrix and merged to form a cluster.
Step 2: New similarity matrix is created and using the average values the similarities between clusters are calculated.
Step 3: The similarities are stored.
Step 4: Proceed with step1 until the similarity is negligible.
The reduction steps are illustrated in Table 4.
After some reduction process now there are only 4 clusters remaining and the algorithm is terminated.By using this algorithm, the ten recovery services are merged into four clusters, where s 1 and s 6 are merged into a cluster named C 1 , services s 2 , s 3 , s 5 , s 7 and s 9 are merged into a cluster named C 2 , service s 4 is separately merged into a cluster named C 3 and services s 8 , s 10 are merged into a cluster named C 4 .Suppose there are four users (u 1 , u 2 , u 3 , u 4 ) who rated the ten recovery services.A rating matrix is shown in Table 4.The ratings are on 5-point scales and 0 means the user did not rate the recovery service.As u 3 does not rate s 5 (a not-yet-experienced item), u 3 is regarded as an active user and s 5 is looked as a target recovery.By computing the predicted rating of s 5 , it can be determined whether s 5 is a recommendable service for u 3 .Furthermore, s 2 is also chosen as another target recovery.Through comparing the predicted rating and real rating of s 2 , the accuracy of proposed system will be verified in such case.Since s 5 and s 2 are both belong to the cluster C 2 , rating similarity is computed between recovery services within C 2 by using formula (4).The rating similarities between s 5 , s 2 and every other recovery service in C 2 are listed in Table 5.
Rating similarity is computed using Pearson correlation coefficient and it ranges in value from −1 to +1.The value of −1 indicates perfect negative correlation and the value of "+1" indicates positive correlation.Without loss of generality, the rating similarity threshold γ in formula ( 5) is set to 0.5.Since the rating similarity between s 5 and s 2 is 0.544 and the rating similarity between s 5 and s 3 is 0.736 which are both greater than γ, s 2 and s 3 are chosen as the neighbours of s 5 , i.e., N (s 5 ) = s 2 , s 3 .
Since the rating similarity between s 2 and s 3 is 0.839 and the rating similarity between s 2 and s 5 is 0.544 which are both greater than γ, s 3 and s 5 are chosen as the neighbours of s 2 , i.e., N (s 2 ) = s 3 , s 5 .According to formula (6), the predicted rating of s 5 for u 3 is 1.97 and the predicted rating of s 2 for u 3 is 1.06.Thus, s 5 is not a good recovery service for u 3 and will not be recommended to u 3 .In addition, as the real rating of s 2 given by user u 3 is 1 (Table 6) while its predicted rating is 1.06, it can be inferred that proposed system may gain an accurate prediction.

of the Proposed System
A hybrid system will make use of the combination of collaborative content based filter and it restricts itself with the collaborative filtering strategies.Before the content based filtering begins it accepts dataset formed in the process and with the hierarchy of the user preference the recommended data set are formulated and the ranking will be given based on the user preferences [18].In a content-based system, keywords are used to describe the items, besides the user profile is built to indicate the keyword item to which the user specified their desires.In other words, the algorithm proposed try to give the most relevant data in the hierarchy for the recommended user and also it make sure that there is no similarity in the user retrieved information and further unused and used information will be stored in the repository for future use [22].Generally the cold start problem, data sparsity may affect the system performance and here we discussed how the proposed system overcomes these problems to improve the performance of the system.

Accuracy of the Proposed Recommendation:
To evaluate the accuracy of this algorithm, Mean Absolute Error (MAE), which is a measure of the deviation of recommendations from their true user-specified ratings, is used in this paper.The recommendation quality is measured using the mean absolute error (MAE) and sometimes it is also called absolute deviation.This method takes the mean of the absolute difference between each prediction and all ratings of users in the test set.MAE is computed as follow: In this formula, n represents the number of rating-prediction pairs, r u,i is the rating that an active user u gives to a recovery service i, P u,i denotes the predicted rating of i for u.
For each test recovery service in each fold, its predicted rating is calculated based on traditional system and proposed system approach separately.The recovery services considered as the real data set is experimented with both the concepts and the MAE is calculated.Therefore, without loss of generality in our experiment, the value of K is set to 4, 5, 6, 7, 8 respectively.Furthermore, rating similarity threshold γ is set for two cases.Under these parameter conditions, the predicted ratings of test services are calculated by proposed system and Traditional system.Then the average MAEs of Proposed system and Traditional system can be computed using formula (6).The comparison results are shown in Figure 4.
While the rating similarity threshold γ < 0.5, MAE values of proposed recommendation decrease as the value of K increases.The services are divided into clusters, and the services in a cluster will be more similar with each other.Furthermore, target service neighbours are chosen from the cluster of that the target service belongs to.Therefore, these neighbours might be more close to the target service and it results more accurate prediction.
While γ = 0.5, MAE values of Proposed system and Traditional system both increase.The intermediate results of these two approaches were checked and if the rating similarity threshold is set to 0.5 then the test services have only few or no neighbours, when neighbours have to be selected from a smaller cluster.It results large deviations between the predicted ratings and the real ratings.

Computation Time for the Proposed Recommendation
The time complexity of this approach involve two parts namely the offline cluster formation with agglomerative clustering algorithm and the online collaborative filtering.There are two main computationally expensive steps in this algorithm.The first step is the computation of the pair wise similarity between all the services.The number of services in the recommender system is , and the complexity of this step is generally O (n 2 ).The second step is to repeat the selection of the pair of most similar clusters or the pair of clusters that optimizes the criterion functionality.A naive way of performing this step is to merge each pair of clusters after each level of the agglomeration then re-compute the gains achieved and select the most promising pair.If the number of the target service's neighbours reaches to the maximum value, then its worst case time complexity of item-based prediction O (n k ).Since k n n  and k m m  , the cost of computation will decrease significantly [18].In order to evaluate the efficiency of proposed recommender system, the online computation time of proposed recommender system is compared with that of traditional recommender system, as shown in Figure 5.
In all, proposed recommender system spends less computation time than traditional Item-based Collaborative Filtering.Since the number of services in a cluster is less than the available services, and the time of rating similarity computation between every pair of services will be reduced.The rating similarity threshold γ increase, then the time of proposed recommender system decrease.It is due to the number of neighbours of the target service decreases when γ increase.

Conclusion and Work
We conclude our work by proposing an exact semantic search engine which gives preference to the user with highest priority of data content retrieval it works on the data agglomerative clustering.This work extends with the filter which works under the user agent without any supervision.In pile clustering, the ranking hierarchy is provided to the relevancy of the data, and the user will get the content in the highest order with efficient mining process.The mining process constitutes refining the information content process which has been clustered in the data set which has highest affinity towards relevance of the information.So the user in any environment can able to get exact information accordingly to their desires.Next acquired information which is not used will be moved to repository for future use.This procedure simulates that information retrieval works on intelligent but actually the data recommendation is used to give justification for the intelligence.In future, the same work can be extended purely on expert system without any intervene from the external user to obtain the content in the absence of mining.

Figure 1 .
Figure 1.Architecture diagram of proposed system.

Figure 2 .
Figure 2. Diagram for similarity and functionality calculation.

Figure 3 .
Figure 3. Diagram for clustering the services.

Figure 4 .
Figure 4. Comparison of MAE with proposed and traditional recommendation systems.

Figure 5 .
Figure 5. of computation time with proposed and traditional recommendation system.

Table 1 .
Comparison result analysis.

Table 2 .
Example of different recovery services in computer system.

Table 5 .
of cluster of services.

Table 6 .
Rating similarity between selected services.