^{1}

^{*}

^{2}

^{*}

^{3}

^{*}

Latent Semantic Analysis involves natural language processing techniques for analyzing relationships between a set of documents and the terms they contain, by producing a set of concepts (related to the documents and terms) called semantic topics. These semantic topics assist search engine users by providing leads to the more relevant document. We develope a novel algorithm called Latent Semantic Manifold (LSM) that can identify the semantic topics in the high-dimensional web data. The LSM algorithm is established upon the concepts of topology and probability. Asearch tool is also developed using the LSM algorithm. This search tool is deployed for two years at two sites in Taiwan: 1) Taipei Medical University Library, Taipei, and 2) Biomedical Engineering Laboratory, Institute of Biomedical Engineering, National Taiwan University, Taipei. We evaluate the effectiveness and efficiency of the LSM algorithm by comparing with other contemporary algorithms. The results show that the LSM algorithm outperforms compared with others. This algorithm can be used to enhance the functionality of currently available search engines.

In the traditional approach to data gathering, we collect data on a few well-chosen variables, and then manually perform various tasks, such as finding relevant information, analyzing them, making decisions, and so on [

To combat the problem to lose the relevant information in the overwhelming amount of data, a number of search engines have proliferated recently, which can aid users in searching contents which are relevant to them [

Many effective search engines, such as MedEvi, EBIMed, MEDIE, PubNet, GoPubMed, Argo, and Vivisimo, have provided capabilities to fit search results to the users’ intent. These search engines can discover latent semantic (relationships between a set of documents and the terms they contain) in the search engine generated documents and classify these documents into homogeneous semantic clusters [

In the past, many algorithms/techniques have been deployed to develop semantic search engines as described in the previous paragraph [

In the last decades, some other dimension reduction techniques, such as Latent Semantic Indexing, Probabilistic Latent Semantic Indexing, and Latent Dirichlet Allocation models have been proposed to overcome the shortcomings of earlier search engines. But, all these are based on bag-of-words models. The bag-of-words models follow Aldous and de Finetti theorem of exchangeability where the order of terms in a document or order of documents in a corpus can be neglected [

As we can see from the literature review and our arguments that there is a need to enhance search engines’ capabilities to reveal latent semantics in high-dimensional web data while preserving the relationship and order of term(s) or document(s). We proposed a novel algorithm called Latent Semantic Manifold (LSM), which identifies homogeneous groups in web data while preserving the spatial information about terms in a document or documents in the corpus. This paper aims to explain the Latent Semantic Manifold algorithm (from now on, LSM algorithm), its deployment, and performance evaluation.

This study consists of three key components: proposing and describing the LSM algorithm, its deployment, and evaluation. They are described in the following subsections.

The proposed LSM algorithm is based upon the concepts of probability and topology, which identifies the latentsemantic in data.

Step 1 (Identifying relevant fragment from the user query generated documents): A user can enter a query using a search engine, which generates a set of documents. The relevant fragments (paragraphs in the LSM) are identified from the generated documents. The identification of the fragments is handled by the “document preprocesssor” of the search engine, which typically normalizes the document stream to a predefined format, breaks the document stream into desired retrievable unit, and isolates and metatags subdocument pieces.

Step 2 (Recognizing named-entity and constructing heterogeneous manifold): It is crucial to extract significant “terms” from the fragments (identified in Step 1) to construct heterogeneous manifolds. Notably, we can extract various types of terms with a large number of training documents. However, extracting different types of terms and calculating their marginal and conditional probabilities is highly computation-intensive [

models a conditional distribution P(z|x) by selecting the label sequence z, a named category, to label a novel observation sequence x with an associated undirected graph structure that obeys the Markov property. When

Algorithm | ||
---|---|---|

Require: A collection of returned documents from a search query. Ensure: A collection of semantic manifolds. | ||

Step 1 | Perform feature extractions using discriminative linear chain Conditional Random Field method to generate named entities. | |

Step 2 | Construct a manifold from the set of named entities generated from the document collection. | |

Step 3 | Classify the manifold into isomorphic (homogeneous) categories by using the Graph-based Tree-width Decomposition algorithm starting from a fixed dimension local manifold. Require: _{i} is associated with its named categories equipped with a weighted probability. Ensure: | |

Step 3.1 | Let a semantic topic set: | |

Step 3.2 | Given a tree-width d, find a semantic manifold M_{j} generated from single named entities for each semantic category z_{i} initially in which |M_{j}_{|} = d and the semantic mapping | |

Step 3.3 | Perform graph decompositions on G starting from M_{j}. |

conditioned on the observations that are given in a particular observation sequence, the CRF defines a single log-linear distribution over the labeled sequence. The CRF model does not need explicitly to present the dependencies of input variables x affording the use of rich and global features of the input, thus allows relaxation of the strong independence assumptions required by HMMs in order to ensure tractable inference. The relationships among these named-entities construct a complex structure called a heterogeneous manifold.

The named-entities are indicated with their marginal probabilities, and the correlations among named-entities are indicated with their conditional probabilities. For example, the jaguar is considered as a named-entity, and it is assigned to the animal or vehicle type depending on the overall context of the fragment. The named-entities are indicated with their marginal probabilities, and the correlations among the named-entities are indicated with their conditional probabilities. As illustrated in

Step 3 (Decomposing a heterogeneous manifold into homogeneous manifolds): As mentioned in Step 2, the he- terogeneous manifold consists of a complex structure of named-entities, including estimates of marginal and con- ditional probabilities. A collection of fragment vectors lies on the heterogeneous manifolds, which contains some local spaces resembling Euclidean spaces of a fixed number of dimensions. Every point of the n-dimensional he- terogeneous manifold has a neighborhood homeomorphic to the n-dimensional Euclidean space Rn. In addition, all the points in the local spaces are strongly connected. As the heterogeneous manifold is overly complex, and the semantic is latent in local spaces; thus, instead of retaining just one heterogeneous manifold, we break it into a collection of homogeneous manifolds. The topological and geometrical concepts can be used to represent the la- tent semantics of a heterogeneous manifold as a collection of homogeneous manifolds. A Graph-based Tree-width Decomposition algorithm is used to decompose a heterogeneous manifold into a collection of homogeneous ma- nifolds [

let a heterogeneous manifold M_{i} for fragmenti be the set of homogeneous manifolds, such that

are independent. In addition, a semantic topic set

with semantic mapping

probabilities indicate the number of documents that are relevant to a homogeneous manifold and match the user’s intent. To induce homogeneous manifolds, it is crucial to extract significant terms from fragments. In addition, we should demonstrate the relevance of each fragment to the homogeneous manifold. The users can refer only homogeneous manifold associated fragments, which they want.

Step 4 (Exploring the homogeneous manifolds): The relevant fragments cluster around their related homogeneous manifolds. For instance, a user query for the term APC, the fragments have aggregated into a collection of homogeneous manifolds as shown in

The LSM algorithm was deployed to develop a search tool. A team of three researchers including an expert in the Java programming language developed the tool using the Eclipse Software Development Kit. The LSM tool was used for two years at two places in Taiwan: 1) Taipei Medical University Library, Taipei; and 2) Biomedical Engineering Laboratory, Institute of Biomedical Engineering, National Taiwan University, Taipei. The members of the library and lab used the LSM tool to perform semantic searches in the PubMed database.

Data sets: Two data sets, Reuters-21578-Distribution-1 and OHSUMED, were used to evaluate the performance

of the LSM algorithm. The Reuters-21578-Distribution-1 is a standard benchmark for the text categorization, which consists of Newswire articles classified into 135 topics [

Evaluation criteria: Effectiveness and efficiency were measured as an experimental evaluation of the LSM algorithm. Effectiveness is defined as the ability to identify the right cluster (collection of documents). As shown in

Moreover,

_{1} is calculated as the mean of all results, which is a macro-average of the categories.

In addition, two other evaluation metrics, Normalized Mutual Information (NMI) and overall F-measure, were also used [

where

Statistics | Number of topics | Number of documents | Documents on a topic |
---|---|---|---|

Origin | 135 | 21,578 | 0 - 3945 |

Single topic | 65 | 8649 | 1 - 3945 |

Single topic (≥5 documents) | 51 | 9494 | 5 - 3945 |

Category | Clustering results | ||
---|---|---|---|

Yes | No | ||

Expert Judgment | Yes | ||

No |

^{a}TP: True Positive; FP: False Positive; FN: False Negative; TN: True Negative.

The Normalized Mutual Information metric MI(C, C') will return a value between zero and

Let

where F(z, z') calculates the F-measure between z and z'.

Efficiency is the clustering time for a search query with a fixed number of features for each clustering scheme, where features set is fixed.

Experiments: The experiments were conducted using Reuters-21578-Distribution-1 and OHSUMED data sets. The clusters ranging from two to ten topics were randomly selected to evaluate the LSM with other clustering methods. For each clustering method, each test run was conducted on a selected topic, and Normalized Mutual Information of the topic and its corresponding cluster was calculated. After conducting fifty test runs on a fixed number of k’s topics, where

k | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average |
---|---|---|---|---|---|---|---|---|---|---|

LSM | 0.461 | 0.505 | 0.622 | 0.686 | 0.714 | 0.792 | 0.893 | 0.884 | 0.9 | 0.717 |

CCF | 0.569 | 0.563 | 0.607 | 0.62 | 0.605 | 0.624 | 0.633 | 0.647 | 0.676 | 0.616 |

GMM | 0.475 | 0.468 | 0.462 | 0.516 | 0.551 | 0.522 | 0.551 | 0.557 | 0.548 | 0.517 |

NB | 0.466 | 0.348 | 0.401 | 0.405 | 0.409 | 0.404 | 0.435 | 0.411 | 0.418 | 0.411 |

GMM + DFM | 0.47 | 0.466 | 0.45 | 0.513 | 0.531 | 0.506 | 0.535 | 0.535 | 0.536 | 0.505 |

KM | 0.404 | 0.402 | 0.461 | 0.525 | 0.561 | 0.548 | 0.583 | 0.597 | 0.618 | 0.522 |

KM-NC | 0.438 | 0.462 | 0.525 | 0.554 | 0.592 | 0.577 | 0.594 | 0.607 | 0.618 | 0.552 |

SKM | 0.458 | 0.407 | 0.499 | 0.561 | 0.567 | 0.558 | 0.591 | 0.598 | 0.619 | 0.54 |

SKM-NCW | 0.434 | 0.423 | 0.515 | 0.556 | 0.577 | 0.563 | 0.593 | 0.602 | 0.612 | 0.542 |

BP-NCW | 0.391 | 0.377 | 0.431 | 0.478 | 0.493 | 0.5 | 0.519 | 0.529 | 0.532 | 0.472 |

AA | 0.443 | 0.415 | 0.488 | 0.531 | 0.571 | 0.542 | 0.587 | 0.594 | 0.611 | 0.531 |

NC | 0.484 | 0.461 | 0.555 | 0.592 | 0.617 | 0.594 | 0.64 | 0.634 | 0.643 | 0.58 |

RC | 0.417 | 0.381 | 0.505 | 0.46 | 0.485 | 0.456 | 0.548 | 0.484 | 0.495 | 0.47 |

NMF | 0.48 | 0.426 | 0.498 | 0.559 | 0.591 | 0.552 | 0.603 | 0.601 | 0.623 | 0.548 |

NMF-NCW | 0.494 | 0.5 | 0.586 | 0.615 | 0.637 | 0.613 | 0.654 | 0.659 | 0.658 | 0.602 |

CF | 0.48 | 0.429 | 0.503 | 0.563 | 0.592 | 0.556 | 0.613 | 0.609 | 0.629 | 0.553 |

CF-NCW | 0.496 | 0.505 | 0.595 | 0.616 | 0.644 | 0.615 | 0.66 | 0.66 | 0.665 | 0.606 |

^{b}LSM: Latent semantic manifold; CCF-k: clique community finding algorithm; GMM: Gaussian mixture model; NB: Naive Bayes clustering; GMM + DFM: Gaussian mixture model followed by the iterative cluster refinement method; KM: Traditional k-means; KM-NCL Traditional k-means and spectral clustering algorithm based on normalized cut criterion; SKM: Spherical k-means; SKM-NCW: Normalized-cut weighted form; BP-NCW: Spectral clustering based bipartite normalized cut; AA: Average association criterion; NC: Normalized cut criterion; RC: Spectral clustering based on ratio cut criterion; NMF: Non-negative matrix factorization; NMF-NCW: Nonnegative Matrix Factorization-based clustering; CF: Concept factorization; CF-NCW: Clustering by concept factorization.

k | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|

Precision | 0.9845 | 0.9579 | 0.9385 | 0.9352 | 0.8909 | 0.9013 | 0.9148 | 0.8913 | 0.8859 |

Recall | 0.7085 | 0.6384 | 0.6453 | 0.6056 | 0.5916 | 0.6543 | 0.6822 | 0.6688 | 0.6805 |

Overall F-measure | 0.7988 | 0.7297 | 0.7399 | 0.6986 | 0.6822 | 0.7329 | 0.7562 | 0.7343 | 0.7472 |

NMI | 0.4617 | 0.5051 | 0.6221 | 0.6866 | 0.7148 | 0.7925 | 0.8936 | 0.8848 | 0.9006 |

The average precision, recall, overall F-measure, and Normalized Mutual Information of LSM, LST, PLSI, PLSI + AdaBoost, LDA, and CCF were evaluated using the Reuters-21578-Distribution-1 data set; and LSM, LST, and CCF were evaluated on an OHSUMED data set, as shown in

Normalized Mutual Information comparison of the LSM algorithm with the other sixteen methods using Reuters-21578-Distribution-1 data setis shown in

Dataset | Method | Precision | Recall | Overall F-measure | NMI |
---|---|---|---|---|---|

Reuters | LSM | 0.81 | 0.773 | 0.786 | 0.717 |

LST | 0.779 | 0.745 | 0.754 | 0.633 | |

PLSI | 0.649 | 0.627 | 0.636 | 0.54 | |

PLSI + AdaBoost | 0.772 | 0.812 | 0.697 | N/A | |

LDA | 0.66 | 0.714 | 0.686 | 0.61 | |

CCF | 0.727 | 0.73 | 0.723 | 0.616 | |

OHSUMED | LSM | 0.59 | 0.479 | 0.522 | 0.315 |

LST | 0.586 | 0.388 | 0.456 | 0.257 | |

CCF | 0.514 | 0.54 | 0.513 | 0.214 |

^{d}LSM: Latent semantic manifold; LST: Latent semantic topology; PLSI: Probabilistic latent semantic indexing; PLSI + AdaBoost: Probabilistic latent semantic indexing + additive boosting methods; LDA: Latent Dirichlet allocation; CCF: k-clique community finding algorithm.

Our findings suggest that the LSM algorithm, which can discover the latent semantics in high-dimensional web data, might play an instrumental role in enhancing the search engine functionality. LSM carries out searches based on both keywords and meaning, which can assist researchers to perform semantic searches on databases. For example, a researcher can search APC with Adenomatous Polyposis Coli as his or her intended meaning in the PubMed database (the output of a user queried term APC is shown in

APC can also have other meanings, such as Antigen-Presenting Cells, Anaphase Promoting Complex, or Activated Protein C. Suppose, in a homogeneous manifold, we find APC, Colorectal Cancer, and gene-related documents are assembled, the homogeneous manifoldwould point out the meaning of APC as Adenomatous Polyposis Gene. Similarly, suppose APC, Major Histocompatibility Complex, and T-cells-related documents are assembled, it would indicate the meaning of APC as Antigen Presenting Cells.

According to the result, inflammatory bowel disease and its type (Crohn’s disease and ulcerative colitis) are associated with gene NOD2. The term NOD2 was found to be evenly spread over these three topics-inflamma- tory bowel disease, Crohn’s disease, and ulcerative colitis. Some evolving topics, such as the bacterial component were also discovered. However, the result was different when we searched NOD2 on Genia Corpus (

We can see that results (

(p-value < 0.05) (

Limitation and future studies: This study has a few limitations that open up the scope of future studies. First, to identify and discriminate the correct topics in the collection of documents, a combination of features and their co-occurring relationships serve as clues, and probabilities display their significance. All features in documents comprise a topological probabilistic manifold, associate to probabilistic measures, and denote the underlying structure. This complex structure is decomposed into inseparable components at various levels (in various levels of skeletons) so that each component corresponds to topics in the collection of documents. This process is a computation-intensive and time-consuming, which strongly depend on features and their identifications (named- entities). Second, some terms with similar meanings, such as anticipate, believe, estimate, expect, intend, and project, were separated into independent topics. Likewise, some terms were repeatedly specified in many topics. These issues might be addressed by utilizing thesauri and some other adaptive methods [

We found that the LSM algorithm can discover the latent semantics in high-dimensional web data and can organize them into several semantic topics. This algorithm can be used to enhance the functionality of currently available search engines.

The National Science Foundation (NSC 98-2221-E-038-012) supported this work.

Ajit Kumar,Sanjeev Maskara,I-Jen Chiang,1 1, (2015) Identifying Semantic in High-Dimensional Web Data Using Latent Semantic Manifold. Journal of Data Analysis and Information Processing,03,136-152. doi: 10.4236/jdaip.2015.34014