Knowledge Driven Paper Recommendation Using Heterogeneous Network Embedding Method

We search a variety of things over the Internet in our daily lives, and numerous search engines are available to get us more relevant results. With the rapid technological advancement, the internet has become a major source of obtaining information. Further, the advent of the Web2.0 era has led to an increased interaction between the user and the website. It has become challenging to provide information to users as per their interests. Because of copyright restrictions, most of existing research studies are confronting the lack of availability of the content of candidates recommending articles. The content of such articles is not always available freely and hence leads to inadequate recommendation results. Moreover, various research studies base recommendation on user profiles. Therefore, their recommendation needs a significant number of registered users in the system. In recent years, research work proves that Knowledge graphs have yielded better in generating quality recommendation results and alleviating sparsity and cold start issues. Network embedding techniques try to learn high quality feature vectors automatically from network structures, enabling vector-based measurers of node relatedness. Keeping the strength of Network embedding techniques, the proposed citation-based recommendation approach makes use of heterogeneous network embedding in generating recommendation results. The novelty of this paper is in exploiting the performance of a network embedding approach i.e., matapath2vec to generate paper recommendations. Unlike existing approaches, the proposed method has the capability of learning low-dimensional latent representation of nodes (i.e., research papers) in a network. We apply metapath2vec on a knowledge network built by the ACL Anthology Network (all about NLP) and use the node relatedness to generate item (research article) recommendations. How to cite this paper: Ahmed, I. and Kalhoro, Z.A. (2018) Knowledge Driven Paper Recommendation Using Heterogeneous Network Embedding Method. Journal of Computer and Communications, 6, 157-170. https://doi.org/10.4236/jcc.2018.612016 Received: December 5, 2018 Accepted: December 25, 2018 Published: December 28, 2018 Copyright © 2018 by author(s) and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/


Introduction
Due to the availability of enormous information on the World Wide Web, it is quite daunting and difficult to get access to relevant information.Information overload and cognitive stress are the core issues users are facing these days when they are in search of relevant items/information.To cope with such issues, recommender systems in different online services are playing a significant role in finding relevant information/item for the user.Traditional approaches such as a matrix factorization attempt try to characterize user-item interaction records (e.g., using user-item rating matrix) by learning an effective prediction function.Due to the availability of web services, recommender systems can consume various kinds of auxiliary information and generate more useful recommendations [1] [2] [3].However, to model and utilize such complex heterogeneous information is a daunting task.Furthermore, it is more challenging to develop a relatively general approach to model these varying data in different systems or platforms.In recent years, research works have proven that integrating knowledge graphs with state of the art CF methods have yielded quality recommendation results.Additionally, their hybridization helps in resolving issues such as new scarce items, faced by existing recommendation approaches [4]- [9].
Among existing approaches, meta-path2vec has performed well in learning the features of network structure.Research work [10] has exploited its efficiency over other existing approaches in link prediction and node classification scenarios.In light of these studies, our research aims to use metapath2vec in learning knowledge network and find research items that meet user requirements.
In this research, we have proposed a research paper recommendation system using a novel heterogeneous network embedding technique called CN_HER exploiting latent relations in the scholarly paper recommendation.Furthermore, it can adequately amalgamate different kinds of information in HIN to enhance the quality of generated results.In literature, researchers have exploited citation networks in recommending research papers; however, they ignore the latent relations by employing direct relations such as co-citation relationships as accomplished in [11] [12].Identifying and incorporating the latent relations across research papers could play a significant role and may improve the recommendation performance.The authors mined the hidden relation between a target paper and its references to present utile recommendations.
In a nutshell, the contributions of this paper can be summarized as follows: • To uncover the semantic and structural information, we use a heterogeneous network embedding approach called metapath2vec.
• To the best of our knowledge, we integrate heterogeneous network embedding approach, i.e.CN_HER which can adequately incorporate embedding information into HIN resulting in enhanced recommendation results.
• We perform experiments on real-world dataset showing the effectiveness of our proposed model.Additionally, we examine its capability in resolving various problems, including sparsity, cold start, prediction accuracy and scalability issues, and reveal its improved recommendation results in HINs.The paper is organized as follows.Section 2 mainly introduces the related work closely related to the research of this paper.Section 3 introduces Proposed Description and problem formulation of our model.Section 4 is devoted to experimental study, and section 5 concludes the paper.

Knowledge Graph Embeddings Based Item Recommendation Using Node2vec
In a graph, node2vec [10] is a learning representation of nodes using the word2vec model on node sequences experimented through random walks.Node2vec is the concept of random walk exploration that is easygoing to represent several of connectivity patterns in a graph, node2vec brings this innovation.This proposed work shows that node2vec can be efficiently used to learn Knowledge network embeddings to perform in recommendation system.Node2vec is applied in a knowledge network, including user feedback on items, modelled by the special relation "feedback", and item relations to other entities; then, using the similarity between users and items in the vector space model for producing the recommendations results.In this study, how learning knowledge network embedding's using node2vec can be used to generate items.To use node2vec on a knowledge graph built from the MovieLens 1M dataset and DBpedia and use the node similarity for generating item recommendations.Given a knowledge network K consisting of users, items (I, U) (the object of the recommendations, e.g. a movie) and other entities E (objects connected to items, e.g. the director of a movie), vector representations of the users x u and of the items x i (and of other entities x e ) are generated by node2vec.Thus, propose to use as a ranking function the similarity between the user and the item vectors: ( ) ( ) where c is the cosine similarity in this work.

Recommendation Model Based on Knowledge Graph Representation Learning Technology
The learning technology expresses the semantic information of the research object as dense low dimensional real value vector by machine learning method, which solves the problem that the former single heat (one-hot) vector cannot represent the semantic information of the research object.By means of knowledge network representation learning technology, the low dimensional dense vectors of entities and relationships in the Knowledge Graph can be obtained, and the semantic information of the entities under the structure of knowledge network is expressed.F. Zhang et al. [13], the idea of adding the information in the knowledge graph to the recommendation model by means of learning is presented in the following ways: first, through TRANSR [14] learning the vectors of entities and relationships in the knowledge network.Then, the potential feature model is used to decompose the user's object matrix to obtain the latent eigenvector of the user and the object, and finally the vector of the object and the potential eigenvector of the object, which is embedding with the structure information of the knowledge network is added to obtain the final vector of the object.The recommended results are ultimately determined by the score ranking obtained by the user vector and the object vector.The distributed representation of goods is based on the structure learning of the graph in the knowledge network, so the direct addition and form cannot directly describe the structural information that is learned from the graph, we propose a new method to combine the Knowledge Network Embeddings technology with the recommendation model.In response to the above problems, our approach uses a method similar to [13], using embedding technology to learn the vector of entities and relationships in the knowledge network, and then, by multiplying the product vectors to get the relationship matrix of the object.The user-object preference matrix is obtained through the user-object feedback matrix and the product-item relationship matrix products, we introduce the idea of knowledge network information similar to [15], pass the user's preference through the structure of the knowledge network.Finally, the user-object preference matrix is decomposed to obtain the potential eigenvector of users and objects.We propose a citation based research paper recommender system using Heterogeneous Information Network Embedding method for Recommendation (CN_HER).Given a Heterogeneous Information Network Embeddings, the task is to learn the low-dimensional latent representations (embeddings) for every object v ∈ V.The learned low-dimensional latent representations are expected to highly encapsulate informative characteristics, which are probable to be beneficial in paper recommender systems represented on Heterogeneous Information Network Embeddings.Nevertheless, most of the current network embedding approaches mostly focus on homogeneous networks, which are not capable to efficiently model heterogeneous information networks.For example, a node2vec and deepwalk, etc.; to generate node sequences are used by the innovative study [10] [16], which cannot contradistinguish nodes and edges with different object and relation types.Therefore, it needs a more righteous way to traverse the Heterogeneous Information Network Embeddings and generate meaningful node sequences.For the first time, we integrate heterogeneous network embedding model (metapath2vec) in recommendation, which has been enhanced the recommendation results.The learned score list ( ) , s x y is used to rank the scores between manuscript and all the scientific papers in the dataset to produce a recommendation.

Architecture and Working of Proposed Solution
In this paper, we investigate a citation based research paper recommender system using Heterogeneous Information Network Embedding model for Recommendation to recommend the relevant top-N research paper.In recommending papers, the proposed work goes through different phases.

System Overview
As introduced earlier, our Network-based citation recommendation framework includes the following three main stages: Training data preparation, Representation learning and Network-based citation recommendation (See Figure 1).

Heterogeneous Information Network Construction
For given information, our multi-layered network model can be defined as follows: A Graph G defines a Heterogeneous Information Network ( ) consisting of node v and edge e.A Heterogeneous Information Network is associated with their mapping functions  , respectively.The sets of object and relation types are denoted as a V A and E A , where 2 For instance, the information network can be one represented in Figure 1 with authors (V), papers (T), venues (A), terms (P) as an object set V (nodes), where, publish (A-P, P-V), co-author (A-A), terms (P-T) relationships are indicated as a link set E (edges).

Heterogeneous Network Embedding: Metapath2vec
The recent progress on network embedding [10] [16] encouraged.We propose a CN_HER approach in order to extract and represent useful information on Heterogeneous Information Networks for the paper recommendation.Given a Heterogeneous Information Network Embeddings, The task is to learn the low-dimensional latent representations (embedding) ∈ for every object v ∈ V.The learned low-dimensional latent representations are expected to highly encapsulate informative characteristics, which are probable to be beneficial in paper recommender systems represented on Heterogeneous Information Network Embeddings.Nevertheless, most of the current network embedding approaches mostly focus on homogeneous networks, which are not capable to efficiently model heterogeneous information networks.For example, a random walk to generate node sequences are used by the innovative study deepwalk [16], which cannot contradistinguish nodes and edges with different object and relation types.Therefore, it needs a more righteous way to traverse the Heterogeneous Information Network Embeddings and generate meaningful node sequences.

Meta-Path Based-Random Walks
To design an effective walking strategy, it is essential to generate meaningful node sequences, that are capable to capture the complex semantics in Heterogeneous Information Network Embeddings.In the literature of Heterogeneous Information Network Embeddings, to characterize the semantic patterns for Heterogeneous Information Network Embeddings [19], the meta-path is an important concept.Hence, to generate the node sequence, meta-path based random walk method proposed.As shown in Figure 2, giving a heterogeneous network, and a meta-path denotes the composite relations between node types A1 and Al [20].We can generate a sampled sequence The following distribution generates the walk path: where v has the type of A t , m t , is the t-th node in the walk, and the first-order neighbor set for node v with the type of The pattern of a me- ta-path will be followed continuously by a walk, until Walk has reached the pre-defined length.

CN_HER Model Citation Recommendation Algorithm
Given a manuscript P, we propose a heterogeneous scientific papers network representation citation recommendation approach, which aims to return top-ranked scientific papers as reference papers by measuring the similarity scores between the manuscript and all the scientific papers in the dataset.In our work, we formulate the manuscript P as a manuscript author a, term t, venue v i.e.

{ }
, , t P a t v = . We suppose P t as a testing paper, P a as an author of the manuscript and all the scientific papers in the dataset P T as training papers.Algorithm 1 Figure 2. A demonstrative example of the proposed meta-path based random walk.We perform random walks directed by some selected meta-paths.
where T P R is the vector representation of training papers, txt P is the vector representation of the manuscript text, V AR is the vector representation of authors

Experimental Evaluation
The aim of this section is to evaluate the performance of the recommendation using the meatapath2vec methodology on the dataset for determining its accuracy and efficiency.

Data Preparation
To evaluate the quality of our proposed model, we conduct experiments on the heterogeneous networks bibliographic dataset: AAN (ACL Anthology Network) dataset, which is established by Radev and Muthukrishnan (2009), it consists of conference papers and journal papers in the field of Natural Language Processing.
In our work, we extract the title and abstract of the papers in the dataset as document content.We remove the papers which have missed titles or abstracts in the dataset.Then, we use the remaining 13,929 papers published from 1975 to 2013 as an experimental dataset.Through the following steps each paper pre-processed 1) extract paper abstract and title; 2) remove the words that consist less than 3 characters; 3) remove stop words.For reducing the noise, we removed those words that seemed less than ten in the dataset.We got a total of 4397 different candidate words.Using a naive approach TF-IDF as an indicator [21], we recognized keywords from the set of candidate words for constructing the paper-word graph.If the TF-IDF of a word was greater than a TF-IDF threshold of 0:03, this word was selected as a keyword.A total of 3804 different keywords were identified in the set of 13,939 papers at a TF-IDF threshold frequency 0:03.For assessment purposes, we divide the entire data set into two separate sets, the papers 2015 are assessed as a training set (12,738 papers) and the remaining papers fall into the testing published before set (1211 papers).

Evaluation Metrics
In order to assess and evaluate recommendation results, we have used Recall [20], precision and NDCG [22].These matrices are used to check the performance of the proposed method in terms of recommendation accuracy and predicted ranks quality.In literature, they are commonly used for evaluating recommendation results.
• Recall.Recall measure the ratio between the number of total cited papers to the number of articles system has presented in a Top-N recommendation list.
The ratio represents the number of hits divided by the size of each user test data.The formula for recall is given as follows: • NDCG.The recommendation results generated by the paper recommender system are sensitive to the positions of the relevant reference papers, using recall we cannot evaluate that intuitively.Therefore, it is indispensable for closely related references to appear higher in the Top-N list.We use (NDCG) to measure the ranked recommendation list.The NDCG value can be calculated using the following formula: calculated as ( ) where r(i) is the rating value of the i-th paper in the ranking list.r(i) = 1 means the article is relevant otherwise r(i) = 0 if it is not.
• Precision.Precision is used to calculate and find the ratio between the relevant papers recommended by the system to the total number of recommended papers.In contrast, recall calculates the relevant articles recommended by the system to total relevant recommendations.Precision measures the exactness of recommendation results generated by the system.Preci-sion value has an inverse relationship with false positive, greater the precision the less is false positive.Trade-off exists between precision and recall scores as there is an inverse relationship between the two, increase in one metric results decrease in another.For establishing the balance, there should be a trade-off between the two as given following: relevant papers retrieved_ papers Precision relevant_papers

Results and Discussions
Evaluating metapath2vec with the various latest network representations, learning methods, DeepWalk [16], and node2vec [10], our original intention to propose the network representation approach consists in hopes to obtain more meaningful vector representation of each vertex in the network, and then perform citation recommendation based on the vector representations of these vertices.For that purpose, so we compare our proposed bibliographic network representation approach with another two network representation approach: DeepWalk, (2) Node2Vec, which learns a mapping of paper vertices to a low dimensional space of features that maximizes the likelihood of preserving paper network neighborhoods of paper vertices.After obtaining network representation with the above different approaches, citation recommendation can then be performed.Which learns paper network representation by utilizing network structure information; the obtained results are given in Figure 3 and Figure 4.
These figures show the results of comparisons supported (Recall) and (Precision) severally.The result obtained in Figure 5 shows that the model has immensely and commonly outperformed the baseline ways for all N recommendations values based on Recall.DeepWalk performs the worst of the three results, whereas for sure, the performance of our model will increase as the amount of N will increase.However, the best results supported Recall are obtained once N = 100 (Top@N).
Then again, the assessment results based on Precision are described in Figure 4.  Between our model and the Node2vec results differences are not much noteworthy.While as estimated, the performance of our model decreases when the number of @N increases.This is because as the number of @N increases, the affinity of retrieving inappropriate results also increases and thereby affecting the cumulative Precision results.Though, both the two approaches considerably outclassed the Deepwalk method.This is because, the two approaches can leverage which learns a mapping of paper vertices to a low dimensional space of features that maximizes the likelihood of preserving paper network neighborhoods of paper vertices, and different from the Deepwalk method which learns paper network representation by utilizing network structure information.Figure 5 respectively represents the results, comparisons based on Ndcg.The Ndcg result of our model is significant over the baseline methods for all N recommendations values.However, the Node2vec approach outperformed the proposed approach when N = 50 (N@50).Our model results start with encouraging results, specifically when N = 25 (N@25), but becomes less noteworthy when the number of recommendations (N) increases.However, both methodologies statistically outperformed the Deep walk method based on Ndcg.
In this section, we have compared our network embeddings method to the two state-of-the-art methods node2vec and deepwalk.These methods proposed,

Conclusion
In this paper, to uncover the semantic and structural information of research articles, we use a heterogeneous network embedding approach called meta-path2vec.The proposed citation-based recommendation approach makes use of heterogeneous network embedding in generating recommendation results.The novelty of this paper is in exploiting the performance of a network embedding approach i.e., matapath2vec to generate paper recommendations.Unlike existing approaches, the proposed method has the capability of learning low-dimensional latent representation of nodes (i.e., research papers) in a network.We performed experiments on real-world dataset showing the effectiveness of our proposed model.Additionally, we examine its capability in resolving various problems, including sparsity, cold start, prediction accuracy and scalability issues, and reveal its improved recommendation results in HINs.

Algorithm 1 .
The algorithm of CN_HER.summarizes the whole process of the heterogeneous scientific papers network representation citation recommendation approach.The similarity scores are represented as,The input to the recommendation system is the word sequence of training papers and testing papers, all the authors, terms, venues and citation papers of the training papers and authors, terms, venues of test papers, as well as the adjacency matrix based on the network.All these papers and, all the authors, terms, venues and citation papers are mapped into vectors based on our proposed CN-HER model.Thus the similarity scores can be calculated as:

V
is the vector representation of the manuscript terms, v P V is the vector representation of the manuscript venues, p P V is the vector representa- tion of the manuscript citation papers.Training papers are ranked according to the similarity scores, the top-ranked ones, are selected as the final recommendation list.
-based network embeddings learning frameworks.However, these methods have thus far focused on homogeneous network representation learning of a single type of nodes and relation types.Our network embeddings method (metapath2vec) performs considerably better heterogeneous network embedding's more than current high-tech methods, as measured by evaluation metrics.The plus point of our proposed method lies in their proper thought of the network heterogeneity challenge the existence of more than one type of nodes and relation types.