Scientific Communities Found Based on the Path Structure of Citation Network

Based on the structure of citation network, the citation paths among papers, and the association strength such as coupling, co-citation and etc. between two papers are defined in this article. We give formulas to quantify the association strength in order to establish citation network model based on the citation path structure. Then, the OPTICS algorithm is brought into the scientific communities found model since it can solve the parameter’s setting problem. This method combines various kinds of path structures together and thus it contains more complete citation network information. Experiments and analysis reveal the reliability and validity of this method.


Introduction
Many researchers have been studying and mining scientific communities using citation network and analysis of the citation network has also become an important tool for their research.Xiaoming Liu employed the analysis method of social network into the co-author network to detect condition variations of the digital libraries and constructed weighted directed citation network to mine the inner scientific communities [1].R. Ichise designed a userinteracted community mining system, in which the values of co-author network parameter can be constructed variably in different needs of users' purposes.On that basis, they added the author-keyword schema and make a good attempt at clarifying research interests of different scientific communities [2].
A large amount of previous research has been done on citation analysis method, however, these studies and experiments are all one-sided, doing survey on certain aspects of the research, instead of a comprehensive vision.Shortcomings are listed as follows: First of all, the comparison criterion between relational model is inconsistent.Although the citation path provides association relationship between papers, there's no uniform criterion to quantify this relationship.Secondly, there's no documental multiple correlation.The majority of previous scientific community found are doing survey in a certain aspect of citation path, co-author, coupling, co-citation, and thus it is incomprehensive.Thirdly, the results of the mining groups are impracticality.This article defines the association strength between authors and the association strength between papers.The formulas of those definitions innovatively combine coupling, co-citation, co-author and other structure to make the system more comprehensive.Besides, the proposed OP-TICS algorithm is not sensitive to parameter setting, and it can use heuristic algorithm to estimate the parameter and get good mining results of a large dataset.

Build of Citation Network Based on the Citation Path Structure
Three common types of citation network structures are coupling, co-citation, co-author.This article uses weighted directed graphic method to build the network model of citation network.Here are some hypotheses: Firstly, assuming there's no fault reference or repeated reference issues between papers.Secondly, association relations between papers can be measured numerically and the tightness of association relations can also be compared by numeric size.
In citation network, a paper is always regarded as a vertex and a relation as an edge, and thus a mesh structure is formed.Then a citation network model based on the citation path can be built according to the below definitions.The first four definitions below can be found in reference [3].
Definition 1) Citation Network Definition 2) DR (Direct Reference) Definition 3) RP (Reference Path) Definition 4) RL (Reference Length) Definition 5) RPR (Reference Path Relation) RPR is described by reference length between two papers.
where there are n citation paths from paper to paper i P j P , and the length of each path is j .The longer RL is, the less RPR is.The more RP is, the more RPR is.Thus, we use cumulative method., PR is accumulated by weighted RPR, BCR and CCR.These three association strengths present the association relations between two papers from different aspects, thus, their weighted value 1  , 2  , 3  should be different.In order to discuss the issue, here we set 1 2

Definition 9) TR (Together write Relation)
If two authors write a paper together, namely co-author, it demonstrates that they share some interrelated research direction.TR is used to represent this kind of association strength.
where n is the number of paper co-authored by i and a j a , t M represents the number of the authors who co-author the t-th paper, the PR of each paper is 2 t M , the overall TR is accumulated by TR of each co-authored paper.

PR P P PR P P AR a a
Year P Year P TR a a where m is the total number of the papers author i has published, n is the total number of the papers author TR a a P to represent their association strength.If  and P  is not the same paper, then   , TR a a 0 i j  , we can also use to represent their association strength.Considering the two parts represent the different association strength, we accumulate them by weights.In order to discuss this issue, here we set OPTICS algorithm is the extension of the DBSCAN algorithm [5].Unlike DBSCAN algorithm which generates clustering result after distributing the objects into different clusters [6], OPTICS algorithm stores a cluster order when each object is being clustered [7].Through the order we can get some basic clustering information such as the center of the cluster and inner cluster structure in the object set of the database.As for each object, these information is made up of two parts: core distance and reachable distance.

Scientific Communities Found Method Based on Citation Path
Definition 11) Core Distance (CD) In the object se of the database O, the core distance of object p is defined as: if object p is the core object,  , then the distance from object p to its Minpts nearest surrounding object is called the core distance of object p, written as core-distance (p), if object p is not the core object, then the core object of p is UN-DEFINED.In fact core-distance (p) is the minimum  to make object p become a core object, or else core-distance (p) is UNDEFINED.Definition 12) Reachable Distance (RD) In the object se of the database O, reachable distance of object p to object q is the larger one of core-distance (q) and the distance from object p and object q, written as reachable-distance (p,q).If q is not the core object, then reachability-distance (p,q) is UNDEFINED.In fact, reachable-distance (p,q) is the minimum direct-density reachable distance from object p to object q.Thus, this reachable distance cannot be smaller than the core-distance (q), or else there's no direct density-reachable object to object q.The reachable distance of object p to core object is related to the core distance of the core object.The above two definitions are as shown as Figure 3.Given the setting parameters Minpts and the initial distance  , the procedure of OPTICS clustering algorithm can be described as follows: Step1: Select an unclustered object p from the object set O of the database, and mark this object as clustered.
Step2: According to the parameter  , get the object p's neighbor object set, and set core-distance (p) and reachabledistance (q) according to the neighbor set.
Step3: If core-distance (p) is UNDEFINED, namely this object is the core object, then add neighbors of object p into the priority queue.
Step4: If the priority queue is empty, then go on to the Step 5, or else select the minimum reachability-distance object q in the queue, get the neighbor set of object q, set the core-distance (q) and reachable-distance (q) according to the neighbor set.If object q is the core object, then add the neighbors of object q into the priority queue, repeat Step 4.
Step 5: If all the objects in object set O have been marked as clustered, then go to the Step 6, or else to Step 1.
Step 6: Output the sorting result and clustering result according to the core distance and reachable distance of each object, then end this algorithm.
The major advantage of OPTICS algorithms is: it's not sensitive to the parameters Minpts and  , that is, Minpts and  do not influence the distribution of the reachable distance, the parameters can be selected randomly at a large scale without affecting the distribution of reachable distance and the inner clustering structure of object set.[8].Unlike DBSCAN algorithm, OPTICS algorithm can employ heuristic to estimate parameter  and Minpts according to experience [9].

Design of Scientific Communities Found System
The System of scientific communities found based on citation network contains three independent parts: data preprocessing sub-system, communities mining sub-system, graphic demonstration sub-system.Shown as Fig- ure 4, these three sub-systems can be divided into several independent function modules.Through transferring data, the three sub-systems are combined into a complex system.

Experiment and Analysis
In the experiment, SMARTY + PHP + MYSQL scheme is used to construct the model, in which SMARTY template is used for visualization display, PHP is used for processing data and controlling the procedure, and MYSQL is used for data storage.Initial input data of the experiment come from the paper set of Wanfang Date (http://www.Wanfangdata.com.cn/).These initial data is acquired in HTML source code format, and processed by PHD program, to extract title, abstract keyword, author, references, publish date and other information of each paper.These information are stored in the MYSQL database.
Here, information of 22 papers and 31 authors is extracted from the original database, as shown as Table 1 and Table 2.We can construct the citation network, in which each vertex represents each paper, named by the paper_id, and each edge represents the citation relationship between two papers.
Based on the citation network structure shown in Fig- ure 5, we use Formular (1) to compute RPR between any two papers and get the RPR matrix of citation network.Similarly, we use Formulas ( 2) and (3) to compute the BCR matrix and CCR matrix [10].Then according the Formula (4), we can get PR matrix.Furthermore, we can get AR matrix according Formula (5) and Formula (6), as shown as Figure 6.After computing the AR matrix.we can get the distance between any two authors and cluster the author objects.

Conclusions
This article researched on the structure of the citation network based on the citation path [11], defined the association relations between papers to transfer the scientific communities found model in the database to a clustering issue of points in the space.Besides, by using OPTICS algorithm, there's no need for users to concern about the selection of the parameters and clusters of any shape can be found effectively in a more practical way [12].In order to test and analyze conveniently, the number of selected papers is not large and all of them are in Chinese.This article hypothesis there's no fault reference or repeated reference, while in fact, these problems are quite likely to happen.So our future work is to improve the model, doing research on how to deal with those issues and filter the citation path accurately.
CCR (Co-citation Relation) CCR is described by the reference paths from the cocitation paper to the two papers respectively.the number of co-citation papers of i and P j P , P  represents the th  paper they co-citation,

3. 1 . 2 where 1 , 2 
OPTICS Clustering AlgorithmDBSCAN algorithm is sensitive to the parameters of  and Minpts[4].Even a slight different setting may lead to a different clustering result.Setting of the two parameters requires experience.As for the high-dimensional data, setting appropriate parameters is challenging.In order to solve the shortcomings of DBSCAN algorithm, this article brings in OPTICS[5] (Ordering Points to Identify the Clustering Structure) clustering algorithm [4].OPTICS algorithm is based on the fact that low-density clusterings contain high-density clusterings, as shown in the Figure is the distance parameter of clustering Cis the distance parameter from clusterincludes cluster-ing C1 and clustering C2 completely.

Figure 2 .
Figure 2. Global distance and local distance.

Minpts = 4 Figure 3 .
Figure 3. Core distance of object q and reachable distance of object q to object p.
Yea  separately represent the year of the published paper P  and P  .If published, P  is the th  paper author i has