_{1}

^{*}

Discovering publication hierarchically-ordered contexts is the main task in context-based searching paradigm. The proposed techniques to discover publication contexts relies on the availability of domain-specific inputs, namely a pre-specified ontology terms. A problem with this technique is that the needed domain-specific inputs may not be available in some scientific disciplines. In this paper, we propose utilizing a powerful input that is naturally available in any scientific discipline to discover the hierarchically-ordered contexts of it, namely paper citation and co-authorship graphs. More specifically, we propose a set of domain-specific bibliometry-aware features that are automatically computable instead of domain-specific inputs that need experts’ efforts to prepare. Another benefit behind considering bibliometric-features to adapt to the special characteristics of the literature environment being targeted, which in turn facilitates contexts membership decision making. One key advantage of our proposal is that it considers temporal changes of the targeted publication set.

In this paper, we aim at enhancing the accuracy of search results, i.e. finding relevant publications to a given keyword query by better capturing the notion of “publication importance”. Due to the vast amount of literature work in all disciplines, keyword-based searching of digital libraries usually returns large number of relevant publications. User studies show that users usually view the first few results before rewording the keywords to obtain more documents that are relevant/more relevant documents [

Despite their relative success in web search engines, link-based ranking (or citation-based ranking in publications) approaches did not find acceptance in ranking publication for digital libraries [

Most of the well-known digital libraries, like ACM Portal [

The text-based relevancy score only, e.g., ACM Portal.

Text-based relevancy and citation-based scores e.g., Google Scholar.

The pre-assigned document ID as the case in PubMed.

Practically, ranking publications in terms of citation-based scores faces accuracy-related problems that, if solved, will make it a standard in digital libraries design [

In this paper, we address the problem of ranking publications and propose techniques that help toward better ranking publications within hierarchically-ordered contexts. We start with an example that illustrates a problem that we refer to as the global ranking bias. After that, we illustrate the need for assigning publications to contexts to obtain scores that are more accurate and that considerably reduce the global ranking bias effect.

Utilization of citation networks is a common starting point among the proposed publication scoring measures [

Example 1:

pushed them up in the result set. On the other hand, the low citation-based scores of the matches reached next (

The scope of ranking measure may result in comparing publications from new subfields, which emerges rapidly, with the overlapping existing subfields. The problem may be more severe for digital libraries that contain publications from different sciences such as biochemistry, biology, etc. as is the case in PubMed.

Therefore, we propose that each paper should be evaluated in terms of importance by taking into account its context and the characteristics of the citation graph of its context(s) [

The searching paradigm proposed in [

Our approach of discovering paper contexts is of two stages. The first captures the author communities of the authors in the target publication set. The output of the first stage is used in the second stage. An author community is a system of scientists or scientist-units interacting frequently about shared topic(s) of research interests [

To rank publications within a context, we may imitate what HITS does in the web domain [

1) Research domains of papers may overlap in most of the cases. One cannot put a clear-cut boundary when separating papers into subdomains.

2) Users are usually sensitive to time and efforts spent on finding information [

We consider the different graph structures that can be inferred from the targeted publication set to locate paper contexts, and rank paper in its candidate context(s). Examples of such networks are paper citation graphs and author co- authorship and citation graphs. Paper contexts can be kept large or small depending on the application type. We also propose a technique to find optimal/ reasonable size paper contexts. Our main contributions are as follows. We propose.

a) A set of author-author and paper-paper similarity/distance measures.

b) A set of bibliometric features that can be captured from the targeted publication set.

For the sake of evaluating the numerical distribution of the proposed feature formulas, we use three sets of publications set, the first is from the computer science field (around 87,000 articles are selected from ACM, IEEE and VLDB; we refer to this set the CS set). The second is from genomics area in life sciences (around 72,000 articles are selected from PubMed; we refer to this set the LS set), and the third is from data management (around 15,000 articles of ACM Anthology; we refer to this set the DM set). These articles were crawled, downloaded and parsed.

Current ranking implementations assume large community of papers that can be scored using the same citation infrastructure. This leads to the global ranking bias. Motivated by the fact that citation relationship between papers gives a better clue of paper-paper similarity than text-based similarity, we automatically discover paper contexts and organize the discovered clusters into proper hierarchical order.

Assigning papers to contexts helps in enhancing search performance through better capturing their importance [

Classical documents clustering techniques uses document’s features (words) to measure similarity between the documents. In [

As an intermediate step in discovering paper contexts, we capture the scholarly-communication structure of the paper set in order to discover author communities. An author community is a set of authors that work in a common research domains.

Studying author communities helps:

1) Understanding the growth patterns of scholarly communication in different science disciplines, i.e. computer science, data management and medicine,

2) Discovering the relationships among research areas [

One issue is the variance of clusters densities, as well as other network infrastructure properties, which makes cluster membership decision hard to take. The network infrastructure of citation and co-authorship graphs are the main concern of Bibliometrics. Bibliometrics goal is to study the process of written communication and of the nature of development of different disciplines [

We use three sets of publications to study the numerical distribution of the proposed features; namely, The (D)ata (M)anagment Set, the (L)ife (S)ciences Set and the (C)omputer (S)ciences Set. The DM Set is a collection of around 15,000 publications from the data management fields. The CS Set is a collection of around 87,000 publications from computer science fields, thus, the CS Set is more heterogeneous compared to the DM set. The LS Set is a collection of 72,000 publications from the genomics area, thus it is homogeneous like the DM set.

The three paper sets where parsed and a group of three databases of the extracted information from them were created.

Observation 1: the number of publications per year parameter is steadier in the DM field than in the CS and LS sets.

Observation 2: the rate of increase in the publications per year significantly increases after year 1985 in the CS and LS fields.

In this section, we present a number of bibliometric features that can be utilized to decide on context membership decisions and computing similarity/distance scores between papers and between authors.

In this section, we present the bibliometric features that can be extracted from the paper-paper citation curve. We will use the curves and measures presented later to discover paper contexts and author communities.

Different disciplines vary in terms of its nature and rate of development. To capture these two bibliometric features we define the age of citation curve. We define the age of citation C P 1 → P 2 from paper P1 to P2 as the absolute difference between the publication years of P1 and P2. Citation age distribution graph plots the age of citation values vs. frequency of these values.

Observation 1: In life sciences, authors tend to cite more up-to-date publications than authors in data management field of study.

We may also benefit from self-citation behavior of authors. Self-citation refers to the tendency of authors to cite their own work. One possible measure of self-citation tendency of author A is the Percentage of self-citations in A’s writings according to the following formula SC A ( A ) = P A → A / P A where P A → A is the numbers of papers where A cites his own work, and C A is the total number of A’s papers.

Observation 2: life scientists have more tendency to cite their own previous work than data management scientists.

Depending on the rate of growth of technology, and the need to rapidly publish papers in active research areas, authors tend to work jointly. Tendency to work jointly, or collaborative tendency, may vary from a discipline to another. One possible measure of collaborative tendency of author A is the size of A’s Collaboration Group CG ( A ) . We define the collaboration group of A as the set of all authors that A has ever published a paper with

Observation 3: LS researchers tend to have larger collaboration groups than CS and DM researchers.

Members of an author’s collaboration graph may vary in collaboration levels. We define the collaboration level of author B to author A’s collaboration group Cl ( B , A ) as the ratio between the number of publication of A and B together P A , B and the total number of A’s publications P A , i.e. Cl ( B , A ) = P A , B / P A .

We may go further and define the Collaboration Level Distribution curve as shown in

Observation 4: DM set showed the highest collaboration levels. CS set comes next and the LS set is the lowest.

One bibliometric feature that may vary from discipline to another is the productivity level of authors. One possible indicator of productivity level of authors is publishing frequency curve. The publishing frequency curve of author A is defined as the distribution of time spans between A’s consecutive publications. The time span between consecutive publications P1 and P2 of author A is computed as the absolute difference of P1 and P2’s publication years. Short time spans between A’s publications is an indication of his productivity level.

If two authors published common papers, then they probably work in the same research area and thus belong to the same community. Assume authors A and B, who has published | P A | and | P B | papers respectively, has published | P A ∩ P B | papers in common, then they probably belong to the same community C or ( A , B ) ∈ C . The probability P ( ( A , B ) ∈ C ) that these two authors belong to the same community, is directly proportional to the percentage of common papers (PCP) between A and B computed according to the following basic formula,

P ( ( A , B ) ∈ C ) ∝ PCP ( A , B ) = | P A ∩ P B | / | P A ∪ P B | (1)

To check how unusual the PCP between two particular authors is, or to say how significant the PCP value is, we prepare the PCP distribution as shown in

We observe two types of collaborative couples in any publication set. One involves an advisor with his student, or advisor-student couple. The other involves an author with his college, or college-college couple. The advisor-student collaboration usually involves an unbalanced relationship, i.e. the common papers between the student and his advisor is all the student’s papers, while they form a subset of the advisor’s papers. In the case of college-college pair, the collaborative relationship may also be unbalanced, but usually not perfect.

To capture the unbalanced relationship of the advisor-student and college- college pairs, we define the Single Sided PCP, or SSPCP between author A and B, once from A’s prospective and another from B’s prospective. The SSPCP from A’s prospective can be computed as

SSPCP A ( A , B ) = | P A ∩ P B | / | P A | (2)

Similarly, we can compute SSPCP B ( A , B ) as

SSPCP B ( A , B ) = | P A ∩ P B | / | P B | .

Formula (2) suggests that, a perfect or nearly perfect SSPCP A ( A , B ) with low SSPCP B ( A , B ) scores indicate that A and B forms an advisor-student-like couple, with A being the student and B being the advisor. It also indicates the following:

1) B belongs to more than one community with different probabilities.

2) The probability that A belongs to one (or more) of B’s candidate communities is very high.

3) A may not alone help us decide upon to which community B belongs most.

In the other hand, the Formula (2) suggests that as the difference between SSPCP A ( A , B ) and SSPCP B ( A , B ) scores becomes less than a certain threshold α , this difference gives a clue of how likely author A and B belong to the same community. But still, A may not alone help us decide upon which community B belongs most, or vise versa. We observed that α = 0.5 in the three publication sets.

To illustrate more, we discuss three possible scenarios that may occur. The scenarios are presented in the following table:

From

The first is the area where PCP and SSPCP are near perfect. Most of the author couples that lies within this area are of type advisor-student. Notice that in the DM field, more research is conducted in the setting of advisor-student. While in the LS field, research is conducted in variety of settings other than advisor-student, for example, research in LS involves lab technicians and clinicians. This maps to the | P A | ≫ | P B | case in the above table.

The second is just in the middle where PCP and SSPCP value = 0.5. This PCP/ SSPCP occurs when the common papers are half as much as the total number of both authors or one of the authors. This maps to the | P A | ≅ | P B | case in the above table.

The third, which showed the widest distribution of PCP and SSPCP over the interval [0, 0.3]. This maps to the | P A | > | P B | case in the above table.

We notice that, as the difference between the author couples becomes less than 0.5, we can safely use SSPCP as an indicator of how likely A and B belong to the same community. However, when the case is and advisor-student case, we need to consider, when computing the final PCP score, the unbalanced relationship between the author couples.

One question that is left is how to compute the final PCP score of authors A and B from SSPCP A ( A , B ) and SSPCP B ( A , B ) scores.

We may think of the relationship between authors A and B as a two dimensional relationship. The strength of this relationship is determined by combining the significance of the SSPCP values of the two authors.

The significance of an SSPCP value, or Sig ( SSPCP A or B ( A , B ) ) , can be computed based on a set of mapping functions:

The Raw SSPCP Value

In this approach, we use the SSPCP score as it is, in this case the higher SSPCP becomes, the closer the authors becomes to each other. i.e.

Sig ( SSPCP A or B ( A , B ) ) = SSPCP A or B ( A , B ) (3)

A problem with this approach is that it does not explicitly consider the bibliometric features of the publication set.

Frequency of SSPCP Value

The frequency of observing the value of SSPCP in the publication set, or f ( SSPCP A or B ( A , B ) ) , can be used to infer the significance of, i.e.

Sig ( SSPCP A or B ( A , B ) ) = f ( SSPCP A or B ( A , B ) ) (4)

The motivation here is that scores that rarely occur are not informative. In this case, SSPCP values within the intervals [0.35, 0.5[ and ]0.5,1[ will be almost zero. This measure suggests that more rare SSPCP values are less significant than common ones.

The P-Value of SSPCP Score

The P-Value of a score v measures the probability of the following random event:

“When randomly selecting author couples A and B from the publication set, what is the probability of observing an SSPCP A ( A , B ) ≥ v or higher”, i.e.

Sig ( x = v ) = ∫ x = v ∞ f ( x ) d x (5)

where x is a dummy variable that represents the SSPCP values and f ( x ) is the frequency of observing x in the publication set.

Note: This measure is very useful when the distribution of measure we target (in this case it is SSPCP) follows the Zipf distribution.

The Z Score of SSPCP Value

One technique to isolate extreme scores and reduce their effect on the distribution is to compute the Z scores. We use the following Z score formula from [

Z ( v ) = v − m SSPCP S SSPCP (6)

where m SSPCP is the mean of the observed SSPCP values, and S SSPCP is the mean absolute-deviation which is defined as follows:

S [ SSPCP ] = 1 / n ∑ x i ∈ [ SSPCP ] ( x i − m SSPCP )

where [ SSPCP ] is the vector of all observed SSPCP values.

Back to our question of how to combine the two SSPCP scores into a single PCP score. One possible way to compute P ( A ↔ B ) is according to the Pythagorean Theorem, i.e.

P ( ( A , B ) ∈ C ) = Sig ( SSPCP A ( A , B ) ) 2 + Sig ( SSPCP B ( A , B ) ) 2 / 2 (7)

The 2 is used as a normalizing factor which occurs when the both SSPCP are perfect (=1).

One problem of the relying on co-authorship only is that two authors from different disciplines may have common papers. As an example, a database researcher may write a common work in bioinformatics with a professor in the medical school. A statistician may publish a common paper with a researcher in nursing or other disciplines where statistical analysis is needed. One way to reduce the effect of this problem is to consider what we refer to as the angle between authors.

To illustrate the concept of the angles between authors, we discus one possible way to measure the angle between author A and B. in this way we utilize the citation relationships between authors. Denote the expressions Sig ( SSPCP A ( A , B ) ) , Sig ( SSPCP B ( A , B ) ) and Sig ( SSPCP A ( A , B ) ) 2 + f pcp ( SSPCP B ( A , B ) ) 2 by

A ° , B ° and C ° respectively. The expression C ° is nothing but the length of the third edge opposite to the right angle as shown in

If author A and B are coauthors in a subset of their publications, and they cite each other’s works relatively frequently, then they more likely belong to the same community. In this case, the angle between the edges A ° , B ° will be small and C ° will be long indicating higher probability of A and B belonging to the same community (see

On the other hand, if authors A and B are coauthors in a subset of their publications and they cite each other’s works relatively rarely, then they more likely belong to two different ICs. In this case, the angle between the edges A ° , B ° will be large and C ° will be short indicating lower probability of A and B belonging to the same community (see

Consequently, ( P ( ( A , B ) ∈ C ) .a) can be rewritten as follows

P ( ( A , B ) ∈ C ) = f PCP ( SSPCP A ( A , B ) ) 2 + f PCP ( SSPCP B ( A , B ) ) 2 + 2 ⋅ f PCP ( SSPCP A ( A , B ) ) ⋅ f PCP ( SSPCP B ( A , B ) ) ⋅ Cos θ A , B / 2 (8)

The number 2 in the denominator is used as a normalizing factor. In the case when the both SSPCP are perfect (=1) and the angle θ A , B is 0, the final score will be 1. Based on the above discussion, we propose the following basic formula to compute θ A , B ,

θ A , B = Max ( | CS ( A ) ∩ P B | / | P B | , | CS ( B ) ∩ P A | / P A ) ⋅ π (9)

where CS ( A ) ( CS ( B ) is similar) is the citation space (CS) of A, which is the set of papers that A cites in his work.

| CS ( A ) ∩ P B | represents the number of papers written by B are cited by A.

We notices that θ A , B ranges between 0, in the case of perfect relatedness between A and B, and π when no citation relationship observed between A and B.

We may also consider the age of citations between authors A and B. One-way to do this is to utilize the citation age factor r c − age which we present the definition of in the next subsection.

θ A , B = ( 1 − Max ( r c − age ( A → B ) , r c − age ( A ← B ) ) ) ⋅ π (10)

Other ways to measures the angle between authors A and B are:

The Relative Distance Based on the SSPCP Vectors of the Publication Set

For any author couples A and B, the higher the difference between SSPCP A ( A , B ) and SSPCP B ( A , B ) becomes, the lower the probability that A and B belongs to the same community becomes.

The relative distance between SSPCP A ( A , B ) and SSPCP B ( A , B ) as follows.

R E D i s t SSPCP ( A , B ) = | SSPCP A ( A , B ) − SSPCP B ( A , B ) | Euclidian Distance ( [ SSPCP A ] , [ SSPCP B ] ) / | [ SSPCP A ] | ⋅ π (11)

where EuclidianDistance ( [ SSPCP A ] , [ SSPCP B ] ) is the Euclidian Distance between the vector of all observed SSPCP values of A prospective ( ( [ SSPCP A ] ) ) and B prospective ( ( [ SSPCP B ] ) ). We divide it by | [ SSPCP A ] | which represents the number of author couples in either of the SSPCP vectors.

Formula (11) suggests that, as | SSPCP A ( A , B ) − SSPCP B ( A , B ) | increases, we conclude that Formula (8) is less likely to be a good clue of how related authors A and B to each other, and thus gives less weight to the it.

Citation Exchange between A and B

We may use citation exchange between A and B as presented in ( θ A , B .a) and ( θ A , B .b).

Citation Space Difference between A and B

Citation space of an author A is the set of papers that A cites in his publications as we stated before. To compute the distance between A and B we consider the citations of the papers that are not common between A and B. A basic formula to compute the angle between authors A and B based on citation space difference is:

C i t S D ( A , B ) = | { CS ( A ) ∩ CS ( B ) } − C ( P A , B ) | | { CS ( A ) ∪ CS ( B ) } − C ( P A , B ) | ⋅ π (12)

where:

CS ( A ) ∩ CS ( B ) the overlapping between the citation spaces of A and B.

C ( P A , B ) is the citations of the common papers between A and B (excluded).

CS ( A ) ∪ CS ( B ) the total set of citations from the citation spaces of both A and B.

The reason for excluding the citations of the common publications between the two authors is to identify authors who belong to different communities like the case of a researcher from the computer science domain publishing a paper with a researcher from the biomedical science domain when the paper is dealing with a topic from bioinformatics. Excluding the citations of the common papers of bioinformatics, we expect that the computer science researcher cites different papers than those cited by the biomedical specialist.

We may weigh a citation c according to how many times does c appear in the citation space of the author as follows:

C i t S D ( A , B ) = ∑ C i ∈ [ { CS ( A ) ∩ CS ( B ) } − C ( P A , B ) ] w ( C i ) ∑ C i ∈ ( { CS ( A ) ∪ CS ( B ) } − C ( P A , B ) ) w ( C i ) ⋅ π (13)

where [ { CS ( A ) ∩ CS ( B ) } − C ( P A , B ) ] is the set of common citations between the citation spaces of A and B excluding the citations of the common publications of A and B. And { CS ( A ) ∪ CS ( B ) } is all the citations in the citation spaces of two spaces of A and B.

Second Level of Collaborative Set Difference

The second level collaborative set of author A is defined as the collaborative sets of all authors that collaboratively worked with A. we may use this measure to identify those authors who belong to different communities but still have common publications. A basic formula to measure this parameter is:

L 2 C o l S D ( A , B ) = | { L 2 C o l S ( A ) ∩ L 2 C o l S ( B ) } − [ L 1 C o l S ( A ) ∩ L 1 C o l S ( B ) ] | | { L 2 C o l S ( A ) ∪ L 2 C o l S ( B ) } − [ L 1 C o l S ( A ) ∩ L 1 C o l S ( B ) ] | ⋅ π (14)

where:

L 2 C o l S ( A ) and L 1 C o l S ( A ) are the second and first level collaboration set of A.

{ L 2 C o l S ( A ) ∪ L 2 C o l S ( B ) } − [ L 1 C o l S ( A ) ∩ L 1 C o l S ( B ) ] is the set of common authors between L 2 C o l S ( A ) and L 2 C o l S ( B ) excluding those common authors from the first level.

We may also weigh the second level author x in the collaboration set of A by the number of common publications between x and the first level author(s) as follows.

L 2 C o l S D ( A , B ) = ∑ w [ L 2 C o l S ( A ) ∩ L 2 C o l S ( B ) − [ L 1 C o l S ( A ) ∩ L 1 C o l S ( B ) ] ] ∑ w [ L 2 C o l S ( A ) ∪ L 2 C o l S ( B ) − [ L 1 C o l S ( A ) ∩ L 1 C o l S ( B ) ] ] ⋅ π (16)

Another problem of relying on the co-authorship relationship between authors prevents discovering authors who belong to the same community when they have no common publications. To overcome this problem, we utilize another relationship that is based on citation relationship between authors. Details are presented in the next subsection.

If two authors directly or indirectly cite each other’s works, then probably these two authors belong to the same community.

One possible measure of citation relationship strength between authors A and B is the Bidirectional Citation Bandwidth (C_{2BW}). The bidirectional citation bandwidth between authors A and B is defined , from A’s prospective, as the percentage of citation exchange between A and B (from publications of A to B and vise versa) to the total citation exchange between A’s work and all other authors’ work citing or cited by A’s work. The following formula clarifies the way to compute C 2 B W ( A , B )

C 2 B W ( A , B ) = C A → B + C B → A C A → + C → A (17)

where C A → B and C B → A are the citation exchange from A’s publications to B’s publications. C A → and C → A are the total in and out citations to and from A’s publications.

Similarly, we may compute C 2 B W ( B , A ) , this time from B’s prospective, according to the following formula

C 2 B W ( B , A ) = C B → A + C A → B C B → + C → B (18)

where C B → and C → B are the total in and out citations to and from A’s publications.

We assumed here that citing and the cited works are topically related. However, citation-based relations between papers are often criticized on the ground that citation may not actually represent, due to topic diversity of paper citations, topic-relationship between the source and the destination of citation [

One possible indicator of the topical relatedness of citations between authors is the level 2 citation relationship strength. Level 2-citation-relationship strength between authors A and B is defined as the overlapping ratio between out citations of A’s publications and out citations of B’s publications. Denoting A’s and B’s out citation count by C A → and C B → respectively, the level 2 citation relationship strength between A and B can be computed using the formula C 2 O L ( A , B ) = ( C A → ∩ C B → ) / min ( C A → , C B → ) . Using the same scenario (3.a) shown above is derived; we derive P ( ( A , B ) ∈ C ) based on citation relationship between author A and B as follows

P ( ( A , B ) ∈ C ) = f C 2 B W ( C 2 B W ( A , B ) ) 2 + f C 2 B W ( C 2 B W ( B , A ) ) 2 + 2 ⋅ f C 2 B W ( C 2 B W ( A , B ) ) ⋅ f C 2 B W ( C 2 B W ( B , A ) ) ⋅ Cos ω A , B / 2 (19)

where ω A , B is computed as

ω A , B = ( 1 − C 2 O L ( A , B ) ) ⋅ π

Notice that ω A , B ranges from 0 to π depending on how strong the level-2 citation relationship between A and B is. The weaker the level-2 citation relationship between authors is, the bigger the angle ω A , B becomes, and consequently P ( ( A , B ) ∈ C ) becomes smaller if the bidirectional citation bandwidth remains unchanged.

One indicator of topic-relatedness between the citing and the cited papers is the age of citation. We define the age of a citation as the absolute difference between the publication years of the citing and the cited papers. The effect citation age on the topic-relatedness clearly appears in disciplines that are technology driven like computer science.

Different disciplines vary in terms of its nature and rate of development. To capture these two bibliometric differences we define the age of citation curve. We define the age of citation C P 1 → P 2 from paper P1 to P2 as the absolute difference between the publication years of P1 and P2. Citation age distribution graph f c g ( t ) relates the age of citation values vs. frequency of these values.

Notice that the impact of a citation C i from a work P A of author A to a work P B author B to the similarity between A and B is a) inversely proportional to the duration between the two connected works, i.e. the publication date of P A and P B . b) also inversely proportional to the frequency of having two citations in that paper set f c g ( t ) , where t = | T ( P B ) − T ( P A ) | , T ( P x ) is the publication date of P x . And c) directly proportional to the percentage of citations from A to B with duration t or n C A → B / t to the total number of citations from A to B n C A → B . We refer to this ratio as the citation-age factor of related works of authors A and B, which is computed as

r c − age = n C A → B / t n C A → B × ( 1 − f c g (t))

we involved the frequency of having citations with age t in the targeted publication set as stated in item b). Thus, the probability that A and B belong to the same community, or the relationship strength between A and B based on the citations from A’s works to B’s works can be computed as

r c − age ( A → B ) = ∑ all t ′ i from A to B n C A → B / t i n C A → B × ( 1 − f c g ( t ) ) = 1 / n C A → B ∑ all t ′ i from A to B n C A → B / t i ( 1 − f c g (t))

The citation age curve f c g ( t ) is one of the bibliometric features that depend on the targeted publication set. Similarly, one may compute r c − age for citations from the opposite direction, i.e. from B’s works to A’s. The case of having two authors citing each other’s works will be given more weight based on the similarity measure proposed. We refer to this phenomenon by author’s citation- backward loop.

Discovering publication hierarchically-ordered contexts is a key task in context- based searching paradigm. Discover publication contexts and author communities (i.e., Scholarly-Communication Structures) rely on the availability of domain-specific inputs that need experts’ efforts to prepare. However, the needed domain-specific inputs may not be available in some scientific disciplines. In this paper, we proposed utilizing a powerful input that is naturally available in any scientific discipline to discover the hierarchically-ordered contexts of it, namely paper citation and co-authorship graphs. More specifically, we proposed a set of domain-specific bibliometry-aware features that are automatically computable instead of domain-specific inputs that might not be available or difficult to prepare. Another benefit behind considering bibliometric-features to adapt to the special characteristics of the literature environment being targeted, which in turn facilitates contexts membership decision making. Another key advantage of our proposal is that it considers temporal changes of the targeted publication set.

Bani-Ahmad, S. (2017) Bibliometry-Aware and Domain- Specific Features for Discovering Publication Hierarchically-Ordered Contexts and Scholarly-Communication Structures. Social Networking, 6, 61-79. http://dx.doi.org/10.4236/sn.2017.61005