Extracting a Heterogeneous Social Network of Academic Researchers on the Web Based on Information Retrieved from Multiple Sources

The majority of academic researchers present the results of their scientific activity on the Web. This trace can be used to derive useful information of their past, present activity and forecast the future intentions. Hence, social network of academic researchers can be of important value for scientific community. This information can be retrieved from various data source currently available on the Web. From each of them a separate network can be built. In this paper we present a method which can be used to combine multiple single-relational networks into a single network which will combine all relations, hence it will be multi-relational.


Introduction
The appearance of web in 1992 put a series of interesting challenges to the researches of social networks. First of the entire web influenced the traditional way of thinking about social networks. Since social network analysis usually was conducted on small group of nodes it appeared to be not so easy to apply the same methods on the Web. In most cases it was nearly impossible to analyze a network of millions of people taking into account that the proper analysis requires construction most of the network. Another hard task was to gather information about a large group of people. Moreover, there can and actually are people who are actors of the network but do not need any generated data which requires specific considerations in this case. These types of networks are much more difficult to study [1].
Social networks on the web have been extracted by retrieving relationships between entities all automatically derived from multilingual news [2]. Social networks also have been extracted from log files of online shared workspaces [3]. Another method is used to extract biographical information of historical persons from multiple unstructured sources on the Web [4]. Extraction of social networks has also been conducted via Internet and Networked Sensing [5]. A social network extraction system from the Web was designed named Referral Web [6]. Mika developed a system for extracting social networks from the Web, named Flink [7]. A system which extracts social network from user's inbox was developed as well [8]. Matsuo developed a system Polyphonet which is used for extracting and analyzing a social network of academic figures [9]. Tang et al., demonstrate a method for extracting an academic social network [10]. Some authors address extraction of multi-relational networks [11]. How a social network can be extracted from email communication of users is also addressed [12]. If two users exchanged more than some N number of emails then ad edge is invented between them. In order to assign ranks to users two statistics are measured. First is based on the hypothesis that if two users communicate more, then they should exchange more emails. So the first statistics is simply the number of emails a user has sent or received. The second statistics is the average response time. After the social network has been built then all cliques are obtained by means of the method described in [13]. Users ranked based on the degree centrality, betweenness centrality and also on the number of cliques a user is contained within, the size of the clique and the weights of the clique edges.
One of the most popular searches on the Web is searching people related to some other person. For example, people need to find other people who have published more papers in a specific topic or to find the most famous actor in a certain area. Authors of [14] propose an approach to resolve the problem of people search sharing similar interests by representing a person with the aid of the user's Web site content. Some studies have extracted networks from FOAF documents [15][16][17]. Entity ranking results' differentiation is demonstrated in [18]. Community mining, which is one of primary research areas in SNA, is also addressed in some works [19][20][21].
Although not little work done on detecting social networks on the Web most of them assume that there is only one relationship between entities in the graph. In fact, in most communities nodes are connected by more than one relation. Moreover, each relation has a particular role in a particular task. When supposing that in a given community only one relation exists between people much of valuable information can be lost. Besides, many of these algorithms are not suitable for large-scale networks and concentrate on small networks.
Academic researchers may have relations of different ki-nds: they may have co-authored one or more papers, may participate in the same conference, may be members of the same scientific centre or might have taken part in the same project and etc. Academic researchers' social network extraction is another interesting topic on the Web [9,10,19]. The major issue with the work done on extracting social network of academic researchers on the Web is that they do not embrace the entire academic network.

Constructing a Heterogenous Social Network of Academic Researchers from Multiple Sources
The richness of information on the Web gives a base to assume that information for one and the same set of entities obtained from multiple different sources will rarely provide similar picture. Say, we have built a social network of given named of entities based on information from one data source only. In such a case it may happen, for example, that those two actors have no relation in common. But having built the same network, in terms of actors, from a different source we may see that those two actors indeed posses a relation. This is turn means that the more data sources we use to build final network one the more exact results we shall receive as a result. In our case we consider undirected networks only. We assume that the entities are given beforehand. Although the number of data sources and actors can be any in the method described in the paper, suppose we have three data sources and from which we can obtain network data for the given five named entities; also assume we are given five named entities and . Assume from three different data sources we have built three different social networks shown below.
, , , , a a a a a Table 1 represents the edge weights for each of the networks.  After we have built three single-relational networks our objective is to combine them into a single network with the aid of the method described in [22]. Here a multi-relational network is built. The edge weight between any two entities in this network is the sum of the weights of the corresponding entities in each of the single-relational networks multiplied by some coefficient which shows the importance degree of the data source. In our case the single-relational networks will be those depicted in Figures 1,  2 and 3. Sum of these coefficients equals unit. For this case in terms of formulas the basic rule will have the following form: In order to find these weights the author uses the ranks of the choices which are larger if reliability of the choice is so. Hence, the following is true.  At this step we have built a single multi-relational network which combines multiple networks into it. We have also given a formula with the aid of which we can find weights of the resultant network edges. The problem now comes to finding the unknowing coefficients for the data sources mentioned in formula (1). These coefficients will tell us "how much out of each of the networks" we need to include in the resultant network.

If
is the worst choice (by criterion l s j c ) with weight and rank , then the last correlation can be written as follows. In order to find these unknown values we shall use the method presented in [23]. In here a method of finding the best alternative out of possible multiple choices is presented. It's supposes that each choice is associated with multiple criterion. Namely, it's assumed that on given set of choices the set of criterion is given. Each criterion is represented as the following fuzzy set.
Since the sum of weights of choice equals unit by each criterion. Here, as also the author mentions, the ration r i /r j is taken from Saati 9 -scale, in which this ration equals one of the numbers in the interval [1,8] depending on how much alternative i s is better than j s Using the last formula for one can calculate можно weights of choices with the aid of ratio of the rank of the choice to the rank of the worst choice. l w Therefore, in order to use this method we need to consider one more condition: the networks from these data sources should possess some criteria. In the method presented here we suppose that the networks have the criteria: average tie strength, density, average distance. Table  2 shows values for each of these criteria for each of the networks.
We proceed according to the method mentioned. 1) Criteria: average tie strength (ATS). The worst alternative is . The weight of the worst alternative is. 2) Criteria: density (D). The worst alternative is . The weight of the worst alternative is. Weights of the other two alternatives are. Weights of the other two alternatives are In the calculations above is a rank of the alternative and ratio r i /r j is taken from the very Table (Saaty1 -9 scale) in the method and shows how much the alternative is better than the alternative i r i s i j with respect to a certain criteria. According to the mentioned method the received weights of the alternatives for different criteria allows us to represent the criteria as fuzzy sets as shown next. Choosing the maximum of the minimum of these values for each of the sources correspondingly will give us information about which alternative is the best in respect to these criteria i.e. the following set. This shows us which source is the best, which is worse and which is the worst. The better is the one with higher value of the numerator in the fraction.
As mentioned our objective is not to find out which source is the most reliable but how to integrate data from these sources into a single one from which we can build a single resultant network. We could consider the received values for alternatives as the coefficients we search but this would contradict to our requirement of . So to have this condition we normalize values in the resultant set of alternative weights R. Hence we receive that z 1 = 1/5, z 2 = 1/5, and z 3 = 3/5 which in turn satisfies the condition that the sum of the unknown coefficients must be equal unity.
Further we use formula (1) to calculate edge weights of the resultant network. As a result we receive the network which follows with edge weight values shown in Table 3.
Computing values for each of the criteria for this network we receive the results from Table 4.
From the procedures accomplished above we can mention some interesting facts. Actors 2 and 4 do not have any direct relations in common in the 1st and 2nd networks but does have a relation in the 3rd network. So they do in the resultant network. Similarly, 4 and 5 do not have a direct relationship in the 1st network but have relations in the 2nd and 3rd networks. Our method showed that they have a relation in the resultant network as well. This gives us a base to assume that in this sense our method does not undergo information loss Another important point is that we can clearly see that z 3 = 1/3, i.e. the 3 rd network should be the best in the sense that    it should combine all optimal values of the criteria. Indeed, for the 2 nd criteria the 3 rd network is best, for the 1 st criteria the network is neither worst nor the best, and finally, for the 3 rd criteria the network is the worst which does not correspond to the received value of the coefficient.

Conclusions
Different kinds of social networks of academic researchers can be retrieved from information available on the Web. Most of currently existing methods deals with only specific network types of researchers. Ignoring any of the networks derived in this way may result in valuable information loss. We presented a method which can be used to create a single heterogeneous network out of multiple homogeneous networks of academic researchers.
In the method presented none of the networks is ignored completely. In the future, we might be able to show some applications of the method described.