^{1}

^{*}

^{2}

^{*}

^{2}

^{*}

^{3}

^{*}

This paper presents a new method for obtaining network properties from incomplete data sets. Problems associated with missing data represent well-known stumbling blocks in Social Network Analysis. The method of “estimating connectivity from spanning tree completions” (ECSTC) is spe cifically designed to address situations where only spanning tree(s) of a network are known, such as those obtained through respondent driven sampling (RDS). Using repeated random completions derived from degree information, this method forgoes the usual step of trying to obtain final edge or vertex rosters, and instead aims to estimate network-centric properties of vertices probabilistically from the spanning trees themselves. In this paper, we discuss the problem of missing data and describe the protocols of our completion method, and finally the results of an experiment where ECSTC was used to estimate graph dependent vertex properties from spanning trees sampled from a graph whose characteristics were known ahead of time. The results show that ECSTC methods hold more promise for obtaining network-centric properties of individuals from a limited set of data than researchers may have previously assumed. Such an approach represents a break with past strategies of working with missing data which have mainly sought means to com plete the graph, rather than ECSTC’s approach, which is to estimate network properties themselves without deciding on the final edge set.

Respondent-Driven Sampling (RDS) has become a popular technique for providing statistically meaningful data on hard to reach populations by using peer-referral methods. Data obtained using RDS that can be subjected to mathematical modeling, which can in turn provide the sorts of confidence intervals and measurable design effects expected of social science research [

Yet given the prominent role that social networks play in the RDS methodology, the recruitment/sampling strategy produces very little social network information. This is for three reasons: 1) all interview participants are given the same number of coupons, usually far fewer than their degree, meaning that referral turnout gives little indication of individual network neighborhood, 2) the random-walk method necessary for achieving representativeness intentionally disregards questions of the range of network degrees, questions of directionality, and edge strength variation, and 3) because individuals are prevented from appearing as referrals once they have already been interviewed, RDS produces spanning trees that lack cycles.

Despite all this, RDS methods do provide some network data for populations among which normal social network research methods remain problematic or prohibitively expensive-networks of drug users, sex workers, marginal youth, and other hard to reach populations where name generators are either not useful or not welcome, and increasingly subject to restriction on the basis of human subjects protection. The network connections that appear in the RDS edge set are the result of peer referral yet can be collected anonymously (via coupon number), and thus normally meet IRB guidelines. Unfortunately, limited methods now exist for imputing structural information in settings where there is missing social network data, as is the case from RDS surveys.

As Huisman [

Here we propose a second method for dealing with the missing data inherent in RDS spanning trees. Rather than attempting to replace missing data, or quantify the effects of missing data, we begin by considering the network to be a fixed structure about which we wish to make inferences based on partial observation. Spe- cifically, we evaluate the constraints implied by very limited information about the marginals of the adjacency matrix and a small subset of its entries, and assess the extent to which these constraints can be used to re- construct the relative values of network-centric vertex measures. In the following paper, we describe a set of experiments undertaken to ascertain the extent to which network level statistics can be generated from the limited sorts of data normally produced by RDS samples. The method of “estimating connectivity from span- ning tree completions” (ECSTC, pronounced ek-stuh-see) proposed here seeks to recover network-centric measures for individuals within RDS samples, given only very limited information about links within the ambient network in which the survey is conducted. The method does not seek to construct concrete networks that most plausibly impute missing network links from the limited input data. Rather, if ECSTC can estimate network-centric vertex measures in spite of the missing links peculiar to data generated through RDS, then combining ECSTC with RDS might potentially provide a way around the high cost of conventional social network survey methods.

The method of “estimating connectivity from spanning tree completions” (ECSTC) begins with the edge set determined in the course of referrals made during the RDS process, together with individual network degree information determined in each subject survey. The residual difference between these two quantities represents the number of undiscovered edges at each vertex. The ECSTC method randomly adds these missing edges to the RDS tree until each vertex has gained the requisite degree^{1}. Stated equivalently, ECSTC takes as its input very limited information: a small set of entries within a network’s adjacency matrix, together with the matrix’s mar- ginals. It then samples from the space of all adjacency matrices that are consistent with the partial information provided. In assigning missing edges to form complete networks, the intention is not to assert a final edge set. Rather, ECSTC seeks only to estimate network-centric vertex measures―foregoing the attempt to deduce the network’s structure in any final manner. It does this by producing large numbers of random graph completions consistent with what is known about vertex degrees. Each randomly completed network is then analyzed to determine network variable(s) at each vertex; here we consider the betweenness centrality, Burt’s measure of aggregate constraint, and effective size of each vertex. The completion process is then repeated on the same RDS tree, and the vertex properties once again measured for each of the completions. The values obtained from multiple independent completions are used to obtain a mean value for each variable (for each vertex) and the standard deviation is calculated to estimate variability across different completions. The ECSTC method is des- cribed in greater detail in Section 4.

Our strategy for evaluating the ECSTC method makes use of computational experiments on known, albeit idealized, topologies drawn from a class of theoretically plausible Barabasi-Albert (BA) networks^{2}. For purposes of this trial, we use multiple instances of randomly generated BA graphs of 100 and 500 vertices. Unlike most tests of techniques aimed at addressing the problem of missing network data, we do not begin by removing a random subset of vertices or edges (or both). Rather, we begin by simulating an RDS sample the known graph, by which a list of vertices and a fraction of their connecting edges are discovered. We take an idealized view of the RDS method, by assuming that coupon referral tracks real network ties of equivalent edge strength, that subjects distribute coupons randomly among their network neighbors, recursively, until the referral chains all reach vertices with no undiscovered neighbors^{3}.

To begin the RDS simulation, one “seed” vertex is chosen randomly from among the vertices, to serve as the starting point of the simulated RDS. We assume that at each progressive step in the RDS simulation, accurate information is obtained from the surveyed subject (vertex) regarding its network size and actual neighbors. Each surveyed vertex is then “given” three coupons^{4}.

We chose three coupons because this is the current standard practice in most RDS studies, though the pro- posed method is impervious to this parameter setting. This node “distributes” the three coupons to up to three of its as-yet undiscovered neighbors, which it chooses uniformly at random. This process continues to exhaustion, which is to say until we reach a state where no further steps to unsampled nodes are possible. In practice, we find that a relatively high proportion, though not necessarily all of the vertices are encountered in this way. In addition, terminal nodes in the referral tree tend to be low degree nodes, though occasionally terminal nodes may have higher degree if all their neighbors have already been sampled at previous stages of the RDS simulation. The ECSTC method is then used to generate multiple independent completions of the RDS tree, as described previously. The network-centric vertex measures of betweenness centrality, Burt’s constraint, and effective size, and computed for each vertex within each completion, and the mean of these values serves as the ECSTC-derived estimate of the per-vertex measures. ECSTC-derived estimates are then compared with the true values of the network-centric measures, where the latter is readily computed using the ambient graphs on which the RDS simulation itself was conducted. Plots of the estimated versus actual measures of each vertex (for each variable) are made, and serve as the basis of conclusions concerning the extent to which the relative magnitudes of ECSTC-derived estimates reflect the relative magnitudes of the true values of the measures.

The preceding process is repeated for different RDS trees, in order to determine the sensitivity of our con- clusions to the random choices involved in any particular RDS tree. The entire process is then repeated for different graphs in order to determine the sensitivity of the conclusions to the choice of particular BA network.

For purposes of this experiment, three common network measures were chosen to test the efficacy of the ECSTC method: effective size of a vertex, betweenness centrality, and Burt’s constraint coefficient. We chose Burt’s constraint and effective size as they represent related but quite different “neighborhood” measures for social network analysis. Betweenness centrality was chosen to assess the method’s performance on measures affected gu global network geometry (rather than just the neighborhood of the measured vertex). We note, however, that any other measure defined for a (combinatorial) graph could be substituted in place of these three (e.g. triad census or other more complex topological functions). Since each round of the ECSTC process pro- duces a “completed” network, all that is needed is to compute the measure of interest for the each of the com- pletions produced in successive ECSTC rounds; the mean of these computed values then serves as an estimate of the true measure.

The first function examined in the experiment is the effective size of a vertex. Like Burt’s constraint coeffiecient (discussed below), this is a measure of local or neighborhood topology intended to make clear the importance of a vertex to the connectivity of its neighbors (and is thus a measure of mediation or influence). Effective size is simply the degree of a vertex minus the average of the degrees of its k = 1 neighbors with respect to one another. Being largely dependent on degree information, and averaging across k = 1 neighbors, this function was thought beforehand as likely to be the most amenable to ECSTC methods. In the experiment, effective size

where

Betweenness centrality is defined by Wasserman and Faust [

where

Burt’s constraint is a measure of the extent to which a vertex is linked to alters who are in turn linked to one another [

where

Denote by ^{5}.

Let

The next two subsections present the ECSTC procedure precisely, using which the function

To begin, we note that uniformly sampling spanning trees of a general graph

1) Pick a seed vertex

2) Now starting at

The above process implicitly defines a distribution

Let

1) The number of vertices in

2) Degrees of vertices in

3) The graph

C1. Initialize

In the next step (C2), the vertex set

C2. Repeat Steps (a)-(c) until

(a) Define a probability distribution over the vertices

(b) Choose vertices

(c) If

Add the edge

increment the values of ^{6}.

C3. Output C.

The output of the above process implicitly defines a distribution

Steps C2 (a)-(c) above are a sort of “preferential completion”, since the algorithm chooses vertices

Repeating the aforementioned processes we obtain

Network-centric vertex measure estimates. Given a specific completion

Let

1) The correlation

in which each point maps the true vertex measures

2) The misclassification

In this section, we seek to experimentally determine the effects of increasing the number of RDS trees

The following constitutes a single experimental trial:

• Draw a random graph

• Choose RDS trees

• For each

• Use the

• Compute estimate quality measures

To illustrate, fix

To counter the possibility that these results might by due to chance (either in the choice of graph, or the choice of tree, or the choice of completions), we evaluated the robustness of the results by conducting

Correlation as a function of number of completions: For a fixed number of trees, the mean correlation across all vertices improves. The high values support the idea that the ECSTC method is able to successfully recover significant data across a range of network measures, with increased numbers of completions improving the fit of the estimated values to the actual ones. For several network measures, at high numbers of completions, correla-

1 comps | 10 comps | 30 comps | 50 comps | |
---|---|---|---|---|

1 trees | 0.954 | 0.977 | 0.979 | 0.979 |

10 trees | 0.979 | 0.981 | 0.981 | 0.982 |

30 trees | 0.981 | 0.982 | 0.982 | 0.982 |

50 trees | 0.981 | 0.982 | 0.982 | 0.982 |

std | 1 comps | 10 comps | 30 comps | 50 comps |
---|---|---|---|---|

1 trees | 0.009 | 0.002 | 0.002 | 0.001 |

10 trees | 0.002 | 0.000 | 0.000 | 0.000 |

30 trees | 0.001 | 0.000 | 0.000 | 0.000 |

50 trees | 0.001 | 0.000 | 0.000 | 0.000 |

Measure: ES

Measure: CON

std | 1 comps | 10 comps | 30 comps | 50 comps |
---|---|---|---|---|

1 trees | 1.035 | 0.641 | 0.589 | 0.592 |

10 trees | 0.476 | 0.414 | 0.281 | 0.264 |

30 trees | 0.439 | 0.271 | 0.176 | 0.167 |

50 trees | 0.462 | 0.233 | 0.161 | 0.174 |

1 comps | 10 comps | 30 comps | 50 comps | |
---|---|---|---|---|

1 trees | 8.447 | 7.872 | 7.842 | 7.843 |

10 trees | 7.862 | 7.838 | 7.838 | 7.838 |

30 trees | 7.839 | 7.838 | 7.838 | 7.838 |

50 trees | 7.838 | 7.838 | 7.838 | 7.838 |

Measure: ES

Measure: CON

tion approaches 1. This holds true across a range of variables, with strong correlations between actual and estimated values apparent for betweenness centrality, effective size, and Burt’s constraint. These observations are mitigated in those instances where high numbers of trees were included. There, the correlation values (for 50 trees, for example) were already so high that the use of multiple completions added only very marginal gains. The standard deviation of correlation values across 25 independent trials shows a similar trend. Where the number of trees is held steady (and low), increasing numbers of completions produces a lower standard de- viation across trials, meaning that high numbers of completions tend to mitigate sensitivity to initial starting conditions, and the vaguaries of the starting point of the sampling tree.

Correlation, as a function of multiple trees: Where the number of completions is held steady (and low), the effect of producing multiple trees has a similar effect to producing multiple completions, improving the fit between estimated and actual. Here too, where high numbers of completions are included, the fit is already so tight that there is only a marginal improvement provided by raising the number of trees. The standard deviation of correlation values across 25 independent trials shows a similar trend. Where the number of trees is held steady (and low), increasing numbers of completions produces a lower standard deviation across trials, meaning that high numbers of completions tend to mitigate sensitivity to initial starting conditions.

Misclassification, as a function of number of completions. As with correlation, increasing the numbers of completions shows an improvement in the fit between estimated and actual values, with high numbers of completions resulting in a lower percentage of misclassified vertex pairs. This holds true across effective size, Burt’s constraint, though not for betweeness centrality. Here, a high number of completions did not result in a steady decrease in the number of misclassified pairs. Across 25 trials, the standard deviation of misclassification decreased as the number of completions increased. This held true across all three network measures. We note here, though, that where high number of trees were available, the improvement provided by high numbers of completions was negligible, as the the standard deviation across trials was already approaching 0.

Misclassification, as a function of multiple trees. Here the observation that pertained to correlation is reversed. The inclusion of multiple trees did not significantly improve (i.e. lower) the percentage of misclassifications, and in the case of betweenness centrality, the percentage of misclassifications actually increased with the in- clusion of more sampling trees of the same ambient graph.

These observations, overall, suggest that multiple completions carry much the same results as multiple spann- ing tree samples of the same network, and at times produce better results. They also have the effect of mi- nimizing sensitivity to initial starting conditions, as examined across 25 distinct trials. Beyond this, for these (idealized) conditions, the ECSTC method proved capable of recovering significant amounts of network data, in close correlation with the values that obtain in the original network.

As above, the purpose of this experiment was to test the potential and begin to assess the validity of the ECSTC method for obtaining network properties from fairly sparse data sets, especially the sorts of spanning tree data sets normally produced by Respondent-Driven Sampling methodologies. The high conformity of the estimated values to the known values surprised the authors. These results are encouraging, showing that the method is capable under the circumstances described here of estimating accurately the values of a known but only partly sampled graph, with relatively small levels of variation in that estimate or dependence on initial conditions.

A major concern for the authors was the sensitivity of the method to any single random walk. Given the relationship between this method and RDS research protocols―where ordinarily only a single random walk sample is taken―we worried that stochastic factors inherent in the walk itself (randomness that plays a large role in RDS’s ability to reach sampling equilibrium in a population) would bias the results of the completions. Again this appears, at first attempt, not to be the case. The high concurrence of results over multiple sampling walks of the same networks, and the generally low standard deviation of the variation of those results across 25 distinct trials, means that we can have some confidence that the ECSTC method is not overly sensitive to peculiarities of any particular sampling walk.

Not surprisingly, the method was not equally successful across all measures, nor equally successful among those it was able to estimate closely. It worked best (closest fit and smallest individual error) for effective size. The authors were very surprised at the ability of the method to recover Burt’s constraint measure, with a very high Pearson’s r score, and low mean standard deviation. We expected the technique to fare worse on this measure. Despite past results showing that betweenness centrality to be among the least resiliant measure in the face of missing data, these scores were actually quite good as well, indicating that the mean values of these distributions (of estimates) were, in general, quite close to the actual values. These results were consistent over the course of 25 trials.

There remains much work to be done, as discussed below. But if the results shown here for the Barabasi- Albert distribution are consistent across other topologies and sampling scenarios, then the ECSTC method may prove a valuable extension of the Respondent-Driven Sampling method, allowing researchers to recover at least some broad topological data from the sampling trees produced by RDS. This would address two problems that social network researchers commonly face: the cost of large surveys where all participants must be asked about all others, and the problem of anonymity and informed consent. RDS trees are samples that do not attempt to ask respondents about others in the sample, other than the sorts of degree and ego-network questions necessary for tracking their own sampling. Likewise, the coupon referral method normally used in RDS allows for anonymous tracking of links, not necessitating the use of names or rosters.

Several important limits to our results must be discussed, however. Because the spanning tree samples stop when they reach a vertex with no additional undiscovered edges, this means that low degree nodes of degree one are likely to be known quite accurately for a higher proportion of their edge set (obviously), and that low degree nodes will have a lower proportion of their edges appear as “missing” in the sample. The result is that we have much higher levels of accuracy from the initial spanning tree for low degree vertices. In a BA graph, these make up the majority of the network, such that we begin the completion protocol with much of the periphery of the network fairly well known. This means that ECSTC method does most of its work, in the current instance of a BA graph, among the more highly connected vertices. This may be why betweenness centrality estimation remained accurate despite the fact that, in general, less than 50% of the edges are discovered in the sampling walks.

An issue for our results is that we assumed that we were able to record accurate degree information at each step of the walk, even though we did not discover the full set of edges to which that degree corresponded. A legitimate question is, to what extent such a measure is normally accurate in network interviews [

The authors would like to thank the referees for many helpful suggestions and comments through which the paper was improved considerably. This research was supported by NIH/NIDA grants RO1DA034637-01, 1RC1DA-028476-01/02, NSF Social Behavioral Sciences grant SMA-1338485. The opinions, findings, and con- clusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect those of the National Institute of Health/National Institute on Drug Abuse. The analyses discussed in this paper were carried out at the labs of the New York City Social Networks Research Group (www.snrg-nyc.org). Special thanks to Samuel Friedman, Karen Terry, Jacob Marini and Susy Mendes in the John Jay Office for the Ad- vancement of Research, and Colleen Syron, Emily Channell, Robert Riggs, David Marshall, Nathaniel Dom- browski, and the other members of the SNRG team. We would like to acknowledge that initial funding for a pilot version of this project was provided by the NSF Office of Behavioral, Social, and Economic Sciences, Anthropology Program Grant BCS-0752680.