1. Introduction
Social networks have emerged as a communication tool of unexpected impact. Frequent contact between people through these networks gives rise to virtual relationships developed according to their interests. This study’s approach follows [1] which poses emphasis not only in the individual behaviour but in social interaction within the network.
One of today’s challenges is to develop accurate tools to identify influential users by sectors and markets, and to understand information flow dynamics. In turn, it is difficult to fully observe a social network; therefore statistical problems and probabilistic modelling are important issues (see [2] [3] and [4] for surveys on this subject). Besides this topic, social networks also provide examples of situations of unidentifiable or censored data models and this makes them particularly interesting.
The above concerns lead us to study virtual social networks phenomenon and we restricted our analysis to Facebook. For this we develop a model which, although does not describe it in full reality, it is useful as an approach to network’s dynamics study. Naturally this model leads to graph theory which relates to work in [5] and [6] .
We will use mathematical tools and statistics from stochastic processes field [7] [8] [9] , trying to answer questions such as the existence of transversal communication or how to find if a network segmentation proposed is optimal within a certain communication behaviour between segments.
This paper is structured as follows. In Section 2 we address Facebook modelling using tools of Markov chains, we introduce the concept of complete transversality in the communication, and in this context, we try to find the distribution of random functions involved in the model.
Section 3 proposes two statistical hypothesis tests, one to prove network’s CT and the other to prove CT between network segments using U-statistics theory and asymptotic convergence theorems presented in [10] . We end this section, defining useful performance index to measure segmentation’s quality.
Section 4 is devoted to conclusions and acknowledgment.
Section 5 is an appendix with proofs of the results obtained in the preceding sections.
Further and related works on this subject can be read in one of author’s PhD theses [11] .
2. Model Description
In this section we propose a model for describe Facebook dynamics.
Consider
. Let
denote the set of all internet users at instant t and
the set of Facebook’s profiles at time t. Of course
and, we’ll also suppose that once a profile is created it cannot be eliminated. (Actually a profile can be eliminated, but this is tedious and difficult to do, so we disregard this behaviour.) Then
.
We also denote
and
to the sets
and
respectively.
For each instant t, we will model the network with a random graph where the nodes represent profiles and the edges represent friendship links.
Definition 1. Let
. We define the random “friendship” function at time t as the function
, such that if
,
and
Remark. The “friendship” relationship is symmetric, so function
is also symmetric. Then, the random graph determined by
is bi-directional and we can define the adjacency matrix as follows.
Definition 2. Let M be the cardinal number of
and let
be the set of binary symmetric matrices of order M. We define the random “friendship” matrix at time t as the matrix
whose elements are the values taken by
for each pair of profiles
.
Due to the high level of interactivity on social networks, is natural to suppose that
is a Markov chain with states space in
. Then, given two states A and B of
, we denote
to the one step transition probability from A to B.
Because of
symmetry’s,
is determined by the one step transition probability of the lower subdiagonal’s elements of
. Then, if we say that f precedes
if
and we write
and, given
, we denote
to the set of the lower subdiagonal’s entries of A, we have that:
(1)
For (1) calculation’s, we describe through events the profile’s actions which have impact on the transition. These events will be linked to certain indices that we will construct to measure affinities between profiles.
We will use the following notation. Let A be any set, we will denote:
to the set of distinct pairs of A’s elements, and by
to the set of triples of A’s elements they are different by pairs.
Definition 3. Let
, with
. We define the p “image index” over
as the function
such that
Remark. We will suppose that the network lies at steady state, this implies that image between profiles has also evolved to a steady state, so the function X does not depends on time t. Besides, function X is not symmetric, non observable and monotonic.
Definition 4. Let
, with
. We define f “image index” over g as the function
, given by
(2)
with
and
the users sets that administrate f and g profiles respectively,
a monotonic regression function, and
independent and identically distributed random variables with
and
, for all pair
,
.
To triples ordered of users and triples ordered of profiles we define the following indices.
Definition 5. Let
. For the triple ordered of users
we define an “image index” as the function
that assigns to each triple, a real number that represents acceptance level for the action p suggests to
to be a friend of
.
Remark. As with X, U doesn’t depend on t, is non symmetric and non observable.
Definition 6. Let
distinct by pairs. For the triple ordered of profiles
we define the “index image” as the function
given by
(3)
with
,
and
the users sets that administrate f, g and h profiles respectively,
a monotonic regression function and
independent and identically distributed random variables with
and
, for every triple
, distinct by pairs.
Then, given
, we will suppose that there are
,
and
, such that:
i) “f breaks friendship with g” Û
.
(The breakdown of friendship may be due to that f take decision to eliminate g or vice versa).
ii) “f successfully requests friendship to g” Û
.
(The request of successful friendship arises when f asks to g for friendship and g accepts to f as his friend).
iii) “f successfully suggests to g, h’s friendship” Û
.
(The suggestion of successful friendship is given when f suggests h to g, g requests h’s friendship and h accepts).
Then, if we denote with
to the interventions of f regardless its effects in the transitions from one instant to the next from
to
, these can be described in a disjunct union of the following events:
that is,
, and the transition probabilities in one step from
to
can be expressed in the following formula:
(4)
Proposition 1. The one step transition probability from
to
of the Markov chain
is given by:
with
,
,
and
Complete Transversality
Complete Transversality (CT) in a social network is associated to a certain communication behavior. This behavior implies that relationship probability between two profiles is always the same. No matter who the profiles are.
CT arises from social scientific theories [12] that poses that massification of social networks would bring an horizontality in communication and transversality in connection between people, overpassing social, economic, ethnographic and other differences. Virtual social networks would bring balance and democracy to all people connection.
In terms of the model, this could be reflected in the “image index” X of one user p over another
and in the “image index” U of the triple ordered of users
distinct by pairs. So, averages of regression functions
and
takes the same value in (2) and (3). Let’s say they are equal to
for all distinct user pairs and
for all triples ordered of users distinct by pairs. Under this conditions, “image indices” given by (2) and (3) are reduced to
and
.
When network follows this behaviour, the following results can be established.
Theorem 1. (Homogeneity)
Under the CT context,
is a time-homogeneous Markov chain.
Theorem 2. (Ergodicity)
Under the CT context, the Markov chain
is ergodic, and when initial distribution is ergodic, the friendship indicator between a pair of profiles
, with
, denoted by
, is a random variable with Bernoulli distribution and parameter p,
, with the same distribution and independent of the friendship indicator of any pair of distinct profiles.
From the results exposed in former Theorem we can conclude that, for
, ergodic distribution under CT is
3. Test and Estimation
We are aimed to discuss the CT’s validity in social network Facebook. For this, we will propose two statistics based on samples of N profiles and will study their asymptotic distribution under CT when N increases. Besides, we will present the CT tests.
Further, in this section we will introduce a lighter version of CT called Segment Transversality (ST) related to a given segmentation, that allow us to elaborate segmentation quality index.
3.1. Average Communication between Profiles
Let
(5)
this statistic averages proportion of friends who have profiles on the sample and therefore measures “sample’s communication average”.
Focusing on long term dynamics, we have seen that under CT, the random variables
,
are independent Bernoulli with parameter p. Then,
we can verify that
and
.
We want to find EN’s asymptotic distribution, so we study asymptotic behaviour of
(6)
when sample size N grows towards infinity.
If we denote
, we have that
, where
forms a triangular array.
Theorem 3. If the triangular array
verifies the following conditions:
i) For each
,
are independents;
ii)
, when
;
iii)
;
iv) Exists
such that
, for all N and for all i,
and Lyapunov conditions is met, that is,
(7)
then
(8)
Proof 1. Hypothesis i) is trivial because for
and N fixed,
is independent of
.
Besides,
, and
(9)
Then, the hypothesis ii) and iii) are met.
Let
. We will see that iv) holds. For this, we will calculate
with
,
,
and
.
Then,
and Lyapunov condition given by (7) can be verified for
:
Consequently the hypothesis i)-iv) holds and we conclude that
Returning to the centered statistic expression
and to the expression of
given by (9) results the following:
that is,
(10)
As EN is a communication estimation between profiles, if we select different profile samples under CT, we shouldn’t detect differences among p’s estimations.
To study this, we propose the following hypothesis test to compare communication average.
Given
and
, independent profile samples of
, that verifies
, we build the statistics:
and we have that
and
with
, this is, mean and variance are linked and
is derivable. So, if we take,
and,
(11)
and
(12)
Consequently, to test average communication, we can substract (11) and (12). This statistic, under CT assumption, results
(13)
and, given a signification level α, we obtain the following critic region
Real Data Testing
We perform such an experiment in a real profile network that gives permission to the authors for sampling. For confidential reasons we cannot release any of the data used to make calculations. We can state that we take two independent and disjunct samples of size
. The statistic value was
. Therefore, with a
significance level, we reject CT hypothesis.
This results indicates the social network Facebook is a platform in which communication between people or groups of people is it NOT TRANSVERSAL.
3.2. Mean Square Deviation of the Communication between Profiles
Let
(14)
be an statistic that estimates the mean square deviation of the communication between profiles respect to its mean.
In order to find the asymptotic distribution of TN we use properties of U-statistics introduced in [13] .
Suppose that
are independent and identically distributed random variables and that
,
is some symmetric function respect to permutations.
Definition 7. A U-statistic of order r with kernel h is defined as
We state the following theorem whose proof can be seen in [10] .
Theorem 4. Let UN be a U-statistic of order r with kernel h. Suppose that
and that
i)
are independent and identically distributed random variables,
ii)
and,
iii)
Then, if
,
(15)
Then we have that TN is a U-statistic of order 2 with kernel
where
and
. Thus, under CT,
and
.
Remark. For the proposed model for Facebook, p is the friendship probability of any pair of profiles in long term and, because of the large size of the network, p is probably less than 0.5. Thus,
,
and
, therefore
.
As
, by the Theorem 4 is
(16)
As
and
, we see that
and
is derivable. Taking
verifies that
and the limit expression on (16) is
(17)
when
.
We can make a test to prove CT by comparing the mean square deviation of two independent populations of profiles. For this we take two independent samples of N profiles of
,
and
such that
, we construct the statistics
and
then, by (17), we obtain the asymptotic distribution of the statistic for the comparison of mean square deviations under the context of CT, that is,
being the critical region at the level of significance α
Similarly as for the test comparing the average proportion of communication between profiles, we use the same real profile and, on his network of friends, we take two independent and disjoint samples and calculate the statistic and the critical region, concluding that the hypothesis of CT is rejected again.
3.3. Segmented Transversality
A context of CT in a social network like Facebook is far from reality as evidenced by the findings of the two tests that we made. Is reasonable to think that profiles tend to cluster in different segments according to social criteria such as political ideologies, economic interests, musical tastes, ages, etc., and that these segments are also related to each other.
We introduce the concept of Segmented Transversality (ST), that is, CT between segments. Then, making a priori segmentation on the network, we will introduce a statistic representing the communication between pairs of segments and we will prove CT between the profiles of the segments.
Let
be a partition in segments of
. We notice with
to the proportion of profiles of
,
,
,
, and we make a random stratified sampling by segments as follows:
are chosen randomly inside of
,
are chosen randomly inside of
, and so on until
are chosen randomly inside of
, being [x] the integer part of x, that is,
, for
.
Given the sets
,
,
,
and, given
and
two segments of
the partition on k segments of
, for
and
, with
and
, we notice with
to the probability of friendship between
and
, that is,
.
Remark. Friendship’s random functions
still are independent random variables with Bernoulli distribution, but now the parameter distribution depends on the intensity of the relationship between the pair of profiles considered.
Definition 8. Given two segments of profiles
and
, we say that there is CT between them if given
and
, with
and
,
has Bernoulli distribution with parameter
, for all
and for all
.
That is, CT between segments means a distinctive homogeneous behavior in the communication between them.
So, let
(18)
be the average proportion of friends of the profiles of
in the segment
. Then, under CT between
and
, we have that
and
.
Of the same way as we obtain the asymptotic distribution of the centered estimator
of the expression (6), we can obtain the asymptotic distribution of
(19)
resulting
(20)
If we want to test CT between a pair of segments
and
we make a stratified random sampling independent from the previous one in which
are chosen randomly inside of
,
are chosen randomly inside of
, and so on until
are chosen randomly inside of
, with
,
,
and we construct the statistic
(21)
Then, under CT between
and
, are
,
and
(22)
For
and
we have that the variance s is a function of the expected value m, that is,
, where
and
is derivable. Then, taking
we have that
and, similarly as in the previous section, we can conclude that
being the critical region at the level of significance α,
Therefore, if the test leads to the rejection of the hypothesis of CT between the segments
and
, with an error probability of α we say that such segments do not have a distinctive homogeneous behavior in the communication.
3.4. Quality on Segmentation
If we divide the network into k disjoint segments, we can take all possible pairs of those k segments,
, and make a total of
test, one of each pair and test whether the segment of this pair have a distinctive homogeneous behavior in the communication. We can represent these
test by a binary symmetric matrix of order k,
, in which each element
is one if the segments
and
were not homogeneous in terms of communication, that is, if the corresponding test rejects the hypothesis of CT and,
equals zero, otherwise.
Then, noticing the cardinal of the set of “ones” in the subdiagonal of
as
we can define the following useful performance index to measure the quality on segmentation:
(23)
If we keep the segmentation and make a stratified random sampling segment m times, with m sufficiently large, and calculate m times the index defined in (23),
, with
, we can observe the histogram representing the distribution of quality on segmentation. If the most of the times this measure results, for example, greater than the mean of the observations, we continue segmenting according to the criteria which has been used, otherwise it is desirable to modify the segmentation criteria.
Let’s illustrate this with an example. Suppose we segment people by age (>15, <20, >20 and <30, >30) and gender (M, F).
Given this segmentation, suppose we conduct the 15 hypothesis test for segment transversality and obtain the following segment adjacency matrix
A one in this matrix means that segment
with segment
behaves distinctly in the sense of transversality communication. Zeros in the upper triangle matrix doesn’t mean anything. Each segment compare to each self is homegenous (zero in the diagonals) except in segment 6. This could mean that further segmentation within it might improve audience segmentation. Nevertheless, segment 6 distinguish from other segments anyway.
If we sum the ones under the diagonal and divided by the number of segment combinations (see (23)), we calculate the performance of this segmentation which is 40%. The more the zeros the lower the performance index.
Following this framework we can improvement this segmentation by:
1) fusion homogenous segments
2) explore intra-segmentation in the cases were there was one in the diagonal
For actions in the first group, we look that in this toy example S1 and S2 doesn’t differ in communication behaviour, so we could consider to group them as one segment. We could summaries this saying that gender doesn’t segment among young people (under 20 years). We calculate again the matrix with this augmented segment and calculate the performance. Then occam’s razor lead us to select the least segmented partition when we have two or more with same performance.
For segment 6, this is female over thirty years, we calculate that it is inhomogenous with itself, so we could try a sub-segmentation by education degree or by motherhood. Then repeat the five tests between the new segment with the previous ones and calculate performance of this new matrix with one or more rows.
Of course this iterative method implies significant work with estimation, data recompilation, amount of data, independent sampling, etc. that although might be of extreme relevance it’s beyond the scope of pure mathematics and poses a great source of scientific challenge and interdisciplinary work.
4. Conclusions
In this work we analyze Facebook social network, modelling it with a Markov Chain and several random variables representing profiles friendships. We further propose communication behaviour between all profiles called complete transversality that assumes no bias between profiles willingness to connect as friends. This CT behaviour leads to estimations that allow us a hypothesis test by means of mean square deviation to reject the CT assumption. This might be an obvious conclusion (because people behaves within Facebook as they behave in real context), but it has all the hypothesis testing machinery behind which gives it strong rigorosity.
Next step in our work was to weaken CT and, for this, we introduce ST (segment Transversality). This is given a determined network segmentation, and each segment profile connects to any other segment profile with the same probability. (Of course this probability can change with different pair of segments.)
In this ST scenario we were able to compare between two entire segments and determine whether they behave in the same way or differently. If we don’t find differences we can safely group the two segments in order to improve the original segmentation. For a given segmentation we test intra segment and every pair of different segments and define a performance index that reflects a percentage of the segments that behaves differently from all the possible comparisons. With this result, we join similar segment (don’t reject null hypothesis) and recalculate tests and performance index to the new segmentation.
This iterative process of grouping homogenous segments or leaving different leads to a hierarchy in segmentations according to segmentation’s quality. This information could be interpreted and used to determine if segmentation’s rules are adequate to distinguish segments; of course the comparison falls in this communication behavior sense.
Appendix
1.1. Proof of Proposition 1
Let
,
and
. We will see that
is independent of
. For this, it is enough prove that
is independent from
, when
. In fact, if we look at
, the first coordinate of both involved indices
for
and
for
, is fixed and is the first coordinate of the random variables
and
of the mentioned indices respectively. Then,
is independent from
and
is independent from
, hence,
is independent from
, when
. Thus,
Applying inclusion-exclusion principle for the events
,
, we have
(24)
We have to check that the events
,
, are independent between them for a same profile f.
is independent from
, from
and
, because
, which corresponds to image index
involved in
is independent from
which corresponds to index
that appears in
,
and
.
is independent from
and from
, because if a profile
appears in the intersections of
, it doesn’t figure either in the intersections of
nor in the unions of
. The same argument allows us to affirm that
is independent from
.
Therefore, transition probability (24) is given by
We will see how
,
is finally written.
If
, with
,
,
,
is independent from
. Then, are independents:
from
,
from
,
from
and
from
. Thus,
and, applying set properties we have that
as if
then
and accordingly we have that
and
are independents.
Finally,
and, for fixed
,
,
,
,
,
is independent from
. So, are independents
from
and
from
. Besides,
and
are independents, as if
then
. Therefore,
1.2. Proof of Theorem 1
Suppose we are in a CT context. To prove time homogeneity in Markov chain
we will see that probabilities involved in
,
,
, don’t depend on time. To do this we introduce some useful notations.
Let F and G be the distribution function for random variables
and
related with image indices
and
respectively.
The CT hypothesis implies that F and G neither depend on time nor on profiles. This is,
and
, for all
, for all
and every ordered triple
, with F and G continuous functions.
We also denote with
and
the following cardinals numbers of sets:
and
Consequently, under CT hypothesis we have,
and
Then,
,
, and therefore
doesn’t depend on t. Consequently,
is homogeneous in time.
1.3. Proof of Theorem 2
Let
be the space states of
and suppose that we are in a CT context.
We denote
to the set of states with all ones on the diagonal.
Lemma 1. C is a closed, irreducible and aperiodic communication class.
Proof 1. Let
such that
. Then, for some
,
, that is, at time t, the friendship state
indicates that for some
, the profile
doesn’t exists on Facebook. As we have supposed that once a profile is created it can’t be deleted, then if
,
, that is, an state outside of C is not accessible from a state inside of C. Then, C is closed.
Also, given
, outside the diagonal, an entry equal to one can turn into in zero, or an entry with zero can turn into one in one step, that is, A and B are communicated and aperiodic, so C is irreducible and aperiodic.
As we are studying chain’s behaviour at steady state of the network, we suppose that all profiles had been created. Then, if
at time t, in a finite amount of steps the state
will be attracted by C. Therefore, on the long term, every states of S will be attracted in a finite amount of states by the closed, irreducible and aperiodic class C. As C is a finite set, all of its states are positive recurrent. Then, restricting the chain
to C, it is closed, irreducible, aperiodic and every state is positive recurrent, hence it is ergodic.
The ergodic property, ensures limit distribution existence. In this case, if
,
Moreover, friendship functions between profiles f and g of
,
, were defined as dicotomic variables taking zero or one according to whether they were Facebook friends or not at time t. Then, under the ergodic distribution, as time tends to infinity, random variable
, with
, has Bernoulli distribution.
Under CT hypothesis, each profile of
engages friendship with any other profile with the same probability, let’s say p, so that random variables
, with
,
are identically distributed.
Also, this variables keeps direct relation with “image index” between profile pairs
. Under CT this index is reduced to a constant plus a random variable. This variables are independent for all
, with
, so image index between distinct profile pairs are also independent. Consequently,
, with
,
are independent.