Application of Association Rule Mining Theory in Sina Weibo

A user profile contains information about a user. A substantial effort has been made so as to understand users’ behavior through analyzing their profile data. Online social networks provide an enormous amount of such information for researchers. Sina Weibo, a Twitter-like microblogging platform, has achieved a great success in China although studies on it are still in an initial state. This paper aims to explore the relationships among different profile attributes in Sina Weibo. We use the techniques of association rule mining to identify the dependency among the attributes and we found that if a user’s posts are welcomed, he or she is more likely to have a large number of followers. Our results demonstrate how the relationships among the profile attributes are affected by a user’s verified type. We also put some efforts on data transformation and analyze the influence of the statistical properties of the data distribution on data discretization


Introduction
Online social networks such as Facebook, Twitter and Google+ have become an integral part of people's daily lives.No matter how they differentiate from one another, user profiles are a key feature.A user profile may include but not be limited to gender, age, location, occupation, social contacts, etc.The availability of the information may vary from one site to another.In spite of the fact that user profiles are less dynamic than other online behaviors, they still provide a clear signal of users' characteristics.A substantial effort has been made recently in order to obtain knowledge about users from their profile data.Lampe et al. [1] found that profile completion percentage on Facebook has a positive relationship with the number of friends a user has.Mislove et al. [2] proposed an algorithm to infer the missing part of a user profile according to other similar profiles.Quercia et al. [3] conducted a study on the relationship between the Big Five personality traits and user behaviors on Twitter.They introduced a novel method to predict the personality based on the number of followers, followings and tweets a user has.
As Twitter is banned in China, Sina Weibo is considered a replacement for it.Sina Weibo has reached 56 million daily active users (who spend an average of one hour per day with the service) [4].Sina Weibo has had a significant influence on Chinese society.Unlike its predecessors, studies on Sina Weibo are still in an initial state.There are a few studies on Sina Weibo with regard to user profiles.Guo et al. [5] found that the connections between users are mostly one-way and the number of followers a user has changes very fast.Chen and She [6] carried out a similar study but compared verified users with unverified ones.They believed that users whose real identity has been verified are more likely to have greater influence.Wang et al. [7] examined the correlation between the number of followers, followings and posts.They found that the number of followers grows rapidly as the number of followings increases from 10 to 3000.They also stated that the increase in posts can lead to more followers as long as the number of posts does not exceed 20,000.
Although considerable attention has been paid to Sina Weibo, associations among different profile attributes, such as the association between the number of reposts and comments, have not been well examined yet.Due to the fact that a large number of users on Sina Weibo have been verified according to their professional background, people on Sina Weibo are more likely to act responsibly and engage honestly with the community.It is worthwhile to explore users' characteristics on Sina Weibo especially considering they have different verified types (e.g.local authorities, news agency, and celebrity).Our research is based on a set of first-hand data collected from Sina Weibo, containing 1,192,972 users' profiles.The major contributions are summarized as follows: • Continuous data (e.g. the number of followers) are replaced by meaningful labels (e.g. the grass roots and social star).• The influence of the distribution of the data on data discretization is analyzed.• Association rule mining is conducted with respect to users' verified types.• A comparison between different types of users is made.
The rest of the paper is organized as follows.Section 2 presents the data model used in this paper.Definitions such as the number of followings a user has are given.Section 3 explains the process of data collection.The social relationships among users in Sina Weibo are illustrated.Section 4 discusses the methods for data discretization.The statistical properties of the data distribution are taken into consideration.Section 5 introduces an Apriori-based method for association rule mining and explains how we are going to conduct the association rule mining in Sina Weibo.Empirical results are given in Section 6 and conclusions are drawn in the last section.

Data Model
The information in a user profile may include various attributes of a user such as geographical location, academic and professional background, interests, preferences, etc.The availability of such information may vary from one site to another.In terms of microblogging, i.e.Sina Weibo, the number of followers, followings and posts a user has are three indispensable parts of a user profile.Such information is always displayed at a prominent place.Besides, a verified type is added to a user profile as users on Sina Weibo may choose to verify their identity based on their professional background.In this paper, a user profile is defined as follows: profile(uid) = {username, province, gender, number of followers, number of followings, number of posts, number of reposts, number of comments, verified type, time since created} Each user has a unique identification number (uid).The core attributes of a profile are defined as follows: • NoA refers to number of followers.NoA(uid) is the total number of audience who are listening to the broadcast of user uid.NoA is one of the major signs of a user's popularity.

Data Collection
Users' profiles are collected through the REST API provided by Sina Weibo.Bilateral relationships are used to expand the search of new users.Social relationships among users are defined as follows (see Figure 1).
Scenario 1 indicates that uid1 and uid2 have no connection between them.Scenario 2 shows that uid1 is a follower of uid2.Scenario 3 explains bilateral friendships where uid1 is a follower of uid2 and uid2 is also a follower of uid1.We assume that is two users follow each other, they are considered friends.Getting the friends of a friend is the strategy used in this paper to obtain users' ids from Sina Weibo.The REST API provides facilities to retrieve profile information according to a user's id.The implementation details are given below (see Table 1).
Unlike studies [4][5][6] where a user's followings are used to expand the search of new users, bilateral relationships are used in this study.Users who follow each other seem to have a closer relation between them.This method can prevent the search of new users from the spammers because no one likes to subscribe a spammer's microblog.
Finally, 1,192,972 users' profiles are retrieved.39.58% of them are verified users.Red star and e-celebrity account for 91.08% of verified users (see Figure 2).

Data Discretization
Data mining process involves a preprocessing step in order to assure the data have the quality and the format required by the algorithm.Users are classified by their attributes.For example, according to NoA, users are classified into two groups: the grass roots and social star.Users in the latter group have much more followers than users in the former one.Other continuous data are replaced as well in a similar way (see Table 2).

The K-Means Method and the Pareto Principle
This paper experimented with two methods: the k-means clustering algorithm and the Pareto principle.The purpose of clustering is to search for similar examples and group them into clusters such that the distance between examples within cluster is as small as possible and the distance between clusters is as large as possible [9].Let , , , n P p p p =  be a set of data points to be clustered and k is the number of clusters (Here, 2 k = ).Randomly select k data points from P as the initial centroids of the clusters, Then, following steps are repeatedly performed until the convergence is obtained: 1) Assign each data point Centroid is the mean of the points in cluster.
The Pareto principle (also known as the 80 -20 rule) [10] originally referred to the observation that 80% of Italy's wealth belonged to only 20% of the population.Here, we assume that, for example, a user whose followers are more than 80% of the other users is classed as social star.The quantile function used to calculate the cut points between the groups (e.g. the grass roots and social star), is defined as follows [11]: Here, X may refer to one of the variables in Table 2.The distribution function of X is given by ( ) ( ) F X represents the probability that X is less than or equal to x .Equation (1) determines the place where 80% of the data lies below it, e.g.80% of NoA is less than or equal to 1140% and 80% of NoR is less than or equal to 294.

Discretization Index
In this paper, a discretization index ( ) di is proposed to measure the quality of the discretization produced from above methods.Let  be a set of data points to be split.Suppose X is partitioned into two groups A di is defined as follows: where i δ denotes the proportion of i g in X and j µ denotes the mean of data points in j g .The method with the smallest di is considered the best method based on the following criteria: 1) Minimize the distances within the clusters and maximize the distances between the clusters.2) Split as equally as possible.The reason why both criteria are needed is that using the first criterion (i.e., clusters that are coherent internally but clearly different from each other) alone to split the data may cause an extremely uneven partition (see Section 4.3).As asso- ciation rules are generated from frequent itemsets (see Section 5), data in the minority, for example, 200 social star users in 1,192,972 users, are very likely to be overlooked.More explanations for why partitioning as evenly as possible is important to this study are given in Section 5. We propose di aiming to build a balance between the criteria.

Comparison between the Methods
We found that the use of discretization methods depends on the statistical properties of the data distribution.The spread of the data (i.e. standard deviation) and the symmetry of the data (i.e.skewness) may have significant influence on the performance of the discretization.Higher standard deviation implies greater spread of data.Positive values for the skewness indicate that the distribution is skewed right.Higher skewness implies longer tail in the right side.A normal distribution has a skewness of 0. We found that the k-means method is very good at creating clusters coherent internally but different from each other.However, the k-means method tends to partition data in an extremely uneven way when the distribution is skewed (see Table 2).On the other hand, partition based on the Pareto principle produces a lower di in most cases (see Table 2).Data are partitioned in a 80-20 way without impairing the internal coherence and the external difference of the clusters.
We use examples to illustrate how the statistical properties of the data distribution can have influence on the data discretization.As shown in Figure 3, the distribution of NoB is much closer to a normal distribution with a skewness of 2, at the same time it has the lowest standard deviation compared with other variables.In this case, the k-means method produces a lower di than the Pareto principle.In comparison, data points in NoA are spread out over an extremely large range of values 1 to 63,717,128.A skewness of 116.64 indicates that the distribution of NoA has a very long tail at the right side (see Figure 3).As a consequence, the majority of data points in NoA fall within a very small range of values and very few of data points fall within an extremely wide range of values.Actually, 80% of the data points in NoA fall within the interval [1,1140) and the rest falls within the interval [1140,63,717,128].In this case, the k-means method tends to group almost all data points into one cluster and put the rest into another one.Actually, only 0.01% of users were classed as social star in the k-means method (see Table 2).Partition based on the Pareto principle is applied in this study because it makes a trade-off between the criteria.

Association Rule Mining
The association rule mining can be conceptualized as follows [9] , , , n f I I I =  be the set of all items.Let DB be a set of database transactions where each transaction T is a set of items such that T f ⊆ .Let A be a set of items.A transaction is said to contain A if and only if A f ⊆ .An association rule is an implica- tion of the form The support s , confidence c and lift l of the rule A B ⇒ are defined as: reed as an itemset.An itemset that contains (k) items is a k-itemset.The support count of an itemset is the number of transactions containing the itemset.The minimum support count is defined as min s DB

( ) ( )
⋅ .An itemset is frequent if its support count is not less than the minimum support count.

Apriori Algorithm
Apriori is an influential algorithm for mining frequent itemsets for Boolean association rules.The name of the algorithm is based on the fact that the algorithm uses the Apriori property, i.e., all nonempty subsets of a frequent itemset much also be frequent.Let i L be the set of fre- quent i-itemsets.Given , , , 2) Prune: k l can be huge.To reduce the size of k l , the Apriori property is used as follows.If any (k − 1)-subset of an candidate k-itemset is not in 1 k L − , the candidate cannot be frequent either and so can be removed from k l .The set of remaining candidates in k l is a superset of i L , that is, its elements may or may not be frequent, but all of the frequent k-itemsets must be included in k l .A scan of the database to determine the count of each candidate in k l would result in the determination of k L , i.e., all candidate having a count no less than the minimum support count are frequent and therefore belong to k L .By Apriori algorithm, all frequent itemsets along with their support counts can be found efficiently.

Experimental Design
Suppose dataset U contains all the data collected from Sina Weibo.Association rules are mined from both U and its subsets.
Considering the property of association rule mining described above, rare types of users are very likely to be pruned due to their relatively low support counts.Splitting U into disjoint subsets based on VT and mining association rules from them separately is necessary so as to avoid overlooking some interesting patterns that are hidden in the rare types of users.The dataset U is divided into 2 subsets: verified_accounts and unverified_accounts.A comparison in terms of association rules, between ve-rified_accounts and unverified_accounts, is made to identify the difference between verified users and unverified users.If necessary, the dataset verified_accounts can be further divided according to VT.In this paper, a comparison between red star and e-celebrity is conducted for two reasons: 1) red star and e-celebrity together account for 91.08% of verified users.2) red star refers to the masses, opposite to e-celebrity who are public figures and professionals and well known in local communities.
Considering the fact that users who have large number of followers (followings, posts, reposts, and comments) only account for a very small part, we have to give a relatively small min s so as to assure the rules for social star (scout, blog zealot, propagator, and topic initiator) can be elicited.Association rules are sorted by lift values.A lift equals to 1 means the occurrence of A is inde- pendent of the occurrence of B if an association rule is in the form of

[ ]
, , A B s c l ⇒ .A lift is greater than 1 indicates that the occurrence of A has a positive effect on the occurrence of B .We are interested in the profile attributes which are dependent on each other.

Empirical Results
We found that both NoR and NoC play important roles in a user's popularity (see Figure 4).If a user's posts are welcomed, either the posts are forwarded by many times or many people leave comments about the posts, the owner of the posts is much more likely to be tagged as a social star.Another finding was that NoC is positively correlated with NoR. 3 of 4 rules for "NoR = Propagator" were attributed to "NoC = Topic initiator" (i.e.Rule #3, 5, and 6).Also, an e-celebrity user is always accompanied by a large number of followers (social star).Thus, social star is a good indicator of e-celebrity.We found that a 5-year social star user is an e-celebrity with a confidence of 54.16%.
A comparison between association rules derived from unverified_accounts and verified_accounts was made (see Figures 5 and 6).
A positive correlation between NoC and NoR exists in both of them; however, rules in unverified_accounts have higher lift than that in verified_accounts.In other words, NoC and NoR are more dependent on one another in un-verified_accounts.According to above findings, we could state that, for a verified user, an increase in NoC may not enhance the probability of an increase in NoR.On the other hand, for an ordinary user who has not been verified yet, saying something controversial to receive more comments (NoC) is a good way to increase the rate of diffusion of his or her posts (NoR).Actually, it has already happened in many online social networks where people initiate some controversial topics in order to become famous [12].
Although red_star is a disjoint subset of verified_accounts, the dependence, in terms of lift, between NoC and NoR in red_star is much stronger than that in veri-fied_accounts itself (see Figure 7).On the other hand, rules derived from e-celebrity are less interesting in terms of lift (see Figure 8).
We found that if an e-celebrity user's posts are welcomed, then he or she is a blog zealot with a confidence greater than 65%.Actually, it happens in many kinds of user types.Unlike red star users, users such as corporation, media, and application software, have a strong motive for promoting themselves or something else.As a consequence, they are likely to send as many messages as possible.At the same time, due to their high reputation,    other users prefer to forward their posts or have discussion with them.Posts are welcomed is independent of having a large number of posts.For this reason, lift values in Figure 8 are very close to 1.

Conclusion
In this study, we explored the relationships among different profile attributes through the techniques of association rule mining.We found that a user is more likely to have a large number of followers (NoA) if his or her posts are forwarded by many times (NoR) or many people get involved in the discussion he or she initiated (NoC).Our results indicate that NoR and NoC are strongly dependent on each other with respect to ordinary users (both unverified users and red star users).Profile attributes for verified users are relatively independent on one another.We also examined both the k-means method and the Pareto principle as a method for data discretization.We found that the statistical properties of data distribution can have significant influence on data discretization.Due to the fact that data used in this study are skewed heavily, we suggested using the Pareto principle to partition data.

Figure 2 .
Figure 2. The proportion of users.

1 kL
− , Apriori algorithm finds k L using join and prune actions as follows: 1) Join: To find k L , a set of candidate k-itemsets, denoted as k l is generated by joining 1 k L − with itself.Any two (k − 1)-itemsets A and B are joinable if they contain (k − 2) common items.For example,

Table 2 . Data discretization.
Calculation was based on the partitions generated from the k-means method; b Calculation was based on the partitions generated from the Pareto principle. a