Network Hot Topic Discovery of Fuzzy Clustering Based on Improved Firefly Algorithm

The existing fuzzy clustering algorithm (FCM) is sensitive to the initial center point. And simple clustering of distance can neither discovery hot topics on the Network accurately nor solve the problem of semantic diversity in Chinese. Aiming at these problems, an improved fuzzy clustering method based on dynamic adaptive step firefly algorithm (FA) was proposed. The clustering center was optimized by improved FA, and the FCM was used to complete the final clustering. First, the step length was adjusted adaptively in the current iteration, and the relationship between fireflies was established according to text similarity, then the topic influence value was applied to fuzzy clustering algorithm to improve fitness function optimization. In this process the topic was categorized into the closest class to the cluster center, which can reduce the impact of topic variation. Finally, according to the level of influence value got hot topics. By collecting real data from Sina micro-blog, the effectiveness of the algorithm was verified by experiments, and the accuracy of topic discovery was improved greatly.


Introduction
The rapid development of social media has brought great challenges to the research of complex networks.Users can make views according to their own moods; it can form various themes through forwarding, comment and so on, with the development of this, generating certain social trends.This social trend can reveal the relevant things that are happening at the moment [1].If we can find these in a timely and accurately manner, then we can provide various countermeasures related to it.Hot topics are more representative of the public's recent concerns, and it has a deeper influence on the development of society.Untrue or negative information has triggered a series of network public opinion problems.Therefore, the research on hot topic discovery has received extensive attention from scholars at home and abroad [2].
The research on the network hot topic discovery text is mainly the two directions of feature extraction and clustering algorithm.The clustering methods of traditional topic discovery include hierarchical clustering, cure algorithm, single-pass, DBSCAN etc [3].All of these algorithms are hard clustering algorithms.
These algorithms have their own limitations for the fast change of the network language and the variant of the Chinese semantic diversity, which leads that users cannot discover hot topics in the network timely and accurately.Feng Liguang et al. [4] used the FCM parallel algorithm to discover hot topics; however, the fuzzy clustering algorithm has the disadvantage of the common mean algorithm because it is a local search algorithm.For such an algorithm, if the initial value is improperly selected, it is easy to converge to the local minimum point.
But group intelligence algorithms have strong global parallelism.Fiho et al. [5] combined the improved particle swarm algorithm with two kinds of mixed fuzzy clustering (FCM-IDPSO and FCM2-IDPSO) and made the problem of FCM trapping into local optimum and sensitivity to initial value of clustering center improved; Wu et al. [6] combined simulated annealing algorithm and particle swarm optimization algorithm, and proposed an enhanced adaptive weight fuzzy clustering; Jiang et al. [7] proposed a data set classification algorithm combining genetic algorithm and FCM; Yang Fei et al. [8] combined genetic algorithm with clustering and applied it to topic discovery to improve the efficiency of topic discovery.Babak et al. [9] used multi-target enhanced firefly algorithm to discover associations in complex networks, multi-targets and adaptive probability variation increases the accuracy of topic discovery.
In summary, the hot topic discovery method which based on word frequency is difficult to deal with the challenges of derived variant naming entities and new words such as heteromorphic words and polysemous words; topic discovery methods based on the heat is difficult to find hidden topics with low heat but sudden strongness.Aiming at the shortcomings of the FCM algorithm in the hot topic discovery process, the dynamic adaptive step firefly algorithm is used to optimize the FCM algorithm.The combination of text influence with the FCM algorithm is applied to hot topic discovery, which can identify hot topics with low influence but sudden strongness.And it can solve some problems caused by topic variation, thereby improving the accuracy of hot topic discovery.

Hot Topic Discovery Process
The hot topic discovery algorithm may include the following steps: data collec-tion; it is mainly use crawler technology to obtain information; preprocessing is the initial operation of word segmentation, removal of stop words, etc for the text we have got; vector space model is established after TF-IDF feature extraction, this operate is to make it easier to compare the similarity.Clustering is the most important part of this paper.The flowchart of the algorithm for topic discovery is shown in Figure 1.

Firefly Algorithm
Firefly Algorithm is put up by a Cambridge scholar Xin-She Yang based on the glow behavior of fireflies in nature [10], it has a good use prospect in the optimization of continuous space [11].This article assumes that there are n fireflies, the corresponding text number is n, fireflies i and j move position according to their mutual attraction.If the firefly's brightness is high, it will attract the lower brightness fireflies move to it, and then complete the position optimization.The following are the relative brightness fluorescence formula, the firefly mutual attraction formula, and the position update formula.Among them, α represent the step size, β 0 represent the maximum attraction, I 0 represent brightest firefly brightness, r ij represent the distance between fireflies i and j. )

FCM Algorithm
The FCM algorithm is a popular fuzzy clustering algorithm proposed by Dunn in 1973 [12].Assuming that n is the number of elements in the data set, we divide it into c classes, that is, there are c class centers.And define the minimum objective function as follows: express the Euclidean distance of the sample x i to the cluster center c j , represent the degree of membership of sample i belonging Figure 1.Topic discovery process.
to cluster center j, we define that the larger the value of u ij the probability of belonging to this class is higher.To minimize the value of the target function, the objective function value is satisfied . According to Dunn [12], by using the Lagrange method, the membership degree and class center formula are as follows: ( )

Improved FCM Network Hot Topic Discovery Algorithm Based on Firefly Algorithm
The traditional FCM algorithm is sensitive to noise and outliers, it lead to the FCM algorithm is easy to fall into local minimum, and the selection of the initial center has a strong influence on the final clustering effect, which makes the discover result of hot topics are not ideal.The firefly algorithm has the advantage of not relying on the initial clustering center, and can overcome the shortcomings of FCM.But the FA has its own limitations, it is easy to fall into the local optimum, resulting in the solution accuracy is not high.So we propose the DASFA-FCM algorithm.

Optimized Firefly Algorithm
With the increase of iterations, the firefly swarm will gather near the optimal value of standard FA.In this way, the distance between the optimal value and other fireflies is small.When approaching the optimal value, it is likely that the distance of the firefly's movement is greater than the distance from the optimal value, so that the firefly will skip the optimal value when updating its position, which leads to a decrease in the optimal solution discovery rate.
The optimal solution and step size largely determine the convergence performance of the algorithm.In the standard FA, the step value is fixed, and all fireflies have a fixed step size during the iteration.It is easy to get the algorithm into local optimal and premature convergence [13].In mitigate this state, according to the variable step size firefly algorithm of Yu et al. (VSSFA) [14], we propose an improved location update method: dynamic adaptive step firefly algorithm (DASFA).Use dynamic step size instead of the fixed step, the step size is automatically changed according to the current number of iterations.In the early stage of the iteration, the firefly has a larger step size and a larger search space, thus ensuring that global search optimization can be achieved.As the number of iterations increases, the step value decreases gradually, and each firefly searches for its own range until they find the most suitable solution.In the later iteration, in order to prevent it from skipping the op-Journal of Computer and Communications timal solution step, it does not need to move a lot.At the beginning, DASFA has a better global search capability, and is positioned at a faster speed near the global optimum solution.When the optimal solution is found, the current iteration and the change of step length are stopped to prevent falling into local optimum.The improved adaptive step size calculation formula using nonlinear equations is as follows: ( ) α (0) is the maximum step size, which is also the initial step size at t = 0, t is the current number of iterations, and T max is the maximum number of iterations.
The improved position update formula is: X i and X j represent the spatial position of two fireflies i and j, and rand is a random factor on [0, 1], it obeys uniform distribution.

Fitness Function
The spread of topics is based on the relationship between the number and time between the publishers and forwarders, reviewers, and readers.The higher the value of a topic's attention is, the higher the influence of the topic is, and it is most likely to become a hot topic.Thus, according to Qiu Jiangnan et al. [15], we use forwarding amount, number of comments, and number of praises as the influence factors of the topic, the formula of the influence of each text X i is: Attraction log 1, 2, , Among them: f(X i ), z(X i ), c(X i ) respectively indicate the number of forwards, comments, and praises of the i th text.Where s, p, l is the weighting factor, their sum is 1.Because in the propagation of the topic, users have a higher probability of praising content than they comment the topic of the text, so the probability of l is given a higher weight value.
Generally, the trend of hot topics is: germination period, outbreak period, stationary period, turning period, and decline period.During the germination period, as the influence of the topic increases, the attention continues to increase.When the outbreak period is reached, the degree of attention reaches the maximum; during the stationary period, the degree of attention changes little and after that the level of attention continues diminish.It can be seen from the fluorescence brightness formula that the fluorescence brightness of fireflies in nature will decrease with the distance increase and the propagation of intermediate medium, which is similar to the trend of the topic of change over time in the topic propagation process.Based on this, the influence of hot topic is corresponding to the brightness of firefly.When the influence of a topic is high, it can be regarded as the most bright firefly position.By com-paring with the influence of surrounding topics, it can update the position iteratively and finally find the optimal solution.Due to the relative brightness of the firefly is related to the objective function value ( ) ( ) i i

I X F X ∝
. If the value of the objective function is smaller, it indicates that the spatial location is better, and the topic influence is greater.So the topic is more likely become the cluster center.In this way, even if the topics are far apart, we can discover the topic as well if it has a highly influence.Thereby, it can reduce the outliers in space.Similarly, if the distance is close to the optimal position but the influence of the topic is low, this means that the topic does not have a certain representative and can be ignored or not clustered into an optimal center position.The updated fitness function is:

Similarity Calculation
The clustering requires ensure low similarity between classes and high intraclass similarity [16].Each text is represented as a vector, the similarity formula between text i and text j is expressed as follows: , w k (i), w k (j) represent the weights of the k th feature words of text i and text j respectively, ( ) [ ] , 0,1 sim i j ∈ .

Fuzzy Clustering Based on Improved Firefly Algorithm
This paper combines the DASFA with FCM to correspond each of the space sample points to each firefly.The text influence value corresponds to the firefly brightness; the similarity corresponds to the membership value and the maximum attractiveness.If the similarity is within the threshold value, it indicates that the two fireflies are attracted, and there is a membership relationship between the two texts, otherwise there is no related link between the text.We found the similar topics according to the similarity of the text, and clustering is achieved through the attraction between fireflies.In this process, the optimization area is adjusted adaptively through the step length, according to the comparison of fitness values, we can find out the initial clustering center and then using FCM for the last clustering.In this process, according to the distance from the text to cluster center classify topics into the closest class.We can get the hot topic until the end of the termination condition is reached.Specific steps are as follows: Input: preprocessed micro-blog text; Output: hot topics after clustering.Journal of Computer and Communications DASFA-FCM proceed as follows: ① Initialization parameters: γ, T max , m, generate initial population , n indicates all micro-blog texts, k represents the number of initial cluster centers, initializing the position of each firefly.
② Calculating the influence value A(X i ) of each firefly according to Formula (9).
③ similarity between two texts(comparison of each micro-blog text and class center).when

( )
, sim i j ε < , the value of β 0 , u ij are 0; when ( ) , sim i j ε ≥ , all are 1.In this moment, the mutual attraction between fireflies is calculated according to Formula (2).
④ According to Formula (7), calculating the dynamic adaptive step length under the current iteration.
⑤ Calculating fitness function F(X i ), F(X j ); if F(X i ) < F(X j ), it shows that the firefly i influence is bigger than j, firefly i is in a better position than j, so firefly j moves to i, update each firefly position according to Formula (8).
⑥ Repeating steps ③ to ⑤ until the maximum number of iteration is reached.We can get the center of the cluster with the most influential fireflies.The number of the cluster center is C.
⑦ Based on the initial class centers found above, calculating the cluster center and membership matrix.

⑧ Calculating the distance
from the micro-blog text i to the cluster c, and classifying topics into the nearest cluster center.
⑨ Repeating steps ⑦ and steps ⑧.If the termination condition is reached, the location and influence of the most influential firefly will be output, and the result after clustering, otherwise continue.
⑩ We get hot topics based on the arrangement of influence values, output the top 50% topics.
The algorithm flowchart is shown in Figure 2.
One research shows that the topic similarity published in the near time is relatively higher, but due to the limitation of Chinese semantic diversity and topic frequency-based topic discovery, the general similarity comparison can not accurately identify the topic of variation.Therefore, by spatial distance, we divide the topic into the class closest to the cluster center.So that, for the topic of variation, if the similarity comparison cannot be classified accurately, the partial topic can also be classified into the correct cluster.By doing this, we can solve the problems caused by a small number of variation topics and improve the accuracy of topic discovery.In addition, due to the rule of topic evolution, it is considered that the influence value which is larger can represent the hot topics, so outputting the previous topic.

Data Preprocessing
This experiment uses the method of web crawler to extract real experimental data.
From the Sina Weibo website, we get 8126 pieces of micro-blog data from December 17 to 28, 2017, and randomly select 6 topics, a total of 4967 micro-blog datas.We labeled it as experimental data set M, the data including: It can be seen from the figure that when the number of cluster cores is close to the number of topics, the F value of the algorithm is larger, but if the number of clusters is smaller than the number of topics, the clustering effect is very poor, and with the number of clusters increases, the effect is basically showing a gradual upward trend.However, when it is larger than the number of topics, the clustering effect does not change basically, and it has the same effect for the FCM algorithm.Therefore, the number of initial cluster centers of data sets M and N is 6, and the F value of the algorithm is about 10% higher than that of FCM algorithm.

2) Similarity threshold determination
Since the data is obtained from micro-blog, micro-blog has the characteristics of short text, timeliness and randomness, so the similarity of the same topic is low.If the similarity value which selected is larger, the similar topics will be few, Figure 3.The initial value is selected differently for each indicator value.From the graph, we can see that the similarity threshold presents a fluctuating situation.When ε = 0.1, it seems that the value of F reaches the highest, but as the threshold increases, the value of F has a decreasing trend.We take the value later, when ε = 0.25, we can see that F has reached an optimal state.Then, with the increase of threshold, the value of F keeps decreasing.In addition, there are obvious differences between the F values of the two data sets.The measure value of the data set M is higher than the data set N in a certain threshold, but when ε > 0.27, the measure value of the data set M is significantly drops and lower than the data set N. This may be due to the large amount of data set M and the relationship with the characteristics of the topic itself.There are more topics of

4 . 3 . Experimental Results and Analysis 1 )
Topic 1, Xijia (978 strips); Topic 2, 396 Mathematics (349 strips); Topic 3, Discovery of the second solar system (152 strips); Topic 4, The Imperial Palace Response (263 strips); Topic 5, Jiangge Incident (1900 strips); Topic 6, Liu Yifei and Huang Initial value determination The initial step value setting has a great impact on the clustering results.Figure 3 shows the value of α (0) in the range of 0 to 1 at intervals of 0.1.Comparison of the corresponding values of P, R, F in the range of values.As can be seen from the figure, α (0) reaches a maximum value around 0.8.However, in order to accurately determine the value, we re-select two points which is nearby 0.8, we select 0.75, 0.85 to conduct the experiment again.Finally, the performance values of the two experiments are combined together.The results show that 0.8 is a turning point.When α (0) is 0.8, the performance value of the algorithm reaches the maximum.Therefore, the initial step is 0.8.The selection of the number of initial class centers also has a certain influence on the final experimental results.The number of cluster centers is adjusted, and the number of final class centers is obtained by comparing FCM and method of this article.The results are shown in Figure 4.

Figure 4 .
Figure 4. Different values of cluster hearts correspond to F values.

Figure 5 .
Figure 5. Threshold differences correspond to F values.