A Proposal for a Benchmark Generator of Weakly Connected Directed Graphs

The previous studies on detection of communities on complex networks were focused on nondirected graphs, such as Neural Networks, social networks, social interrelations, the contagion of diseases, and bibliographies. However, there are also other problems whose modeling entails obtaining a weakly connected directed graph such as the student access to the university, the public transport networks, or trophic chains. Those cases deserve particularized study with an analysis and the resolution adjusted to them. Additionally, this is a challenge, since the existing algorithms in most of the cases were origi-nally designed for non-directed graphs or symmetrical and regular graphs. Our proposal is a Benchmark Generator of Weakly Connected Directed Graphs whose properties can be defined by the end-users according to their necessities. The source code of the generators described in this article is available in GitHub under the GNU license.


Introduction
The interaction between the elements in many complex real-world systems can be modeled as graphs or networks, where the elements are represented as vertices and the relationships between them as edges. The networks are referred to Direct Networks where the relationship between the vertices of the network can be unidirectional, while in another case, they are referred to Symmetric Networks. The weight of each edge, if it is defined, is an associated numerical How to cite this paper: Montañana, J.M., Hervás, A. and Soriano, P.P. (2020) A Proposal for a Benchmark Generator of Weakly Connected Directed Graphs. Open Journal of Modelling and Simulation, 8, 18-34. detection after applying different algorithms in the same graph.
The availability of different graphs to be analyzed by the different algorithms is important, because the quality of the algorithms is strongly related to the graph properties to which they are applied [10]. For such purpose, generators of synthetic graphs with different properties were proposed for its use on the benchmarking of the capacities of the detection algorithms [5] [11] [12].
In our work, we require detection algorithms for Weakly Connected Directed Graphs, which are common in real problems. In those graphs, the relationship between each pair of vertices can be unidirectional, or bidirectional with a different contribution in each direction. Examples of these graphs can be found in the representation of bus lines on their city map [13] [14], flights between airports [15], social networks [16], and the demand for enrolment of students in different degrees [17].
However, we found that the existing community detection algorithms are not able to correctly detect the communities in the Weakly Connected Directed Graphs. And in addition, we neither found a generator of synthetic graphs with such properties.
There are generators that, in general, work excellently for non directed graphs. [4] [5] [12] [13] However, they cannot generate such specific directed graphs as the "students' choice of university degrees" case. In particular, the results of the best approximation with the main synthetic generator used for benchmarking, to the best of our knowledge [12], are available at the same GitHub where the source code of the generators described in this article are available.
Therefore, we consider that the research on these graphs requires a generator of synthetic Weakly Connected Directed Graphs, which can be used later for developing and benchmarking new detection algorithms.
For this reason, in this article, we propose a new directed graphs generator with new modelling capabilities, which be able to modelate weakly connected directed graphs. In particular, we consider the Students' Enrolment Demand graph, which is referred to as SED-graph, to evaluate the modelling capabilities of the new generator. The extension to other models is done naturally.

Related Work
The benchmarking process consists of 3 steps. At the first step, it is used a synthetic generator for obtaining a set of Initial Sub-Graphs. At the second step, we apply the different detection algorithms under comparison on each Initial Graph (IG) where initially the communities are disjointed, i.e. there are not interconnected edges between communities, and reapplying those detection algorithms after adding new edges between communities until the amount of addition is the same amount of edges as in the IG. The percentage of the number of additional edges over the number of edges in the IG is commonly referred as a mixing parameter 0 ≤ µ ≤ 1. Notice that the additional edges are defined also by the synthetic generator. Open Journal of Modelling and Simulation At the last step, there are identified the detection capabilities of each algorithm, which corresponds to the highest value of mixing parameter µ for each algorithm where they are still able to detect the original communities. Typically, the coefficient Normalized Mutual Information (NMI) is used to evaluate the goodness of the detection of the original communities. The NMI is evaluated by comparing the detected communities for the different values of the Mixing Parameter µ, with respect to the initial graph (µ = 0) 1 . The value of NMI equal to 1, corresponds to a detection of communities exact to that of the original graph, while the value of NMI decreases as the detected communities differ from those of the original graph.
As an example, we can consider the initially directed graph in Figure 2(a).
Each edge in a directed graph has a defined direction. In that graph, we can see 3 disjointed sets of vertices with only internal edges. In the figure, each community detected by a hypothetical algorithm is represented by a different colour. The main synthetic generator used for benchmarking, to the best of our knowledge, was published in [12]. As an example, we have generated with it a binary, a directed, a weighted, and a weighted-directed graph, all of them with a defined weight of edges. That synthetic generator also provides these graphs with different amount of inter-communities edges. Then, we then apply the Girvan-Newman community detection algorithm to each of these graphs. The results for each of the graphs are shown with the Normalized Mutual Information (NMI) in Figure 3. Figure 3 shows that the original communities are no longer detected after increasing about 40% the number of edges.

Objectives
The existing community detection algorithms had been used to find communities in dense and disperse networks. The former are those networks where elements are highly related among themselves, with many more edges than vertices.  The latter, which is more common, consists of networks where the number of edges is much smaller. The communities in these networks have groups of vertices highly connected between them and poorly connected with the vertices in the other communities [9] [12] [13] [14] [19].
However, they have difficulties to detect the communities in directed graphs with only a few vertices that have a high output weight, while most of the vertices have a total low output weight, as we found in real scenarios.
Our objective is to obtain a synthetic generator of such type of graphs. Because those graphs are not conveniently supported by the existing generators, and the challenge of detection communities on those graphs. The difficulty on detection is due to the existing algorithms in most of the cases that were originally designed for non-directed graphs or symmetrical and regular graphs.
This new synthetic generator will allow us to evaluate and compare community detection algorithms on this kind of graphs (benchmarking of detection algorithms). It will be also of particular interest to researchers who develop new algorithms.

Proportion of Vertices with High and Low Weights on Their Output Edges
The first big difference is the proportion of vertices with high and low weights on their output edges. As an example, Figure 4 shows the weight of the output edges sorted from highest to lowest in the largest community of the SED-graph. The fitting results of Figure 4 on different functions show that the best global fit is to the exponential function, which also keeps the difference on the weight ratio among vertices with higher weight and those with a lower weight.
The results of the estimation of the fitting parameter appear in Table 1, which shows a very good adjustment.
We can consider the model as statistically significant and we have good adjustment because the p-values are less than the pre-determined statistical significance level, which is ideally 0.05 (probability of 5% [20]). We can also see that the residuals are acceptable because they are centered around zero, i.e. the fit function is centered in the distribution of measures. And there is not any outlier, i.e. there is not any measure far from the fit function.

Ratio of the Size of the Communities
The second main difference is the ratio of small communities over the total amount. We can see the number of communities sorted by size of the SED-graph in Figure 5.
We look for a function that fits with the distribution of community sizes in Figure 4. Weight of the output edges in the largest community of the SED-graph generator. Table 1. Goodness-of-fit of the weight of the output edges in the SED-graph with an exponential function using R.     Figure 6 shows the functions that fit better with that distribution. In particular, we found that the function that fits more accurately was the Weibull function. Table 2 shows the numerical metrics for different criteria for measuring the fitting error with different types of functions. The Weibull function is the one that achieves the best coefficient in all the criteria (the smaller the better). In addition to considering these adjustment criteria, Figure 6 shows that the Weibull function is always the one that best adjusts to the data samples for different types of probability analyses.  The results of the estimation of the fitting parameters appear in Table 3. It is a very good adjustment, for the same reasons as shown in Section 3.1.   Table 3. Goodness-of-fit of the weight of the step levels of the output edges of the largest community in the SED-graph with a polynomial function using R. In the next section, we proceed to describe the graph generator with these statistical properties.

Proposed Synthetic Graph Generator
In this section, we describe the proposed generator, which is highly configurable according to the needs of the user, allowing to generate directed and non-directed graphs, symmetric and regular, as well as non-symmetrical and non-regular.
In order to facilitate the description, we propose first a simplified version of the generator, and later the complete version with additional parameters which achieves the modeling our target graphs.
The first version is a simplified version of the algorithm referred to as "generator of Directed weighted graphs which vertices have an Unbounded number of Output edges" (DUO), and the second one is referred to as "Directed weighted graph which vertices have a bounded number of Output edges" (DBO).

Generator of Directed Weighted Graphs with Unbounded Number of Output Edges (DUO)
In this first generator, the number of vertices NC per community is obtained randomly from a normal function that is defined by the parameters provided at the generator input. Next, it creates a routing table, where each route is defined by a start in one of the vertices, visiting from that vertex other vertices of the same community different from those visited in that same route (See Algorithm 1).
The number and length of these paths are optional input parameters, the generator will use default values calculated as a function of the size of each community when the user doesn't define them.
In order to obtain dominant vertices with a stronger connection in each community, these paths will give preference to visiting certain vertices. It is achieved using the following probability function to visit a vertex i:  Each path is composed of a list of N vertices, increasing the weight of the edge between the vertex i and the i + 1 of the path with 1/2 i , i.e. the weight contribution of each step in the path is half of the previous step.
As an example, Figure 8(a) shows a graph of 3 communities. Figure 8 The high degree of connectivity is shown in Figure 8(a) and considering that the small-world [21] style graphs have a small set of vertices with high connectivity degree, while many other vertices have a low degree of connectivity, motivated a second version of the generator.

Generator of Directed Weighted Graphs with Bounded Number of Output Edges (DBO)
This second generator is based on previous one, which the main difference is The definition of weights of edges is also done in the same way as for DUO.
But, the definition of paths will be done with the restriction of the number of output edges, which limits the possible random paths that can be defined. We have to take into account that the limitation of the number of output edges impose that some of the new paths will not reach the desired length. It is because some paths reach a vertex from where the path cannot go to any other vertex which the path has not already visited.
As an example, Figure 9, shows a graph of 4 vertices where all the possible directed edges already defined, where it is not possible to define a 3 hops path without visiting a vertex more than one time. In particular, the only possible path starting at A has to be the path A  B. It is limited to having a single hop because no vertex can be visited more than once within the same path, and the only one output edge from vertex B leads to vertex A which is already visited by this path.    This was the reason to develop the DBO generator, which is an evolution of the DUO generator.

Analysis of a Generated Random Graph
The purpose of this section is to analyze if the generated synthetic graphs have the main properties of the weak connected directed graphs, which make difficult to detect communities on them. Those properties are the Ratio of the Total Output Weights, and the Ratio of the size of the communities.

Ratio of the Total Output Weights
The distribution of output edges of the larger community in the graph is represented by a dotted line in Figure 12. We considered the larger community because it is the one that provides the greatest number of values to adjust curves.
The adjustment is shown in Figure 12(a) for the SED-graph, and in Figure 12 Figure 11. Example of (a) a graph generated by the DUO, (b) the SED-graph, and (c) an example of a graph generated by the DBO. The interconnection degree between vertices in the same community in the SED-graph is more similar to the interconnection degree in the graph generated by the DBO. Open Journal of Modelling and Simulation Figure 13(a) shows, with a line with points, the number of communities ordered by size, in the case of (a) SED-graph and (b) a random graph generated by the DBO generator proposed in this article. In both cases, the adjustments with the Weibull function appear as a dashed line. As the last result, we can see in Figure 15 that the plot shows some level steps when the vertices ordered by their total number of output edges, for both cases, the SED-graph and the generated graph with the DBO algorithm. This results from the way in which the graph was generated, although this pattern was not imposed in the algorithm. We also consider this last result as satisfactory, since

Conclusions
In this paper, we have proposed two parametrizable benchmarking algorithms that can generate a wide range of graphs, including graphs not-supported by the existing generators. In particular, these previously not-supported graphs are needed for the study of general problems in "badly conditioned" directed graphs from the traditional point of view, for which it is particularly difficult to detect their communities. In this way, the proposal in this paper intends to cover a space that until now has not been studied due to its difficulty.
We consider that the availability of synthetic directed graphs is essential for the development of new community detection algorithms, and therefore the proposed generators in this paper can be a key element. The source code of the proposed generators (written in C) is available in GitHub [22].

Funding
This work has been supported by the Project "Complex Networks" from the In-