The Dynamic Prediction Model of Number of Participants in Software Crowd Sourcing Collaboration Development Project

Many online platforms providing crowd with opportunities to participate in software development projects have been existed for a while. Meanwhile, many enterprises are using crowd source to collaboratively develop their software via these platforms in recent years. However, some software development projects in these platforms hardly attract users to join. Therefore, these project owners need a way to effectively predict the number of participants in their projects and accordingly well plan their software and project specifications, such as the program language and the size of the documenta-tion, in order to attract more individuals to participant in the projects. Compared with the past prediction models, our proposed model can dynamically add the factors, such as number of participants in the initial stage of the project, within the project life cycle and make the adjustment to the prediction model. The proposed model was also verified by using cross validation method. The results show that: 1) The models with the factor “the number of user participation” is more accurate than the model without it. 2) The factors of crowd dimension are more influential on the prediction accuracy than those of software project and owner dimensions. It is suggested that the project owners not only just consider those factors of the software project dimension in the initial stage of the project life cycle but also those factors of crowd and interaction dimensions in the late stage to attract more participants in their projects.


Introduction
According to Kalliamvakou [1], nearly 33% of the collaborative platforms have no users involved in the development projects. Therefore, it's important for project owners to know whether their projects are attractive to users, and whether their specifications are developed for most users. In addition, if they can well predict the number of users who are interested in participating in their projects in advance, they can well plan their development activities.
In the past, scholars indicated that the higher the number of project's participants is, the higher probability of being collaboratively developed is [2]. Therefore, it is obvious that it is difficult to correctly predict the number of participants in the software development collaboration projects only based on those factors of software and project dimensions before the projects are put into the platform. This study considers that an ideal software project prediction system should be able to dynamically adjust the predict results based on the data within the software development life cycle.
Based on the above research motivation, we expect to propose a dynamic prediction model for the number of participants in the collaborative software development, and to explore the impact of the factors of crowd dimension on the degree of attention to the project.

Software Crowd Sourcing Collaboration Development
On the software crowd sourcing collaboration development platform, users can easily upload local software projects to the Internet, and download the projects interested to them and save them into their own project library for further participation on project. It provides the ability to easily develop software projects collaboratively, including allowing users to track other users, compose organizations, track the dynamics of software libraries, and modify software code, make comments, etc.
Mining software repositories abbreviated as MSR. It refers to the behavior of searching for software library or code data [3]. The research data uses the GHTorrent data set provided by MSR officially. The data source is accessed through the Github API into a new data set ( Figure 1).
The definition of the software library refers to all the log files saved during the process of software evolution. The files include: the changes of Metadata (such as user and developer ID or time stamps), a record of the differences between versions (such as the change log. branches and tags between versions) and project bug tracking system.
There are 34 events in Github. In the software collaborative development process, whether the commit submitted by everyone can be adopted is decided by the project owner. Users can propose issue or ideas in the discussion area.
If users see a favorite project and want to contribute, he can copy to his local end repository. If the user wants the project owner to be merged, the user can pull  request. When the project is developed to a certain extent, the version can be released and become a more formal software product.
On the user side, a user can follow his/her favorite users and pay attention to the project development they created. The Github also provides users with the function to form an organization event without limiting the number of members.

Software Crowd Sourcing Collaboration Development Project Prediction
Many scholars use the API method to obtain the real data on the platform for Some scholars believe that the quantity of attention represents the user's interest in the project [4] [5] [6]; some scholars believe that the number of fork represents the user's interest in the project and wants to contribute [5] [7].
However, some scholars have also found that there are too many copies bot without further participation action. Therefore, there are many different opinions on predicting whether a particular software collaborative development project is attractive to the crowd.
A popular software collaborative development projects in this research is defined as one which attracts a certain number of users on the platform. A project which is predicted as attractive to crowd means that this project will have a higher probability to put into action through collaborative development activities.

Research Process
Stepwise Regression Procedure is adopted in the predictive model construction. Using cross-validation, the data was divided into ten equal parts, 9/10 training data, and 1/10 test data which were used to verify the final model. The research process includes the six steps as follows.
Step 1: Grab the Github data through a third-party API and build a history database.
Step 2: Perform multi-layer grouping through the K-means algorithm until the group features are obvious.
Step 3: Include the influence factors using clustering result obtained in Step 2.
Step 4: Evaluate the impact of each factor on the number of participants among groups.
Step 5: Construct the predict models using the influence factors and clustering groups obtained from the above steps.
Step 6: Verify and further compare the prediction accuracy of two models using the MMRE and Pred (0.25) metrics.

The Impact Factors of the Number of Participants
On the Github, 34 attribute factors are provided and can be divided into three dimensions which are software project, owner and crowd. However, this research found that the project owner has an important degree of influence in the early stage of the project, so the software project dimension is divided into project owner dimension and software project dimension. The project will change with the development life cycle. Therefore, the research variables are divided into fixed factors and uncertain factors according to the time characteristics.

Dimension Design
We have defined the dimensions of factors affecting the number of participants in the software development collaboration project. The main dimensions include the following three ones:  Software Project Dimension The software project dimension is the basic information of the project and the changes in its development process. This study draws four factors as research variables including Fix_doc, Fix_language, Fix_developer and Dy_release.

 Project Owner Dimension
The project owner refers to the user who created the project. The feature is that the initial stage of the project has an impact on the growth of the number of participants. We set the research variables including Fix_type, Fix_follower and Fix_following.  Crowd Dimension When the software project is developed through the collaborative crowd development platform, users on the platform can join the project development at any time. Whether it affects the number of participants after the crowd participation is an important issue of this study. The factors we set for this dimension include Dy_commit, Dy_issue and Dy_fork.

Data Collection
The object of this study is the artificial intelligence software project on the Github. The collected data are the projects created from January 1, 2015 to January 31, 2016, and the development information for each project during the one-year period. The project filter conditions are provided by Github's artificial intelligence related labels, and the project is an originally pure software project. The project samples were filtered out the projects with zero attention, and the final sample dataset was 1096.

Multi-Layered Data Grouping
Data grouping is to classify similar things. Variables in the same group may have unequal differences. There're two types of data grouping. The first one is to use the number of participants increased each week to do classification; the second one is to monitor the growth trend of the number of weekly participants. The characteristics between the separated groups are the same. However, the first method is not suitable because the amplitude of weekly curve is too dramatic. In the end, the study adopts the second method, divided into five groups with obvious characteristics.

Grouping Method
Through data conversion, we first scale the growth trend of each project so that we can compare the relative trends and then calculate the difference between each data. Data standardization: σ is population's standard deviation; x is raw data to be normalized; μ is population's mean Calculate the difference between vectors: Clara algorithm can deal with the larger dataset. Internally, it is achieved by considering a fixed sample size subset so that time and storage requirements become linear at n instead of quadratic.

Grouping Results
The sample data was multi-layered in this study. The first grouping results were four groups. Group one had 1003 projects, group two had 27, three had 57, and four had 8. Since the number and characteristics of group one are not focused enough, group one is divided into three groups. After the analysis, we merged The results were five groups; group one was 257 projects, group two 690, group three 84, group four 57, group five 8. The one, two and three groups grew the number of participants gradually, but the growth rate was different. Group four of participants stopped growing after a few weeks, and group five suddenly increased in the last five weeks. In order to shorten the content, this study only presents the experimental result of groups one and two.

Experimental Results
Following ber is, the smaller margin of error in the prediction result is. Since one-tenth of the samples were randomly selected as test data, the standard deviation was calculated to avoid the influence of outliers. The study calculates the prediction with an accuracy of plus or minus 25% as an acceptable error range [8].

Group One-Stable Growth Prediction Model
The result of prediction model in Group one is shown in Table 1

Group Two-Rapid Growth Prediction Model
The result of prediction model in Group two is shown in  (Table 4), the results show that the model II has a prediction accuracy of 61% in a half year and 62% in a year, which is 20% more accurate than the model I.

Conclusions
The experimental results show that the MMRE and Pred (0.25) of Model II are better than Model I, and the prediction result of one year is better than half a year. It can be seen that the more data the user accesses, the more accurate the prediction model is. That is, we can improve the accuracy with more participant data.
The impact of the software project dimension on the number of participants is mostly affected by Fix_developer. Fix_language only affects in half a year in group five, but the group five's participants is increased in the late development. Its influence on the project participants is low, which is consistent with the conclusions of the past literature. Fix_doc is only affected in groups one and five in Model I, and has no effect in Model II. In addition, Dy_release has no impact on  each group, so this study believes that users won't care if the project is being developed. Fix_follower in the project owner dimension is the most influential. For Model II in a year, groups one, two and four have influence; for Model I in a year, groups one and four have influence. Therefore, the project owner needs to interact with users on the platform to enhance the number of participants.
Fix_following isn't affected in the Models I and II, which is contrary to the results of past literature. This study believes that we can not only track users, but also build reputation on its platform through participation in the development life cycle.
The impact of the crowd dimension is most influential in Dy_fork, which is consistent with the past literature. The results also show that the crowd dimension has a high degree of influence on the collaborative development process. If the attracted users are willing to interactive in the project, they will have a positive impact on the number of participants. The study also found that users don't care whether the project has been developed, but worry about whether the project is continuously developed or maintained, so all factors in the crowd dimension have an impact on the number of participants.