Estimation Models for Software Functional Test Effort

The International Software Benchmarking and Standards Group (ISBSG) database was used to build estimation models for estimating software functional test effort. The analysis of the data revealed three test productivity patterns representing economies or diseconomies of scale and these patterns served as a basis for investigating the characteristics of the corresponding projects. Three groups of projects related to the three different productivity patterns, characterized by domain, team size, elapsed time and rigor of verification and validation carried out during development, were found to be statistically significant. Within each project group, the variations in test effort can be explained, in addition to functional size, by 1) the processes executed during development, and 2) the processes adopted for testing. Portfolios of estimation models were built using combinations of the three independent variables. Performance of the estimation models built using the function point method innovated by the Common Software Measurement International Consortium (COSMIC) known as COSMIC Function Points, and the one advocated by the International Function Point Users Group (IFPUG) known as IFPUG Function Points, were compared to evaluate the impact of these respective sizing methods on test effort estimation.


Introduction
This paper reports on a set of estimation models designed with data chosen from the ISBSG repository consisting of functional sizes reported both in IFPUG function points [1] and COSMIC function points. These estimation models were evaluated using criteria for measuring outputs from estimation models. The How to cite this paper: Jayakumar, K.R. and Abran, A. (2017) Estimation Models for Software Functional Test Effort. Journal of Software Engineering and Applications, models were compared to understand their performance based on the measure of their predictability.
The motivation for this research work arises from the fact that existing techniques for estimating test effort (such as judgment-based, work breakdown, factors & weights, and functional size based techniques) suffer from several limitations [2] [3], while other innovative approaches for estimating testing effort (such as fuzzy inference, artificial neural networks, and case-based reasoning as proposed in the literature) are yet to be adopted in the industry. There is a growing body of work on the use of the COSMIC function points [4] [5] for estimation and performance measurement of software development projects which can be adapted for estimating software test effort too.
The remainder of this paper is structured as follows. Section 2 presents the data preparation; Section 3 is data analysis; Section 4 is the estimation models and Section 5 is the conclusions.

ISBSG Data
Release 12 of ISBSG data published in 2013 [6] consists of data related to parameters of software projects re-ported over the last two and half decades, providing industry and researchers with standardized data for benchmarking and estimation. The ISBSG dataset has been extensively reviewed for its applicability to building effort estimation models, including effects of outliers and missing values [7] [8].
The attributes of interest for test effort estimation models are: a. Functional size data based on international measurement standards such as IFPUG and COSMIC function points.
b. Schedule, team size, work effort information, project elapsed time and breakdown of work effort by project phase (planning, specifications, design, build, test and install).
c. Project process related data based on software life cycle activities (e.g. planning, specifications, design, build, test) and adoption of practices from standards or models such as ISO

Data Preprocessing
A set of criteria was defined to ensure data quality, relevance to current industry needs, suitability to the testing context and adequacy for statistical analysis, as follows: 1) Data Quality a. ISBSG quality rating: Data quality ratings of A and B were selected to reduce risk and improve confidence in the results.
b. Function point size quality: When IFPUG function points were used for the measurement of size, only the un-adjusted function point value was considered. Function point data quality ratings of C and D were excluded from the data.
2) Data Relevance ISBSG data consist of projects reported since the early 90s. Data prior to 2000 and projects with an architecture type of "standalone" were removed while client/server or Web-based projects were considered for modelling.

3) Data Suitability
To exclude trivial projects, the following filters were applied:

Generation of Datasets
Applying the filters related to the criteria for data selection and removal of outliers resulted in 142 data points, which were then grouped to form four datasets: Dataset A: This dataset consists of all 142 data points including project functional size measures reported in IFPUG 4.1 or COSMIC FP. For this study, they were not differentiated within dataset A as they correlate well even though the relationship is not the same across all size ranges [9].
Dataset B: In the case of dataset A, projects with an architecture field value of "standalone" were eliminated from the original ISBSG data set, while "blanks" were retained. To be very specific about the architecture type, "blanks" were also eliminated from dataset A to arrive at dataset B, with 72 data points.
Dataset C: Data set C is made up of projects where functional size was reported in COSMIC function points. It is a subset of data set A and has 82 data points.
Dataset D: Dataset D includes only projects where functional size was reported in IFPUG function points. It is another subset of dataset A and contains

Strategy
The following strategy was adopted for data analysis: a) Identify data point subsets exhibiting different levels of testing productivity.
b) Analyze these subsets to identify the possible causes for the differences in productivity.

Identification of Test Productivity Levels
The scatter diagram in Figure 1 depicts a large dispersion between functional size and test effort, the independent and dependent variables, respectively. The pattern is closer to wedge-shaped and is typical of data from large repositories [10].
Within the dataset of Figure 1, there are candidate groups exhibiting both large economies of scale and large diseconomies of scale. The rate of increase of test effort is not the same for all similar functional sizes. Analyzing various slices of data brought out different testing productivity levels ( Figure 2).
As economies and diseconomies of scale correspond to different productivity levels, a new term "test delivery rate" (TDR) was defined to describe project testing productivity. TDR is the rate at which software functionality is tested as a factor of the effort required, and is expressed as hours per functional size unit  (hr./FSU). Functional size unit (FSU) refers to either IFPUG or COSMIC function points, depending upon the sizing method used for measurement. The four varying levels of productivity are referred as "TDR levels". TDR being the effect, the characteristics of the projects falling into each level were then investigated to identify the underlying causes. Due to the highly dispersed nature of TDR level 4, only TDR levels 1 to 3 were taken up for further analysis and development of the estimation models.

Identification of Candidate Characteristics of Projects
Previous research work [11] [12] based on data from hundreds of software projects has indicated that team size and schedule (duration of the project) within a particular domain affect the productivity of development projects. As software testing is one of the phases of development, project attributes such as domain, team size and elapsed time are likely causes for test productivity, too.
Testability of the software components, i.e., quality of the software delivered for testing, is critical for reducing the testing cost [13] and hence the effort for testing. The quality of the software delivered for testing can be determined by the extent of verification and validation activities carried out during the development process.  The group of projects contributing to each TDR level was termed the project group. Accordingly, project group 1 (PG1), project group 2 (PG2) and project group 3 (PG 3) refer to TDR levels 1, 2 and 3 discussed in Section 3.2. Based on the percentage of projects falling into each of the project groups for the four attributes of interest (Table 2), we were able to characterize the project groups.
Close to half of the BFSI projects (46%) fell into PG3 followed by a third in PG2. All education projects fell into PG1 while slightly more than half of the government projects fell in PG2. Close to two thirds of projects with a small team size fell into PG1, while 82% (46% + 36%) those with a medium team size were distributed between PG1 and PG2. The results of the analysis of the project characteristics in the three datasets A to C, excluding dataset D, demonstrate similar behavior. Characteristics of project groups PG1 to PG3 based on these attributes are summarized in Table 3.

1) Size
It has been observed that functional size is the most accepted approach for measuring size, as sensitivity to changes in functional size has a greater impact on project effort [14] [15]. Here, correlation coefficients computed using dataset A, between size and test effort values of 0.9035 for PG1, 0.8572 for PG2 and 0.8572 for PG3, indicate good correlation of functional size with effort. Size was therefore chosen as the primary independent variable.

2) Non-Size Variables
Size being the main independent variable, other independent variables were next examined for significance of incorporating them into estimation models. It has been observed [13] that "testability of software components", meaning the quality of the software delivered for testing and testing processes followed while testing, are critical factors for reducing testing effort and improving software quality. To accommodate these process factors two new variables representing development process quality and testing process quality were defined and investigated as follows.
a) Development Process Quality Rating (DevQ) The process followed during development was rated by considering the nature of the development life cycle followed and the artefacts produced, based on the following project attributes:  Standards followed.
 Distinct development life cycle phases followed.
 Verification activities carried out during development.
The ISBSG data field "software process" has one of the values-CMMI, ISO, SPICE, PSP or any such standard followed during development. A set of fields representing "Documents and Techniques" exists in the ISBSG data providing information on the life cycle phases adopted and verification activities carried out during development. Based on these, a rating for DevQ was developed, as shown in Table 4.

b) Test Process Quality Rating (TestQ)
While reviewing, the data related to the testing process followed, it was found  that there were not enough fields in the ISBSG data to capture the details of the testing process, such as testing techniques adopted, levels of testing executed, test artefacts produced, reviews of test cases etc., to gauge the extent of testing. This notwithstanding, it was possible to classify the test process rating broadly into two categories (Table 5).

Analysis of DevQ and TestQ
Projects in data set A were analyzed in terms of DevQ and TestQ:  36%, 49%, and 15% of the projects were found to be in DevQ with ratings 0, 1 and 2, respectively.  80% and 20% of the projects had TestQ ratings 0 and 1 respectively.
To further justify the inclusion of these variables, two statistical tests were carried out to quantify their significance (Table 6):  the Kruskal-Wallis Test for DevQ as it involved three categories, and  the Mann Whitney Test was applied for TestQ.
The p value indicated that size, DevQ and TestQ were statistically significant.

Portfolio of Models
The linear form of relationship between input and output variables was chosen to build models for effort estimation. Linear regression analysis, a well-known and well understood algorithm in statistics and machine learning, does not require much training data, and is easily interpreted by project managers. Parametric models are objective, repeatable, fast and easy to use, and can be used early in the life cycle if they are properly calibrated and validated [16]. A set of 24 models under four portfolios were generated (Table 7) using datasets A to D.
Portfolio A models based on dataset A: Models 1, 2 and 3 are for each project group using size as the independent variable. Models 4, 5 and 6 use both size and DevQ as independent variables and relate to project groups 1, 2 and 3 respectively.   Using estimation models based on size: Test effort for a particular functional size can be estimated from models using the following equation representing size based estimation models: Test effort for a particular functional size can be computed by using the values of A and B from Table 7 and substituting functional size for "size" in Equation (1).
Using estimation models based on size and DevQ: Test effort for a particular value of functional size and DevQ can be estimated from models using the following equation: Test effort for a particular functional size where rating for DevQ is available can be computed using Equation (2). D1 and D2 have different values based on the value of DevQ. Appropriate values from Table 7 are to be chosen depending on whether DevQ = 0 or Dev Q = 1. For DevQ = 2, the value is 0, the base value considered while modelling.
Using estimation models based on size, DevQ and TestQ: The equation for estimating Test Effort for particular values of size, DevQ and TestQ from the model has the form: Equation (3)  ling.
An estimator chooses the project group by mapping the characteristics of the project to be estimated to the attributes of project group and selects the related data set in order to choose the closest model for estimation.

Evaluation of Estimation Models
The quality of estimation models was evaluated using criteria such as coefficient of determination (R 2 ), Adj R 2 , magnitude of relative error (MRE), median magnitude of relative error (MedMRE) [10] ( Table 8).
The value of R 2 for portfolio A ranged between 0.74 and 0.86, and that of Adj R 2 ranged between 0.73 and 0.83 indicating a strong relationship between the independent variables-size, DevQ and TestQ with the dependent variable test effort in all models.
The value of MedMRE ranging between 0.22 and 0.28 shows that the error levels between the estimate and actual are within the range of 22% to 28% for 50% or less of the samples, which is practical considering the multi-organizational data used for building the models.
Similar observations can be made for rest of the models.

Predictive Performance of Models
The criterion used to evaluate the predictive quality of an estimation model was PRED (l) = k/n, where k is the number of projects in a specific sample of size n for which MRE <= l. In the software engineering literature, an estimation model is considered good when PRED (0.25) = 0.75 [17] or

Dataset A Models in Portfolio A
There are 9 models in portfolio A. A comparison of these nine models reveals how predictability varies between project groups and while using different independent variables. Figure 3 depicts the MRE levels of models corresponding to PG1 (the leftmost three bars), PG2 (the next three bars) and PG3 (the rightmost three bars).
Within PG1, the model with size, DevQ and TestQ as independent variables (model 7) demonstrate lower MRE for 50% of the population compared to the model with size, Dev Q (model 4) which is lower than the model with size alone (model 1). A similar pattern is observed in PG3 for models 3, 6 and 9. In the case of PG2, the model using size, DevQ and TestQ (model 5) exhibits higher MRE compared to other models in PG2 (models 2 & 8) as well as for all the models in dataset A.

Size-Based Models
A comparison of all models using only size as an independent variable across all portfolios shows under which context size-based models provide better predictability. Figure 4 illustrates size-based models from each portfolio, the first three bars corresponding to each project group in portfolio A, the next three corresponding to each project group in portfolio B and so on.
In summary:  Size-based models for project groups PG1 and PG3 are better than PG2 with the exception of portfolio D.  PG3 size models, in general perform better than PG1 and PG2 except for the last model (Model ID 24).

Models in Portfolios A and B
Portfolio B models were developed using a subset of data used for portfolio A. Portfolio B models were more specific to web or client/server architecture, unlike portfolio A models where there was an approximation due to differences in architecture. A comparison between the models across portfolios A and B using the independent variables DevQ and TestQ along with size helped to make certain observations. Figure 5 depicts size & DevQ models for PG1, PG2 and PG3 for dataset A and B, while Figure 6 illustrates size, DevQ and TestQ models for PG1, PG2 and PG3 for data set A and B.
Examination of Figure 5 reveals that models in portfolio B (models 13, 14,15,16 & 18) performed much better than models in portfolio A (models 4 to 9) with model 17 being an exception.

COSMIC and IFPUG Models
The performance of COSMIC (dataset C) and IFPUG (dataset D) models was compared next using size-based models from portfolio A as the reference. Both COSMIC and IFPUG data are subsets of dataset A consisting of projects measured using the corresponding sizing method. This comparison can help to evaluate prediction accuracy of COSMIC-based models versus IFPUG-based models. COSMIC-based estimation models using dataset C had better performance than IFPUG-based estimation models using data set D, with the exception of PG2 (model 20). COSMIC-based PG3 model demonstrated the best predictability. Furthermore, the R 2 values for COSMIC-based models ranged from 0.73 to

Conclusions
This research work explored software testing from the perspective of estimation of efforts for functional testing. The ISBSG database, with its wealth of project data from around the globe, was used for the first time in building effort models for functional testing. The analysis of the data revealed three test productivity patterns representing economies and diseconomies of scale, based on which characteristics of the corresponding projects were investigated. Three project groups, characterized by domain, team size, elapsed time and rigor of verification and validation, and related to three productivity patterns were found to be statistically significant. Within each project group, the variations in test effort could be explained, apart from the functional size, by 1) the processes executed during the development, and 2) the processes adopted for testing.
Two new independent variables, DevQ and TestQ were identified as influential in the estimation of effort. A total of 24 models were built, using combinations of the three independent variables. The quality of each model was evaluated using established criteria such as R 2 , Adj R 2 , MRE and MedMRE. As these models were built from ISBSG data, they could serve as an industry benchmark for functional test efforts. Test estimation models using projects measured in COSMIC function point exhibited better quality and resulted in more accurate estimates compared to projects measured in IFPUG function points.
The models are applicable only for the ranges of size in the data set and for testing of business applications. The models generated are not applicable for enhancement projects. These limitations can be overcome by generating specific models for enhancements or real-time projects, using an approach like the one followed in this work. This may require identification of additional project characteristics, as well as other variables influencing testing effort. PG4-the fourth group of project data points remains to be analyzed.
The process factors used for rating DevQ and TestQ can be further refined within organizational context. There could be other variables that influence test efforts in specific contexts, which would require further study and analysis. The estimation models designed can be further refined by considering testing techniques adopted as a parameter to evaluate their impact and then used to build estimation models.