Paper Menu >>
Journal Menu >>
![]() J. Software Engineering & Applications, 2010, 3, 1005-1014 doi:10.4236/jsea.2010.311118 Published Online November 2010 (http://www.SciRP.org/journal/jsea) Copyright © 2010 SciRes. JSEA Case-Based Reasoning for Reducing Software Development Effort Adam Brady1, Tim Menzies1, Oussama El-Rawas1, Ekrem Kocaguneli1, Jacky W. Keung2 1Lane Department of CS&EE, West Virginia University, Morgantown, USA; 2Department of Computing, The Hong Kong Polytech- nic University, Hong Kong, China. Email: adam.m.brady@gmail.com, tim@menzies.us, orawas@gmail.com, kocaguneli@gmail.com, jacky.keung@comp.polyu.edu.hk Received June 24th, 2010; revised July 15th, 2010; accepted July 19th, 2010. ABSTRACT How can we best find project changes that most improve project estimates? Prior solutions to this problem required the use of standard software process models that may not be relevant to some new project. Also, those prior solutions suf- fered from limited verification (the only way to assess the results of those studies was to run the recommendations back through the standard process models). Combining case-based reasoning and contrast set learning, the W system re- quires no underlying model. Hence, it is widely applicable (since there is no need for data to conform to some so ftware process models). Also, W’s results can be verified (using holdout sets). For example, in the experiments reported here, W found changes to projects that greatly reduced estimate median and variance by up to 95% and 83% (respectively). Keywords: Sof tw are Eff ort Est i mat i on, C ase Based Reasoning , Effort Estimation 1. Introduction Existing research in effort estimations focuses mostly on deriving estimates from past project data using (e.g.) parametric models [1] or case-based reasoning (CBR) [2] or genetic algorithms [3]. That research is curiously silent on how to change a project in order to, say, re- duce development effort. That is, that research reports what is and not what should be changed. Previously [4-7], we have tackled this problem using STAR/NOVA, a suite of AI search algorithms that ex- plored the input space of standard software process models to find project options that most reduced the effort estimates. That approach had some drawbacks including 1) the dependency of the data to be in the format of the standard process models, 2) the imple- mentation complexity of the Monte Carlo simulator and the AI search engines, and 3) the lack of an inde- pendent verification module (the only way to assess the results of those studies was to run the recommenda- tions back through the standard process models). This paper describes “W”, a simpler, yet more general solution to the problem of finding project changes that most improves project estimates. Combining case-based reasoning and contrast set learning, W requires no un- derlying parametric model. Hence, W can be applied to more data sets since there is no requirement for the data sets to be in the format required by the standard software process models. For example, this paper ex- plores five data sets with W, three of which cannot be processed using STAR/NOVA. W is simpler to implement and easier to use than STAR/NOVA, requiring hundreds of lines of the scripting language AWK rather than thousands of lines of C++/LISP. For example, we describe below a case study that finds a loophole in Brooks’ Law (“adding manpower [sic] to a late project makes it later”). Using W, that study took three days which included the time to build W, from scratch. An analogous study, based on state-of-the-art model-based AI techniques, took two years. Also, the accuracy of STAR/NOVA is only as good as the underlying model (the USC COCOMO suite [1]). W’s results, on the other hand, are verified using hold-out sets on real-world data. In the experiments reported below, we show that W can find changes to projects that greatly reduce estimate median and vari- ance by up to 95% and 83%, respectively. Finally, the difference between W, which finds what to change in order to improve an estimate, and a stan- dard case-based effort estimator, which only generates estimates, is very small. Based on this experiment, we ![]() Case-Based Reasoning for Reducing Software Development Effort 1006 advise augmenting standard CBR tools with modules like the planning sub-systems in W. The rest of this paper is structured as follows. After some background notes on effort estimation and STAR/NOVA, we describe the general framework for case-based reasoning. The W extension to CBR is then described (contrast set learning over the local neigh- borhood), using a small example. This is followed by fourteen case studies, one with Brooks’ Law, and thir- teen others. Our conclusion will discuss when W is preferred over STAR/NOVA. 2. Background 2.1. Why Study Effort Estimation? Generating (and regenerating) project effort estimates is an important and continuous process for project manag- ers. Not only do good estimates allow the better use of resources, but by reviewing and improving their estima- tion process, a software company can learn and improve from their past experience. Sadly, we often get estimates wrong. Consider the NASA’s Checkout Launch Control System, which was canceled when the initial estimate of $200 million was overrun by an additional $200 M [8]. This case is not unique, despite the significant effort put into designing more accurate estimation models. It has been reported that many predictions are wrong by a factor of four or more [9,10]. In order to conduct software effort estima- tion, it is standard practice to use models to estimate ef- fort. Many software process models have emerged aim- ing to achieve that task, and there has not emerged a sin- gle standardized model that is widely used by the soft- ware engineering industry. There several reasons for this including generality, data islands and instability. Soft- ware models may not be general so it can be inappropri- ate to apply a software model learned in one environment one to another. Also, many companies prefer to keep cost related data confidential. This data island effect has also contributed to the fragmentation of the field by compa- nies preferring to build private models rather than using publicly available models. This multiplicity of software effort models has lead to scarcity of publicly available, local data needed for model based effort estimation. Without sufficient data to build, audit, and tune models, the predictions generated by these models may be highly unstable. Baker [11] reports a study that learned values for the (a, b) (linear, exponential) constants in Boehm’ COCOMO software process model [9]. The study was repeated 100 times, each time selecting from a random sample of 90% of the project data. The learned values for (a, b) exhibited an alarming variance: (2.2 ≤ a ≤ 9.18) ^ (0.88 ≤ b ≤ 1.09) (1) Such large variations make it hard to understand the ef- fects of changing project options. Suppose some pro- posed change doubles productivity, but a moves from 9 to 4.5. The improvement resulting from that change would be obscured by the tuning variance. 2.2. STAR and NOVA The SBSE literature inspired us to try simulated anneal- ing to search the what-ifs associated with Equation 1. This lead to the STAR system [4,6]. NOVA was a gen- eralization of STAR that included simulated annealing and other search engines [5,7]. STAR/NOVA handled model variance by finding conclusions that were stable across the space of possible tunings. This analysis assumed that, for a mature effort estimation model, the range of possible tunings was known (this is the case for models like COCOMO). For such models, it is possible for the AI search engines to find conclusions that hold across the space of all tunings. STAR and NOVA constrain project options P but not the tuning options T. Hence, their recommendations con- tain subsets of the project options P that most improve the score, despite variations in the tunings T. This approach meant we could reuse COCOMO models requiring using local tuning data. The following is a description that briefly presents the operation of STAR and NOVA: 1) SAMPLE: We sample across the ranges of all the attributes in the model n1 times. Often we sample ran- domly across the range. Some heuristics allow us to concentrate more on the extremes of the range. 2) DIS CRETIZE: The data seen in the n1 samples is then discretized into D = 10 bins. Equal frequency bins were used. 3) CLASSIFY: The top n% projects are classified as best or rest. 4) RANK: The ranges are then ranked in increasing order using Support-Based Bayesian Ranking. 5) PRUNE: STAR runs n2 experiments with the mod- els where the top ranked ranges 1…X ranges are pre-set and the remaining ranges can be selected at random. 6) REPORT: STAR returns the 1…X settings that op- timize the best for the fitness function. These settings are determined by iterating back from the minimum point achieved towards the first point that is statistically simi- lar to the minimum point. In practice, STAR/NOVA approach was very effective. Figure 1 shows the large effort reductions found by STAR in three out of four cases presented at ASE’09. It is insightful to reflect about STAR/NOVA’s failure to find large reductions in the fourth case study (nasa93 osp2). In this project, management had already fixed most of the project options. STAR/NOVA failed, in this case, since there was very little left to try and change. Copyright © 2010 SciRes. JSEA ![]() Case-Based Reasoning for Reducing Software Development Effort 1007 StudyNOVA nasa93 flight72% nasa93 ground73% nasa93 osp42% nasa93 osp25% Figure 1. Effort estimate improvements found by NOVA. from [12]. This fourth case study lead to one of the lessons learned of STAR/NOVA: apply project option exploration tools as early as possible in the lifecycle of a project. Or, to say that more succinctly: if you fix everything, there is nothing left to fix [13]. While a successful prototype, STAR/NOVA has cer- tain drawbacks: Model dependency: STAR/NOVA requires a model to calculate (e.g.) estimated effort. In order to do so, we had to use some software process models to generate the estimates. Hence, the conclusions reached by STAR/NOVA are only as good as this model. That is, if a client doubts the relevance of those models, then the conclusions will also be doubted. Data Dependency: STAR/NOVA’s AI algorithms explored an underlying software process model. Hence, it could only process project data in a format compatible with the underlying model. In practice, this limits the scope of the tool. Inflexibility: It proved to be trickier than we thought to code up the process models, in a manner suitable for Monte Carlo simulation. By our count, STAR/NOVA’s models required 22 design deci- sions to handle certain special cases. Lacking guid- ance from the literature, we just had to apply “en- gineering judgment” to make those decisions. While we think we made the right decisions, we cannot rigorously justify them. Performance: Our stochastic approach conducted several tens of thousands of iterations to explore the search space, with several effort estimates needing calculation with each iteration. This resulted in a performance disadvantage. Size and Maintainability: Due to all the above fac- tors, our code base proved difficult to maintain. While there was nothing in principle against apply- ing our techniques to other software effort models, we believe that the limiting factor on disseminating our technique is the complexity of our implementa- tion. As partial evidence for this, we note that in the three years since we first reported our technique [6]: We have only coded one set of software process models (COCOMO), which inherently limited the scope of our study. No other research group has applied these tech- niques. Therefore, rather than elaborate a complex code base, we now explore a different option, based on Case Based Reasoning (CBR). This new ap- proach had no model restrictions (since it is does not use a model) and can accommodate a wide range of data sets (since there are no restrictions of the variables that can be processed). While there was nothing in principle against applying our techniques to other software effort models, we be- lieve that the limiting factor on disseminating our tech- nique is the complexity of our implementation. As partial evidence for this, we note that in the three years since we first reported our technique [6]: We have only coded one set of software process models (COCOMO), which inherently limited the scope of our study. No other research group has applied these tech- niques. Therefore, rather than elaborate a complex code base, we now explore a different option, based on Case Based Reasoning (CBR). This new approach had no model re- strictions (since it does not use a model) and can ac- commodate a wide range of data sets (since there are no restrictions of the variables that can be processed). 3. Case-Based Reasoning (CBR) Case based reasoning is a method of machine learning that seeks to emulate human recollection and adaptation of past experiences in order to find solutions to current problems. That is, as humans we tend to base our deci- sions not on complex reductive analysis, but on an in- stantaneous survey of past experiences [14]; i.e., we don’t think, we remember. CBR is purely based on this direct adaptation of previous cases based on the similar- ity of those cases with the current situation. Having said that, a CBR based system has no dedicated world model logic, rather that model is expressed through the avail- able past cases in the case cache. This cache is continu- ously updated and appended with additional cases. Aamodt & Plaza [15] describe a 4-step general CBR cycle, which consists of: 1) Retrieve: Find the most similar cases to the target problem. 2) Reuse: Adapt our actions conducted for the past cases to solve the new problem. 3) Revise: Revise the proposed solution for the new problem and verify it against the case base. 4) Retain: Retain the parts of current experience in the case base for future problem solving. Having verified the results from our chosen adapted ac- tion on the new case, the new case is added to the avail- able case base. The last step allows CBR to effectively Copyright © 2010 SciRes. JSEA ![]() Case-Based Reasoning for Reducing Software Development Effort 1008 learn from new experiences. In this manner, a CBR sys- tem is able to automatically maintain itself. As discussed below, W supports retrieve, reuse, and revise (as well as retain if the user collecting data so decides). This 4-stage cyclical CBR process is sometimes re- ferred to as the R4 model [16]. Shepperd [16] considered the new problem as a case that comprises two parts. There is a description part and a solution part forming the basic data structure of the system. The description part is normally a vector of features that describe the case state at the point at which the problem is posed. The solution part describes the solution for the specific prob- lem (the problem description part). The similarity between the target case and each case in the case base is determined by a similarity measure. Dif- ferent methods of measuring similarity have been pro- posed for different measurement contexts. A similarity measure is measuring the closeness or the distance be- tween two objects in an n-dimensional Euclidean space, the result is usually presented in a distance matrix (simi- larity matrix) identifying the similarity among all cases in the dataset. Although there are other different distance metrics available for different purposes, the Euclidean distance metric is probably the most commonly used in CBR for its distance measures. Irrespective of the similarity measure used, the objec- tive is to rank similar cases from case-base to the target case and utilize the known solution of the nearest k-cases. The value of k in this case has been the subject of debate [17,2]: Shepperd [2], Mendes [18] argue for k = 3 while Li [3] propose k = 5. Once the actual value of the target case is available it can be reviewed and retained in the case-base for future reference. Stored cases must be maintained over time to prevent information irrelevancy and inconsistency. This is a typical case of incremental learning in an organiza- tion utilizing the techniques of CBR. Observe that these 4 general CBR application steps (retrieve, reuse, revise, retain) do not include any explicit model based calculations; rather we are relying on our past experience, expressed through the case base, to es- timate any model calculations based on the similarity to the cases being used. This has two advantages: 1) It allows us to operate independently of the models being used. For example, our prior report to this confer- ence [13] ran over two data sets. This study, based on CBR, uses twice as many data sets. 2) This improves our performance, since data retrieval can be more efficient than calculation, especially given that many thousands of iterations of calculation were needed with our traditional modeling based tool. As evi- dence of this, despite the use of a slower language, W’s AWK code runs faster than the C++/LISP used in STAR/NOVA. It takes just minutes to conduct 20 trials over 13 data sets with W. A similar trial, conducted with NOVA or STAR, can take hours to run. 4. From CBR to W A standard CBR algorithm reports the median class val- ue of some local neighborhood. The W algorithm treats the local neighborhood in a slightly different manner: The local neighborhood is divided into best and rest. A contrast set is learned that most separates the re- gions (contrast sets contain attribute ranges that are common in one region, but rare in the other). W then searches for a subset of the contrast set that best selects for (e.g.) the region with lower effort estimates. The rest of this section details the above process. 4.1. Finding Contrast Sets Once a contrast set learner is available, it is a simple matter to add W to CBR. W finds contrast sets using a greedy search, where candidate contrast sets are ranked by the log of the odds ratios. Let some attribute range x appear at frequency N1 and N2 in two regions of size R1 and R2. Let the R1 region be the preferred goal and R2 be some undesired goal. The log of the odds ratio, or LOR, is: LOR(x) = log ((N1/R1) (N2/R2)) (2) Note that when LOR(x) = 0, then x occurs with the same probability in each region (such ranges are there- fore not useful for selecting on region or another). On the other hand, when LOR(x) > 0, x is more common in the preferred region than otherwise. These LOR-positive ranges are candidate members of the contrast set that selects for the desired outcome. It turns out that, for many data sets, the LOR values for all the ranges contain a small number of very large values (strong contrasts) and a large number of very small values (weak contrasts). The reasons for this dis- tribution do not concern us here (and if the reader is in- terested in this master-variable effect, they are referred to [19,20]). What is relevant is that the LOR can be used to rank candidate members of a contrast set. W computes the LORs for all ranges, and then conducts experiments applying the top i-th ranked ranges. For more on LOR, and their use for multi-dimensional data, see [21]. 4.2. The Algorithm CBR systems input a query q and a set of cases. They return the subset of cases C that is relevant to the query. In the case of W: Each case Ci is an historical record of one software Copyright © 2010 SciRes. JSEA ![]() Case-Based Reasoning for Reducing Software Development Effort 1009 projects, plus the development effort required for that project. Within the case, the project is de- scribed by a set of attributes which we assume have been discretized into a small number of discrete values (e.g. analyst capability ∈{1, 2, 3, 4, 5} de- noting very low, low, nominal, high, very high re- spectively). Each query q is a set of constraints describing the particulars of a project. For example, if we were in- terested in a schedule over-run for a complex, high reliability projects that have only minimal access to tools, then those constraints can be expressed in the syntax of Figure 2. W seeks q' (a change to the original query) that finds another set of cases C' such that the median effort values in C' are less than that of C (the cases found by q). W finds q' by first dividing the data into two-thirds training and one-third testing. Retrieve and reuse are applied to the training set. Revising is then applied to the test set. 1) Retrieve: The initial query q is used to find the N training cases nearest to q using a Euclidean distance measure where all the attribute values are normalized from 0 to 1. 2) Reuse (adapt): The N cases are sorted by effort and divided into the K1 best cases (with lowest efforts) and K2 rest cases. For this study, we used K1 = 5, K2 = 15. Then we seek the contrast sets that select for the K1 best cases with lowest estimates. All the attribute ranges that the user has marked as “controllable” are scored and sorted by LOR. This sorted order S defines a set of can- didate q' queries that use the first i-th entries in S: qi' = q∪S1∪S2…∪Si (3) Formally, the goal of W is find the smallest i value such qi' selects cases with the least median estimates. According to Figure 3, after retrieving and reusing comes revising (this is the “verify” step). When revising q', W prunes away irrelevant ranges as follows: 1) Set i = 0 and qi' = q. 2) Let Foundi be the test cases consistent with qi' (i.e., that do not contradict any of the attribute ranges in qi'). 3) Let Efforti be the median efforts seen in Foundi. @project example @attribute ?rely 3 4 5 @attribute tool 2 @attribute cplx 4 5 6 @attribute ?time 4 5 6 Figure 2. W’s syntax for descr ibing the input quer y q. Here, all the values run 1 to 6. 4 ≤ cplx ≤ 6 denotes projects with above average complexity. Question marks denote what can be controlled in this case, rely, time (require d reliability and development time). Figure 3. A diagram describing the steps of CBR (source: http://www.peerscience.com/Assets/cbrcycle1.gif). 4) If Found is too small then terminate (due to over-fitting). After Shepperd [2], we terminated for |Found| < 3. 5) If i > 1 and Efforti < Efforti-1, then terminate (due to no improvement). 6) Print qi' and Efforti. 7) Set i = i + 1 and qi' = qi-1∪Si 8) Go to step 2. On termination, W recommends changing a project according to the set q' – q. For example, in Figure 2, if q' – q is rely = 3 then this treatment recommends that the best way to reduce the effort for this project is to reject rely = 4 or 5. One useful feature of the above loop is that it is not a black box that offers a single “all-or-nothing” solution. Rather it generates enough information for a user to make their cost-benefit tradeoffs. In practice, users may not accept all the treatments found by this loop. Rather, for pragmatic reasons, they may only adopt the first few Si changes seen in the first few rounds of this loop. Users might adopt this strategy if (e.g.) they have limited man- agement control of a project (in which case, they may decide to apply just the most influential Si decisions). Implementing W is simpler than the STAR/NOVA approach: Both NOVA and STAR contain a set process mod- els for predicting effort, defects, and project threats as well as Monte Carlo routines to randomly select values from known ranges. STAR and NOVA im- plement simulated annealing while NOVA also im- plements other search algorithms such as A*, LDS, Copyright © 2010 SciRes. JSEA ![]() Case-Based Reasoning for Reducing Software Development Effort Copyright © 2010 SciRes. JSEA 1010 MAXWALKSAT, beam search, etc. STAR/NOVA are 3000 and 5000 lines of C++ and LISP, respec- tively. W, on the other hand, is a 300 line AWK script. Our pre-experimental suspicion was that W was too simple and would need extensive enhancement. However, the results shown below suggest that, at least for this task, simplicity can suffice (but see the future work section for planned extensions). Note that W verification results are more rigorous than those of STAR/NOVA. W reports results on data that is external to its deliberation process (i.e., on the test set). STAR/NOVA, on the other hand, only reported the changes to model output once certain new constraints were added to the model input space. 5. Data Recall that a CBR system takes input a query q and cases C. W has been tested using multiple queries on the data sets of Figure 4. These queries and data sets are de- scribed below. 5.1. Data Sets As shown in Figu re 4, our data includes: The standard public domain COCOMO data set (Cocomo81); Data from NASA; Data from the International Software Benchmarking Standards Group (ISBSG); The Desharnais and Maxwell data sets. Except for ISBSG, all the data used in this study is available at http://promisedata.org/data or from the au- thors. Note the skew of this data (min to median much smaller than median to max). Such asymmetric distribu- tions complicate model-based methods that use Gaussian approximations to variables. There is also much divergence in the attributes used in our data: While our data include effort values (measured in terms of months or hours), no other feature is shared by all data sets. The COCOMO and NASA data sets all use the at- tributes defined by Boehm [9]; e.g. analyst capabil- ity, required software reliability, memory con- straints, and use of software tools. The other data sets use a variety of attributes such as the number of data model entities, the number of basic logical transactions, and number of distinct business units serviced. This attribute divergence is a significant problem for model-based methods like STAR/NOVA since those systems can only accept data that conforms to the space of attributes supported by their model. For example, this study uses the five data sets listed in Figure 2. STAR/ NOVA can only process two of them (coc81 and nasa93). CBR tools like W, on the other hand, avoid these two problems: W makes no assumption about the distributions of the variables. W can be easily applied to any attribute space (ca- veat: as long but there are some dependent vari- ables). 5.2. Queries Figure 2 showed an example of the W query language: The idiom “@attribute name range “defines the range of interest for some attribute “name”. If “range” contains multiple values, then this repre- sents a disjunction of possibilities. If “range” contains one value, then this represents a fixed decision that cannot be changed. The idiom “?x” denotes a controllable attribute (and W only generates contrast sets from these control- lables). Note that the “range”s defined for “?x” must contain more than one value, otherwise there is no point to making this controllable. 6. Case Studies 6.1. Case Study #1: Brooks’ Law This section applies W to Brooks’ Law. Writing in the 1970s [22], Brooks noted that software production is a very human-centric activity and managers need to be aware of the human factors that increase/decrease pro- ductivity. For example, a common practice at that time at IBM was to solve deadline problems by allocating more resources. In the case of programming, this meant Dataset Attributes Number of cases Content Units Min MedianMean Max Skewness coc81 17 63 NASA projects months 6 98 683 11400 4.4 nasa93 17 93 NASA projects months 8 252 624 8211 4.2 desharnais 12 81 Canadian software projects hours 546 3647 5046 23940 2.0 maxwell 26 62 Finnish banking software months 6 5189 8223 63694 3.3 isbsg 14 29 Banking projects of ISBSG minutes662 2355 5357 36046 2.6 Figure 4. The 328 projects used in this study come from 5 data sets. ![]() Case-Based Reasoning for Reducing Software Development Effort 1011 @project brooks @attribute apex 1 @attribute plex 1 @attribute ltex 1 @attribute kloc 0.9 2.2 3 [snip] 339 350 352 423 980 Figure 5. The brookslaw query. Treatment median spread AsIs = Nasa93 225 670 ToBe1 = nasa93 ∪ q 380 680 ToBe2 = nasa93 ∪ q' 220 290 Figure 6. Effort estimates seen in different treatments for the brooks’ law experiment. adding more programmers to the team. Brooks argued that this was an inappropriate response since, according to Brooks’ law “adding manpower [sic] to a late software project makes it later”. The reason for this slowdown is two-fold: The more people involved the greater the commu- nication overhead. While this is certainly an issue if all parts of the software system are accessible to all other parts, with an intelligent module design, this first issue can be mitigated. The second issue is more fundamental. Software construction is a complex activity. Newcomers to a project suffer from inexperience in the tools, the platform, the problem domain, etc. The query of Figure 5 models this second issue. In this query, all the experience attributes have been set to their lowest value (apex, plex, ltex are analyst experience, programmer language experience, and language and tool experience, respectively). The remaining attributes are all controllable and are allowed to move over their full range. For a (dataset, query) of (nasa93, q = brookslaw), W returns q' = q ∪ (data = 2) (4) That is, W is recommended setting the database size to its lowest value. Databases are used to store program and data elements. In effect, W is recommending reigning in the scope of the project. The recommendation can be paraphrased as follows: If the project is late, and you add more staff with less experience, you can still finish on time if you decrease the scope of the project. Figure 6 shows the effects of this recommendation. The AsIs row shows the median and spread of the effort values in nasa93. The ToBe1 row shows the effect of Brooks’ Law. In the subset of the data consistent with aexp = plex = ltex = 1, the median effort has nearly dou- bled. The ToBe2 row shows the impact of W’s recom- mendation: the project will now finish in nearly the time as the AsIs row, and the spread is greatly reduced. One of the reasons we are exploring W is the simplic- ity of the implementation. In this regard, it is useful to compare our results on Brooks’ Law to other researchers. Brooks’ Law is a well researched effect and other re- searchers have found special cases where the general law does not hold. For example, using sophisticated qualita- tive reasoning techniques, Zhang et al. [23] found their own loopholes in Brooks’ Law. One of us (Keung) worked on site with the Zhang team and reports that the Brooks analysis was the main result of a two year mas- ters graduate thesis. In contrast, writing W took three days and the specific analysis of Brooks’ Law took twenty minutes from first posing the question, to graph- ing the output. 6.2. Thirteen More Case Studies Apart from the Brooks’ Law experiment, we have tested W on thirteen other case studies: For the ISBSG data set, we used our recent experi- ence to describe the constraints suitable for a stand-alone or client server system (denoted stdalone and clientsrv). For the Desharnais data set, we posed queries rep- resenting: ○ (s; m; l) denotes (small, medium,large) pro- jects; ○ (mngr; team) denotes (manager,team) experi- ence being low. For the Cocomo and NASA data sets, we used our contacts at the Jet Propulsion Laboratory to write queries describing (a) osp (see above); (b) the sec- ond version of that system called osp2; as well as (c) generic flight and (d) ground systems. Lacking direct experience with the Finnish financial system, we could not pose specific queries to the Maxwell dataset. Instead, we made half the attrib- utes controllable and used that for the Maxwell query. Figure 7 shows the improvements seen in our 13 que- ries, running on the data sets of Figure 4. As shown by the last line of Figure 7, the usual improvements where (36, 68) for (median, spread). Note that, unlike STAR/NOVA, these are results on real-world data sets not used during training. Figure 8 displays the Figure 7 results graphically. The dashed line indicates the median of the improve- ments for each axis. One data set had consistently worse results than any other. The gray cells of Figure 7 indi- cate when W failed (i.e., where the treatments increased median development effort or spread, or both). Note that the gray cells are found only in the Desharnais results. On investigation, the root cause of the problem was Copyright © 2010 SciRes. JSEA ![]() Case-Based Reasoning for Reducing Software Development Effort 1012 Improvement dataset query q median spread coc81 ground %95 %83 coc81 osp %82 %68 93nasa ground %77 %15 ISBSG stdalone %69 %100 93nasa flight %61 %73 93nasa osp %49 %48 maxwell %44 %76 ISBSG clientserv%42 %88 desharnais mngr-l %36 %45- 81coc 2osp %31 %71 93nasa 2osp %27 %42 desharnais mngr-s %23 %85 desharnais mngr-m %23 %85 81coc flight %0 %18 desharnais team-m %0 %60 desharnais team-s %15- %206- desharnais team-l %15- %93- median %36 %68 Figure 7. Improvements (100 * (initial - final)/initial) for 13 queries, sorted by median improvement. Gray cells show negative improvement. Figure 8. Median and spread improvements for effort esti- mation. dashed lines mark median values. the granularity of the data. Whereas (e.g.) coc81 assigns only one of six values to each attribute, Desharnais’ at- tributes had a very wide range. Currently, we are exploring discretization policies to [24,25] reduce attributes with large cardinality to a smaller set. Tentatively, we can say that discretization solves the problem with Desharnais but we are still studying this aspect of our system. Even counting the negative Desharnais results, in the majority of cases W found treatments that improved both the median and spread of the effort estimates. Sometimes, the improve- ments were modest: in the case of (coc81, flight), the median did not improve (but did not get worse) while the spread was only reduced by 18%. But sometimes the improvements are quite dramatic. In the case of (ISBSG, stdlone), a 100% improve- ment in spread was seen when q' selected a group of projects that were codeveloped and, hence, all had the same development time. In the case of (coc81, ground), a 95% effort im- provement was seen when q' found that ground systems divide into two groups (the very expensive, and the very simple). In this case W found the fac- tor that drove a ground system into the very simple case. 7. Discussion 7.1. Comparisons to NOVA Figure 9 shows that estimation improvements found by W (in this report) to the improvements reported previ- ously (in Figure 1). This table is much shorter than Fig- ure 7 since, due to the modeling restrictions imposed by the software process models, NOVA cannot be applied to all the data sets that can be processed by W. The num- bers in Figure 9 cannot be directly compared due to the different the different goals of the two systems: W tries to minimize effort while NOVA tries to minimize effort and development time and delivered defects (we are cur- rently extending W to handle such multiple-goal optimi- zation tasks). Nevertheless, it is encouraging to note that the results are similar and that the W improvements are not always less than those found by STAR/NOVA. Regardless of the results in Figure 9, even though we prefer W (due to the simplicity of the analysis) there are clear indicators of when we would still use STAR/ NOVA. W is a case-based method. If historical cases are not available, then STAR/NOVA is the preferred method. On the other hand, STAR/NOVA is based on the USC COCOMO suite of models. If the local business users do not endorse that mode, then W is the preferred method. Treatment median spread AsIs = Nasa93 225 670 ToBe1 = nasa93 ∪ q380 680 ToBe2 = nasa93 ∪ q' 220 290 Figure 9. Comparing improvements found by NOVA (fr om Figure 1) and W (Figure 7). Copyright © 2010 SciRes. JSEA ![]() Case-Based Reasoning for Reducing Software Development Effort 1013 7.2. Threats to Validity External validity is the ability to generalize results out- side the specifications of that study [26]. To ensure the generalizability of our results, we studied a large number of projects. Our datasets contain a wide diversity of pro- jects in terms of their sources, their domains and the time period they were developed in. Our reading of the litera- ture is that this study uses more project data, from more sources, than numerous other papers. Table 4 of [27] lists the total number of projects in all data sets used by other studies. The median value of that sample is 186, which is less much less than the 328 projects used in our study. Internal validity questions to what extent the cause-effect relationship between dependent and independent vari- ables hold [28]. For example, the above results showed reductions in the effort estimates of up to 95%; i.e., by a factor of 20. Are such massive reductions possible? These reductions are theoretically possible. Making maximal changes to the first factor (personnel/team ca- pability) can affect the development effort by up to a 3.53. Making maximal changes just to the first four fac- tors could have a net effect of up to: 3.53 * 2.38 * 1.62 * 1.54 ≈ 21 > 20 (5) By the same reasoning, making maximal changes to all factors could have a net effect of up to eleven thou- sand. Hence, an improvement of 95% (or even more) is theoretically possible. As to what is pragmatically possi- ble, that is a matter for human decision making. No automatic tool such as STAR/NOVA/W has access to all the personnel factors and organizational constraints known to a human manager. Also, some projects are in- herently expensive (e.g. the flight guidance system of a manned spacecraft) and cutting costs by, say, reducing the required reliability of the code is clearly not appro- priate. Tools like W are useful for uncovering options that a human manager might have missed, yet ultimately the actual project changes must be a human decision. 8. Conclusions If a manager is given an estimate for developing some software, they may ask “how do I change that?” The model variance problem makes it difficult to answer this question. Our own prior solution to this problem required an underlying process model that limited the data sets that can be analyzed. This paper has introduced W, a case-based reasoning approach (augmented with a simple linear-time greedy search for contrast sets). W provides a mechanism that allows projects to improve over time based on historical events, greatly assist in the project planning and resource allocation. W has proven to be simpler to implement and use than our prior solutions. Further, since this new me- thod has no model-based assumptions, it can be applied to more data sets. When tested on 13 real-world case studies, this approach found changes to projects that could greatly reduce the median and variance of the ef- fort estimate. REFERENCES [1] B. Boehm, E. Horowitz, R. Madachy, D. Reifer, B. K. Clark, B. Steece, A. W. Brown, S. Chulani and C. Abts, “Software Cost Estimation with Cocomo II,” Prentice Hall, New Jersey, 2000. [2] M. Shepperd and C. Schofield, “Estimating Software Project Effort Using Analogies,” IEEE Transactions on Software Engineering, Vol. 23, No. 11, November 1997, pp. 736-743. [3] Y. Li, M. Xie and T. Goh, “A Study of Project Selection and Feature Weighting for Analogy Based Software Cost Estimation,” Journal of Systems and Software, Vol. 82, No. 2, February 2009, pp. 241-252. [4] O. El-Rawas, “Software Process Control without Calibra- tion,” Master’s Thesis, Morgantown, 2008. [5] T. Menzies, O. El-Rawas, J. Hihn and B. Boehm, “Can We Build Software Faster and Better and Cheaper?” Proceedings of the 5th International Conference on Pre- dictor Models in Software Engineering (PROMISE’09), 2009. [6] T. Menzies, O. Elrawas, J. Hihn, M. Feathear, B. Boehm and R. Madachy, “The Business Case for Automated Soft- ware Engineering,” Proceedings of the Twenty-second IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE’07), 2007, pp. 303-312. [7] T. Menzies, S. Williams, O. El-rawas, B. Boehm and J. Hihn, “How to Avoid Drastic Software Process Change (Using Stochastic Statbility),” Proceedings of the 31st International Conferenc e on Softwar e Eng ineeri ng (ICSE’09), 2009. [8] K. Cowing, “Nasa to Shut down Checkout & Launch Con- trol System,” August 26, 2002. http://www.spaceref.com /news/viewnews.html [9] B. Boehm, “Software Engineering Economics,” Prentice Hall, New Jersey, 1981. [10] C. Kemerer, “An Empirical Validation of Software Cost Estimation Models,” Communications of the ACM, Vol. 30, No. 5, May 1987, pp. 416-429. [11] D. Baker, “A Hybrid Approach to Expert and Mod- el-Based Effort Estimation,” Master’s Thesis, Morgan- town, 2007. [12] P. Green, T. Menzies, S. Williams and O. El-waras, “Un- derstanding the Value of Software Engineering Tech- nologies,” Proceedings of the 2009 IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE’09), 2009. [13] T. Menzies, O. Elrawas, D. Baker, J. Hihn and K. Lum, “On the Value of Stochastic Abduction (If You Fix Eve- rything, You Lose Fixes for Everything Else),” Interna- tional Workshop on Living with Uncertainty (An ASE’07 Copyright © 2010 SciRes. JSEA ![]() Case-Based Reasoning for Reducing Software Development Effort Copyright © 2010 SciRes. JSEA 1014 Co-Located Event), 2007. [14] R. C. Schank, “Dynamic Memory: A Theory of Remind- ing and Learning in Computers and People,” Cambridge University Press, New York, 1983. [15] A. Aamodt and E. Plaza, “Case-Based Reasoning: Foun- dational Issues, Methodological Variations and System Approaches,” Artificial Intelligence Communications, Vol. 7, No. 1, 1994, pp. 39-59. [16] M. J. Shepperd, “Case-Based Reasoning and Software Engineering,” Technical Report TR02-08, Bournemouth University, UK, 2002. [17] G. Kadoda, M. Cartwright, L. Chen and M. Shepperd, “Experiences Using Casebased Reasoning To Predict Software Project Effort,” Keele University, Staffordshire, 2000. [18] E. Mendes, I. D. Watson, C. Triggs, N. Mosley and S. Counsell, “A Comparative Study of Cost Estimation Mod- els for Web Hypermedia Applications,” Empirical Soft- ware Engineering, Vol. 8, No. 2, June 2003, pp. 163-196. [19] T. Menzies, D. Owen and J. Richardson, “The Strangest Thing about Software,” IEEE Computer, Vol. 40, No. 1, January 2007, pp. 54-60. [20] T. Menzies and H. Singh, “Many Maybes Mean (Mostly) the Same Thing,” In: M. Madravio, Ed., Soft Computing in Software Engineering, Springer-Verlag, Berlin Hei- delberg, 2003. [21] M. Možina, J. Demšar, M. Kattan and B. Zupan, “Nomo- grams for Visualization of Naive Bayesian Classifier,” Proceedings of the 8th European Conference on Princi- ples and Practice of Knowledge Discovery in Databases (PKDD’04), Vol. 3202, 2004, pp. 337-348. [22] F. P. Brooks, “The Mythical Man-Month,” Anniversary Edition, Addison-Wesley, Massachusetts, 1975. [23] H. Zhang, M. Huo, B. Kitchenham and R. Jeffery, “Qualitative Simulation Model for Software Engineering Process,” Proceedings of the Australian Software Engi- neering Conference (ASWEC’06), 2006, pp. 10-400. [24] J. Dougherty, R. Kohavi and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” International Conference on Machine Learning, 1995, pp. 194-202. [25] Y. Yang and G. I. Webb, “A Comparative Study of Dis- cretization Methods for Naive-Bayes Classifiers,” Pro- ceedings of PKAW 2002: The 2002 Pacific Rim Knowl- edge Acquisition Workshop, 2002, pp. 159-173. [26] D. Milicic and C. Wohlin, “Distribution Patterns of Effort Estimations,” 30th EUROMICRO Conference (EU- ROMICRO'04), 2004. [27] B. Kitchenham, E. Mendes and G. H. Travassos, “Cross Versus Within-Company Cost Estimation Studies: A Sys- tematic Review,” IEEE Transactions on Software Engi- neering, Vol. 33, No. 5, May 2007, pp. 316-329. [28] E. Alpaydin, “Introduction to Machine Learning,” MIT Press, Cambridge, 2004. |