Code Clone Detection Method Based on the Combination of Tree-Based and Token-Based Methods

This article proposes the high-speed and high-accuracy code clone detection method based on the combination of tree-based and token-based methods. Existence of duplicated program codes, called code clone, is one of the main factors that reduces the quality and maintainability of software. If one code fragment contains faults (bugs) and they are copied and modified to other locations, it is necessary to correct all of them. But it is not easy to find all code clones in large and complex software. Much research efforts have been done for code clone detection. There are mainly two methods for code clone detection. One is token-based and the other is tree-based method. Token-based method is fast and requires less resources. However it cannot detect all kinds of code clones. Tree-based method can detect all kinds of code clones, but it is slow and requires much computing resources. In this paper combination of these two methods was proposed to improve the efficiency and accuracy of detecting code clones. Firstly some candidates of code clones will be extracted by token-based method that is fast and lightweight. Then selected candidates will be checked more precisely by using tree-based method that can find all kinds of code clones. The prototype system was developed. This system accepts source code and tokenizes it in the first step. Then token-based method is applied to this token sequence to find candidates of code clones. After extracting several candidates, selected source codes will be converted into abstract syntax tree (AST) for applying tree-based method. Some sample source codes were used to evaluate the proposed method. This evaluation proved the improvement of efficiency and precision of code clones detecting.


Introduction
This article proposes the high-speed and high-accuracy code clone detecting method.Code clone is a fragment of source code that is identical or similar to other portion of source code [1].Code clones often reduce the maintainability of software.Suppose that there are two code fragments A and B, and fragment B is a clone of fragment A. If errors (bugs) are included in A, B must contain same errors and they must be removed at the same time when errors in A are removed.If the programmer does not recognize the existence of clone B, he or she may forget to revise them in clone B. This may cause the deterioration of software quality.It is often said that 10% to 20% of codes are duplicated code (code clones) in large-scale software [2] [3].If the software is large, finding all code clones are hard work.
From syntactical point of view, there are three types of code clone named TYPE-1, TYPE-2, and TYPE-3 [4].In TYPE-1, all parts of original and clones are identical.In TYPE-2, some differences such as the difference of identifier name and function name, and the value of constants exist but structure of code is identical.In TYPE-3, several statements are inserted or removed from original source code.
In order to detect these code clones, several methods are proposed in previous works.These methods are based on two principles; one is token-based method [5] [6] and the other is tree-based method [7] [8].Token-based methods can detect TYPE-1 and TYPE-2 clones very fast but hard to detect TYPE-3 clones.This method is relatively fast and can be applied to large-scale software.On the other hand, tree-based methods can detect all types of code clone but require large computing resources (CPU time and memory).
In this article, we propose the new method that can detect all types of code clone and run relatively fast.Our method combines token-based method and tree-based method.By using token-based method that runs fast, some candidates of code clones are extracted.After the extraction of candidates, each candidate fragment is examined by using tree-based method if it is clone or not.By combining token-based method and tree-based method, our method can detect all types of code clone faster.Note that there are only two fragments in Figure 1, three or more code clones may exist in the source code.Suppose there is one bug in the original code shown by the red star in Figure 1.Then similar bugs may exist in another code • Type-2 (Parameterized clone): Syntactically identical but some names of identifiers (variable names, function/method names etc.) and the values of constants are different in two code fragments.

Definitions and Types of Code Clones
• Type-3 (Gap clone): Duplication with some insertion and/or deletion of statements.

Related Works of Clone Detection
Several methods to detect code clones are proposed in previous works.For example, Baker et al tried to detect clones by line-wise comparison of two files [2].Furthermore, they introduced the parse tree and used it for detecting clones [2].Currently, detecting methods are classified into three categories.
• Text-based method (or line-based method): This method detects code clones by comparing two codes fragments line by line.This method runs fast and lightweight.But this method can detect only Type-1 clones.Some sophisticated method can detect Type-2 clones.This method is the fastest.
• Token-based method: This method detects code clones by comparing two sequences of tokens (minimum unit of lexically meaningful sequence of cha-racters).As tokens only represent the kind of elements in programming language, this method can detect Type-1 and Type-2 clones.But it requires sophisticated modification to detect Type-3 clones.This method is relatively faster and lightweight.
• Tree-based method: This method detects code clones by comparing two abstract syntax tree (AST) [9] or other tree representation of source codes.In this method, a source code is tokenized and parsed firstly.Source code is converted into AST (or other tree-based representation) and compared the similarity of two (or more) trees.This method requires much computing resources (slow and large memory) but detect all types of clones.
Table 1 is a result of the comparison of previous methods.

Issues of Previous Works
As shown in Table 1, all previous methods have some advantages and disadvantages.Text-based and token-based methods can run relatively fast and do not require much computing resources.For example, Kamiya et al. reported that 40 seconds are required to find clones in 235,000 lines source code [5].Baker showed only 12 seconds were necessary to the same source code [10].But finding Type-3 clones is difficult by these two methods.
Our preliminaries investigation on some sample programs found that the distribution of the clone types is shown in Figure 2.This graph shows that approximately 97% of clones are Type-2 and Type-3, and Type-3 occupies about 40%.Therefore, ignoring Type-3 clones makes the quality of software worse.As Table 1.Comparison of previous methods of code clone detection.

Type-1
Type-2 Type-3 Speed With this investigation, we can also see that the number of exact clones is relatively small.This make sense since in most of the cases of code clone, we take a part of the code and change it to fit new need of the function which is near the cloned program in terms of service or computation to provide.In the process of modification, some identifiers may be renamed, some statements are inserted or removed, and some conditional expression may be changed.Therefore, difficulty of finding Type-3 clones is a serious drawback from the practical point of view.
On the other hand, tree-based methods can detect almost all clones including Type-3 clones that are hard to be detected by text or token-based method.But tree-based methods require much computing time and resources.Baxter [7] and Krinke [11] indicated that about 3 hours and 63 hours were necessary to find all code clones in approximately 115,000 lines codes.The reason why tree-based method requires much computing resources is basically the comparison time of two or more trees.When the number of nodes in the tree T is N, the computational time complexity of naive tree comparison is proportion to N 6 [11].Therefore, when the number of nodes, which corresponds to the size of software, becomes 10 times larger, the computation time becomes 1 million times longer than original one.It means that the tree-based method is hard to apply to large-scale software.

Overall of Proposed Method
Based on the above consideration, we propose our new code clone detection method that is based on the combination of token-based method and tree-based method.Token-based method runs faster but is not appropriate for detecting Type-3 clones.Tree-based method can find all types of clones but runs slower.
The reason of large amount of computing time of tree-based method is its comparison time of trees.The larger the software becomes, the more the number of compared trees.Therefore, we use the token-based method to narrow down the number of trees to be compared.By reducing the number of trees to be compared, we can execute the tree-based method much faster.Figure 3 is the overall steps of proposed method.Proposed method consists of following steps.
1) Lexical analyzing source code and generate token sequence, 2) Applying token-based method to extract the candidates of code clones, 3) Generating abstract syntax trees (ASTs) of code clone candidates, 4) Comparing ASTs to fix code clones of all types.
In step (1), source code is converted into the sequence of tokens.Then some conversion will be done to the sequence of tokens to detect TYPE-2 clones.This conversion includes the replacement of specific tokens such as identifier and function/method name by special characters.Figure 4 is an example of conversion.In this example, all identifiers are replace by special character "$" and constants are replaced by "#".After the conversion, detection process will be executed.After arranging all tokens as shown in Figure 5, check the identical token and mark as "*".Some code fragments, which have sufficient length of diagonal lines of Figure 5, are candidates of code clone.After the extraction of code clone candidates in step (2), these code fragments are converted into abstract syntax trees in step (3).Then these trees are compared to find Type-3 code clones by using tree distance in step (4).

Gap of Diagonal Line
Type-1 and Type-2 code clones draw the continuous diagonal lines when we represent them in a manner shown in Figure 5.But as Type-3 code clones have some insertions and/or deletions of statements, there are several gaps within the clone diagonal lines.Figure 6 shows the example of this gap diagonal line.In this case, original source code has several inserted and/or removed tokens.This will cause a gap as shown in Figure 6.
In order to detect Type-3 code clones, we have to detect the candidates with gap in a diagonal line.In order to detect this gap, we need to find a method to look for other part of code that might follow this gap.
Figure 7 shows an illustration of gap diagnose lines.In this case, there are three diagonal lines i l , j l , and k l .To find Type-3 code clones by merging several code clone fragments, we will use the following algorithms.

Algorithm
Step 1 Extracting diagonal lines (code clone candidates) with longer than predefined length.
Step 2 Let i l be a diagonal line whose starting point is  Step 3 For all diagonal lines, if there is another diagonal line j l where ( ) ( ) ( ) ( ) l y l y dy − < , merge two lines i l and j l and draw a diagonal line starting at s i l and ending at e j l .Note that dx and dy are predefined tolerable limit.The fragment corresponding this new diagonal line is a merged virtual diagonal line and will be a candidate of Type-3 code clone.
Step 4 When there are two or more lines   diagonal line s i l to e v l is a candidate of Type-3 code clone.Figure 8 illustrates the gap code clone that has virtual diagonal line.This virtual diagnose line represents the candidate of TYPE-3 code clone.

Applying Tree-Based Method
Exact and parameterized code clones (Type-1 and Type-2) can be found by token-based method.But gap clone (Type-3) cannot be identified by token-based method.In order to identify Type-3 code clone, we need further steps.They are 1) translating source code into abstract syntax tree (AST) or similar tree-based representation method for source code, 2) compute the difference (distance) of any pair of code clone candidates, and 3) identify Type-3 code clone by examination of distance.Transforming source code into AST is a well-known processing of language processor such as compiler.We will not mention this process any more.
In order to compute the distance of two trees, we use the tree edit distance (TED) [12].In TED, several primitive operations are used to modify trees.For example, we can assume that following three operations are primitive operations for tree transformation: 1) insertion of node, 2) deletion of node, and 3) renaming of node.Let 1 T and 2 T be two trees.By applying above-mentioned three primitive operations for tree modification, we can always transform 1 T into 2 T .The trivial method of transforming 1 T into 2 T is that a) delete all nodes in 1 T and generate void (null) tree, b) insert all nodes of 2 T into void tree.Let 1 n and 2 n be the number of nodes in 1 T and 2 T respectively.Then maximum TED of 1 T and 2 T is 1 2 n n + (deleting all nodes of 1 T requires 1 n operations and inserting all nodes of 2 T requires 2 n operations).There may be another shorter sequence of operations that transform 1 T into 2 T .TED is defined as the minimum number of these primitive operators.For example, let us consider two trees A T and B T in Figure 9.The number of nodes in A T is 6  and that of B T is 6.Therefore by using trivial method, A T can be converted into B T by 6 + 6 = 12 operations.This is the maximum length of transformation operations.However, A T can be converted by the following sequence: i) delete node "D" from tree A T , ii) insert node "G" between node "A" and "B" of tree B T , iii) rename node "F" of tree A T into "H".Total number of primitive operation is 3 and this is the minimum number of operations for By applying TED and computing the distance of any two trees based on the established algorithms [13], we can filter the Type-3 code clones.

Overall of Prototype
We have implemented the code clone detection system which adopted proposed method.Following is a target program of experiment.
• Implementation Language: Java • Program name: JDK 1.5.0 • The number of files: 108 • The number of lines: 33,128 lines Proposed system is implemented under following environment.• OS: Window 7 Professional 64 bits • RAM: 6.0 GB (2.0 GB is used for executing proposed method) Proposed system is implemented by Java.The total number of prototype code is 4333 lines.

Experimental Result
Figure 10 shows the result of token-based detection.In this experiment, 50 or longer sequence of tokens were selected as candidates of code clones.As test file contains more than 400,000 tokens, we divided all token sequence into 4000 tokens block.We can identify several clone candidates around the upper left part of this diagram.By using this diagram, we can extract Type-1 and Type-2 code clones easily.
After extracting Type-1 and Type-2 code clones, we furthermore try to extract the candidates of Type-3 (gap) clones.Threshold values of dx and dy mentioned in section 3.3 were set to 500.Result is shown in Figure 11.O m [15].Therefore the complexity of total computing time is ( )

Computing Time Comparison
O n m + .Table 2 shows the comparison result of proposed system and other systems.
CCFinder is the fastest system.But it uses token-based method, therefore it is not easy to detect Type-3 code clone.Other two systems and the proposed system

Quality of Proposed Method
Currently there is no standard criterion of the quality of code clone detection method.One of the simplest quality measures is the ratio of the number of detected code clones and the number of all code clones.But this measure is virtually useless because we cannot count all code clones in large-scale software.
Bellon [16] proposed new assessment metric.Select several sample code clones from the set of detected clones and evaluate whether they are clones or not.By using this metric, we can assess the precision rate of detection.But we cannot assess the recall rate by using this metric 3 .However, in the detection of code clone, precision rate is more important than recall rate.Low precision rate means that the system generates many noisy code fragments that are not clones.
This causes the waste of time to check if they are clones or not.On the other hand, even if undetected code clones remain in the source code, serious negative effect to the quality of software will not occur.Therefore, we use the Bellon's method for assessing the proposed method.3 The terms "precision rate" and "recall rate" are commonly used in the field of information retrieval and pattern recognition.Precision rate is the fraction of detected instance that are relevant, and recall rate is the fraction of relevant instance that are detected.in these two is 22 and the ratios of common tokens are 56% (left) and 48% (right).After converting these two codes into two ASTs, we compare the ratio of common nodes in ASTs.The number of nodes of left code is 77 and that of right code is 79.The number of common nodes of these two ASTs is 63.Therefore ratios of common nodes are 78% and 82% respectively.Higher common ratio (tokens and nodes) means the high similarity of two codes.Comparing the common tokens with common nodes, ratio of common nodes is higher than that of common tokens.This means that using tree-based method can detect code clones more precisely.
Total average ratio of common tokens in our experiment is 57% and that of common nodes is 78%.Bellon's experiment says average ratio of precision of previous works is approximately 64% [16].Therefore, we can conclude that our proposed method has sufficient precision ratio.

Conclusions
In this article, we proposed the new method of code clone detection which can detect all types of code clones.Our method combines token-based method and tree-based method.The former can run faster but cannot detect all types of code clones, the latter can detect all types of code clones but requires large computing resources (memory and CPU time).Therefore, applying tree-based methods to large-scale software is virtually impossible.We combine these two methods; token-based method is used to extract the candidates of code clones.Extracted candidates are transformed into abstract syntax trees (ASTs) representation and tree-based method is applied to these trees.By narrowing the candidates of code clones and reducing the number of comparison operations, computing time is reduced to reasonable time.
Experimental evaluation is conducted using sample files with approximately 35,000 lines source code.Proposed method is 3 to 4 times faster than conventional tools such as DECKARD and CloneDR, all of which adopt tree-based method.Detection accuracy is assessed using Bellon's criterion.Proposed method keeps almost the same accuracy of conventional tools.Based on these evaluation

Figure 1
Figure 1 is a conceptual illustration of code clone.Let us consider one sample source code shown in Figure 1.The code fragment shown at the left-hand side is the original code.There are another two code fragments at the right-hand side of the source code illustrated by rectangles.These two fragments are assumed to be identical or similar to the original.Then these two fragments are code clones.

Figure 2 .
Figure 2. Ratio of code clone type.

Figure 3 .
Figure 3. Overall steps of proposed method.

Figure 9 .
Figure 9. Tree transformation. For example, line 121 to 210 of code B is a code clone of source code from line 113 to 202 of source code A.

Figure 12
Figure 12 shows the computing time of code clone detection.This graph shows

Figure 11 .
Figure 11.Result of the detection of gap code clone.

Figure 12 . 2 O
Figure 12.Computing time of three methods.

Figure 13 is
Figure 13 is an example of detected code clone.Left hand code contains 39 tokens and right hand code contains 46 tokens.The number of common tokens 2 2.0 9000 3.2 × .

Figure 13 .
Figure 13.Example of detected code clones.

Table 2 .
Comparison of computing time.This table shows that proposed system is three to four times faster than other two tree-based systems (DECKARD and CloneDR).Note that DECKARD uses same CPU clocks and memory size, and those of CloneDR are lower and small.Therefore comparison to CloneDR is not accurate.Clone detection computing does not include so many disk accesses.The computing time mainly depends on the CPU time.The CPU time is roughly in proportion to clock frequency.Therefore if CloneDR runs on the same clock frequency of DECKARD and proposed system, the computing time is approximately 6000 sec 2 .The reason why proposed system is faster than other two tree-based systems is the difference of the number of tree comparison.Our system narrows the candidates of code clones by token-based method.