Enhancing ERS-A Algorithm for Pattern Matching (EERS-A)

Pattern matching is a very important topic in computer science. It has been used in various applications such as information retrieval, virus scanning, DNA sequence analysis, data mining, machine learning, network security and pattern recognition. This paper has presented a new pattern matching algorithm—Enhanced ERS-A, which is an improvement over ERS-S algorithm. In ERS-A, two sliding windows are used to scan the text from the left and the right simultaneously. The proposed algorithm also scans the text from the left and the right simultaneously as well as making comparisons with the pattern from both sides simultaneously. The comparisons done between the text and the pattern are done from both sides in parallel. The shift technique used in the Enhanced ERS-A is the four consecutive characters in the text immediately following the pattern window. The experimental results show that the Enhanced ERS-A has enhanced the process of pattern matching by reducing the number of comparisons performed.


Introduction
Many applications use pattern matching algorithms such as search engines, anti-virus and biological applications such as DNA [1]- [5].
Most of the algorithms have been implemented in order to make the searching process faster and more efficient; this can be achieved either by reducing the number of attempts, comparisons or by both.Some of algo-rithms made enhancements on preprocessing phase [6]- [9] and searching phase [10]- [12], and others modified the shifting technique [5] [9] [13]- [15] used in a case of a mismatch between the pattern and the text.
Searching process differs from one algorithm to another: some algorithms scan the text from one side only, either from left [15] or from right [5]; other algorithms use two sliding windows to scan the text from the left and the right sides simultaneously [10]- [13].
Some algorithms made modification on the shifting values.The shift value is the amount of shift that the sliding window will move in a case of a mismatch between the text and the window; shift values depend on the number of consecutive characters immediately after the sliding window.Some algorithms use one consecutive character [16] [17]; others use two [10] [12] [15] and few use three [13].
In this paper, we propose a new pattern matching algorithm-Enhanced ERS-A.The algorithm uses two sliding windows such as TSW [12], ETSW [10], ERS-A [11] and EBR [13] algorithms.It also uses four consecutive characters to compute the shift values such as RS-A [5] and ERS-A [11] algorithms.In addition to that, it uses the same comparison technique between the pattern and the text that was used in ETSW [10].
This paper compares between the new algorithms of EERS-A, ETSW [10] and ERS-A [11].The results showed that the new algorithm is better than others as explained in Section 5.The reminder of this paper is organized as follows: Section 2 consists of related works; Section 3 explains EERS-A algorithm; Section 4 covers the analysis.Finally, the conclusion and the future work are presented in Section 6.

Related Works
Pattern matching algorithms were needed in many applications [2] [14] [18]- [20].Some algorithms made enhancement in a memory used in preprocessing phase, while others try to make the searching process faster and more efficient [8] [9] [21].
The Berry-Ravindran algorithm (BR) [15] made enhancement in Boyer Moore algorithm [16].In a case of a mismatch between the text and the pattern window Boyer uses one consecutive character in the text immediately to the right of the pattern window to determine amount of shift the window must move, on the other hand Berry-Ravindran uses two consecutive characters.
Many enhancements made in Berry-Ravindran algorithm (BR) [15], some algorithms changed the searching process but on the other hand they used the same bad character shift values, bad character shift depends on using two consecutive characters immediately that follow the text such as TSW [12] and ETSW [10].Others made enhancement on the shift values such as RS-A [5] and ERS-A [11] these two algorithms changed the shift values by using four consecutive character instead of two, also EBR [13] algorithm made modifications on Berry-Ravindran bad character shift by depending on using three consecutive characters.
Two Sliding Windows algorithm TSW [12] used the same preprocessing technique used in BR [15] but made enhancements on the searching phase.In a case of a mismatch between the pattern and text, it uses two consecutive characters to determine the amount of shift the pattern window must slide which is the same technique used in BR [15], but instead of using one pattern window to scan the text as in BR [15] it uses two sliding windows that scan it in parallel.
Searching process become faster in Enhanced Two Sliding Window algorithm (ETSW) [10].ETSW [10] made enhancements on TSW [12] by minimizing the number of comparisons needed but it doesn't make any changes on the number of attempts.The reason for this is that two algorithms used the same preprocessing techniques and also used the two sliding window, the only difference between them is related to the idea that TSW [12] compares the text with the pattern from one side of the pattern while ETSW [10] compare the text with the pattern from the both sides of the pattern at the same time, the best time complexity is ( ) and the worst case time complexity is [5] algorithm used only one window to search for a pattern p in a text t from the right side of the text.In order to make the searching process faster ERS-A [11] algorithms made enhancements on RS-A by using two windows instead of one.The left window aligned with the text from the left and in a case of a mismatch the window will be shifted to the right, and the right window aligned with text from the right and in a case of a mismatch the window will be shifted to the left.The two windows slide in parallel.The search will stop either when the pattern is found or in a case the pattern not found at all.
The EERS-A made enhancement on ERS-A [11], it uses the same preprocessing technique that depends on using four consecutive characters to determine the amount of shift, it also uses two sliding windows to scan the text from the left and right sides at the same time.Enhancements made on a comparison between the text and the

Pre-Processing Phase
Two arrays are generated in this phase nextl and nextr, each array is one-dimensional array.The values of the nextr array are calculated according to RS-A algorithm [5].Initially the indexes of the four consecutive characters of the text after aligning it with the right window are (n − m − 4), (n − m − 3), (n − m − 2) and (n − m − 1) for a, b, c and d respectively, which are used to calculate the shift values in a case of a mismatch from the right side of the text as in Equation (1).
On the other hand, the values of the nextl array are calculated according ERS-A algorithm [11].The shift values are calculated by using four consecutive text characters a, b, c and d which are aligned immediately to the right of the left sliding window.Initially, the indexes of the four consecutive characters in the text string needed to search the text from the left side are (m + 1), (m + 2), (m + 3) and (m + 4) for a, b, c and d respectively as in Equation ( 2 The pre-processing phase is the same in both ERS-A and EERS-A algorithms but the searching phase is enhanced in EERS-A.

Searching Phase
In EERS-A algorithm the searching process started by aligning two windows with the text, the left window aligned with the beginning of the text and the right window aligned with the end of the text.In case of a mismatch the proposed window will be shifted according to next array values calculated in a preprocessing phase; the left window will be shifted to the right and the right window will be shifted to left.
Four pointers will be used in a comparison process between the text and the pattern, two for each window.The left window uses the L and temp_newlindex pointers while the right window uses the R and temp_newrindex pointers.
Comparisons between the text and the pattern from both sides is the same, in each window the first character of the pattern is compared with the corresponding text character and at the same time the last character of the pattern is compared with the corresponding character in the text.Each window uses different pointers as explained below:

Left_Window Search Process
After aligning the left window with the text from the left, comparisons between the pattern and the text will be done using two pointers, L and temp_newlindex, L pointer points at the last character of the pattern and the temp_newlindex points at the first character of the pattern, comparisons are made between the pointers and the corresponding characters of the text.
If a mismatch occurs in any pointer a shift will occurs according to the ERS-A bad character algorithm.
In case of a match the two pointers will move.
The L pointer will move to the left and the temp_newlindex will move to the right.A movement of pointers in a case of a match will stop either when two pointers reach the middle of the pattern or when the L pointer is less than or equal the temp_newlindex, in either case the pattern is found.

Right_Window Search Process
After aligning the right window with the text from the right, comparisons between the pattern and the text will be done using two pointers, R and temp_newrindex, R pointer points at the first character of the pattern and the temp_newrindex points at the last character of the pattern, comparisons are made between the pointers and the corresponding characters of the text.
If a mismatch occurs in any pointer a shift will occurs according to the ERS-A bad character algorithm.In case of a match the two pointers will move.The R pointer will move to the right and the temp_newrindex will move to the left.A movement of pointers in a case of a match will stop either when two pointers reach the middle of the pattern or when the R pointer is less than or equal the temp_newrindex, in either case the pattern is found.
The proposed EERS-A algorithm is explained in Figure 1.

Pre-Processing Phase
Initially give it an index starting from 0. For example for the pattern structure ACCBCBAC, the consecutive characters ACCB, CCBC, CBCB, BCBA and CBAC are given the indexes 0, 1, 2, 3 and 4 respectively.The shift values for the nextr array are calculated according to Equation (1) while the shift values for the nextl array are calculated according to Equation (2).

Searching Phase
The searching process for the pattern P is explained through the working example as shown in Figure 3.

First Attempt
In the first attempt (see Figure 3(a)), we align the left sliding window with the text from the left.In this case, a comparison is made between the text character located at index 0 (character A) with the leftmost character in the pattern (character A) although a match occurs, the total result is a mismatch since at the same time a comparison must be made between the text character at index 7 (character D) with the rightmost character in the pattern (character C); therefore we take the four consecutive characters of the text at index 8, 9, 10 and 11 which are (B, A, C and E) respectively.To determine the amount of shift (shiftl) we have to do the following: then according to a preprocessing algorithm the shift value will be 3.

Second Attempt
In the second attempt (see Figure 3(b)), we align the right sliding window with the text from the right.In this case, a match occurs between text character at index 42 (A) and left most character of the pattern character A but a mismatch occurs between the last text character B with the last pattern character C so as a total result, there is a mismatch; therefore we take the four consecutive characters from the text at index 38, 39, 40 and 41 which are A, C, C and B respectively.To determine the amount of shift (shiftr), we have to do the following: a) We find the index of ACCB in the pattern which is 0. b) Since we search the text from the right side we use nextr array, and shiftr = nextr[0] = 4; therefore the window will be shifted to the left 4 steps.

Third Attempt
In the third attempt (see Figure 3(c)), a mismatch occurs from the left between the text character at index 3 (character D) and the leftmost character in the pattern (character A) while there is a match between the text character at index 10 (character C) and the rightmost character in the pattern (character C); therefore we take the four consecutive characters from the text at indexes 11, 12, 13 and 14 which are E, D, A and B respectively, since EDAB is not found in the pattern, so the window will be shifted to the right 12 steps.

Fourth Attempt
In the fourth attempt (see Figure 3(d)), a comparison is made between the text character located at index 38 (character A) with the leftmost character in the pattern (character A) at the same time a comparison must be made between the text character at index 45 (character B) with the rightmost character in the pattern (character C) since there is a mismatch we take the four consecutive characters of the text at index 34, 35, 36 and 37 which are D, D, D and A respectively; since DDDB is not found in the pattern, so the window will be shifted to the left 12 steps.

Fifth Attempt
We align the left most character of the pattern P[0] with T [15].A comparison between the pattern and the text characters leads to a complete match at index 15.In this case, the occurrence of the pattern is found using the left window.

Analysis
Preposition 1: The space complexity is ( ) ( ) where m is the pattern length.Preposition 2: The pre-process time complexity is The worst occurs when a match between the text and patterns occurs in all characters except a character at index m/2 in the pattern, and if at the same time the shift values in case of a mismatch equal to 1.

Lemma 2:
The best case time complexity is ( ) The best case occurs when the pattern is found in the leftmost or the rightmost sides of the text.Lemma 3: The average case time complexity is ( . O n m + Proof: The average case occurs when the four consecutive characters of the text directly following the sliding window is not found in the pattern.In this case, the shift value will be (m + 4) and hence the time complexity is

Experimental Results and Discussion
Several experiments have been done using Book 1 from the Calgary corpus [22] to be the text in most of searching algorithms.Book 1 consists of 141,274 words (752,149 characters).Book 1 consists of 141,274 words (752,149 characters).Patterns of different lengths are also taken from Book 1.
Table 1 shows the results of comparing the algorithms ETSW, ERS-A and EERS-A.
In Table 1, the first column related to the pattern length; second column is the number of words in a certain length.It can be clearly seen that the number of attempts in EERS-A and ERS-A are the same since the two algorithms use the same shifting techniques in a case of a mismatch, and according to the results it's clear that the number of comparisons in EERS-A are better than all other algorithms.For example, as shown in Table 1, 2896 words of length 6, the average number of comparisons in ETSW is 7633, in ERS-A is 6750 and in the new algorithms is 6212, which is the minimum value ETSW algorithm and EERS-A algorithm use the same comparison technique, they use two sliding windows to scan the text, and in addition to that they use the same technique when comparing the text with the pattern where comparisons happened from both sides of the pattern simultaneously.It is clear that the number of comparisons and attempts in EERS-A algorithm are better than ERS-A algorithm and this refers to using different shifting technique.While ETSW uses Berry-Ravindran (BR) [15] bad character shift values which uses two consecutive characters, EERS-A uses ERS-A [11] bad character shift values that uses four consecutive characters.
Table 2 shows the average number of attempts and comparisons for 100 words taken from the right side of Book1.It is clear that EERS-A is the best among all others according to the number of comparisons and the same results can be seen in Table 3 and Table 4.
Table 5 and Figure 4 show the average number of comparisons needed to search for patterns with different lengths.The results show that EERS-A finds the patterns with minimum number of comparisons compared with       other algorithms such as TSW, ETSW, RSA and ERS-A.For example, it took EERS-A 8042 comparisons to locate the pattern of length 10 while it took TSW, ETSW, RS-A and ERS-A 10272, 9339, 12927 and 8830 comparisons respectively for the same pattern length.These results show that EERS-A is the best algorithm compared to others.

Conclusions and Future Work
In this paper, we presented a new pattern matching algorithm-Enhanced ERS-A algorithm.The Enhanced ERS-A algorithm enhances the ERS-A's process by utilizing the idea of the two sliding windows and by making comparisons with the pattern from both sides simultaneously.The comparisons done between the text and the pattern are done from both sides in parallel.This process gives the proposed algorithm a preference over the ERS-A algorithm.The Enhanced ERS-A algorithm utilizes the idea of RS shifting algorithm to maximize the shift values.To assess the performance of the proposed algorithm, we considered ETSW and ERS-A algorithms for comparison with the proposed algorithm.The experimental results show that the Enhanced ERS-A shows better results in the number of comparisons performed.

Figure 4 .
Figure 4.The average number of comparisons of TSW, ETSW, RS-A and ERS-A, EERS-A algorithms.

Table 1 .
The average number of attempts and comparisons of ETSW, ERS-A and EERS-A algorithms.

Table 2 .
The average number of attempts and comparisons performed to search for (100) patterns selected from the right side of the text.

Table 3 .
The average number of attempts and comparisons performed to search for (100) patterns selected from the middle of the text.

Table 4 .
The average number of attempts and comparisons performed to search for (100) patterns selected from the left side of the text.

Table 5 .
The average number of attempts and comparisons for patterns with different lengths.