_{1}

^{*}

Modern applications require large databases to be searched for regions that are similar to a given pattern. The DNA sequence analysis, speech and text recognition, artificial intelligence, Internet of Things, and many other applications highly depend on pattern matching or similarity searches. In this paper, we discuss some of the string matching solutions developed in the past. Then, we present a novel mathematical model to search for a given pattern and it’s near approximates in the text.

Similarity searches commonly known as approximate string matching allow for some mismatches between the text and the pattern. Similarity searches are widely used in computational biology, search engines, data mining, signal processing, digital dictionaries, and many other applications where exact and similar patterns are to be searched over a given text. Most of the algorithms developed prior to the 1990’s such as Morris and Pratt [

The term “distance” is often used when comparing two strings for similarity. A lesser distance is expected for a greater similarity. The Hamming distance [_{i} be the Hamming distance such that h d i = h a m ( P , t i t i + 1 ⋯ t i + M − 1 ) , where, 0 ≤ i ≤ ( N − M ) . Then, for a given integer k such that 0 ≤ k ≤ M , report all locations iin T where h d i ≤ k .

To solve the “k-mismatch” problem, Landau and Vishkin [_{i} such that h d i = h a m ( P , t i t i + 1 ⋯ t i + M − 1 ) . Abrahamson [_{max}) time and O(2M + σ) space algorithm, where f_{max} is the frequency of the most commonly occurring character in the pattern. Many of these algorithms are covered in Crochemore et al. [

We use the following assumptions and notations. “λ” represents a finite non-empty ordered set of characters called an alphabet, such that | λ | = σ is the size of the alphabet. “λ_{i}” is the i^{th} character in λ such that 1 ≤ i ≤ σ . String S is a finite sequence of characters defined over alphabet λ. S[i] or S_{i} represents the i^{th} character of string S, where, “i” is refers to as the shift, location, or index in S. Both T and P represent non-empty text and pattern strings defined over the alphabet λ. N and M represent sizes of the text and the pattern respectively such that M ≤ N. We use the phrase “Number of matches of P in T at shift t”, which refers to the number of character matches when pattern P is aligned with shift t in T.

Traditional, pattern matching algorithms attempts to align P from the first character of T which may result in losing some valuable information regarding the character matches. Let’s consider an example: suppose pattern P = ABCDEF, and text T = CDEFABCD. Assuming 0 being the initial index of the strings, a four character match is found when P is aligned at the index −2 in T. All traditional algorithms would lose this information as attempts are made to align P from index 0 in T. In other words, traditional algorithms consider the indexes in the range 0 ≤ i ≤ N − M , where i represent the text index. However, as we have seen, considering i’s in the range ( 1 − M ) ≤ i ≤ ( N − 1 ) may provide additional information, particularly when the pattern is considerably large, and the character matches exist at opposite ends of the pattern or text. The lemma and proof given below are already discussed in our previous work [

The Lemma

Let T and P be non-empty text and pattern strings defined over an alphabet λ such that: T = t 0 t 1 ⋯ t N − 1 and P = p 0 p 1 ⋯ p M − 1 . Corresponding to every index j in P, we define a set R_{j} such that R j = { i – j | t i = p j , ∀ 0 ≤ i ≤ N − 1 } . Further, let S represent a set such that S = R 0 ∩ R 1 ∩ R 2 ∩ ⋯ ∩ R M − 1 .Then:

1) Every integer s ∈ S represents an index in T where an exact match of P is found when P is aligned at s in T.

2) The cardinality |S| represents the number of occurrences of P in T. If |S| =0 then P is not present in T.

Proof: Suppose P is found in T when P is aligned at index s in T. Then, we have to show that s ∈ S . If P appears in T at shift (index)s that means all M characters of pattern P = p 0 p 1 ⋯ p M − 1 can be successfully matched with T = t s t s + 1 t s + 2 ⋯ t s + M − 1 . Hence, ∀ j in P such that 0 ≤ j ≤ M − 1 , we have T s + j = p j . Now, from the definition of R_{j}, we get R j = { ( s + j ) − j = s } ⇒ ∀ 0 ≤ j ≤ M − 1 , s ∈ R j ⇒ s ∈ S . Further, since all s ∈ S represent an exact match of P when P is aligned at index s in T ⇒ |S| = Number of occurrences of P in T.

Example 1: Let T = GCABABABCBA be a text array and P = ABAB be a pattern array of size 4. For the given text we have i such that 0 ≤ i ≤ 10 . For each shift j ( 0 ≤ j ≤ 3 ) in P we create a set R_{j} such that R j = { i − j | T [ i ] = p j , 0 ≤ i ≤ 10 } . Which gives: R 0 = { 2 , 4 , 6 , 10 } , R 1 = { 2 , 4 , 6 , 8 } , R 2 = { 0 , 2 , 4 , 8 } , and R 3 = { 0 , 2 , 4 , 6 } . Hence, R 0 ∩ R 1 ∩ R 2 ∩ R 3 = S = { 2 , 4 } ⇒ | S | = 2 , which shows P occurs in T twice at locations 2 and 4.

Corollary 1: Given the M sets R_{j} defined as in the lemma given above. Let f_{s} be the frequency of the occurrence of an integer “s” in all sets. Then, f_{s} represents the number of character matches when P is aligned with shift s in T.

Proof: We have already proved that any s ∈ S = R 0 ∩ R 1 ∩ R 2 ∩ ⋯ ∩ R M − 1 represents exactly M character matches of P = p 0 , p 1 , p 2 , ⋯ , p M − 1 at shift s in T ⇒ each s ∈ R j represents a single character match p_{j} of P ⇒ the frequency of integer “s” = f_{s} = Number of character matches of P at shift s in T. Please note: for an exact match of P at shift s in T, “s” must be present in all M sets, in that case: s ∈ S , i.e. f_{s}= M. Also, M - f_{s} represents the Hamming distance, i.e. the number of character mismatches when pattern P is aligned with shift s in T.

Example 2: Consider example 1, the integer 6 appears in three sets: R_{0}, R_{1} and R_{3}. Hence, the frequency of the integer 6 is equal to f_{6} = 3. Which shows three character matches of P when P is aligned with shift 6 in T. Similarly, f 0 = f 8 = 2 reveal two character matches of P when P is aligned with locations 0 and 80 in T.

We present another example to see how the model described above can be implemented successfully. Consider the pattern array P = FCTHZCTZCF, and the text array T = SKRFCTHZCTZCFTYCTZGHTTCTHZTHZFCTHZCTZCFT. The first column of _{j} as defined in the lemma.

As noted above, the frequency of occurrence f_{s} of an integer “s” in all set represents the number of character matches of P at shift sin T. Therefore, we simply need a mechanism for counting the number of occurrences of individual “s” in all sets in

Pattern Character | Shift in T (i) | Shift in P (j) | R j = { i − j | T [ i ] = p j , ∀ 0 ≤ i ≤ N − 1 } |
---|---|---|---|

F | 3, 12, 29, 38 | 0 | R_{0} = 3, 12, 29, 38_{ } |

9 | R_{9} = −6, 3, 20, 29 | ||

C | 4, 8, 11, 15, 22, 30, 34, 37 | 1 | R_{1} = 3, 7, 10, 14, 21, 29, 33, 36 |

5 | R_{5} = −1, 3, 6, 10, 17, 25, 29, 32 | ||

8 | R_{8} = −4, 0, 3, 7, 14, 22, 26, 29 | ||

T | 5, 9, 13, 16, 20, 21, 23, 26, 31, 35, 39 | 2 | R_{2} = 3, 7, 11, 14, 18, 19, 21, 24, 29, 33,37 |

6 | R_{6} = −1, 3, 7, 10, 14, 15, 17, 20, 25, 29, 33 | ||

H | 6, 19, 24, 27, 32 | 3 | R_{3} = 3, 16, 21, 24, 29 |

Z | 7, 10, 17, 25, 28, 33, 36 | 4 | R_{4} = 3, 6, 13, 21, 24, 29, 32 |

7 | R_{7} = 0, 3, 10, 18, 21, 26, 29 |

has been hit 5 times, location 14 has been hit 4 times, representing 5 and 4 character matches of P at alignment locations 21 and 14 respectively.

The method described above has two shortcomings. First, we need an array of size N, which is undesirable for large values of N. To resolve this, the hit[_{5} and R_{6} in

Highly practical solutions can be drawn based on the model we have presented in this paper. The novel approach we have followed may become an alternative to existing solutions.

We thank anonymous reviewers for their comments and suggestions.

The author declares no conflicts of interest regarding the publication of this paper.

Vinod-Prasad, P. (2020) A Novel Mathematical Model for Similarity Search in Pattern Matching Algorithms. Journal of Computer and Communications, 8, 94-99. https://doi.org/10.4236/jcc.2020.89008