A Novel Mathematical Model for Similarity Search in Pattern Matching Algorithms

Modern applications require large databases to be searched for regions that are similar to a given pattern. The DNA sequence analysis, speech and text recognition, artificial intelligence, Internet of Things, and many other applications highly depend on pattern matching or similarity searches. In this pa-per, we discuss some of the string matching solutions developed in the past. Then, we present a novel mathematical model to search for a given pattern and it’s near approximates in the text.


Introduction
Similarity searches commonly known as approximate string matching allow for some mismatches between the text and the pattern. Similarity searches are widely used in computational biology, search engines, data mining, signal processing, digital dictionaries, and many other applications where exact and similar patterns are to be searched over a given text. Most of the algorithms developed prior to the 1990's such as Morris and Pratt [1], Aho and Corasick [2], Knuth et al. [3], Boyer and Moore [4], Horspool [5], and Karp and Rabin [6], were designed to search for exact pattern matches in the text. However, with some modifications, they can also be used to find approximate matches.
The term "distance" is often used when comparing two strings for similarity. A lesser distance is expected for a greater similarity. The Hamming distance [7] approximates the similarity between two strings of equal length by measuring the number of character mismatches at corresponding locations. Let R and S be two non-empty equal-length strings of size M such that Then, the Hamming distance between R and S is given by ham(R, S) = number of locations where i i r s ≠ , 0 1 i M ≤ ≤ − . Clearly, if R and S are identical, then ham(R, S) = 0. For a given pattern P modern applications may require database to be searched for exact or approximate matches of P. In the literature, this problem is sometimes referred as "string matching with k mismatches" which can be stated as follows: Given the text To solve the "k-mismatch" problem, Landau and Vishkin [8] proposed a suffix-tree-based algorithm. They used a suffix tree to preprocess the text and the pattern in O(N + M) time, and then report k mismatches in O(kN) time. However, the algorithm requires O(k(M + N)) space, which is a concern when N becomes large. Galil and Giancarlo [9] presented an algorithm that uses O(kN) time and O(M) space to solve the same problem. Amir et al. [10] developed an algorithm to identify all such locations in ( ) 2 log O N k k time. As "k" increases the performance of these algorithms deteriorates, and approaches O(MN) as "k" approaches M. For k=M, the problem becomes independent of k and reduces to "string matching with mismatches", which can be stated as follows: Given the text  [13]. For a detailed survey on approximate string matching refer to Navarro [14], Boytsov [15]. Most of the algorithms developed in the past use data-structures and methods outlined above to create indexes over the text or the pattern to accelerate the search process.
However, their costly maintenance has always been a cause of concern.

Assumptions and Notations
We use the following assumptions and notations. "λ" represents a finite nonempty ordered set of characters called an alphabet, such that λ σ = is the size of the alphabet. "λ i " is the i th character in λ such that 1 i σ ≤ ≤ . String S is a finite sequence of characters defined over alphabet λ. S[i] or S i represents the i th character of string S, where, "i" is refers to as the shift, location, or index in S. Both T and P represent non-empty text and pattern strings defined over the alphabet λ. N and M represent sizes of the text and the pattern respectively such that M ≤ N. We use the phrase "Number of matches of P in T at shift t", which refers to the number of character matches when pattern P is aligned with shift t in T.

The Model
Traditional, pattern matching algorithms attempts to align P from the first character of T which may result in losing some valuable information regarding the character matches. Let's consider an example: suppose pattern P = ABCDEF, and text T = CDEFABCD. Assuming 0 being the initial index of the strings, a four character match is found when P is aligned at the index −2 in T. All traditional algorithms would lose this information as attempts are made to align P from index 0 in T. In other words, traditional algorithms consider the indexes in the range 0 i N M ≤ ≤ − , where i represent the text index. However, as we have seen, considering i's in the range ( ) ( ) ≤ ≤ − may provide additional information, particularly when the pattern is considerably large, and the character matches exist at opposite ends of the pattern or text. The lemma and proof given below are already discussed in our previous work [16]. However, a brief discussion is provided below for a prompt reference.
The Lemma Let T and P be non-empty text and pattern strings defined over an alphabet λ such that: Corresponding to every index j in P, we define a set R j such that S represent a set such that .Then:

1) Every integer s S
∈ represents an index in T where an exact match of P is found when P is aligned at s in T.
2) The cardinality |S| represents the number of occurrences of P in T. If |S| =0 then P is not present in T.
Proof: Suppose P is found in T when P is aligned at index s in T. Then, we have to show that s S ∈ . If P appears in T at shift (index)s that means all M characters of pattern Example 2: Consider example 1, the integer 6 appears in three sets: R 0 , R 1 and R 3 . Hence, the frequency of the integer 6 is equal to f 6 = 3. Which shows three character matches of P when P is aligned with shift 6 in T. Similarly, reveal two character matches of P when P is aligned with locations 0 and 80 in T.

Model Implementation
We present another example to see how the model described above can be implemented successfully. Consider the pattern array P = FCTHZCTZCF, and the text array T = SKRFCTHZCTZCFTYCTZGHTTCTHZTHZFCTHZCTZCFT.
The first column of Table 1 below summarizes all unique characters of the pattern. The second and third columns of the table present the shifts of the corresponding pattern characters in the text and in the pattern respectively. Each row of the last column represents set R j as defined in the lemma.
As noted above, the frequency of occurrence f s of an integer "s" in all set represents the number of character matches of P at shift sin T. Therefore, we simply need a mechanism for counting the number of occurrences of individual "s" in all sets in Table 1    has been hit 5 times, location 14 has been hit 4 times, representing 5 and 4 character matches of P at alignment locations 21 and 14 respectively.  Table 1, suggesting that we can obtain two character matches if P is aligned at location −1 in T. Such hits are ignored in Figure 1 because as we cannot record hits at negative array locations. This issue can be resolved by assuming the initial index of the text file to be ≥M − 1 rather than 0. This will ensure that s ≥ 0 for all j s R ∈ . Figure 2 shows the re-indexed version of the array, where, each text character location is hyped by M. Therefore, index M = 10 in the re-indexed array corresponds to location 0 of the actual text array, index 9 corresponds to location −1, and so on. The benefit is straightforward; with the re-indexed array, we can say that 2 hits are found if the pattern is aligned at location 9, which corresponds to location -1 in the actual text array.

Conclusion
Highly practical solutions can be drawn based on the model we have presented in this paper. The novel approach we have followed may become an alternative to existing solutions.