Bo2SS

Bo2SS

4 String Matching Algorithms (Part 1): BF, KMP, SUNDAY, SHIFT-AND, Hash

Some classic and unparalleled algorithms related to single pattern matching

Course Content#

Related Definitions

Single Pattern Matching Problem: Finding whether a certain pattern string $t$ appears in the text string $s$

Multi-pattern Matching Problem: Finding multiple pattern strings $t$

Text String/Main String: The string being searched $s$

Pattern String: The string to be searched $t$

The length of the text string is $m$, and the length of the pattern string is $n$

Brute Force Matching Method#

The cornerstone of all string matching algorithms

Idea#

  • Align each position of the pattern string and the text string sequentially until a complete match is found
  • The idea is the foundation for understanding all matching algorithms: to find the answer without missing anything
    • The core of other algorithm optimizations is to avoid redundancy, reducing attempts that cannot match successfully
  • Time Complexity: $O(m \times n)$

Provoking Thoughts#

How does brute force matching handle the first mismatch?

Image
  • It will match the first position of the pattern string with the second position of the main string
  • If the match is successful, it means the first position of the pattern string is equal to the second position of the text string
    • Since it is known that the second position of the pattern string matches successfully with the second position of the text string
    • There must be a situation where the second position of the pattern string is equal to the first position
  • Whether the second position of the pattern string is equal to the first position can actually be known when obtaining the pattern string
  • For the above image, there is no need to attempt matching the first position of the pattern string with the second position of the main string
    • They definitely do not match
  • Is it possible to skip positions that definitely do not match? The KMP algorithm has its own ideas.

KMP#

State machine thinking, concept of self-structure repetition expansion

Thought Process#

Continuing, focus on positions A, B, C

  • Image
  • [Assume] When a mismatch occurs at the black shadow area, the pattern string $t$ moves forward by the size of ④, at this time the text string $s$'s ② and the pattern string $t$'s ① match successfully
  • [Key] The third part: its left boundary determines the distance $t$ moves forward [same as the right boundary of the first part]
  • [A] In this vertical column, there must be ②=③=①, with ② adjacent to the mismatch position 👉 ③ also adjacent to the mismatch position
  • [B] ③ is actually a suffix of $t$, and ① is a prefix of $t$ 👉 ③ equals some prefix of $t$
  • [C] To ensure no matches are missed, ④ should be as small as possible 👉 ③ should be as large as possible
  • [Conclusion] In summary, through ③, the matching process can be accelerated, and ③ is the longest prefix of $t$ that is adjacent to the mismatch position.

Problem Transformation#

Pre-obtain the size of ③ corresponding to the mismatch, which is also the size of ①

  • Image
  • Find the ① and ③ of the pattern string, use the green characters after ① to match the orange characters at the mismatch position after ②
  • Consider each position of the pattern string: because the mismatch position could be any position of the pattern string
  • The pattern string is known in advance; the text string is unknown and uncontrollable
  • 👉 The $next$ array of the pattern string

Next Array#

  • $next[i]$: Used when the $i$ position matches successfully, but the $i+1$ position mismatches, as the offset for the next attempt to match the pattern string
  • State Definition: The index of the longest prefix that can match in the suffix ending at position $i$ [right boundary]
    • Image
    • Only for the pattern string
  • Transition Formula: Understand by looking at the diagram, focus on positions ① and ② in the diagram
    • $next[i+1]=\left{\begin{aligned}&next[i]+1&t[i+1]=t[next[i]+1]\&next[next[i]]+1&t[i+1]\neq t[next[i]+1]\ &&\ t[i+1]=t[next[next[i]]+1]\&next[next[next[...[i]]]]+1&until\ equal\ or\ the\ end\end{aligned}\right.$
    • Image
    • First look at the second and third lines, determine ①
      • If $t[i+1]=t[next[i]+1]$, then the longest prefix corresponding to $i+1$ is $next[i]+1$
      • Otherwise, move $t$ to the right as short as possible [similar to ④ in the thought process above], which is to find the maximum prefix corresponding to the suffix ending at $j$ in the substring $t[0\to j]$ [this is the definition of the $next$ array]
    • Observe the first and second lines, determine ②
      • If $t[i+1]=t[next[j]+1]$, then the longest prefix corresponding to $i+1$ is $next[j]+1$, where $j=next[i]$
      • Otherwise, continue to move $t$ to the right, checking ③④... until the equality condition is satisfied, or $t$ can no longer move to the right
  • Involves the idea of dynamic programming, see "Interview and Written Algorithm Part 2" — 1 From Recursion to Dynamic Programming (Part 1)
  • Practice Understanding
    • Define a virtual position: -1 [when no prefix matches, the value is -1]
    • Practice 1
    • Image
    • Practice 2
    • Image
    • Answers: -1, -1, 0, -1, 0, 1, 2, 3, 4, 5, 6, 1
  • [PS] Reference KMP Algorithm Detailed Explanation — CSDN
    • In the text, the $next$ array is defined as: the maximum common length of the prefix and suffix [excluding the string itself, otherwise the maximum length is always the string itself]
    • It can be seen that the definition of the $next$ array is not unique
    • The concept of self-structure repetition expansion, when seeking the $next$ array, has this idea

Matching Process#

  1. Traverse each position of the text string $s$ sequentially
  2. Start matching from the beginning of the pattern string $t$
    1. If the current position of the text string $s$ is equal to the current position of the pattern string $t$, then match the next position of the pattern string $t$
    2. Otherwise, based on the $next$ array, continuously move the attempt position of the pattern string $t$ forward until the equality condition is satisfied, match the next position of the pattern string $t$; or cannot move forward anymore, match the next position of the text string $s$
  3. When successfully matching to the end of the pattern string $t$, output the corresponding matching position of the text string $s$; or not found

Summary#

  • Understand the importance of the third part in the pattern string [see thought process]
    • The core to speed up matching, avoiding a large number of useless matching attempts
  • The core to ensure no matches are missed: the third part matches the longest prefix of the pattern string
  • Can be understood from the perspective of a state machine, see code demonstration — State Machine Version KMP
  • Time Complexity: $O(m + n)$
  • [PS] KMP also has some optimization methods

SUNDAY#

Find the tail first, then find the head

Simulation Operation#

Directly simulate the following situation

  • ① When a mismatch occurs, observe the next position e corresponding to the end of the pattern string $t$ in the main string $s$ — golden alignment point
    • Image
    • [Note] Do not observe the position after the mismatch <don't be misled by the image>
  • ② In the pattern string $t$, find the first position of e from the back
    • Image
    • Found e at the second to last position
  • ③ The second to last position means that the attempt matching position of the pattern string $t$ with the main string $s$ moves forward by 2 positions, then retry matching the main string $s$ from the beginning
    • Image
    • [Personal Understanding] If this right shift is to match successfully, the e position must align, because it will always pass through e; finding the first e is to avoid missing answers
  • ④ The above image still has a mismatch, at this time find the first a in the pattern string $t$ from the back, align it, match... until successfully matched, or the matching ends
  • [PS] If the character corresponding to the golden alignment point cannot be found, the right shift offset will be the length of the entire pattern string $t$ + 1

Process#

  1. Preprocess the last occurrence position of each character in the pattern string
  2. Simulate the brute force matching algorithm process, when a mismatch occurs, the text string pointer moves back according to the above preprocessing information, then re-matches
  3. Finally, return the index if matched successfully, or not found

Summary#

  • Core: Understand the golden alignment point [see simulation operation]
    • If matching is successful, it must appear in the pattern string
    • It should align with the last occurrence position of that character in the pattern string
  • Right shift operation: The position of the character found in the pattern string from the end determines how many positions to shift the text string pointer to the right
  • Fastest Time Complexity: $O(m / n)$
    • Corresponding scenario: If the character to find the golden alignment point is not found in the pattern string, the text string pointer will move back the entire length of the pattern string $n + 1$.
  • [PS]
    • Worst-case time complexity: $O(m \times n)$, but this scenario generally has no practical significance
    • Usage scenario: Solving simple single-pattern matching problems like "finding a word in an article"
    • Compared to KMP
      • SUNDAY is easier to understand and implement
      • But the thinking of the KMP algorithm is more advanced and should be learned [State Machine]

SHIFT-AND#

A very elegant niche algorithm that tests imagination
State machine thinking: NFA, Non-deterministic Finite Automaton

Encoding Array d#

⭐ Convert the pattern string into a kind of encoding, and after conversion, the matching problem has nothing to do with the pattern string

  • Image
  • Starting from position 0, read each position of the pattern string $t$ sequentially
  • When the character 'a' appears at position 0, set the element at index 'a' in the $d$ array to 1
  • When the character 'e' appears at position 1, set the element at index 'e' in the $d$ array to 1
  • And so on, obtaining the $d$ array
  • [PS]
    • The index of the $d$ array can correspond to the ASCII code of the character
    • The elements of the $d$ array can store int type [32 bits], long long type, or custom types, which determines the maximum length of the pattern string that can be matched
    • Note the order of characters in the string and the numerical order of the $d$ array elements, the character at position 0 corresponds to the lower bits of the latter

State Information p#

Understand the state definition and feel the transition of states

  • State Definition: $p_i$
    • In the binary representation of $p_i$, if the $j$-th bit is 1, it indicates that when ending with the text string $s[i]$, the first $j$ positions of the pattern string can be successfully matched
    • Diagram, compare the above definition with the three pairs of circles below
      • Image
      • Red: $p_4$ represents the current matching condition ending with the text string $s[4]$
      • Orange: The second bit of $p_4$ is 1, indicating that the first 2 positions of the pattern string $t$ can satisfy the above matching condition, i.e., $t[0\to 1]$ matches $s[3\to 4]$
      • Green: The fifth bit of $p_4$ is 1, indicating that the first 5 positions of the pattern string $t$ can satisfy the above matching condition, i.e., $t[0\to 4]$ matches $s[0\to 4]$
      • [PS]
        • The binary length of $p$ depends on what data type is used, like the elements of the $d$ array, such as int: 32 bits
        • Do not focus on how $p_4$ is obtained for now, we will look at the transition process later
    • ⭐ It can be seen that
      • $p$ simultaneously stores information about multiple matching results❗
      • When the $n$-th bit of $p_i$ is 1, it indicates a successful match, that is
        • $p_i\ &\ (1\ <<\ (n-1))=1$
        • The matching position is: $i-n+1$
  • Transition Process: $p_i=(p_{i-1}\ <<\ 1\ |\ 1)\ &\ d[s[i]]$
    • ⭐ Key Formula ⭐
    • At this point, the pattern string $t$ has been completely abandoned, and through the previous state of $p$ and the $d$ array, the current value of $p$ can be obtained
    • Principle analysis, compare the key formula with the diagram below ①②
      • image-20210203004703369
      • Assuming $p_3$ is known, currently matching $s[2\to 3]$ and $s[0\to 3]$
      • Left shift by 1 + or 1 operation, obtain the transition state $p_4^{'}$
        • Left shift by 1: Attempting to swallow $s[4]$ based on the previous match, i.e., matching $s[2\to 4]$ and $s[0\to 4]$
          • The condition for successful matching lies in the subsequent and operation
          • [Note] The left shift operation appears as a right shift in the diagram because the order of digits is low on the left
        • Or 1 operation: Leave a chance for a separate match of $s[4]$, matching success indicates matching this one character
      • And operation
        • The above two operations are just attempts to match, and whether it is truly successful depends on the performance of $d[s_4]$
        • If the result of the and operation at the $j$-th bit is 1, it indicates a successful match of $s[4-j+1\to 4]$ with $t[0\to j-1]$
          • For example, if $p_4[2]=1$, $p_4[5]=1$, the matching result is as illustrated in the state definition part
  • [PS]
    • The initial value of $p$ is 0, and $p$ is actually a non-deterministic finite state automaton
    • Habitual conflict: Generally, the right side of the number order is the lower bit, but in the diagram, the left side of $p$ is the lower bit; the lowest bit of the binary number is 1, while the lowest bit of the string is 0; the elements of the $d$ array are numbers, so they should correspond to the numerical order, but the above diagram is to reflect the corresponding position relationship with the pattern string, using string order
    • Accurately understand the meaning of $p$: The last successful match was 4 bits, and the next time it may successfully match 5 bits
    • For matching problems with more than 64 bits, you need to create your own data type
      • Just support left shift, or, and operations
      • Then modify the types of the encoding array $d$ and state information $p$ accordingly

Summary#

  • ① Encoding Array $d$: Converts the position of each character in the pattern string into the corresponding binary encoding
  • ② State Information $p$: Transitions states through the key formula $p_i=(p_{i-1}\ <<\ 1\ |\ 1)\ &\ d[s[i]]$
  • Time Complexity: $O(m)$
    • Completely does not require back and forth on the pattern string
    • However, strictly speaking, it is not purely $O(m)$, which is related to the computer's data representation capability
  • ❗ Can solve regular expression matching problems
    • For example: $(a|b|c)&(c|d)&e&(f|a|b)$
    • ① Convert the above pattern string into the encoding array $d$
      • For each column of the $d$ array, multiple bits can be set to 1
      • That is, each position of the pattern string can set multiple corresponding binary bits of the $d$ array elements to 1
    • ② After that, it has nothing to do with the pattern string, solve the problem based on the $d$ array and the key formula
    • [PS] Here it can be seen that the reason why the pattern string is called a pattern string is that it refers not only to strings but also to regular expressions (which can also be seen as a kind of related multi-pattern)

Code Demonstration#

Brute Force Matching Method#

  • Image
  • Two versions, learn the skills of the concise version, flexibly use $flag$
  • Traverse the text string and the pattern string respectively, compare each position, if mismatched, move the text string pointer one position to the right
  • This idea is the foundation of all string matching algorithms

KMP#

2 programming methods, the latter is closer to the essence of the algorithm

  • ① Ordinary coding: Directly obtain the $next$ array, then use the $next$ array for mismatch offset
    • Image
    • Preprocess the $next$ array → Traverse each position of the text string → Match the pattern string [using the $next$ array]
    • ❗ Observe: The process of obtaining the $next$ array is very similar to the process of matching the text string, based on the state machine idea, the two parts can be integrated
  • ② Advanced programming: Based on the state machine idea, transform the process of obtaining the $next$ value into a state change process
    • Image
    • Abstracted a state machine model: $j$ points to the position of the state machine, the $getNext$ method is equivalent to performing a state transition based on the input character, which is actually changing the value of $j$
    • A way of thinking that is closer to the essence of KMP
  • PS: The $next$ array should also be freed before returning $i-n+1$ when matching is successful

SUNDAY#

  • Image
  • ❗ Note the meaning of 256: Used to record the position of each character in the pattern string
  • The process of preprocessing the $offset$ array: First initialize to all not appearing, $n+1$ indicates that the text string pointer can directly move right by a distance of the length of the pattern string + 1; during preprocessing, the positions of the same characters appearing later will overwrite those before, which is exactly what SUNDAY intends
  • $s[i + n]$: The character corresponding to the golden alignment point
  • $i + n <= m$: Reasonably utilizes the $n$, $m$ variables, indicating that the text string is still sufficient for the pattern string to match
  • Compared to KMP, SUNDAY is easier to understand, and the code is easier to implement [the best choice for solving simple single-pattern matching problems]

SHIFT-AND#

  • Image
  • ⭐ Beautiful idea, simple code
  • Note
    • The $d$ array elements related to numbers and state $p$, the lowest bit is 1; while the string corresponds to 0
    • But the transition process is arithmetic between numbers, so there is no need to worry about whether the positions are aligned

In-Class Exercises#

——Hash——#

Common problem-solving routine

HZOJ-275: Rabbits and Rabbits#

Image

Sample Input

aabbaabb
3
1 3 5 7
1 3 6 8
1 2 1 2

Sample Output

Yes
No
Yes
  • Idea
    • Note: The number of queries is particularly large, and the query interval is very large, brute force matching will definitely time out
    • How to compare quickly? 👉 Hash matching algorithm
      • By hashing, represent the string with numbers <hash value>
      • If the hash values of the two strings are different, the two strings must not be equal
        • This eliminates most of the unequal cases without needing to compare bit by bit
      • Otherwise, verify whether the strings are equal through brute force matching
        • If the two strings are equal, one match can succeed
      • [PS]
        • The hash value corresponding to long strings may be too large, so generally, the hash value is reduced through modulus operation
        • If the remainders are different 👉 the strings must be different
    • Design a hash function [string -> hash value]
      • $H=(\sum_{k=0}^{n-1}{C_k\times base^k})%P$
      • $n$: Length of the string $s$
      • $C_k$: ASCII code value corresponding to the $k$-th character
      • $base$: Weight [prime number]
      • $P$: Modulus [prime number]
      • [PS] Characters are stored in binary numbers at the computer's lower level
    • ⭐ The hash value $H_{j\to i}$ of a substring $s[j\to i]$ in this problem
      • Based on the relationship between strings and hash values, a prefix sum array based on hash can be preprocessed
        • $HS_i=(\sum_{k=0}^{i}{C_k\times base^k})%P$
      • Thus, the hash value based on hash for the substring $s[j\to i]$ is
        • $HS_{j\to i}=(\sum_{k=j}^{i}{C_k\times base^k})%P$
      • However, for the hash value of the substring, it should not include the positional information $j$, $i$ of the substring in the string, that is, the multiplier factor should start from $base^0$
      • Therefore, the interval sum $HS_{j\to i}$ should be divided by $base^j$, but for the modulus operation, division cannot be done directly, and the inverse element must be used
      • Finally, the hash value of the substring $s[j\to i]$ is converted to
        • $H_{j\to i}=HS_{j\to i}\times(base^j)^{-1}%P=(HS_i-HS_{j-1})\times(base^j)^{-1}%P$
        • $(base^j)^{-1}$: The inverse element of $base^j$
        • For the quick calculation of the inverse element, please refer to: Quickly find [mod] inverse element
    • Solution Method ①
        1. Preprocess the prefix sum based on hash for each position in the text string, convenient for later interval sum calculation
        1. Use the interval sum and inverse element to obtain the hash value of a substring $s[i\to j]$:
        • $H_{j\to i}=HS_{j\to i}\times(base^j)^{-1}%P$
        1. Compare the hash values of the two substrings
        • If not equal, the substrings must be different
        • Otherwise, verify whether they are the same through brute force matching
      • ❗ The probability of two substrings being different but having the same hash value is $1/P$
    • Solution Method ②
      • Based on Method ①, design two hash functions with different $P$ and $base$
      • Compare the two hash values corresponding to the two substrings
        • If both are equal, consider the two substrings the same
        • Otherwise, they must be different
      • ❗ The probability of two substrings being different but having the same hash value is $1/(P_1\times P_2)$
        • The error probability is extremely low, in fact, many algorithms are probabilistic algorithms
  • Code
    • Method ①: Hash matching + brute force matching
    • Image
    • Image
    • ① The $P$ value must be a prime number, otherwise some inverse elements are meaningless, leading to incorrect hash values
      • This can cause the same substring to correspond to different hash values
      • The sufficient and necessary condition for the existence of the inverse element: coprime
    • ② Need to use long long type, as the values in the intermediate process may be very large
      • This increases the probability of different substrings corresponding to the same hash value, increasing the time cost [brute force matching]
    • [PS]
      • Always remember to take modulus to prevent numbers from exceeding the representation range, even for long long
      • Ensure that the hash value is positive
        • The hash prefix sum array $H$ is not monotonically increasing because there are modulus operations during summation
      • A small $P$ value may lead to timeouts due to excessive brute force matching
    • Method ②: Hash matching * 2
    • img
    • img
    • Note: The two inverse element arrays should be initialized separately to ensure complete initialization [the range of traversal is different]
      • Or use the same $P$ and different $base$, which can be initialized together
    • [PS] To be safe, you can also add brute force matching verification

Quickly Find [mod] Inverse Element#

Derivation Process

  • Goal: Solve for $x\times x^{-1}\equiv1\ (mod\ P)$ in which $x^{-1}$
    • [PS] Only consider the case where $x<P$, because all $x$ greater than $P$ can be converted to less than $P$ by taking modulus
  • Let: $P\ %\ x=r,P\ /\ x=k$
  • Then:
    • $\begin{aligned}P&=kx+r\kx+r&\equiv0\ (mod\ P)\kx(x^{-1}r^{-1})+r(x^{-1}r^{-1})&\equiv0\ (mod\ P)\kr^{-1}+x^{-1}&\equiv0\ (mod\ P)\x^{-1}&\equiv-kr^{-1}\ (mod\ P)\\end{aligned}$
  • Ultimately, the problem of finding $x^{-1}$ is transformed into finding $r^{-1}$
    • And $r$ is the remainder of the modulus $x$, so $r$ must be smaller than $x$
  • A large-scale problem 👉 A smaller-scale problem [recursion]
  • To ensure the inverse element is positive, one more processing step is needed
    • $x^{-1}_{+}=(-kr^{-1}\ %\ P+P)\ %\ P$
  • [PS]

Code

  • Image
  • Regardless of the modulus, the inverse of 1 is 1 [initialization]
  • Output results
    • Image
    • You can verify: The original number multiplied by the inverse element modulo is 1

Additional Knowledge Points#

  • The essential ideas of KMP and SHIFT-AND: State machine
  • The SUNDAY algorithm is commonly used in life, and the Hash matching algorithm is often used in problem-solving
  • DFA and NFA are knowledge from compiler theory

Tips#

  • Before starting C++, make sure to finish the prerequisite courses
  • Becoming an expert in the field of algorithms relies on improving your aesthetic ability
  • Getting angry is a sign of incompetence, focus on doing the right thing! Step by step improvement, without seeking instant success

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.