This document explains Crochemore and Perrin's Two-Way string matching algorithm, in which a smaller string (the "pattern" or "needle") is searched for in a longer string (the "text" or "haystack"), determining whether the needle is a substring of the haystack, and if so, at what index(es). It is to be used by Python's string (and bytes-like) objects when calling `find`, `index`, `__contains__`, or implicitly in methods like `replace` or `partition`. This is essentially a re-telling of the paper Crochemore M., Perrin D., 1991, Two-way string-matching, Journal of the ACM 38(3):651-675. focused more on understanding and examples than on rigor. See also the code sample here: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 The algorithm runs in O(len(needle) + len(haystack)) time and with O(1) space. However, since there is a larger preprocessing cost than simpler algorithms, this Two-Way algorithm is to be used only when the needle and haystack lengths meet certain thresholds. These are the basic steps of the algorithm: * "Very carefully" cut the needle in two. * For each alignment attempted: 1. match the right part * On failure, jump by the amount matched + 1 2. then match the left part. * On failure jump by max(len(left), len(right)) + 1 * If the needle is periodic, don't re-do comparisons; maintain a "memory" of how many characters you already know match. -------- Matching the right part -------- We first scan the right part of the needle to check if it matches the the aligned characters in the haystack. We scan left-to-right, and if a mismatch occurs, we jump ahead by the amount matched plus 1. Example: text: ........EFGX................... pattern: ....abcdEFGH.... cut: <<<<>>>> Matched 3, so jump ahead by 4: text: ........EFGX................... pattern: ....abcdEFGH.... cut: <<<<>>>> Why are we allowed to do this? Because we cut the needle very carefully, in such a way that if the cut is ...abcd + EFGH... then we have d != E cd != EF bcd != EFG abcd != EFGH ... and so on. If this is true for every pair of equal-length substrings around the cut, then the following alignments do not work, so we can skip them: text: ........EFG.................... pattern: ....abcdEFGH.... ^ (Bad because d != E) text: ........EFG.................... pattern: ....abcdEFGH.... ^^ (Bad because cd != EF) text: ........EFG.................... pattern: ....abcdEFGH.... ^^^ (Bad because bcd != EFG) Skip 3 alignments => increment alignment by 4. -------- If len(left_part) < len(right_part) -------- Above is the core idea, and it begins to suggest how the algorithm can be linear-time. There is one bit of subtlety involving what to do around the end of the needle: if the left half is shorter than the right, then we could run into something like this: text: .....EFG...... pattern: cdEFGH The same argument holds that we can skip ahead by 4, so long as d != E cd != EF ?cd != EFG ??cd != EFGH etc. The question marks represent "wildcards" that always match; they're outside the limits of the needle, so there's no way for them to invalidate a match. To ensure that the inequalities above are always true, we need them to be true for all possible '?' values. We thus need cd != FG and cd != GH, etc. -------- Matching the left part -------- Once we have ensured the right part matches, we scan the left part (order doesn't matter, but traditionally right-to-left), and if we find a mismatch, we jump ahead by max(len(left_part), len(right_part)) + 1. That we can jump by at least len(right_part) + 1 we have already seen: text: .....EFG..... pattern: abcdEFG Matched 3, so jump by 4, using the fact that d != E, cd != EF, and bcd != EFG. But we can also jump by at least len(left_part) + 1: text: ....cdEF..... pattern: abcdEF Jump by len('abcd') + 1 = 5. Skip the alignments: text: ....cdEF..... pattern: abcdEF text: ....cdEF..... pattern: abcdEF text: ....cdEF..... pattern: abcdEF text: ....cdEF..... pattern: abcdEF This requires the following facts: d != E cd != EF bcd != EF? abcd != EF?? etc., for all values of ?s, as above. If we have both sets of inequalities, then we can indeed jump by max(len(left_part), len(right_part)) + 1. Under the assumption of such a nice splitting of the needle, we now have enough to prove linear time for the search: consider the forward-progress/comparisons ratio at each alignment position. If a mismatch occurs in the right part, the ratio is 1 position forward per comparison. On the other hand, if a mismatch occurs in the left half, we advance by more than len(needle)//2 positions for at most len(needle) comparisons, so this ratio is more than 1/2. This average "movement speed" is bounded below by the constant "1 position per 2 comparisons", so we have linear time. -------- The periodic case -------- The sets of inequalities listed so far seem too good to be true in the general case. Indeed, they fail when a needle is periodic: there's no way to split 'AAbAAbAAbA' in two such that (the stuff n characters to the left of the split) cannot equal (the stuff n characters to the right of the split) for all n. This is because no matter how you cut it, you'll get s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the needle in two so that n can be as big as possible. If we were to split it as AAbA + AbAAbA then A == A at the split, so this is bad (we failed at length 1), but if we split it as AA + bAAbAAbA we at least have A != b and AA != bA, and we fail at length 3 since ?AA == bAA. We already knew that a cut to make length-3 mismatch was impossible due to the period, but we now see that the bound is sharp; we can get length-1 and length-2 to mismatch. This is exactly the content of the *critical factorization theorem*: that no matter the period of the original needle, you can cut it in such a way that (with the appropriate question marks), needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period. Even "non-periodic" strings are periodic with a period equal to their length, so for such needles, the CFT already guarantees that the algorithm described so far will work, since we can cut the needle so that the length-k chunks on either side of the cut mismatch for all k < len(needle). Looking closer at the algorithm, we only actually require that k go up to max(len(left_part), len(right_part)). So long as the period exceeds that, we're good. The more general shorter-period case is a bit harder. The essentials are the same, except we use the periodicity to our advantage by "remembering" periods that we've already compared. In our running example, say we're computing "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA". We cut as AA + bAAbAAbA, and then the algorithm runs as follows: First alignment: bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ^^X - Mismatch at third position, so jump by 3. - This requires that A!=b and AA != bA. Second alignment: bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ^^^^^^^^ X - Matched entire right part - Mismatch at left part. - Jump forward a period, remembering the existing comparisons Third alignment: bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA mmmmmmm^^X - There's "memory": a bunch of characters were already matched. - Two more characters match beyond that. - The 8th character of the right part mismatched, so jump by 8 - The above rule is more complicated than usual: we don't have the right inequalities for lengths 1 through 7, but we do have shifted copies of the length-1 and length-2 inequalities, along with knowledge of the mismatch. We can skip all of these alignments at once: bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ~ A != b at the cut bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ~~ AA != bA at the cut bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ^^^^X 7-3=4 match, and the 5th misses. bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ~ A != b at the cut bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ~~ AA != bA at the cut bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ^X 7-3-3=1 match and the 2nd misses. bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ~ A != b at the cut Fourth alignment: bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ^X - Second character mismatches, so jump by 2. Fifth alignment: bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA ^^^^^^^^ X - Right half matches, so use memory and skip ahead by period=3 Sixth alignment: bbbAbbAAbAAbAAbbbAAbAAbAAbAA AAbAAbAAbA mmmmmmmm^^ - Right part matches, left part is remembered, found a match! The one tricky skip by 8 here generalizes: if we have a period of p, then the CFT says we can ensure the cut has the inequality property for lengths 1 through p-1, and jumping by p would line up the matching characters and mismatched character one period earlier. Inductively, this proves that we can skip by the number of characters matched in the right half, plus 1, just as in the original algorithm. To make it explicit, the memory is set whenever the entire right part is matched and is then used as a starting point in the next alignment. In such a case, the alignment jumps forward one period, and the right half matches all except possibly the last period. Additionally, if we cut so that the left part has a length strictly less than the period (we always can!), then we can know that the left part already matches. The memory is reset to 0 whenever there is a mismatch in the right part. To prove linearity for the periodic case, note that if a right-part character mismatches, then we advance forward 1 unit per comparison. On the other hand, if the entire right part matches, then the skipping forward by one period "defers" some of the comparisons to the next alignment, where they will then be spent at the usual rate of one comparison per step forward. Even if left-half comparisons are always "wasted", they constitute less than half of all comparisons, so the average rate is certainly at least 1 move forward per 2 comparisons. -------- When to choose the periodic algorithm --------- The periodic algorithm is always valid but has an overhead of one more "memory" register and some memory computation steps, so the here-described-first non-periodic/long-period algorithm -- skipping by max(len(left_part), len(right_part)) + 1 rather than the period -- should be preferred when possible. Interestingly, the long-period algorithm does not require an exact computation of the period; it works even with some long-period, but undeniably "periodic" needles: Cut: AbcdefAbc == Abcde + fAbc This cut gives these inequalities: e != f de != fA cde != fAb bcde != fAbc Abcde != fAbc? The first failure is a period long, per the CFT: ?Abcde == fAbc?? A sufficient condition for using the long-period algorithm is having the period of the needle be greater than max(len(left_part), len(right_part)). This way, after choosing a good split, we get all of the max(len(left_part), len(right_part)) inequalities around the cut that were required in the long-period version of the algorithm. With all of this in mind, here's how we choose: (1) Choose a "critical factorization" of the needle -- a cut where we have period minus 1 inequalities in a row. More specifically, choose a cut so that the left_part is less than one period long. (2) Determine the period P_r of the right_part. (3) Check if the left part is just an extension of the pattern of the right part, so that the whole needle has period P_r. Explicitly, check if needle[0:cut] == needle[0+P_r:cut+P_r] If so, we use the periodic algorithm. If not equal, we use the long-period algorithm. Note that if equality holds in (3), then the period of the whole string is P_r. On the other hand, suppose equality does not hold. The period of the needle is then strictly greater than P_r. Here's a general fact: If p is a substring of s and p has period r, then the period of s is either equal to r or greater than len(p). We know that needle_period != P_r, and therefore needle_period > len(right_part). Additionally, we'll choose the cut (see below) so that len(left_part) < needle_period. Thus, in the case where equality does not hold, we have that needle_period >= max(len(left_part), len(right_part)) + 1, so the long-period algorithm works, but otherwise, we know the period of the needle. Note that this decision process doesn't always require an exact computation of the period -- we can get away with only computing P_r! -------- Computing the cut -------- Our remaining tasks are now to compute a cut of the needle with as many inequalities as possible, ensuring that cut < needle_period. Meanwhile, we must also compute the period P_r of the right_part. The computation is relatively simple, essentially doing this: suffix1 = max(needle[i:] for i in range(len(needle))) suffix2 = ... # the same as above, but invert the alphabet cut1 = len(needle) - len(suffix1) cut2 = len(needle) - len(suffix2) cut = max(cut1, cut2) # the later cut For cut2, "invert the alphabet" is different than saying min(...), since in lexicographic order, we still put "py" < "python", even if the alphabet is inverted. Computing these, along with the method of computing the period of the right half, is easiest to read directly from the source code in fastsearch.h, in which these are computed in linear time. Crochemore & Perrin's Theorem 3.1 give that "cut" above is a critical factorization less than the period, but a very brief sketch of their proof goes something like this (this is far from complete): * If this cut splits the needle as some needle == (a + w) + (w + b), meaning there's a bad equality w == w, it's impossible for w + b to be bigger than both b and w + w + b, so this can't happen. We thus have all of the inequalities with no question marks. * By maximality, the right part is not a substring of the left part. Thus, we have all of the inequalities involving no left-side question marks. * If you have all of the inequalities without right-side question marks, we have a critical factorization. * If one such inequality fails, then there's a smaller period, but the factorization is nonetheless critical. Here's where you need the redundancy coming from computing both cuts and choosing the later one. -------- Some more Bells and Whistles -------- Beyond Crochemore & Perrin's original algorithm, we can use a couple more tricks for speed in fastsearch.h: 1. Even though C&P has a best-case O(n/m) time, this doesn't occur very often, so we add a Boyer-Moore bad character table to achieve sublinear time in more cases. 2. The prework of computing the cut/period is expensive per needle character, so we shouldn't do it if it won't pay off. For this reason, if the needle and haystack are long enough, only automatically start with two-way if the needle's length is a small percentage of the length of the haystack. 3. In cases where the needle and haystack are large but the needle makes up a significant percentage of the length of the haystack, don't pay the expensive two-way preprocessing cost if you don't need to. Instead, keep track of how many character comparisons are equal, and if that exceeds O(len(needle)), then pay that cost, since the simpler algorithm isn't doing very well.