mirror of https://github.com/python/cpython
432 lines
16 KiB
Plaintext
432 lines
16 KiB
Plaintext
This document explains Crochemore and Perrin's Two-Way string matching
|
|
algorithm, in which a smaller string (the "pattern" or "needle")
|
|
is searched for in a longer string (the "text" or "haystack"),
|
|
determining whether the needle is a substring of the haystack, and if
|
|
so, at what index(es). It is to be used by Python's string
|
|
(and bytes-like) objects when calling `find`, `index`, `__contains__`,
|
|
or implicitly in methods like `replace` or `partition`.
|
|
|
|
This is essentially a re-telling of the paper
|
|
|
|
Crochemore M., Perrin D., 1991, Two-way string-matching,
|
|
Journal of the ACM 38(3):651-675.
|
|
|
|
focused more on understanding and examples than on rigor. See also
|
|
the code sample here:
|
|
|
|
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
|
|
|
|
The algorithm runs in O(len(needle) + len(haystack)) time and with
|
|
O(1) space. However, since there is a larger preprocessing cost than
|
|
simpler algorithms, this Two-Way algorithm is to be used only when the
|
|
needle and haystack lengths meet certain thresholds.
|
|
|
|
|
|
These are the basic steps of the algorithm:
|
|
|
|
* "Very carefully" cut the needle in two.
|
|
* For each alignment attempted:
|
|
1. match the right part
|
|
* On failure, jump by the amount matched + 1
|
|
2. then match the left part.
|
|
* On failure jump by max(len(left), len(right)) + 1
|
|
* If the needle is periodic, don't re-do comparisons; maintain
|
|
a "memory" of how many characters you already know match.
|
|
|
|
|
|
-------- Matching the right part --------
|
|
|
|
We first scan the right part of the needle to check if it matches the
|
|
the aligned characters in the haystack. We scan left-to-right,
|
|
and if a mismatch occurs, we jump ahead by the amount matched plus 1.
|
|
|
|
Example:
|
|
|
|
text: ........EFGX...................
|
|
pattern: ....abcdEFGH....
|
|
cut: <<<<>>>>
|
|
|
|
Matched 3, so jump ahead by 4:
|
|
|
|
text: ........EFGX...................
|
|
pattern: ....abcdEFGH....
|
|
cut: <<<<>>>>
|
|
|
|
Why are we allowed to do this? Because we cut the needle very
|
|
carefully, in such a way that if the cut is ...abcd + EFGH... then
|
|
we have
|
|
|
|
d != E
|
|
cd != EF
|
|
bcd != EFG
|
|
abcd != EFGH
|
|
... and so on.
|
|
|
|
If this is true for every pair of equal-length substrings around the
|
|
cut, then the following alignments do not work, so we can skip them:
|
|
|
|
text: ........EFG....................
|
|
pattern: ....abcdEFGH....
|
|
^ (Bad because d != E)
|
|
text: ........EFG....................
|
|
pattern: ....abcdEFGH....
|
|
^^ (Bad because cd != EF)
|
|
text: ........EFG....................
|
|
pattern: ....abcdEFGH....
|
|
^^^ (Bad because bcd != EFG)
|
|
|
|
Skip 3 alignments => increment alignment by 4.
|
|
|
|
|
|
-------- If len(left_part) < len(right_part) --------
|
|
|
|
Above is the core idea, and it begins to suggest how the algorithm can
|
|
be linear-time. There is one bit of subtlety involving what to do
|
|
around the end of the needle: if the left half is shorter than the
|
|
right, then we could run into something like this:
|
|
|
|
text: .....EFG......
|
|
pattern: cdEFGH
|
|
|
|
The same argument holds that we can skip ahead by 4, so long as
|
|
|
|
d != E
|
|
cd != EF
|
|
?cd != EFG
|
|
??cd != EFGH
|
|
etc.
|
|
|
|
The question marks represent "wildcards" that always match; they're
|
|
outside the limits of the needle, so there's no way for them to
|
|
invalidate a match. To ensure that the inequalities above are always
|
|
true, we need them to be true for all possible '?' values. We thus
|
|
need cd != FG and cd != GH, etc.
|
|
|
|
|
|
-------- Matching the left part --------
|
|
|
|
Once we have ensured the right part matches, we scan the left part
|
|
(order doesn't matter, but traditionally right-to-left), and if we
|
|
find a mismatch, we jump ahead by
|
|
max(len(left_part), len(right_part)) + 1. That we can jump by
|
|
at least len(right_part) + 1 we have already seen:
|
|
|
|
text: .....EFG.....
|
|
pattern: abcdEFG
|
|
Matched 3, so jump by 4,
|
|
using the fact that d != E, cd != EF, and bcd != EFG.
|
|
|
|
But we can also jump by at least len(left_part) + 1:
|
|
|
|
text: ....cdEF.....
|
|
pattern: abcdEF
|
|
Jump by len('abcd') + 1 = 5.
|
|
|
|
Skip the alignments:
|
|
text: ....cdEF.....
|
|
pattern: abcdEF
|
|
text: ....cdEF.....
|
|
pattern: abcdEF
|
|
text: ....cdEF.....
|
|
pattern: abcdEF
|
|
text: ....cdEF.....
|
|
pattern: abcdEF
|
|
|
|
This requires the following facts:
|
|
d != E
|
|
cd != EF
|
|
bcd != EF?
|
|
abcd != EF??
|
|
etc., for all values of ?s, as above.
|
|
|
|
If we have both sets of inequalities, then we can indeed jump by
|
|
max(len(left_part), len(right_part)) + 1. Under the assumption of such
|
|
a nice splitting of the needle, we now have enough to prove linear
|
|
time for the search: consider the forward-progress/comparisons ratio
|
|
at each alignment position. If a mismatch occurs in the right part,
|
|
the ratio is 1 position forward per comparison. On the other hand,
|
|
if a mismatch occurs in the left half, we advance by more than
|
|
len(needle)//2 positions for at most len(needle) comparisons,
|
|
so this ratio is more than 1/2. This average "movement speed" is
|
|
bounded below by the constant "1 position per 2 comparisons", so we
|
|
have linear time.
|
|
|
|
|
|
-------- The periodic case --------
|
|
|
|
The sets of inequalities listed so far seem too good to be true in
|
|
the general case. Indeed, they fail when a needle is periodic:
|
|
there's no way to split 'AAbAAbAAbA' in two such that
|
|
|
|
(the stuff n characters to the left of the split)
|
|
cannot equal
|
|
(the stuff n characters to the right of the split)
|
|
for all n.
|
|
|
|
This is because no matter how you cut it, you'll get
|
|
s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the
|
|
needle in two so that n can be as big as possible. If we were to
|
|
split it as
|
|
|
|
AAbA + AbAAbA
|
|
|
|
then A == A at the split, so this is bad (we failed at length 1), but
|
|
if we split it as
|
|
|
|
AA + bAAbAAbA
|
|
|
|
we at least have A != b and AA != bA, and we fail at length 3
|
|
since ?AA == bAA. We already knew that a cut to make length-3
|
|
mismatch was impossible due to the period, but we now see that the
|
|
bound is sharp; we can get length-1 and length-2 to mismatch.
|
|
|
|
This is exactly the content of the *critical factorization theorem*:
|
|
that no matter the period of the original needle, you can cut it in
|
|
such a way that (with the appropriate question marks),
|
|
needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period.
|
|
|
|
Even "non-periodic" strings are periodic with a period equal to
|
|
their length, so for such needles, the CFT already guarantees that
|
|
the algorithm described so far will work, since we can cut the needle
|
|
so that the length-k chunks on either side of the cut mismatch for all
|
|
k < len(needle). Looking closer at the algorithm, we only actually
|
|
require that k go up to max(len(left_part), len(right_part)).
|
|
So long as the period exceeds that, we're good.
|
|
|
|
The more general shorter-period case is a bit harder. The essentials
|
|
are the same, except we use the periodicity to our advantage by
|
|
"remembering" periods that we've already compared. In our running
|
|
example, say we're computing
|
|
|
|
"AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA".
|
|
|
|
We cut as AA + bAAbAAbA, and then the algorithm runs as follows:
|
|
|
|
First alignment:
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
^^X
|
|
- Mismatch at third position, so jump by 3.
|
|
- This requires that A!=b and AA != bA.
|
|
|
|
Second alignment:
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
^^^^^^^^
|
|
X
|
|
- Matched entire right part
|
|
- Mismatch at left part.
|
|
- Jump forward a period, remembering the existing comparisons
|
|
|
|
Third alignment:
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
mmmmmmm^^X
|
|
- There's "memory": a bunch of characters were already matched.
|
|
- Two more characters match beyond that.
|
|
- The 8th character of the right part mismatched, so jump by 8
|
|
- The above rule is more complicated than usual: we don't have
|
|
the right inequalities for lengths 1 through 7, but we do have
|
|
shifted copies of the length-1 and length-2 inequalities,
|
|
along with knowledge of the mismatch. We can skip all of these
|
|
alignments at once:
|
|
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
~ A != b at the cut
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
~~ AA != bA at the cut
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
^^^^X 7-3=4 match, and the 5th misses.
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
~ A != b at the cut
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
~~ AA != bA at the cut
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
^X 7-3-3=1 match and the 2nd misses.
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
~ A != b at the cut
|
|
|
|
Fourth alignment:
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
^X
|
|
- Second character mismatches, so jump by 2.
|
|
|
|
Fifth alignment:
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
^^^^^^^^
|
|
X
|
|
- Right half matches, so use memory and skip ahead by period=3
|
|
|
|
Sixth alignment:
|
|
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
|
|
AAbAAbAAbA
|
|
mmmmmmmm^^
|
|
- Right part matches, left part is remembered, found a match!
|
|
|
|
The one tricky skip by 8 here generalizes: if we have a period of p,
|
|
then the CFT says we can ensure the cut has the inequality property
|
|
for lengths 1 through p-1, and jumping by p would line up the
|
|
matching characters and mismatched character one period earlier.
|
|
Inductively, this proves that we can skip by the number of characters
|
|
matched in the right half, plus 1, just as in the original algorithm.
|
|
|
|
To make it explicit, the memory is set whenever the entire right part
|
|
is matched and is then used as a starting point in the next alignment.
|
|
In such a case, the alignment jumps forward one period, and the right
|
|
half matches all except possibly the last period. Additionally,
|
|
if we cut so that the left part has a length strictly less than the
|
|
period (we always can!), then we can know that the left part already
|
|
matches. The memory is reset to 0 whenever there is a mismatch in the
|
|
right part.
|
|
|
|
To prove linearity for the periodic case, note that if a right-part
|
|
character mismatches, then we advance forward 1 unit per comparison.
|
|
On the other hand, if the entire right part matches, then the skipping
|
|
forward by one period "defers" some of the comparisons to the next
|
|
alignment, where they will then be spent at the usual rate of
|
|
one comparison per step forward. Even if left-half comparisons
|
|
are always "wasted", they constitute less than half of all
|
|
comparisons, so the average rate is certainly at least 1 move forward
|
|
per 2 comparisons.
|
|
|
|
|
|
-------- When to choose the periodic algorithm ---------
|
|
|
|
The periodic algorithm is always valid but has an overhead of one
|
|
more "memory" register and some memory computation steps, so the
|
|
here-described-first non-periodic/long-period algorithm -- skipping by
|
|
max(len(left_part), len(right_part)) + 1 rather than the period --
|
|
should be preferred when possible.
|
|
|
|
Interestingly, the long-period algorithm does not require an exact
|
|
computation of the period; it works even with some long-period, but
|
|
undeniably "periodic" needles:
|
|
|
|
Cut: AbcdefAbc == Abcde + fAbc
|
|
|
|
This cut gives these inequalities:
|
|
|
|
e != f
|
|
de != fA
|
|
cde != fAb
|
|
bcde != fAbc
|
|
Abcde != fAbc?
|
|
The first failure is a period long, per the CFT:
|
|
?Abcde == fAbc??
|
|
|
|
A sufficient condition for using the long-period algorithm is having
|
|
the period of the needle be greater than
|
|
max(len(left_part), len(right_part)). This way, after choosing a good
|
|
split, we get all of the max(len(left_part), len(right_part))
|
|
inequalities around the cut that were required in the long-period
|
|
version of the algorithm.
|
|
|
|
With all of this in mind, here's how we choose:
|
|
|
|
(1) Choose a "critical factorization" of the needle -- a cut
|
|
where we have period minus 1 inequalities in a row.
|
|
More specifically, choose a cut so that the left_part
|
|
is less than one period long.
|
|
(2) Determine the period P_r of the right_part.
|
|
(3) Check if the left part is just an extension of the pattern of
|
|
the right part, so that the whole needle has period P_r.
|
|
Explicitly, check if
|
|
needle[0:cut] == needle[0+P_r:cut+P_r]
|
|
If so, we use the periodic algorithm. If not equal, we use the
|
|
long-period algorithm.
|
|
|
|
Note that if equality holds in (3), then the period of the whole
|
|
string is P_r. On the other hand, suppose equality does not hold.
|
|
The period of the needle is then strictly greater than P_r. Here's
|
|
a general fact:
|
|
|
|
If p is a substring of s and p has period r, then the period
|
|
of s is either equal to r or greater than len(p).
|
|
|
|
We know that needle_period != P_r,
|
|
and therefore needle_period > len(right_part).
|
|
Additionally, we'll choose the cut (see below)
|
|
so that len(left_part) < needle_period.
|
|
|
|
Thus, in the case where equality does not hold, we have that
|
|
needle_period >= max(len(left_part), len(right_part)) + 1,
|
|
so the long-period algorithm works, but otherwise, we know the period
|
|
of the needle.
|
|
|
|
Note that this decision process doesn't always require an exact
|
|
computation of the period -- we can get away with only computing P_r!
|
|
|
|
|
|
-------- Computing the cut --------
|
|
|
|
Our remaining tasks are now to compute a cut of the needle with as
|
|
many inequalities as possible, ensuring that cut < needle_period.
|
|
Meanwhile, we must also compute the period P_r of the right_part.
|
|
|
|
The computation is relatively simple, essentially doing this:
|
|
|
|
suffix1 = max(needle[i:] for i in range(len(needle)))
|
|
suffix2 = ... # the same as above, but invert the alphabet
|
|
cut1 = len(needle) - len(suffix1)
|
|
cut2 = len(needle) - len(suffix2)
|
|
cut = max(cut1, cut2) # the later cut
|
|
|
|
For cut2, "invert the alphabet" is different than saying min(...),
|
|
since in lexicographic order, we still put "py" < "python", even
|
|
if the alphabet is inverted. Computing these, along with the method
|
|
of computing the period of the right half, is easiest to read directly
|
|
from the source code in fastsearch.h, in which these are computed
|
|
in linear time.
|
|
|
|
Crochemore & Perrin's Theorem 3.1 give that "cut" above is a
|
|
critical factorization less than the period, but a very brief sketch
|
|
of their proof goes something like this (this is far from complete):
|
|
|
|
* If this cut splits the needle as some
|
|
needle == (a + w) + (w + b), meaning there's a bad equality
|
|
w == w, it's impossible for w + b to be bigger than both
|
|
b and w + w + b, so this can't happen. We thus have all of
|
|
the ineuqalities with no question marks.
|
|
* By maximality, the right part is not a substring of the left
|
|
part. Thus, we have all of the inequalities involving no
|
|
left-side question marks.
|
|
* If you have all of the inequalities without right-side question
|
|
marks, we have a critical factorization.
|
|
* If one such inequality fails, then there's a smaller period,
|
|
but the factorization is nonetheless critical. Here's where
|
|
you need the redundancy coming from computing both cuts and
|
|
choosing the later one.
|
|
|
|
|
|
-------- Some more Bells and Whistles --------
|
|
|
|
Beyond Crochemore & Perrin's original algorithm, we can use a couple
|
|
more tricks for speed in fastsearch.h:
|
|
|
|
1. Even though C&P has a best-case O(n/m) time, this doesn't occur
|
|
very often, so we add a Boyer-Moore bad character table to
|
|
achieve sublinear time in more cases.
|
|
|
|
2. The prework of computing the cut/period is expensive per
|
|
needle character, so we shouldn't do it if it won't pay off.
|
|
For this reason, if the needle and haystack are long enough,
|
|
only automatically start with two-way if the needle's length
|
|
is a small percentage of the length of the haystack.
|
|
|
|
3. In cases where the needle and haystack are large but the needle
|
|
makes up a significant percentage of the length of the
|
|
haystack, don't pay the expensive two-way preprocessing cost
|
|
if you don't need to. Instead, keep track of how many
|
|
character comparisons are equal, and if that exceeds
|
|
O(len(needle)), then pay that cost, since the simpler algorithm
|
|
isn't doing very well.
|