cpython/Objects/stringlib/stringlib_find_two_way_note...

432 lines
16 KiB
Plaintext
Raw Normal View History

This document explains Crochemore and Perrin's Two-Way string matching
algorithm, in which a smaller string (the "pattern" or "needle")
is searched for in a longer string (the "text" or "haystack"),
determining whether the needle is a substring of the haystack, and if
so, at what index(es). It is to be used by Python's string
(and bytes-like) objects when calling `find`, `index`, `__contains__`,
or implicitly in methods like `replace` or `partition`.
This is essentially a re-telling of the paper
Crochemore M., Perrin D., 1991, Two-way string-matching,
Journal of the ACM 38(3):651-675.
focused more on understanding and examples than on rigor. See also
the code sample here:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
The algorithm runs in O(len(needle) + len(haystack)) time and with
O(1) space. However, since there is a larger preprocessing cost than
simpler algorithms, this Two-Way algorithm is to be used only when the
needle and haystack lengths meet certain thresholds.
These are the basic steps of the algorithm:
* "Very carefully" cut the needle in two.
* For each alignment attempted:
1. match the right part
* On failure, jump by the amount matched + 1
2. then match the left part.
* On failure jump by max(len(left), len(right)) + 1
* If the needle is periodic, don't re-do comparisons; maintain
a "memory" of how many characters you already know match.
-------- Matching the right part --------
We first scan the right part of the needle to check if it matches the
the aligned characters in the haystack. We scan left-to-right,
and if a mismatch occurs, we jump ahead by the amount matched plus 1.
Example:
text: ........EFGX...................
pattern: ....abcdEFGH....
cut: <<<<>>>>
Matched 3, so jump ahead by 4:
text: ........EFGX...................
pattern: ....abcdEFGH....
cut: <<<<>>>>
Why are we allowed to do this? Because we cut the needle very
carefully, in such a way that if the cut is ...abcd + EFGH... then
we have
d != E
cd != EF
bcd != EFG
abcd != EFGH
... and so on.
If this is true for every pair of equal-length substrings around the
cut, then the following alignments do not work, so we can skip them:
text: ........EFG....................
pattern: ....abcdEFGH....
^ (Bad because d != E)
text: ........EFG....................
pattern: ....abcdEFGH....
^^ (Bad because cd != EF)
text: ........EFG....................
pattern: ....abcdEFGH....
^^^ (Bad because bcd != EFG)
Skip 3 alignments => increment alignment by 4.
-------- If len(left_part) < len(right_part) --------
Above is the core idea, and it begins to suggest how the algorithm can
be linear-time. There is one bit of subtlety involving what to do
around the end of the needle: if the left half is shorter than the
right, then we could run into something like this:
text: .....EFG......
pattern: cdEFGH
The same argument holds that we can skip ahead by 4, so long as
d != E
cd != EF
?cd != EFG
??cd != EFGH
etc.
The question marks represent "wildcards" that always match; they're
outside the limits of the needle, so there's no way for them to
invalidate a match. To ensure that the inequalities above are always
true, we need them to be true for all possible '?' values. We thus
need cd != FG and cd != GH, etc.
-------- Matching the left part --------
Once we have ensured the right part matches, we scan the left part
(order doesn't matter, but traditionally right-to-left), and if we
find a mismatch, we jump ahead by
max(len(left_part), len(right_part)) + 1. That we can jump by
at least len(right_part) + 1 we have already seen:
text: .....EFG.....
pattern: abcdEFG
Matched 3, so jump by 4,
using the fact that d != E, cd != EF, and bcd != EFG.
But we can also jump by at least len(left_part) + 1:
text: ....cdEF.....
pattern: abcdEF
Jump by len('abcd') + 1 = 5.
Skip the alignments:
text: ....cdEF.....
pattern: abcdEF
text: ....cdEF.....
pattern: abcdEF
text: ....cdEF.....
pattern: abcdEF
text: ....cdEF.....
pattern: abcdEF
This requires the following facts:
d != E
cd != EF
bcd != EF?
abcd != EF??
etc., for all values of ?s, as above.
If we have both sets of inequalities, then we can indeed jump by
max(len(left_part), len(right_part)) + 1. Under the assumption of such
a nice splitting of the needle, we now have enough to prove linear
time for the search: consider the forward-progress/comparisons ratio
at each alignment position. If a mismatch occurs in the right part,
the ratio is 1 position forward per comparison. On the other hand,
if a mismatch occurs in the left half, we advance by more than
len(needle)//2 positions for at most len(needle) comparisons,
so this ratio is more than 1/2. This average "movement speed" is
bounded below by the constant "1 position per 2 comparisons", so we
have linear time.
-------- The periodic case --------
The sets of inequalities listed so far seem too good to be true in
the general case. Indeed, they fail when a needle is periodic:
there's no way to split 'AAbAAbAAbA' in two such that
(the stuff n characters to the left of the split)
cannot equal
(the stuff n characters to the right of the split)
for all n.
This is because no matter how you cut it, you'll get
s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the
needle in two so that n can be as big as possible. If we were to
split it as
AAbA + AbAAbA
then A == A at the split, so this is bad (we failed at length 1), but
if we split it as
AA + bAAbAAbA
we at least have A != b and AA != bA, and we fail at length 3
since ?AA == bAA. We already knew that a cut to make length-3
mismatch was impossible due to the period, but we now see that the
bound is sharp; we can get length-1 and length-2 to mismatch.
This is exactly the content of the *critical factorization theorem*:
that no matter the period of the original needle, you can cut it in
such a way that (with the appropriate question marks),
needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period.
Even "non-periodic" strings are periodic with a period equal to
their length, so for such needles, the CFT already guarantees that
the algorithm described so far will work, since we can cut the needle
so that the length-k chunks on either side of the cut mismatch for all
k < len(needle). Looking closer at the algorithm, we only actually
require that k go up to max(len(left_part), len(right_part)).
So long as the period exceeds that, we're good.
The more general shorter-period case is a bit harder. The essentials
are the same, except we use the periodicity to our advantage by
"remembering" periods that we've already compared. In our running
example, say we're computing
"AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA".
We cut as AA + bAAbAAbA, and then the algorithm runs as follows:
First alignment:
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
^^X
- Mismatch at third position, so jump by 3.
- This requires that A!=b and AA != bA.
Second alignment:
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
^^^^^^^^
X
- Matched entire right part
- Mismatch at left part.
- Jump forward a period, remembering the existing comparisons
Third alignment:
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
mmmmmmm^^X
- There's "memory": a bunch of characters were already matched.
- Two more characters match beyond that.
- The 8th character of the right part mismatched, so jump by 8
- The above rule is more complicated than usual: we don't have
the right inequalities for lengths 1 through 7, but we do have
shifted copies of the length-1 and length-2 inequalities,
along with knowledge of the mismatch. We can skip all of these
alignments at once:
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
~ A != b at the cut
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
~~ AA != bA at the cut
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
^^^^X 7-3=4 match, and the 5th misses.
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
~ A != b at the cut
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
~~ AA != bA at the cut
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
^X 7-3-3=1 match and the 2nd misses.
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
~ A != b at the cut
Fourth alignment:
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
^X
- Second character mismatches, so jump by 2.
Fifth alignment:
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
^^^^^^^^
X
- Right half matches, so use memory and skip ahead by period=3
Sixth alignment:
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
AAbAAbAAbA
mmmmmmmm^^
- Right part matches, left part is remembered, found a match!
The one tricky skip by 8 here generalizes: if we have a period of p,
then the CFT says we can ensure the cut has the inequality property
for lengths 1 through p-1, and jumping by p would line up the
matching characters and mismatched character one period earlier.
Inductively, this proves that we can skip by the number of characters
matched in the right half, plus 1, just as in the original algorithm.
To make it explicit, the memory is set whenever the entire right part
is matched and is then used as a starting point in the next alignment.
In such a case, the alignment jumps forward one period, and the right
half matches all except possibly the last period. Additionally,
if we cut so that the left part has a length strictly less than the
period (we always can!), then we can know that the left part already
matches. The memory is reset to 0 whenever there is a mismatch in the
right part.
To prove linearity for the periodic case, note that if a right-part
character mismatches, then we advance forward 1 unit per comparison.
On the other hand, if the entire right part matches, then the skipping
forward by one period "defers" some of the comparisons to the next
alignment, where they will then be spent at the usual rate of
one comparison per step forward. Even if left-half comparisons
are always "wasted", they constitute less than half of all
comparisons, so the average rate is certainly at least 1 move forward
per 2 comparisons.
-------- When to choose the periodic algorithm ---------
The periodic algorithm is always valid but has an overhead of one
more "memory" register and some memory computation steps, so the
here-described-first non-periodic/long-period algorithm -- skipping by
max(len(left_part), len(right_part)) + 1 rather than the period --
should be preferred when possible.
Interestingly, the long-period algorithm does not require an exact
computation of the period; it works even with some long-period, but
undeniably "periodic" needles:
Cut: AbcdefAbc == Abcde + fAbc
This cut gives these inequalities:
e != f
de != fA
cde != fAb
bcde != fAbc
Abcde != fAbc?
The first failure is a period long, per the CFT:
?Abcde == fAbc??
A sufficient condition for using the long-period algorithm is having
the period of the needle be greater than
max(len(left_part), len(right_part)). This way, after choosing a good
split, we get all of the max(len(left_part), len(right_part))
inequalities around the cut that were required in the long-period
version of the algorithm.
With all of this in mind, here's how we choose:
(1) Choose a "critical factorization" of the needle -- a cut
where we have period minus 1 inequalities in a row.
More specifically, choose a cut so that the left_part
is less than one period long.
(2) Determine the period P_r of the right_part.
(3) Check if the left part is just an extension of the pattern of
the right part, so that the whole needle has period P_r.
Explicitly, check if
needle[0:cut] == needle[0+P_r:cut+P_r]
If so, we use the periodic algorithm. If not equal, we use the
long-period algorithm.
Note that if equality holds in (3), then the period of the whole
string is P_r. On the other hand, suppose equality does not hold.
The period of the needle is then strictly greater than P_r. Here's
a general fact:
If p is a substring of s and p has period r, then the period
of s is either equal to r or greater than len(p).
We know that needle_period != P_r,
and therefore needle_period > len(right_part).
Additionally, we'll choose the cut (see below)
so that len(left_part) < needle_period.
Thus, in the case where equality does not hold, we have that
needle_period >= max(len(left_part), len(right_part)) + 1,
so the long-period algorithm works, but otherwise, we know the period
of the needle.
Note that this decision process doesn't always require an exact
computation of the period -- we can get away with only computing P_r!
-------- Computing the cut --------
Our remaining tasks are now to compute a cut of the needle with as
many inequalities as possible, ensuring that cut < needle_period.
Meanwhile, we must also compute the period P_r of the right_part.
The computation is relatively simple, essentially doing this:
suffix1 = max(needle[i:] for i in range(len(needle)))
suffix2 = ... # the same as above, but invert the alphabet
cut1 = len(needle) - len(suffix1)
cut2 = len(needle) - len(suffix2)
cut = max(cut1, cut2) # the later cut
For cut2, "invert the alphabet" is different than saying min(...),
since in lexicographic order, we still put "py" < "python", even
if the alphabet is inverted. Computing these, along with the method
of computing the period of the right half, is easiest to read directly
from the source code in fastsearch.h, in which these are computed
in linear time.
Crochemore & Perrin's Theorem 3.1 give that "cut" above is a
critical factorization less than the period, but a very brief sketch
of their proof goes something like this (this is far from complete):
* If this cut splits the needle as some
needle == (a + w) + (w + b), meaning there's a bad equality
w == w, it's impossible for w + b to be bigger than both
b and w + w + b, so this can't happen. We thus have all of
the inequalities with no question marks.
* By maximality, the right part is not a substring of the left
part. Thus, we have all of the inequalities involving no
left-side question marks.
* If you have all of the inequalities without right-side question
marks, we have a critical factorization.
* If one such inequality fails, then there's a smaller period,
but the factorization is nonetheless critical. Here's where
you need the redundancy coming from computing both cuts and
choosing the later one.
-------- Some more Bells and Whistles --------
Beyond Crochemore & Perrin's original algorithm, we can use a couple
more tricks for speed in fastsearch.h:
1. Even though C&P has a best-case O(n/m) time, this doesn't occur
very often, so we add a Boyer-Moore bad character table to
achieve sublinear time in more cases.
2. The prework of computing the cut/period is expensive per
needle character, so we shouldn't do it if it won't pay off.
For this reason, if the needle and haystack are long enough,
only automatically start with two-way if the needle's length
is a small percentage of the length of the haystack.
3. In cases where the needle and haystack are large but the needle
makes up a significant percentage of the length of the
haystack, don't pay the expensive two-way preprocessing cost
if you don't need to. Instead, keep track of how many
character comparisons are equal, and if that exceeds
O(len(needle)), then pay that cost, since the simpler algorithm
isn't doing very well.