mirror of https://github.com/python/cpython
David Goodger <dgoodger@atsautomation.com>:
Documentation for difflib/ndiff refactoring: more of the ndiff functionality has been moved to the underlying library (difflib). This closes SF patch #445413.
This commit is contained in:
parent
97dbec97bc
commit
6943a29cbf
|
@ -10,6 +10,48 @@
|
||||||
\versionadded{2.1}
|
\versionadded{2.1}
|
||||||
|
|
||||||
|
|
||||||
|
\begin{classdesc*}{SequenceMatcher}
|
||||||
|
This is a flexible class for comparing pairs of sequences of any
|
||||||
|
type, so long as the sequence elements are hashable. The basic
|
||||||
|
algorithm predates, and is a little fancier than, an algorithm
|
||||||
|
published in the late 1980's by Ratcliff and Obershelp under the
|
||||||
|
hyperbolic name ``gestalt pattern matching.'' The idea is to find
|
||||||
|
the longest contiguous matching subsequence that contains no
|
||||||
|
``junk'' elements (the Ratcliff and Obershelp algorithm doesn't
|
||||||
|
address junk). The same idea is then applied recursively to the
|
||||||
|
pieces of the sequences to the left and to the right of the matching
|
||||||
|
subsequence. This does not yield minimal edit sequences, but does
|
||||||
|
tend to yield matches that ``look right'' to people.
|
||||||
|
|
||||||
|
\strong{Timing:} The basic Ratcliff-Obershelp algorithm is cubic
|
||||||
|
time in the worst case and quadratic time in the expected case.
|
||||||
|
\class{SequenceMatcher} is quadratic time for the worst case and has
|
||||||
|
expected-case behavior dependent in a complicated way on how many
|
||||||
|
elements the sequences have in common; best case time is linear.
|
||||||
|
\end{classdesc*}
|
||||||
|
|
||||||
|
\begin{classdesc*}{Differ}
|
||||||
|
This is a class for comparing sequences of lines of text, and
|
||||||
|
producing human-readable differences or deltas. Differ uses
|
||||||
|
\class{SequenceMatcher} both to compare sequences of lines, and to
|
||||||
|
compare sequences of characters within similar (near-matching)
|
||||||
|
lines.
|
||||||
|
|
||||||
|
Each line of a \class{Differ} delta begins with a two-letter code:
|
||||||
|
|
||||||
|
\begin{tableii}{l|l}{code}{Code}{Meaning}
|
||||||
|
\lineii{'- '}{line unique to sequence 1}
|
||||||
|
\lineii{'+ '}{line unique to sequence 2}
|
||||||
|
\lineii{' '}{line common to both sequences}
|
||||||
|
\lineii{'? '}{line not present in either input sequence}
|
||||||
|
\end{tableii}
|
||||||
|
|
||||||
|
Lines beginning with `\code{?~}' attempt to guide the eye to
|
||||||
|
intraline differences, and were not present in either input
|
||||||
|
sequence. These lines can be confusing if the sequences contain tab
|
||||||
|
characters.
|
||||||
|
\end{classdesc*}
|
||||||
|
|
||||||
\begin{funcdesc}{get_close_matches}{word, possibilities\optional{,
|
\begin{funcdesc}{get_close_matches}{word, possibilities\optional{,
|
||||||
n\optional{, cutoff}}}
|
n\optional{, cutoff}}}
|
||||||
Return a list of the best ``good enough'' matches. \var{word} is a
|
Return a list of the best ``good enough'' matches. \var{word} is a
|
||||||
|
@ -40,25 +82,85 @@
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
\end{funcdesc}
|
\end{funcdesc}
|
||||||
|
|
||||||
\begin{classdesc*}{SequenceMatcher}
|
\begin{funcdesc}{ndiff}{a, b\optional{, linejunk\optional{,
|
||||||
This is a flexible class for comparing pairs of sequences of any
|
charjunk}}}
|
||||||
type, so long as the sequence elements are hashable. The basic
|
Compare \var{a} and \var{b} (lists of strings); return a
|
||||||
algorithm predates, and is a little fancier than, an algorithm
|
\class{Differ}-style delta.
|
||||||
published in the late 1980's by Ratcliff and Obershelp under the
|
|
||||||
hyperbolic name ``gestalt pattern matching.'' The idea is to find
|
|
||||||
the longest contiguous matching subsequence that contains no
|
|
||||||
``junk'' elements (the Ratcliff and Obershelp algorithm doesn't
|
|
||||||
address junk). The same idea is then applied recursively to the
|
|
||||||
pieces of the sequences to the left and to the right of the matching
|
|
||||||
subsequence. This does not yield minimal edit sequences, but does
|
|
||||||
tend to yield matches that ``look right'' to people.
|
|
||||||
|
|
||||||
\strong{Timing:} The basic Ratcliff-Obershelp algorithm is cubic
|
Optional keyword parameters \var{linejunk} and \var{charjunk} are
|
||||||
time in the worst case and quadratic time in the expected case.
|
for filter functions (or \code{None}):
|
||||||
\class{SequenceMatcher} is quadratic time for the worst case and has
|
|
||||||
expected-case behavior dependent in a complicated way on how many
|
\var{linejunk}: A function that should accept a single string
|
||||||
elements the sequences have in common; best case time is linear.
|
argument, and return true if the string is junk (or false if it is
|
||||||
\end{classdesc*}
|
not). The default is module-level function
|
||||||
|
\function{IS_LINE_JUNK()}, which filters out lines without visible
|
||||||
|
characters, except for at most one pound character (\character{\#}).
|
||||||
|
|
||||||
|
\var{charjunk}: A function that should accept a string of length 1.
|
||||||
|
The default is module-level function \function{IS_CHARACTER_JUNK()},
|
||||||
|
which filters out whitespace characters (a blank or tab; note: bad
|
||||||
|
idea to include newline in this!).
|
||||||
|
|
||||||
|
\file{Tools/scripts/ndiff.py} is a command-line front-end to this
|
||||||
|
function.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
|
||||||
|
... 'ore\ntree\nemu\n'.splitlines(1)))
|
||||||
|
>>> print ''.join(diff),
|
||||||
|
- one
|
||||||
|
? ^
|
||||||
|
+ ore
|
||||||
|
? ^
|
||||||
|
- two
|
||||||
|
- three
|
||||||
|
? -
|
||||||
|
+ tree
|
||||||
|
+ emu
|
||||||
|
\end{verbatim}
|
||||||
|
\end{funcdesc}
|
||||||
|
|
||||||
|
\begin{funcdesc}{restore}{sequence, which}
|
||||||
|
Return one of the two sequences that generated a delta.
|
||||||
|
|
||||||
|
Given a \var{sequence} produced by \method{Differ.compare()} or
|
||||||
|
\function{ndiff()}, extract lines originating from file 1 or 2
|
||||||
|
(parameter \var{which}), stripping off line prefixes.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
|
||||||
|
... 'ore\ntree\nemu\n'.splitlines(1))
|
||||||
|
>>> print ''.join(restore(diff, 1)),
|
||||||
|
one
|
||||||
|
two
|
||||||
|
three
|
||||||
|
>>> print ''.join(restore(diff, 2)),
|
||||||
|
ore
|
||||||
|
tree
|
||||||
|
emu
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
\end{funcdesc}
|
||||||
|
|
||||||
|
|
||||||
|
\begin{funcdesc}{IS_LINE_JUNK}{line}:
|
||||||
|
|
||||||
|
Return 1 for ignorable line: iff \var{line} is blank or contains a
|
||||||
|
single \character{\#}. Used as a default for parameter
|
||||||
|
\var{linejunk} in \function{ndiff()}.
|
||||||
|
|
||||||
|
\end{funcdesc}
|
||||||
|
|
||||||
|
|
||||||
|
\begin{funcdesc}{IS_CHARACTER_JUNK}{ch}:
|
||||||
|
|
||||||
|
Return 1 for ignorable character: iff \var{ch} is a space or tab.
|
||||||
|
Used as a default for parameter \var{charjunk} in
|
||||||
|
\function{ndiff()}.
|
||||||
|
|
||||||
|
\end{funcdesc}
|
||||||
|
|
||||||
|
|
||||||
\begin{seealso}
|
\begin{seealso}
|
||||||
|
@ -231,9 +333,9 @@ replace a[3:4] (x) b[2:3] (y)
|
||||||
range [0, 1].
|
range [0, 1].
|
||||||
|
|
||||||
Where T is the total number of elements in both sequences, and M is
|
Where T is the total number of elements in both sequences, and M is
|
||||||
the number of matches, this is 2.0*M / T. Note that this is \code{1.}
|
the number of matches, this is 2.0*M / T. Note that this is
|
||||||
if the sequences are identical, and \code{0.} if they have nothing in
|
\code{1.0} if the sequences are identical, and \code{0.0} if they
|
||||||
common.
|
have nothing in common.
|
||||||
|
|
||||||
This is expensive to compute if \method{get_matching_blocks()} or
|
This is expensive to compute if \method{get_matching_blocks()} or
|
||||||
\method{get_opcodes()} hasn't already been called, in which case you
|
\method{get_opcodes()} hasn't already been called, in which case you
|
||||||
|
@ -272,7 +374,7 @@ at least as large as \method{ratio()}:
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
|
|
||||||
|
|
||||||
\subsection{Examples \label{difflib-examples}}
|
\subsection{SequenceMatcher Examples \label{sequencematcher-examples}}
|
||||||
|
|
||||||
|
|
||||||
This example compares two strings, considering blanks to be ``junk:''
|
This example compares two strings, considering blanks to be ``junk:''
|
||||||
|
@ -321,11 +423,122 @@ insert a[8:8] b[8:17]
|
||||||
equal a[14:29] b[23:38]
|
equal a[14:29] b[23:38]
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
|
|
||||||
See \file{Tools/scripts/ndiff.py} from the Python source distribution
|
|
||||||
for a fancy human-friendly file differencer, which uses
|
|
||||||
\class{SequenceMatcher} both to view files as sequences of lines, and
|
|
||||||
lines as sequences of characters.
|
|
||||||
|
|
||||||
See also the function \function{get_close_matches()} in this module,
|
See also the function \function{get_close_matches()} in this module,
|
||||||
which shows how simple code building on \class{SequenceMatcher} can be
|
which shows how simple code building on \class{SequenceMatcher} can be
|
||||||
used to do useful work.
|
used to do useful work.
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{Differ Objects \label{differ-objects}}
|
||||||
|
|
||||||
|
Note that \class{Differ}-generated deltas make no claim to be
|
||||||
|
\strong{minimal} diffs. To the contrary, minimal diffs are often
|
||||||
|
counter-intuitive, because they synch up anywhere possible, sometimes
|
||||||
|
accidental matches 100 pages apart. Restricting synch points to
|
||||||
|
contiguous matches preserves some notion of locality, at the
|
||||||
|
occasional cost of producing a longer diff.
|
||||||
|
|
||||||
|
The \class{Differ} class has this constructor:
|
||||||
|
|
||||||
|
\begin{classdesc}{Differ}{\optional{linejunk\optional{, charjunk}}}
|
||||||
|
Optional keyword parameters \var{linejunk} and \var{charjunk} are
|
||||||
|
for filter functions (or \code{None}):
|
||||||
|
|
||||||
|
\var{linejunk}: A function that should accept a single string
|
||||||
|
argument, and return true iff the string is junk. The default is
|
||||||
|
module-level function \function{IS_LINE_JUNK()}, which filters out
|
||||||
|
lines without visible characters, except for at most one pound
|
||||||
|
character (\character{\#}).
|
||||||
|
|
||||||
|
\var{charjunk}: A function that should accept a string of length 1.
|
||||||
|
The default is module-level function \function{IS_CHARACTER_JUNK()},
|
||||||
|
which filters out whitespace characters (a blank or tab; note: bad
|
||||||
|
idea to include newline in this!).
|
||||||
|
\end{classdesc}
|
||||||
|
|
||||||
|
\class{Differ} objects are used (deltas generated) via a single
|
||||||
|
method:
|
||||||
|
|
||||||
|
\begin{methoddesc}{compare}{a, b}
|
||||||
|
Compare two sequences of lines; return the resulting delta (list).
|
||||||
|
|
||||||
|
Each sequence must contain individual single-line strings ending
|
||||||
|
with newlines. Such sequences can be obtained from the
|
||||||
|
\method{readlines()} method of file-like objects. The list returned
|
||||||
|
is also made up of newline-terminated strings, and ready to be used
|
||||||
|
with the \method{writelines()} method of a file-like object.
|
||||||
|
\end{methoddesc}
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{Differ Example \label{differ-examples}}
|
||||||
|
|
||||||
|
This example compares two texts. First we set up the texts, sequences
|
||||||
|
of individual single-line strings ending with newlines (such sequences
|
||||||
|
can also be obtained from the \method{readlines()} method of file-like
|
||||||
|
objects):
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> text1 = ''' 1. Beautiful is better than ugly.
|
||||||
|
... 2. Explicit is better than implicit.
|
||||||
|
... 3. Simple is better than complex.
|
||||||
|
... 4. Complex is better than complicated.
|
||||||
|
... '''.splitlines(1)
|
||||||
|
>>> len(text1)
|
||||||
|
4
|
||||||
|
>>> text1[0][-1]
|
||||||
|
'\n'
|
||||||
|
>>> text2 = ''' 1. Beautiful is better than ugly.
|
||||||
|
... 3. Simple is better than complex.
|
||||||
|
... 4. Complicated is better than complex.
|
||||||
|
... 5. Flat is better than nested.
|
||||||
|
... '''.splitlines(1)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Next we instantiate a Differ object:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> d = Differ()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Note that when instantiating a \class{Differ} object we may pass
|
||||||
|
functions to filter out line and character ``junk.'' See the
|
||||||
|
\method{Differ()} constructor for details.
|
||||||
|
|
||||||
|
Finally, we compare the two:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> result = d.compare(text1, text2)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
\code{result} is a list of strings, so let's pretty-print it:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> from pprint import pprint
|
||||||
|
>>> pprint(result)
|
||||||
|
[' 1. Beautiful is better than ugly.\n',
|
||||||
|
'- 2. Explicit is better than implicit.\n',
|
||||||
|
'- 3. Simple is better than complex.\n',
|
||||||
|
'+ 3. Simple is better than complex.\n',
|
||||||
|
'? ++ \n',
|
||||||
|
'- 4. Complex is better than complicated.\n',
|
||||||
|
'? ^ ---- ^ \n',
|
||||||
|
'+ 4. Complicated is better than complex.\n',
|
||||||
|
'? ++++ ^ ^ \n',
|
||||||
|
'+ 5. Flat is better than nested.\n']
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
As a single multi-line string it looks like this:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> import sys
|
||||||
|
>>> sys.stdout.writelines(result)
|
||||||
|
1. Beautiful is better than ugly.
|
||||||
|
- 2. Explicit is better than implicit.
|
||||||
|
- 3. Simple is better than complex.
|
||||||
|
+ 3. Simple is better than complex.
|
||||||
|
? ++
|
||||||
|
- 4. Complex is better than complicated.
|
||||||
|
? ^ ---- ^
|
||||||
|
+ 4. Complicated is better than complex.
|
||||||
|
? ++++ ^ ^
|
||||||
|
+ 5. Flat is better than nested.
|
||||||
|
\end{verbatim}
|
||||||
|
|
Loading…
Reference in New Issue