More edits

This commit is contained in:
Andrew M. Kuchling 2007-01-29 21:28:48 +00:00
parent 5781dd2d7c
commit 15c1fe5047
1 changed files with 27 additions and 27 deletions

View File

@ -927,15 +927,15 @@ Now that we've looked at the general extension syntax, we can return
to the features that simplify working with groups in complex REs. to the features that simplify working with groups in complex REs.
Since groups are numbered from left to right and a complex expression Since groups are numbered from left to right and a complex expression
may use many groups, it can become difficult to keep track of the may use many groups, it can become difficult to keep track of the
correct numbering, and modifying such a complex RE is annoying. correct numbering. Modifying such a complex RE is annoying, too:
Insert a new group near the beginning, and you change the numbers of insert a new group near the beginning and you change the numbers of
everything that follows it. everything that follows it.
First, sometimes you'll want to use a group to collect a part of a Sometimes you'll want to use a group to collect a part of a regular
regular expression, but aren't interested in retrieving the group's expression, but aren't interested in retrieving the group's contents.
contents. You can make this fact explicit by using a non-capturing You can make this fact explicit by using a non-capturing group:
group: \regexp{(?:...)}, where you can put any other regular \regexp{(?:...)}, where you can replace the \regexp{...}
expression inside the parentheses. with any other regular expression.
\begin{verbatim} \begin{verbatim}
>>> m = re.match("([abc])+", "abc") >>> m = re.match("([abc])+", "abc")
@ -951,23 +951,23 @@ group matched, a non-capturing group behaves exactly the same as a
capturing group; you can put anything inside it, repeat it with a capturing group; you can put anything inside it, repeat it with a
repetition metacharacter such as \samp{*}, and nest it within other repetition metacharacter such as \samp{*}, and nest it within other
groups (capturing or non-capturing). \regexp{(?:...)} is particularly groups (capturing or non-capturing). \regexp{(?:...)} is particularly
useful when modifying an existing group, since you can add new groups useful when modifying an existing pattern, since you can add new groups
without changing how all the other groups are numbered. It should be without changing how all the other groups are numbered. It should be
mentioned that there's no performance difference in searching between mentioned that there's no performance difference in searching between
capturing and non-capturing groups; neither form is any faster than capturing and non-capturing groups; neither form is any faster than
the other. the other.
The second, and more significant, feature is named groups; instead of A more significant feature is named groups: instead of
referring to them by numbers, groups can be referenced by a name. referring to them by numbers, groups can be referenced by a name.
The syntax for a named group is one of the Python-specific extensions: The syntax for a named group is one of the Python-specific extensions:
\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of \regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
the group. Except for associating a name with a group, named groups the group. Named groups also behave exactly like capturing groups,
also behave identically to capturing groups. The \class{MatchObject} and additionally associate a name with a group. The
methods that deal with capturing groups all accept either integers, to \class{MatchObject} methods that deal with capturing groups all accept
refer to groups by number, or a string containing the group name. either integers that refer to the group by number or strings that
Named groups are still given numbers, so you can retrieve information contain the desired group's name. Named groups are still given
about a group in two ways: numbers, so you can retrieve information about a group in two ways:
\begin{verbatim} \begin{verbatim}
>>> p = re.compile(r'(?P<word>\b\w+\b)') >>> p = re.compile(r'(?P<word>\b\w+\b)')
@ -994,11 +994,11 @@ InternalDate = re.compile(r'INTERNALDATE "'
It's obviously much easier to retrieve \code{m.group('zonem')}, It's obviously much easier to retrieve \code{m.group('zonem')},
instead of having to remember to retrieve group 9. instead of having to remember to retrieve group 9.
Since the syntax for backreferences, in an expression like The syntax for backreferences in an expression such as
\regexp{(...)\e 1}, refers to the number of the group there's \regexp{(...)\e 1} refers to the number of the group. There's
naturally a variant that uses the group name instead of the number. naturally a variant that uses the group name instead of the number.
This is also a Python extension: \regexp{(?P=\var{name})} indicates This is another Python extension: \regexp{(?P=\var{name})} indicates
that the contents of the group called \var{name} should again be found that the contents of the group called \var{name} should again be matched
at the current point. The regular expression for finding doubled at the current point. The regular expression for finding doubled
words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}: \regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
@ -1028,11 +1028,11 @@ opposite of the positive assertion; it succeeds if the contained expression
\emph{doesn't} match at the current position in the string. \emph{doesn't} match at the current position in the string.
\end{itemize} \end{itemize}
An example will help make this concrete by demonstrating a case To make this concrete, let's look at a case where a lookahead is
where a lookahead is useful. Consider a simple pattern to match a useful. Consider a simple pattern to match a filename and split it
filename and split it apart into a base name and an extension, apart into a base name and an extension, separated by a \samp{.}. For
separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news} example, in \samp{news.rc}, \samp{news} is the base name, and
is the base name, and \samp{rc} is the filename's extension. \samp{rc} is the filename's extension.
The pattern to match this is quite simple: The pattern to match this is quite simple:
@ -1079,12 +1079,12 @@ read and understand. Worse, if the problem changes and you want to
exclude both \samp{bat} and \samp{exe} as extensions, the pattern exclude both \samp{bat} and \samp{exe} as extensions, the pattern
would get even more complicated and confusing. would get even more complicated and confusing.
A negative lookahead cuts through all this: A negative lookahead cuts through all this confusion:
\regexp{.*[.](?!bat\$).*\$} \regexp{.*[.](?!bat\$).*\$}
% $ % $
The lookahead means: if the expression \regexp{bat} doesn't match at The negative lookahead means: if the expression \regexp{bat} doesn't match at
this point, try the rest of the pattern; if \regexp{bat\$} does match, this point, try the rest of the pattern; if \regexp{bat\$} does match,
the whole pattern will fail. The trailing \regexp{\$} is required to the whole pattern will fail. The trailing \regexp{\$} is required to
ensure that something like \samp{sample.batch}, where the extension ensure that something like \samp{sample.batch}, where the extension
@ -1101,7 +1101,7 @@ filenames that end in either \samp{bat} or \samp{exe}:
\section{Modifying Strings} \section{Modifying Strings}
Up to this point, we've simply performed searches against a static Up to this point, we've simply performed searches against a static
string. Regular expressions are also commonly used to modify a string string. Regular expressions are also commonly used to modify strings
in various ways, using the following \class{RegexObject} methods: in various ways, using the following \class{RegexObject} methods:
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}