More edits
This commit is contained in:
parent
5781dd2d7c
commit
15c1fe5047
|
@ -927,15 +927,15 @@ Now that we've looked at the general extension syntax, we can return
|
||||||
to the features that simplify working with groups in complex REs.
|
to the features that simplify working with groups in complex REs.
|
||||||
Since groups are numbered from left to right and a complex expression
|
Since groups are numbered from left to right and a complex expression
|
||||||
may use many groups, it can become difficult to keep track of the
|
may use many groups, it can become difficult to keep track of the
|
||||||
correct numbering, and modifying such a complex RE is annoying.
|
correct numbering. Modifying such a complex RE is annoying, too:
|
||||||
Insert a new group near the beginning, and you change the numbers of
|
insert a new group near the beginning and you change the numbers of
|
||||||
everything that follows it.
|
everything that follows it.
|
||||||
|
|
||||||
First, sometimes you'll want to use a group to collect a part of a
|
Sometimes you'll want to use a group to collect a part of a regular
|
||||||
regular expression, but aren't interested in retrieving the group's
|
expression, but aren't interested in retrieving the group's contents.
|
||||||
contents. You can make this fact explicit by using a non-capturing
|
You can make this fact explicit by using a non-capturing group:
|
||||||
group: \regexp{(?:...)}, where you can put any other regular
|
\regexp{(?:...)}, where you can replace the \regexp{...}
|
||||||
expression inside the parentheses.
|
with any other regular expression.
|
||||||
|
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
>>> m = re.match("([abc])+", "abc")
|
>>> m = re.match("([abc])+", "abc")
|
||||||
|
@ -951,23 +951,23 @@ group matched, a non-capturing group behaves exactly the same as a
|
||||||
capturing group; you can put anything inside it, repeat it with a
|
capturing group; you can put anything inside it, repeat it with a
|
||||||
repetition metacharacter such as \samp{*}, and nest it within other
|
repetition metacharacter such as \samp{*}, and nest it within other
|
||||||
groups (capturing or non-capturing). \regexp{(?:...)} is particularly
|
groups (capturing or non-capturing). \regexp{(?:...)} is particularly
|
||||||
useful when modifying an existing group, since you can add new groups
|
useful when modifying an existing pattern, since you can add new groups
|
||||||
without changing how all the other groups are numbered. It should be
|
without changing how all the other groups are numbered. It should be
|
||||||
mentioned that there's no performance difference in searching between
|
mentioned that there's no performance difference in searching between
|
||||||
capturing and non-capturing groups; neither form is any faster than
|
capturing and non-capturing groups; neither form is any faster than
|
||||||
the other.
|
the other.
|
||||||
|
|
||||||
The second, and more significant, feature is named groups; instead of
|
A more significant feature is named groups: instead of
|
||||||
referring to them by numbers, groups can be referenced by a name.
|
referring to them by numbers, groups can be referenced by a name.
|
||||||
|
|
||||||
The syntax for a named group is one of the Python-specific extensions:
|
The syntax for a named group is one of the Python-specific extensions:
|
||||||
\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
|
\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
|
||||||
the group. Except for associating a name with a group, named groups
|
the group. Named groups also behave exactly like capturing groups,
|
||||||
also behave identically to capturing groups. The \class{MatchObject}
|
and additionally associate a name with a group. The
|
||||||
methods that deal with capturing groups all accept either integers, to
|
\class{MatchObject} methods that deal with capturing groups all accept
|
||||||
refer to groups by number, or a string containing the group name.
|
either integers that refer to the group by number or strings that
|
||||||
Named groups are still given numbers, so you can retrieve information
|
contain the desired group's name. Named groups are still given
|
||||||
about a group in two ways:
|
numbers, so you can retrieve information about a group in two ways:
|
||||||
|
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
>>> p = re.compile(r'(?P<word>\b\w+\b)')
|
>>> p = re.compile(r'(?P<word>\b\w+\b)')
|
||||||
|
@ -994,11 +994,11 @@ InternalDate = re.compile(r'INTERNALDATE "'
|
||||||
It's obviously much easier to retrieve \code{m.group('zonem')},
|
It's obviously much easier to retrieve \code{m.group('zonem')},
|
||||||
instead of having to remember to retrieve group 9.
|
instead of having to remember to retrieve group 9.
|
||||||
|
|
||||||
Since the syntax for backreferences, in an expression like
|
The syntax for backreferences in an expression such as
|
||||||
\regexp{(...)\e 1}, refers to the number of the group there's
|
\regexp{(...)\e 1} refers to the number of the group. There's
|
||||||
naturally a variant that uses the group name instead of the number.
|
naturally a variant that uses the group name instead of the number.
|
||||||
This is also a Python extension: \regexp{(?P=\var{name})} indicates
|
This is another Python extension: \regexp{(?P=\var{name})} indicates
|
||||||
that the contents of the group called \var{name} should again be found
|
that the contents of the group called \var{name} should again be matched
|
||||||
at the current point. The regular expression for finding doubled
|
at the current point. The regular expression for finding doubled
|
||||||
words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
|
words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
|
||||||
\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
|
\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
|
||||||
|
@ -1028,11 +1028,11 @@ opposite of the positive assertion; it succeeds if the contained expression
|
||||||
\emph{doesn't} match at the current position in the string.
|
\emph{doesn't} match at the current position in the string.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
An example will help make this concrete by demonstrating a case
|
To make this concrete, let's look at a case where a lookahead is
|
||||||
where a lookahead is useful. Consider a simple pattern to match a
|
useful. Consider a simple pattern to match a filename and split it
|
||||||
filename and split it apart into a base name and an extension,
|
apart into a base name and an extension, separated by a \samp{.}. For
|
||||||
separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news}
|
example, in \samp{news.rc}, \samp{news} is the base name, and
|
||||||
is the base name, and \samp{rc} is the filename's extension.
|
\samp{rc} is the filename's extension.
|
||||||
|
|
||||||
The pattern to match this is quite simple:
|
The pattern to match this is quite simple:
|
||||||
|
|
||||||
|
@ -1079,12 +1079,12 @@ read and understand. Worse, if the problem changes and you want to
|
||||||
exclude both \samp{bat} and \samp{exe} as extensions, the pattern
|
exclude both \samp{bat} and \samp{exe} as extensions, the pattern
|
||||||
would get even more complicated and confusing.
|
would get even more complicated and confusing.
|
||||||
|
|
||||||
A negative lookahead cuts through all this:
|
A negative lookahead cuts through all this confusion:
|
||||||
|
|
||||||
\regexp{.*[.](?!bat\$).*\$}
|
\regexp{.*[.](?!bat\$).*\$}
|
||||||
% $
|
% $
|
||||||
|
|
||||||
The lookahead means: if the expression \regexp{bat} doesn't match at
|
The negative lookahead means: if the expression \regexp{bat} doesn't match at
|
||||||
this point, try the rest of the pattern; if \regexp{bat\$} does match,
|
this point, try the rest of the pattern; if \regexp{bat\$} does match,
|
||||||
the whole pattern will fail. The trailing \regexp{\$} is required to
|
the whole pattern will fail. The trailing \regexp{\$} is required to
|
||||||
ensure that something like \samp{sample.batch}, where the extension
|
ensure that something like \samp{sample.batch}, where the extension
|
||||||
|
@ -1101,7 +1101,7 @@ filenames that end in either \samp{bat} or \samp{exe}:
|
||||||
\section{Modifying Strings}
|
\section{Modifying Strings}
|
||||||
|
|
||||||
Up to this point, we've simply performed searches against a static
|
Up to this point, we've simply performed searches against a static
|
||||||
string. Regular expressions are also commonly used to modify a string
|
string. Regular expressions are also commonly used to modify strings
|
||||||
in various ways, using the following \class{RegexObject} methods:
|
in various ways, using the following \class{RegexObject} methods:
|
||||||
|
|
||||||
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
|
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
|
||||||
|
|
Loading…
Reference in New Issue