917 lines
39 KiB
TeX
917 lines
39 KiB
TeX
\section{\module{re} ---
|
|
Regular expression operations}
|
|
\declaremodule{standard}{re}
|
|
\moduleauthor{Fredrik Lundh}{effbot@telia.com}
|
|
\sectionauthor{Andrew M. Kuchling}{akuchlin@mems-exchange.org}
|
|
|
|
|
|
\modulesynopsis{Regular expression search and match operations with a
|
|
Perl-style expression syntax.}
|
|
|
|
|
|
This module provides regular expression matching operations similar to
|
|
those found in Perl. Regular expression pattern strings may not
|
|
contain null bytes, but can specify the null byte using the
|
|
\code{\e\var{number}} notation. Both patterns and strings to be
|
|
searched can be Unicode strings as well as 8-bit strings. The
|
|
\module{re} module is always available.
|
|
|
|
Regular expressions use the backslash character (\character{\e}) to
|
|
indicate special forms or to allow special characters to be used
|
|
without invoking their special meaning. This collides with Python's
|
|
usage of the same character for the same purpose in string literals;
|
|
for example, to match a literal backslash, one might have to write
|
|
\code{'\e\e\e\e'} as the pattern string, because the regular expression
|
|
must be \samp{\e\e}, and each backslash must be expressed as
|
|
\samp{\e\e} inside a regular Python string literal.
|
|
|
|
The solution is to use Python's raw string notation for regular
|
|
expression patterns; backslashes are not handled in any special way in
|
|
a string literal prefixed with \character{r}. So \code{r"\e n"} is a
|
|
two-character string containing \character{\e} and \character{n},
|
|
while \code{"\e n"} is a one-character string containing a newline.
|
|
Usually patterns will be expressed in Python code using this raw
|
|
string notation.
|
|
|
|
\begin{seealso}
|
|
\seetitle{Mastering Regular Expressions}{Book on regular expressions
|
|
by Jeffrey Friedl, published by O'Reilly. The Python
|
|
material in this book dates from before the \refmodule{re}
|
|
module, but it covers writing good regular expression
|
|
patterns in great detail.}
|
|
\end{seealso}
|
|
|
|
|
|
\subsection{Regular Expression Syntax \label{re-syntax}}
|
|
|
|
A regular expression (or RE) specifies a set of strings that matches
|
|
it; the functions in this module let you check if a particular string
|
|
matches a given regular expression (or if a given regular expression
|
|
matches a particular string, which comes down to the same thing).
|
|
|
|
Regular expressions can be concatenated to form new regular
|
|
expressions; if \emph{A} and \emph{B} are both regular expressions,
|
|
then \emph{AB} is also a regular expression. If a string \emph{p}
|
|
matches A and another string \emph{q} matches B, the string \emph{pq}
|
|
will match AB if \emph{A} and \emph{B} do no specify boundary
|
|
conditions that are no longer satisfied by \emph{pq}. Thus, complex
|
|
expressions can easily be constructed from simpler primitive
|
|
expressions like the ones described here. For details of the theory
|
|
and implementation of regular expressions, consult the Friedl book
|
|
referenced below, or almost any textbook about compiler construction.
|
|
|
|
A brief explanation of the format of regular expressions follows. For
|
|
further information and a gentler presentation, consult the Regular
|
|
Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
|
|
|
|
Regular expressions can contain both special and ordinary characters.
|
|
Most ordinary characters, like \character{A}, \character{a}, or
|
|
\character{0}, are the simplest regular expressions; they simply match
|
|
themselves. You can concatenate ordinary characters, so \regexp{last}
|
|
matches the string \code{'last'}. (In the rest of this section, we'll
|
|
write RE's in \regexp{this special style}, usually without quotes, and
|
|
strings to be matched \code{'in single quotes'}.)
|
|
|
|
Some characters, like \character{|} or \character{(}, are special.
|
|
Special characters either stand for classes of ordinary characters, or
|
|
affect how the regular expressions around them are interpreted.
|
|
|
|
The special characters are:
|
|
|
|
\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
|
|
|
|
\item[\character{.}] (Dot.) In the default mode, this matches any
|
|
character except a newline. If the \constant{DOTALL} flag has been
|
|
specified, this matches any character including a newline.
|
|
|
|
\item[\character{\textasciicircum}] (Caret.) Matches the start of the
|
|
string, and in \constant{MULTILINE} mode also matches immediately
|
|
after each newline.
|
|
|
|
\item[\character{\$}] Matches the end of the string or just before the
|
|
newline at the end of the string, and in \constant{MULTILINE} mode
|
|
also matches before a newline. \regexp{foo} matches both 'foo' and
|
|
'foobar', while the regular expression \regexp{foo\$} matches only
|
|
'foo'. More interestingly, searching for \regexp{foo.\$} in
|
|
'foo1\textbackslash nfoo2\textbackslash n' matches 'foo2' normally,
|
|
but 'foo1' in \constant{MULTILINE} mode.
|
|
|
|
\item[\character{*}] Causes the resulting RE to
|
|
match 0 or more repetitions of the preceding RE, as many repetitions
|
|
as are possible. \regexp{ab*} will
|
|
match 'a', 'ab', or 'a' followed by any number of 'b's.
|
|
|
|
\item[\character{+}] Causes the
|
|
resulting RE to match 1 or more repetitions of the preceding RE.
|
|
\regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
|
|
will not match just 'a'.
|
|
|
|
\item[\character{?}] Causes the resulting RE to
|
|
match 0 or 1 repetitions of the preceding RE. \regexp{ab?} will
|
|
match either 'a' or 'ab'.
|
|
|
|
\item[\code{*?}, \code{+?}, \code{??}] The \character{*},
|
|
\character{+}, and \character{?} qualifiers are all \dfn{greedy}; they
|
|
match as much text as possible. Sometimes this behaviour isn't
|
|
desired; if the RE \regexp{<.*>} is matched against
|
|
\code{'<H1>title</H1>'}, it will match the entire string, and not just
|
|
\code{'<H1>'}. Adding \character{?} after the qualifier makes it
|
|
perform the match in \dfn{non-greedy} or \dfn{minimal} fashion; as
|
|
\emph{few} characters as possible will be matched. Using \regexp{.*?}
|
|
in the previous expression will match only \code{'<H1>'}.
|
|
|
|
\item[\code{\{\var{m}\}}]
|
|
Specifies that exactly \var{m} copies of the previous RE should be
|
|
matched; fewer matches cause the entire RE not to match. For example,
|
|
\regexp{a\{6\}} will match exactly six \character{a} characters, but
|
|
not five.
|
|
|
|
\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
|
|
\var{m} to \var{n} repetitions of the preceding RE, attempting to
|
|
match as many repetitions as possible. For example, \regexp{a\{3,5\}}
|
|
will match from 3 to 5 \character{a} characters. Omitting \var{n}
|
|
specifies an infinite upper bound; you can't omit \var{m}. As an
|
|
example, \regexp{a\{4,\}b} will match \code{aaaab}, a thousand
|
|
\character{a} characters followed by a \code{b}, but not \code{aaab}.
|
|
The comma may not be omitted or the modifier would be confused with
|
|
the previously described form.
|
|
|
|
\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
|
|
match from \var{m} to \var{n} repetitions of the preceding RE,
|
|
attempting to match as \emph{few} repetitions as possible. This is
|
|
the non-greedy version of the previous qualifier. For example, on the
|
|
6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
|
|
\character{a} characters, while \regexp{a\{3,5\}?} will only match 3
|
|
characters.
|
|
|
|
\item[\character{\e}] Either escapes special characters (permitting
|
|
you to match characters like \character{*}, \character{?}, and so
|
|
forth), or signals a special sequence; special sequences are discussed
|
|
below.
|
|
|
|
If you're not using a raw string to
|
|
express the pattern, remember that Python also uses the
|
|
backslash as an escape sequence in string literals; if the escape
|
|
sequence isn't recognized by Python's parser, the backslash and
|
|
subsequent character are included in the resulting string. However,
|
|
if Python would recognize the resulting sequence, the backslash should
|
|
be repeated twice. This is complicated and hard to understand, so
|
|
it's highly recommended that you use raw strings for all but the
|
|
simplest expressions.
|
|
|
|
\item[\code{[]}] Used to indicate a set of characters. Characters can
|
|
be listed individually, or a range of characters can be indicated by
|
|
giving two characters and separating them by a \character{-}. Special
|
|
characters are not active inside sets. For example, \regexp{[akm\$]}
|
|
will match any of the characters \character{a}, \character{k},
|
|
\character{m}, or \character{\$}; \regexp{[a-z]}
|
|
will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
|
|
letter or digit. Character classes such as \code{\e w} or \code{\e S}
|
|
(defined below) are also acceptable inside a range. If you want to
|
|
include a \character{]} or a \character{-} inside a set, precede it with a
|
|
backslash, or place it as the first character. The
|
|
pattern \regexp{[]]} will match \code{']'}, for example.
|
|
|
|
You can match the characters not within a range by \dfn{complementing}
|
|
the set. This is indicated by including a
|
|
\character{\textasciicircum} as the first character of the set;
|
|
\character{\textasciicircum} elsewhere will simply match the
|
|
\character{\textasciicircum} character. For example,
|
|
\regexp{[{\textasciicircum}5]} will match
|
|
any character except \character{5}, and
|
|
\regexp{[\textasciicircum\code{\textasciicircum}]} will match any character
|
|
except \character{\textasciicircum}.
|
|
|
|
\item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
|
|
creates a regular expression that will match either A or B. An
|
|
arbitrary number of REs can be separated by the \character{|} in this
|
|
way. This can be used inside groups (see below) as well. REs
|
|
separated by \character{|} are tried from left to right, and the first
|
|
one that allows the complete pattern to match is considered the
|
|
accepted branch. This means that if \code{A} matches, \code{B} will
|
|
never be tested, even if it would produce a longer overall match. In
|
|
other words, the \character{|} operator is never greedy. To match a
|
|
literal \character{|}, use \regexp{\e|}, or enclose it inside a
|
|
character class, as in \regexp{[|]}.
|
|
|
|
\item[\code{(...)}] Matches whatever regular expression is inside the
|
|
parentheses, and indicates the start and end of a group; the contents
|
|
of a group can be retrieved after a match has been performed, and can
|
|
be matched later in the string with the \regexp{\e \var{number}} special
|
|
sequence, described below. To match the literals \character{(} or
|
|
\character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
|
|
inside a character class: \regexp{[(] [)]}.
|
|
|
|
\item[\code{(?...)}] This is an extension notation (a \character{?}
|
|
following a \character{(} is not meaningful otherwise). The first
|
|
character after the \character{?}
|
|
determines what the meaning and further syntax of the construct is.
|
|
Extensions usually do not create a new group;
|
|
\regexp{(?P<\var{name}>...)} is the only exception to this rule.
|
|
Following are the currently supported extensions.
|
|
|
|
\item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
|
|
\character{L}, \character{m}, \character{s}, \character{u},
|
|
\character{x}.) The group matches the empty string; the letters set
|
|
the corresponding flags (\constant{re.I}, \constant{re.L},
|
|
\constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
|
|
for the entire regular expression. This is useful if you wish to
|
|
include the flags as part of the regular expression, instead of
|
|
passing a \var{flag} argument to the \function{compile()} function.
|
|
|
|
Note that the \regexp{(?x)} flag changes how the expression is parsed.
|
|
It should be used first in the expression string, or after one or more
|
|
whitespace characters. If there are non-whitespace characters before
|
|
the flag, the results are undefined.
|
|
|
|
\item[\code{(?:...)}] A non-grouping version of regular parentheses.
|
|
Matches whatever regular expression is inside the parentheses, but the
|
|
substring matched by the
|
|
group \emph{cannot} be retrieved after performing a match or
|
|
referenced later in the pattern.
|
|
|
|
\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
|
|
the substring matched by the group is accessible via the symbolic group
|
|
name \var{name}. Group names must be valid Python identifiers, and
|
|
each group name must be defined only once within a regular expression. A
|
|
symbolic group is also a numbered group, just as if the group were not
|
|
named. So the group named 'id' in the example above can also be
|
|
referenced as the numbered group 1.
|
|
|
|
For example, if the pattern is
|
|
\regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
|
|
name in arguments to methods of match objects, such as
|
|
\code{m.group('id')} or \code{m.end('id')}, and also by name in
|
|
pattern text (for example, \regexp{(?P=id)}) and replacement text
|
|
(such as \code{\e g<id>}).
|
|
|
|
\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
|
|
earlier group named \var{name}.
|
|
|
|
\item[\code{(?\#...)}] A comment; the contents of the parentheses are
|
|
simply ignored.
|
|
|
|
\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
|
|
consume any of the string. This is called a lookahead assertion. For
|
|
example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
|
|
followed by \code{'Asimov'}.
|
|
|
|
\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This
|
|
is a negative lookahead assertion. For example,
|
|
\regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
|
|
followed by \code{'Asimov'}.
|
|
|
|
\item[\code{(?<=...)}] Matches if the current position in the string
|
|
is preceded by a match for \regexp{...} that ends at the current
|
|
position. This is called a \dfn{positive lookbehind assertion}.
|
|
\regexp{(?<=abc)def} will find a match in \samp{abcdef}, since the
|
|
lookbehind will back up 3 characters and check if the contained
|
|
pattern matches. The contained pattern must only match strings of
|
|
some fixed length, meaning that \regexp{abc} or \regexp{a|b} are
|
|
allowed, but \regexp{a*} and \regexp{a\{3,4\}} are not. Note that
|
|
patterns which start with positive lookbehind assertions will never
|
|
match at the beginning of the string being searched; you will most
|
|
likely want to use the \function{search()} function rather than the
|
|
\function{match()} function:
|
|
|
|
\begin{verbatim}
|
|
>>> import re
|
|
>>> m = re.search('(?<=abc)def', 'abcdef')
|
|
>>> m.group(0)
|
|
'def'
|
|
\end{verbatim}
|
|
|
|
This example looks for a word following a hyphen:
|
|
|
|
\begin{verbatim}
|
|
>>> m = re.search('(?<=-)\w+', 'spam-egg')
|
|
>>> m.group(0)
|
|
'egg'
|
|
\end{verbatim}
|
|
|
|
\item[\code{(?<!...)}] Matches if the current position in the string
|
|
is not preceded by a match for \regexp{...}. This is called a
|
|
\dfn{negative lookbehind assertion}. Similar to positive lookbehind
|
|
assertions, the contained pattern must only match strings of some
|
|
fixed length. Patterns which start with negative lookbehind
|
|
assertions may match at the beginning of the string being searched.
|
|
|
|
\end{list}
|
|
|
|
The special sequences consist of \character{\e} and a character from the
|
|
list below. If the ordinary character is not on the list, then the
|
|
resulting RE will match the second character. For example,
|
|
\regexp{\e\$} matches the character \character{\$}.
|
|
|
|
\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
|
|
|
|
\item[\code{\e \var{number}}] Matches the contents of the group of the
|
|
same number. Groups are numbered starting from 1. For example,
|
|
\regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
|
|
\code{'the end'} (note
|
|
the space after the group). This special sequence can only be used to
|
|
match one of the first 99 groups. If the first digit of \var{number}
|
|
is 0, or \var{number} is 3 octal digits long, it will not be interpreted
|
|
as a group match, but as the character with octal value \var{number}.
|
|
(There is a group 0, which is the entire matched pattern, but it can't
|
|
be referenced with \regexp{\e 0}; instead, use \regexp{\e g<0>}.)
|
|
Inside the \character{[} and \character{]} of a character class, all numeric
|
|
escapes are treated as characters.
|
|
|
|
\item[\code{\e A}] Matches only at the start of the string.
|
|
|
|
\item[\code{\e b}] Matches the empty string, but only at the
|
|
beginning or end of a word. A word is defined as a sequence of
|
|
alphanumeric or underscore characters, so the end of a word is indicated by
|
|
whitespace or a non-alphanumeric, non-underscore character. Note that
|
|
{}\code{\e b} is defined as the boundary between \code{\e w} and \code{\e
|
|
W}, so the precise set of characters deemed to be alphanumeric depends on the
|
|
values of the \code{UNICODE} and \code{LOCALE} flags. Inside a character
|
|
range, \regexp{\e b} represents the backspace character, for compatibility
|
|
with Python's string literals.
|
|
|
|
\item[\code{\e B}] Matches the empty string, but only when it is \emph{not}
|
|
at the beginning or end of a word. This is just the opposite of {}\code{\e
|
|
b}, so is also subject to the settings of \code{LOCALE} and \code{UNICODE}.
|
|
|
|
\item[\code{\e d}]Matches any decimal digit; this is
|
|
equivalent to the set \regexp{[0-9]}.
|
|
|
|
\item[\code{\e D}]Matches any non-digit character; this is
|
|
equivalent to the set \regexp{[{\textasciicircum}0-9]}.
|
|
|
|
\item[\code{\e s}]Matches any whitespace character; this is
|
|
equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
|
|
|
|
\item[\code{\e S}]Matches any non-whitespace character; this is
|
|
equivalent to the set \regexp{[\textasciicircum\ \e t\e n\e r\e f\e v]}.
|
|
|
|
\item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
|
|
flags are not specified,
|
|
matches any alphanumeric character; this is equivalent to the set
|
|
\regexp{[a-zA-Z0-9_]}. With \constant{LOCALE}, it will match the set
|
|
\regexp{[0-9_]} plus whatever characters are defined as letters for
|
|
the current locale. If \constant{UNICODE} is set, this will match the
|
|
characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
|
|
in the Unicode character properties database.
|
|
|
|
\item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
|
|
flags are not specified, matches any non-alphanumeric character; this
|
|
is equivalent to the set \regexp{[{\textasciicircum}a-zA-Z0-9_]}. With
|
|
\constant{LOCALE}, it will match any character not in the set
|
|
\regexp{[0-9_]}, and not defined as a letter for the current locale.
|
|
If \constant{UNICODE} is set, this will match anything other than
|
|
\regexp{[0-9_]} and characters marked at alphanumeric in the Unicode
|
|
character properties database.
|
|
|
|
\item[\code{\e Z}]Matches only at the end of the string.
|
|
|
|
\end{list}
|
|
|
|
Most of the standard escapes supported by Python string literals are
|
|
also accepted by the regular expression parser:
|
|
|
|
\begin{verbatim}
|
|
\a \b \f \n
|
|
\r \t \v \x
|
|
\\
|
|
\end{verbatim}
|
|
|
|
Octal escapes are included in a limited form: If the first digit is a
|
|
0, or if there are three octal digits, it is considered an octal
|
|
escape. Otherwise, it is a group reference.
|
|
|
|
|
|
% Note the lack of a period in the section title; it causes problems
|
|
% with readers of the GNU info version. See http://www.python.org/sf/581414.
|
|
\subsection{Matching vs Searching \label{matching-searching}}
|
|
\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
|
|
|
|
Python offers two different primitive operations based on regular
|
|
expressions: match and search. If you are accustomed to Perl's
|
|
semantics, the search operation is what you're looking for. See the
|
|
\function{search()} function and corresponding method of compiled
|
|
regular expression objects.
|
|
|
|
Note that match may differ from search using a regular expression
|
|
beginning with \character{\textasciicircum}:
|
|
\character{\textasciicircum} matches only at the
|
|
start of the string, or in \constant{MULTILINE} mode also immediately
|
|
following a newline. The ``match'' operation succeeds only if the
|
|
pattern matches at the start of the string regardless of mode, or at
|
|
the starting position given by the optional \var{pos} argument
|
|
regardless of whether a newline precedes it.
|
|
|
|
% Examples from Tim Peters:
|
|
\begin{verbatim}
|
|
re.compile("a").match("ba", 1) # succeeds
|
|
re.compile("^a").search("ba", 1) # fails; 'a' not at start
|
|
re.compile("^a").search("\na", 1) # fails; 'a' not at start
|
|
re.compile("^a", re.M).search("\na", 1) # succeeds
|
|
re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
|
|
\end{verbatim}
|
|
|
|
|
|
\subsection{Module Contents}
|
|
\nodename{Contents of Module re}
|
|
|
|
The module defines the following functions and constants, and an exception:
|
|
|
|
|
|
\begin{funcdesc}{compile}{pattern\optional{, flags}}
|
|
Compile a regular expression pattern into a regular expression
|
|
object, which can be used for matching using its \function{match()} and
|
|
\function{search()} methods, described below.
|
|
|
|
The expression's behaviour can be modified by specifying a
|
|
\var{flags} value. Values can be any of the following variables,
|
|
combined using bitwise OR (the \code{|} operator).
|
|
|
|
The sequence
|
|
|
|
\begin{verbatim}
|
|
prog = re.compile(pat)
|
|
result = prog.match(str)
|
|
\end{verbatim}
|
|
|
|
is equivalent to
|
|
|
|
\begin{verbatim}
|
|
result = re.match(pat, str)
|
|
\end{verbatim}
|
|
|
|
but the version using \function{compile()} is more efficient when the
|
|
expression will be used several times in a single program.
|
|
%(The compiled version of the last pattern passed to
|
|
%\function{re.match()} or \function{re.search()} is cached, so
|
|
%programs that use only a single regular expression at a time needn't
|
|
%worry about compiling regular expressions.)
|
|
\end{funcdesc}
|
|
|
|
\begin{datadesc}{I}
|
|
\dataline{IGNORECASE}
|
|
Perform case-insensitive matching; expressions like \regexp{[A-Z]}
|
|
will match lowercase letters, too. This is not affected by the
|
|
current locale.
|
|
\end{datadesc}
|
|
|
|
\begin{datadesc}{L}
|
|
\dataline{LOCALE}
|
|
Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
|
|
\regexp{\e B} dependent on the current locale.
|
|
\end{datadesc}
|
|
|
|
\begin{datadesc}{M}
|
|
\dataline{MULTILINE}
|
|
When specified, the pattern character \character{\textasciicircum}
|
|
matches at the beginning of the string and at the beginning of each
|
|
line (immediately following each newline); and the pattern character
|
|
\character{\$} matches at the end of the string and at the end of each
|
|
line (immediately preceding each newline). By default,
|
|
\character{\textasciicircum} matches only at the beginning of the
|
|
string, and \character{\$} only at the end of the string and
|
|
immediately before the newline (if any) at the end of the string.
|
|
\end{datadesc}
|
|
|
|
\begin{datadesc}{S}
|
|
\dataline{DOTALL}
|
|
Make the \character{.} special character match any character at all,
|
|
including a newline; without this flag, \character{.} will match
|
|
anything \emph{except} a newline.
|
|
\end{datadesc}
|
|
|
|
\begin{datadesc}{U}
|
|
\dataline{UNICODE}
|
|
Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
|
|
\regexp{\e B} dependent on the Unicode character properties database.
|
|
\versionadded{2.0}
|
|
\end{datadesc}
|
|
|
|
\begin{datadesc}{X}
|
|
\dataline{VERBOSE}
|
|
This flag allows you to write regular expressions that look nicer.
|
|
Whitespace within the pattern is ignored,
|
|
except when in a character class or preceded by an unescaped
|
|
backslash, and, when a line contains a \character{\#} neither in a
|
|
character class or preceded by an unescaped backslash, all characters
|
|
from the leftmost such \character{\#} through the end of the line are
|
|
ignored.
|
|
% XXX should add an example here
|
|
\end{datadesc}
|
|
|
|
|
|
\begin{funcdesc}{search}{pattern, string\optional{, flags}}
|
|
Scan through \var{string} looking for a location where the regular
|
|
expression \var{pattern} produces a match, and return a
|
|
corresponding \class{MatchObject} instance.
|
|
Return \code{None} if no
|
|
position in the string matches the pattern; note that this is
|
|
different from finding a zero-length match at some point in the string.
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{match}{pattern, string\optional{, flags}}
|
|
If zero or more characters at the beginning of \var{string} match
|
|
the regular expression \var{pattern}, return a corresponding
|
|
\class{MatchObject} instance. Return \code{None} if the string does not
|
|
match the pattern; note that this is different from a zero-length
|
|
match.
|
|
|
|
\note{If you want to locate a match anywhere in
|
|
\var{string}, use \method{search()} instead.}
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
|
|
Split \var{string} by the occurrences of \var{pattern}. If
|
|
capturing parentheses are used in \var{pattern}, then the text of all
|
|
groups in the pattern are also returned as part of the resulting list.
|
|
If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
|
|
occur, and the remainder of the string is returned as the final
|
|
element of the list. (Incompatibility note: in the original Python
|
|
1.5 release, \var{maxsplit} was ignored. This has been fixed in
|
|
later releases.)
|
|
|
|
\begin{verbatim}
|
|
>>> re.split('\W+', 'Words, words, words.')
|
|
['Words', 'words', 'words', '']
|
|
>>> re.split('(\W+)', 'Words, words, words.')
|
|
['Words', ', ', 'words', ', ', 'words', '.', '']
|
|
>>> re.split('\W+', 'Words, words, words.', 1)
|
|
['Words', 'words, words.']
|
|
\end{verbatim}
|
|
|
|
This function combines and extends the functionality of
|
|
the old \function{regsub.split()} and \function{regsub.splitx()}.
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{findall}{pattern, string}
|
|
Return a list of all non-overlapping matches of \var{pattern} in
|
|
\var{string}. If one or more groups are present in the pattern,
|
|
return a list of groups; this will be a list of tuples if the
|
|
pattern has more than one group. Empty matches are included in the
|
|
result.
|
|
\versionadded{1.5.2}
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{finditer}{pattern, string}
|
|
Return an iterator over all non-overlapping matches for the RE
|
|
\var{pattern} in \var{string}. For each match, the iterator returns
|
|
a match object. Empty matches are included in the result.
|
|
\versionadded{2.2}
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{sub}{pattern, repl, string\optional{, count}}
|
|
Return the string obtained by replacing the leftmost non-overlapping
|
|
occurrences of \var{pattern} in \var{string} by the replacement
|
|
\var{repl}. If the pattern isn't found, \var{string} is returned
|
|
unchanged. \var{repl} can be a string or a function; if it is a
|
|
string, any backslash escapes in it are processed. That is,
|
|
\samp{\e n} is converted to a single newline character, \samp{\e r}
|
|
is converted to a linefeed, and so forth. Unknown escapes such as
|
|
\samp{\e j} are left alone. Backreferences, such as \samp{\e6}, are
|
|
replaced with the substring matched by group 6 in the pattern. For
|
|
example:
|
|
|
|
\begin{verbatim}
|
|
>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
|
|
... r'static PyObject*\npy_\1(void)\n{',
|
|
... 'def myfunc():')
|
|
'static PyObject*\npy_myfunc(void)\n{'
|
|
\end{verbatim}
|
|
|
|
If \var{repl} is a function, it is called for every non-overlapping
|
|
occurrence of \var{pattern}. The function takes a single match
|
|
object argument, and returns the replacement string. For example:
|
|
|
|
\begin{verbatim}
|
|
>>> def dashrepl(matchobj):
|
|
.... if matchobj.group(0) == '-': return ' '
|
|
.... else: return '-'
|
|
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
|
|
'pro--gram files'
|
|
\end{verbatim}
|
|
|
|
The pattern may be a string or an RE object; if you need to specify
|
|
regular expression flags, you must use a RE object, or use embedded
|
|
modifiers in a pattern; for example, \samp{sub("(?i)b+", "x", "bbbb
|
|
BBBB")} returns \code{'x x'}.
|
|
|
|
The optional argument \var{count} is the maximum number of pattern
|
|
occurrences to be replaced; \var{count} must be a non-negative
|
|
integer. If omitted or zero, all occurrences will be replaced.
|
|
Empty matches for the pattern are replaced only when not adjacent to
|
|
a previous match, so \samp{sub('x*', '-', 'abc')} returns
|
|
\code{'-a-b-c-'}.
|
|
|
|
In addition to character escapes and backreferences as described
|
|
above, \samp{\e g<name>} will use the substring matched by the group
|
|
named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
|
|
\samp{\e g<number>} uses the corresponding group number;
|
|
\samp{\e g<2>} is therefore equivalent to \samp{\e 2}, but isn't
|
|
ambiguous in a replacement such as \samp{\e g<2>0}. \samp{\e 20}
|
|
would be interpreted as a reference to group 20, not a reference to
|
|
group 2 followed by the literal character \character{0}. The
|
|
backreference \samp{\e g<0>} substitutes in the entire substring
|
|
matched by the RE.
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{subn}{pattern, repl, string\optional{, count}}
|
|
Perform the same operation as \function{sub()}, but return a tuple
|
|
\code{(\var{new_string}, \var{number_of_subs_made})}.
|
|
\end{funcdesc}
|
|
|
|
\begin{funcdesc}{escape}{string}
|
|
Return \var{string} with all non-alphanumerics backslashed; this is
|
|
useful if you want to match an arbitrary literal string that may have
|
|
regular expression metacharacters in it.
|
|
\end{funcdesc}
|
|
|
|
\begin{excdesc}{error}
|
|
Exception raised when a string passed to one of the functions here
|
|
is not a valid regular expression (for example, it might contain
|
|
unmatched parentheses) or when some other error occurs during
|
|
compilation or matching. It is never an error if a string contains
|
|
no match for a pattern.
|
|
\end{excdesc}
|
|
|
|
|
|
\subsection{Regular Expression Objects \label{re-objects}}
|
|
|
|
Compiled regular expression objects support the following methods and
|
|
attributes:
|
|
|
|
\begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
|
|
endpos}}}
|
|
Scan through \var{string} looking for a location where this regular
|
|
expression produces a match, and return a
|
|
corresponding \class{MatchObject} instance. Return \code{None} if no
|
|
position in the string matches the pattern; note that this is
|
|
different from finding a zero-length match at some point in the string.
|
|
|
|
The optional \var{pos} and \var{endpos} parameters have the same
|
|
meaning as for the \method{match()} method.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
|
|
endpos}}}
|
|
If zero or more characters at the beginning of \var{string} match
|
|
this regular expression, return a corresponding
|
|
\class{MatchObject} instance. Return \code{None} if the string does not
|
|
match the pattern; note that this is different from a zero-length
|
|
match.
|
|
|
|
\note{If you want to locate a match anywhere in
|
|
\var{string}, use \method{search()} instead.}
|
|
|
|
The optional second parameter \var{pos} gives an index in the string
|
|
where the search is to start; it defaults to \code{0}. This is not
|
|
completely equivalent to slicing the string; the
|
|
\code{'\textasciicircum'} pattern
|
|
character matches at the real beginning of the string and at positions
|
|
just after a newline, but not necessarily at the index where the search
|
|
is to start.
|
|
|
|
The optional parameter \var{endpos} limits how far the string will
|
|
be searched; it will be as if the string is \var{endpos} characters
|
|
long, so only the characters from \var{pos} to \code{\var{endpos} -
|
|
1} will be searched for a match. If \var{endpos} is less than
|
|
\var{pos}, no match will be found, otherwise, if \var{rx} is a
|
|
compiled regular expression object,
|
|
\code{\var{rx}.match(\var{string}, 0, 50)} is equivalent to
|
|
\code{\var{rx}.match(\var{string}[:50], 0)}.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[RegexObject]{split}{string\optional{,
|
|
maxsplit\code{ = 0}}}
|
|
Identical to the \function{split()} function, using the compiled pattern.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[RegexObject]{findall}{string}
|
|
Identical to the \function{findall()} function, using the compiled pattern.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[RegexObject]{finditer}{string}
|
|
Identical to the \function{finditer()} function, using the compiled pattern.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
|
|
Identical to the \function{sub()} function, using the compiled pattern.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
|
|
count\code{ = 0}}}
|
|
Identical to the \function{subn()} function, using the compiled pattern.
|
|
\end{methoddesc}
|
|
|
|
|
|
\begin{memberdesc}[RegexObject]{flags}
|
|
The flags argument used when the RE object was compiled, or
|
|
\code{0} if no flags were provided.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}[RegexObject]{groupindex}
|
|
A dictionary mapping any symbolic group names defined by
|
|
\regexp{(?P<\var{id}>)} to group numbers. The dictionary is empty if no
|
|
symbolic groups were used in the pattern.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}[RegexObject]{pattern}
|
|
The pattern string from which the RE object was compiled.
|
|
\end{memberdesc}
|
|
|
|
|
|
\subsection{Match Objects \label{match-objects}}
|
|
|
|
\class{MatchObject} instances support the following methods and
|
|
attributes:
|
|
|
|
\begin{methoddesc}[MatchObject]{expand}{template}
|
|
Return the string obtained by doing backslash substitution on the
|
|
template string \var{template}, as done by the \method{sub()} method.
|
|
Escapes such as \samp{\e n} are converted to the appropriate
|
|
characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and
|
|
named backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced
|
|
by the contents of the corresponding group.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
|
|
Returns one or more subgroups of the match. If there is a single
|
|
argument, the result is a single string; if there are
|
|
multiple arguments, the result is a tuple with one item per argument.
|
|
Without arguments, \var{group1} defaults to zero (the whole match
|
|
is returned).
|
|
If a \var{groupN} argument is zero, the corresponding return value is the
|
|
entire matching string; if it is in the inclusive range [1..99], it is
|
|
the string matching the the corresponding parenthesized group. If a
|
|
group number is negative or larger than the number of groups defined
|
|
in the pattern, an \exception{IndexError} exception is raised.
|
|
If a group is contained in a part of the pattern that did not match,
|
|
the corresponding result is \code{None}. If a group is contained in a
|
|
part of the pattern that matched multiple times, the last match is
|
|
returned.
|
|
|
|
If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
|
|
the \var{groupN} arguments may also be strings identifying groups by
|
|
their group name. If a string argument is not used as a group name in
|
|
the pattern, an \exception{IndexError} exception is raised.
|
|
|
|
A moderately complicated example:
|
|
|
|
\begin{verbatim}
|
|
m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
|
|
\end{verbatim}
|
|
|
|
After performing this match, \code{m.group(1)} is \code{'3'}, as is
|
|
\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[MatchObject]{groups}{\optional{default}}
|
|
Return a tuple containing all the subgroups of the match, from 1 up to
|
|
however many groups are in the pattern. The \var{default} argument is
|
|
used for groups that did not participate in the match; it defaults to
|
|
\code{None}. (Incompatibility note: in the original Python 1.5
|
|
release, if the tuple was one element long, a string would be returned
|
|
instead. In later versions (from 1.5.1 on), a singleton tuple is
|
|
returned in such cases.)
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
|
|
Return a dictionary containing all the \emph{named} subgroups of the
|
|
match, keyed by the subgroup name. The \var{default} argument is
|
|
used for groups that did not participate in the match; it defaults to
|
|
\code{None}.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[MatchObject]{start}{\optional{group}}
|
|
\funcline{end}{\optional{group}}
|
|
Return the indices of the start and end of the substring
|
|
matched by \var{group}; \var{group} defaults to zero (meaning the whole
|
|
matched substring).
|
|
Return \code{-1} if \var{group} exists but
|
|
did not contribute to the match. For a match object
|
|
\var{m}, and a group \var{g} that did contribute to the match, the
|
|
substring matched by group \var{g} (equivalent to
|
|
\code{\var{m}.group(\var{g})}) is
|
|
|
|
\begin{verbatim}
|
|
m.string[m.start(g):m.end(g)]
|
|
\end{verbatim}
|
|
|
|
Note that
|
|
\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
|
|
\var{group} matched a null string. For example, after \code{\var{m} =
|
|
re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
|
|
\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
|
|
\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
|
|
an \exception{IndexError} exception.
|
|
\end{methoddesc}
|
|
|
|
\begin{methoddesc}[MatchObject]{span}{\optional{group}}
|
|
For \class{MatchObject} \var{m}, return the 2-tuple
|
|
\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
|
|
Note that if \var{group} did not contribute to the match, this is
|
|
\code{(-1, -1)}. Again, \var{group} defaults to zero.
|
|
\end{methoddesc}
|
|
|
|
\begin{memberdesc}[MatchObject]{pos}
|
|
The value of \var{pos} which was passed to the
|
|
\function{search()} or \function{match()} function. This is the index
|
|
into the string at which the RE engine started looking for a match.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}[MatchObject]{endpos}
|
|
The value of \var{endpos} which was passed to the
|
|
\function{search()} or \function{match()} function. This is the index
|
|
into the string beyond which the RE engine will not go.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}[MatchObject]{lastgroup}
|
|
The name of the last matched capturing group, or \code{None} if the
|
|
group didn't have a name, or if no group was matched at all.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}[MatchObject]{lastindex}
|
|
The integer index of the last matched capturing group, or \code{None}
|
|
if no group was matched at all.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}[MatchObject]{re}
|
|
The regular expression object whose \method{match()} or
|
|
\method{search()} method produced this \class{MatchObject} instance.
|
|
\end{memberdesc}
|
|
|
|
\begin{memberdesc}[MatchObject]{string}
|
|
The string passed to \function{match()} or \function{search()}.
|
|
\end{memberdesc}
|
|
|
|
\subsection{Examples}
|
|
|
|
\leftline{\strong{Simulating \cfunction{scanf()}}}
|
|
|
|
Python does not currently have an equivalent to \cfunction{scanf()}.
|
|
\ttindex{scanf()}
|
|
Regular expressions are generally more powerful, though also more
|
|
verbose, than \cfunction{scanf()} format strings. The table below
|
|
offers some more-or-less equivalent mappings between
|
|
\cfunction{scanf()} format tokens and regular expressions.
|
|
|
|
\begin{tableii}{l|l}{textrm}{\cfunction{scanf()} Token}{Regular Expression}
|
|
\lineii{\code{\%c}}
|
|
{\regexp{.}}
|
|
\lineii{\code{\%5c}}
|
|
{\regexp{.\{5\}}}
|
|
\lineii{\code{\%d}}
|
|
{\regexp{[-+]\e d+}}
|
|
\lineii{\code{\%e}, \code{\%E}, \code{\%f}, \code{\%g}}
|
|
{\regexp{[-+](\e d+(\e.\e d*)?|\e d*\e.\e d+)([eE]\e d+)?}}
|
|
\lineii{\code{\%i}}
|
|
{\regexp{[-+](0[xX][\e dA-Fa-f]+|0[0-7]*|\e d+)}}
|
|
\lineii{\code{\%o}}
|
|
{\regexp{0[0-7]*}}
|
|
\lineii{\code{\%s}}
|
|
{\regexp{\e S+}}
|
|
\lineii{\code{\%u}}
|
|
{\regexp{\e d+}}
|
|
\lineii{\code{\%x}, \code{\%X}}
|
|
{\regexp{0[xX][\e dA-Fa-f]+}}
|
|
\end{tableii}
|
|
|
|
To extract the filename and numbers from a string like
|
|
|
|
\begin{verbatim}
|
|
/usr/sbin/sendmail - 0 errors, 4 warnings
|
|
\end{verbatim}
|
|
|
|
you would use a \cfunction{scanf()} format like
|
|
|
|
\begin{verbatim}
|
|
%s - %d errors, %d warnings
|
|
\end{verbatim}
|
|
|
|
The equivalent regular expression would be
|
|
|
|
\begin{verbatim}
|
|
(\S+) - (\d+) errors, (\d+) warnings
|
|
\end{verbatim}
|
|
|
|
\leftline{\strong{Avoiding backtracking}}
|
|
|
|
If you create regular expressions that require the engine to perform a lot
|
|
of backtracking, you may encounter a RuntimeError exception with the message
|
|
\code{maximum recursion limit exceeded}. For example,
|
|
|
|
\begin{verbatim}
|
|
>>> s = "<" + "that's a very big string!"*1000 + ">"
|
|
>>> re.match('<.*?>', s)
|
|
Traceback (most recent call last):
|
|
File "<stdin>", line 1, in ?
|
|
File "/usr/local/lib/python2.3/sre.py", line 132, in match
|
|
return _compile(pattern, flags).match(string)
|
|
RuntimeError: maximum recursion limit exceeded
|
|
\end{verbatim}
|
|
|
|
You can often restructure your regular expression to avoid backtracking.
|
|
The above regular expression can be recast as
|
|
\regexp{\textless[\textasciicircum \textgreater]*\textgreater}. As a
|
|
further benefit, such regular expressions will run faster than their
|
|
backtracking equivalents.
|