2005-08-29 22:25:05 -03:00
|
|
|
\documentclass{howto}
|
|
|
|
|
|
|
|
% TODO:
|
|
|
|
% Document lookbehind assertions
|
|
|
|
% Better way of displaying a RE, a string, and what it matches
|
|
|
|
% Mention optional argument to match.groups()
|
|
|
|
% Unicode (at least a reference)
|
|
|
|
|
|
|
|
\title{Regular Expression HOWTO}
|
|
|
|
|
|
|
|
\release{0.05}
|
|
|
|
|
|
|
|
\author{A.M. Kuchling}
|
|
|
|
\authoraddress{\email{amk@amk.ca}}
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
|
|
|
|
|
|
\begin{abstract}
|
|
|
|
\noindent
|
|
|
|
This document is an introductory tutorial to using regular expressions
|
|
|
|
in Python with the \module{re} module. It provides a gentler
|
|
|
|
introduction than the corresponding section in the Library Reference.
|
|
|
|
|
|
|
|
This document is available from
|
|
|
|
\url{http://www.amk.ca/python/howto}.
|
|
|
|
|
|
|
|
\end{abstract}
|
|
|
|
|
|
|
|
\tableofcontents
|
|
|
|
|
|
|
|
\section{Introduction}
|
|
|
|
|
|
|
|
The \module{re} module was added in Python 1.5, and provides
|
|
|
|
Perl-style regular expression patterns. Earlier versions of Python
|
2006-04-21 07:40:58 -03:00
|
|
|
came with the \module{regex} module, which provided Emacs-style
|
|
|
|
patterns. \module{regex} module was removed in Python 2.5.
|
2005-08-29 22:25:05 -03:00
|
|
|
|
|
|
|
Regular expressions (or REs) are essentially a tiny, highly
|
|
|
|
specialized programming language embedded inside Python and made
|
|
|
|
available through the \module{re} module. Using this little language,
|
|
|
|
you specify the rules for the set of possible strings that you want to
|
|
|
|
match; this set might contain English sentences, or e-mail addresses,
|
|
|
|
or TeX commands, or anything you like. You can then ask questions
|
|
|
|
such as ``Does this string match the pattern?'', or ``Is there a match
|
|
|
|
for the pattern anywhere in this string?''. You can also use REs to
|
|
|
|
modify a string or to split it apart in various ways.
|
|
|
|
|
|
|
|
Regular expression patterns are compiled into a series of bytecodes
|
|
|
|
which are then executed by a matching engine written in C. For
|
|
|
|
advanced use, it may be necessary to pay careful attention to how the
|
|
|
|
engine will execute a given RE, and write the RE in a certain way in
|
|
|
|
order to produce bytecode that runs faster. Optimization isn't
|
|
|
|
covered in this document, because it requires that you have a good
|
|
|
|
understanding of the matching engine's internals.
|
|
|
|
|
|
|
|
The regular expression language is relatively small and restricted, so
|
|
|
|
not all possible string processing tasks can be done using regular
|
|
|
|
expressions. There are also tasks that \emph{can} be done with
|
|
|
|
regular expressions, but the expressions turn out to be very
|
|
|
|
complicated. In these cases, you may be better off writing Python
|
|
|
|
code to do the processing; while Python code will be slower than an
|
|
|
|
elaborate regular expression, it will also probably be more understandable.
|
|
|
|
|
|
|
|
\section{Simple Patterns}
|
|
|
|
|
|
|
|
We'll start by learning about the simplest possible regular
|
|
|
|
expressions. Since regular expressions are used to operate on
|
|
|
|
strings, we'll begin with the most common task: matching characters.
|
|
|
|
|
|
|
|
For a detailed explanation of the computer science underlying regular
|
|
|
|
expressions (deterministic and non-deterministic finite automata), you
|
|
|
|
can refer to almost any textbook on writing compilers.
|
|
|
|
|
|
|
|
\subsection{Matching Characters}
|
|
|
|
|
|
|
|
Most letters and characters will simply match themselves. For
|
|
|
|
example, the regular expression \regexp{test} will match the string
|
|
|
|
\samp{test} exactly. (You can enable a case-insensitive mode that
|
|
|
|
would let this RE match \samp{Test} or \samp{TEST} as well; more
|
|
|
|
about this later.)
|
|
|
|
|
|
|
|
There are exceptions to this rule; some characters are
|
|
|
|
special, and don't match themselves. Instead, they signal that some
|
|
|
|
out-of-the-ordinary thing should be matched, or they affect other
|
|
|
|
portions of the RE by repeating them. Much of this document is
|
|
|
|
devoted to discussing various metacharacters and what they do.
|
|
|
|
|
|
|
|
Here's a complete list of the metacharacters; their meanings will be
|
|
|
|
discussed in the rest of this HOWTO.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
. ^ $ * + ? { [ ] \ | ( )
|
|
|
|
\end{verbatim}
|
|
|
|
% $
|
|
|
|
|
|
|
|
The first metacharacters we'll look at are \samp{[} and \samp{]}.
|
|
|
|
They're used for specifying a character class, which is a set of
|
|
|
|
characters that you wish to match. Characters can be listed
|
|
|
|
individually, or a range of characters can be indicated by giving two
|
|
|
|
characters and separating them by a \character{-}. For example,
|
|
|
|
\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
|
|
|
|
\samp{c}; this is the same as
|
|
|
|
\regexp{[a-c]}, which uses a range to express the same set of
|
|
|
|
characters. If you wanted to match only lowercase letters, your
|
|
|
|
RE would be \regexp{[a-z]}.
|
|
|
|
|
|
|
|
Metacharacters are not active inside classes. For example,
|
|
|
|
\regexp{[akm\$]} will match any of the characters \character{a},
|
|
|
|
\character{k}, \character{m}, or \character{\$}; \character{\$} is
|
|
|
|
usually a metacharacter, but inside a character class it's stripped of
|
|
|
|
its special nature.
|
|
|
|
|
|
|
|
You can match the characters not within a range by \dfn{complementing}
|
|
|
|
the set. This is indicated by including a \character{\^} as the first
|
|
|
|
character of the class; \character{\^} elsewhere will simply match the
|
|
|
|
\character{\^} character. For example, \verb|[^5]| will match any
|
|
|
|
character except \character{5}.
|
|
|
|
|
|
|
|
Perhaps the most important metacharacter is the backslash, \samp{\e}.
|
|
|
|
As in Python string literals, the backslash can be followed by various
|
|
|
|
characters to signal various special sequences. It's also used to escape
|
|
|
|
all the metacharacters so you can still match them in patterns; for
|
|
|
|
example, if you need to match a \samp{[} or
|
|
|
|
\samp{\e}, you can precede them with a backslash to remove their
|
|
|
|
special meaning: \regexp{\e[} or \regexp{\e\e}.
|
|
|
|
|
|
|
|
Some of the special sequences beginning with \character{\e} represent
|
|
|
|
predefined sets of characters that are often useful, such as the set
|
|
|
|
of digits, the set of letters, or the set of anything that isn't
|
|
|
|
whitespace. The following predefined special sequences are available:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item[\code{\e d}]Matches any decimal digit; this is
|
|
|
|
equivalent to the class \regexp{[0-9]}.
|
|
|
|
|
|
|
|
\item[\code{\e D}]Matches any non-digit character; this is
|
|
|
|
equivalent to the class \verb|[^0-9]|.
|
|
|
|
|
|
|
|
\item[\code{\e s}]Matches any whitespace character; this is
|
|
|
|
equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
|
|
|
|
|
|
|
|
\item[\code{\e S}]Matches any non-whitespace character; this is
|
|
|
|
equivalent to the class \verb|[^ \t\n\r\f\v]|.
|
|
|
|
|
|
|
|
\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
|
|
|
|
\regexp{[a-zA-Z0-9_]}.
|
|
|
|
|
|
|
|
\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
|
|
|
|
\verb|[^a-zA-Z0-9_]|.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
These sequences can be included inside a character class. For
|
|
|
|
example, \regexp{[\e s,.]} is a character class that will match any
|
|
|
|
whitespace character, or \character{,} or \character{.}.
|
|
|
|
|
|
|
|
The final metacharacter in this section is \regexp{.}. It matches
|
|
|
|
anything except a newline character, and there's an alternate mode
|
|
|
|
(\code{re.DOTALL}) where it will match even a newline. \character{.}
|
|
|
|
is often used where you want to match ``any character''.
|
|
|
|
|
|
|
|
\subsection{Repeating Things}
|
|
|
|
|
|
|
|
Being able to match varying sets of characters is the first thing
|
|
|
|
regular expressions can do that isn't already possible with the
|
|
|
|
methods available on strings. However, if that was the only
|
|
|
|
additional capability of regexes, they wouldn't be much of an advance.
|
|
|
|
Another capability is that you can specify that portions of the RE
|
|
|
|
must be repeated a certain number of times.
|
|
|
|
|
|
|
|
The first metacharacter for repeating things that we'll look at is
|
|
|
|
\regexp{*}. \regexp{*} doesn't match the literal character \samp{*};
|
|
|
|
instead, it specifies that the previous character can be matched zero
|
|
|
|
or more times, instead of exactly once.
|
|
|
|
|
|
|
|
For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
|
|
|
|
characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
|
|
|
|
characters), and so forth. The RE engine has various internal
|
|
|
|
limitations stemming from the size of C's \code{int} type, that will
|
|
|
|
prevent it from matching over 2 billion \samp{a} characters; you
|
|
|
|
probably don't have enough memory to construct a string that large, so
|
|
|
|
you shouldn't run into that limit.
|
|
|
|
|
|
|
|
Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
|
|
|
|
the matching engine will try to repeat it as many times as possible.
|
|
|
|
If later portions of the pattern don't match, the matching engine will
|
|
|
|
then back up and try again with few repetitions.
|
|
|
|
|
|
|
|
A step-by-step example will make this more obvious. Let's consider
|
|
|
|
the expression \regexp{a[bcd]*b}. This matches the letter
|
|
|
|
\character{a}, zero or more letters from the class \code{[bcd]}, and
|
|
|
|
finally ends with a \character{b}. Now imagine matching this RE
|
|
|
|
against the string \samp{abcbd}.
|
|
|
|
|
|
|
|
\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
|
|
|
|
\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
|
|
|
|
\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
|
|
|
|
it can, which is to the end of the string.}
|
|
|
|
\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
|
|
|
|
current position is at the end of the string, so it fails.}
|
|
|
|
\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches
|
|
|
|
one less character.}
|
|
|
|
\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
|
|
|
|
current position is at the last character, which is a \character{d}.}
|
|
|
|
\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is
|
|
|
|
only matching \samp{bc}.}
|
|
|
|
\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time
|
|
|
|
but the character at the current position is \character{b}, so it succeeds.}
|
|
|
|
\end{tableiii}
|
|
|
|
|
|
|
|
The end of the RE has now been reached, and it has matched
|
|
|
|
\samp{abcb}. This demonstrates how the matching engine goes as far as
|
|
|
|
it can at first, and if no match is found it will then progressively
|
|
|
|
back up and retry the rest of the RE again and again. It will back up
|
|
|
|
until it has tried zero matches for \regexp{[bcd]*}, and if that
|
|
|
|
subsequently fails, the engine will conclude that the string doesn't
|
|
|
|
match the RE at all.
|
|
|
|
|
|
|
|
Another repeating metacharacter is \regexp{+}, which matches one or
|
|
|
|
more times. Pay careful attention to the difference between
|
|
|
|
\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
|
|
|
|
times, so whatever's being repeated may not be present at all, while
|
|
|
|
\regexp{+} requires at least \emph{one} occurrence. To use a similar
|
|
|
|
example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
|
|
|
|
\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
|
|
|
|
|
|
|
|
There are two more repeating qualifiers. The question mark character,
|
|
|
|
\regexp{?}, matches either once or zero times; you can think of it as
|
|
|
|
marking something as being optional. For example, \regexp{home-?brew}
|
|
|
|
matches either \samp{homebrew} or \samp{home-brew}.
|
|
|
|
|
|
|
|
The most complicated repeated qualifier is
|
|
|
|
\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
|
|
|
|
integers. This qualifier means there must be at least \var{m}
|
|
|
|
repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b}
|
|
|
|
will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match
|
|
|
|
\samp{ab}, which has no slashes, or \samp{a////b}, which has four.
|
|
|
|
|
|
|
|
You can omit either \var{m} or \var{n}; in that case, a reasonable
|
|
|
|
value is assumed for the missing value. Omitting \var{m} is
|
|
|
|
interpreted as a lower limit of 0, while omitting \var{n} results in an
|
|
|
|
upper bound of infinity --- actually, the 2 billion limit mentioned
|
|
|
|
earlier, but that might as well be infinity.
|
|
|
|
|
|
|
|
Readers of a reductionist bent may notice that the three other qualifiers
|
|
|
|
can all be expressed using this notation. \regexp{\{0,\}} is the same
|
|
|
|
as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
|
|
|
|
\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use
|
|
|
|
\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
|
|
|
|
they're shorter and easier to read.
|
|
|
|
|
|
|
|
\section{Using Regular Expressions}
|
|
|
|
|
|
|
|
Now that we've looked at some simple regular expressions, how do we
|
|
|
|
actually use them in Python? The \module{re} module provides an
|
|
|
|
interface to the regular expression engine, allowing you to compile
|
|
|
|
REs into objects and then perform matches with them.
|
|
|
|
|
|
|
|
\subsection{Compiling Regular Expressions}
|
|
|
|
|
|
|
|
Regular expressions are compiled into \class{RegexObject} instances,
|
|
|
|
which have methods for various operations such as searching for
|
|
|
|
pattern matches or performing string substitutions.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> import re
|
|
|
|
>>> p = re.compile('ab*')
|
|
|
|
>>> print p
|
|
|
|
<re.RegexObject instance at 80b4150>
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
\function{re.compile()} also accepts an optional \var{flags}
|
|
|
|
argument, used to enable various special features and syntax
|
|
|
|
variations. We'll go over the available settings later, but for now a
|
|
|
|
single example will do:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile('ab*', re.IGNORECASE)
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
The RE is passed to \function{re.compile()} as a string. REs are
|
|
|
|
handled as strings because regular expressions aren't part of the core
|
|
|
|
Python language, and no special syntax was created for expressing
|
|
|
|
them. (There are applications that don't need REs at all, so there's
|
|
|
|
no need to bloat the language specification by including them.)
|
|
|
|
Instead, the \module{re} module is simply a C extension module
|
|
|
|
included with Python, just like the \module{socket} or \module{zlib}
|
|
|
|
module.
|
|
|
|
|
|
|
|
Putting REs in strings keeps the Python language simpler, but has one
|
|
|
|
disadvantage which is the topic of the next section.
|
|
|
|
|
|
|
|
\subsection{The Backslash Plague}
|
|
|
|
|
|
|
|
As stated earlier, regular expressions use the backslash
|
|
|
|
character (\character{\e}) to indicate special forms or to allow
|
|
|
|
special characters to be used without invoking their special meaning.
|
|
|
|
This conflicts with Python's usage of the same character for the same
|
|
|
|
purpose in string literals.
|
|
|
|
|
|
|
|
Let's say you want to write a RE that matches the string
|
|
|
|
\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure
|
|
|
|
out what to write in the program code, start with the desired string
|
|
|
|
to be matched. Next, you must escape any backslashes and other
|
|
|
|
metacharacters by preceding them with a backslash, resulting in the
|
|
|
|
string \samp{\e\e section}. The resulting string that must be passed
|
|
|
|
to \function{re.compile()} must be \verb|\\section|. However, to
|
|
|
|
express this as a Python string literal, both backslashes must be
|
|
|
|
escaped \emph{again}.
|
|
|
|
|
|
|
|
\begin{tableii}{c|l}{code}{Characters}{Stage}
|
|
|
|
\lineii{\e section}{Text string to be matched}
|
|
|
|
\lineii{\e\e section}{Escaped backslash for \function{re.compile}}
|
|
|
|
\lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
|
|
|
|
\end{tableii}
|
|
|
|
|
|
|
|
In short, to match a literal backslash, one has to write
|
|
|
|
\code{'\e\e\e\e'} as the RE string, because the regular expression
|
|
|
|
must be \samp{\e\e}, and each backslash must be expressed as
|
|
|
|
\samp{\e\e} inside a regular Python string literal. In REs that
|
|
|
|
feature backslashes repeatedly, this leads to lots of repeated
|
|
|
|
backslashes and makes the resulting strings difficult to understand.
|
|
|
|
|
|
|
|
The solution is to use Python's raw string notation for regular
|
|
|
|
expressions; backslashes are not handled in any special way in
|
|
|
|
a string literal prefixed with \character{r}, so \code{r"\e n"} is a
|
|
|
|
two-character string containing \character{\e} and \character{n},
|
|
|
|
while \code{"\e n"} is a one-character string containing a newline.
|
|
|
|
Frequently regular expressions will be expressed in Python
|
|
|
|
code using this raw string notation.
|
|
|
|
|
|
|
|
\begin{tableii}{c|c}{code}{Regular String}{Raw string}
|
|
|
|
\lineii{"ab*"}{\code{r"ab*"}}
|
|
|
|
\lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
|
|
|
|
\lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
|
|
|
|
\end{tableii}
|
|
|
|
|
|
|
|
\subsection{Performing Matches}
|
|
|
|
|
|
|
|
Once you have an object representing a compiled regular expression,
|
|
|
|
what do you do with it? \class{RegexObject} instances have several
|
|
|
|
methods and attributes. Only the most significant ones will be
|
|
|
|
covered here; consult \ulink{the Library
|
|
|
|
Reference}{http://www.python.org/doc/lib/module-re.html} for a
|
|
|
|
complete listing.
|
|
|
|
|
|
|
|
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
|
|
|
|
\lineii{match()}{Determine if the RE matches at the beginning of
|
|
|
|
the string.}
|
|
|
|
\lineii{search()}{Scan through a string, looking for any location
|
|
|
|
where this RE matches.}
|
|
|
|
\lineii{findall()}{Find all substrings where the RE matches,
|
|
|
|
and returns them as a list.}
|
|
|
|
\lineii{finditer()}{Find all substrings where the RE matches,
|
|
|
|
and returns them as an iterator.}
|
|
|
|
\end{tableii}
|
|
|
|
|
|
|
|
\method{match()} and \method{search()} return \code{None} if no match
|
|
|
|
can be found. If they're successful, a \code{MatchObject} instance is
|
|
|
|
returned, containing information about the match: where it starts and
|
|
|
|
ends, the substring it matched, and more.
|
|
|
|
|
|
|
|
You can learn about this by interactively experimenting with the
|
|
|
|
\module{re} module. If you have Tkinter available, you may also want
|
|
|
|
to look at \file{Tools/scripts/redemo.py}, a demonstration program
|
|
|
|
included with the Python distribution. It allows you to enter REs and
|
|
|
|
strings, and displays whether the RE matches or fails.
|
|
|
|
\file{redemo.py} can be quite useful when trying to debug a
|
|
|
|
complicated RE. Phil Schwartz's
|
Four months of trunk changes (including a few releases...)
Merged revisions 51434-53004 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
r51434 | neal.norwitz | 2006-08-21 20:20:10 +0200 (Mon, 21 Aug 2006) | 1 line
Fix a couple of ssize-t issues reported by Alexander Belopolsky on python-dev
........
r51439 | neal.norwitz | 2006-08-21 21:47:08 +0200 (Mon, 21 Aug 2006) | 6 lines
Patch #1542451: disallow continue anywhere under a finally
I'm undecided if this should be backported to 2.5 or 2.5.1.
Armin suggested to wait (I'm of the same opinion). Thomas W thinks
it's fine to go in 2.5.
........
r51443 | neal.norwitz | 2006-08-21 22:16:24 +0200 (Mon, 21 Aug 2006) | 4 lines
Handle a few more error conditions.
Klocwork 301 and 302. Will backport.
........
r51450 | neal.norwitz | 2006-08-22 00:21:19 +0200 (Tue, 22 Aug 2006) | 5 lines
Patch #1541585: fix buffer overrun when performing repr() on
a unicode string in a build with wide unicode (UCS-4) support.
This code could be improved, so add an XXX comment.
........
r51456 | neal.norwitz | 2006-08-22 01:44:48 +0200 (Tue, 22 Aug 2006) | 1 line
Try to get the windows bots working again with the new peephole.c
........
r51461 | anthony.baxter | 2006-08-22 09:36:59 +0200 (Tue, 22 Aug 2006) | 1 line
patch for documentation for recent uuid changes (from ping)
........
r51473 | neal.norwitz | 2006-08-22 15:56:56 +0200 (Tue, 22 Aug 2006) | 1 line
Alexander Belopolsky pointed out that pos is a size_t
........
r51489 | jeremy.hylton | 2006-08-22 22:46:00 +0200 (Tue, 22 Aug 2006) | 2 lines
Expose column offset information in parse trees.
........
r51497 | andrew.kuchling | 2006-08-23 01:13:43 +0200 (Wed, 23 Aug 2006) | 1 line
Move functional howto into trunk
........
r51515 | jeremy.hylton | 2006-08-23 20:37:43 +0200 (Wed, 23 Aug 2006) | 2 lines
Baby steps towards better tests for tokenize
........
r51525 | alex.martelli | 2006-08-23 22:42:02 +0200 (Wed, 23 Aug 2006) | 6 lines
x**2 should about equal x*x (including for a float x such that the result is
inf) but didn't; added a test to test_float to verify that, and ignored the
ERANGE value for errno in the pow operation to make the new test pass (with
help from Marilyn Davis at the Google Python Sprint -- thanks!).
........
r51526 | jeremy.hylton | 2006-08-23 23:14:03 +0200 (Wed, 23 Aug 2006) | 20 lines
Bug fixes large and small for tokenize.
Small: Always generate a NL or NEWLINE token following
a COMMENT token. The old code did not generate an NL token if
the comment was on a line by itself.
Large: The output of untokenize() will now match the
input exactly if it is passed the full token sequence. The
old, crufty output is still generated if a limited input
sequence is provided, where limited means that it does not
include position information for tokens.
Remaining bug: There is no CONTINUATION token (\) so there is no way
for untokenize() to handle such code.
Also, expanded the number of doctests in hopes of eventually removing
the old-style tests that compare against a golden file.
Bug fix candidate for Python 2.5.1. (Sigh.)
........
r51527 | jeremy.hylton | 2006-08-23 23:26:46 +0200 (Wed, 23 Aug 2006) | 5 lines
Replace dead code with an assert.
Now that COMMENT tokens are reliably followed by NL or NEWLINE,
there is never a need to add extra newlines in untokenize.
........
r51530 | alex.martelli | 2006-08-24 00:17:59 +0200 (Thu, 24 Aug 2006) | 7 lines
Reverting the patch that tried to fix the issue whereby x**2 raises
OverflowError while x*x succeeds and produces infinity; apparently
these inconsistencies cannot be fixed across ``all'' platforms and
there's a widespread feeling that therefore ``every'' platform
should keep suffering forevermore. Ah well.
........
r51565 | thomas.wouters | 2006-08-24 20:40:20 +0200 (Thu, 24 Aug 2006) | 6 lines
Fix SF bug #1545837: array.array borks on deepcopy.
array.__deepcopy__() needs to take an argument, even if it doesn't actually
use it. Will backport to 2.5 and 2.4 (if applicable.)
........
r51580 | martin.v.loewis | 2006-08-25 02:03:34 +0200 (Fri, 25 Aug 2006) | 3 lines
Patch #1545507: Exclude ctypes package in Win64 MSI file.
Will backport to 2.5.
........
r51589 | neal.norwitz | 2006-08-25 03:52:49 +0200 (Fri, 25 Aug 2006) | 1 line
importing types is not necessary if we use isinstance
........
r51604 | thomas.heller | 2006-08-25 09:27:33 +0200 (Fri, 25 Aug 2006) | 3 lines
Port _ctypes.pyd to win64 on AMD64.
........
r51605 | thomas.heller | 2006-08-25 09:34:51 +0200 (Fri, 25 Aug 2006) | 3 lines
Add missing file for _ctypes.pyd port to win64 on AMD64.
........
r51606 | thomas.heller | 2006-08-25 11:26:33 +0200 (Fri, 25 Aug 2006) | 6 lines
Build _ctypes.pyd for win AMD64 into the MSVC project file.
Since MSVC doesn't know about .asm files, a helper batch file is needed
to find ml64.exe in predefined locations. The helper script hardcodes
the path to the MS Platform SDK.
........
r51608 | armin.rigo | 2006-08-25 14:44:28 +0200 (Fri, 25 Aug 2006) | 4 lines
The regular expression engine in '_sre' can segfault when interpreting
bogus bytecode. It is unclear whether this is a real bug or a "won't
fix" case like bogus_code_obj.py.
........
r51617 | tim.peters | 2006-08-26 00:05:39 +0200 (Sat, 26 Aug 2006) | 2 lines
Whitespace normalization.
........
r51618 | tim.peters | 2006-08-26 00:06:44 +0200 (Sat, 26 Aug 2006) | 2 lines
Add missing svn:eol-style property to text files.
........
r51619 | tim.peters | 2006-08-26 00:26:21 +0200 (Sat, 26 Aug 2006) | 3 lines
A new test here relied on preserving invisible trailing
whitespace in expected output. Stop that.
........
r51624 | jack.diederich | 2006-08-26 20:42:06 +0200 (Sat, 26 Aug 2006) | 4 lines
- Move functions common to all path modules into genericpath.py and have the
OS speicifc path modules import them.
- Have os2emxpath import common functions fron ntpath instead of using copies
........
r51642 | neal.norwitz | 2006-08-29 07:40:58 +0200 (Tue, 29 Aug 2006) | 1 line
Fix a couple of typos.
........
r51647 | marc-andre.lemburg | 2006-08-29 12:34:12 +0200 (Tue, 29 Aug 2006) | 5 lines
Fix a buglet in the error reporting (SF bug report #1546372).
This should probably go into Python 2.5 or 2.5.1 as well.
........
r51663 | armin.rigo | 2006-08-31 10:51:06 +0200 (Thu, 31 Aug 2006) | 3 lines
Doc fix: hashlib objects don't always return a digest of 16 bytes.
Backport candidate for 2.5.
........
r51664 | nick.coghlan | 2006-08-31 14:00:43 +0200 (Thu, 31 Aug 2006) | 1 line
Fix the wrongheaded implementation of context management in the decimal module and add unit tests. (python-dev discussion is ongoing regarding what we do about Python 2.5)
........
r51665 | nick.coghlan | 2006-08-31 14:51:25 +0200 (Thu, 31 Aug 2006) | 1 line
Remove the old decimal context management tests from test_contextlib (guess who didn't run the test suite before committing...)
........
r51669 | brett.cannon | 2006-08-31 20:54:26 +0200 (Thu, 31 Aug 2006) | 4 lines
Make sure memory is properly cleaned up in file_init.
Backport candidate.
........
r51671 | brett.cannon | 2006-08-31 23:47:52 +0200 (Thu, 31 Aug 2006) | 2 lines
Fix comment about indentation level in C files.
........
r51674 | brett.cannon | 2006-09-01 00:42:37 +0200 (Fri, 01 Sep 2006) | 3 lines
Have pre-existing C files use 8 spaces indents (to match old PEP 7 style), but
have all new files use 4 spaces (to match current PEP 7 style).
........
r51676 | fred.drake | 2006-09-01 05:57:19 +0200 (Fri, 01 Sep 2006) | 3 lines
- SF patch #1550263: Enhance and correct unittest docs
- various minor cleanups for improved consistency
........
r51677 | georg.brandl | 2006-09-02 00:30:52 +0200 (Sat, 02 Sep 2006) | 2 lines
evalfile() should be execfile().
........
r51681 | neal.norwitz | 2006-09-02 04:43:17 +0200 (Sat, 02 Sep 2006) | 1 line
SF #1547931, fix typo (missing and). Will backport to 2.5
........
r51683 | neal.norwitz | 2006-09-02 04:50:35 +0200 (Sat, 02 Sep 2006) | 1 line
Bug #1548092: fix curses.tparm seg fault on invalid input. Needs backport to 2.5.1 and earlier.
........
r51684 | neal.norwitz | 2006-09-02 04:58:13 +0200 (Sat, 02 Sep 2006) | 4 lines
Bug #1550714: fix SystemError from itertools.tee on negative value for n.
Needs backport to 2.5.1 and earlier.
........
r51685 | nick.coghlan | 2006-09-02 05:54:17 +0200 (Sat, 02 Sep 2006) | 1 line
Make decimal.ContextManager a private implementation detail of decimal.localcontext()
........
r51686 | nick.coghlan | 2006-09-02 06:04:18 +0200 (Sat, 02 Sep 2006) | 1 line
Further corrections to the decimal module context management documentation
........
r51688 | raymond.hettinger | 2006-09-02 19:07:23 +0200 (Sat, 02 Sep 2006) | 1 line
Fix documentation nits for decimal context managers.
........
r51690 | neal.norwitz | 2006-09-02 20:51:34 +0200 (Sat, 02 Sep 2006) | 1 line
Add missing word in comment
........
r51691 | neal.norwitz | 2006-09-02 21:40:19 +0200 (Sat, 02 Sep 2006) | 7 lines
Hmm, this test has failed at least twice recently on the OpenBSD and
Debian sparc buildbots. Since this goes through a lot of tests
and hits the disk a lot it could be slow (especially if NFS is involved).
I'm not sure if that's the problem, but printing periodic msgs shouldn't hurt.
The code was stolen from test_compiler.
........
r51693 | nick.coghlan | 2006-09-03 03:02:00 +0200 (Sun, 03 Sep 2006) | 1 line
Fix final documentation nits before backporting decimal module fixes to 2.5
........
r51694 | nick.coghlan | 2006-09-03 03:06:07 +0200 (Sun, 03 Sep 2006) | 1 line
Typo fix for decimal docs
........
r51697 | nick.coghlan | 2006-09-03 03:20:46 +0200 (Sun, 03 Sep 2006) | 1 line
NEWS entry on trunk for decimal module changes
........
r51704 | raymond.hettinger | 2006-09-04 17:32:48 +0200 (Mon, 04 Sep 2006) | 1 line
Fix endcase for str.rpartition()
........
r51716 | tim.peters | 2006-09-05 04:18:09 +0200 (Tue, 05 Sep 2006) | 12 lines
"Conceptual" merge of rev 51711 from the 2.5 branch.
i_divmod(): As discussed on Python-Dev, changed the overflow
checking to live happily with recent gcc optimizations that
assume signed integer arithmetic never overflows.
This differs from the corresponding change on the 2.5 and 2.4
branches, using a less obscure approach, but one that /may/
tickle platform idiocies in their definitions of LONG_MIN.
The 2.4 + 2.5 change avoided introducing a dependence on
LONG_MIN, at the cost of substantially goofier code.
........
r51717 | tim.peters | 2006-09-05 04:21:19 +0200 (Tue, 05 Sep 2006) | 2 lines
Whitespace normalization.
........
r51719 | tim.peters | 2006-09-05 04:22:17 +0200 (Tue, 05 Sep 2006) | 2 lines
Add missing svn:eol-style property to text files.
........
r51720 | neal.norwitz | 2006-09-05 04:24:03 +0200 (Tue, 05 Sep 2006) | 2 lines
Fix SF bug #1546288, crash in dict_equal.
........
r51721 | neal.norwitz | 2006-09-05 04:25:41 +0200 (Tue, 05 Sep 2006) | 1 line
Fix SF #1552093, eval docstring typo (3 ps in mapping)
........
r51724 | neal.norwitz | 2006-09-05 04:35:08 +0200 (Tue, 05 Sep 2006) | 1 line
This was found by Guido AFAIK on p3yk (sic) branch.
........
r51725 | neal.norwitz | 2006-09-05 04:36:20 +0200 (Tue, 05 Sep 2006) | 1 line
Add a NEWS entry for str.rpartition() change
........
r51728 | neal.norwitz | 2006-09-05 04:57:01 +0200 (Tue, 05 Sep 2006) | 1 line
Patch #1540470, for OpenBSD 4.0. Backport candidate for 2.[34].
........
r51729 | neal.norwitz | 2006-09-05 05:53:08 +0200 (Tue, 05 Sep 2006) | 12 lines
Bug #1520864 (again): unpacking singleton tuples in list comprehensions and
generator expressions (x for x, in ... ) works again.
Sigh, I only fixed for loops the first time, not list comps and genexprs too.
I couldn't find any more unpacking cases where there is a similar bug lurking.
This code should be refactored to eliminate the duplication. I'm sure
the listcomp/genexpr code can be refactored. I'm not sure if the for loop
can re-use any of the same code though.
Will backport to 2.5 (the only place it matters).
........
r51731 | neal.norwitz | 2006-09-05 05:58:26 +0200 (Tue, 05 Sep 2006) | 1 line
Add a comment about some refactoring. (There's probably more that should be done.) I will reformat this file in the next checkin due to the inconsistent tabs/spaces.
........
r51732 | neal.norwitz | 2006-09-05 06:00:12 +0200 (Tue, 05 Sep 2006) | 1 line
M-x untabify
........
r51737 | hyeshik.chang | 2006-09-05 14:07:09 +0200 (Tue, 05 Sep 2006) | 7 lines
Fix a few bugs on cjkcodecs found by Oren Tirosh:
- gbk and gb18030 codec now handle U+30FB KATAKANA MIDDLE DOT correctly.
- iso2022_jp_2 codec now encodes into G0 for KS X 1001, GB2312
codepoints to conform the standard.
- iso2022_jp_3 and iso2022_jp_2004 codec can encode JIS X 2013:2
codepoints now.
........
r51738 | hyeshik.chang | 2006-09-05 14:14:57 +0200 (Tue, 05 Sep 2006) | 2 lines
Fix a typo: 2013 -> 0213
........
r51740 | georg.brandl | 2006-09-05 14:44:58 +0200 (Tue, 05 Sep 2006) | 3 lines
Bug #1552618: change docs of dict.has_key() to reflect recommendation
to use "in".
........
r51742 | andrew.kuchling | 2006-09-05 15:02:40 +0200 (Tue, 05 Sep 2006) | 1 line
Rearrange example a bit, and show rpartition() when separator is not found
........
r51744 | andrew.kuchling | 2006-09-05 15:15:41 +0200 (Tue, 05 Sep 2006) | 1 line
[Bug #1525469] SimpleXMLRPCServer still uses the sys.exc_{value,type} module-level globals instead of calling sys.exc_info(). Reported by Russell Warren
........
r51745 | andrew.kuchling | 2006-09-05 15:19:18 +0200 (Tue, 05 Sep 2006) | 3 lines
[Bug #1526834] Fix crash in pdb when you do 'b f(';
the function name was placed into a regex pattern and the unbalanced paren
caused re.compile() to report an error
........
r51751 | kristjan.jonsson | 2006-09-05 19:58:12 +0200 (Tue, 05 Sep 2006) | 6 lines
Update the PCBuild8 solution.
Facilitate cross-compilation by having binaries in separate Win32 and x64 directories.
Rationalized configs by making proper use of platforms/configurations.
Remove pythoncore_pgo project.
Add new PGIRelease and PGORelease configurations to perform Profile Guided Optimisation.
Removed I64 support, but this can be easily added by copying the x64 platform settings.
........
r51758 | gustavo.niemeyer | 2006-09-06 03:58:52 +0200 (Wed, 06 Sep 2006) | 3 lines
Fixing #1531862: Do not close standard file descriptors in the
subprocess module.
........
r51760 | neal.norwitz | 2006-09-06 05:58:34 +0200 (Wed, 06 Sep 2006) | 1 line
Revert 51758 because it broke all the buildbots
........
r51762 | georg.brandl | 2006-09-06 08:03:59 +0200 (Wed, 06 Sep 2006) | 3 lines
Bug #1551427: fix a wrong NULL pointer check in the win32 version
of os.urandom().
........
r51765 | georg.brandl | 2006-09-06 08:09:31 +0200 (Wed, 06 Sep 2006) | 3 lines
Bug #1550983: emit better error messages for erroneous relative
imports (if not in package and if beyond toplevel package).
........
r51767 | neal.norwitz | 2006-09-06 08:28:06 +0200 (Wed, 06 Sep 2006) | 1 line
with and as are now keywords. There are some generated files I can't recreate.
........
r51770 | georg.brandl | 2006-09-06 08:50:05 +0200 (Wed, 06 Sep 2006) | 5 lines
Bug #1542051: Exceptions now correctly call PyObject_GC_UnTrack.
Also make sure that every exception class has __module__ set to
'exceptions'.
........
r51785 | georg.brandl | 2006-09-06 22:05:58 +0200 (Wed, 06 Sep 2006) | 2 lines
Fix missing import of the types module in logging.config.
........
r51789 | marc-andre.lemburg | 2006-09-06 22:40:22 +0200 (Wed, 06 Sep 2006) | 3 lines
Add news item for bug fix of SF bug report #1546372.
........
r51797 | gustavo.niemeyer | 2006-09-07 02:48:33 +0200 (Thu, 07 Sep 2006) | 3 lines
Fixed subprocess bug #1531862 again, after removing tests
offending buildbot
........
r51798 | raymond.hettinger | 2006-09-07 04:42:48 +0200 (Thu, 07 Sep 2006) | 1 line
Fix refcounts and add error checks.
........
r51803 | nick.coghlan | 2006-09-07 12:50:34 +0200 (Thu, 07 Sep 2006) | 1 line
Fix the speed regression in inspect.py by adding another cache to speed up getmodule(). Patch #1553314
........
r51805 | ronald.oussoren | 2006-09-07 14:03:10 +0200 (Thu, 07 Sep 2006) | 2 lines
Fix a glaring error and update some version numbers.
........
r51814 | andrew.kuchling | 2006-09-07 15:56:23 +0200 (Thu, 07 Sep 2006) | 1 line
Typo fix
........
r51815 | andrew.kuchling | 2006-09-07 15:59:38 +0200 (Thu, 07 Sep 2006) | 8 lines
[Bug #1552726] Avoid repeatedly polling in interactive mode -- only put a timeout on the select()
if an input hook has been defined. Patch by Richard Boulton.
This select() code is only executed with readline 2.1, or if
READLINE_CALLBACKS is defined.
Backport candidate for 2.5, 2.4, probably earlier versions too.
........
r51816 | armin.rigo | 2006-09-07 17:06:00 +0200 (Thu, 07 Sep 2006) | 2 lines
Add a warning notice on top of the generated grammar.txt.
........
r51819 | thomas.heller | 2006-09-07 20:56:28 +0200 (Thu, 07 Sep 2006) | 5 lines
Anonymous structure fields that have a bit-width specified did not work,
and they gave a strange error message from PyArg_ParseTuple:
function takes exactly 2 arguments (3 given).
With tests.
........
r51820 | thomas.heller | 2006-09-07 21:09:54 +0200 (Thu, 07 Sep 2006) | 4 lines
The cast function did not accept c_char_p or c_wchar_p instances
as first argument, and failed with a 'bad argument to internal function'
error message.
........
r51827 | nick.coghlan | 2006-09-08 12:04:38 +0200 (Fri, 08 Sep 2006) | 1 line
Add missing NEWS entry for rev 51803
........
r51828 | andrew.kuchling | 2006-09-08 15:25:23 +0200 (Fri, 08 Sep 2006) | 1 line
Add missing word
........
r51829 | andrew.kuchling | 2006-09-08 15:35:49 +0200 (Fri, 08 Sep 2006) | 1 line
Explain SQLite a bit more clearly
........
r51830 | andrew.kuchling | 2006-09-08 15:36:36 +0200 (Fri, 08 Sep 2006) | 1 line
Explain SQLite a bit more clearly
........
r51832 | andrew.kuchling | 2006-09-08 16:02:45 +0200 (Fri, 08 Sep 2006) | 1 line
Use native SQLite types
........
r51833 | andrew.kuchling | 2006-09-08 16:03:01 +0200 (Fri, 08 Sep 2006) | 1 line
Use native SQLite types
........
r51835 | andrew.kuchling | 2006-09-08 16:05:10 +0200 (Fri, 08 Sep 2006) | 1 line
Fix typo in example
........
r51837 | brett.cannon | 2006-09-09 09:11:46 +0200 (Sat, 09 Sep 2006) | 6 lines
Remove the __unicode__ method from exceptions. Allows unicode() to be called
on exception classes. Would require introducing a tp_unicode slot to make it
work otherwise.
Fixes bug #1551432 and will be backported.
........
r51854 | neal.norwitz | 2006-09-11 06:24:09 +0200 (Mon, 11 Sep 2006) | 8 lines
Forward port of 51850 from release25-maint branch.
As mentioned on python-dev, reverting patch #1504333 because it introduced
an infinite loop in rev 47154.
This patch also adds a test to prevent the regression.
........
r51855 | neal.norwitz | 2006-09-11 06:28:16 +0200 (Mon, 11 Sep 2006) | 5 lines
Properly handle a NULL returned from PyArena_New().
(Also fix some whitespace)
Klocwork #364.
........
r51856 | neal.norwitz | 2006-09-11 06:32:57 +0200 (Mon, 11 Sep 2006) | 1 line
Add a "crasher" taken from the sgml bug report referenced in the comment
........
r51858 | georg.brandl | 2006-09-11 11:38:35 +0200 (Mon, 11 Sep 2006) | 12 lines
Forward-port of rev. 51857:
Building with HP's cc on HP-UX turned up a couple of problems.
_PyGILState_NoteThreadState was declared as static inconsistently.
Make it static as it's not necessary outside of this module.
Some tests failed because errno was reset to 0. (I think the tests
that failed were at least: test_fcntl and test_mailbox).
Ensure that errno doesn't change after a call to Py_END_ALLOW_THREADS.
This only affected debug builds.
........
r51865 | martin.v.loewis | 2006-09-12 21:49:20 +0200 (Tue, 12 Sep 2006) | 2 lines
Forward-port 51862: Add sgml_input.html.
........
r51866 | andrew.kuchling | 2006-09-12 22:50:23 +0200 (Tue, 12 Sep 2006) | 1 line
Markup typo fix
........
r51867 | andrew.kuchling | 2006-09-12 23:09:02 +0200 (Tue, 12 Sep 2006) | 1 line
Some editing, markup fixes
........
r51868 | andrew.kuchling | 2006-09-12 23:21:51 +0200 (Tue, 12 Sep 2006) | 1 line
More wordsmithing
........
r51877 | andrew.kuchling | 2006-09-14 13:22:18 +0200 (Thu, 14 Sep 2006) | 1 line
Make --help mention that -v can be supplied multiple times
........
r51878 | andrew.kuchling | 2006-09-14 13:28:50 +0200 (Thu, 14 Sep 2006) | 1 line
Rewrite help message to remove some of the parentheticals. (There were a lot of them.)
........
r51883 | ka-ping.yee | 2006-09-15 02:34:19 +0200 (Fri, 15 Sep 2006) | 2 lines
Fix grammar errors and improve clarity.
........
r51885 | georg.brandl | 2006-09-15 07:22:24 +0200 (Fri, 15 Sep 2006) | 3 lines
Correct elementtree module index entry.
........
r51889 | fred.drake | 2006-09-15 17:18:04 +0200 (Fri, 15 Sep 2006) | 4 lines
- fix module name in links in formatted documentation
- minor markup cleanup
(forward-ported from release25-maint revision 51888)
........
r51891 | fred.drake | 2006-09-15 18:11:27 +0200 (Fri, 15 Sep 2006) | 3 lines
revise explanation of returns_unicode to reflect bool values
and to include the default value
(merged from release25-maint revision 51890)
........
r51897 | martin.v.loewis | 2006-09-16 19:36:37 +0200 (Sat, 16 Sep 2006) | 2 lines
Patch #1557515: Add RLIMIT_SBSIZE.
........
r51903 | ronald.oussoren | 2006-09-17 20:42:53 +0200 (Sun, 17 Sep 2006) | 2 lines
Port of revision 51902 in release25-maint to the trunk
........
r51904 | ronald.oussoren | 2006-09-17 21:23:27 +0200 (Sun, 17 Sep 2006) | 3 lines
Tweak Mac/Makefile in to ensure that pythonw gets rebuild when the major version
of python changes (2.5 -> 2.6). Bug #1552935.
........
r51913 | guido.van.rossum | 2006-09-18 23:36:16 +0200 (Mon, 18 Sep 2006) | 2 lines
Make this thing executable.
........
r51920 | gregory.p.smith | 2006-09-19 19:35:04 +0200 (Tue, 19 Sep 2006) | 5 lines
Fixes a bug with bsddb.DB.stat where the flags and txn keyword
arguments are transposed. (reported by Louis Zechtzer)
..already committed to release24-maint
..needs committing to release25-maint
........
r51926 | brett.cannon | 2006-09-20 20:34:28 +0200 (Wed, 20 Sep 2006) | 3 lines
Accidentally didn't commit Misc/NEWS entry on when __unicode__() was removed
from exceptions.
........
r51927 | brett.cannon | 2006-09-20 20:43:13 +0200 (Wed, 20 Sep 2006) | 6 lines
Allow exceptions to be directly sliced again
(e.g., ``BaseException(1,2,3)[0:2]``).
Discovered in Python 2.5.0 by Thomas Heller and reported to python-dev. This
should be backported to 2.5 .
........
r51928 | brett.cannon | 2006-09-20 21:28:35 +0200 (Wed, 20 Sep 2006) | 2 lines
Make python.vim output more deterministic.
........
r51949 | walter.doerwald | 2006-09-21 17:09:55 +0200 (Thu, 21 Sep 2006) | 2 lines
Fix typo.
........
r51950 | jack.diederich | 2006-09-21 19:50:26 +0200 (Thu, 21 Sep 2006) | 5 lines
* regression bug, count_next was coercing a Py_ssize_t to an unsigned Py_size_t
which breaks negative counts
* added test for negative numbers
will backport to 2.5.1
........
r51953 | jack.diederich | 2006-09-21 22:34:49 +0200 (Thu, 21 Sep 2006) | 1 line
added itertools.count(-n) fix
........
r51971 | neal.norwitz | 2006-09-22 10:16:26 +0200 (Fri, 22 Sep 2006) | 10 lines
Fix %zd string formatting on Mac OS X so it prints negative numbers.
In addition to testing positive numbers, verify negative numbers work in configure.
In order to avoid compiler warnings on OS X 10.4, also change the order of the check
for the format character to use (PY_FORMAT_SIZE_T) in the sprintf format
for Py_ssize_t. This patch changes PY_FORMAT_SIZE_T from "" to "l" if it wasn't
defined at configure time. Need to verify the buildbot results.
Backport candidate (if everyone thinks this patch can't be improved).
........
r51972 | neal.norwitz | 2006-09-22 10:18:10 +0200 (Fri, 22 Sep 2006) | 7 lines
Bug #1557232: fix seg fault with def f((((x)))) and def f(((x),)).
These tests should be improved. Hopefully this fixes variations when
flipping back and forth between fpdef and fplist.
Backport candidate.
........
r51975 | neal.norwitz | 2006-09-22 10:47:23 +0200 (Fri, 22 Sep 2006) | 4 lines
Mostly revert this file to the same version as before. Only force setting
of PY_FORMAT_SIZE_T to "l" for Mac OSX. I don't know a better define
to use. This should get rid of the warnings on other platforms and Mac too.
........
r51986 | fred.drake | 2006-09-23 02:26:31 +0200 (Sat, 23 Sep 2006) | 1 line
add boilerplate "What's New" document so the docs will build
........
r51987 | neal.norwitz | 2006-09-23 06:11:38 +0200 (Sat, 23 Sep 2006) | 1 line
Remove extra semi-colons reported by Johnny Lee on python-dev. Backport if anyone cares.
........
r51989 | neal.norwitz | 2006-09-23 20:11:58 +0200 (Sat, 23 Sep 2006) | 1 line
SF Bug #1563963, add missing word and cleanup first sentance
........
r51990 | brett.cannon | 2006-09-23 21:53:20 +0200 (Sat, 23 Sep 2006) | 3 lines
Make output on test_strptime() be more verbose in face of failure. This is in
hopes that more information will help debug the failing test on HPPA Ubuntu.
........
r51991 | georg.brandl | 2006-09-24 12:36:01 +0200 (Sun, 24 Sep 2006) | 2 lines
Fix webbrowser.BackgroundBrowser on Windows.
........
r51993 | georg.brandl | 2006-09-24 14:35:36 +0200 (Sun, 24 Sep 2006) | 4 lines
Fix a bug in the parser's future statement handling that led to "with"
not being recognized as a keyword after, e.g., this statement:
from __future__ import division, with_statement
........
r51995 | georg.brandl | 2006-09-24 14:50:24 +0200 (Sun, 24 Sep 2006) | 4 lines
Fix a bug in traceback.format_exception_only() that led to an error
being raised when print_exc() was called without an exception set.
In version 2.4, this printed "None", restored that behavior.
........
r52000 | armin.rigo | 2006-09-25 17:16:26 +0200 (Mon, 25 Sep 2006) | 2 lines
Another crasher.
........
r52011 | brett.cannon | 2006-09-27 01:38:24 +0200 (Wed, 27 Sep 2006) | 2 lines
Make the error message for when the time data and format do not match clearer.
........
r52014 | andrew.kuchling | 2006-09-27 18:37:30 +0200 (Wed, 27 Sep 2006) | 1 line
Add news item for rev. 51815
........
r52018 | andrew.kuchling | 2006-09-27 21:23:05 +0200 (Wed, 27 Sep 2006) | 1 line
Make examples do error checking on Py_InitModule
........
r52032 | brett.cannon | 2006-09-29 00:10:14 +0200 (Fri, 29 Sep 2006) | 2 lines
Very minor grammatical fix in a comment.
........
r52048 | george.yoshida | 2006-09-30 07:14:02 +0200 (Sat, 30 Sep 2006) | 4 lines
SF bug #1567976 : fix typo
Will backport to 2.5.
........
r52051 | gregory.p.smith | 2006-09-30 08:08:20 +0200 (Sat, 30 Sep 2006) | 2 lines
wording change
........
r52053 | georg.brandl | 2006-09-30 09:24:48 +0200 (Sat, 30 Sep 2006) | 2 lines
Bug #1567375: a minor logical glitch in example description.
........
r52056 | georg.brandl | 2006-09-30 09:31:57 +0200 (Sat, 30 Sep 2006) | 3 lines
Bug #1565661: in webbrowser, split() the command for the default
GNOME browser in case it is a command with args.
........
r52058 | georg.brandl | 2006-09-30 10:43:30 +0200 (Sat, 30 Sep 2006) | 4 lines
Patch #1567691: super() and new.instancemethod() now don't accept
keyword arguments any more (previously they accepted them, but didn't
use them).
........
r52061 | georg.brandl | 2006-09-30 11:03:42 +0200 (Sat, 30 Sep 2006) | 3 lines
Bug #1566800: make sure that EnvironmentError can be called with any
number of arguments, as was the case in Python 2.4.
........
r52063 | georg.brandl | 2006-09-30 11:06:45 +0200 (Sat, 30 Sep 2006) | 2 lines
Bug #1566663: remove obsolete example from datetime docs.
........
r52065 | georg.brandl | 2006-09-30 11:13:21 +0200 (Sat, 30 Sep 2006) | 3 lines
Bug #1566602: correct failure of posixpath unittest when $HOME ends
with a slash.
........
r52068 | georg.brandl | 2006-09-30 12:58:01 +0200 (Sat, 30 Sep 2006) | 3 lines
Bug #1457823: cgi.(Sv)FormContentDict's constructor now takes
keep_blank_values and strict_parsing keyword arguments.
........
r52069 | georg.brandl | 2006-09-30 13:06:47 +0200 (Sat, 30 Sep 2006) | 3 lines
Bug #1560617: in pyclbr, return full module name not only for classes,
but also for functions.
........
r52072 | georg.brandl | 2006-09-30 13:17:34 +0200 (Sat, 30 Sep 2006) | 3 lines
Bug #1556784: allow format strings longer than 127 characters in
datetime's strftime function.
........
r52075 | georg.brandl | 2006-09-30 13:22:28 +0200 (Sat, 30 Sep 2006) | 3 lines
Bug #1446043: correctly raise a LookupError if an encoding name given
to encodings.search_function() contains a dot.
........
r52078 | georg.brandl | 2006-09-30 14:02:57 +0200 (Sat, 30 Sep 2006) | 3 lines
Bug #1546052: clarify that PyString_FromString(AndSize) copies the
string pointed to by its parameter.
........
r52080 | georg.brandl | 2006-09-30 14:16:03 +0200 (Sat, 30 Sep 2006) | 3 lines
Convert test_import to unittest.
........
r52083 | kurt.kaiser | 2006-10-01 23:16:45 +0200 (Sun, 01 Oct 2006) | 5 lines
Some syntax errors were being caught by tokenize during the tabnanny
check, resulting in obscure error messages. Do the syntax check
first. Bug 1562716, 1562719
........
r52084 | kurt.kaiser | 2006-10-01 23:54:37 +0200 (Sun, 01 Oct 2006) | 3 lines
Add comment explaining that error msgs may be due to user code when
running w/o subprocess.
........
r52086 | martin.v.loewis | 2006-10-02 16:55:51 +0200 (Mon, 02 Oct 2006) | 3 lines
Fix test for uintptr_t. Fixes #1568842.
Will backport.
........
r52089 | martin.v.loewis | 2006-10-02 17:20:37 +0200 (Mon, 02 Oct 2006) | 3 lines
Guard uintptr_t test with HAVE_STDINT_H, test for
stdint.h. Will backport.
........
r52100 | vinay.sajip | 2006-10-03 20:02:37 +0200 (Tue, 03 Oct 2006) | 1 line
Documentation omitted the additional parameter to LogRecord.__init__ which was added in 2.5. (See SF #1569622).
........
r52101 | vinay.sajip | 2006-10-03 20:20:26 +0200 (Tue, 03 Oct 2006) | 1 line
Documentation clarified to mention optional parameters.
........
r52102 | vinay.sajip | 2006-10-03 20:21:56 +0200 (Tue, 03 Oct 2006) | 1 line
Modified LogRecord.__init__ to make the func parameter optional. (See SF #1569622).
........
r52121 | brett.cannon | 2006-10-03 23:58:55 +0200 (Tue, 03 Oct 2006) | 2 lines
Fix minor typo in a comment.
........
r52123 | brett.cannon | 2006-10-04 01:23:14 +0200 (Wed, 04 Oct 2006) | 2 lines
Convert test_imp over to unittest.
........
r52128 | barry.warsaw | 2006-10-04 04:06:36 +0200 (Wed, 04 Oct 2006) | 3 lines
decode_rfc2231(): As Christian Robottom Reis points out, it makes no sense to
test for parts > 3 when we use .split(..., 2).
........
r52129 | jeremy.hylton | 2006-10-04 04:24:52 +0200 (Wed, 04 Oct 2006) | 9 lines
Fix for SF bug 1569998: break permitted inside try.
The compiler was checking that there was something on the fblock
stack, but not that there was a loop on the stack. Fixed that and
added a test for the specific syntax error.
Bug fix candidate.
........
r52130 | martin.v.loewis | 2006-10-04 07:47:34 +0200 (Wed, 04 Oct 2006) | 4 lines
Fix integer negation and absolute value to not rely
on undefined behaviour of the C compiler anymore.
Will backport to 2.5 and 2.4.
........
r52135 | martin.v.loewis | 2006-10-04 11:21:20 +0200 (Wed, 04 Oct 2006) | 1 line
Forward port r52134: Add uuids for 2.4.4.
........
r52137 | armin.rigo | 2006-10-04 12:23:57 +0200 (Wed, 04 Oct 2006) | 3 lines
Compilation problem caused by conflicting typedefs for uint32_t
(unsigned long vs. unsigned int).
........
r52139 | armin.rigo | 2006-10-04 14:17:45 +0200 (Wed, 04 Oct 2006) | 23 lines
Forward-port of r52136,52138: a review of overflow-detecting code.
* unified the way intobject, longobject and mystrtoul handle
values around -sys.maxint-1.
* in general, trying to entierely avoid overflows in any computation
involving signed ints or longs is extremely involved. Fixed a few
simple cases where a compiler might be too clever (but that's all
guesswork).
* more overflow checks against bad data in marshal.c.
* 2.5 specific: fixed a number of places that were still confusing int
and Py_ssize_t. Some of them could potentially have caused
"real-world" breakage.
* list.pop(x): fixing overflow issues on x was messy. I just reverted
to PyArg_ParseTuple("n"), which does the right thing. (An obscure
test was trying to give a Decimal to list.pop()... doesn't make
sense any more IMHO)
* trying to write a few tests...
........
r52147 | andrew.kuchling | 2006-10-04 15:42:43 +0200 (Wed, 04 Oct 2006) | 6 lines
Cause a PyObject_Malloc() failure to trigger a MemoryError, and then
add 'if (PyErr_Occurred())' checks to various places so that NULL is
returned properly.
2.4 backport candidate.
........
r52148 | martin.v.loewis | 2006-10-04 17:25:28 +0200 (Wed, 04 Oct 2006) | 1 line
Add MSVC8 project files to create wininst-8.exe.
........
r52196 | brett.cannon | 2006-10-06 00:02:31 +0200 (Fri, 06 Oct 2006) | 7 lines
Clarify what "re-initialization" means for init_builtin() and init_dynamic().
Also remove warning about re-initialization as possibly raising an execption as
both call _PyImport_FindExtension() which pulls any module that was already
imported from the Python process' extension cache and just copies the __dict__
into the module stored in sys.modules.
........
r52200 | fred.drake | 2006-10-06 02:03:45 +0200 (Fri, 06 Oct 2006) | 3 lines
- update links
- remove Sleepycat name now that they have been bought
........
r52204 | andrew.kuchling | 2006-10-06 12:41:01 +0200 (Fri, 06 Oct 2006) | 1 line
Case fix
........
r52208 | georg.brandl | 2006-10-06 14:46:08 +0200 (Fri, 06 Oct 2006) | 3 lines
Fix name.
........
r52211 | andrew.kuchling | 2006-10-06 15:18:26 +0200 (Fri, 06 Oct 2006) | 1 line
[Bug #1545341] Allow 'classifier' parameter to be a tuple as well as a list. Will backport.
........
r52212 | armin.rigo | 2006-10-06 18:33:22 +0200 (Fri, 06 Oct 2006) | 4 lines
A very minor bug fix: this code looks like it is designed to accept
any hue value and do the modulo itself, except it doesn't quite do
it in all cases. At least, the "cannot get here" comment was wrong.
........
r52213 | andrew.kuchling | 2006-10-06 20:51:55 +0200 (Fri, 06 Oct 2006) | 1 line
Comment grammar
........
r52218 | skip.montanaro | 2006-10-07 13:05:02 +0200 (Sat, 07 Oct 2006) | 6 lines
Note that the excel_tab class is registered as the "excel-tab" dialect.
Fixes 1572471. Make a similar change for the excel class and clean up
references to the Dialects and Formatting Parameters section in a few
places.
........
r52221 | georg.brandl | 2006-10-08 09:11:54 +0200 (Sun, 08 Oct 2006) | 3 lines
Add missing NEWS entry for rev. 52129.
........
r52223 | hyeshik.chang | 2006-10-08 15:48:34 +0200 (Sun, 08 Oct 2006) | 3 lines
Bug #1572832: fix a bug in ISO-2022 codecs which may cause segfault
when encoding non-BMP unicode characters. (Submitted by Ray Chason)
........
r52227 | ronald.oussoren | 2006-10-08 19:37:58 +0200 (Sun, 08 Oct 2006) | 4 lines
Add version number to the link to the python documentation in
/Developer/Documentation/Python, better for users that install multiple versions
of python.
........
r52229 | ronald.oussoren | 2006-10-08 19:40:02 +0200 (Sun, 08 Oct 2006) | 2 lines
Fix for bug #1570284
........
r52233 | ronald.oussoren | 2006-10-08 19:49:52 +0200 (Sun, 08 Oct 2006) | 6 lines
MacOSX: distutils changes the values of BASECFLAGS and LDFLAGS when using a
universal build of python on OSX 10.3 to ensure that those flags can be used
to compile code (the universal build uses compiler flags that aren't supported
on 10.3). This patches gives the same treatment to CFLAGS, PY_CFLAGS and
BLDSHARED.
........
r52236 | ronald.oussoren | 2006-10-08 19:51:46 +0200 (Sun, 08 Oct 2006) | 5 lines
MacOSX: The universal build requires that users have the MacOSX10.4u SDK
installed to build extensions. This patch makes distutils emit a warning when
the compiler should use an SDK but that SDK is not installed, hopefully reducing
some confusion.
........
r52238 | ronald.oussoren | 2006-10-08 20:18:26 +0200 (Sun, 08 Oct 2006) | 3 lines
MacOSX: add more logic to recognize the correct startup file to patch to the
shell profile patching post-install script.
........
r52242 | andrew.kuchling | 2006-10-09 19:10:12 +0200 (Mon, 09 Oct 2006) | 1 line
Add news item for rev. 52211 change
........
r52245 | andrew.kuchling | 2006-10-09 20:05:19 +0200 (Mon, 09 Oct 2006) | 1 line
Fix wording in comment
........
r52251 | georg.brandl | 2006-10-09 21:03:06 +0200 (Mon, 09 Oct 2006) | 2 lines
Patch #1572724: fix typo ('=' instead of '==') in _msi.c.
........
r52255 | barry.warsaw | 2006-10-09 21:43:24 +0200 (Mon, 09 Oct 2006) | 2 lines
List gc.get_count() in the module docstring.
........
r52257 | martin.v.loewis | 2006-10-09 22:44:25 +0200 (Mon, 09 Oct 2006) | 1 line
Bug #1565150: Fix subsecond processing for os.utime on Windows.
........
r52268 | ronald.oussoren | 2006-10-10 09:55:06 +0200 (Tue, 10 Oct 2006) | 2 lines
MacOSX: fix permission problem in the generated installer
........
r52293 | georg.brandl | 2006-10-12 09:38:04 +0200 (Thu, 12 Oct 2006) | 2 lines
Bug #1575746: fix typo in property() docs.
........
r52295 | georg.brandl | 2006-10-12 09:57:21 +0200 (Thu, 12 Oct 2006) | 3 lines
Bug #813342: Start the IDLE subprocess with -Qnew if the parent
is started with that option.
........
r52297 | georg.brandl | 2006-10-12 10:22:53 +0200 (Thu, 12 Oct 2006) | 2 lines
Bug #1565919: document set types in the Language Reference.
........
r52299 | georg.brandl | 2006-10-12 11:20:33 +0200 (Thu, 12 Oct 2006) | 3 lines
Bug #1550524: better heuristics to find correct class definition
in inspect.findsource().
........
r52301 | georg.brandl | 2006-10-12 11:47:12 +0200 (Thu, 12 Oct 2006) | 4 lines
Bug #1548891: The cStringIO.StringIO() constructor now encodes unicode
arguments with the system default encoding just like the write()
method does, instead of converting it to a raw buffer.
........
r52303 | georg.brandl | 2006-10-12 13:14:40 +0200 (Thu, 12 Oct 2006) | 2 lines
Bug #1546628: add a note about urlparse.urljoin() and absolute paths.
........
r52305 | georg.brandl | 2006-10-12 13:27:59 +0200 (Thu, 12 Oct 2006) | 3 lines
Bug #1545497: when given an explicit base, int() did ignore NULs
embedded in the string to convert.
........
r52307 | georg.brandl | 2006-10-12 13:41:11 +0200 (Thu, 12 Oct 2006) | 3 lines
Add a note to fpectl docs that it's not built by default
(bug #1556261).
........
r52309 | georg.brandl | 2006-10-12 13:46:57 +0200 (Thu, 12 Oct 2006) | 3 lines
Bug #1560114: the Mac filesystem does have accurate information
about the case of filenames.
........
r52311 | georg.brandl | 2006-10-12 13:59:27 +0200 (Thu, 12 Oct 2006) | 2 lines
Small grammar fix, thanks Sjoerd.
........
r52313 | georg.brandl | 2006-10-12 14:03:07 +0200 (Thu, 12 Oct 2006) | 2 lines
Fix tarfile depending on buggy int('1\0', base) behavior.
........
r52315 | georg.brandl | 2006-10-12 14:33:07 +0200 (Thu, 12 Oct 2006) | 2 lines
Bug #1283491: follow docstring convention wrt. keyword-able args in sum().
........
r52316 | georg.brandl | 2006-10-12 15:08:16 +0200 (Thu, 12 Oct 2006) | 3 lines
Bug #1560179: speed up posixpath.(dir|base)name
........
r52327 | brett.cannon | 2006-10-14 08:36:45 +0200 (Sat, 14 Oct 2006) | 3 lines
Clean up the language of a sentence relating to the connect() function and
user-defined datatypes.
........
r52332 | neal.norwitz | 2006-10-14 23:33:38 +0200 (Sat, 14 Oct 2006) | 3 lines
Update the peephole optimizer to remove more dead code (jumps after returns)
and inline jumps to returns.
........
r52333 | martin.v.loewis | 2006-10-15 09:54:40 +0200 (Sun, 15 Oct 2006) | 4 lines
Patch #1576954: Update VC6 build directory; remove redundant
files in VC7. Will backport to 2.5.
........
r52335 | martin.v.loewis | 2006-10-15 10:43:33 +0200 (Sun, 15 Oct 2006) | 1 line
Patch #1576166: Support os.utime for directories on Windows NT+.
........
r52336 | martin.v.loewis | 2006-10-15 10:51:22 +0200 (Sun, 15 Oct 2006) | 2 lines
Patch #1577551: Add ctypes and ET build support for VC6.
Will backport to 2.5.
........
r52338 | martin.v.loewis | 2006-10-15 11:35:51 +0200 (Sun, 15 Oct 2006) | 1 line
Loosen the test for equal time stamps.
........
r52339 | martin.v.loewis | 2006-10-15 11:43:39 +0200 (Sun, 15 Oct 2006) | 2 lines
Bug #1567666: Emulate GetFileAttributesExA for Win95.
Will backport to 2.5.
........
r52341 | martin.v.loewis | 2006-10-15 13:02:07 +0200 (Sun, 15 Oct 2006) | 2 lines
Round to int, because some systems support sub-second time stamps in stat, but not in utime.
Also be consistent with modifying only mtime, not atime.
........
r52342 | martin.v.loewis | 2006-10-15 13:57:40 +0200 (Sun, 15 Oct 2006) | 2 lines
Set the eol-style for project files to "CRLF".
........
r52343 | martin.v.loewis | 2006-10-15 13:59:56 +0200 (Sun, 15 Oct 2006) | 3 lines
Drop binary property on dsp files, set eol-style
to CRLF instead.
........
r52344 | martin.v.loewis | 2006-10-15 14:01:43 +0200 (Sun, 15 Oct 2006) | 2 lines
Remove binary property, set eol-style to CRLF instead.
........
r52346 | martin.v.loewis | 2006-10-15 16:30:38 +0200 (Sun, 15 Oct 2006) | 2 lines
Mention the bdist_msi module. Will backport to 2.5.
........
r52354 | brett.cannon | 2006-10-16 05:09:52 +0200 (Mon, 16 Oct 2006) | 3 lines
Fix turtle so that you can launch the demo2 function on its own instead of only
when the module is launched as a script.
........
r52356 | martin.v.loewis | 2006-10-17 17:18:06 +0200 (Tue, 17 Oct 2006) | 2 lines
Patch #1457736: Update VC6 to use current PCbuild settings.
Will backport to 2.5.
........
r52360 | martin.v.loewis | 2006-10-17 20:09:55 +0200 (Tue, 17 Oct 2006) | 2 lines
Remove obsolete file. Will backport.
........
r52363 | martin.v.loewis | 2006-10-17 20:59:23 +0200 (Tue, 17 Oct 2006) | 4 lines
Forward-port r52358:
- Bug #1578513: Cross compilation was broken by a change to configure.
Repair so that it's back to how it was in 2.4.3.
........
r52365 | thomas.heller | 2006-10-17 21:30:48 +0200 (Tue, 17 Oct 2006) | 6 lines
ctypes callback functions only support 'fundamental' result types.
Check this and raise an error when something else is used - before
this change ctypes would hang or crash when such a callback was
called. This is a partial fix for #1574584.
Will backport to release25-maint.
........
r52377 | tim.peters | 2006-10-18 07:06:06 +0200 (Wed, 18 Oct 2006) | 2 lines
newIobject(): repaired incorrect cast to quiet MSVC warning.
........
r52378 | tim.peters | 2006-10-18 07:09:12 +0200 (Wed, 18 Oct 2006) | 2 lines
Whitespace normalization.
........
r52379 | tim.peters | 2006-10-18 07:10:28 +0200 (Wed, 18 Oct 2006) | 2 lines
Add missing svn:eol-style to text files.
........
r52387 | martin.v.loewis | 2006-10-19 12:58:46 +0200 (Thu, 19 Oct 2006) | 3 lines
Add check for the PyArg_ParseTuple format, and declare
it if it is supported.
........
r52388 | martin.v.loewis | 2006-10-19 13:00:37 +0200 (Thu, 19 Oct 2006) | 3 lines
Fix various minor errors in passing arguments to
PyArg_ParseTuple.
........
r52389 | martin.v.loewis | 2006-10-19 18:01:37 +0200 (Thu, 19 Oct 2006) | 2 lines
Restore CFLAGS after checking for __attribute__
........
r52390 | andrew.kuchling | 2006-10-19 23:55:55 +0200 (Thu, 19 Oct 2006) | 1 line
[Bug #1576348] Fix typo in example
........
r52414 | walter.doerwald | 2006-10-22 10:59:41 +0200 (Sun, 22 Oct 2006) | 2 lines
Port test___future__ to unittest.
........
r52415 | ronald.oussoren | 2006-10-22 12:45:18 +0200 (Sun, 22 Oct 2006) | 3 lines
Patch #1580674: with this patch os.readlink uses the filesystem encoding to
decode unicode objects and returns an unicode object when the argument is one.
........
r52416 | martin.v.loewis | 2006-10-22 12:46:18 +0200 (Sun, 22 Oct 2006) | 3 lines
Patch #1580872: Remove duplicate declaration of PyCallable_Check.
Will backport to 2.5.
........
r52418 | martin.v.loewis | 2006-10-22 12:55:15 +0200 (Sun, 22 Oct 2006) | 4 lines
- Patch #1560695: Add .note.GNU-stack to ctypes' sysv.S so that
ctypes isn't considered as requiring executable stacks.
Will backport to 2.5.
........
r52420 | martin.v.loewis | 2006-10-22 15:45:13 +0200 (Sun, 22 Oct 2006) | 3 lines
Remove passwd.adjunct.byname from list of maps
for test_nis. Will backport to 2.5.
........
r52431 | georg.brandl | 2006-10-24 18:54:16 +0200 (Tue, 24 Oct 2006) | 2 lines
Patch [ 1583506 ] tarfile.py: 100-char filenames are truncated
........
r52446 | andrew.kuchling | 2006-10-26 21:10:46 +0200 (Thu, 26 Oct 2006) | 1 line
[Bug #1579796] Wrong syntax for PyDateTime_IMPORT in documentation. Reported by David Faure.
........
r52449 | andrew.kuchling | 2006-10-26 21:16:46 +0200 (Thu, 26 Oct 2006) | 1 line
Typo fix
........
r52452 | martin.v.loewis | 2006-10-27 08:16:31 +0200 (Fri, 27 Oct 2006) | 3 lines
Patch #1549049: Rewrite type conversion in structmember.
Fixes #1545696 and #1566140. Will backport to 2.5.
........
r52454 | martin.v.loewis | 2006-10-27 08:42:27 +0200 (Fri, 27 Oct 2006) | 2 lines
Check for values.h. Will backport.
........
r52456 | martin.v.loewis | 2006-10-27 09:06:52 +0200 (Fri, 27 Oct 2006) | 2 lines
Get DBL_MAX from float.h not values.h. Will backport.
........
r52458 | martin.v.loewis | 2006-10-27 09:13:28 +0200 (Fri, 27 Oct 2006) | 2 lines
Patch #1567274: Support SMTP over TLS.
........
r52459 | andrew.kuchling | 2006-10-27 13:33:29 +0200 (Fri, 27 Oct 2006) | 1 line
Set svn:keywords property
........
r52460 | andrew.kuchling | 2006-10-27 13:36:41 +0200 (Fri, 27 Oct 2006) | 1 line
Add item
........
r52461 | andrew.kuchling | 2006-10-27 13:37:01 +0200 (Fri, 27 Oct 2006) | 1 line
Some wording changes and markup fixes
........
r52462 | andrew.kuchling | 2006-10-27 14:18:38 +0200 (Fri, 27 Oct 2006) | 1 line
[Bug #1585690] Note that line_num was added in Python 2.5
........
r52464 | andrew.kuchling | 2006-10-27 14:50:38 +0200 (Fri, 27 Oct 2006) | 1 line
[Bug #1583946] Reword description of server and issuer
........
r52466 | andrew.kuchling | 2006-10-27 15:06:25 +0200 (Fri, 27 Oct 2006) | 1 line
[Bug #1562583] Mention the set_reuse_addr() method
........
r52469 | andrew.kuchling | 2006-10-27 15:22:46 +0200 (Fri, 27 Oct 2006) | 4 lines
[Bug #1542016] Report PCALL_POP value. This makes the return value of sys.callstats() match its docstring.
Backport candidate. Though it's an API change, this is a pretty obscure
portion of the API.
........
r52473 | andrew.kuchling | 2006-10-27 16:53:41 +0200 (Fri, 27 Oct 2006) | 1 line
Point users to the subprocess module in the docs for os.system, os.spawn*, os.popen2, and the popen2 and commands modules
........
r52476 | andrew.kuchling | 2006-10-27 18:39:10 +0200 (Fri, 27 Oct 2006) | 1 line
[Bug #1576241] Let functools.wraps work with built-in functions
........
r52478 | andrew.kuchling | 2006-10-27 18:55:34 +0200 (Fri, 27 Oct 2006) | 1 line
[Bug #1575506] The _singlefileMailbox class was using the wrong file object in its flush() method, causing an error
........
r52480 | andrew.kuchling | 2006-10-27 19:06:16 +0200 (Fri, 27 Oct 2006) | 1 line
Clarify docstring
........
r52481 | andrew.kuchling | 2006-10-27 19:11:23 +0200 (Fri, 27 Oct 2006) | 5 lines
[Patch #1574068 by Scott Dial] urllib and urllib2 were using
base64.encodestring() for encoding authentication data.
encodestring() can include newlines for very long input, which
produced broken HTTP headers.
........
r52483 | andrew.kuchling | 2006-10-27 20:13:46 +0200 (Fri, 27 Oct 2006) | 1 line
Check db_setup_debug for a few print statements; change sqlite_setup_debug to False
........
r52484 | andrew.kuchling | 2006-10-27 20:15:02 +0200 (Fri, 27 Oct 2006) | 1 line
[Patch #1503717] Tiny patch from Chris AtLee to stop a lengthy line from being printed
........
r52485 | thomas.heller | 2006-10-27 20:31:36 +0200 (Fri, 27 Oct 2006) | 5 lines
WindowsError.str should display the windows error code,
not the posix error code; with test.
Fixes #1576174.
Will backport to release25-maint.
........
r52487 | thomas.heller | 2006-10-27 21:05:53 +0200 (Fri, 27 Oct 2006) | 4 lines
Modulefinder now handles absolute and relative imports, including
tests.
Will backport to release25-maint.
........
r52488 | georg.brandl | 2006-10-27 22:39:43 +0200 (Fri, 27 Oct 2006) | 2 lines
Patch #1552024: add decorator support to unparse.py demo script.
........
r52492 | walter.doerwald | 2006-10-28 12:47:12 +0200 (Sat, 28 Oct 2006) | 2 lines
Port test_bufio to unittest.
........
r52493 | georg.brandl | 2006-10-28 15:10:17 +0200 (Sat, 28 Oct 2006) | 6 lines
Convert test_global, test_scope and test_grammar to unittest.
I tried to enclose all tests which must be run at the toplevel
(instead of inside a method) in exec statements.
........
r52494 | georg.brandl | 2006-10-28 15:11:41 +0200 (Sat, 28 Oct 2006) | 3 lines
Update outstanding bugs test file.
........
r52495 | georg.brandl | 2006-10-28 15:51:49 +0200 (Sat, 28 Oct 2006) | 3 lines
Convert test_math to unittest.
........
r52496 | georg.brandl | 2006-10-28 15:56:58 +0200 (Sat, 28 Oct 2006) | 3 lines
Convert test_opcodes to unittest.
........
r52497 | georg.brandl | 2006-10-28 18:04:04 +0200 (Sat, 28 Oct 2006) | 2 lines
Fix nth() itertool recipe.
........
r52500 | georg.brandl | 2006-10-28 22:25:09 +0200 (Sat, 28 Oct 2006) | 2 lines
make test_grammar pass with python -O
........
r52501 | neal.norwitz | 2006-10-28 23:15:30 +0200 (Sat, 28 Oct 2006) | 6 lines
Add some asserts. In sysmodule, I think these were to try to silence
some warnings from Klokwork. They verify the assumptions of the format
of svn version output.
The assert in the thread module helped debug a problem on HP-UX.
........
r52502 | neal.norwitz | 2006-10-28 23:16:54 +0200 (Sat, 28 Oct 2006) | 5 lines
Fix warnings with HP's C compiler. It doesn't recognize that infinite
loops are, um, infinite. These conditions should not be able to happen.
Will backport.
........
r52503 | neal.norwitz | 2006-10-28 23:17:51 +0200 (Sat, 28 Oct 2006) | 5 lines
Fix crash in test on HP-UX. Apparently, it's not possible to delete a lock if
it's held (even by the current thread).
Will backport.
........
r52504 | neal.norwitz | 2006-10-28 23:19:07 +0200 (Sat, 28 Oct 2006) | 6 lines
Fix bug #1565514, SystemError not raised on too many nested blocks.
It seems like this should be a different error than SystemError, but
I don't have any great ideas and SystemError was raised in 2.4 and earlier.
Will backport.
........
r52505 | neal.norwitz | 2006-10-28 23:20:12 +0200 (Sat, 28 Oct 2006) | 4 lines
Prevent crash if alloc of garbage fails. Found by Typo.pl.
Will backport.
........
r52506 | neal.norwitz | 2006-10-28 23:21:00 +0200 (Sat, 28 Oct 2006) | 4 lines
Don't inline Py_ADDRESS_IN_RANGE with gcc 4+ either.
Will backport.
........
r52513 | neal.norwitz | 2006-10-28 23:56:49 +0200 (Sat, 28 Oct 2006) | 2 lines
Fix test_modulefinder so it doesn't fail when run after test_distutils.
........
r52514 | neal.norwitz | 2006-10-29 00:12:26 +0200 (Sun, 29 Oct 2006) | 4 lines
From SF 1557890, fix problem of using wrong type in example.
Will backport.
........
r52517 | georg.brandl | 2006-10-29 09:39:22 +0100 (Sun, 29 Oct 2006) | 4 lines
Fix codecs.EncodedFile which did not use file_encoding in 2.5.0, and
fix all codecs file wrappers to work correctly with the "with"
statement (bug #1586513).
........
r52519 | georg.brandl | 2006-10-29 09:47:08 +0100 (Sun, 29 Oct 2006) | 3 lines
Clean up a leftover from old listcomp generation code.
........
r52520 | georg.brandl | 2006-10-29 09:53:06 +0100 (Sun, 29 Oct 2006) | 4 lines
Bug #1586448: the compiler module now emits the same bytecode for
list comprehensions as the builtin compiler, using the LIST_APPEND
opcode.
........
r52521 | georg.brandl | 2006-10-29 10:01:01 +0100 (Sun, 29 Oct 2006) | 3 lines
Remove trailing comma.
........
r52522 | georg.brandl | 2006-10-29 10:05:04 +0100 (Sun, 29 Oct 2006) | 3 lines
Bug #1357915: allow all sequence types for shell arguments in
subprocess.
........
r52524 | georg.brandl | 2006-10-29 10:16:12 +0100 (Sun, 29 Oct 2006) | 3 lines
Patch #1583880: fix tarfile's problems with long names and posix/
GNU modes.
........
r52526 | georg.brandl | 2006-10-29 10:18:00 +0100 (Sun, 29 Oct 2006) | 3 lines
Test assert if __debug__ is true.
........
r52527 | georg.brandl | 2006-10-29 10:32:16 +0100 (Sun, 29 Oct 2006) | 2 lines
Fix the new EncodedFile test to work with big endian platforms.
........
r52529 | georg.brandl | 2006-10-29 15:39:09 +0100 (Sun, 29 Oct 2006) | 2 lines
Bug #1586613: fix zlib and bz2 codecs' incremental en/decoders.
........
r52532 | georg.brandl | 2006-10-29 19:01:08 +0100 (Sun, 29 Oct 2006) | 2 lines
Bug #1586773: extend hashlib docstring.
........
r52534 | neal.norwitz | 2006-10-29 19:30:10 +0100 (Sun, 29 Oct 2006) | 4 lines
Update comments, remove commented out code.
Move assembler structure next to assembler code to make it easier to
move it to a separate file.
........
r52535 | georg.brandl | 2006-10-29 19:31:42 +0100 (Sun, 29 Oct 2006) | 3 lines
Bug #1576657: when setting a KeyError for a tuple key, make sure that
the tuple isn't used as the "exception arguments tuple".
........
r52537 | georg.brandl | 2006-10-29 20:13:40 +0100 (Sun, 29 Oct 2006) | 3 lines
Convert test_mmap to unittest.
........
r52538 | georg.brandl | 2006-10-29 20:20:45 +0100 (Sun, 29 Oct 2006) | 3 lines
Convert test_poll to unittest.
........
r52539 | georg.brandl | 2006-10-29 20:24:43 +0100 (Sun, 29 Oct 2006) | 3 lines
Convert test_nis to unittest.
........
r52540 | georg.brandl | 2006-10-29 20:35:03 +0100 (Sun, 29 Oct 2006) | 3 lines
Convert test_types to unittest.
........
r52541 | georg.brandl | 2006-10-29 20:51:16 +0100 (Sun, 29 Oct 2006) | 3 lines
Convert test_cookie to unittest.
........
r52542 | georg.brandl | 2006-10-29 21:09:12 +0100 (Sun, 29 Oct 2006) | 3 lines
Convert test_cgi to unittest.
........
r52543 | georg.brandl | 2006-10-29 21:24:01 +0100 (Sun, 29 Oct 2006) | 3 lines
Completely convert test_httplib to unittest.
........
r52544 | georg.brandl | 2006-10-29 21:28:26 +0100 (Sun, 29 Oct 2006) | 2 lines
Convert test_MimeWriter to unittest.
........
r52545 | georg.brandl | 2006-10-29 21:31:17 +0100 (Sun, 29 Oct 2006) | 3 lines
Convert test_openpty to unittest.
........
r52546 | georg.brandl | 2006-10-29 21:35:12 +0100 (Sun, 29 Oct 2006) | 3 lines
Remove leftover test output file.
........
r52547 | georg.brandl | 2006-10-29 22:54:18 +0100 (Sun, 29 Oct 2006) | 3 lines
Move the check for openpty to the beginning.
........
r52548 | walter.doerwald | 2006-10-29 23:06:28 +0100 (Sun, 29 Oct 2006) | 2 lines
Add tests for basic argument errors.
........
r52549 | walter.doerwald | 2006-10-30 00:02:27 +0100 (Mon, 30 Oct 2006) | 3 lines
Add tests for incremental codecs with an errors
argument.
........
r52550 | neal.norwitz | 2006-10-30 00:39:03 +0100 (Mon, 30 Oct 2006) | 1 line
Fix refleak
........
r52552 | neal.norwitz | 2006-10-30 00:58:36 +0100 (Mon, 30 Oct 2006) | 1 line
I'm assuming this is correct, it fixes the tests so they pass again
........
r52555 | vinay.sajip | 2006-10-31 18:32:37 +0100 (Tue, 31 Oct 2006) | 1 line
Change to improve speed of _fixupChildren
........
r52556 | vinay.sajip | 2006-10-31 18:34:31 +0100 (Tue, 31 Oct 2006) | 1 line
Added relativeCreated to Formatter doc (has been in the system for a long time - was unaccountably left out of the docs and not noticed until now).
........
r52588 | thomas.heller | 2006-11-02 20:48:24 +0100 (Thu, 02 Nov 2006) | 5 lines
Replace the XXX marker in the 'Arrays and pointers' reference manual
section with a link to the tutorial sections.
Will backport to release25-maint.
........
r52592 | thomas.heller | 2006-11-02 21:22:29 +0100 (Thu, 02 Nov 2006) | 6 lines
Fix a code example by adding a missing import.
Fixes #1557890.
Will backport to release25-maint.
........
r52598 | tim.peters | 2006-11-03 03:32:46 +0100 (Fri, 03 Nov 2006) | 2 lines
Whitespace normalization.
........
r52619 | martin.v.loewis | 2006-11-04 19:14:06 +0100 (Sat, 04 Nov 2006) | 4 lines
- Patch #1060577: Extract list of RPM files from spec file in
bdist_rpm
Will backport to 2.5.
........
r52621 | neal.norwitz | 2006-11-04 20:25:22 +0100 (Sat, 04 Nov 2006) | 4 lines
Bug #1588287: fix invalid assertion for `1,2` in debug builds.
Will backport
........
r52630 | andrew.kuchling | 2006-11-05 22:04:37 +0100 (Sun, 05 Nov 2006) | 1 line
Update link
........
r52631 | skip.montanaro | 2006-11-06 15:34:52 +0100 (Mon, 06 Nov 2006) | 1 line
note that user can control directory location even if default dir is used
........
r52644 | ronald.oussoren | 2006-11-07 16:53:38 +0100 (Tue, 07 Nov 2006) | 2 lines
Fix a number of typos in strings and comments (sf#1589070)
........
r52647 | ronald.oussoren | 2006-11-07 17:00:34 +0100 (Tue, 07 Nov 2006) | 2 lines
Whitespace changes to make the source more compliant with PEP8 (SF#1589070)
........
r52651 | thomas.heller | 2006-11-07 19:01:18 +0100 (Tue, 07 Nov 2006) | 3 lines
Fix markup.
Will backport to release25-maint.
........
r52653 | thomas.heller | 2006-11-07 19:20:47 +0100 (Tue, 07 Nov 2006) | 3 lines
Fix grammatical error as well.
Will backport to release25-maint.
........
r52657 | andrew.kuchling | 2006-11-07 21:39:16 +0100 (Tue, 07 Nov 2006) | 1 line
Add missing word
........
r52662 | martin.v.loewis | 2006-11-08 07:46:37 +0100 (Wed, 08 Nov 2006) | 4 lines
Correctly forward exception in instance_contains().
Fixes #1591996. Patch contributed by Neal Norwitz.
Will backport.
........
r52664 | martin.v.loewis | 2006-11-08 07:48:36 +0100 (Wed, 08 Nov 2006) | 2 lines
News entry for 52662.
........
r52665 | martin.v.loewis | 2006-11-08 08:35:55 +0100 (Wed, 08 Nov 2006) | 2 lines
Patch #1351744: Add askyesnocancel helper for tkMessageBox.
........
r52666 | georg.brandl | 2006-11-08 08:45:59 +0100 (Wed, 08 Nov 2006) | 2 lines
Patch #1592072: fix docs for return value of PyErr_CheckSignals.
........
r52668 | georg.brandl | 2006-11-08 11:04:29 +0100 (Wed, 08 Nov 2006) | 3 lines
Bug #1592533: rename variable in heapq doc example, to avoid shadowing
"sorted".
........
r52671 | andrew.kuchling | 2006-11-08 14:35:34 +0100 (Wed, 08 Nov 2006) | 1 line
Add section on the functional module
........
r52672 | andrew.kuchling | 2006-11-08 15:14:30 +0100 (Wed, 08 Nov 2006) | 1 line
Add section on operator module; make a few edits
........
r52673 | andrew.kuchling | 2006-11-08 15:24:03 +0100 (Wed, 08 Nov 2006) | 1 line
Add table of contents; this required fixing a few headings. Some more smalle edits.
........
r52674 | andrew.kuchling | 2006-11-08 15:30:14 +0100 (Wed, 08 Nov 2006) | 1 line
More edits
........
r52686 | martin.v.loewis | 2006-11-09 12:06:03 +0100 (Thu, 09 Nov 2006) | 3 lines
Patch #838546: Make terminal become controlling in pty.fork().
Will backport to 2.5.
........
r52688 | martin.v.loewis | 2006-11-09 12:27:32 +0100 (Thu, 09 Nov 2006) | 2 lines
Patch #1592250: Add elidge argument to Tkinter.Text.search.
........
r52690 | andrew.kuchling | 2006-11-09 14:27:07 +0100 (Thu, 09 Nov 2006) | 7 lines
[Bug #1569790] mailbox.Maildir.get_folder() loses factory information
Both the Maildir and MH classes had this bug; the patch fixes both classes
and adds a test.
Will backport to 25-maint.
........
r52692 | andrew.kuchling | 2006-11-09 14:51:14 +0100 (Thu, 09 Nov 2006) | 1 line
[Patch #1514544 by David Watson] use fsync() to ensure data is really on disk
........
r52695 | walter.doerwald | 2006-11-09 17:23:26 +0100 (Thu, 09 Nov 2006) | 2 lines
Replace C++ comment with C comment (fixes SF bug #1593525).
........
r52712 | andrew.kuchling | 2006-11-09 22:16:46 +0100 (Thu, 09 Nov 2006) | 11 lines
[Patch #1514543] mailbox (Maildir): avoid losing messages on name clash
Two changes:
Where possible, use link()/remove() to move files into a directory; this
makes it easier to avoid overwriting an existing file.
Use _create_carefully() to create files in tmp/, which uses O_EXCL.
Backport candidate.
........
r52716 | phillip.eby | 2006-11-10 01:33:36 +0100 (Fri, 10 Nov 2006) | 4 lines
Fix SF#1566719: not creating site-packages (or other target directory) when
installing .egg-info for a project that contains no modules or packages,
while using --root (as in bdist_rpm).
........
r52719 | andrew.kuchling | 2006-11-10 14:14:01 +0100 (Fri, 10 Nov 2006) | 1 line
Reword entry
........
r52725 | andrew.kuchling | 2006-11-10 15:39:01 +0100 (Fri, 10 Nov 2006) | 1 line
[Feature request #1542920] Link to wsgi.org
........
r52731 | georg.brandl | 2006-11-11 19:29:11 +0100 (Sat, 11 Nov 2006) | 2 lines
Bug #1594742: wrong word in stringobject doc.
........
r52733 | georg.brandl | 2006-11-11 19:32:47 +0100 (Sat, 11 Nov 2006) | 2 lines
Bug #1594758: wording improvement for dict.update() docs.
........
r52736 | martin.v.loewis | 2006-11-12 11:32:47 +0100 (Sun, 12 Nov 2006) | 3 lines
Patch #1065257: Support passing open files as body in
HTTPConnection.request().
........
r52737 | martin.v.loewis | 2006-11-12 11:41:39 +0100 (Sun, 12 Nov 2006) | 2 lines
Patch #1355023: support whence argument for GzipFile.seek.
........
r52738 | martin.v.loewis | 2006-11-12 19:24:26 +0100 (Sun, 12 Nov 2006) | 2 lines
Bug #1067760: Deprecate passing floats to file.seek.
........
r52739 | martin.v.loewis | 2006-11-12 19:48:13 +0100 (Sun, 12 Nov 2006) | 3 lines
Patch #1359217: Ignore 2xx response before 150 response.
Will backport to 2.5.
........
r52741 | martin.v.loewis | 2006-11-12 19:56:03 +0100 (Sun, 12 Nov 2006) | 4 lines
Patch #1360200: Use unmangled_version RPM spec field to deal with
file name mangling.
Will backport to 2.5.
........
r52753 | walter.doerwald | 2006-11-15 17:23:46 +0100 (Wed, 15 Nov 2006) | 2 lines
Fix typo.
........
r52754 | georg.brandl | 2006-11-15 18:42:03 +0100 (Wed, 15 Nov 2006) | 2 lines
Bug #1594809: add a note to README regarding PYTHONPATH and make install.
........
r52762 | georg.brandl | 2006-11-16 16:05:14 +0100 (Thu, 16 Nov 2006) | 2 lines
Bug #1597576: mention that the new base64 api has been introduced in py2.4.
........
r52764 | georg.brandl | 2006-11-16 17:50:59 +0100 (Thu, 16 Nov 2006) | 3 lines
Bug #1597824: return the registered function from atexit.register()
to facilitate usage as a decorator.
........
r52765 | georg.brandl | 2006-11-16 18:08:45 +0100 (Thu, 16 Nov 2006) | 4 lines
Bug #1588217: don't parse "= " as a soft line break in binascii's
a2b_qp() function, instead leave it in the string as quopri.decode()
does.
........
r52776 | andrew.kuchling | 2006-11-17 14:30:25 +0100 (Fri, 17 Nov 2006) | 17 lines
Remove file-locking in MH.pack() method.
This change looks massive but it's mostly a re-indenting after
removing some try...finally blocks.
Also adds a test case that does a pack() while the mailbox is locked; this
test would have turned up bugs in the original code on some platforms.
In both nmh and GNU Mailutils' implementation of MH-format mailboxes,
no locking is done of individual message files when renaming them.
The original mailbox.py code did do locking, which meant that message
files had to be opened. This code was buggy on certain platforms
(found through reading the code); there were code paths that closed
the file object and then called _unlock_file() on it.
Will backport to 25-maint once I see how the buildbots react to this patch.
........
r52780 | martin.v.loewis | 2006-11-18 19:00:23 +0100 (Sat, 18 Nov 2006) | 5 lines
Patch #1538878: Don't make tkSimpleDialog dialogs transient if
the parent window is withdrawn. This mirrors what dialog.tcl
does.
Will backport to 2.5.
........
r52782 | martin.v.loewis | 2006-11-18 19:05:35 +0100 (Sat, 18 Nov 2006) | 4 lines
Patch #1594554: Always close a tkSimpleDialog on ok(), even
if an exception occurs.
Will backport to 2.5.
........
r52784 | martin.v.loewis | 2006-11-18 19:42:11 +0100 (Sat, 18 Nov 2006) | 3 lines
Patch #1472877: Fix Tix subwidget name resolution.
Will backport to 2.5.
........
r52786 | andrew.kuchling | 2006-11-18 23:17:33 +0100 (Sat, 18 Nov 2006) | 1 line
Expand checking in test_sha
........
r52787 | georg.brandl | 2006-11-19 09:48:30 +0100 (Sun, 19 Nov 2006) | 3 lines
Patch [ 1586791 ] better error msgs for some TypeErrors
........
r52788 | martin.v.loewis | 2006-11-19 11:41:41 +0100 (Sun, 19 Nov 2006) | 4 lines
Make cStringIO.truncate raise IOError for negative
arguments (even for -1). Fixes the last bit of
#1359365.
........
r52789 | andrew.kuchling | 2006-11-19 19:40:01 +0100 (Sun, 19 Nov 2006) | 1 line
Add a test case of data w/ bytes > 127
........
r52790 | martin.v.loewis | 2006-11-19 19:51:54 +0100 (Sun, 19 Nov 2006) | 3 lines
Patch #1070046: Marshal new-style objects like InstanceType
in xmlrpclib.
........
r52792 | neal.norwitz | 2006-11-19 22:26:53 +0100 (Sun, 19 Nov 2006) | 4 lines
Speed up function calls into the math module by using METH_O.
There should be no functional changes. However, the error msgs are
slightly different. Also verified that the module dict is not NULL on init.
........
r52794 | george.yoshida | 2006-11-20 03:24:48 +0100 (Mon, 20 Nov 2006) | 2 lines
markup fix
........
r52795 | georg.brandl | 2006-11-20 08:12:58 +0100 (Mon, 20 Nov 2006) | 3 lines
Further markup fix.
........
r52800 | andrew.kuchling | 2006-11-20 14:39:37 +0100 (Mon, 20 Nov 2006) | 2 lines
Jython compatibility fix: if uu.decode() opened its output file, be sure to
close it.
........
r52811 | neal.norwitz | 2006-11-21 06:26:22 +0100 (Tue, 21 Nov 2006) | 9 lines
Bug #1599782: Fix segfault on bsddb.db.DB().type().
The problem is that _DB_get_type() can't be called without the GIL
because it calls a bunch of PyErr_* APIs when an error occurs.
There were no other cases in this file that it was called without the GIL.
Removing the BEGIN/END THREAD around _DB_get_type() made everything work.
Will backport.
........
r52814 | neal.norwitz | 2006-11-21 06:51:51 +0100 (Tue, 21 Nov 2006) | 1 line
Oops, convert tabs to spaces
........
r52815 | neal.norwitz | 2006-11-21 07:23:44 +0100 (Tue, 21 Nov 2006) | 1 line
Fix SF #1599879, socket.gethostname should ref getfqdn directly.
........
r52817 | martin.v.loewis | 2006-11-21 19:20:25 +0100 (Tue, 21 Nov 2006) | 4 lines
Conditionalize definition of _CRT_SECURE_NO_DEPRECATE
and _CRT_NONSTDC_NO_DEPRECATE.
Will backport.
........
r52821 | martin.v.loewis | 2006-11-22 09:50:02 +0100 (Wed, 22 Nov 2006) | 4 lines
Patch #1362975: Rework CodeContext indentation algorithm to
avoid hard-coding pixel widths. Also make the text's scrollbar
a child of the text frame, not the top widget.
........
r52826 | walter.doerwald | 2006-11-23 06:03:56 +0100 (Thu, 23 Nov 2006) | 3 lines
Change decode() so that it works with a buffer (i.e. unicode(..., 'utf-8-sig'))
SF bug #1601501.
........
r52833 | georg.brandl | 2006-11-23 10:55:07 +0100 (Thu, 23 Nov 2006) | 2 lines
Bug #1601630: little improvement to getopt docs
........
r52835 | michael.hudson | 2006-11-23 14:54:04 +0100 (Thu, 23 Nov 2006) | 3 lines
a test for an error condition not covered by existing tests
(noticed this when writing the equivalent code for pypy)
........
r52839 | raymond.hettinger | 2006-11-23 22:06:03 +0100 (Thu, 23 Nov 2006) | 1 line
Fix and/add typo
........
r52840 | raymond.hettinger | 2006-11-23 22:35:19 +0100 (Thu, 23 Nov 2006) | 1 line
... and the number of the counting shall be three.
........
r52841 | thomas.heller | 2006-11-24 19:45:39 +0100 (Fri, 24 Nov 2006) | 1 line
Fix bug #1598620: A ctypes structure cannot contain itself.
........
r52843 | martin.v.loewis | 2006-11-25 16:39:19 +0100 (Sat, 25 Nov 2006) | 3 lines
Disable _XOPEN_SOURCE on NetBSD 1.x.
Will backport to 2.5
........
r52845 | georg.brandl | 2006-11-26 20:27:47 +0100 (Sun, 26 Nov 2006) | 2 lines
Bug #1603321: make pstats.Stats accept Unicode file paths.
........
r52850 | georg.brandl | 2006-11-27 19:46:21 +0100 (Mon, 27 Nov 2006) | 2 lines
Bug #1603789: grammatical error in Tkinter docs.
........
r52855 | thomas.heller | 2006-11-28 21:21:54 +0100 (Tue, 28 Nov 2006) | 7 lines
Fix #1563807: _ctypes built on AIX fails with ld ffi error.
The contents of ffi_darwin.c must be compiled unless __APPLE__ is
defined and __ppc__ is not.
Will backport.
........
r52862 | armin.rigo | 2006-11-29 22:59:22 +0100 (Wed, 29 Nov 2006) | 3 lines
Forgot a case where the locals can now be a general mapping
instead of just a dictionary. (backporting...)
........
r52872 | guido.van.rossum | 2006-11-30 20:23:13 +0100 (Thu, 30 Nov 2006) | 2 lines
Update version.
........
r52890 | walter.doerwald | 2006-12-01 17:59:47 +0100 (Fri, 01 Dec 2006) | 3 lines
Move xdrlib tests from the module into a separate test script,
port the tests to unittest and add a few new tests.
........
r52900 | raymond.hettinger | 2006-12-02 03:00:39 +0100 (Sat, 02 Dec 2006) | 1 line
Add name to credits (for untokenize).
........
r52905 | martin.v.loewis | 2006-12-03 10:54:46 +0100 (Sun, 03 Dec 2006) | 2 lines
Move IDLE news into NEWS.txt.
........
r52906 | martin.v.loewis | 2006-12-03 12:23:45 +0100 (Sun, 03 Dec 2006) | 4 lines
Patch #1544279: Improve thread-safety of the socket module by moving
the sock_addr_t storage out of the socket object.
Will backport to 2.5.
........
r52908 | martin.v.loewis | 2006-12-03 13:01:53 +0100 (Sun, 03 Dec 2006) | 3 lines
Patch #1371075: Make ConfigParser accept optional dict type
for ordering, sorting, etc.
........
r52910 | matthias.klose | 2006-12-03 18:16:41 +0100 (Sun, 03 Dec 2006) | 2 lines
- Fix build failure on kfreebsd and on the hurd.
........
r52915 | george.yoshida | 2006-12-04 12:41:54 +0100 (Mon, 04 Dec 2006) | 2 lines
fix a versionchanged tag
........
r52917 | george.yoshida | 2006-12-05 06:39:50 +0100 (Tue, 05 Dec 2006) | 3 lines
Fix pickle doc typo
Patch #1608758
........
r52938 | georg.brandl | 2006-12-06 23:21:18 +0100 (Wed, 06 Dec 2006) | 2 lines
Patch #1610437: fix a tarfile bug with long filename headers.
........
r52945 | brett.cannon | 2006-12-07 00:38:48 +0100 (Thu, 07 Dec 2006) | 3 lines
Fix a bad assumption that all objects assigned to '__loader__' on a module
will have a '_files' attribute.
........
r52951 | georg.brandl | 2006-12-07 10:30:06 +0100 (Thu, 07 Dec 2006) | 3 lines
RFE #1592899: mention string.maketrans() in docs for str.translate,
remove reference to the old regex module in the former's doc.
........
r52962 | raymond.hettinger | 2006-12-08 04:17:18 +0100 (Fri, 08 Dec 2006) | 1 line
Eliminate two redundant calls to PyObject_Hash().
........
r52963 | raymond.hettinger | 2006-12-08 05:24:33 +0100 (Fri, 08 Dec 2006) | 3 lines
Port Armin's fix for a dict resize vulnerability (svn revision 46589, sf bug 1456209).
........
r52964 | raymond.hettinger | 2006-12-08 05:57:50 +0100 (Fri, 08 Dec 2006) | 4 lines
Port Georg's dictobject.c fix keys that were tuples got unpacked on the way to setting a KeyError (svn revision 52535, sf bug
1576657).
........
r52966 | raymond.hettinger | 2006-12-08 18:35:25 +0100 (Fri, 08 Dec 2006) | 2 lines
Add test for SF bug 1576657
........
r52970 | georg.brandl | 2006-12-08 21:46:11 +0100 (Fri, 08 Dec 2006) | 3 lines
#1577756: svnversion doesn't react to LANG=C, use LC_ALL=C to force
English output.
........
r52972 | georg.brandl | 2006-12-09 10:08:29 +0100 (Sat, 09 Dec 2006) | 3 lines
Patch #1608267: fix a race condition in os.makedirs() is the directory
to be created is already there.
........
r52975 | matthias.klose | 2006-12-09 13:15:27 +0100 (Sat, 09 Dec 2006) | 2 lines
- Fix the build of the library reference in info format.
........
r52994 | neal.norwitz | 2006-12-11 02:01:06 +0100 (Mon, 11 Dec 2006) | 1 line
Fix a typo
........
r52996 | georg.brandl | 2006-12-11 08:56:33 +0100 (Mon, 11 Dec 2006) | 2 lines
Move errno imports back to individual functions.
........
r52998 | vinay.sajip | 2006-12-11 15:07:16 +0100 (Mon, 11 Dec 2006) | 1 line
Patch by Jeremy Katz (SF #1609407)
........
r53000 | vinay.sajip | 2006-12-11 15:26:23 +0100 (Mon, 11 Dec 2006) | 1 line
Patch by "cuppatea" (SF #1503765)
........
2006-12-13 00:49:30 -04:00
|
|
|
\ulink{Kodos}{http://www.phil-schwartz.com/kodos.spy} is also an interactive
|
2005-08-29 22:25:05 -03:00
|
|
|
tool for developing and testing RE patterns. This HOWTO will use the
|
|
|
|
standard Python interpreter for its examples.
|
|
|
|
|
|
|
|
First, run the Python interpreter, import the \module{re} module, and
|
|
|
|
compile a RE:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
|
|
|
|
>>> import re
|
|
|
|
>>> p = re.compile('[a-z]+')
|
|
|
|
>>> p
|
|
|
|
<_sre.SRE_Pattern object at 80c3c28>
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Now, you can try matching various strings against the RE
|
|
|
|
\regexp{[a-z]+}. An empty string shouldn't match at all, since
|
|
|
|
\regexp{+} means 'one or more repetitions'. \method{match()} should
|
|
|
|
return \code{None} in this case, which will cause the interpreter to
|
|
|
|
print no output. You can explicitly print the result of
|
|
|
|
\method{match()} to make this clear.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p.match("")
|
|
|
|
>>> print p.match("")
|
|
|
|
None
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Now, let's try it on a string that it should match, such as
|
|
|
|
\samp{tempo}. In this case, \method{match()} will return a
|
|
|
|
\class{MatchObject}, so you should store the result in a variable for
|
|
|
|
later use.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> m = p.match( 'tempo')
|
|
|
|
>>> print m
|
|
|
|
<_sre.SRE_Match object at 80c4f68>
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Now you can query the \class{MatchObject} for information about the
|
|
|
|
matching string. \class{MatchObject} instances also have several
|
|
|
|
methods and attributes; the most important ones are:
|
|
|
|
|
|
|
|
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
|
|
|
|
\lineii{group()}{Return the string matched by the RE}
|
|
|
|
\lineii{start()}{Return the starting position of the match}
|
|
|
|
\lineii{end()}{Return the ending position of the match}
|
|
|
|
\lineii{span()}{Return a tuple containing the (start, end) positions
|
|
|
|
of the match}
|
|
|
|
\end{tableii}
|
|
|
|
|
|
|
|
Trying these methods will soon clarify their meaning:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> m.group()
|
|
|
|
'tempo'
|
|
|
|
>>> m.start(), m.end()
|
|
|
|
(0, 5)
|
|
|
|
>>> m.span()
|
|
|
|
(0, 5)
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
\method{group()} returns the substring that was matched by the
|
|
|
|
RE. \method{start()} and \method{end()} return the starting and
|
|
|
|
ending index of the match. \method{span()} returns both start and end
|
|
|
|
indexes in a single tuple. Since the \method{match} method only
|
|
|
|
checks if the RE matches at the start of a string,
|
|
|
|
\method{start()} will always be zero. However, the \method{search}
|
|
|
|
method of \class{RegexObject} instances scans through the string, so
|
|
|
|
the match may not start at zero in that case.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> print p.match('::: message')
|
|
|
|
None
|
|
|
|
>>> m = p.search('::: message') ; print m
|
|
|
|
<re.MatchObject instance at 80c9650>
|
|
|
|
>>> m.group()
|
|
|
|
'message'
|
|
|
|
>>> m.span()
|
|
|
|
(4, 11)
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
In actual programs, the most common style is to store the
|
|
|
|
\class{MatchObject} in a variable, and then check if it was
|
|
|
|
\code{None}. This usually looks like:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
p = re.compile( ... )
|
|
|
|
m = p.match( 'string goes here' )
|
|
|
|
if m:
|
|
|
|
print 'Match found: ', m.group()
|
|
|
|
else:
|
|
|
|
print 'No match'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Two \class{RegexObject} methods return all of the matches for a pattern.
|
|
|
|
\method{findall()} returns a list of matching strings:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile('\d+')
|
|
|
|
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
|
|
|
|
['12', '11', '10']
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
\method{findall()} has to create the entire list before it can be
|
|
|
|
returned as the result. In Python 2.2, the \method{finditer()} method
|
|
|
|
is also available, returning a sequence of \class{MatchObject} instances
|
|
|
|
as an iterator.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
|
|
|
|
>>> iterator
|
|
|
|
<callable-iterator object at 0x401833ac>
|
|
|
|
>>> for match in iterator:
|
|
|
|
... print match.span()
|
|
|
|
...
|
|
|
|
(0, 2)
|
|
|
|
(22, 24)
|
|
|
|
(29, 31)
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Module-Level Functions}
|
|
|
|
|
|
|
|
You don't have to produce a \class{RegexObject} and call its methods;
|
|
|
|
the \module{re} module also provides top-level functions called
|
|
|
|
\function{match()}, \function{search()}, \function{sub()}, and so
|
|
|
|
forth. These functions take the same arguments as the corresponding
|
|
|
|
\class{RegexObject} method, with the RE string added as the first
|
|
|
|
argument, and still return either \code{None} or a \class{MatchObject}
|
|
|
|
instance.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> print re.match(r'From\s+', 'Fromage amk')
|
|
|
|
None
|
|
|
|
>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
|
|
|
|
<re.MatchObject instance at 80c5978>
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Under the hood, these functions simply produce a \class{RegexObject}
|
|
|
|
for you and call the appropriate method on it. They also store the
|
|
|
|
compiled object in a cache, so future calls using the same
|
|
|
|
RE are faster.
|
|
|
|
|
|
|
|
Should you use these module-level functions, or should you get the
|
|
|
|
\class{RegexObject} and call its methods yourself? That choice
|
|
|
|
depends on how frequently the RE will be used, and on your personal
|
|
|
|
coding style. If a RE is being used at only one point in the code,
|
|
|
|
then the module functions are probably more convenient. If a program
|
|
|
|
contains a lot of regular expressions, or re-uses the same ones in
|
|
|
|
several locations, then it might be worthwhile to collect all the
|
|
|
|
definitions in one place, in a section of code that compiles all the
|
|
|
|
REs ahead of time. To take an example from the standard library,
|
|
|
|
here's an extract from \file{xmllib.py}:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
ref = re.compile( ... )
|
|
|
|
entityref = re.compile( ... )
|
|
|
|
charref = re.compile( ... )
|
|
|
|
starttagopen = re.compile( ... )
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
I generally prefer to work with the compiled object, even for
|
|
|
|
one-time uses, but few people will be as much of a purist about this
|
|
|
|
as I am.
|
|
|
|
|
|
|
|
\subsection{Compilation Flags}
|
|
|
|
|
|
|
|
Compilation flags let you modify some aspects of how regular
|
|
|
|
expressions work. Flags are available in the \module{re} module under
|
|
|
|
two names, a long name such as \constant{IGNORECASE}, and a short,
|
|
|
|
one-letter form such as \constant{I}. (If you're familiar with Perl's
|
|
|
|
pattern modifiers, the one-letter forms use the same letters; the
|
|
|
|
short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
|
|
|
|
Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
|
|
|
|
re.M} sets both the \constant{I} and \constant{M} flags, for example.
|
|
|
|
|
|
|
|
Here's a table of the available flags, followed by
|
|
|
|
a more detailed explanation of each one.
|
|
|
|
|
|
|
|
\begin{tableii}{c|l}{}{Flag}{Meaning}
|
|
|
|
\lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
|
|
|
|
character, including newlines}
|
|
|
|
\lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
|
|
|
|
\lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
|
|
|
|
\lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
|
|
|
|
affecting \regexp{\^} and \regexp{\$}}
|
|
|
|
\lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
|
|
|
|
which can be organized more cleanly and understandably.}
|
|
|
|
\end{tableii}
|
|
|
|
|
|
|
|
\begin{datadesc}{I}
|
|
|
|
\dataline{IGNORECASE}
|
|
|
|
Perform case-insensitive matching; character class and literal strings
|
|
|
|
will match
|
|
|
|
letters by ignoring case. For example, \regexp{[A-Z]} will match
|
|
|
|
lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
|
|
|
|
\samp{spam}, or \samp{spAM}.
|
|
|
|
This lowercasing doesn't take the current locale into account; it will
|
|
|
|
if you also set the \constant{LOCALE} flag.
|
|
|
|
\end{datadesc}
|
|
|
|
|
|
|
|
\begin{datadesc}{L}
|
|
|
|
\dataline{LOCALE}
|
|
|
|
Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
|
|
|
|
and \regexp{\e B}, dependent on the current locale.
|
|
|
|
|
|
|
|
Locales are a feature of the C library intended to help in writing
|
|
|
|
programs that take account of language differences. For example, if
|
|
|
|
you're processing French text, you'd want to be able to write
|
|
|
|
\regexp{\e w+} to match words, but \regexp{\e w} only matches the
|
|
|
|
character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
|
|
|
|
\character{\c c}. If your system is configured properly and a French
|
|
|
|
locale is selected, certain C functions will tell the program that
|
|
|
|
\character{\'e} should also be considered a letter. Setting the
|
|
|
|
\constant{LOCALE} flag when compiling a regular expression will cause the
|
|
|
|
resulting compiled object to use these C functions for \regexp{\e w};
|
|
|
|
this is slower, but also enables \regexp{\e w+} to match French words as
|
|
|
|
you'd expect.
|
|
|
|
\end{datadesc}
|
|
|
|
|
|
|
|
\begin{datadesc}{M}
|
|
|
|
\dataline{MULTILINE}
|
|
|
|
(\regexp{\^} and \regexp{\$} haven't been explained yet;
|
|
|
|
they'll be introduced in section~\ref{more-metacharacters}.)
|
|
|
|
|
|
|
|
Usually \regexp{\^} matches only at the beginning of the string, and
|
|
|
|
\regexp{\$} matches only at the end of the string and immediately before the
|
|
|
|
newline (if any) at the end of the string. When this flag is
|
|
|
|
specified, \regexp{\^} matches at the beginning of the string and at
|
|
|
|
the beginning of each line within the string, immediately following
|
|
|
|
each newline. Similarly, the \regexp{\$} metacharacter matches either at
|
|
|
|
the end of the string and at the end of each line (immediately
|
|
|
|
preceding each newline).
|
|
|
|
|
|
|
|
\end{datadesc}
|
|
|
|
|
|
|
|
\begin{datadesc}{S}
|
|
|
|
\dataline{DOTALL}
|
|
|
|
Makes the \character{.} special character match any character at all,
|
|
|
|
including a newline; without this flag, \character{.} will match
|
|
|
|
anything \emph{except} a newline.
|
|
|
|
\end{datadesc}
|
|
|
|
|
|
|
|
\begin{datadesc}{X}
|
|
|
|
\dataline{VERBOSE} This flag allows you to write regular expressions
|
|
|
|
that are more readable by granting you more flexibility in how you can
|
|
|
|
format them. When this flag has been specified, whitespace within the
|
|
|
|
RE string is ignored, except when the whitespace is in a character
|
|
|
|
class or preceded by an unescaped backslash; this lets you organize
|
|
|
|
and indent the RE more clearly. It also enables you to put comments
|
|
|
|
within a RE that will be ignored by the engine; comments are marked by
|
|
|
|
a \character{\#} that's neither in a character class or preceded by an
|
|
|
|
unescaped backslash.
|
|
|
|
|
|
|
|
For example, here's a RE that uses \constant{re.VERBOSE}; see how
|
|
|
|
much easier it is to read?
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
charref = re.compile(r"""
|
|
|
|
&[#] # Start of a numeric entity reference
|
|
|
|
(
|
|
|
|
[0-9]+[^0-9] # Decimal form
|
|
|
|
| 0[0-7]+[^0-7] # Octal form
|
|
|
|
| x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
|
|
|
|
)
|
|
|
|
""", re.VERBOSE)
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Without the verbose setting, the RE would look like this:
|
|
|
|
\begin{verbatim}
|
|
|
|
charref = re.compile("&#([0-9]+[^0-9]"
|
|
|
|
"|0[0-7]+[^0-7]"
|
|
|
|
"|x[0-9a-fA-F]+[^0-9a-fA-F])")
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
In the above example, Python's automatic concatenation of string
|
|
|
|
literals has been used to break up the RE into smaller pieces, but
|
|
|
|
it's still more difficult to understand than the version using
|
|
|
|
\constant{re.VERBOSE}.
|
|
|
|
|
|
|
|
\end{datadesc}
|
|
|
|
|
|
|
|
\section{More Pattern Power}
|
|
|
|
|
|
|
|
So far we've only covered a part of the features of regular
|
|
|
|
expressions. In this section, we'll cover some new metacharacters,
|
|
|
|
and how to use groups to retrieve portions of the text that was matched.
|
|
|
|
|
|
|
|
\subsection{More Metacharacters\label{more-metacharacters}}
|
|
|
|
|
|
|
|
There are some metacharacters that we haven't covered yet. Most of
|
|
|
|
them will be covered in this section.
|
|
|
|
|
|
|
|
Some of the remaining metacharacters to be discussed are
|
|
|
|
\dfn{zero-width assertions}. They don't cause the engine to advance
|
|
|
|
through the string; instead, they consume no characters at all,
|
|
|
|
and simply succeed or fail. For example, \regexp{\e b} is an
|
|
|
|
assertion that the current position is located at a word boundary; the
|
|
|
|
position isn't changed by the \regexp{\e b} at all. This means that
|
|
|
|
zero-width assertions should never be repeated, because if they match
|
|
|
|
once at a given location, they can obviously be matched an infinite
|
|
|
|
number of times.
|
|
|
|
|
|
|
|
\begin{list}{}{}
|
|
|
|
|
|
|
|
\item[\regexp{|}]
|
|
|
|
Alternation, or the ``or'' operator.
|
|
|
|
If A and B are regular expressions,
|
|
|
|
\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
|
|
|
|
\regexp{|} has very low precedence in order to make it work reasonably when
|
|
|
|
you're alternating multi-character strings.
|
|
|
|
\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
|
|
|
|
\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
|
|
|
|
|
|
|
|
To match a literal \character{|},
|
|
|
|
use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
|
|
|
|
|
|
|
|
\item[\regexp{\^}] Matches at the beginning of lines. Unless the
|
|
|
|
\constant{MULTILINE} flag has been set, this will only match at the
|
|
|
|
beginning of the string. In \constant{MULTILINE} mode, this also
|
|
|
|
matches immediately after each newline within the string.
|
|
|
|
|
|
|
|
For example, if you wish to match the word \samp{From} only at the
|
|
|
|
beginning of a line, the RE to use is \verb|^From|.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> print re.search('^From', 'From Here to Eternity')
|
|
|
|
<re.MatchObject instance at 80c1520>
|
|
|
|
>>> print re.search('^From', 'Reciting From Memory')
|
|
|
|
None
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
%To match a literal \character{\^}, use \regexp{\e\^} or enclose it
|
|
|
|
%inside a character class, as in \regexp{[{\e}\^]}.
|
|
|
|
|
|
|
|
\item[\regexp{\$}] Matches at the end of a line, which is defined as
|
|
|
|
either the end of the string, or any location followed by a newline
|
|
|
|
character.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> print re.search('}$', '{block}')
|
|
|
|
<re.MatchObject instance at 80adfa8>
|
|
|
|
>>> print re.search('}$', '{block} ')
|
|
|
|
None
|
|
|
|
>>> print re.search('}$', '{block}\n')
|
|
|
|
<re.MatchObject instance at 80adfa8>
|
|
|
|
\end{verbatim}
|
|
|
|
% $
|
|
|
|
|
|
|
|
To match a literal \character{\$}, use \regexp{\e\$} or enclose it
|
|
|
|
inside a character class, as in \regexp{[\$]}.
|
|
|
|
|
|
|
|
\item[\regexp{\e A}] Matches only at the start of the string. When
|
|
|
|
not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
|
|
|
|
effectively the same. In \constant{MULTILINE} mode, however, they're
|
|
|
|
different; \regexp{\e A} still matches only at the beginning of the
|
|
|
|
string, but \regexp{\^} may match at any location inside the string
|
|
|
|
that follows a newline character.
|
|
|
|
|
|
|
|
\item[\regexp{\e Z}]Matches only at the end of the string.
|
|
|
|
|
|
|
|
\item[\regexp{\e b}] Word boundary.
|
|
|
|
This is a zero-width assertion that matches only at the
|
|
|
|
beginning or end of a word. A word is defined as a sequence of
|
|
|
|
alphanumeric characters, so the end of a word is indicated by
|
|
|
|
whitespace or a non-alphanumeric character.
|
|
|
|
|
|
|
|
The following example matches \samp{class} only when it's a complete
|
|
|
|
word; it won't match when it's contained inside another word.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile(r'\bclass\b')
|
|
|
|
>>> print p.search('no class at all')
|
|
|
|
<re.MatchObject instance at 80c8f28>
|
|
|
|
>>> print p.search('the declassified algorithm')
|
|
|
|
None
|
|
|
|
>>> print p.search('one subclass is')
|
|
|
|
None
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
There are two subtleties you should remember when using this special
|
|
|
|
sequence. First, this is the worst collision between Python's string
|
|
|
|
literals and regular expression sequences. In Python's string
|
|
|
|
literals, \samp{\e b} is the backspace character, ASCII value 8. If
|
|
|
|
you're not using raw strings, then Python will convert the \samp{\e b} to
|
|
|
|
a backspace, and your RE won't match as you expect it to. The
|
|
|
|
following example looks the same as our previous RE, but omits
|
|
|
|
the \character{r} in front of the RE string.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile('\bclass\b')
|
|
|
|
>>> print p.search('no class at all')
|
|
|
|
None
|
|
|
|
>>> print p.search('\b' + 'class' + '\b')
|
|
|
|
<re.MatchObject instance at 80c3ee0>
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Second, inside a character class, where there's no use for this
|
|
|
|
assertion, \regexp{\e b} represents the backspace character, for
|
|
|
|
compatibility with Python's string literals.
|
|
|
|
|
|
|
|
\item[\regexp{\e B}] Another zero-width assertion, this is the
|
|
|
|
opposite of \regexp{\e b}, only matching when the current
|
|
|
|
position is not at a word boundary.
|
|
|
|
|
|
|
|
\end{list}
|
|
|
|
|
|
|
|
\subsection{Grouping}
|
|
|
|
|
|
|
|
Frequently you need to obtain more information than just whether the
|
|
|
|
RE matched or not. Regular expressions are often used to dissect
|
|
|
|
strings by writing a RE divided into several subgroups which
|
|
|
|
match different components of interest. For example, an RFC-822
|
|
|
|
header line is divided into a header name and a value, separated by a
|
|
|
|
\character{:}. This can be handled by writing a regular expression
|
|
|
|
which matches an entire header line, and has one group which matches the
|
|
|
|
header name, and another group which matches the header's value.
|
|
|
|
|
|
|
|
Groups are marked by the \character{(}, \character{)} metacharacters.
|
|
|
|
\character{(} and \character{)} have much the same meaning as they do
|
|
|
|
in mathematical expressions; they group together the expressions
|
|
|
|
contained inside them. For example, you can repeat the contents of a
|
|
|
|
group with a repeating qualifier, such as \regexp{*}, \regexp{+},
|
|
|
|
\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example,
|
|
|
|
\regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile('(ab)*')
|
|
|
|
>>> print p.match('ababababab').span()
|
|
|
|
(0, 10)
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Groups indicated with \character{(}, \character{)} also capture the
|
|
|
|
starting and ending index of the text that they match; this can be
|
|
|
|
retrieved by passing an argument to \method{group()},
|
|
|
|
\method{start()}, \method{end()}, and \method{span()}. Groups are
|
|
|
|
numbered starting with 0. Group 0 is always present; it's the whole
|
|
|
|
RE, so \class{MatchObject} methods all have group 0 as their default
|
|
|
|
argument. Later we'll see how to express groups that don't capture
|
|
|
|
the span of text that they match.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile('(a)b')
|
|
|
|
>>> m = p.match('ab')
|
|
|
|
>>> m.group()
|
|
|
|
'ab'
|
|
|
|
>>> m.group(0)
|
|
|
|
'ab'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Subgroups are numbered from left to right, from 1 upward. Groups can
|
|
|
|
be nested; to determine the number, just count the opening parenthesis
|
|
|
|
characters, going from left to right.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile('(a(b)c)d')
|
|
|
|
>>> m = p.match('abcd')
|
|
|
|
>>> m.group(0)
|
|
|
|
'abcd'
|
|
|
|
>>> m.group(1)
|
|
|
|
'abc'
|
|
|
|
>>> m.group(2)
|
|
|
|
'b'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
\method{group()} can be passed multiple group numbers at a time, in
|
|
|
|
which case it will return a tuple containing the corresponding values
|
|
|
|
for those groups.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> m.group(2,1,2)
|
|
|
|
('b', 'abc', 'b')
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
The \method{groups()} method returns a tuple containing the strings
|
|
|
|
for all the subgroups, from 1 up to however many there are.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> m.groups()
|
|
|
|
('abc', 'b')
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Backreferences in a pattern allow you to specify that the contents of
|
|
|
|
an earlier capturing group must also be found at the current location
|
|
|
|
in the string. For example, \regexp{\e 1} will succeed if the exact
|
|
|
|
contents of group 1 can be found at the current position, and fails
|
|
|
|
otherwise. Remember that Python's string literals also use a
|
|
|
|
backslash followed by numbers to allow including arbitrary characters
|
|
|
|
in a string, so be sure to use a raw string when incorporating
|
|
|
|
backreferences in a RE.
|
|
|
|
|
|
|
|
For example, the following RE detects doubled words in a string.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile(r'(\b\w+)\s+\1')
|
|
|
|
>>> p.search('Paris in the the spring').group()
|
|
|
|
'the the'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Backreferences like this aren't often useful for just searching
|
|
|
|
through a string --- there are few text formats which repeat data in
|
|
|
|
this way --- but you'll soon find out that they're \emph{very} useful
|
|
|
|
when performing string substitutions.
|
|
|
|
|
|
|
|
\subsection{Non-capturing and Named Groups}
|
|
|
|
|
|
|
|
Elaborate REs may use many groups, both to capture substrings of
|
|
|
|
interest, and to group and structure the RE itself. In complex REs,
|
|
|
|
it becomes difficult to keep track of the group numbers. There are
|
|
|
|
two features which help with this problem. Both of them use a common
|
|
|
|
syntax for regular expression extensions, so we'll look at that first.
|
|
|
|
|
|
|
|
Perl 5 added several additional features to standard regular
|
|
|
|
expressions, and the Python \module{re} module supports most of them.
|
|
|
|
It would have been difficult to choose new single-keystroke
|
|
|
|
metacharacters or new special sequences beginning with \samp{\e} to
|
|
|
|
represent the new features without making Perl's regular expressions
|
|
|
|
confusingly different from standard REs. If you chose \samp{\&} as a
|
|
|
|
new metacharacter, for example, old expressions would be assuming that
|
|
|
|
\samp{\&} was a regular character and wouldn't have escaped it by
|
|
|
|
writing \regexp{\e \&} or \regexp{[\&]}.
|
|
|
|
|
|
|
|
The solution chosen by the Perl developers was to use \regexp{(?...)}
|
|
|
|
as the extension syntax. \samp{?} immediately after a parenthesis was
|
|
|
|
a syntax error because the \samp{?} would have nothing to repeat, so
|
|
|
|
this didn't introduce any compatibility problems. The characters
|
|
|
|
immediately after the \samp{?} indicate what extension is being used,
|
|
|
|
so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
|
|
|
|
\regexp{(?:foo)} is something else (a non-capturing group containing
|
|
|
|
the subexpression \regexp{foo}).
|
|
|
|
|
|
|
|
Python adds an extension syntax to Perl's extension syntax. If the
|
|
|
|
first character after the question mark is a \samp{P}, you know that
|
|
|
|
it's an extension that's specific to Python. Currently there are two
|
|
|
|
such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
|
|
|
|
and \regexp{(?P=\var{name})} is a backreference to a named group. If
|
|
|
|
future versions of Perl 5 add similar features using a different
|
|
|
|
syntax, the \module{re} module will be changed to support the new
|
|
|
|
syntax, while preserving the Python-specific syntax for
|
|
|
|
compatibility's sake.
|
|
|
|
|
|
|
|
Now that we've looked at the general extension syntax, we can return
|
|
|
|
to the features that simplify working with groups in complex REs.
|
|
|
|
Since groups are numbered from left to right and a complex expression
|
|
|
|
may use many groups, it can become difficult to keep track of the
|
|
|
|
correct numbering, and modifying such a complex RE is annoying.
|
|
|
|
Insert a new group near the beginning, and you change the numbers of
|
|
|
|
everything that follows it.
|
|
|
|
|
|
|
|
First, sometimes you'll want to use a group to collect a part of a
|
|
|
|
regular expression, but aren't interested in retrieving the group's
|
|
|
|
contents. You can make this fact explicit by using a non-capturing
|
|
|
|
group: \regexp{(?:...)}, where you can put any other regular
|
|
|
|
expression inside the parentheses.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> m = re.match("([abc])+", "abc")
|
|
|
|
>>> m.groups()
|
|
|
|
('c',)
|
|
|
|
>>> m = re.match("(?:[abc])+", "abc")
|
|
|
|
>>> m.groups()
|
|
|
|
()
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Except for the fact that you can't retrieve the contents of what the
|
|
|
|
group matched, a non-capturing group behaves exactly the same as a
|
|
|
|
capturing group; you can put anything inside it, repeat it with a
|
|
|
|
repetition metacharacter such as \samp{*}, and nest it within other
|
|
|
|
groups (capturing or non-capturing). \regexp{(?:...)} is particularly
|
|
|
|
useful when modifying an existing group, since you can add new groups
|
|
|
|
without changing how all the other groups are numbered. It should be
|
|
|
|
mentioned that there's no performance difference in searching between
|
|
|
|
capturing and non-capturing groups; neither form is any faster than
|
|
|
|
the other.
|
|
|
|
|
|
|
|
The second, and more significant, feature is named groups; instead of
|
|
|
|
referring to them by numbers, groups can be referenced by a name.
|
|
|
|
|
|
|
|
The syntax for a named group is one of the Python-specific extensions:
|
|
|
|
\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
|
|
|
|
the group. Except for associating a name with a group, named groups
|
|
|
|
also behave identically to capturing groups. The \class{MatchObject}
|
|
|
|
methods that deal with capturing groups all accept either integers, to
|
|
|
|
refer to groups by number, or a string containing the group name.
|
|
|
|
Named groups are still given numbers, so you can retrieve information
|
|
|
|
about a group in two ways:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile(r'(?P<word>\b\w+\b)')
|
|
|
|
>>> m = p.search( '(((( Lots of punctuation )))' )
|
|
|
|
>>> m.group('word')
|
|
|
|
'Lots'
|
|
|
|
>>> m.group(1)
|
|
|
|
'Lots'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Named groups are handy because they let you use easily-remembered
|
|
|
|
names, instead of having to remember numbers. Here's an example RE
|
|
|
|
from the \module{imaplib} module:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
InternalDate = re.compile(r'INTERNALDATE "'
|
|
|
|
r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
|
|
|
|
r'(?P<year>[0-9][0-9][0-9][0-9])'
|
|
|
|
r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
|
|
|
|
r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
|
|
|
|
r'"')
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
It's obviously much easier to retrieve \code{m.group('zonem')},
|
|
|
|
instead of having to remember to retrieve group 9.
|
|
|
|
|
|
|
|
Since the syntax for backreferences, in an expression like
|
|
|
|
\regexp{(...)\e 1}, refers to the number of the group there's
|
|
|
|
naturally a variant that uses the group name instead of the number.
|
|
|
|
This is also a Python extension: \regexp{(?P=\var{name})} indicates
|
|
|
|
that the contents of the group called \var{name} should again be found
|
|
|
|
at the current point. The regular expression for finding doubled
|
|
|
|
words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
|
|
|
|
\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
|
|
|
|
>>> p.search('Paris in the the spring').group()
|
|
|
|
'the the'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
\subsection{Lookahead Assertions}
|
|
|
|
|
|
|
|
Another zero-width assertion is the lookahead assertion. Lookahead
|
|
|
|
assertions are available in both positive and negative form, and
|
|
|
|
look like this:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds
|
|
|
|
if the contained regular expression, represented here by \code{...},
|
|
|
|
successfully matches at the current location, and fails otherwise.
|
|
|
|
But, once the contained expression has been tried, the matching engine
|
|
|
|
doesn't advance at all; the rest of the pattern is tried right where
|
|
|
|
the assertion started.
|
|
|
|
|
|
|
|
\item[\regexp{(?!...)}] Negative lookahead assertion. This is the
|
|
|
|
opposite of the positive assertion; it succeeds if the contained expression
|
|
|
|
\emph{doesn't} match at the current position in the string.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
An example will help make this concrete by demonstrating a case
|
|
|
|
where a lookahead is useful. Consider a simple pattern to match a
|
|
|
|
filename and split it apart into a base name and an extension,
|
|
|
|
separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news}
|
|
|
|
is the base name, and \samp{rc} is the filename's extension.
|
|
|
|
|
|
|
|
The pattern to match this is quite simple:
|
|
|
|
|
|
|
|
\regexp{.*[.].*\$}
|
|
|
|
|
|
|
|
Notice that the \samp{.} needs to be treated specially because it's a
|
|
|
|
metacharacter; I've put it inside a character class. Also notice the
|
|
|
|
trailing \regexp{\$}; this is added to ensure that all the rest of the
|
|
|
|
string must be included in the extension. This regular expression
|
|
|
|
matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
|
|
|
|
\samp{printers.conf}.
|
|
|
|
|
|
|
|
Now, consider complicating the problem a bit; what if you want to
|
|
|
|
match filenames where the extension is not \samp{bat}?
|
|
|
|
Some incorrect attempts:
|
|
|
|
|
|
|
|
\verb|.*[.][^b].*$|
|
|
|
|
% $
|
|
|
|
|
|
|
|
The first attempt above tries to exclude \samp{bat} by requiring that
|
|
|
|
the first character of the extension is not a \samp{b}. This is
|
|
|
|
wrong, because the pattern also doesn't match \samp{foo.bar}.
|
|
|
|
|
|
|
|
% Messes up the HTML without the curly braces around \^
|
|
|
|
\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
|
|
|
|
|
|
|
|
The expression gets messier when you try to patch up the first
|
|
|
|
solution by requiring one of the following cases to match: the first
|
|
|
|
character of the extension isn't \samp{b}; the second character isn't
|
|
|
|
\samp{a}; or the third character isn't \samp{t}. This accepts
|
|
|
|
\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
|
|
|
|
three-letter extension and won't accept a filename with a two-letter
|
|
|
|
extension such as \samp{sendmail.cf}. We'll complicate the pattern
|
|
|
|
again in an effort to fix it.
|
|
|
|
|
|
|
|
\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
|
|
|
|
|
|
|
|
In the third attempt, the second and third letters are all made
|
|
|
|
optional in order to allow matching extensions shorter than three
|
|
|
|
characters, such as \samp{sendmail.cf}.
|
|
|
|
|
|
|
|
The pattern's getting really complicated now, which makes it hard to
|
|
|
|
read and understand. Worse, if the problem changes and you want to
|
|
|
|
exclude both \samp{bat} and \samp{exe} as extensions, the pattern
|
|
|
|
would get even more complicated and confusing.
|
|
|
|
|
|
|
|
A negative lookahead cuts through all this:
|
|
|
|
|
|
|
|
\regexp{.*[.](?!bat\$).*\$}
|
|
|
|
% $
|
|
|
|
|
|
|
|
The lookahead means: if the expression \regexp{bat} doesn't match at
|
|
|
|
this point, try the rest of the pattern; if \regexp{bat\$} does match,
|
|
|
|
the whole pattern will fail. The trailing \regexp{\$} is required to
|
|
|
|
ensure that something like \samp{sample.batch}, where the extension
|
|
|
|
only starts with \samp{bat}, will be allowed.
|
|
|
|
|
|
|
|
Excluding another filename extension is now easy; simply add it as an
|
|
|
|
alternative inside the assertion. The following pattern excludes
|
|
|
|
filenames that end in either \samp{bat} or \samp{exe}:
|
|
|
|
|
|
|
|
\regexp{.*[.](?!bat\$|exe\$).*\$}
|
|
|
|
% $
|
|
|
|
|
|
|
|
|
|
|
|
\section{Modifying Strings}
|
|
|
|
|
|
|
|
Up to this point, we've simply performed searches against a static
|
|
|
|
string. Regular expressions are also commonly used to modify a string
|
|
|
|
in various ways, using the following \class{RegexObject} methods:
|
|
|
|
|
|
|
|
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
|
|
|
|
\lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
|
|
|
|
\lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
|
|
|
|
\lineii{subn()}{Does the same thing as \method{sub()},
|
|
|
|
but returns the new string and the number of replacements}
|
|
|
|
\end{tableii}
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Splitting Strings}
|
|
|
|
|
|
|
|
The \method{split()} method of a \class{RegexObject} splits a string
|
|
|
|
apart wherever the RE matches, returning a list of the pieces.
|
|
|
|
It's similar to the \method{split()} method of strings but
|
|
|
|
provides much more
|
|
|
|
generality in the delimiters that you can split by;
|
|
|
|
\method{split()} only supports splitting by whitespace or by
|
|
|
|
a fixed string. As you'd expect, there's a module-level
|
|
|
|
\function{re.split()} function, too.
|
|
|
|
|
|
|
|
\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
|
|
|
|
Split \var{string} by the matches of the regular expression. If
|
|
|
|
capturing parentheses are used in the RE, then their contents will
|
|
|
|
also be returned as part of the resulting list. If \var{maxsplit}
|
|
|
|
is nonzero, at most \var{maxsplit} splits are performed.
|
|
|
|
\end{methoddesc}
|
|
|
|
|
|
|
|
You can limit the number of splits made, by passing a value for
|
|
|
|
\var{maxsplit}. When \var{maxsplit} is nonzero, at most
|
|
|
|
\var{maxsplit} splits will be made, and the remainder of the string is
|
|
|
|
returned as the final element of the list. In the following example,
|
|
|
|
the delimiter is any sequence of non-alphanumeric characters.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile(r'\W+')
|
|
|
|
>>> p.split('This is a test, short and sweet, of split().')
|
|
|
|
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
|
|
|
|
>>> p.split('This is a test, short and sweet, of split().', 3)
|
|
|
|
['This', 'is', 'a', 'test, short and sweet, of split().']
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Sometimes you're not only interested in what the text between
|
|
|
|
delimiters is, but also need to know what the delimiter was. If
|
|
|
|
capturing parentheses are used in the RE, then their values are also
|
|
|
|
returned as part of the list. Compare the following calls:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile(r'\W+')
|
|
|
|
>>> p2 = re.compile(r'(\W+)')
|
|
|
|
>>> p.split('This... is a test.')
|
|
|
|
['This', 'is', 'a', 'test', '']
|
|
|
|
>>> p2.split('This... is a test.')
|
|
|
|
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
The module-level function \function{re.split()} adds the RE to be
|
|
|
|
used as the first argument, but is otherwise the same.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> re.split('[\W]+', 'Words, words, words.')
|
|
|
|
['Words', 'words', 'words', '']
|
|
|
|
>>> re.split('([\W]+)', 'Words, words, words.')
|
|
|
|
['Words', ', ', 'words', ', ', 'words', '.', '']
|
|
|
|
>>> re.split('[\W]+', 'Words, words, words.', 1)
|
|
|
|
['Words', 'words, words.']
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
\subsection{Search and Replace}
|
|
|
|
|
|
|
|
Another common task is to find all the matches for a pattern, and
|
|
|
|
replace them with a different string. The \method{sub()} method takes
|
|
|
|
a replacement value, which can be either a string or a function, and
|
|
|
|
the string to be processed.
|
|
|
|
|
|
|
|
\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
|
|
|
|
Returns the string obtained by replacing the leftmost non-overlapping
|
|
|
|
occurrences of the RE in \var{string} by the replacement
|
|
|
|
\var{replacement}. If the pattern isn't found, \var{string} is returned
|
|
|
|
unchanged.
|
|
|
|
|
|
|
|
The optional argument \var{count} is the maximum number of pattern
|
|
|
|
occurrences to be replaced; \var{count} must be a non-negative
|
|
|
|
integer. The default value of 0 means to replace all occurrences.
|
|
|
|
\end{methoddesc}
|
|
|
|
|
|
|
|
Here's a simple example of using the \method{sub()} method. It
|
|
|
|
replaces colour names with the word \samp{colour}:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile( '(blue|white|red)')
|
|
|
|
>>> p.sub( 'colour', 'blue socks and red shoes')
|
|
|
|
'colour socks and colour shoes'
|
|
|
|
>>> p.sub( 'colour', 'blue socks and red shoes', count=1)
|
|
|
|
'colour socks and red shoes'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
The \method{subn()} method does the same work, but returns a 2-tuple
|
|
|
|
containing the new string value and the number of replacements
|
|
|
|
that were performed:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile( '(blue|white|red)')
|
|
|
|
>>> p.subn( 'colour', 'blue socks and red shoes')
|
|
|
|
('colour socks and colour shoes', 2)
|
|
|
|
>>> p.subn( 'colour', 'no colours at all')
|
|
|
|
('no colours at all', 0)
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Empty matches are replaced only when they're not
|
|
|
|
adjacent to a previous match.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile('x*')
|
|
|
|
>>> p.sub('-', 'abxd')
|
|
|
|
'-a-b-d-'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
If \var{replacement} is a string, any backslash escapes in it are
|
|
|
|
processed. That is, \samp{\e n} is converted to a single newline
|
|
|
|
character, \samp{\e r} is converted to a carriage return, and so forth.
|
|
|
|
Unknown escapes such as \samp{\e j} are left alone. Backreferences,
|
|
|
|
such as \samp{\e 6}, are replaced with the substring matched by the
|
|
|
|
corresponding group in the RE. This lets you incorporate
|
|
|
|
portions of the original text in the resulting
|
|
|
|
replacement string.
|
|
|
|
|
|
|
|
This example matches the word \samp{section} followed by a string
|
|
|
|
enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
|
|
|
|
\samp{subsection}:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
|
|
|
|
>>> p.sub(r'subsection{\1}','section{First} section{second}')
|
|
|
|
'subsection{First} subsection{second}'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
There's also a syntax for referring to named groups as defined by the
|
|
|
|
\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the
|
|
|
|
substring matched by the group named \samp{name}, and
|
|
|
|
\samp{\e g<\var{number}>}
|
|
|
|
uses the corresponding group number.
|
|
|
|
\samp{\e g<2>} is therefore equivalent to \samp{\e 2},
|
|
|
|
but isn't ambiguous in a
|
|
|
|
replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be
|
|
|
|
interpreted as a reference to group 20, not a reference to group 2
|
|
|
|
followed by the literal character \character{0}.) The following
|
|
|
|
substitutions are all equivalent, but use all three variations of the
|
|
|
|
replacement string.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
|
|
|
|
>>> p.sub(r'subsection{\1}','section{First}')
|
|
|
|
'subsection{First}'
|
|
|
|
>>> p.sub(r'subsection{\g<1>}','section{First}')
|
|
|
|
'subsection{First}'
|
|
|
|
>>> p.sub(r'subsection{\g<name>}','section{First}')
|
|
|
|
'subsection{First}'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
\var{replacement} can also be a function, which gives you even more
|
|
|
|
control. If \var{replacement} is a function, the function is
|
|
|
|
called for every non-overlapping occurrence of \var{pattern}. On each
|
|
|
|
call, the function is
|
|
|
|
passed a \class{MatchObject} argument for the match
|
|
|
|
and can use this information to compute the desired replacement string and return it.
|
|
|
|
|
|
|
|
In the following example, the replacement function translates
|
|
|
|
decimals into hexadecimal:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> def hexrepl( match ):
|
|
|
|
... "Return the hex string for a decimal number"
|
|
|
|
... value = int( match.group() )
|
|
|
|
... return hex(value)
|
|
|
|
...
|
|
|
|
>>> p = re.compile(r'\d+')
|
|
|
|
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
|
|
|
|
'Call 0xffd2 for printing, 0xc000 for user code.'
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
When using the module-level \function{re.sub()} function, the pattern
|
|
|
|
is passed as the first argument. The pattern may be a string or a
|
|
|
|
\class{RegexObject}; if you need to specify regular expression flags,
|
|
|
|
you must either use a \class{RegexObject} as the first parameter, or use
|
|
|
|
embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb
|
|
|
|
BBBB")} returns \code{'x x'}.
|
|
|
|
|
|
|
|
\section{Common Problems}
|
|
|
|
|
|
|
|
Regular expressions are a powerful tool for some applications, but in
|
|
|
|
some ways their behaviour isn't intuitive and at times they don't
|
|
|
|
behave the way you may expect them to. This section will point out
|
|
|
|
some of the most common pitfalls.
|
|
|
|
|
|
|
|
\subsection{Use String Methods}
|
|
|
|
|
|
|
|
Sometimes using the \module{re} module is a mistake. If you're
|
|
|
|
matching a fixed string, or a single character class, and you're not
|
|
|
|
using any \module{re} features such as the \constant{IGNORECASE} flag,
|
|
|
|
then the full power of regular expressions may not be required.
|
|
|
|
Strings have several methods for performing operations with fixed
|
|
|
|
strings and they're usually much faster, because the implementation is
|
|
|
|
a single small C loop that's been optimized for the purpose, instead
|
|
|
|
of the large, more generalized regular expression engine.
|
|
|
|
|
|
|
|
One example might be replacing a single fixed string with another
|
|
|
|
one; for example, you might replace \samp{word}
|
|
|
|
with \samp{deed}. \code{re.sub()} seems like the function to use for
|
|
|
|
this, but consider the \method{replace()} method. Note that
|
|
|
|
\function{replace()} will also replace \samp{word} inside
|
|
|
|
words, turning \samp{swordfish} into \samp{sdeedfish}, but the
|
|
|
|
na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing
|
|
|
|
the substitution on parts of words, the pattern would have to be
|
|
|
|
\regexp{\e bword\e b}, in order to require that \samp{word} have a
|
|
|
|
word boundary on either side. This takes the job beyond
|
|
|
|
\method{replace}'s abilities.)
|
|
|
|
|
|
|
|
Another common task is deleting every occurrence of a single character
|
|
|
|
from a string or replacing it with another single character. You
|
|
|
|
might do this with something like \code{re.sub('\e n', ' ', S)}, but
|
|
|
|
\method{translate()} is capable of doing both tasks
|
2005-08-31 14:49:38 -03:00
|
|
|
and will be faster than any regular expression operation can be.
|
2005-08-29 22:25:05 -03:00
|
|
|
|
|
|
|
In short, before turning to the \module{re} module, consider whether
|
|
|
|
your problem can be solved with a faster and simpler string method.
|
|
|
|
|
|
|
|
\subsection{match() versus search()}
|
|
|
|
|
|
|
|
The \function{match()} function only checks if the RE matches at
|
|
|
|
the beginning of the string while \function{search()} will scan
|
|
|
|
forward through the string for a match.
|
|
|
|
It's important to keep this distinction in mind. Remember,
|
|
|
|
\function{match()} will only report a successful match which
|
|
|
|
will start at 0; if the match wouldn't start at zero,
|
|
|
|
\function{match()} will \emph{not} report it.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> print re.match('super', 'superstition').span()
|
|
|
|
(0, 5)
|
|
|
|
>>> print re.match('super', 'insuperable')
|
|
|
|
None
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
On the other hand, \function{search()} will scan forward through the
|
|
|
|
string, reporting the first match it finds.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> print re.search('super', 'superstition').span()
|
|
|
|
(0, 5)
|
|
|
|
>>> print re.search('super', 'insuperable').span()
|
|
|
|
(2, 7)
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
Sometimes you'll be tempted to keep using \function{re.match()}, and
|
|
|
|
just add \regexp{.*} to the front of your RE. Resist this temptation
|
|
|
|
and use \function{re.search()} instead. The regular expression
|
|
|
|
compiler does some analysis of REs in order to speed up the process of
|
|
|
|
looking for a match. One such analysis figures out what the first
|
|
|
|
character of a match must be; for example, a pattern starting with
|
|
|
|
\regexp{Crow} must match starting with a \character{C}. The analysis
|
|
|
|
lets the engine quickly scan through the string looking for the
|
|
|
|
starting character, only trying the full match if a \character{C} is found.
|
|
|
|
|
|
|
|
Adding \regexp{.*} defeats this optimization, requiring scanning to
|
|
|
|
the end of the string and then backtracking to find a match for the
|
|
|
|
rest of the RE. Use \function{re.search()} instead.
|
|
|
|
|
|
|
|
\subsection{Greedy versus Non-Greedy}
|
|
|
|
|
|
|
|
When repeating a regular expression, as in \regexp{a*}, the resulting
|
|
|
|
action is to consume as much of the pattern as possible. This
|
|
|
|
fact often bites you when you're trying to match a pair of
|
|
|
|
balanced delimiters, such as the angle brackets surrounding an HTML
|
|
|
|
tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't
|
|
|
|
work because of the greedy nature of \regexp{.*}.
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> s = '<html><head><title>Title</title>'
|
|
|
|
>>> len(s)
|
|
|
|
32
|
|
|
|
>>> print re.match('<.*>', s).span()
|
|
|
|
(0, 32)
|
|
|
|
>>> print re.match('<.*>', s).group()
|
|
|
|
<html><head><title>Title</title>
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
The RE matches the \character{<} in \samp{<html>}, and the
|
|
|
|
\regexp{.*} consumes the rest of the string. There's still more left
|
|
|
|
in the RE, though, and the \regexp{>} can't match at the end of
|
|
|
|
the string, so the regular expression engine has to backtrack
|
|
|
|
character by character until it finds a match for the \regexp{>}.
|
|
|
|
The final match extends from the \character{<} in \samp{<html>}
|
|
|
|
to the \character{>} in \samp{</title>}, which isn't what you want.
|
|
|
|
|
|
|
|
In this case, the solution is to use the non-greedy qualifiers
|
|
|
|
\regexp{*?}, \regexp{+?}, \regexp{??}, or
|
|
|
|
\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
|
|
|
|
possible. In the above example, the \character{>} is tried
|
|
|
|
immediately after the first \character{<} matches, and when it fails,
|
|
|
|
the engine advances a character at a time, retrying the \character{>}
|
|
|
|
at every step. This produces just the right result:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
>>> print re.match('<.*?>', s).group()
|
|
|
|
<html>
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
(Note that parsing HTML or XML with regular expressions is painful.
|
|
|
|
Quick-and-dirty patterns will handle common cases, but HTML and XML
|
|
|
|
have special cases that will break the obvious regular expression; by
|
|
|
|
the time you've written a regular expression that handles all of the
|
|
|
|
possible cases, the patterns will be \emph{very} complicated. Use an
|
|
|
|
HTML or XML parser module for such tasks.)
|
|
|
|
|
|
|
|
\subsection{Not Using re.VERBOSE}
|
|
|
|
|
|
|
|
By now you've probably noticed that regular expressions are a very
|
|
|
|
compact notation, but they're not terribly readable. REs of
|
|
|
|
moderate complexity can become lengthy collections of backslashes,
|
|
|
|
parentheses, and metacharacters, making them difficult to read and
|
|
|
|
understand.
|
|
|
|
|
|
|
|
For such REs, specifying the \code{re.VERBOSE} flag when
|
|
|
|
compiling the regular expression can be helpful, because it allows
|
|
|
|
you to format the regular expression more clearly.
|
|
|
|
|
|
|
|
The \code{re.VERBOSE} flag has several effects. Whitespace in the
|
|
|
|
regular expression that \emph{isn't} inside a character class is
|
|
|
|
ignored. This means that an expression such as \regexp{dog | cat} is
|
|
|
|
equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
|
|
|
|
will still match the characters \character{a}, \character{b}, or a
|
|
|
|
space. In addition, you can also put comments inside a RE; comments
|
|
|
|
extend from a \samp{\#} character to the next newline. When used with
|
|
|
|
triple-quoted strings, this enables REs to be formatted more neatly:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
pat = re.compile(r"""
|
|
|
|
\s* # Skip leading whitespace
|
|
|
|
(?P<header>[^:]+) # Header name
|
|
|
|
\s* : # Whitespace, and a colon
|
|
|
|
(?P<value>.*?) # The header's value -- *? used to
|
|
|
|
# lose the following trailing whitespace
|
|
|
|
\s*$ # Trailing whitespace to end-of-line
|
|
|
|
""", re.VERBOSE)
|
|
|
|
\end{verbatim}
|
|
|
|
% $
|
|
|
|
|
|
|
|
This is far more readable than:
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
|
|
|
|
\end{verbatim}
|
|
|
|
% $
|
|
|
|
|
|
|
|
\section{Feedback}
|
|
|
|
|
|
|
|
Regular expressions are a complicated topic. Did this document help
|
|
|
|
you understand them? Were there parts that were unclear, or Problems
|
|
|
|
you encountered that weren't covered here? If so, please send
|
|
|
|
suggestions for improvements to the author.
|
|
|
|
|
|
|
|
The most complete book on regular expressions is almost certainly
|
|
|
|
Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
|
|
|
|
by O'Reilly. Unfortunately, it exclusively concentrates on Perl and
|
|
|
|
Java's flavours of regular expressions, and doesn't contain any Python
|
|
|
|
material at all, so it won't be useful as a reference for programming
|
2006-04-21 07:40:58 -03:00
|
|
|
in Python. (The first edition covered Python's now-removed
|
2005-08-29 22:25:05 -03:00
|
|
|
\module{regex} module, which won't help you much.) Consider checking
|
|
|
|
it out from your library.
|
|
|
|
|
|
|
|
\end{document}
|
|
|
|
|