cpython/Doc/howto/regex.tex

1464 lines
58 KiB
TeX
Raw Normal View History

\documentclass{howto}
% TODO:
% Document lookbehind assertions
% Better way of displaying a RE, a string, and what it matches
% Mention optional argument to match.groups()
% Unicode (at least a reference)
\title{Regular Expression HOWTO}
\release{0.05}
\author{A.M. Kuchling}
\authoraddress{\email{amk@amk.ca}}
\begin{document}
\maketitle
\begin{abstract}
\noindent
This document is an introductory tutorial to using regular expressions
in Python with the \module{re} module. It provides a gentler
introduction than the corresponding section in the Library Reference.
This document is available from
\url{http://www.amk.ca/python/howto}.
\end{abstract}
\tableofcontents
\section{Introduction}
The \module{re} module was added in Python 1.5, and provides
Perl-style regular expression patterns. Earlier versions of Python
came with the \module{regex} module, which provided Emacs-style
patterns. \module{regex} module was removed in Python 2.5.
Regular expressions (or REs) are essentially a tiny, highly
specialized programming language embedded inside Python and made
available through the \module{re} module. Using this little language,
you specify the rules for the set of possible strings that you want to
match; this set might contain English sentences, or e-mail addresses,
or TeX commands, or anything you like. You can then ask questions
such as ``Does this string match the pattern?'', or ``Is there a match
for the pattern anywhere in this string?''. You can also use REs to
modify a string or to split it apart in various ways.
Regular expression patterns are compiled into a series of bytecodes
which are then executed by a matching engine written in C. For
advanced use, it may be necessary to pay careful attention to how the
engine will execute a given RE, and write the RE in a certain way in
order to produce bytecode that runs faster. Optimization isn't
covered in this document, because it requires that you have a good
understanding of the matching engine's internals.
The regular expression language is relatively small and restricted, so
not all possible string processing tasks can be done using regular
expressions. There are also tasks that \emph{can} be done with
regular expressions, but the expressions turn out to be very
complicated. In these cases, you may be better off writing Python
code to do the processing; while Python code will be slower than an
elaborate regular expression, it will also probably be more understandable.
\section{Simple Patterns}
We'll start by learning about the simplest possible regular
expressions. Since regular expressions are used to operate on
strings, we'll begin with the most common task: matching characters.
For a detailed explanation of the computer science underlying regular
expressions (deterministic and non-deterministic finite automata), you
can refer to almost any textbook on writing compilers.
\subsection{Matching Characters}
Most letters and characters will simply match themselves. For
example, the regular expression \regexp{test} will match the string
\samp{test} exactly. (You can enable a case-insensitive mode that
would let this RE match \samp{Test} or \samp{TEST} as well; more
about this later.)
There are exceptions to this rule; some characters are
special, and don't match themselves. Instead, they signal that some
out-of-the-ordinary thing should be matched, or they affect other
portions of the RE by repeating them. Much of this document is
devoted to discussing various metacharacters and what they do.
Here's a complete list of the metacharacters; their meanings will be
discussed in the rest of this HOWTO.
\begin{verbatim}
. ^ $ * + ? { [ ] \ | ( )
\end{verbatim}
% $
The first metacharacters we'll look at are \samp{[} and \samp{]}.
They're used for specifying a character class, which is a set of
characters that you wish to match. Characters can be listed
individually, or a range of characters can be indicated by giving two
characters and separating them by a \character{-}. For example,
\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
\samp{c}; this is the same as
\regexp{[a-c]}, which uses a range to express the same set of
characters. If you wanted to match only lowercase letters, your
RE would be \regexp{[a-z]}.
Metacharacters are not active inside classes. For example,
\regexp{[akm\$]} will match any of the characters \character{a},
\character{k}, \character{m}, or \character{\$}; \character{\$} is
usually a metacharacter, but inside a character class it's stripped of
its special nature.
You can match the characters not within a range by \dfn{complementing}
the set. This is indicated by including a \character{\^} as the first
character of the class; \character{\^} elsewhere will simply match the
\character{\^} character. For example, \verb|[^5]| will match any
character except \character{5}.
Perhaps the most important metacharacter is the backslash, \samp{\e}.
As in Python string literals, the backslash can be followed by various
characters to signal various special sequences. It's also used to escape
all the metacharacters so you can still match them in patterns; for
example, if you need to match a \samp{[} or
\samp{\e}, you can precede them with a backslash to remove their
special meaning: \regexp{\e[} or \regexp{\e\e}.
Some of the special sequences beginning with \character{\e} represent
predefined sets of characters that are often useful, such as the set
of digits, the set of letters, or the set of anything that isn't
whitespace. The following predefined special sequences are available:
\begin{itemize}
\item[\code{\e d}]Matches any decimal digit; this is
equivalent to the class \regexp{[0-9]}.
\item[\code{\e D}]Matches any non-digit character; this is
equivalent to the class \verb|[^0-9]|.
\item[\code{\e s}]Matches any whitespace character; this is
equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
\item[\code{\e S}]Matches any non-whitespace character; this is
equivalent to the class \verb|[^ \t\n\r\f\v]|.
\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
\regexp{[a-zA-Z0-9_]}.
\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
\verb|[^a-zA-Z0-9_]|.
\end{itemize}
These sequences can be included inside a character class. For
example, \regexp{[\e s,.]} is a character class that will match any
whitespace character, or \character{,} or \character{.}.
The final metacharacter in this section is \regexp{.}. It matches
anything except a newline character, and there's an alternate mode
(\code{re.DOTALL}) where it will match even a newline. \character{.}
is often used where you want to match ``any character''.
\subsection{Repeating Things}
Being able to match varying sets of characters is the first thing
regular expressions can do that isn't already possible with the
methods available on strings. However, if that was the only
additional capability of regexes, they wouldn't be much of an advance.
Another capability is that you can specify that portions of the RE
must be repeated a certain number of times.
The first metacharacter for repeating things that we'll look at is
\regexp{*}. \regexp{*} doesn't match the literal character \samp{*};
instead, it specifies that the previous character can be matched zero
or more times, instead of exactly once.
For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
characters), and so forth. The RE engine has various internal
limitations stemming from the size of C's \code{int} type, that will
prevent it from matching over 2 billion \samp{a} characters; you
probably don't have enough memory to construct a string that large, so
you shouldn't run into that limit.
Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
the matching engine will try to repeat it as many times as possible.
If later portions of the pattern don't match, the matching engine will
then back up and try again with few repetitions.
A step-by-step example will make this more obvious. Let's consider
the expression \regexp{a[bcd]*b}. This matches the letter
\character{a}, zero or more letters from the class \code{[bcd]}, and
finally ends with a \character{b}. Now imagine matching this RE
against the string \samp{abcbd}.
\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
it can, which is to the end of the string.}
\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
current position is at the end of the string, so it fails.}
\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches
one less character.}
\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
current position is at the last character, which is a \character{d}.}
\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is
only matching \samp{bc}.}
\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time
but the character at the current position is \character{b}, so it succeeds.}
\end{tableiii}
The end of the RE has now been reached, and it has matched
\samp{abcb}. This demonstrates how the matching engine goes as far as
it can at first, and if no match is found it will then progressively
back up and retry the rest of the RE again and again. It will back up
until it has tried zero matches for \regexp{[bcd]*}, and if that
subsequently fails, the engine will conclude that the string doesn't
match the RE at all.
Another repeating metacharacter is \regexp{+}, which matches one or
more times. Pay careful attention to the difference between
\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
times, so whatever's being repeated may not be present at all, while
\regexp{+} requires at least \emph{one} occurrence. To use a similar
example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
There are two more repeating qualifiers. The question mark character,
\regexp{?}, matches either once or zero times; you can think of it as
marking something as being optional. For example, \regexp{home-?brew}
matches either \samp{homebrew} or \samp{home-brew}.
The most complicated repeated qualifier is
\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
integers. This qualifier means there must be at least \var{m}
repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b}
will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match
\samp{ab}, which has no slashes, or \samp{a////b}, which has four.
You can omit either \var{m} or \var{n}; in that case, a reasonable
value is assumed for the missing value. Omitting \var{m} is
interpreted as a lower limit of 0, while omitting \var{n} results in an
upper bound of infinity --- actually, the 2 billion limit mentioned
earlier, but that might as well be infinity.
Readers of a reductionist bent may notice that the three other qualifiers
can all be expressed using this notation. \regexp{\{0,\}} is the same
as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use
\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
they're shorter and easier to read.
\section{Using Regular Expressions}
Now that we've looked at some simple regular expressions, how do we
actually use them in Python? The \module{re} module provides an
interface to the regular expression engine, allowing you to compile
REs into objects and then perform matches with them.
\subsection{Compiling Regular Expressions}
Regular expressions are compiled into \class{RegexObject} instances,
which have methods for various operations such as searching for
pattern matches or performing string substitutions.
\begin{verbatim}
>>> import re
>>> p = re.compile('ab*')
>>> print p
<re.RegexObject instance at 80b4150>
\end{verbatim}
\function{re.compile()} also accepts an optional \var{flags}
argument, used to enable various special features and syntax
variations. We'll go over the available settings later, but for now a
single example will do:
\begin{verbatim}
>>> p = re.compile('ab*', re.IGNORECASE)
\end{verbatim}
The RE is passed to \function{re.compile()} as a string. REs are
handled as strings because regular expressions aren't part of the core
Python language, and no special syntax was created for expressing
them. (There are applications that don't need REs at all, so there's
no need to bloat the language specification by including them.)
Instead, the \module{re} module is simply a C extension module
included with Python, just like the \module{socket} or \module{zlib}
module.
Putting REs in strings keeps the Python language simpler, but has one
disadvantage which is the topic of the next section.
\subsection{The Backslash Plague}
As stated earlier, regular expressions use the backslash
character (\character{\e}) to indicate special forms or to allow
special characters to be used without invoking their special meaning.
This conflicts with Python's usage of the same character for the same
purpose in string literals.
Let's say you want to write a RE that matches the string
\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure
out what to write in the program code, start with the desired string
to be matched. Next, you must escape any backslashes and other
metacharacters by preceding them with a backslash, resulting in the
string \samp{\e\e section}. The resulting string that must be passed
to \function{re.compile()} must be \verb|\\section|. However, to
express this as a Python string literal, both backslashes must be
escaped \emph{again}.
\begin{tableii}{c|l}{code}{Characters}{Stage}
\lineii{\e section}{Text string to be matched}
\lineii{\e\e section}{Escaped backslash for \function{re.compile}}
\lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
\end{tableii}
In short, to match a literal backslash, one has to write
\code{'\e\e\e\e'} as the RE string, because the regular expression
must be \samp{\e\e}, and each backslash must be expressed as
\samp{\e\e} inside a regular Python string literal. In REs that
feature backslashes repeatedly, this leads to lots of repeated
backslashes and makes the resulting strings difficult to understand.
The solution is to use Python's raw string notation for regular
expressions; backslashes are not handled in any special way in
a string literal prefixed with \character{r}, so \code{r"\e n"} is a
two-character string containing \character{\e} and \character{n},
while \code{"\e n"} is a one-character string containing a newline.
Frequently regular expressions will be expressed in Python
code using this raw string notation.
\begin{tableii}{c|c}{code}{Regular String}{Raw string}
\lineii{"ab*"}{\code{r"ab*"}}
\lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
\lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
\end{tableii}
\subsection{Performing Matches}
Once you have an object representing a compiled regular expression,
what do you do with it? \class{RegexObject} instances have several
methods and attributes. Only the most significant ones will be
covered here; consult \ulink{the Library
Reference}{http://www.python.org/doc/lib/module-re.html} for a
complete listing.
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
\lineii{match()}{Determine if the RE matches at the beginning of
the string.}
\lineii{search()}{Scan through a string, looking for any location
where this RE matches.}
\lineii{findall()}{Find all substrings where the RE matches,
and returns them as a list.}
\lineii{finditer()}{Find all substrings where the RE matches,
and returns them as an iterator.}
\end{tableii}
\method{match()} and \method{search()} return \code{None} if no match
can be found. If they're successful, a \code{MatchObject} instance is
returned, containing information about the match: where it starts and
ends, the substring it matched, and more.
You can learn about this by interactively experimenting with the
\module{re} module. If you have Tkinter available, you may also want
to look at \file{Tools/scripts/redemo.py}, a demonstration program
included with the Python distribution. It allows you to enter REs and
strings, and displays whether the RE matches or fails.
\file{redemo.py} can be quite useful when trying to debug a
complicated RE. Phil Schwartz's
Four months of trunk changes (including a few releases...) Merged revisions 51434-53004 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r51434 | neal.norwitz | 2006-08-21 20:20:10 +0200 (Mon, 21 Aug 2006) | 1 line Fix a couple of ssize-t issues reported by Alexander Belopolsky on python-dev ........ r51439 | neal.norwitz | 2006-08-21 21:47:08 +0200 (Mon, 21 Aug 2006) | 6 lines Patch #1542451: disallow continue anywhere under a finally I'm undecided if this should be backported to 2.5 or 2.5.1. Armin suggested to wait (I'm of the same opinion). Thomas W thinks it's fine to go in 2.5. ........ r51443 | neal.norwitz | 2006-08-21 22:16:24 +0200 (Mon, 21 Aug 2006) | 4 lines Handle a few more error conditions. Klocwork 301 and 302. Will backport. ........ r51450 | neal.norwitz | 2006-08-22 00:21:19 +0200 (Tue, 22 Aug 2006) | 5 lines Patch #1541585: fix buffer overrun when performing repr() on a unicode string in a build with wide unicode (UCS-4) support. This code could be improved, so add an XXX comment. ........ r51456 | neal.norwitz | 2006-08-22 01:44:48 +0200 (Tue, 22 Aug 2006) | 1 line Try to get the windows bots working again with the new peephole.c ........ r51461 | anthony.baxter | 2006-08-22 09:36:59 +0200 (Tue, 22 Aug 2006) | 1 line patch for documentation for recent uuid changes (from ping) ........ r51473 | neal.norwitz | 2006-08-22 15:56:56 +0200 (Tue, 22 Aug 2006) | 1 line Alexander Belopolsky pointed out that pos is a size_t ........ r51489 | jeremy.hylton | 2006-08-22 22:46:00 +0200 (Tue, 22 Aug 2006) | 2 lines Expose column offset information in parse trees. ........ r51497 | andrew.kuchling | 2006-08-23 01:13:43 +0200 (Wed, 23 Aug 2006) | 1 line Move functional howto into trunk ........ r51515 | jeremy.hylton | 2006-08-23 20:37:43 +0200 (Wed, 23 Aug 2006) | 2 lines Baby steps towards better tests for tokenize ........ r51525 | alex.martelli | 2006-08-23 22:42:02 +0200 (Wed, 23 Aug 2006) | 6 lines x**2 should about equal x*x (including for a float x such that the result is inf) but didn't; added a test to test_float to verify that, and ignored the ERANGE value for errno in the pow operation to make the new test pass (with help from Marilyn Davis at the Google Python Sprint -- thanks!). ........ r51526 | jeremy.hylton | 2006-08-23 23:14:03 +0200 (Wed, 23 Aug 2006) | 20 lines Bug fixes large and small for tokenize. Small: Always generate a NL or NEWLINE token following a COMMENT token. The old code did not generate an NL token if the comment was on a line by itself. Large: The output of untokenize() will now match the input exactly if it is passed the full token sequence. The old, crufty output is still generated if a limited input sequence is provided, where limited means that it does not include position information for tokens. Remaining bug: There is no CONTINUATION token (\) so there is no way for untokenize() to handle such code. Also, expanded the number of doctests in hopes of eventually removing the old-style tests that compare against a golden file. Bug fix candidate for Python 2.5.1. (Sigh.) ........ r51527 | jeremy.hylton | 2006-08-23 23:26:46 +0200 (Wed, 23 Aug 2006) | 5 lines Replace dead code with an assert. Now that COMMENT tokens are reliably followed by NL or NEWLINE, there is never a need to add extra newlines in untokenize. ........ r51530 | alex.martelli | 2006-08-24 00:17:59 +0200 (Thu, 24 Aug 2006) | 7 lines Reverting the patch that tried to fix the issue whereby x**2 raises OverflowError while x*x succeeds and produces infinity; apparently these inconsistencies cannot be fixed across ``all'' platforms and there's a widespread feeling that therefore ``every'' platform should keep suffering forevermore. Ah well. ........ r51565 | thomas.wouters | 2006-08-24 20:40:20 +0200 (Thu, 24 Aug 2006) | 6 lines Fix SF bug #1545837: array.array borks on deepcopy. array.__deepcopy__() needs to take an argument, even if it doesn't actually use it. Will backport to 2.5 and 2.4 (if applicable.) ........ r51580 | martin.v.loewis | 2006-08-25 02:03:34 +0200 (Fri, 25 Aug 2006) | 3 lines Patch #1545507: Exclude ctypes package in Win64 MSI file. Will backport to 2.5. ........ r51589 | neal.norwitz | 2006-08-25 03:52:49 +0200 (Fri, 25 Aug 2006) | 1 line importing types is not necessary if we use isinstance ........ r51604 | thomas.heller | 2006-08-25 09:27:33 +0200 (Fri, 25 Aug 2006) | 3 lines Port _ctypes.pyd to win64 on AMD64. ........ r51605 | thomas.heller | 2006-08-25 09:34:51 +0200 (Fri, 25 Aug 2006) | 3 lines Add missing file for _ctypes.pyd port to win64 on AMD64. ........ r51606 | thomas.heller | 2006-08-25 11:26:33 +0200 (Fri, 25 Aug 2006) | 6 lines Build _ctypes.pyd for win AMD64 into the MSVC project file. Since MSVC doesn't know about .asm files, a helper batch file is needed to find ml64.exe in predefined locations. The helper script hardcodes the path to the MS Platform SDK. ........ r51608 | armin.rigo | 2006-08-25 14:44:28 +0200 (Fri, 25 Aug 2006) | 4 lines The regular expression engine in '_sre' can segfault when interpreting bogus bytecode. It is unclear whether this is a real bug or a "won't fix" case like bogus_code_obj.py. ........ r51617 | tim.peters | 2006-08-26 00:05:39 +0200 (Sat, 26 Aug 2006) | 2 lines Whitespace normalization. ........ r51618 | tim.peters | 2006-08-26 00:06:44 +0200 (Sat, 26 Aug 2006) | 2 lines Add missing svn:eol-style property to text files. ........ r51619 | tim.peters | 2006-08-26 00:26:21 +0200 (Sat, 26 Aug 2006) | 3 lines A new test here relied on preserving invisible trailing whitespace in expected output. Stop that. ........ r51624 | jack.diederich | 2006-08-26 20:42:06 +0200 (Sat, 26 Aug 2006) | 4 lines - Move functions common to all path modules into genericpath.py and have the OS speicifc path modules import them. - Have os2emxpath import common functions fron ntpath instead of using copies ........ r51642 | neal.norwitz | 2006-08-29 07:40:58 +0200 (Tue, 29 Aug 2006) | 1 line Fix a couple of typos. ........ r51647 | marc-andre.lemburg | 2006-08-29 12:34:12 +0200 (Tue, 29 Aug 2006) | 5 lines Fix a buglet in the error reporting (SF bug report #1546372). This should probably go into Python 2.5 or 2.5.1 as well. ........ r51663 | armin.rigo | 2006-08-31 10:51:06 +0200 (Thu, 31 Aug 2006) | 3 lines Doc fix: hashlib objects don't always return a digest of 16 bytes. Backport candidate for 2.5. ........ r51664 | nick.coghlan | 2006-08-31 14:00:43 +0200 (Thu, 31 Aug 2006) | 1 line Fix the wrongheaded implementation of context management in the decimal module and add unit tests. (python-dev discussion is ongoing regarding what we do about Python 2.5) ........ r51665 | nick.coghlan | 2006-08-31 14:51:25 +0200 (Thu, 31 Aug 2006) | 1 line Remove the old decimal context management tests from test_contextlib (guess who didn't run the test suite before committing...) ........ r51669 | brett.cannon | 2006-08-31 20:54:26 +0200 (Thu, 31 Aug 2006) | 4 lines Make sure memory is properly cleaned up in file_init. Backport candidate. ........ r51671 | brett.cannon | 2006-08-31 23:47:52 +0200 (Thu, 31 Aug 2006) | 2 lines Fix comment about indentation level in C files. ........ r51674 | brett.cannon | 2006-09-01 00:42:37 +0200 (Fri, 01 Sep 2006) | 3 lines Have pre-existing C files use 8 spaces indents (to match old PEP 7 style), but have all new files use 4 spaces (to match current PEP 7 style). ........ r51676 | fred.drake | 2006-09-01 05:57:19 +0200 (Fri, 01 Sep 2006) | 3 lines - SF patch #1550263: Enhance and correct unittest docs - various minor cleanups for improved consistency ........ r51677 | georg.brandl | 2006-09-02 00:30:52 +0200 (Sat, 02 Sep 2006) | 2 lines evalfile() should be execfile(). ........ r51681 | neal.norwitz | 2006-09-02 04:43:17 +0200 (Sat, 02 Sep 2006) | 1 line SF #1547931, fix typo (missing and). Will backport to 2.5 ........ r51683 | neal.norwitz | 2006-09-02 04:50:35 +0200 (Sat, 02 Sep 2006) | 1 line Bug #1548092: fix curses.tparm seg fault on invalid input. Needs backport to 2.5.1 and earlier. ........ r51684 | neal.norwitz | 2006-09-02 04:58:13 +0200 (Sat, 02 Sep 2006) | 4 lines Bug #1550714: fix SystemError from itertools.tee on negative value for n. Needs backport to 2.5.1 and earlier. ........ r51685 | nick.coghlan | 2006-09-02 05:54:17 +0200 (Sat, 02 Sep 2006) | 1 line Make decimal.ContextManager a private implementation detail of decimal.localcontext() ........ r51686 | nick.coghlan | 2006-09-02 06:04:18 +0200 (Sat, 02 Sep 2006) | 1 line Further corrections to the decimal module context management documentation ........ r51688 | raymond.hettinger | 2006-09-02 19:07:23 +0200 (Sat, 02 Sep 2006) | 1 line Fix documentation nits for decimal context managers. ........ r51690 | neal.norwitz | 2006-09-02 20:51:34 +0200 (Sat, 02 Sep 2006) | 1 line Add missing word in comment ........ r51691 | neal.norwitz | 2006-09-02 21:40:19 +0200 (Sat, 02 Sep 2006) | 7 lines Hmm, this test has failed at least twice recently on the OpenBSD and Debian sparc buildbots. Since this goes through a lot of tests and hits the disk a lot it could be slow (especially if NFS is involved). I'm not sure if that's the problem, but printing periodic msgs shouldn't hurt. The code was stolen from test_compiler. ........ r51693 | nick.coghlan | 2006-09-03 03:02:00 +0200 (Sun, 03 Sep 2006) | 1 line Fix final documentation nits before backporting decimal module fixes to 2.5 ........ r51694 | nick.coghlan | 2006-09-03 03:06:07 +0200 (Sun, 03 Sep 2006) | 1 line Typo fix for decimal docs ........ r51697 | nick.coghlan | 2006-09-03 03:20:46 +0200 (Sun, 03 Sep 2006) | 1 line NEWS entry on trunk for decimal module changes ........ r51704 | raymond.hettinger | 2006-09-04 17:32:48 +0200 (Mon, 04 Sep 2006) | 1 line Fix endcase for str.rpartition() ........ r51716 | tim.peters | 2006-09-05 04:18:09 +0200 (Tue, 05 Sep 2006) | 12 lines "Conceptual" merge of rev 51711 from the 2.5 branch. i_divmod(): As discussed on Python-Dev, changed the overflow checking to live happily with recent gcc optimizations that assume signed integer arithmetic never overflows. This differs from the corresponding change on the 2.5 and 2.4 branches, using a less obscure approach, but one that /may/ tickle platform idiocies in their definitions of LONG_MIN. The 2.4 + 2.5 change avoided introducing a dependence on LONG_MIN, at the cost of substantially goofier code. ........ r51717 | tim.peters | 2006-09-05 04:21:19 +0200 (Tue, 05 Sep 2006) | 2 lines Whitespace normalization. ........ r51719 | tim.peters | 2006-09-05 04:22:17 +0200 (Tue, 05 Sep 2006) | 2 lines Add missing svn:eol-style property to text files. ........ r51720 | neal.norwitz | 2006-09-05 04:24:03 +0200 (Tue, 05 Sep 2006) | 2 lines Fix SF bug #1546288, crash in dict_equal. ........ r51721 | neal.norwitz | 2006-09-05 04:25:41 +0200 (Tue, 05 Sep 2006) | 1 line Fix SF #1552093, eval docstring typo (3 ps in mapping) ........ r51724 | neal.norwitz | 2006-09-05 04:35:08 +0200 (Tue, 05 Sep 2006) | 1 line This was found by Guido AFAIK on p3yk (sic) branch. ........ r51725 | neal.norwitz | 2006-09-05 04:36:20 +0200 (Tue, 05 Sep 2006) | 1 line Add a NEWS entry for str.rpartition() change ........ r51728 | neal.norwitz | 2006-09-05 04:57:01 +0200 (Tue, 05 Sep 2006) | 1 line Patch #1540470, for OpenBSD 4.0. Backport candidate for 2.[34]. ........ r51729 | neal.norwitz | 2006-09-05 05:53:08 +0200 (Tue, 05 Sep 2006) | 12 lines Bug #1520864 (again): unpacking singleton tuples in list comprehensions and generator expressions (x for x, in ... ) works again. Sigh, I only fixed for loops the first time, not list comps and genexprs too. I couldn't find any more unpacking cases where there is a similar bug lurking. This code should be refactored to eliminate the duplication. I'm sure the listcomp/genexpr code can be refactored. I'm not sure if the for loop can re-use any of the same code though. Will backport to 2.5 (the only place it matters). ........ r51731 | neal.norwitz | 2006-09-05 05:58:26 +0200 (Tue, 05 Sep 2006) | 1 line Add a comment about some refactoring. (There's probably more that should be done.) I will reformat this file in the next checkin due to the inconsistent tabs/spaces. ........ r51732 | neal.norwitz | 2006-09-05 06:00:12 +0200 (Tue, 05 Sep 2006) | 1 line M-x untabify ........ r51737 | hyeshik.chang | 2006-09-05 14:07:09 +0200 (Tue, 05 Sep 2006) | 7 lines Fix a few bugs on cjkcodecs found by Oren Tirosh: - gbk and gb18030 codec now handle U+30FB KATAKANA MIDDLE DOT correctly. - iso2022_jp_2 codec now encodes into G0 for KS X 1001, GB2312 codepoints to conform the standard. - iso2022_jp_3 and iso2022_jp_2004 codec can encode JIS X 2013:2 codepoints now. ........ r51738 | hyeshik.chang | 2006-09-05 14:14:57 +0200 (Tue, 05 Sep 2006) | 2 lines Fix a typo: 2013 -> 0213 ........ r51740 | georg.brandl | 2006-09-05 14:44:58 +0200 (Tue, 05 Sep 2006) | 3 lines Bug #1552618: change docs of dict.has_key() to reflect recommendation to use "in". ........ r51742 | andrew.kuchling | 2006-09-05 15:02:40 +0200 (Tue, 05 Sep 2006) | 1 line Rearrange example a bit, and show rpartition() when separator is not found ........ r51744 | andrew.kuchling | 2006-09-05 15:15:41 +0200 (Tue, 05 Sep 2006) | 1 line [Bug #1525469] SimpleXMLRPCServer still uses the sys.exc_{value,type} module-level globals instead of calling sys.exc_info(). Reported by Russell Warren ........ r51745 | andrew.kuchling | 2006-09-05 15:19:18 +0200 (Tue, 05 Sep 2006) | 3 lines [Bug #1526834] Fix crash in pdb when you do 'b f('; the function name was placed into a regex pattern and the unbalanced paren caused re.compile() to report an error ........ r51751 | kristjan.jonsson | 2006-09-05 19:58:12 +0200 (Tue, 05 Sep 2006) | 6 lines Update the PCBuild8 solution. Facilitate cross-compilation by having binaries in separate Win32 and x64 directories. Rationalized configs by making proper use of platforms/configurations. Remove pythoncore_pgo project. Add new PGIRelease and PGORelease configurations to perform Profile Guided Optimisation. Removed I64 support, but this can be easily added by copying the x64 platform settings. ........ r51758 | gustavo.niemeyer | 2006-09-06 03:58:52 +0200 (Wed, 06 Sep 2006) | 3 lines Fixing #1531862: Do not close standard file descriptors in the subprocess module. ........ r51760 | neal.norwitz | 2006-09-06 05:58:34 +0200 (Wed, 06 Sep 2006) | 1 line Revert 51758 because it broke all the buildbots ........ r51762 | georg.brandl | 2006-09-06 08:03:59 +0200 (Wed, 06 Sep 2006) | 3 lines Bug #1551427: fix a wrong NULL pointer check in the win32 version of os.urandom(). ........ r51765 | georg.brandl | 2006-09-06 08:09:31 +0200 (Wed, 06 Sep 2006) | 3 lines Bug #1550983: emit better error messages for erroneous relative imports (if not in package and if beyond toplevel package). ........ r51767 | neal.norwitz | 2006-09-06 08:28:06 +0200 (Wed, 06 Sep 2006) | 1 line with and as are now keywords. There are some generated files I can't recreate. ........ r51770 | georg.brandl | 2006-09-06 08:50:05 +0200 (Wed, 06 Sep 2006) | 5 lines Bug #1542051: Exceptions now correctly call PyObject_GC_UnTrack. Also make sure that every exception class has __module__ set to 'exceptions'. ........ r51785 | georg.brandl | 2006-09-06 22:05:58 +0200 (Wed, 06 Sep 2006) | 2 lines Fix missing import of the types module in logging.config. ........ r51789 | marc-andre.lemburg | 2006-09-06 22:40:22 +0200 (Wed, 06 Sep 2006) | 3 lines Add news item for bug fix of SF bug report #1546372. ........ r51797 | gustavo.niemeyer | 2006-09-07 02:48:33 +0200 (Thu, 07 Sep 2006) | 3 lines Fixed subprocess bug #1531862 again, after removing tests offending buildbot ........ r51798 | raymond.hettinger | 2006-09-07 04:42:48 +0200 (Thu, 07 Sep 2006) | 1 line Fix refcounts and add error checks. ........ r51803 | nick.coghlan | 2006-09-07 12:50:34 +0200 (Thu, 07 Sep 2006) | 1 line Fix the speed regression in inspect.py by adding another cache to speed up getmodule(). Patch #1553314 ........ r51805 | ronald.oussoren | 2006-09-07 14:03:10 +0200 (Thu, 07 Sep 2006) | 2 lines Fix a glaring error and update some version numbers. ........ r51814 | andrew.kuchling | 2006-09-07 15:56:23 +0200 (Thu, 07 Sep 2006) | 1 line Typo fix ........ r51815 | andrew.kuchling | 2006-09-07 15:59:38 +0200 (Thu, 07 Sep 2006) | 8 lines [Bug #1552726] Avoid repeatedly polling in interactive mode -- only put a timeout on the select() if an input hook has been defined. Patch by Richard Boulton. This select() code is only executed with readline 2.1, or if READLINE_CALLBACKS is defined. Backport candidate for 2.5, 2.4, probably earlier versions too. ........ r51816 | armin.rigo | 2006-09-07 17:06:00 +0200 (Thu, 07 Sep 2006) | 2 lines Add a warning notice on top of the generated grammar.txt. ........ r51819 | thomas.heller | 2006-09-07 20:56:28 +0200 (Thu, 07 Sep 2006) | 5 lines Anonymous structure fields that have a bit-width specified did not work, and they gave a strange error message from PyArg_ParseTuple: function takes exactly 2 arguments (3 given). With tests. ........ r51820 | thomas.heller | 2006-09-07 21:09:54 +0200 (Thu, 07 Sep 2006) | 4 lines The cast function did not accept c_char_p or c_wchar_p instances as first argument, and failed with a 'bad argument to internal function' error message. ........ r51827 | nick.coghlan | 2006-09-08 12:04:38 +0200 (Fri, 08 Sep 2006) | 1 line Add missing NEWS entry for rev 51803 ........ r51828 | andrew.kuchling | 2006-09-08 15:25:23 +0200 (Fri, 08 Sep 2006) | 1 line Add missing word ........ r51829 | andrew.kuchling | 2006-09-08 15:35:49 +0200 (Fri, 08 Sep 2006) | 1 line Explain SQLite a bit more clearly ........ r51830 | andrew.kuchling | 2006-09-08 15:36:36 +0200 (Fri, 08 Sep 2006) | 1 line Explain SQLite a bit more clearly ........ r51832 | andrew.kuchling | 2006-09-08 16:02:45 +0200 (Fri, 08 Sep 2006) | 1 line Use native SQLite types ........ r51833 | andrew.kuchling | 2006-09-08 16:03:01 +0200 (Fri, 08 Sep 2006) | 1 line Use native SQLite types ........ r51835 | andrew.kuchling | 2006-09-08 16:05:10 +0200 (Fri, 08 Sep 2006) | 1 line Fix typo in example ........ r51837 | brett.cannon | 2006-09-09 09:11:46 +0200 (Sat, 09 Sep 2006) | 6 lines Remove the __unicode__ method from exceptions. Allows unicode() to be called on exception classes. Would require introducing a tp_unicode slot to make it work otherwise. Fixes bug #1551432 and will be backported. ........ r51854 | neal.norwitz | 2006-09-11 06:24:09 +0200 (Mon, 11 Sep 2006) | 8 lines Forward port of 51850 from release25-maint branch. As mentioned on python-dev, reverting patch #1504333 because it introduced an infinite loop in rev 47154. This patch also adds a test to prevent the regression. ........ r51855 | neal.norwitz | 2006-09-11 06:28:16 +0200 (Mon, 11 Sep 2006) | 5 lines Properly handle a NULL returned from PyArena_New(). (Also fix some whitespace) Klocwork #364. ........ r51856 | neal.norwitz | 2006-09-11 06:32:57 +0200 (Mon, 11 Sep 2006) | 1 line Add a "crasher" taken from the sgml bug report referenced in the comment ........ r51858 | georg.brandl | 2006-09-11 11:38:35 +0200 (Mon, 11 Sep 2006) | 12 lines Forward-port of rev. 51857: Building with HP's cc on HP-UX turned up a couple of problems. _PyGILState_NoteThreadState was declared as static inconsistently. Make it static as it's not necessary outside of this module. Some tests failed because errno was reset to 0. (I think the tests that failed were at least: test_fcntl and test_mailbox). Ensure that errno doesn't change after a call to Py_END_ALLOW_THREADS. This only affected debug builds. ........ r51865 | martin.v.loewis | 2006-09-12 21:49:20 +0200 (Tue, 12 Sep 2006) | 2 lines Forward-port 51862: Add sgml_input.html. ........ r51866 | andrew.kuchling | 2006-09-12 22:50:23 +0200 (Tue, 12 Sep 2006) | 1 line Markup typo fix ........ r51867 | andrew.kuchling | 2006-09-12 23:09:02 +0200 (Tue, 12 Sep 2006) | 1 line Some editing, markup fixes ........ r51868 | andrew.kuchling | 2006-09-12 23:21:51 +0200 (Tue, 12 Sep 2006) | 1 line More wordsmithing ........ r51877 | andrew.kuchling | 2006-09-14 13:22:18 +0200 (Thu, 14 Sep 2006) | 1 line Make --help mention that -v can be supplied multiple times ........ r51878 | andrew.kuchling | 2006-09-14 13:28:50 +0200 (Thu, 14 Sep 2006) | 1 line Rewrite help message to remove some of the parentheticals. (There were a lot of them.) ........ r51883 | ka-ping.yee | 2006-09-15 02:34:19 +0200 (Fri, 15 Sep 2006) | 2 lines Fix grammar errors and improve clarity. ........ r51885 | georg.brandl | 2006-09-15 07:22:24 +0200 (Fri, 15 Sep 2006) | 3 lines Correct elementtree module index entry. ........ r51889 | fred.drake | 2006-09-15 17:18:04 +0200 (Fri, 15 Sep 2006) | 4 lines - fix module name in links in formatted documentation - minor markup cleanup (forward-ported from release25-maint revision 51888) ........ r51891 | fred.drake | 2006-09-15 18:11:27 +0200 (Fri, 15 Sep 2006) | 3 lines revise explanation of returns_unicode to reflect bool values and to include the default value (merged from release25-maint revision 51890) ........ r51897 | martin.v.loewis | 2006-09-16 19:36:37 +0200 (Sat, 16 Sep 2006) | 2 lines Patch #1557515: Add RLIMIT_SBSIZE. ........ r51903 | ronald.oussoren | 2006-09-17 20:42:53 +0200 (Sun, 17 Sep 2006) | 2 lines Port of revision 51902 in release25-maint to the trunk ........ r51904 | ronald.oussoren | 2006-09-17 21:23:27 +0200 (Sun, 17 Sep 2006) | 3 lines Tweak Mac/Makefile in to ensure that pythonw gets rebuild when the major version of python changes (2.5 -> 2.6). Bug #1552935. ........ r51913 | guido.van.rossum | 2006-09-18 23:36:16 +0200 (Mon, 18 Sep 2006) | 2 lines Make this thing executable. ........ r51920 | gregory.p.smith | 2006-09-19 19:35:04 +0200 (Tue, 19 Sep 2006) | 5 lines Fixes a bug with bsddb.DB.stat where the flags and txn keyword arguments are transposed. (reported by Louis Zechtzer) ..already committed to release24-maint ..needs committing to release25-maint ........ r51926 | brett.cannon | 2006-09-20 20:34:28 +0200 (Wed, 20 Sep 2006) | 3 lines Accidentally didn't commit Misc/NEWS entry on when __unicode__() was removed from exceptions. ........ r51927 | brett.cannon | 2006-09-20 20:43:13 +0200 (Wed, 20 Sep 2006) | 6 lines Allow exceptions to be directly sliced again (e.g., ``BaseException(1,2,3)[0:2]``). Discovered in Python 2.5.0 by Thomas Heller and reported to python-dev. This should be backported to 2.5 . ........ r51928 | brett.cannon | 2006-09-20 21:28:35 +0200 (Wed, 20 Sep 2006) | 2 lines Make python.vim output more deterministic. ........ r51949 | walter.doerwald | 2006-09-21 17:09:55 +0200 (Thu, 21 Sep 2006) | 2 lines Fix typo. ........ r51950 | jack.diederich | 2006-09-21 19:50:26 +0200 (Thu, 21 Sep 2006) | 5 lines * regression bug, count_next was coercing a Py_ssize_t to an unsigned Py_size_t which breaks negative counts * added test for negative numbers will backport to 2.5.1 ........ r51953 | jack.diederich | 2006-09-21 22:34:49 +0200 (Thu, 21 Sep 2006) | 1 line added itertools.count(-n) fix ........ r51971 | neal.norwitz | 2006-09-22 10:16:26 +0200 (Fri, 22 Sep 2006) | 10 lines Fix %zd string formatting on Mac OS X so it prints negative numbers. In addition to testing positive numbers, verify negative numbers work in configure. In order to avoid compiler warnings on OS X 10.4, also change the order of the check for the format character to use (PY_FORMAT_SIZE_T) in the sprintf format for Py_ssize_t. This patch changes PY_FORMAT_SIZE_T from "" to "l" if it wasn't defined at configure time. Need to verify the buildbot results. Backport candidate (if everyone thinks this patch can't be improved). ........ r51972 | neal.norwitz | 2006-09-22 10:18:10 +0200 (Fri, 22 Sep 2006) | 7 lines Bug #1557232: fix seg fault with def f((((x)))) and def f(((x),)). These tests should be improved. Hopefully this fixes variations when flipping back and forth between fpdef and fplist. Backport candidate. ........ r51975 | neal.norwitz | 2006-09-22 10:47:23 +0200 (Fri, 22 Sep 2006) | 4 lines Mostly revert this file to the same version as before. Only force setting of PY_FORMAT_SIZE_T to "l" for Mac OSX. I don't know a better define to use. This should get rid of the warnings on other platforms and Mac too. ........ r51986 | fred.drake | 2006-09-23 02:26:31 +0200 (Sat, 23 Sep 2006) | 1 line add boilerplate "What's New" document so the docs will build ........ r51987 | neal.norwitz | 2006-09-23 06:11:38 +0200 (Sat, 23 Sep 2006) | 1 line Remove extra semi-colons reported by Johnny Lee on python-dev. Backport if anyone cares. ........ r51989 | neal.norwitz | 2006-09-23 20:11:58 +0200 (Sat, 23 Sep 2006) | 1 line SF Bug #1563963, add missing word and cleanup first sentance ........ r51990 | brett.cannon | 2006-09-23 21:53:20 +0200 (Sat, 23 Sep 2006) | 3 lines Make output on test_strptime() be more verbose in face of failure. This is in hopes that more information will help debug the failing test on HPPA Ubuntu. ........ r51991 | georg.brandl | 2006-09-24 12:36:01 +0200 (Sun, 24 Sep 2006) | 2 lines Fix webbrowser.BackgroundBrowser on Windows. ........ r51993 | georg.brandl | 2006-09-24 14:35:36 +0200 (Sun, 24 Sep 2006) | 4 lines Fix a bug in the parser's future statement handling that led to "with" not being recognized as a keyword after, e.g., this statement: from __future__ import division, with_statement ........ r51995 | georg.brandl | 2006-09-24 14:50:24 +0200 (Sun, 24 Sep 2006) | 4 lines Fix a bug in traceback.format_exception_only() that led to an error being raised when print_exc() was called without an exception set. In version 2.4, this printed "None", restored that behavior. ........ r52000 | armin.rigo | 2006-09-25 17:16:26 +0200 (Mon, 25 Sep 2006) | 2 lines Another crasher. ........ r52011 | brett.cannon | 2006-09-27 01:38:24 +0200 (Wed, 27 Sep 2006) | 2 lines Make the error message for when the time data and format do not match clearer. ........ r52014 | andrew.kuchling | 2006-09-27 18:37:30 +0200 (Wed, 27 Sep 2006) | 1 line Add news item for rev. 51815 ........ r52018 | andrew.kuchling | 2006-09-27 21:23:05 +0200 (Wed, 27 Sep 2006) | 1 line Make examples do error checking on Py_InitModule ........ r52032 | brett.cannon | 2006-09-29 00:10:14 +0200 (Fri, 29 Sep 2006) | 2 lines Very minor grammatical fix in a comment. ........ r52048 | george.yoshida | 2006-09-30 07:14:02 +0200 (Sat, 30 Sep 2006) | 4 lines SF bug #1567976 : fix typo Will backport to 2.5. ........ r52051 | gregory.p.smith | 2006-09-30 08:08:20 +0200 (Sat, 30 Sep 2006) | 2 lines wording change ........ r52053 | georg.brandl | 2006-09-30 09:24:48 +0200 (Sat, 30 Sep 2006) | 2 lines Bug #1567375: a minor logical glitch in example description. ........ r52056 | georg.brandl | 2006-09-30 09:31:57 +0200 (Sat, 30 Sep 2006) | 3 lines Bug #1565661: in webbrowser, split() the command for the default GNOME browser in case it is a command with args. ........ r52058 | georg.brandl | 2006-09-30 10:43:30 +0200 (Sat, 30 Sep 2006) | 4 lines Patch #1567691: super() and new.instancemethod() now don't accept keyword arguments any more (previously they accepted them, but didn't use them). ........ r52061 | georg.brandl | 2006-09-30 11:03:42 +0200 (Sat, 30 Sep 2006) | 3 lines Bug #1566800: make sure that EnvironmentError can be called with any number of arguments, as was the case in Python 2.4. ........ r52063 | georg.brandl | 2006-09-30 11:06:45 +0200 (Sat, 30 Sep 2006) | 2 lines Bug #1566663: remove obsolete example from datetime docs. ........ r52065 | georg.brandl | 2006-09-30 11:13:21 +0200 (Sat, 30 Sep 2006) | 3 lines Bug #1566602: correct failure of posixpath unittest when $HOME ends with a slash. ........ r52068 | georg.brandl | 2006-09-30 12:58:01 +0200 (Sat, 30 Sep 2006) | 3 lines Bug #1457823: cgi.(Sv)FormContentDict's constructor now takes keep_blank_values and strict_parsing keyword arguments. ........ r52069 | georg.brandl | 2006-09-30 13:06:47 +0200 (Sat, 30 Sep 2006) | 3 lines Bug #1560617: in pyclbr, return full module name not only for classes, but also for functions. ........ r52072 | georg.brandl | 2006-09-30 13:17:34 +0200 (Sat, 30 Sep 2006) | 3 lines Bug #1556784: allow format strings longer than 127 characters in datetime's strftime function. ........ r52075 | georg.brandl | 2006-09-30 13:22:28 +0200 (Sat, 30 Sep 2006) | 3 lines Bug #1446043: correctly raise a LookupError if an encoding name given to encodings.search_function() contains a dot. ........ r52078 | georg.brandl | 2006-09-30 14:02:57 +0200 (Sat, 30 Sep 2006) | 3 lines Bug #1546052: clarify that PyString_FromString(AndSize) copies the string pointed to by its parameter. ........ r52080 | georg.brandl | 2006-09-30 14:16:03 +0200 (Sat, 30 Sep 2006) | 3 lines Convert test_import to unittest. ........ r52083 | kurt.kaiser | 2006-10-01 23:16:45 +0200 (Sun, 01 Oct 2006) | 5 lines Some syntax errors were being caught by tokenize during the tabnanny check, resulting in obscure error messages. Do the syntax check first. Bug 1562716, 1562719 ........ r52084 | kurt.kaiser | 2006-10-01 23:54:37 +0200 (Sun, 01 Oct 2006) | 3 lines Add comment explaining that error msgs may be due to user code when running w/o subprocess. ........ r52086 | martin.v.loewis | 2006-10-02 16:55:51 +0200 (Mon, 02 Oct 2006) | 3 lines Fix test for uintptr_t. Fixes #1568842. Will backport. ........ r52089 | martin.v.loewis | 2006-10-02 17:20:37 +0200 (Mon, 02 Oct 2006) | 3 lines Guard uintptr_t test with HAVE_STDINT_H, test for stdint.h. Will backport. ........ r52100 | vinay.sajip | 2006-10-03 20:02:37 +0200 (Tue, 03 Oct 2006) | 1 line Documentation omitted the additional parameter to LogRecord.__init__ which was added in 2.5. (See SF #1569622). ........ r52101 | vinay.sajip | 2006-10-03 20:20:26 +0200 (Tue, 03 Oct 2006) | 1 line Documentation clarified to mention optional parameters. ........ r52102 | vinay.sajip | 2006-10-03 20:21:56 +0200 (Tue, 03 Oct 2006) | 1 line Modified LogRecord.__init__ to make the func parameter optional. (See SF #1569622). ........ r52121 | brett.cannon | 2006-10-03 23:58:55 +0200 (Tue, 03 Oct 2006) | 2 lines Fix minor typo in a comment. ........ r52123 | brett.cannon | 2006-10-04 01:23:14 +0200 (Wed, 04 Oct 2006) | 2 lines Convert test_imp over to unittest. ........ r52128 | barry.warsaw | 2006-10-04 04:06:36 +0200 (Wed, 04 Oct 2006) | 3 lines decode_rfc2231(): As Christian Robottom Reis points out, it makes no sense to test for parts > 3 when we use .split(..., 2). ........ r52129 | jeremy.hylton | 2006-10-04 04:24:52 +0200 (Wed, 04 Oct 2006) | 9 lines Fix for SF bug 1569998: break permitted inside try. The compiler was checking that there was something on the fblock stack, but not that there was a loop on the stack. Fixed that and added a test for the specific syntax error. Bug fix candidate. ........ r52130 | martin.v.loewis | 2006-10-04 07:47:34 +0200 (Wed, 04 Oct 2006) | 4 lines Fix integer negation and absolute value to not rely on undefined behaviour of the C compiler anymore. Will backport to 2.5 and 2.4. ........ r52135 | martin.v.loewis | 2006-10-04 11:21:20 +0200 (Wed, 04 Oct 2006) | 1 line Forward port r52134: Add uuids for 2.4.4. ........ r52137 | armin.rigo | 2006-10-04 12:23:57 +0200 (Wed, 04 Oct 2006) | 3 lines Compilation problem caused by conflicting typedefs for uint32_t (unsigned long vs. unsigned int). ........ r52139 | armin.rigo | 2006-10-04 14:17:45 +0200 (Wed, 04 Oct 2006) | 23 lines Forward-port of r52136,52138: a review of overflow-detecting code. * unified the way intobject, longobject and mystrtoul handle values around -sys.maxint-1. * in general, trying to entierely avoid overflows in any computation involving signed ints or longs is extremely involved. Fixed a few simple cases where a compiler might be too clever (but that's all guesswork). * more overflow checks against bad data in marshal.c. * 2.5 specific: fixed a number of places that were still confusing int and Py_ssize_t. Some of them could potentially have caused "real-world" breakage. * list.pop(x): fixing overflow issues on x was messy. I just reverted to PyArg_ParseTuple("n"), which does the right thing. (An obscure test was trying to give a Decimal to list.pop()... doesn't make sense any more IMHO) * trying to write a few tests... ........ r52147 | andrew.kuchling | 2006-10-04 15:42:43 +0200 (Wed, 04 Oct 2006) | 6 lines Cause a PyObject_Malloc() failure to trigger a MemoryError, and then add 'if (PyErr_Occurred())' checks to various places so that NULL is returned properly. 2.4 backport candidate. ........ r52148 | martin.v.loewis | 2006-10-04 17:25:28 +0200 (Wed, 04 Oct 2006) | 1 line Add MSVC8 project files to create wininst-8.exe. ........ r52196 | brett.cannon | 2006-10-06 00:02:31 +0200 (Fri, 06 Oct 2006) | 7 lines Clarify what "re-initialization" means for init_builtin() and init_dynamic(). Also remove warning about re-initialization as possibly raising an execption as both call _PyImport_FindExtension() which pulls any module that was already imported from the Python process' extension cache and just copies the __dict__ into the module stored in sys.modules. ........ r52200 | fred.drake | 2006-10-06 02:03:45 +0200 (Fri, 06 Oct 2006) | 3 lines - update links - remove Sleepycat name now that they have been bought ........ r52204 | andrew.kuchling | 2006-10-06 12:41:01 +0200 (Fri, 06 Oct 2006) | 1 line Case fix ........ r52208 | georg.brandl | 2006-10-06 14:46:08 +0200 (Fri, 06 Oct 2006) | 3 lines Fix name. ........ r52211 | andrew.kuchling | 2006-10-06 15:18:26 +0200 (Fri, 06 Oct 2006) | 1 line [Bug #1545341] Allow 'classifier' parameter to be a tuple as well as a list. Will backport. ........ r52212 | armin.rigo | 2006-10-06 18:33:22 +0200 (Fri, 06 Oct 2006) | 4 lines A very minor bug fix: this code looks like it is designed to accept any hue value and do the modulo itself, except it doesn't quite do it in all cases. At least, the "cannot get here" comment was wrong. ........ r52213 | andrew.kuchling | 2006-10-06 20:51:55 +0200 (Fri, 06 Oct 2006) | 1 line Comment grammar ........ r52218 | skip.montanaro | 2006-10-07 13:05:02 +0200 (Sat, 07 Oct 2006) | 6 lines Note that the excel_tab class is registered as the "excel-tab" dialect. Fixes 1572471. Make a similar change for the excel class and clean up references to the Dialects and Formatting Parameters section in a few places. ........ r52221 | georg.brandl | 2006-10-08 09:11:54 +0200 (Sun, 08 Oct 2006) | 3 lines Add missing NEWS entry for rev. 52129. ........ r52223 | hyeshik.chang | 2006-10-08 15:48:34 +0200 (Sun, 08 Oct 2006) | 3 lines Bug #1572832: fix a bug in ISO-2022 codecs which may cause segfault when encoding non-BMP unicode characters. (Submitted by Ray Chason) ........ r52227 | ronald.oussoren | 2006-10-08 19:37:58 +0200 (Sun, 08 Oct 2006) | 4 lines Add version number to the link to the python documentation in /Developer/Documentation/Python, better for users that install multiple versions of python. ........ r52229 | ronald.oussoren | 2006-10-08 19:40:02 +0200 (Sun, 08 Oct 2006) | 2 lines Fix for bug #1570284 ........ r52233 | ronald.oussoren | 2006-10-08 19:49:52 +0200 (Sun, 08 Oct 2006) | 6 lines MacOSX: distutils changes the values of BASECFLAGS and LDFLAGS when using a universal build of python on OSX 10.3 to ensure that those flags can be used to compile code (the universal build uses compiler flags that aren't supported on 10.3). This patches gives the same treatment to CFLAGS, PY_CFLAGS and BLDSHARED. ........ r52236 | ronald.oussoren | 2006-10-08 19:51:46 +0200 (Sun, 08 Oct 2006) | 5 lines MacOSX: The universal build requires that users have the MacOSX10.4u SDK installed to build extensions. This patch makes distutils emit a warning when the compiler should use an SDK but that SDK is not installed, hopefully reducing some confusion. ........ r52238 | ronald.oussoren | 2006-10-08 20:18:26 +0200 (Sun, 08 Oct 2006) | 3 lines MacOSX: add more logic to recognize the correct startup file to patch to the shell profile patching post-install script. ........ r52242 | andrew.kuchling | 2006-10-09 19:10:12 +0200 (Mon, 09 Oct 2006) | 1 line Add news item for rev. 52211 change ........ r52245 | andrew.kuchling | 2006-10-09 20:05:19 +0200 (Mon, 09 Oct 2006) | 1 line Fix wording in comment ........ r52251 | georg.brandl | 2006-10-09 21:03:06 +0200 (Mon, 09 Oct 2006) | 2 lines Patch #1572724: fix typo ('=' instead of '==') in _msi.c. ........ r52255 | barry.warsaw | 2006-10-09 21:43:24 +0200 (Mon, 09 Oct 2006) | 2 lines List gc.get_count() in the module docstring. ........ r52257 | martin.v.loewis | 2006-10-09 22:44:25 +0200 (Mon, 09 Oct 2006) | 1 line Bug #1565150: Fix subsecond processing for os.utime on Windows. ........ r52268 | ronald.oussoren | 2006-10-10 09:55:06 +0200 (Tue, 10 Oct 2006) | 2 lines MacOSX: fix permission problem in the generated installer ........ r52293 | georg.brandl | 2006-10-12 09:38:04 +0200 (Thu, 12 Oct 2006) | 2 lines Bug #1575746: fix typo in property() docs. ........ r52295 | georg.brandl | 2006-10-12 09:57:21 +0200 (Thu, 12 Oct 2006) | 3 lines Bug #813342: Start the IDLE subprocess with -Qnew if the parent is started with that option. ........ r52297 | georg.brandl | 2006-10-12 10:22:53 +0200 (Thu, 12 Oct 2006) | 2 lines Bug #1565919: document set types in the Language Reference. ........ r52299 | georg.brandl | 2006-10-12 11:20:33 +0200 (Thu, 12 Oct 2006) | 3 lines Bug #1550524: better heuristics to find correct class definition in inspect.findsource(). ........ r52301 | georg.brandl | 2006-10-12 11:47:12 +0200 (Thu, 12 Oct 2006) | 4 lines Bug #1548891: The cStringIO.StringIO() constructor now encodes unicode arguments with the system default encoding just like the write() method does, instead of converting it to a raw buffer. ........ r52303 | georg.brandl | 2006-10-12 13:14:40 +0200 (Thu, 12 Oct 2006) | 2 lines Bug #1546628: add a note about urlparse.urljoin() and absolute paths. ........ r52305 | georg.brandl | 2006-10-12 13:27:59 +0200 (Thu, 12 Oct 2006) | 3 lines Bug #1545497: when given an explicit base, int() did ignore NULs embedded in the string to convert. ........ r52307 | georg.brandl | 2006-10-12 13:41:11 +0200 (Thu, 12 Oct 2006) | 3 lines Add a note to fpectl docs that it's not built by default (bug #1556261). ........ r52309 | georg.brandl | 2006-10-12 13:46:57 +0200 (Thu, 12 Oct 2006) | 3 lines Bug #1560114: the Mac filesystem does have accurate information about the case of filenames. ........ r52311 | georg.brandl | 2006-10-12 13:59:27 +0200 (Thu, 12 Oct 2006) | 2 lines Small grammar fix, thanks Sjoerd. ........ r52313 | georg.brandl | 2006-10-12 14:03:07 +0200 (Thu, 12 Oct 2006) | 2 lines Fix tarfile depending on buggy int('1\0', base) behavior. ........ r52315 | georg.brandl | 2006-10-12 14:33:07 +0200 (Thu, 12 Oct 2006) | 2 lines Bug #1283491: follow docstring convention wrt. keyword-able args in sum(). ........ r52316 | georg.brandl | 2006-10-12 15:08:16 +0200 (Thu, 12 Oct 2006) | 3 lines Bug #1560179: speed up posixpath.(dir|base)name ........ r52327 | brett.cannon | 2006-10-14 08:36:45 +0200 (Sat, 14 Oct 2006) | 3 lines Clean up the language of a sentence relating to the connect() function and user-defined datatypes. ........ r52332 | neal.norwitz | 2006-10-14 23:33:38 +0200 (Sat, 14 Oct 2006) | 3 lines Update the peephole optimizer to remove more dead code (jumps after returns) and inline jumps to returns. ........ r52333 | martin.v.loewis | 2006-10-15 09:54:40 +0200 (Sun, 15 Oct 2006) | 4 lines Patch #1576954: Update VC6 build directory; remove redundant files in VC7. Will backport to 2.5. ........ r52335 | martin.v.loewis | 2006-10-15 10:43:33 +0200 (Sun, 15 Oct 2006) | 1 line Patch #1576166: Support os.utime for directories on Windows NT+. ........ r52336 | martin.v.loewis | 2006-10-15 10:51:22 +0200 (Sun, 15 Oct 2006) | 2 lines Patch #1577551: Add ctypes and ET build support for VC6. Will backport to 2.5. ........ r52338 | martin.v.loewis | 2006-10-15 11:35:51 +0200 (Sun, 15 Oct 2006) | 1 line Loosen the test for equal time stamps. ........ r52339 | martin.v.loewis | 2006-10-15 11:43:39 +0200 (Sun, 15 Oct 2006) | 2 lines Bug #1567666: Emulate GetFileAttributesExA for Win95. Will backport to 2.5. ........ r52341 | martin.v.loewis | 2006-10-15 13:02:07 +0200 (Sun, 15 Oct 2006) | 2 lines Round to int, because some systems support sub-second time stamps in stat, but not in utime. Also be consistent with modifying only mtime, not atime. ........ r52342 | martin.v.loewis | 2006-10-15 13:57:40 +0200 (Sun, 15 Oct 2006) | 2 lines Set the eol-style for project files to "CRLF". ........ r52343 | martin.v.loewis | 2006-10-15 13:59:56 +0200 (Sun, 15 Oct 2006) | 3 lines Drop binary property on dsp files, set eol-style to CRLF instead. ........ r52344 | martin.v.loewis | 2006-10-15 14:01:43 +0200 (Sun, 15 Oct 2006) | 2 lines Remove binary property, set eol-style to CRLF instead. ........ r52346 | martin.v.loewis | 2006-10-15 16:30:38 +0200 (Sun, 15 Oct 2006) | 2 lines Mention the bdist_msi module. Will backport to 2.5. ........ r52354 | brett.cannon | 2006-10-16 05:09:52 +0200 (Mon, 16 Oct 2006) | 3 lines Fix turtle so that you can launch the demo2 function on its own instead of only when the module is launched as a script. ........ r52356 | martin.v.loewis | 2006-10-17 17:18:06 +0200 (Tue, 17 Oct 2006) | 2 lines Patch #1457736: Update VC6 to use current PCbuild settings. Will backport to 2.5. ........ r52360 | martin.v.loewis | 2006-10-17 20:09:55 +0200 (Tue, 17 Oct 2006) | 2 lines Remove obsolete file. Will backport. ........ r52363 | martin.v.loewis | 2006-10-17 20:59:23 +0200 (Tue, 17 Oct 2006) | 4 lines Forward-port r52358: - Bug #1578513: Cross compilation was broken by a change to configure. Repair so that it's back to how it was in 2.4.3. ........ r52365 | thomas.heller | 2006-10-17 21:30:48 +0200 (Tue, 17 Oct 2006) | 6 lines ctypes callback functions only support 'fundamental' result types. Check this and raise an error when something else is used - before this change ctypes would hang or crash when such a callback was called. This is a partial fix for #1574584. Will backport to release25-maint. ........ r52377 | tim.peters | 2006-10-18 07:06:06 +0200 (Wed, 18 Oct 2006) | 2 lines newIobject(): repaired incorrect cast to quiet MSVC warning. ........ r52378 | tim.peters | 2006-10-18 07:09:12 +0200 (Wed, 18 Oct 2006) | 2 lines Whitespace normalization. ........ r52379 | tim.peters | 2006-10-18 07:10:28 +0200 (Wed, 18 Oct 2006) | 2 lines Add missing svn:eol-style to text files. ........ r52387 | martin.v.loewis | 2006-10-19 12:58:46 +0200 (Thu, 19 Oct 2006) | 3 lines Add check for the PyArg_ParseTuple format, and declare it if it is supported. ........ r52388 | martin.v.loewis | 2006-10-19 13:00:37 +0200 (Thu, 19 Oct 2006) | 3 lines Fix various minor errors in passing arguments to PyArg_ParseTuple. ........ r52389 | martin.v.loewis | 2006-10-19 18:01:37 +0200 (Thu, 19 Oct 2006) | 2 lines Restore CFLAGS after checking for __attribute__ ........ r52390 | andrew.kuchling | 2006-10-19 23:55:55 +0200 (Thu, 19 Oct 2006) | 1 line [Bug #1576348] Fix typo in example ........ r52414 | walter.doerwald | 2006-10-22 10:59:41 +0200 (Sun, 22 Oct 2006) | 2 lines Port test___future__ to unittest. ........ r52415 | ronald.oussoren | 2006-10-22 12:45:18 +0200 (Sun, 22 Oct 2006) | 3 lines Patch #1580674: with this patch os.readlink uses the filesystem encoding to decode unicode objects and returns an unicode object when the argument is one. ........ r52416 | martin.v.loewis | 2006-10-22 12:46:18 +0200 (Sun, 22 Oct 2006) | 3 lines Patch #1580872: Remove duplicate declaration of PyCallable_Check. Will backport to 2.5. ........ r52418 | martin.v.loewis | 2006-10-22 12:55:15 +0200 (Sun, 22 Oct 2006) | 4 lines - Patch #1560695: Add .note.GNU-stack to ctypes' sysv.S so that ctypes isn't considered as requiring executable stacks. Will backport to 2.5. ........ r52420 | martin.v.loewis | 2006-10-22 15:45:13 +0200 (Sun, 22 Oct 2006) | 3 lines Remove passwd.adjunct.byname from list of maps for test_nis. Will backport to 2.5. ........ r52431 | georg.brandl | 2006-10-24 18:54:16 +0200 (Tue, 24 Oct 2006) | 2 lines Patch [ 1583506 ] tarfile.py: 100-char filenames are truncated ........ r52446 | andrew.kuchling | 2006-10-26 21:10:46 +0200 (Thu, 26 Oct 2006) | 1 line [Bug #1579796] Wrong syntax for PyDateTime_IMPORT in documentation. Reported by David Faure. ........ r52449 | andrew.kuchling | 2006-10-26 21:16:46 +0200 (Thu, 26 Oct 2006) | 1 line Typo fix ........ r52452 | martin.v.loewis | 2006-10-27 08:16:31 +0200 (Fri, 27 Oct 2006) | 3 lines Patch #1549049: Rewrite type conversion in structmember. Fixes #1545696 and #1566140. Will backport to 2.5. ........ r52454 | martin.v.loewis | 2006-10-27 08:42:27 +0200 (Fri, 27 Oct 2006) | 2 lines Check for values.h. Will backport. ........ r52456 | martin.v.loewis | 2006-10-27 09:06:52 +0200 (Fri, 27 Oct 2006) | 2 lines Get DBL_MAX from float.h not values.h. Will backport. ........ r52458 | martin.v.loewis | 2006-10-27 09:13:28 +0200 (Fri, 27 Oct 2006) | 2 lines Patch #1567274: Support SMTP over TLS. ........ r52459 | andrew.kuchling | 2006-10-27 13:33:29 +0200 (Fri, 27 Oct 2006) | 1 line Set svn:keywords property ........ r52460 | andrew.kuchling | 2006-10-27 13:36:41 +0200 (Fri, 27 Oct 2006) | 1 line Add item ........ r52461 | andrew.kuchling | 2006-10-27 13:37:01 +0200 (Fri, 27 Oct 2006) | 1 line Some wording changes and markup fixes ........ r52462 | andrew.kuchling | 2006-10-27 14:18:38 +0200 (Fri, 27 Oct 2006) | 1 line [Bug #1585690] Note that line_num was added in Python 2.5 ........ r52464 | andrew.kuchling | 2006-10-27 14:50:38 +0200 (Fri, 27 Oct 2006) | 1 line [Bug #1583946] Reword description of server and issuer ........ r52466 | andrew.kuchling | 2006-10-27 15:06:25 +0200 (Fri, 27 Oct 2006) | 1 line [Bug #1562583] Mention the set_reuse_addr() method ........ r52469 | andrew.kuchling | 2006-10-27 15:22:46 +0200 (Fri, 27 Oct 2006) | 4 lines [Bug #1542016] Report PCALL_POP value. This makes the return value of sys.callstats() match its docstring. Backport candidate. Though it's an API change, this is a pretty obscure portion of the API. ........ r52473 | andrew.kuchling | 2006-10-27 16:53:41 +0200 (Fri, 27 Oct 2006) | 1 line Point users to the subprocess module in the docs for os.system, os.spawn*, os.popen2, and the popen2 and commands modules ........ r52476 | andrew.kuchling | 2006-10-27 18:39:10 +0200 (Fri, 27 Oct 2006) | 1 line [Bug #1576241] Let functools.wraps work with built-in functions ........ r52478 | andrew.kuchling | 2006-10-27 18:55:34 +0200 (Fri, 27 Oct 2006) | 1 line [Bug #1575506] The _singlefileMailbox class was using the wrong file object in its flush() method, causing an error ........ r52480 | andrew.kuchling | 2006-10-27 19:06:16 +0200 (Fri, 27 Oct 2006) | 1 line Clarify docstring ........ r52481 | andrew.kuchling | 2006-10-27 19:11:23 +0200 (Fri, 27 Oct 2006) | 5 lines [Patch #1574068 by Scott Dial] urllib and urllib2 were using base64.encodestring() for encoding authentication data. encodestring() can include newlines for very long input, which produced broken HTTP headers. ........ r52483 | andrew.kuchling | 2006-10-27 20:13:46 +0200 (Fri, 27 Oct 2006) | 1 line Check db_setup_debug for a few print statements; change sqlite_setup_debug to False ........ r52484 | andrew.kuchling | 2006-10-27 20:15:02 +0200 (Fri, 27 Oct 2006) | 1 line [Patch #1503717] Tiny patch from Chris AtLee to stop a lengthy line from being printed ........ r52485 | thomas.heller | 2006-10-27 20:31:36 +0200 (Fri, 27 Oct 2006) | 5 lines WindowsError.str should display the windows error code, not the posix error code; with test. Fixes #1576174. Will backport to release25-maint. ........ r52487 | thomas.heller | 2006-10-27 21:05:53 +0200 (Fri, 27 Oct 2006) | 4 lines Modulefinder now handles absolute and relative imports, including tests. Will backport to release25-maint. ........ r52488 | georg.brandl | 2006-10-27 22:39:43 +0200 (Fri, 27 Oct 2006) | 2 lines Patch #1552024: add decorator support to unparse.py demo script. ........ r52492 | walter.doerwald | 2006-10-28 12:47:12 +0200 (Sat, 28 Oct 2006) | 2 lines Port test_bufio to unittest. ........ r52493 | georg.brandl | 2006-10-28 15:10:17 +0200 (Sat, 28 Oct 2006) | 6 lines Convert test_global, test_scope and test_grammar to unittest. I tried to enclose all tests which must be run at the toplevel (instead of inside a method) in exec statements. ........ r52494 | georg.brandl | 2006-10-28 15:11:41 +0200 (Sat, 28 Oct 2006) | 3 lines Update outstanding bugs test file. ........ r52495 | georg.brandl | 2006-10-28 15:51:49 +0200 (Sat, 28 Oct 2006) | 3 lines Convert test_math to unittest. ........ r52496 | georg.brandl | 2006-10-28 15:56:58 +0200 (Sat, 28 Oct 2006) | 3 lines Convert test_opcodes to unittest. ........ r52497 | georg.brandl | 2006-10-28 18:04:04 +0200 (Sat, 28 Oct 2006) | 2 lines Fix nth() itertool recipe. ........ r52500 | georg.brandl | 2006-10-28 22:25:09 +0200 (Sat, 28 Oct 2006) | 2 lines make test_grammar pass with python -O ........ r52501 | neal.norwitz | 2006-10-28 23:15:30 +0200 (Sat, 28 Oct 2006) | 6 lines Add some asserts. In sysmodule, I think these were to try to silence some warnings from Klokwork. They verify the assumptions of the format of svn version output. The assert in the thread module helped debug a problem on HP-UX. ........ r52502 | neal.norwitz | 2006-10-28 23:16:54 +0200 (Sat, 28 Oct 2006) | 5 lines Fix warnings with HP's C compiler. It doesn't recognize that infinite loops are, um, infinite. These conditions should not be able to happen. Will backport. ........ r52503 | neal.norwitz | 2006-10-28 23:17:51 +0200 (Sat, 28 Oct 2006) | 5 lines Fix crash in test on HP-UX. Apparently, it's not possible to delete a lock if it's held (even by the current thread). Will backport. ........ r52504 | neal.norwitz | 2006-10-28 23:19:07 +0200 (Sat, 28 Oct 2006) | 6 lines Fix bug #1565514, SystemError not raised on too many nested blocks. It seems like this should be a different error than SystemError, but I don't have any great ideas and SystemError was raised in 2.4 and earlier. Will backport. ........ r52505 | neal.norwitz | 2006-10-28 23:20:12 +0200 (Sat, 28 Oct 2006) | 4 lines Prevent crash if alloc of garbage fails. Found by Typo.pl. Will backport. ........ r52506 | neal.norwitz | 2006-10-28 23:21:00 +0200 (Sat, 28 Oct 2006) | 4 lines Don't inline Py_ADDRESS_IN_RANGE with gcc 4+ either. Will backport. ........ r52513 | neal.norwitz | 2006-10-28 23:56:49 +0200 (Sat, 28 Oct 2006) | 2 lines Fix test_modulefinder so it doesn't fail when run after test_distutils. ........ r52514 | neal.norwitz | 2006-10-29 00:12:26 +0200 (Sun, 29 Oct 2006) | 4 lines From SF 1557890, fix problem of using wrong type in example. Will backport. ........ r52517 | georg.brandl | 2006-10-29 09:39:22 +0100 (Sun, 29 Oct 2006) | 4 lines Fix codecs.EncodedFile which did not use file_encoding in 2.5.0, and fix all codecs file wrappers to work correctly with the "with" statement (bug #1586513). ........ r52519 | georg.brandl | 2006-10-29 09:47:08 +0100 (Sun, 29 Oct 2006) | 3 lines Clean up a leftover from old listcomp generation code. ........ r52520 | georg.brandl | 2006-10-29 09:53:06 +0100 (Sun, 29 Oct 2006) | 4 lines Bug #1586448: the compiler module now emits the same bytecode for list comprehensions as the builtin compiler, using the LIST_APPEND opcode. ........ r52521 | georg.brandl | 2006-10-29 10:01:01 +0100 (Sun, 29 Oct 2006) | 3 lines Remove trailing comma. ........ r52522 | georg.brandl | 2006-10-29 10:05:04 +0100 (Sun, 29 Oct 2006) | 3 lines Bug #1357915: allow all sequence types for shell arguments in subprocess. ........ r52524 | georg.brandl | 2006-10-29 10:16:12 +0100 (Sun, 29 Oct 2006) | 3 lines Patch #1583880: fix tarfile's problems with long names and posix/ GNU modes. ........ r52526 | georg.brandl | 2006-10-29 10:18:00 +0100 (Sun, 29 Oct 2006) | 3 lines Test assert if __debug__ is true. ........ r52527 | georg.brandl | 2006-10-29 10:32:16 +0100 (Sun, 29 Oct 2006) | 2 lines Fix the new EncodedFile test to work with big endian platforms. ........ r52529 | georg.brandl | 2006-10-29 15:39:09 +0100 (Sun, 29 Oct 2006) | 2 lines Bug #1586613: fix zlib and bz2 codecs' incremental en/decoders. ........ r52532 | georg.brandl | 2006-10-29 19:01:08 +0100 (Sun, 29 Oct 2006) | 2 lines Bug #1586773: extend hashlib docstring. ........ r52534 | neal.norwitz | 2006-10-29 19:30:10 +0100 (Sun, 29 Oct 2006) | 4 lines Update comments, remove commented out code. Move assembler structure next to assembler code to make it easier to move it to a separate file. ........ r52535 | georg.brandl | 2006-10-29 19:31:42 +0100 (Sun, 29 Oct 2006) | 3 lines Bug #1576657: when setting a KeyError for a tuple key, make sure that the tuple isn't used as the "exception arguments tuple". ........ r52537 | georg.brandl | 2006-10-29 20:13:40 +0100 (Sun, 29 Oct 2006) | 3 lines Convert test_mmap to unittest. ........ r52538 | georg.brandl | 2006-10-29 20:20:45 +0100 (Sun, 29 Oct 2006) | 3 lines Convert test_poll to unittest. ........ r52539 | georg.brandl | 2006-10-29 20:24:43 +0100 (Sun, 29 Oct 2006) | 3 lines Convert test_nis to unittest. ........ r52540 | georg.brandl | 2006-10-29 20:35:03 +0100 (Sun, 29 Oct 2006) | 3 lines Convert test_types to unittest. ........ r52541 | georg.brandl | 2006-10-29 20:51:16 +0100 (Sun, 29 Oct 2006) | 3 lines Convert test_cookie to unittest. ........ r52542 | georg.brandl | 2006-10-29 21:09:12 +0100 (Sun, 29 Oct 2006) | 3 lines Convert test_cgi to unittest. ........ r52543 | georg.brandl | 2006-10-29 21:24:01 +0100 (Sun, 29 Oct 2006) | 3 lines Completely convert test_httplib to unittest. ........ r52544 | georg.brandl | 2006-10-29 21:28:26 +0100 (Sun, 29 Oct 2006) | 2 lines Convert test_MimeWriter to unittest. ........ r52545 | georg.brandl | 2006-10-29 21:31:17 +0100 (Sun, 29 Oct 2006) | 3 lines Convert test_openpty to unittest. ........ r52546 | georg.brandl | 2006-10-29 21:35:12 +0100 (Sun, 29 Oct 2006) | 3 lines Remove leftover test output file. ........ r52547 | georg.brandl | 2006-10-29 22:54:18 +0100 (Sun, 29 Oct 2006) | 3 lines Move the check for openpty to the beginning. ........ r52548 | walter.doerwald | 2006-10-29 23:06:28 +0100 (Sun, 29 Oct 2006) | 2 lines Add tests for basic argument errors. ........ r52549 | walter.doerwald | 2006-10-30 00:02:27 +0100 (Mon, 30 Oct 2006) | 3 lines Add tests for incremental codecs with an errors argument. ........ r52550 | neal.norwitz | 2006-10-30 00:39:03 +0100 (Mon, 30 Oct 2006) | 1 line Fix refleak ........ r52552 | neal.norwitz | 2006-10-30 00:58:36 +0100 (Mon, 30 Oct 2006) | 1 line I'm assuming this is correct, it fixes the tests so they pass again ........ r52555 | vinay.sajip | 2006-10-31 18:32:37 +0100 (Tue, 31 Oct 2006) | 1 line Change to improve speed of _fixupChildren ........ r52556 | vinay.sajip | 2006-10-31 18:34:31 +0100 (Tue, 31 Oct 2006) | 1 line Added relativeCreated to Formatter doc (has been in the system for a long time - was unaccountably left out of the docs and not noticed until now). ........ r52588 | thomas.heller | 2006-11-02 20:48:24 +0100 (Thu, 02 Nov 2006) | 5 lines Replace the XXX marker in the 'Arrays and pointers' reference manual section with a link to the tutorial sections. Will backport to release25-maint. ........ r52592 | thomas.heller | 2006-11-02 21:22:29 +0100 (Thu, 02 Nov 2006) | 6 lines Fix a code example by adding a missing import. Fixes #1557890. Will backport to release25-maint. ........ r52598 | tim.peters | 2006-11-03 03:32:46 +0100 (Fri, 03 Nov 2006) | 2 lines Whitespace normalization. ........ r52619 | martin.v.loewis | 2006-11-04 19:14:06 +0100 (Sat, 04 Nov 2006) | 4 lines - Patch #1060577: Extract list of RPM files from spec file in bdist_rpm Will backport to 2.5. ........ r52621 | neal.norwitz | 2006-11-04 20:25:22 +0100 (Sat, 04 Nov 2006) | 4 lines Bug #1588287: fix invalid assertion for `1,2` in debug builds. Will backport ........ r52630 | andrew.kuchling | 2006-11-05 22:04:37 +0100 (Sun, 05 Nov 2006) | 1 line Update link ........ r52631 | skip.montanaro | 2006-11-06 15:34:52 +0100 (Mon, 06 Nov 2006) | 1 line note that user can control directory location even if default dir is used ........ r52644 | ronald.oussoren | 2006-11-07 16:53:38 +0100 (Tue, 07 Nov 2006) | 2 lines Fix a number of typos in strings and comments (sf#1589070) ........ r52647 | ronald.oussoren | 2006-11-07 17:00:34 +0100 (Tue, 07 Nov 2006) | 2 lines Whitespace changes to make the source more compliant with PEP8 (SF#1589070) ........ r52651 | thomas.heller | 2006-11-07 19:01:18 +0100 (Tue, 07 Nov 2006) | 3 lines Fix markup. Will backport to release25-maint. ........ r52653 | thomas.heller | 2006-11-07 19:20:47 +0100 (Tue, 07 Nov 2006) | 3 lines Fix grammatical error as well. Will backport to release25-maint. ........ r52657 | andrew.kuchling | 2006-11-07 21:39:16 +0100 (Tue, 07 Nov 2006) | 1 line Add missing word ........ r52662 | martin.v.loewis | 2006-11-08 07:46:37 +0100 (Wed, 08 Nov 2006) | 4 lines Correctly forward exception in instance_contains(). Fixes #1591996. Patch contributed by Neal Norwitz. Will backport. ........ r52664 | martin.v.loewis | 2006-11-08 07:48:36 +0100 (Wed, 08 Nov 2006) | 2 lines News entry for 52662. ........ r52665 | martin.v.loewis | 2006-11-08 08:35:55 +0100 (Wed, 08 Nov 2006) | 2 lines Patch #1351744: Add askyesnocancel helper for tkMessageBox. ........ r52666 | georg.brandl | 2006-11-08 08:45:59 +0100 (Wed, 08 Nov 2006) | 2 lines Patch #1592072: fix docs for return value of PyErr_CheckSignals. ........ r52668 | georg.brandl | 2006-11-08 11:04:29 +0100 (Wed, 08 Nov 2006) | 3 lines Bug #1592533: rename variable in heapq doc example, to avoid shadowing "sorted". ........ r52671 | andrew.kuchling | 2006-11-08 14:35:34 +0100 (Wed, 08 Nov 2006) | 1 line Add section on the functional module ........ r52672 | andrew.kuchling | 2006-11-08 15:14:30 +0100 (Wed, 08 Nov 2006) | 1 line Add section on operator module; make a few edits ........ r52673 | andrew.kuchling | 2006-11-08 15:24:03 +0100 (Wed, 08 Nov 2006) | 1 line Add table of contents; this required fixing a few headings. Some more smalle edits. ........ r52674 | andrew.kuchling | 2006-11-08 15:30:14 +0100 (Wed, 08 Nov 2006) | 1 line More edits ........ r52686 | martin.v.loewis | 2006-11-09 12:06:03 +0100 (Thu, 09 Nov 2006) | 3 lines Patch #838546: Make terminal become controlling in pty.fork(). Will backport to 2.5. ........ r52688 | martin.v.loewis | 2006-11-09 12:27:32 +0100 (Thu, 09 Nov 2006) | 2 lines Patch #1592250: Add elidge argument to Tkinter.Text.search. ........ r52690 | andrew.kuchling | 2006-11-09 14:27:07 +0100 (Thu, 09 Nov 2006) | 7 lines [Bug #1569790] mailbox.Maildir.get_folder() loses factory information Both the Maildir and MH classes had this bug; the patch fixes both classes and adds a test. Will backport to 25-maint. ........ r52692 | andrew.kuchling | 2006-11-09 14:51:14 +0100 (Thu, 09 Nov 2006) | 1 line [Patch #1514544 by David Watson] use fsync() to ensure data is really on disk ........ r52695 | walter.doerwald | 2006-11-09 17:23:26 +0100 (Thu, 09 Nov 2006) | 2 lines Replace C++ comment with C comment (fixes SF bug #1593525). ........ r52712 | andrew.kuchling | 2006-11-09 22:16:46 +0100 (Thu, 09 Nov 2006) | 11 lines [Patch #1514543] mailbox (Maildir): avoid losing messages on name clash Two changes: Where possible, use link()/remove() to move files into a directory; this makes it easier to avoid overwriting an existing file. Use _create_carefully() to create files in tmp/, which uses O_EXCL. Backport candidate. ........ r52716 | phillip.eby | 2006-11-10 01:33:36 +0100 (Fri, 10 Nov 2006) | 4 lines Fix SF#1566719: not creating site-packages (or other target directory) when installing .egg-info for a project that contains no modules or packages, while using --root (as in bdist_rpm). ........ r52719 | andrew.kuchling | 2006-11-10 14:14:01 +0100 (Fri, 10 Nov 2006) | 1 line Reword entry ........ r52725 | andrew.kuchling | 2006-11-10 15:39:01 +0100 (Fri, 10 Nov 2006) | 1 line [Feature request #1542920] Link to wsgi.org ........ r52731 | georg.brandl | 2006-11-11 19:29:11 +0100 (Sat, 11 Nov 2006) | 2 lines Bug #1594742: wrong word in stringobject doc. ........ r52733 | georg.brandl | 2006-11-11 19:32:47 +0100 (Sat, 11 Nov 2006) | 2 lines Bug #1594758: wording improvement for dict.update() docs. ........ r52736 | martin.v.loewis | 2006-11-12 11:32:47 +0100 (Sun, 12 Nov 2006) | 3 lines Patch #1065257: Support passing open files as body in HTTPConnection.request(). ........ r52737 | martin.v.loewis | 2006-11-12 11:41:39 +0100 (Sun, 12 Nov 2006) | 2 lines Patch #1355023: support whence argument for GzipFile.seek. ........ r52738 | martin.v.loewis | 2006-11-12 19:24:26 +0100 (Sun, 12 Nov 2006) | 2 lines Bug #1067760: Deprecate passing floats to file.seek. ........ r52739 | martin.v.loewis | 2006-11-12 19:48:13 +0100 (Sun, 12 Nov 2006) | 3 lines Patch #1359217: Ignore 2xx response before 150 response. Will backport to 2.5. ........ r52741 | martin.v.loewis | 2006-11-12 19:56:03 +0100 (Sun, 12 Nov 2006) | 4 lines Patch #1360200: Use unmangled_version RPM spec field to deal with file name mangling. Will backport to 2.5. ........ r52753 | walter.doerwald | 2006-11-15 17:23:46 +0100 (Wed, 15 Nov 2006) | 2 lines Fix typo. ........ r52754 | georg.brandl | 2006-11-15 18:42:03 +0100 (Wed, 15 Nov 2006) | 2 lines Bug #1594809: add a note to README regarding PYTHONPATH and make install. ........ r52762 | georg.brandl | 2006-11-16 16:05:14 +0100 (Thu, 16 Nov 2006) | 2 lines Bug #1597576: mention that the new base64 api has been introduced in py2.4. ........ r52764 | georg.brandl | 2006-11-16 17:50:59 +0100 (Thu, 16 Nov 2006) | 3 lines Bug #1597824: return the registered function from atexit.register() to facilitate usage as a decorator. ........ r52765 | georg.brandl | 2006-11-16 18:08:45 +0100 (Thu, 16 Nov 2006) | 4 lines Bug #1588217: don't parse "= " as a soft line break in binascii's a2b_qp() function, instead leave it in the string as quopri.decode() does. ........ r52776 | andrew.kuchling | 2006-11-17 14:30:25 +0100 (Fri, 17 Nov 2006) | 17 lines Remove file-locking in MH.pack() method. This change looks massive but it's mostly a re-indenting after removing some try...finally blocks. Also adds a test case that does a pack() while the mailbox is locked; this test would have turned up bugs in the original code on some platforms. In both nmh and GNU Mailutils' implementation of MH-format mailboxes, no locking is done of individual message files when renaming them. The original mailbox.py code did do locking, which meant that message files had to be opened. This code was buggy on certain platforms (found through reading the code); there were code paths that closed the file object and then called _unlock_file() on it. Will backport to 25-maint once I see how the buildbots react to this patch. ........ r52780 | martin.v.loewis | 2006-11-18 19:00:23 +0100 (Sat, 18 Nov 2006) | 5 lines Patch #1538878: Don't make tkSimpleDialog dialogs transient if the parent window is withdrawn. This mirrors what dialog.tcl does. Will backport to 2.5. ........ r52782 | martin.v.loewis | 2006-11-18 19:05:35 +0100 (Sat, 18 Nov 2006) | 4 lines Patch #1594554: Always close a tkSimpleDialog on ok(), even if an exception occurs. Will backport to 2.5. ........ r52784 | martin.v.loewis | 2006-11-18 19:42:11 +0100 (Sat, 18 Nov 2006) | 3 lines Patch #1472877: Fix Tix subwidget name resolution. Will backport to 2.5. ........ r52786 | andrew.kuchling | 2006-11-18 23:17:33 +0100 (Sat, 18 Nov 2006) | 1 line Expand checking in test_sha ........ r52787 | georg.brandl | 2006-11-19 09:48:30 +0100 (Sun, 19 Nov 2006) | 3 lines Patch [ 1586791 ] better error msgs for some TypeErrors ........ r52788 | martin.v.loewis | 2006-11-19 11:41:41 +0100 (Sun, 19 Nov 2006) | 4 lines Make cStringIO.truncate raise IOError for negative arguments (even for -1). Fixes the last bit of #1359365. ........ r52789 | andrew.kuchling | 2006-11-19 19:40:01 +0100 (Sun, 19 Nov 2006) | 1 line Add a test case of data w/ bytes > 127 ........ r52790 | martin.v.loewis | 2006-11-19 19:51:54 +0100 (Sun, 19 Nov 2006) | 3 lines Patch #1070046: Marshal new-style objects like InstanceType in xmlrpclib. ........ r52792 | neal.norwitz | 2006-11-19 22:26:53 +0100 (Sun, 19 Nov 2006) | 4 lines Speed up function calls into the math module by using METH_O. There should be no functional changes. However, the error msgs are slightly different. Also verified that the module dict is not NULL on init. ........ r52794 | george.yoshida | 2006-11-20 03:24:48 +0100 (Mon, 20 Nov 2006) | 2 lines markup fix ........ r52795 | georg.brandl | 2006-11-20 08:12:58 +0100 (Mon, 20 Nov 2006) | 3 lines Further markup fix. ........ r52800 | andrew.kuchling | 2006-11-20 14:39:37 +0100 (Mon, 20 Nov 2006) | 2 lines Jython compatibility fix: if uu.decode() opened its output file, be sure to close it. ........ r52811 | neal.norwitz | 2006-11-21 06:26:22 +0100 (Tue, 21 Nov 2006) | 9 lines Bug #1599782: Fix segfault on bsddb.db.DB().type(). The problem is that _DB_get_type() can't be called without the GIL because it calls a bunch of PyErr_* APIs when an error occurs. There were no other cases in this file that it was called without the GIL. Removing the BEGIN/END THREAD around _DB_get_type() made everything work. Will backport. ........ r52814 | neal.norwitz | 2006-11-21 06:51:51 +0100 (Tue, 21 Nov 2006) | 1 line Oops, convert tabs to spaces ........ r52815 | neal.norwitz | 2006-11-21 07:23:44 +0100 (Tue, 21 Nov 2006) | 1 line Fix SF #1599879, socket.gethostname should ref getfqdn directly. ........ r52817 | martin.v.loewis | 2006-11-21 19:20:25 +0100 (Tue, 21 Nov 2006) | 4 lines Conditionalize definition of _CRT_SECURE_NO_DEPRECATE and _CRT_NONSTDC_NO_DEPRECATE. Will backport. ........ r52821 | martin.v.loewis | 2006-11-22 09:50:02 +0100 (Wed, 22 Nov 2006) | 4 lines Patch #1362975: Rework CodeContext indentation algorithm to avoid hard-coding pixel widths. Also make the text's scrollbar a child of the text frame, not the top widget. ........ r52826 | walter.doerwald | 2006-11-23 06:03:56 +0100 (Thu, 23 Nov 2006) | 3 lines Change decode() so that it works with a buffer (i.e. unicode(..., 'utf-8-sig')) SF bug #1601501. ........ r52833 | georg.brandl | 2006-11-23 10:55:07 +0100 (Thu, 23 Nov 2006) | 2 lines Bug #1601630: little improvement to getopt docs ........ r52835 | michael.hudson | 2006-11-23 14:54:04 +0100 (Thu, 23 Nov 2006) | 3 lines a test for an error condition not covered by existing tests (noticed this when writing the equivalent code for pypy) ........ r52839 | raymond.hettinger | 2006-11-23 22:06:03 +0100 (Thu, 23 Nov 2006) | 1 line Fix and/add typo ........ r52840 | raymond.hettinger | 2006-11-23 22:35:19 +0100 (Thu, 23 Nov 2006) | 1 line ... and the number of the counting shall be three. ........ r52841 | thomas.heller | 2006-11-24 19:45:39 +0100 (Fri, 24 Nov 2006) | 1 line Fix bug #1598620: A ctypes structure cannot contain itself. ........ r52843 | martin.v.loewis | 2006-11-25 16:39:19 +0100 (Sat, 25 Nov 2006) | 3 lines Disable _XOPEN_SOURCE on NetBSD 1.x. Will backport to 2.5 ........ r52845 | georg.brandl | 2006-11-26 20:27:47 +0100 (Sun, 26 Nov 2006) | 2 lines Bug #1603321: make pstats.Stats accept Unicode file paths. ........ r52850 | georg.brandl | 2006-11-27 19:46:21 +0100 (Mon, 27 Nov 2006) | 2 lines Bug #1603789: grammatical error in Tkinter docs. ........ r52855 | thomas.heller | 2006-11-28 21:21:54 +0100 (Tue, 28 Nov 2006) | 7 lines Fix #1563807: _ctypes built on AIX fails with ld ffi error. The contents of ffi_darwin.c must be compiled unless __APPLE__ is defined and __ppc__ is not. Will backport. ........ r52862 | armin.rigo | 2006-11-29 22:59:22 +0100 (Wed, 29 Nov 2006) | 3 lines Forgot a case where the locals can now be a general mapping instead of just a dictionary. (backporting...) ........ r52872 | guido.van.rossum | 2006-11-30 20:23:13 +0100 (Thu, 30 Nov 2006) | 2 lines Update version. ........ r52890 | walter.doerwald | 2006-12-01 17:59:47 +0100 (Fri, 01 Dec 2006) | 3 lines Move xdrlib tests from the module into a separate test script, port the tests to unittest and add a few new tests. ........ r52900 | raymond.hettinger | 2006-12-02 03:00:39 +0100 (Sat, 02 Dec 2006) | 1 line Add name to credits (for untokenize). ........ r52905 | martin.v.loewis | 2006-12-03 10:54:46 +0100 (Sun, 03 Dec 2006) | 2 lines Move IDLE news into NEWS.txt. ........ r52906 | martin.v.loewis | 2006-12-03 12:23:45 +0100 (Sun, 03 Dec 2006) | 4 lines Patch #1544279: Improve thread-safety of the socket module by moving the sock_addr_t storage out of the socket object. Will backport to 2.5. ........ r52908 | martin.v.loewis | 2006-12-03 13:01:53 +0100 (Sun, 03 Dec 2006) | 3 lines Patch #1371075: Make ConfigParser accept optional dict type for ordering, sorting, etc. ........ r52910 | matthias.klose | 2006-12-03 18:16:41 +0100 (Sun, 03 Dec 2006) | 2 lines - Fix build failure on kfreebsd and on the hurd. ........ r52915 | george.yoshida | 2006-12-04 12:41:54 +0100 (Mon, 04 Dec 2006) | 2 lines fix a versionchanged tag ........ r52917 | george.yoshida | 2006-12-05 06:39:50 +0100 (Tue, 05 Dec 2006) | 3 lines Fix pickle doc typo Patch #1608758 ........ r52938 | georg.brandl | 2006-12-06 23:21:18 +0100 (Wed, 06 Dec 2006) | 2 lines Patch #1610437: fix a tarfile bug with long filename headers. ........ r52945 | brett.cannon | 2006-12-07 00:38:48 +0100 (Thu, 07 Dec 2006) | 3 lines Fix a bad assumption that all objects assigned to '__loader__' on a module will have a '_files' attribute. ........ r52951 | georg.brandl | 2006-12-07 10:30:06 +0100 (Thu, 07 Dec 2006) | 3 lines RFE #1592899: mention string.maketrans() in docs for str.translate, remove reference to the old regex module in the former's doc. ........ r52962 | raymond.hettinger | 2006-12-08 04:17:18 +0100 (Fri, 08 Dec 2006) | 1 line Eliminate two redundant calls to PyObject_Hash(). ........ r52963 | raymond.hettinger | 2006-12-08 05:24:33 +0100 (Fri, 08 Dec 2006) | 3 lines Port Armin's fix for a dict resize vulnerability (svn revision 46589, sf bug 1456209). ........ r52964 | raymond.hettinger | 2006-12-08 05:57:50 +0100 (Fri, 08 Dec 2006) | 4 lines Port Georg's dictobject.c fix keys that were tuples got unpacked on the way to setting a KeyError (svn revision 52535, sf bug 1576657). ........ r52966 | raymond.hettinger | 2006-12-08 18:35:25 +0100 (Fri, 08 Dec 2006) | 2 lines Add test for SF bug 1576657 ........ r52970 | georg.brandl | 2006-12-08 21:46:11 +0100 (Fri, 08 Dec 2006) | 3 lines #1577756: svnversion doesn't react to LANG=C, use LC_ALL=C to force English output. ........ r52972 | georg.brandl | 2006-12-09 10:08:29 +0100 (Sat, 09 Dec 2006) | 3 lines Patch #1608267: fix a race condition in os.makedirs() is the directory to be created is already there. ........ r52975 | matthias.klose | 2006-12-09 13:15:27 +0100 (Sat, 09 Dec 2006) | 2 lines - Fix the build of the library reference in info format. ........ r52994 | neal.norwitz | 2006-12-11 02:01:06 +0100 (Mon, 11 Dec 2006) | 1 line Fix a typo ........ r52996 | georg.brandl | 2006-12-11 08:56:33 +0100 (Mon, 11 Dec 2006) | 2 lines Move errno imports back to individual functions. ........ r52998 | vinay.sajip | 2006-12-11 15:07:16 +0100 (Mon, 11 Dec 2006) | 1 line Patch by Jeremy Katz (SF #1609407) ........ r53000 | vinay.sajip | 2006-12-11 15:26:23 +0100 (Mon, 11 Dec 2006) | 1 line Patch by "cuppatea" (SF #1503765) ........
2006-12-13 00:49:30 -04:00
\ulink{Kodos}{http://www.phil-schwartz.com/kodos.spy} is also an interactive
tool for developing and testing RE patterns. This HOWTO will use the
standard Python interpreter for its examples.
First, run the Python interpreter, import the \module{re} module, and
compile a RE:
\begin{verbatim}
Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
>>> import re
>>> p = re.compile('[a-z]+')
>>> p
<_sre.SRE_Pattern object at 80c3c28>
\end{verbatim}
Now, you can try matching various strings against the RE
\regexp{[a-z]+}. An empty string shouldn't match at all, since
\regexp{+} means 'one or more repetitions'. \method{match()} should
return \code{None} in this case, which will cause the interpreter to
print no output. You can explicitly print the result of
\method{match()} to make this clear.
\begin{verbatim}
>>> p.match("")
>>> print p.match("")
None
\end{verbatim}
Now, let's try it on a string that it should match, such as
\samp{tempo}. In this case, \method{match()} will return a
\class{MatchObject}, so you should store the result in a variable for
later use.
\begin{verbatim}
>>> m = p.match( 'tempo')
>>> print m
<_sre.SRE_Match object at 80c4f68>
\end{verbatim}
Now you can query the \class{MatchObject} for information about the
matching string. \class{MatchObject} instances also have several
methods and attributes; the most important ones are:
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
\lineii{group()}{Return the string matched by the RE}
\lineii{start()}{Return the starting position of the match}
\lineii{end()}{Return the ending position of the match}
\lineii{span()}{Return a tuple containing the (start, end) positions
of the match}
\end{tableii}
Trying these methods will soon clarify their meaning:
\begin{verbatim}
>>> m.group()
'tempo'
>>> m.start(), m.end()
(0, 5)
>>> m.span()
(0, 5)
\end{verbatim}
\method{group()} returns the substring that was matched by the
RE. \method{start()} and \method{end()} return the starting and
ending index of the match. \method{span()} returns both start and end
indexes in a single tuple. Since the \method{match} method only
checks if the RE matches at the start of a string,
\method{start()} will always be zero. However, the \method{search}
method of \class{RegexObject} instances scans through the string, so
the match may not start at zero in that case.
\begin{verbatim}
>>> print p.match('::: message')
None
>>> m = p.search('::: message') ; print m
<re.MatchObject instance at 80c9650>
>>> m.group()
'message'
>>> m.span()
(4, 11)
\end{verbatim}
In actual programs, the most common style is to store the
\class{MatchObject} in a variable, and then check if it was
\code{None}. This usually looks like:
\begin{verbatim}
p = re.compile( ... )
m = p.match( 'string goes here' )
if m:
print 'Match found: ', m.group()
else:
print 'No match'
\end{verbatim}
Two \class{RegexObject} methods return all of the matches for a pattern.
\method{findall()} returns a list of matching strings:
\begin{verbatim}
>>> p = re.compile('\d+')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']
\end{verbatim}
\method{findall()} has to create the entire list before it can be
returned as the result. In Python 2.2, the \method{finditer()} method
is also available, returning a sequence of \class{MatchObject} instances
as an iterator.
\begin{verbatim}
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator
<callable-iterator object at 0x401833ac>
>>> for match in iterator:
... print match.span()
...
(0, 2)
(22, 24)
(29, 31)
\end{verbatim}
\subsection{Module-Level Functions}
You don't have to produce a \class{RegexObject} and call its methods;
the \module{re} module also provides top-level functions called
\function{match()}, \function{search()}, \function{sub()}, and so
forth. These functions take the same arguments as the corresponding
\class{RegexObject} method, with the RE string added as the first
argument, and still return either \code{None} or a \class{MatchObject}
instance.
\begin{verbatim}
>>> print re.match(r'From\s+', 'Fromage amk')
None
>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
<re.MatchObject instance at 80c5978>
\end{verbatim}
Under the hood, these functions simply produce a \class{RegexObject}
for you and call the appropriate method on it. They also store the
compiled object in a cache, so future calls using the same
RE are faster.
Should you use these module-level functions, or should you get the
\class{RegexObject} and call its methods yourself? That choice
depends on how frequently the RE will be used, and on your personal
coding style. If a RE is being used at only one point in the code,
then the module functions are probably more convenient. If a program
contains a lot of regular expressions, or re-uses the same ones in
several locations, then it might be worthwhile to collect all the
definitions in one place, in a section of code that compiles all the
REs ahead of time. To take an example from the standard library,
here's an extract from \file{xmllib.py}:
\begin{verbatim}
ref = re.compile( ... )
entityref = re.compile( ... )
charref = re.compile( ... )
starttagopen = re.compile( ... )
\end{verbatim}
I generally prefer to work with the compiled object, even for
one-time uses, but few people will be as much of a purist about this
as I am.
\subsection{Compilation Flags}
Compilation flags let you modify some aspects of how regular
expressions work. Flags are available in the \module{re} module under
two names, a long name such as \constant{IGNORECASE}, and a short,
one-letter form such as \constant{I}. (If you're familiar with Perl's
pattern modifiers, the one-letter forms use the same letters; the
short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
re.M} sets both the \constant{I} and \constant{M} flags, for example.
Here's a table of the available flags, followed by
a more detailed explanation of each one.
\begin{tableii}{c|l}{}{Flag}{Meaning}
\lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
character, including newlines}
\lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
\lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
\lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
affecting \regexp{\^} and \regexp{\$}}
\lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
which can be organized more cleanly and understandably.}
\end{tableii}
\begin{datadesc}{I}
\dataline{IGNORECASE}
Perform case-insensitive matching; character class and literal strings
will match
letters by ignoring case. For example, \regexp{[A-Z]} will match
lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
\samp{spam}, or \samp{spAM}.
This lowercasing doesn't take the current locale into account; it will
if you also set the \constant{LOCALE} flag.
\end{datadesc}
\begin{datadesc}{L}
\dataline{LOCALE}
Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
and \regexp{\e B}, dependent on the current locale.
Locales are a feature of the C library intended to help in writing
programs that take account of language differences. For example, if
you're processing French text, you'd want to be able to write
\regexp{\e w+} to match words, but \regexp{\e w} only matches the
character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
\character{\c c}. If your system is configured properly and a French
locale is selected, certain C functions will tell the program that
\character{\'e} should also be considered a letter. Setting the
\constant{LOCALE} flag when compiling a regular expression will cause the
resulting compiled object to use these C functions for \regexp{\e w};
this is slower, but also enables \regexp{\e w+} to match French words as
you'd expect.
\end{datadesc}
\begin{datadesc}{M}
\dataline{MULTILINE}
(\regexp{\^} and \regexp{\$} haven't been explained yet;
they'll be introduced in section~\ref{more-metacharacters}.)
Usually \regexp{\^} matches only at the beginning of the string, and
\regexp{\$} matches only at the end of the string and immediately before the
newline (if any) at the end of the string. When this flag is
specified, \regexp{\^} matches at the beginning of the string and at
the beginning of each line within the string, immediately following
each newline. Similarly, the \regexp{\$} metacharacter matches either at
the end of the string and at the end of each line (immediately
preceding each newline).
\end{datadesc}
\begin{datadesc}{S}
\dataline{DOTALL}
Makes the \character{.} special character match any character at all,
including a newline; without this flag, \character{.} will match
anything \emph{except} a newline.
\end{datadesc}
\begin{datadesc}{X}
\dataline{VERBOSE} This flag allows you to write regular expressions
that are more readable by granting you more flexibility in how you can
format them. When this flag has been specified, whitespace within the
RE string is ignored, except when the whitespace is in a character
class or preceded by an unescaped backslash; this lets you organize
and indent the RE more clearly. It also enables you to put comments
within a RE that will be ignored by the engine; comments are marked by
a \character{\#} that's neither in a character class or preceded by an
unescaped backslash.
For example, here's a RE that uses \constant{re.VERBOSE}; see how
much easier it is to read?
\begin{verbatim}
charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
[0-9]+[^0-9] # Decimal form
| 0[0-7]+[^0-7] # Octal form
| x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
)
""", re.VERBOSE)
\end{verbatim}
Without the verbose setting, the RE would look like this:
\begin{verbatim}
charref = re.compile("&#([0-9]+[^0-9]"
"|0[0-7]+[^0-7]"
"|x[0-9a-fA-F]+[^0-9a-fA-F])")
\end{verbatim}
In the above example, Python's automatic concatenation of string
literals has been used to break up the RE into smaller pieces, but
it's still more difficult to understand than the version using
\constant{re.VERBOSE}.
\end{datadesc}
\section{More Pattern Power}
So far we've only covered a part of the features of regular
expressions. In this section, we'll cover some new metacharacters,
and how to use groups to retrieve portions of the text that was matched.
\subsection{More Metacharacters\label{more-metacharacters}}
There are some metacharacters that we haven't covered yet. Most of
them will be covered in this section.
Some of the remaining metacharacters to be discussed are
\dfn{zero-width assertions}. They don't cause the engine to advance
through the string; instead, they consume no characters at all,
and simply succeed or fail. For example, \regexp{\e b} is an
assertion that the current position is located at a word boundary; the
position isn't changed by the \regexp{\e b} at all. This means that
zero-width assertions should never be repeated, because if they match
once at a given location, they can obviously be matched an infinite
number of times.
\begin{list}{}{}
\item[\regexp{|}]
Alternation, or the ``or'' operator.
If A and B are regular expressions,
\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
\regexp{|} has very low precedence in order to make it work reasonably when
you're alternating multi-character strings.
\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
To match a literal \character{|},
use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
\item[\regexp{\^}] Matches at the beginning of lines. Unless the
\constant{MULTILINE} flag has been set, this will only match at the
beginning of the string. In \constant{MULTILINE} mode, this also
matches immediately after each newline within the string.
For example, if you wish to match the word \samp{From} only at the
beginning of a line, the RE to use is \verb|^From|.
\begin{verbatim}
>>> print re.search('^From', 'From Here to Eternity')
<re.MatchObject instance at 80c1520>
>>> print re.search('^From', 'Reciting From Memory')
None
\end{verbatim}
%To match a literal \character{\^}, use \regexp{\e\^} or enclose it
%inside a character class, as in \regexp{[{\e}\^]}.
\item[\regexp{\$}] Matches at the end of a line, which is defined as
either the end of the string, or any location followed by a newline
character.
\begin{verbatim}
>>> print re.search('}$', '{block}')
<re.MatchObject instance at 80adfa8>
>>> print re.search('}$', '{block} ')
None
>>> print re.search('}$', '{block}\n')
<re.MatchObject instance at 80adfa8>
\end{verbatim}
% $
To match a literal \character{\$}, use \regexp{\e\$} or enclose it
inside a character class, as in \regexp{[\$]}.
\item[\regexp{\e A}] Matches only at the start of the string. When
not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
effectively the same. In \constant{MULTILINE} mode, however, they're
different; \regexp{\e A} still matches only at the beginning of the
string, but \regexp{\^} may match at any location inside the string
that follows a newline character.
\item[\regexp{\e Z}]Matches only at the end of the string.
\item[\regexp{\e b}] Word boundary.
This is a zero-width assertion that matches only at the
beginning or end of a word. A word is defined as a sequence of
alphanumeric characters, so the end of a word is indicated by
whitespace or a non-alphanumeric character.
The following example matches \samp{class} only when it's a complete
word; it won't match when it's contained inside another word.
\begin{verbatim}
>>> p = re.compile(r'\bclass\b')
>>> print p.search('no class at all')
<re.MatchObject instance at 80c8f28>
>>> print p.search('the declassified algorithm')
None
>>> print p.search('one subclass is')
None
\end{verbatim}
There are two subtleties you should remember when using this special
sequence. First, this is the worst collision between Python's string
literals and regular expression sequences. In Python's string
literals, \samp{\e b} is the backspace character, ASCII value 8. If
you're not using raw strings, then Python will convert the \samp{\e b} to
a backspace, and your RE won't match as you expect it to. The
following example looks the same as our previous RE, but omits
the \character{r} in front of the RE string.
\begin{verbatim}
>>> p = re.compile('\bclass\b')
>>> print p.search('no class at all')
None
>>> print p.search('\b' + 'class' + '\b')
<re.MatchObject instance at 80c3ee0>
\end{verbatim}
Second, inside a character class, where there's no use for this
assertion, \regexp{\e b} represents the backspace character, for
compatibility with Python's string literals.
\item[\regexp{\e B}] Another zero-width assertion, this is the
opposite of \regexp{\e b}, only matching when the current
position is not at a word boundary.
\end{list}
\subsection{Grouping}
Frequently you need to obtain more information than just whether the
RE matched or not. Regular expressions are often used to dissect
strings by writing a RE divided into several subgroups which
match different components of interest. For example, an RFC-822
header line is divided into a header name and a value, separated by a
\character{:}. This can be handled by writing a regular expression
which matches an entire header line, and has one group which matches the
header name, and another group which matches the header's value.
Groups are marked by the \character{(}, \character{)} metacharacters.
\character{(} and \character{)} have much the same meaning as they do
in mathematical expressions; they group together the expressions
contained inside them. For example, you can repeat the contents of a
group with a repeating qualifier, such as \regexp{*}, \regexp{+},
\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example,
\regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
\begin{verbatim}
>>> p = re.compile('(ab)*')
>>> print p.match('ababababab').span()
(0, 10)
\end{verbatim}
Groups indicated with \character{(}, \character{)} also capture the
starting and ending index of the text that they match; this can be
retrieved by passing an argument to \method{group()},
\method{start()}, \method{end()}, and \method{span()}. Groups are
numbered starting with 0. Group 0 is always present; it's the whole
RE, so \class{MatchObject} methods all have group 0 as their default
argument. Later we'll see how to express groups that don't capture
the span of text that they match.
\begin{verbatim}
>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'
\end{verbatim}
Subgroups are numbered from left to right, from 1 upward. Groups can
be nested; to determine the number, just count the opening parenthesis
characters, going from left to right.
\begin{verbatim}
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
\end{verbatim}
\method{group()} can be passed multiple group numbers at a time, in
which case it will return a tuple containing the corresponding values
for those groups.
\begin{verbatim}
>>> m.group(2,1,2)
('b', 'abc', 'b')
\end{verbatim}
The \method{groups()} method returns a tuple containing the strings
for all the subgroups, from 1 up to however many there are.
\begin{verbatim}
>>> m.groups()
('abc', 'b')
\end{verbatim}
Backreferences in a pattern allow you to specify that the contents of
an earlier capturing group must also be found at the current location
in the string. For example, \regexp{\e 1} will succeed if the exact
contents of group 1 can be found at the current position, and fails
otherwise. Remember that Python's string literals also use a
backslash followed by numbers to allow including arbitrary characters
in a string, so be sure to use a raw string when incorporating
backreferences in a RE.
For example, the following RE detects doubled words in a string.
\begin{verbatim}
>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'
\end{verbatim}
Backreferences like this aren't often useful for just searching
through a string --- there are few text formats which repeat data in
this way --- but you'll soon find out that they're \emph{very} useful
when performing string substitutions.
\subsection{Non-capturing and Named Groups}
Elaborate REs may use many groups, both to capture substrings of
interest, and to group and structure the RE itself. In complex REs,
it becomes difficult to keep track of the group numbers. There are
two features which help with this problem. Both of them use a common
syntax for regular expression extensions, so we'll look at that first.
Perl 5 added several additional features to standard regular
expressions, and the Python \module{re} module supports most of them.
It would have been difficult to choose new single-keystroke
metacharacters or new special sequences beginning with \samp{\e} to
represent the new features without making Perl's regular expressions
confusingly different from standard REs. If you chose \samp{\&} as a
new metacharacter, for example, old expressions would be assuming that
\samp{\&} was a regular character and wouldn't have escaped it by
writing \regexp{\e \&} or \regexp{[\&]}.
The solution chosen by the Perl developers was to use \regexp{(?...)}
as the extension syntax. \samp{?} immediately after a parenthesis was
a syntax error because the \samp{?} would have nothing to repeat, so
this didn't introduce any compatibility problems. The characters
immediately after the \samp{?} indicate what extension is being used,
so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
\regexp{(?:foo)} is something else (a non-capturing group containing
the subexpression \regexp{foo}).
Python adds an extension syntax to Perl's extension syntax. If the
first character after the question mark is a \samp{P}, you know that
it's an extension that's specific to Python. Currently there are two
such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
and \regexp{(?P=\var{name})} is a backreference to a named group. If
future versions of Perl 5 add similar features using a different
syntax, the \module{re} module will be changed to support the new
syntax, while preserving the Python-specific syntax for
compatibility's sake.
Now that we've looked at the general extension syntax, we can return
to the features that simplify working with groups in complex REs.
Since groups are numbered from left to right and a complex expression
may use many groups, it can become difficult to keep track of the
correct numbering, and modifying such a complex RE is annoying.
Insert a new group near the beginning, and you change the numbers of
everything that follows it.
First, sometimes you'll want to use a group to collect a part of a
regular expression, but aren't interested in retrieving the group's
contents. You can make this fact explicit by using a non-capturing
group: \regexp{(?:...)}, where you can put any other regular
expression inside the parentheses.
\begin{verbatim}
>>> m = re.match("([abc])+", "abc")
>>> m.groups()
('c',)
>>> m = re.match("(?:[abc])+", "abc")
>>> m.groups()
()
\end{verbatim}
Except for the fact that you can't retrieve the contents of what the
group matched, a non-capturing group behaves exactly the same as a
capturing group; you can put anything inside it, repeat it with a
repetition metacharacter such as \samp{*}, and nest it within other
groups (capturing or non-capturing). \regexp{(?:...)} is particularly
useful when modifying an existing group, since you can add new groups
without changing how all the other groups are numbered. It should be
mentioned that there's no performance difference in searching between
capturing and non-capturing groups; neither form is any faster than
the other.
The second, and more significant, feature is named groups; instead of
referring to them by numbers, groups can be referenced by a name.
The syntax for a named group is one of the Python-specific extensions:
\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
the group. Except for associating a name with a group, named groups
also behave identically to capturing groups. The \class{MatchObject}
methods that deal with capturing groups all accept either integers, to
refer to groups by number, or a string containing the group name.
Named groups are still given numbers, so you can retrieve information
about a group in two ways:
\begin{verbatim}
>>> p = re.compile(r'(?P<word>\b\w+\b)')
>>> m = p.search( '(((( Lots of punctuation )))' )
>>> m.group('word')
'Lots'
>>> m.group(1)
'Lots'
\end{verbatim}
Named groups are handy because they let you use easily-remembered
names, instead of having to remember numbers. Here's an example RE
from the \module{imaplib} module:
\begin{verbatim}
InternalDate = re.compile(r'INTERNALDATE "'
r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
r'(?P<year>[0-9][0-9][0-9][0-9])'
r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
r'"')
\end{verbatim}
It's obviously much easier to retrieve \code{m.group('zonem')},
instead of having to remember to retrieve group 9.
Since the syntax for backreferences, in an expression like
\regexp{(...)\e 1}, refers to the number of the group there's
naturally a variant that uses the group name instead of the number.
This is also a Python extension: \regexp{(?P=\var{name})} indicates
that the contents of the group called \var{name} should again be found
at the current point. The regular expression for finding doubled
words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
\begin{verbatim}
>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
>>> p.search('Paris in the the spring').group()
'the the'
\end{verbatim}
\subsection{Lookahead Assertions}
Another zero-width assertion is the lookahead assertion. Lookahead
assertions are available in both positive and negative form, and
look like this:
\begin{itemize}
\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds
if the contained regular expression, represented here by \code{...},
successfully matches at the current location, and fails otherwise.
But, once the contained expression has been tried, the matching engine
doesn't advance at all; the rest of the pattern is tried right where
the assertion started.
\item[\regexp{(?!...)}] Negative lookahead assertion. This is the
opposite of the positive assertion; it succeeds if the contained expression
\emph{doesn't} match at the current position in the string.
\end{itemize}
An example will help make this concrete by demonstrating a case
where a lookahead is useful. Consider a simple pattern to match a
filename and split it apart into a base name and an extension,
separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news}
is the base name, and \samp{rc} is the filename's extension.
The pattern to match this is quite simple:
\regexp{.*[.].*\$}
Notice that the \samp{.} needs to be treated specially because it's a
metacharacter; I've put it inside a character class. Also notice the
trailing \regexp{\$}; this is added to ensure that all the rest of the
string must be included in the extension. This regular expression
matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
\samp{printers.conf}.
Now, consider complicating the problem a bit; what if you want to
match filenames where the extension is not \samp{bat}?
Some incorrect attempts:
\verb|.*[.][^b].*$|
% $
The first attempt above tries to exclude \samp{bat} by requiring that
the first character of the extension is not a \samp{b}. This is
wrong, because the pattern also doesn't match \samp{foo.bar}.
% Messes up the HTML without the curly braces around \^
\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
The expression gets messier when you try to patch up the first
solution by requiring one of the following cases to match: the first
character of the extension isn't \samp{b}; the second character isn't
\samp{a}; or the third character isn't \samp{t}. This accepts
\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
three-letter extension and won't accept a filename with a two-letter
extension such as \samp{sendmail.cf}. We'll complicate the pattern
again in an effort to fix it.
\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
In the third attempt, the second and third letters are all made
optional in order to allow matching extensions shorter than three
characters, such as \samp{sendmail.cf}.
The pattern's getting really complicated now, which makes it hard to
read and understand. Worse, if the problem changes and you want to
exclude both \samp{bat} and \samp{exe} as extensions, the pattern
would get even more complicated and confusing.
A negative lookahead cuts through all this:
\regexp{.*[.](?!bat\$).*\$}
% $
The lookahead means: if the expression \regexp{bat} doesn't match at
this point, try the rest of the pattern; if \regexp{bat\$} does match,
the whole pattern will fail. The trailing \regexp{\$} is required to
ensure that something like \samp{sample.batch}, where the extension
only starts with \samp{bat}, will be allowed.
Excluding another filename extension is now easy; simply add it as an
alternative inside the assertion. The following pattern excludes
filenames that end in either \samp{bat} or \samp{exe}:
\regexp{.*[.](?!bat\$|exe\$).*\$}
% $
\section{Modifying Strings}
Up to this point, we've simply performed searches against a static
string. Regular expressions are also commonly used to modify a string
in various ways, using the following \class{RegexObject} methods:
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
\lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
\lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
\lineii{subn()}{Does the same thing as \method{sub()},
but returns the new string and the number of replacements}
\end{tableii}
\subsection{Splitting Strings}
The \method{split()} method of a \class{RegexObject} splits a string
apart wherever the RE matches, returning a list of the pieces.
It's similar to the \method{split()} method of strings but
provides much more
generality in the delimiters that you can split by;
\method{split()} only supports splitting by whitespace or by
a fixed string. As you'd expect, there's a module-level
\function{re.split()} function, too.
\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
Split \var{string} by the matches of the regular expression. If
capturing parentheses are used in the RE, then their contents will
also be returned as part of the resulting list. If \var{maxsplit}
is nonzero, at most \var{maxsplit} splits are performed.
\end{methoddesc}
You can limit the number of splits made, by passing a value for
\var{maxsplit}. When \var{maxsplit} is nonzero, at most
\var{maxsplit} splits will be made, and the remainder of the string is
returned as the final element of the list. In the following example,
the delimiter is any sequence of non-alphanumeric characters.
\begin{verbatim}
>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']
\end{verbatim}
Sometimes you're not only interested in what the text between
delimiters is, but also need to know what the delimiter was. If
capturing parentheses are used in the RE, then their values are also
returned as part of the list. Compare the following calls:
\begin{verbatim}
>>> p = re.compile(r'\W+')
>>> p2 = re.compile(r'(\W+)')
>>> p.split('This... is a test.')
['This', 'is', 'a', 'test', '']
>>> p2.split('This... is a test.')
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
\end{verbatim}
The module-level function \function{re.split()} adds the RE to be
used as the first argument, but is otherwise the same.
\begin{verbatim}
>>> re.split('[\W]+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('([\W]+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('[\W]+', 'Words, words, words.', 1)
['Words', 'words, words.']
\end{verbatim}
\subsection{Search and Replace}
Another common task is to find all the matches for a pattern, and
replace them with a different string. The \method{sub()} method takes
a replacement value, which can be either a string or a function, and
the string to be processed.
\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
Returns the string obtained by replacing the leftmost non-overlapping
occurrences of the RE in \var{string} by the replacement
\var{replacement}. If the pattern isn't found, \var{string} is returned
unchanged.
The optional argument \var{count} is the maximum number of pattern
occurrences to be replaced; \var{count} must be a non-negative
integer. The default value of 0 means to replace all occurrences.
\end{methoddesc}
Here's a simple example of using the \method{sub()} method. It
replaces colour names with the word \samp{colour}:
\begin{verbatim}
>>> p = re.compile( '(blue|white|red)')
>>> p.sub( 'colour', 'blue socks and red shoes')
'colour socks and colour shoes'
>>> p.sub( 'colour', 'blue socks and red shoes', count=1)
'colour socks and red shoes'
\end{verbatim}
The \method{subn()} method does the same work, but returns a 2-tuple
containing the new string value and the number of replacements
that were performed:
\begin{verbatim}
>>> p = re.compile( '(blue|white|red)')
>>> p.subn( 'colour', 'blue socks and red shoes')
('colour socks and colour shoes', 2)
>>> p.subn( 'colour', 'no colours at all')
('no colours at all', 0)
\end{verbatim}
Empty matches are replaced only when they're not
adjacent to a previous match.
\begin{verbatim}
>>> p = re.compile('x*')
>>> p.sub('-', 'abxd')
'-a-b-d-'
\end{verbatim}
If \var{replacement} is a string, any backslash escapes in it are
processed. That is, \samp{\e n} is converted to a single newline
character, \samp{\e r} is converted to a carriage return, and so forth.
Unknown escapes such as \samp{\e j} are left alone. Backreferences,
such as \samp{\e 6}, are replaced with the substring matched by the
corresponding group in the RE. This lets you incorporate
portions of the original text in the resulting
replacement string.
This example matches the word \samp{section} followed by a string
enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
\samp{subsection}:
\begin{verbatim}
>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First} section{second}')
'subsection{First} subsection{second}'
\end{verbatim}
There's also a syntax for referring to named groups as defined by the
\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the
substring matched by the group named \samp{name}, and
\samp{\e g<\var{number}>}
uses the corresponding group number.
\samp{\e g<2>} is therefore equivalent to \samp{\e 2},
but isn't ambiguous in a
replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be
interpreted as a reference to group 20, not a reference to group 2
followed by the literal character \character{0}.) The following
substitutions are all equivalent, but use all three variations of the
replacement string.
\begin{verbatim}
>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<1>}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<name>}','section{First}')
'subsection{First}'
\end{verbatim}
\var{replacement} can also be a function, which gives you even more
control. If \var{replacement} is a function, the function is
called for every non-overlapping occurrence of \var{pattern}. On each
call, the function is
passed a \class{MatchObject} argument for the match
and can use this information to compute the desired replacement string and return it.
In the following example, the replacement function translates
decimals into hexadecimal:
\begin{verbatim}
>>> def hexrepl( match ):
... "Return the hex string for a decimal number"
... value = int( match.group() )
... return hex(value)
...
>>> p = re.compile(r'\d+')
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'
\end{verbatim}
When using the module-level \function{re.sub()} function, the pattern
is passed as the first argument. The pattern may be a string or a
\class{RegexObject}; if you need to specify regular expression flags,
you must either use a \class{RegexObject} as the first parameter, or use
embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb
BBBB")} returns \code{'x x'}.
\section{Common Problems}
Regular expressions are a powerful tool for some applications, but in
some ways their behaviour isn't intuitive and at times they don't
behave the way you may expect them to. This section will point out
some of the most common pitfalls.
\subsection{Use String Methods}
Sometimes using the \module{re} module is a mistake. If you're
matching a fixed string, or a single character class, and you're not
using any \module{re} features such as the \constant{IGNORECASE} flag,
then the full power of regular expressions may not be required.
Strings have several methods for performing operations with fixed
strings and they're usually much faster, because the implementation is
a single small C loop that's been optimized for the purpose, instead
of the large, more generalized regular expression engine.
One example might be replacing a single fixed string with another
one; for example, you might replace \samp{word}
with \samp{deed}. \code{re.sub()} seems like the function to use for
this, but consider the \method{replace()} method. Note that
\function{replace()} will also replace \samp{word} inside
words, turning \samp{swordfish} into \samp{sdeedfish}, but the
na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing
the substitution on parts of words, the pattern would have to be
\regexp{\e bword\e b}, in order to require that \samp{word} have a
word boundary on either side. This takes the job beyond
\method{replace}'s abilities.)
Another common task is deleting every occurrence of a single character
from a string or replacing it with another single character. You
might do this with something like \code{re.sub('\e n', ' ', S)}, but
\method{translate()} is capable of doing both tasks
2005-08-31 14:49:38 -03:00
and will be faster than any regular expression operation can be.
In short, before turning to the \module{re} module, consider whether
your problem can be solved with a faster and simpler string method.
\subsection{match() versus search()}
The \function{match()} function only checks if the RE matches at
the beginning of the string while \function{search()} will scan
forward through the string for a match.
It's important to keep this distinction in mind. Remember,
\function{match()} will only report a successful match which
will start at 0; if the match wouldn't start at zero,
\function{match()} will \emph{not} report it.
\begin{verbatim}
>>> print re.match('super', 'superstition').span()
(0, 5)
>>> print re.match('super', 'insuperable')
None
\end{verbatim}
On the other hand, \function{search()} will scan forward through the
string, reporting the first match it finds.
\begin{verbatim}
>>> print re.search('super', 'superstition').span()
(0, 5)
>>> print re.search('super', 'insuperable').span()
(2, 7)
\end{verbatim}
Sometimes you'll be tempted to keep using \function{re.match()}, and
just add \regexp{.*} to the front of your RE. Resist this temptation
and use \function{re.search()} instead. The regular expression
compiler does some analysis of REs in order to speed up the process of
looking for a match. One such analysis figures out what the first
character of a match must be; for example, a pattern starting with
\regexp{Crow} must match starting with a \character{C}. The analysis
lets the engine quickly scan through the string looking for the
starting character, only trying the full match if a \character{C} is found.
Adding \regexp{.*} defeats this optimization, requiring scanning to
the end of the string and then backtracking to find a match for the
rest of the RE. Use \function{re.search()} instead.
\subsection{Greedy versus Non-Greedy}
When repeating a regular expression, as in \regexp{a*}, the resulting
action is to consume as much of the pattern as possible. This
fact often bites you when you're trying to match a pair of
balanced delimiters, such as the angle brackets surrounding an HTML
tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't
work because of the greedy nature of \regexp{.*}.
\begin{verbatim}
>>> s = '<html><head><title>Title</title>'
>>> len(s)
32
>>> print re.match('<.*>', s).span()
(0, 32)
>>> print re.match('<.*>', s).group()
<html><head><title>Title</title>
\end{verbatim}
The RE matches the \character{<} in \samp{<html>}, and the
\regexp{.*} consumes the rest of the string. There's still more left
in the RE, though, and the \regexp{>} can't match at the end of
the string, so the regular expression engine has to backtrack
character by character until it finds a match for the \regexp{>}.
The final match extends from the \character{<} in \samp{<html>}
to the \character{>} in \samp{</title>}, which isn't what you want.
In this case, the solution is to use the non-greedy qualifiers
\regexp{*?}, \regexp{+?}, \regexp{??}, or
\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
possible. In the above example, the \character{>} is tried
immediately after the first \character{<} matches, and when it fails,
the engine advances a character at a time, retrying the \character{>}
at every step. This produces just the right result:
\begin{verbatim}
>>> print re.match('<.*?>', s).group()
<html>
\end{verbatim}
(Note that parsing HTML or XML with regular expressions is painful.
Quick-and-dirty patterns will handle common cases, but HTML and XML
have special cases that will break the obvious regular expression; by
the time you've written a regular expression that handles all of the
possible cases, the patterns will be \emph{very} complicated. Use an
HTML or XML parser module for such tasks.)
\subsection{Not Using re.VERBOSE}
By now you've probably noticed that regular expressions are a very
compact notation, but they're not terribly readable. REs of
moderate complexity can become lengthy collections of backslashes,
parentheses, and metacharacters, making them difficult to read and
understand.
For such REs, specifying the \code{re.VERBOSE} flag when
compiling the regular expression can be helpful, because it allows
you to format the regular expression more clearly.
The \code{re.VERBOSE} flag has several effects. Whitespace in the
regular expression that \emph{isn't} inside a character class is
ignored. This means that an expression such as \regexp{dog | cat} is
equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
will still match the characters \character{a}, \character{b}, or a
space. In addition, you can also put comments inside a RE; comments
extend from a \samp{\#} character to the next newline. When used with
triple-quoted strings, this enables REs to be formatted more neatly:
\begin{verbatim}
pat = re.compile(r"""
\s* # Skip leading whitespace
(?P<header>[^:]+) # Header name
\s* : # Whitespace, and a colon
(?P<value>.*?) # The header's value -- *? used to
# lose the following trailing whitespace
\s*$ # Trailing whitespace to end-of-line
""", re.VERBOSE)
\end{verbatim}
% $
This is far more readable than:
\begin{verbatim}
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
\end{verbatim}
% $
\section{Feedback}
Regular expressions are a complicated topic. Did this document help
you understand them? Were there parts that were unclear, or Problems
you encountered that weren't covered here? If so, please send
suggestions for improvements to the author.
The most complete book on regular expressions is almost certainly
Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
by O'Reilly. Unfortunately, it exclusively concentrates on Perl and
Java's flavours of regular expressions, and doesn't contain any Python
material at all, so it won't be useful as a reference for programming
in Python. (The first edition covered Python's now-removed
\module{regex} module, which won't help you much.) Consider checking
it out from your library.
\end{document}