This is a verifier for the binary code used by the _sre module (this
is often called bytecode, though to distinguish it from Python bytecode
I put it in quotes).
I wrote this for Google App Engine, and am making the patch available as
open source under the Apache 2 license. Below are the copyright
statement and license, for completeness.
# Copyright 2008 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
It's not necessary to include these copyrights and bytecode in the
source file. Google has signed a contributor's agreement with the PSF
already.
SRE_MATCH so that signal handlers can be invoked during
long regular expression matches. It also adds a new
error return value indicating that an exception
occurred in a signal handler during the match, allowing
exceptions in the signal handler to propagate up to the
main loop. Thanks Josh Hoyt and Ralf Schmitt.
Make sure the type of the return value of re.sub(x, y, z) is the type
of y+x (i.e. unicode if either is unicode, str if they are both str)
even if there are no substitutions or if x==z (which triggered various
special cases in join_list()).
Could be backported to 2.5; no need to port to 3.0.
loses information:
OverflowError: regular expression code size limit exceeded
Otherwise the compiled code is gibberish, possibly leading at
least to wrong results or (as reported on c.l.py) internal
sre errors at match time.
I'm not sure how to test this. SRE_CODE is a 2-byte type on
my box, and it's easy to create a regexp that causes the new
exception to trigger here. But it may be a 4-byte type on
other boxes, and creating a regexp large enough to trigger
problems there would be pretty crazy.
Bugfix candidate.
In C++, it's an error to pass a string literal to a char* function
without a const_cast(). Rather than require every C++ extension
module to put a cast around string literals, fix the API to state the
const-ness.
I focused on parts of the API where people usually pass literals:
PyArg_ParseTuple() and friends, Py_BuildValue(), PyMethodDef, the type
slots, etc. Predictably, there were a large set of functions that
needed to be fixed as a result of these changes. The most pervasive
change was to make the keyword args list passed to
PyArg_ParseTupleAndKewords() to be a const char *kwlist[].
One cast was required as a result of the changes: A type object
mallocs the memory for its tp_doc slot and later frees it.
PyTypeObject says that tp_doc is const char *; but if the type was
created by type_new(), we know it is safe to cast to char *.
stack usage on FreeBSD, requiring the recursion limit to be lowered
further. Building with gcc 2.95 (the standard compiler on FreeBSD 4.x)
is now also affected.
The underlying issue is that FreeBSD's pthreads implementation has a
hard-coded 1MB stack size for the initial (or "primary") thread, which
can not be changed without rebuilding libc_r. Exhausting this stack
results in a bus error.
Building without pthreads (configure --without-threads), or linking
with the port of the Linux pthreads library (aka Linuxthreads) instead
of libc_r, avoids this limitation.
On OS/2, only gcc 3.2 is affected and the stack size is controllable,
so the special handling has been removed.
to use LASTMARK_SAVE()/LASTMARK_RESTORE(), based on the discussion
in patch #712900.
- Cleaned up LASTMARK_SAVE()/LASTMARK_RESTORE() usage, based on the
established rules.
- Moved the upper part of the just commited patch (relative to bug #725106)
to outside the for() loop of BRANCH OP. There's no need to mark_save()
in every loop iteration.
This problem is related to a wrong behavior from mark_save/restore(),
which don't restore the mark_stack_base before restoring the marks.
Greg's suggestion was to change the asserts, which happen to be
the only recursive ops that can continue the loop, but the problem would
happen to any operation with the same behavior. So, rather than
hardcoding this into asserts, I have changed mark_save/restore() to
always restore the stackbase before restoring the marks.
Both solutions should fix these two cases, presented by Greg:
>>> re.match('(a)(?:(?=(b)*)c)*', 'abb').groups()
('b', None)
>>> re.match('(a)((?!(b)*))*', 'abb').groups()
('b', None, None)
The rest of the bug and patch in #725149 must be discussed further.
within repeats of alternatives. The only change to the original
patch was to convert the tests to the new test_re.py file.
This patch fixes cases like:
>>> re.match('((a)|b)*', 'abc').groups()
('b', '')
Which is wrong (it's impossible to match the empty string),
and incompatible with other regex systems, like the following
examples show:
% perl -e '"abc" =~ /^((a)|b)*/; print "$1 $2\n";'
b a
% echo "abc" | sed -r -e "s/^((a)|b)*/\1 \2|/"
b a|c
I've applied a modified version of Greg Chapman's patch. I've included
the fixes without introducing the reorganization mentioned, for the sake
of stability. Also, the second fix mentioned in the patch don't fix the
mentioned problem anymore, because of the change introduced by patch
#720991 (by Greg as well). The new fix wasn't complicated though, and is
included as well.
As a note. It seems that there are other places that require the
"protection" of LASTMARK_SAVE()/LASTMARK_RESTORE(), and are just waiting
for someone to find how to break them. Particularly, I belive that every
recursion of SRE_MATCH() should be protected by these macros. I won't
do that right now since I'm not completely sure about this, and we don't
have much time for testing until the next release.
to be compliant with previous python versions, by backing out the changes
made in revision 2.84 which affected this. The bugfix for backtracking is
still maintained.