cpython/Lib/sre_compile.py

490 lines
15 KiB
Python
Raw Normal View History

#
# Secret Labs' Regular Expression Engine
#
# convert template to internal format
#
# Copyright (c) 1997-2001 by Secret Labs AB. All rights reserved.
#
# See the sre.py file for information on usage and redistribution.
#
"""Internal support module for sre"""
import _sre
Merged revisions 62194,62197-62198,62204-62205,62214,62219-62221,62227,62229-62231,62233-62235,62237-62239 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r62194 | jeffrey.yasskin | 2008-04-07 01:04:28 +0200 (Mon, 07 Apr 2008) | 7 lines Add enough debugging information to diagnose failures where the HandlerBException is ignored, and fix one such problem, where it was thrown during the __del__ method of the previous Popen object. We may want to find a better way of printing verbose information so it's not spammy when the test passes. ........ r62197 | mark.hammond | 2008-04-07 03:53:39 +0200 (Mon, 07 Apr 2008) | 2 lines Issue #2513: enable 64bit cross compilation on windows. ........ r62198 | mark.hammond | 2008-04-07 03:59:40 +0200 (Mon, 07 Apr 2008) | 2 lines correct heading underline for new "Cross-compiling on Windows" section ........ r62204 | gregory.p.smith | 2008-04-07 08:33:21 +0200 (Mon, 07 Apr 2008) | 4 lines Use the new PyFile_IncUseCount & PyFile_DecUseCount calls appropriatly within the standard library. These modules use PyFile_AsFile and later release the GIL while operating on the previously returned FILE*. ........ r62205 | mark.summerfield | 2008-04-07 09:39:23 +0200 (Mon, 07 Apr 2008) | 4 lines changed "2500 components" to "several thousand" since the number keeps growning:-) ........ r62214 | georg.brandl | 2008-04-07 20:51:59 +0200 (Mon, 07 Apr 2008) | 2 lines #2525: update timezone info examples in the docs. ........ r62219 | andrew.kuchling | 2008-04-08 01:57:07 +0200 (Tue, 08 Apr 2008) | 1 line Write PEP 3127 section; add items ........ r62220 | andrew.kuchling | 2008-04-08 01:57:21 +0200 (Tue, 08 Apr 2008) | 1 line Typo fix ........ r62221 | andrew.kuchling | 2008-04-08 03:33:10 +0200 (Tue, 08 Apr 2008) | 1 line Typographical fix: 32bit -> 32-bit, 64bit -> 64-bit ........ r62227 | andrew.kuchling | 2008-04-08 23:22:53 +0200 (Tue, 08 Apr 2008) | 1 line Add items ........ r62229 | amaury.forgeotdarc | 2008-04-08 23:27:42 +0200 (Tue, 08 Apr 2008) | 7 lines Issue2564: Prevent a hang in "import test.autotest", which runs the entire test suite as a side-effect of importing the module. - in test_capi, a thread tried to import other modules - re.compile() imported sre_parse again on every call. ........ r62230 | amaury.forgeotdarc | 2008-04-08 23:51:57 +0200 (Tue, 08 Apr 2008) | 2 lines Prevent an error when inspect.isabstract() is called with something else than a new-style class. ........ r62231 | amaury.forgeotdarc | 2008-04-09 00:07:05 +0200 (Wed, 09 Apr 2008) | 8 lines Issue 2408: remove the _types module It was only used as a helper in types.py to access types (GetSetDescriptorType and MemberDescriptorType), when they can easily be obtained with python code. These expressions even work with Jython. I don't know what the future of the types module is; (cf. discussion in http://bugs.python.org/issue1605 ) at least this change makes it simpler. ........ r62233 | amaury.forgeotdarc | 2008-04-09 01:10:07 +0200 (Wed, 09 Apr 2008) | 2 lines Add a NEWS entry for previous checkin ........ r62234 | trent.nelson | 2008-04-09 01:47:30 +0200 (Wed, 09 Apr 2008) | 37 lines - Issue #2550: The approach used by client/server code for obtaining ports to listen on in network-oriented tests has been refined in an effort to facilitate running multiple instances of the entire regression test suite in parallel without issue. test_support.bind_port() has been fixed such that it will always return a unique port -- which wasn't always the case with the previous implementation, especially if socket options had been set that affected address reuse (i.e. SO_REUSEADDR, SO_REUSEPORT). The new implementation of bind_port() will actually raise an exception if it is passed an AF_INET/SOCK_STREAM socket with either the SO_REUSEADDR or SO_REUSEPORT socket option set. Furthermore, if available, bind_port() will set the SO_EXCLUSIVEADDRUSE option on the socket it's been passed. This currently only applies to Windows. This option prevents any other sockets from binding to the host/port we've bound to, thus removing the possibility of the 'non-deterministic' behaviour, as Microsoft puts it, that occurs when a second SOCK_STREAM socket binds and accepts to a host/port that's already been bound by another socket. The optional preferred port parameter to bind_port() has been removed. Under no circumstances should tests be hard coding ports! test_support.find_unused_port() has also been introduced, which will pass a temporary socket object to bind_port() in order to obtain an unused port. The temporary socket object is then closed and deleted, and the port is returned. This method should only be used for obtaining an unused port in order to pass to an external program (i.e. the -accept [port] argument to openssl's s_server mode) or as a parameter to a server-oriented class that doesn't give you direct access to the underlying socket used. Finally, test_support.HOST has been introduced, which should be used for the host argument of any relevant socket calls (i.e. bind and connect). The following tests were updated to following the new conventions: test_socket, test_smtplib, test_asyncore, test_ssl, test_httplib, test_poplib, test_ftplib, test_telnetlib, test_socketserver, test_asynchat and test_socket_ssl. It is now possible for multiple instances of the regression test suite to run in parallel without issue. ........ r62235 | gregory.p.smith | 2008-04-09 02:25:17 +0200 (Wed, 09 Apr 2008) | 3 lines Fix zlib crash from zlib.decompressobj().flush(val) when val was not positive. It tried to allocate negative or zero memory. That fails. ........ r62237 | trent.nelson | 2008-04-09 02:34:53 +0200 (Wed, 09 Apr 2008) | 1 line Fix typo with regards to self.PORT shadowing class variables with the same name. ........ r62238 | andrew.kuchling | 2008-04-09 03:08:32 +0200 (Wed, 09 Apr 2008) | 1 line Add items ........ r62239 | jerry.seutter | 2008-04-09 07:07:58 +0200 (Wed, 09 Apr 2008) | 1 line Changed test so it no longer runs as a side effect of importing. ........
2008-04-09 05:37:03 -03:00
import sre_parse
from sre_constants import *
assert _sre.MAGIC == MAGIC, "SRE module mismatch"
if _sre.CODESIZE == 2:
MAXCODE = 65535
else:
MAXCODE = 0xFFFFFFFF
_LITERAL_CODES = {LITERAL, NOT_LITERAL}
_REPEATING_CODES = {REPEAT, MIN_REPEAT, MAX_REPEAT}
_SUCCESS_CODES = {SUCCESS, FAILURE}
_ASSERT_CODES = {ASSERT, ASSERT_NOT}
2000-06-29 05:58:44 -03:00
def _compile(code, pattern, flags):
# internal: compile a (sub)pattern
emit = code.append
_len = len
LITERAL_CODES = _LITERAL_CODES
REPEATING_CODES = _REPEATING_CODES
SUCCESS_CODES = _SUCCESS_CODES
ASSERT_CODES = _ASSERT_CODES
for op, av in pattern:
if op in LITERAL_CODES:
if flags & SRE_FLAG_IGNORECASE:
emit(OP_IGNORE[op])
emit(_sre.getlower(av, flags))
else:
emit(op)
emit(av)
elif op is IN:
if flags & SRE_FLAG_IGNORECASE:
emit(OP_IGNORE[op])
def fixup(literal, flags=flags):
return _sre.getlower(literal, flags)
else:
emit(op)
fixup = None
skip = _len(code); emit(0)
_compile_charset(av, flags, code, fixup)
code[skip] = _len(code) - skip
elif op is ANY:
if flags & SRE_FLAG_DOTALL:
emit(ANY_ALL)
else:
emit(ANY)
elif op in REPEATING_CODES:
if flags & SRE_FLAG_TEMPLATE:
2007-08-29 22:19:48 -03:00
raise error("internal: unsupported template operator")
elif _simple(av) and op is not REPEAT:
if op is MAX_REPEAT:
emit(REPEAT_ONE)
else:
emit(MIN_REPEAT_ONE)
skip = _len(code); emit(0)
emit(av[0])
emit(av[1])
_compile(code, av[2], flags)
emit(SUCCESS)
code[skip] = _len(code) - skip
else:
emit(REPEAT)
skip = _len(code); emit(0)
emit(av[0])
emit(av[1])
_compile(code, av[2], flags)
code[skip] = _len(code) - skip
if op is MAX_REPEAT:
emit(MAX_UNTIL)
else:
emit(MIN_UNTIL)
elif op is SUBPATTERN:
if av[0]:
emit(MARK)
emit((av[0]-1)*2)
# _compile_info(code, av[1], flags)
_compile(code, av[1], flags)
if av[0]:
emit(MARK)
emit((av[0]-1)*2+1)
elif op in SUCCESS_CODES:
emit(op)
elif op in ASSERT_CODES:
emit(op)
skip = _len(code); emit(0)
if av[0] >= 0:
emit(0) # look ahead
else:
lo, hi = av[1].getwidth()
if lo != hi:
2007-08-29 22:19:48 -03:00
raise error("look-behind requires fixed-width pattern")
emit(lo) # look behind
_compile(code, av[1], flags)
emit(SUCCESS)
code[skip] = _len(code) - skip
elif op is CALL:
emit(op)
skip = _len(code); emit(0)
_compile(code, av, flags)
emit(SUCCESS)
code[skip] = _len(code) - skip
elif op is AT:
emit(op)
if flags & SRE_FLAG_MULTILINE:
av = AT_MULTILINE.get(av, av)
if flags & SRE_FLAG_LOCALE:
av = AT_LOCALE.get(av, av)
elif flags & SRE_FLAG_UNICODE:
av = AT_UNICODE.get(av, av)
emit(av)
elif op is BRANCH:
emit(op)
tail = []
tailappend = tail.append
for av in av[1]:
skip = _len(code); emit(0)
# _compile_info(code, av, flags)
_compile(code, av, flags)
emit(JUMP)
tailappend(_len(code)); emit(0)
code[skip] = _len(code) - skip
emit(0) # end of branch
for tail in tail:
code[tail] = _len(code) - tail
elif op is CATEGORY:
emit(op)
if flags & SRE_FLAG_LOCALE:
av = CH_LOCALE[av]
elif flags & SRE_FLAG_UNICODE:
av = CH_UNICODE[av]
emit(av)
elif op is GROUPREF:
if flags & SRE_FLAG_IGNORECASE:
emit(OP_IGNORE[op])
else:
emit(op)
emit(av-1)
elif op is GROUPREF_EXISTS:
emit(op)
emit(av[0]-1)
skipyes = _len(code); emit(0)
_compile(code, av[1], flags)
if av[2]:
emit(JUMP)
skipno = _len(code); emit(0)
code[skipyes] = _len(code) - skipyes + 1
_compile(code, av[2], flags)
code[skipno] = _len(code) - skipno
else:
code[skipyes] = _len(code) - skipyes + 1
else:
2007-08-29 22:19:48 -03:00
raise ValueError("unsupported operand type", op)
def _compile_charset(charset, flags, code, fixup=None):
# compile charset subprogram
emit = code.append
for op, av in _optimize_charset(charset, fixup):
emit(op)
if op is NEGATE:
pass
elif op is LITERAL:
emit(av)
elif op is RANGE or op is RANGE_IGNORE:
emit(av[0])
emit(av[1])
elif op is CHARSET:
code.extend(av)
elif op is BIGCHARSET:
code.extend(av)
elif op is CATEGORY:
if flags & SRE_FLAG_LOCALE:
emit(CH_LOCALE[av])
elif flags & SRE_FLAG_UNICODE:
emit(CH_UNICODE[av])
else:
emit(av)
else:
2007-08-29 22:19:48 -03:00
raise error("internal: unsupported set operator")
emit(FAILURE)
def _optimize_charset(charset, fixup):
# internal: optimize character set
out = []
tail = []
charmap = bytearray(256)
for op, av in charset:
while True:
try:
if op is LITERAL:
if fixup:
av = fixup(av)
charmap[av] = 1
elif op is RANGE:
r = range(av[0], av[1]+1)
if fixup:
r = map(fixup, r)
for i in r:
charmap[i] = 1
elif op is NEGATE:
out.append((op, av))
else:
tail.append((op, av))
except IndexError:
if len(charmap) == 256:
# character set contains non-UCS1 character codes
charmap += b'\0' * 0xff00
continue
# Character set contains non-BMP character codes.
# There are only two ranges of cased non-BMP characters:
# 10400-1044F (Deseret) and 118A0-118DF (Warang Citi),
# and for both ranges RANGE_IGNORE works.
if fixup and op is RANGE:
op = RANGE_IGNORE
tail.append((op, av))
break
# compress character map
runs = []
q = 0
while True:
p = charmap.find(1, q)
if p < 0:
break
if len(runs) >= 2:
runs = None
break
q = charmap.find(0, p)
if q < 0:
runs.append((p, len(charmap)))
break
runs.append((p, q))
if runs is not None:
# use literal/range
for p, q in runs:
if q - p == 1:
out.append((LITERAL, p))
else:
out.append((RANGE, (p, q - 1)))
out += tail
# if the case was changed or new representation is more compact
if fixup or len(out) < len(charset):
return out
# else original character set is good enough
return charset
# use bitmap
if len(charmap) == 256:
data = _mk_bitmap(charmap)
out.append((CHARSET, data))
out += tail
return out
# To represent a big charset, first a bitmap of all characters in the
# set is constructed. Then, this bitmap is sliced into chunks of 256
# characters, duplicate chunks are eliminated, and each chunk is
# given a number. In the compiled expression, the charset is
# represented by a 32-bit word sequence, consisting of one word for
# the number of different chunks, a sequence of 256 bytes (64 words)
# of chunk numbers indexed by their original chunk position, and a
# sequence of 256-bit chunks (8 words each).
# Compression is normally good: in a typical charset, large ranges of
# Unicode will be either completely excluded (e.g. if only cyrillic
# letters are to be matched), or completely included (e.g. if large
# subranges of Kanji match). These ranges will be represented by
# chunks of all one-bits or all zero-bits.
# Matching can be also done efficiently: the more significant byte of
# the Unicode character is an index into the chunk number, and the
# less significant byte is a bit index in the chunk (just like the
# CHARSET matching).
charmap = bytes(charmap) # should be hashable
comps = {}
mapping = bytearray(256)
block = 0
data = bytearray()
for i in range(0, 65536, 256):
chunk = charmap[i: i + 256]
if chunk in comps:
mapping[i // 256] = comps[chunk]
else:
mapping[i // 256] = comps[chunk] = block
block += 1
data += chunk
data = _mk_bitmap(data)
data[0:0] = [block] + _bytes_to_codes(mapping)
out.append((BIGCHARSET, data))
out += tail
return out
_CODEBITS = _sre.CODESIZE * 8
_BITS_TRANS = b'0' + b'1' * 255
def _mk_bitmap(bits, _CODEBITS=_CODEBITS, _int=int):
s = bits.translate(_BITS_TRANS)[::-1]
return [_int(s[i - _CODEBITS: i], 2)
for i in range(len(s), 0, -_CODEBITS)]
def _bytes_to_codes(b):
# Convert block indices to word array
import array
a = array.array('I', b)
assert a.itemsize == _sre.CODESIZE
assert len(a) * a.itemsize == len(b)
return a.tolist()
def _simple(av):
# check if av is a "simple" operator
lo, hi = av[2].getwidth()
return lo == hi == 1 and av[2][0][0] != SUBPATTERN
def _generate_overlap_table(prefix):
"""
Generate an overlap table for the following prefix.
An overlap table is a table of the same size as the prefix which
informs about the potential self-overlap for each index in the prefix:
- if overlap[i] == 0, prefix[i:] can't overlap prefix[0:...]
- if overlap[i] == k with 0 < k <= i, prefix[i-k+1:i+1] overlaps with
prefix[0:k]
"""
table = [0] * len(prefix)
for i in range(1, len(prefix)):
idx = table[i - 1]
while prefix[i] != prefix[idx]:
if idx == 0:
table[i] = 0
break
idx = table[idx - 1]
else:
table[i] = idx + 1
return table
def _compile_info(code, pattern, flags):
# internal: compile an info block. in the current version,
# this contains min/max pattern width, and an optional literal
# prefix or a character map
lo, hi = pattern.getwidth()
if lo == 0:
return # not worth it
# look for a literal prefix
prefix = []
prefixappend = prefix.append
prefix_skip = 0
charset = [] # not used
charsetappend = charset.append
if not (flags & SRE_FLAG_IGNORECASE):
# look for literal prefix
for op, av in pattern.data:
if op is LITERAL:
if len(prefix) == prefix_skip:
prefix_skip = prefix_skip + 1
prefixappend(av)
elif op is SUBPATTERN and len(av[1]) == 1:
op, av = av[1][0]
if op is LITERAL:
prefixappend(av)
else:
break
else:
break
# if no prefix, look for charset prefix
if not prefix and pattern.data:
op, av = pattern.data[0]
if op is SUBPATTERN and av[1]:
op, av = av[1][0]
if op is LITERAL:
charsetappend((op, av))
elif op is BRANCH:
c = []
cappend = c.append
for p in av[1]:
if not p:
break
op, av = p[0]
if op is LITERAL:
cappend((op, av))
else:
break
else:
charset = c
elif op is BRANCH:
c = []
cappend = c.append
for p in av[1]:
if not p:
break
op, av = p[0]
if op is LITERAL:
cappend((op, av))
else:
break
else:
charset = c
elif op is IN:
charset = av
## if prefix:
## print "*** PREFIX", prefix, prefix_skip
## if charset:
## print "*** CHARSET", charset
# add an info block
emit = code.append
emit(INFO)
skip = len(code); emit(0)
# literal flag
mask = 0
if prefix:
mask = SRE_INFO_PREFIX
if len(prefix) == prefix_skip == len(pattern.data):
mask = mask + SRE_INFO_LITERAL
elif charset:
mask = mask + SRE_INFO_CHARSET
emit(mask)
# pattern length
if lo < MAXCODE:
emit(lo)
else:
emit(MAXCODE)
prefix = prefix[:MAXCODE]
if hi < MAXCODE:
emit(hi)
else:
emit(0)
# add literal prefix
if prefix:
emit(len(prefix)) # length
emit(prefix_skip) # skip
code.extend(prefix)
# generate overlap table
code.extend(_generate_overlap_table(prefix))
elif charset:
_compile_charset(charset, flags, code)
code[skip] = len(code) - skip
def isstring(obj):
return isinstance(obj, (str, bytes))
def _code(p, flags):
flags = p.pattern.flags | flags
code = []
# compile info block
_compile_info(code, p, flags)
# compile the pattern
_compile(code, p.data, flags)
code.append(SUCCESS)
return code
def compile(p, flags=0):
# internal: convert pattern list to internal format
if isstring(p):
pattern = p
p = sre_parse.parse(p, flags)
else:
pattern = None
code = _code(p, flags)
# print(code)
# map in either direction
groupindex = p.pattern.groupdict
indexgroup = [None] * p.pattern.groups
for k, i in groupindex.items():
indexgroup[i] = k
return _sre.compile(
Merged revisions 59666-59679 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r59666 | christian.heimes | 2008-01-02 19:28:32 +0100 (Wed, 02 Jan 2008) | 1 line Made vs9to8 Unix compatible ........ r59669 | guido.van.rossum | 2008-01-02 20:00:46 +0100 (Wed, 02 Jan 2008) | 2 lines Patch #1696. Don't attempt to close None in dry-run mode. ........ r59671 | jeffrey.yasskin | 2008-01-03 03:21:52 +0100 (Thu, 03 Jan 2008) | 6 lines Backport PEP 3141 from the py3k branch to the trunk. This includes r50877 (just the complex_pow part), r56649, r56652, r56715, r57296, r57302, r57359, r57361, r57372, r57738, r57739, r58017, r58039, r58040, and r59390, and new documentation. The only significant difference is that round(x) returns a float to preserve backward-compatibility. See http://bugs.python.org/issue1689. ........ r59672 | christian.heimes | 2008-01-03 16:41:30 +0100 (Thu, 03 Jan 2008) | 1 line Issue #1726: Remove Python/atof.c from PCBuild/pythoncore.vcproj ........ r59675 | guido.van.rossum | 2008-01-03 20:12:44 +0100 (Thu, 03 Jan 2008) | 4 lines Issue #1700, reported by Nguyen Quan Son, fix by Fredruk Lundh: Regular Expression inline flags not handled correctly for some unicode characters. (Forward port from 2.5.2.) ........ r59676 | christian.heimes | 2008-01-03 21:23:15 +0100 (Thu, 03 Jan 2008) | 1 line Added math.isinf() and math.isnan() ........ r59677 | christian.heimes | 2008-01-03 22:14:48 +0100 (Thu, 03 Jan 2008) | 1 line Some build bots don't compile mathmodule. There is an issue with the long definition of pi and euler ........ r59678 | christian.heimes | 2008-01-03 23:16:32 +0100 (Thu, 03 Jan 2008) | 2 lines Modified PyImport_Import and PyImport_ImportModule to always use absolute imports by calling __import__ with an explicit level of 0 Added a new API function PyImport_ImportModuleNoBlock. It solves the problem with dead locks when mixing threads and imports ........ r59679 | christian.heimes | 2008-01-03 23:32:26 +0100 (Thu, 03 Jan 2008) | 1 line Added copysign(x, y) function to the math module ........
2008-01-03 19:01:04 -04:00
pattern, flags | p.pattern.flags, code,
p.pattern.groups-1,
groupindex, indexgroup
)