2000-03-31 10:58:54 -04:00
|
|
|
|
#
|
|
|
|
|
# Secret Labs' Regular Expression Engine
|
|
|
|
|
#
|
|
|
|
|
# convert template to internal format
|
|
|
|
|
#
|
2001-01-14 11:06:11 -04:00
|
|
|
|
# Copyright (c) 1997-2001 by Secret Labs AB. All rights reserved.
|
2000-03-31 10:58:54 -04:00
|
|
|
|
#
|
2000-08-01 15:20:07 -03:00
|
|
|
|
# See the sre.py file for information on usage and redistribution.
|
2000-03-31 10:58:54 -04:00
|
|
|
|
#
|
|
|
|
|
|
2001-09-04 16:10:20 -03:00
|
|
|
|
"""Internal support module for sre"""
|
|
|
|
|
|
2014-03-20 05:16:38 -03:00
|
|
|
|
import _sre
|
Merged revisions 62194,62197-62198,62204-62205,62214,62219-62221,62227,62229-62231,62233-62235,62237-62239 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
r62194 | jeffrey.yasskin | 2008-04-07 01:04:28 +0200 (Mon, 07 Apr 2008) | 7 lines
Add enough debugging information to diagnose failures where the
HandlerBException is ignored, and fix one such problem, where it was thrown
during the __del__ method of the previous Popen object.
We may want to find a better way of printing verbose information so it's not
spammy when the test passes.
........
r62197 | mark.hammond | 2008-04-07 03:53:39 +0200 (Mon, 07 Apr 2008) | 2 lines
Issue #2513: enable 64bit cross compilation on windows.
........
r62198 | mark.hammond | 2008-04-07 03:59:40 +0200 (Mon, 07 Apr 2008) | 2 lines
correct heading underline for new "Cross-compiling on Windows" section
........
r62204 | gregory.p.smith | 2008-04-07 08:33:21 +0200 (Mon, 07 Apr 2008) | 4 lines
Use the new PyFile_IncUseCount & PyFile_DecUseCount calls appropriatly
within the standard library. These modules use PyFile_AsFile and later
release the GIL while operating on the previously returned FILE*.
........
r62205 | mark.summerfield | 2008-04-07 09:39:23 +0200 (Mon, 07 Apr 2008) | 4 lines
changed "2500 components" to "several thousand" since the number keeps
growning:-)
........
r62214 | georg.brandl | 2008-04-07 20:51:59 +0200 (Mon, 07 Apr 2008) | 2 lines
#2525: update timezone info examples in the docs.
........
r62219 | andrew.kuchling | 2008-04-08 01:57:07 +0200 (Tue, 08 Apr 2008) | 1 line
Write PEP 3127 section; add items
........
r62220 | andrew.kuchling | 2008-04-08 01:57:21 +0200 (Tue, 08 Apr 2008) | 1 line
Typo fix
........
r62221 | andrew.kuchling | 2008-04-08 03:33:10 +0200 (Tue, 08 Apr 2008) | 1 line
Typographical fix: 32bit -> 32-bit, 64bit -> 64-bit
........
r62227 | andrew.kuchling | 2008-04-08 23:22:53 +0200 (Tue, 08 Apr 2008) | 1 line
Add items
........
r62229 | amaury.forgeotdarc | 2008-04-08 23:27:42 +0200 (Tue, 08 Apr 2008) | 7 lines
Issue2564: Prevent a hang in "import test.autotest", which runs the entire test
suite as a side-effect of importing the module.
- in test_capi, a thread tried to import other modules
- re.compile() imported sre_parse again on every call.
........
r62230 | amaury.forgeotdarc | 2008-04-08 23:51:57 +0200 (Tue, 08 Apr 2008) | 2 lines
Prevent an error when inspect.isabstract() is called with something else than a new-style class.
........
r62231 | amaury.forgeotdarc | 2008-04-09 00:07:05 +0200 (Wed, 09 Apr 2008) | 8 lines
Issue 2408: remove the _types module
It was only used as a helper in types.py to access types (GetSetDescriptorType and MemberDescriptorType),
when they can easily be obtained with python code.
These expressions even work with Jython.
I don't know what the future of the types module is; (cf. discussion in http://bugs.python.org/issue1605 )
at least this change makes it simpler.
........
r62233 | amaury.forgeotdarc | 2008-04-09 01:10:07 +0200 (Wed, 09 Apr 2008) | 2 lines
Add a NEWS entry for previous checkin
........
r62234 | trent.nelson | 2008-04-09 01:47:30 +0200 (Wed, 09 Apr 2008) | 37 lines
- Issue #2550: The approach used by client/server code for obtaining ports
to listen on in network-oriented tests has been refined in an effort to
facilitate running multiple instances of the entire regression test suite
in parallel without issue. test_support.bind_port() has been fixed such
that it will always return a unique port -- which wasn't always the case
with the previous implementation, especially if socket options had been
set that affected address reuse (i.e. SO_REUSEADDR, SO_REUSEPORT). The
new implementation of bind_port() will actually raise an exception if it
is passed an AF_INET/SOCK_STREAM socket with either the SO_REUSEADDR or
SO_REUSEPORT socket option set. Furthermore, if available, bind_port()
will set the SO_EXCLUSIVEADDRUSE option on the socket it's been passed.
This currently only applies to Windows. This option prevents any other
sockets from binding to the host/port we've bound to, thus removing the
possibility of the 'non-deterministic' behaviour, as Microsoft puts it,
that occurs when a second SOCK_STREAM socket binds and accepts to a
host/port that's already been bound by another socket. The optional
preferred port parameter to bind_port() has been removed. Under no
circumstances should tests be hard coding ports!
test_support.find_unused_port() has also been introduced, which will pass
a temporary socket object to bind_port() in order to obtain an unused port.
The temporary socket object is then closed and deleted, and the port is
returned. This method should only be used for obtaining an unused port
in order to pass to an external program (i.e. the -accept [port] argument
to openssl's s_server mode) or as a parameter to a server-oriented class
that doesn't give you direct access to the underlying socket used.
Finally, test_support.HOST has been introduced, which should be used for
the host argument of any relevant socket calls (i.e. bind and connect).
The following tests were updated to following the new conventions:
test_socket, test_smtplib, test_asyncore, test_ssl, test_httplib,
test_poplib, test_ftplib, test_telnetlib, test_socketserver,
test_asynchat and test_socket_ssl.
It is now possible for multiple instances of the regression test suite to
run in parallel without issue.
........
r62235 | gregory.p.smith | 2008-04-09 02:25:17 +0200 (Wed, 09 Apr 2008) | 3 lines
Fix zlib crash from zlib.decompressobj().flush(val) when val was not positive.
It tried to allocate negative or zero memory. That fails.
........
r62237 | trent.nelson | 2008-04-09 02:34:53 +0200 (Wed, 09 Apr 2008) | 1 line
Fix typo with regards to self.PORT shadowing class variables with the same name.
........
r62238 | andrew.kuchling | 2008-04-09 03:08:32 +0200 (Wed, 09 Apr 2008) | 1 line
Add items
........
r62239 | jerry.seutter | 2008-04-09 07:07:58 +0200 (Wed, 09 Apr 2008) | 1 line
Changed test so it no longer runs as a side effect of importing.
........
2008-04-09 05:37:03 -03:00
|
|
|
|
import sre_parse
|
2000-03-31 10:58:54 -04:00
|
|
|
|
from sre_constants import *
|
|
|
|
|
|
2001-01-15 08:46:09 -04:00
|
|
|
|
assert _sre.MAGIC == MAGIC, "SRE module mismatch"
|
|
|
|
|
|
2014-11-09 19:56:33 -04:00
|
|
|
|
_LITERAL_CODES = {LITERAL, NOT_LITERAL}
|
|
|
|
|
_REPEATING_CODES = {REPEAT, MIN_REPEAT, MAX_REPEAT}
|
|
|
|
|
_SUCCESS_CODES = {SUCCESS, FAILURE}
|
|
|
|
|
_ASSERT_CODES = {ASSERT, ASSERT_NOT}
|
2017-05-14 02:32:33 -03:00
|
|
|
|
_UNIT_CODES = _LITERAL_CODES | {ANY, IN}
|
2005-02-28 15:27:52 -04:00
|
|
|
|
|
2014-11-10 06:37:16 -04:00
|
|
|
|
# Sets of lowercase characters which have the same uppercase.
|
|
|
|
|
_equivalences = (
|
|
|
|
|
# LATIN SMALL LETTER I, LATIN SMALL LETTER DOTLESS I
|
|
|
|
|
(0x69, 0x131), # iı
|
|
|
|
|
# LATIN SMALL LETTER S, LATIN SMALL LETTER LONG S
|
|
|
|
|
(0x73, 0x17f), # sſ
|
|
|
|
|
# MICRO SIGN, GREEK SMALL LETTER MU
|
|
|
|
|
(0xb5, 0x3bc), # µμ
|
|
|
|
|
# COMBINING GREEK YPOGEGRAMMENI, GREEK SMALL LETTER IOTA, GREEK PROSGEGRAMMENI
|
|
|
|
|
(0x345, 0x3b9, 0x1fbe), # \u0345ιι
|
|
|
|
|
# GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS, GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
|
|
|
|
|
(0x390, 0x1fd3), # ΐΐ
|
|
|
|
|
# GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS, GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
|
|
|
|
|
(0x3b0, 0x1fe3), # ΰΰ
|
|
|
|
|
# GREEK SMALL LETTER BETA, GREEK BETA SYMBOL
|
|
|
|
|
(0x3b2, 0x3d0), # βϐ
|
|
|
|
|
# GREEK SMALL LETTER EPSILON, GREEK LUNATE EPSILON SYMBOL
|
|
|
|
|
(0x3b5, 0x3f5), # εϵ
|
|
|
|
|
# GREEK SMALL LETTER THETA, GREEK THETA SYMBOL
|
|
|
|
|
(0x3b8, 0x3d1), # θϑ
|
|
|
|
|
# GREEK SMALL LETTER KAPPA, GREEK KAPPA SYMBOL
|
|
|
|
|
(0x3ba, 0x3f0), # κϰ
|
|
|
|
|
# GREEK SMALL LETTER PI, GREEK PI SYMBOL
|
|
|
|
|
(0x3c0, 0x3d6), # πϖ
|
|
|
|
|
# GREEK SMALL LETTER RHO, GREEK RHO SYMBOL
|
|
|
|
|
(0x3c1, 0x3f1), # ρϱ
|
|
|
|
|
# GREEK SMALL LETTER FINAL SIGMA, GREEK SMALL LETTER SIGMA
|
|
|
|
|
(0x3c2, 0x3c3), # ςσ
|
|
|
|
|
# GREEK SMALL LETTER PHI, GREEK PHI SYMBOL
|
|
|
|
|
(0x3c6, 0x3d5), # φϕ
|
|
|
|
|
# LATIN SMALL LETTER S WITH DOT ABOVE, LATIN SMALL LETTER LONG S WITH DOT ABOVE
|
|
|
|
|
(0x1e61, 0x1e9b), # ṡẛ
|
|
|
|
|
# LATIN SMALL LIGATURE LONG S T, LATIN SMALL LIGATURE ST
|
|
|
|
|
(0xfb05, 0xfb06), # ſtst
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Maps the lowercase code to lowercase codes which have the same uppercase.
|
|
|
|
|
_ignorecase_fixes = {i: tuple(j for j in t if i != j)
|
|
|
|
|
for t in _equivalences for i in t}
|
|
|
|
|
|
2017-10-24 17:31:42 -03:00
|
|
|
|
def _combine_flags(flags, add_flags, del_flags,
|
|
|
|
|
TYPE_FLAGS=sre_parse.TYPE_FLAGS):
|
|
|
|
|
if add_flags & TYPE_FLAGS:
|
|
|
|
|
flags &= ~TYPE_FLAGS
|
|
|
|
|
return (flags | add_flags) & ~del_flags
|
|
|
|
|
|
2000-06-29 05:58:44 -03:00
|
|
|
|
def _compile(code, pattern, flags):
|
2000-06-29 20:33:12 -03:00
|
|
|
|
# internal: compile a (sub)pattern
|
2000-06-29 13:57:40 -03:00
|
|
|
|
emit = code.append
|
2004-03-26 07:16:55 -04:00
|
|
|
|
_len = len
|
2005-02-28 15:27:52 -04:00
|
|
|
|
LITERAL_CODES = _LITERAL_CODES
|
|
|
|
|
REPEATING_CODES = _REPEATING_CODES
|
|
|
|
|
SUCCESS_CODES = _SUCCESS_CODES
|
|
|
|
|
ASSERT_CODES = _ASSERT_CODES
|
2017-05-09 17:37:14 -03:00
|
|
|
|
iscased = None
|
2017-05-05 04:42:46 -03:00
|
|
|
|
tolower = None
|
|
|
|
|
fixes = None
|
|
|
|
|
if flags & SRE_FLAG_IGNORECASE and not flags & SRE_FLAG_LOCALE:
|
2018-10-05 14:53:45 -03:00
|
|
|
|
if flags & SRE_FLAG_UNICODE:
|
2017-05-09 17:37:14 -03:00
|
|
|
|
iscased = _sre.unicode_iscased
|
2017-05-05 04:42:46 -03:00
|
|
|
|
tolower = _sre.unicode_tolower
|
|
|
|
|
fixes = _ignorecase_fixes
|
|
|
|
|
else:
|
2017-05-09 17:37:14 -03:00
|
|
|
|
iscased = _sre.ascii_iscased
|
2017-05-05 04:42:46 -03:00
|
|
|
|
tolower = _sre.ascii_tolower
|
2000-03-31 10:58:54 -04:00
|
|
|
|
for op, av in pattern:
|
2004-03-26 07:16:55 -04:00
|
|
|
|
if op in LITERAL_CODES:
|
2017-05-05 02:53:40 -03:00
|
|
|
|
if not flags & SRE_FLAG_IGNORECASE:
|
|
|
|
|
emit(op)
|
|
|
|
|
emit(av)
|
|
|
|
|
elif flags & SRE_FLAG_LOCALE:
|
2017-10-24 17:31:42 -03:00
|
|
|
|
emit(OP_LOCALE_IGNORE[op])
|
2017-05-05 02:53:40 -03:00
|
|
|
|
emit(av)
|
2017-05-09 17:37:14 -03:00
|
|
|
|
elif not iscased(av):
|
|
|
|
|
emit(op)
|
|
|
|
|
emit(av)
|
2017-05-05 02:53:40 -03:00
|
|
|
|
else:
|
2017-05-05 04:42:46 -03:00
|
|
|
|
lo = tolower(av)
|
2017-10-24 17:31:42 -03:00
|
|
|
|
if not fixes: # ascii
|
|
|
|
|
emit(OP_IGNORE[op])
|
|
|
|
|
emit(lo)
|
|
|
|
|
elif lo not in fixes:
|
|
|
|
|
emit(OP_UNICODE_IGNORE[op])
|
|
|
|
|
emit(lo)
|
|
|
|
|
else:
|
|
|
|
|
emit(IN_UNI_IGNORE)
|
2014-11-10 06:37:16 -04:00
|
|
|
|
skip = _len(code); emit(0)
|
|
|
|
|
if op is NOT_LITERAL:
|
2014-11-10 06:43:14 -04:00
|
|
|
|
emit(NEGATE)
|
2014-11-10 06:37:16 -04:00
|
|
|
|
for k in (lo,) + fixes[lo]:
|
2014-11-10 06:43:14 -04:00
|
|
|
|
emit(LITERAL)
|
2014-11-10 06:37:16 -04:00
|
|
|
|
emit(k)
|
2014-11-10 06:43:14 -04:00
|
|
|
|
emit(FAILURE)
|
2014-11-10 06:37:16 -04:00
|
|
|
|
code[skip] = _len(code) - skip
|
2000-06-30 04:50:59 -03:00
|
|
|
|
elif op is IN:
|
2017-05-09 17:37:14 -03:00
|
|
|
|
charset, hascased = _optimize_charset(av, iscased, tolower, fixes)
|
|
|
|
|
if flags & SRE_FLAG_IGNORECASE and flags & SRE_FLAG_LOCALE:
|
2017-05-05 02:53:40 -03:00
|
|
|
|
emit(IN_LOC_IGNORE)
|
2017-10-24 17:31:42 -03:00
|
|
|
|
elif not hascased:
|
|
|
|
|
emit(IN)
|
|
|
|
|
elif not fixes: # ascii
|
2017-05-05 02:53:40 -03:00
|
|
|
|
emit(IN_IGNORE)
|
2017-05-09 17:37:14 -03:00
|
|
|
|
else:
|
2017-10-24 17:31:42 -03:00
|
|
|
|
emit(IN_UNI_IGNORE)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
skip = _len(code); emit(0)
|
2017-05-09 17:37:14 -03:00
|
|
|
|
_compile_charset(charset, flags, code)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
code[skip] = _len(code) - skip
|
2000-06-30 07:41:31 -03:00
|
|
|
|
elif op is ANY:
|
|
|
|
|
if flags & SRE_FLAG_DOTALL:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(ANY_ALL)
|
2000-06-30 07:41:31 -03:00
|
|
|
|
else:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(ANY)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
elif op in REPEATING_CODES:
|
2000-06-30 04:50:59 -03:00
|
|
|
|
if flags & SRE_FLAG_TEMPLATE:
|
2015-03-25 16:03:47 -03:00
|
|
|
|
raise error("internal: unsupported template operator %r" % (op,))
|
2017-05-14 02:32:33 -03:00
|
|
|
|
if _simple(av[2]):
|
2004-03-26 07:16:55 -04:00
|
|
|
|
if op is MAX_REPEAT:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(REPEAT_ONE)
|
2003-04-14 14:59:34 -03:00
|
|
|
|
else:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(MIN_REPEAT_ONE)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
skip = _len(code); emit(0)
|
2000-08-01 19:47:49 -03:00
|
|
|
|
emit(av[0])
|
|
|
|
|
emit(av[1])
|
|
|
|
|
_compile(code, av[2], flags)
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(SUCCESS)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
code[skip] = _len(code) - skip
|
2000-06-30 04:50:59 -03:00
|
|
|
|
else:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(REPEAT)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
skip = _len(code); emit(0)
|
2000-08-01 19:47:49 -03:00
|
|
|
|
emit(av[0])
|
|
|
|
|
emit(av[1])
|
|
|
|
|
_compile(code, av[2], flags)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
code[skip] = _len(code) - skip
|
|
|
|
|
if op is MAX_REPEAT:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(MAX_UNTIL)
|
2000-06-30 04:50:59 -03:00
|
|
|
|
else:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(MIN_UNTIL)
|
2000-06-30 04:50:59 -03:00
|
|
|
|
elif op is SUBPATTERN:
|
2016-09-09 18:57:55 -03:00
|
|
|
|
group, add_flags, del_flags, p = av
|
|
|
|
|
if group:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(MARK)
|
2016-09-09 18:57:55 -03:00
|
|
|
|
emit((group-1)*2)
|
2017-10-24 17:31:42 -03:00
|
|
|
|
# _compile_info(code, p, _combine_flags(flags, add_flags, del_flags))
|
|
|
|
|
_compile(code, p, _combine_flags(flags, add_flags, del_flags))
|
2016-09-09 18:57:55 -03:00
|
|
|
|
if group:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(MARK)
|
2016-09-09 18:57:55 -03:00
|
|
|
|
emit((group-1)*2+1)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
elif op in SUCCESS_CODES:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(op)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
elif op in ASSERT_CODES:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(op)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
skip = _len(code); emit(0)
|
2000-07-03 15:44:21 -03:00
|
|
|
|
if av[0] >= 0:
|
|
|
|
|
emit(0) # look ahead
|
|
|
|
|
else:
|
|
|
|
|
lo, hi = av[1].getwidth()
|
|
|
|
|
if lo != hi:
|
2007-08-29 22:19:48 -03:00
|
|
|
|
raise error("look-behind requires fixed-width pattern")
|
2000-07-03 15:44:21 -03:00
|
|
|
|
emit(lo) # look behind
|
|
|
|
|
_compile(code, av[1], flags)
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(SUCCESS)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
code[skip] = _len(code) - skip
|
2000-07-03 15:44:21 -03:00
|
|
|
|
elif op is CALL:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(op)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
skip = _len(code); emit(0)
|
2000-06-30 07:41:31 -03:00
|
|
|
|
_compile(code, av, flags)
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(SUCCESS)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
code[skip] = _len(code) - skip
|
2000-06-30 07:41:31 -03:00
|
|
|
|
elif op is AT:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(op)
|
2000-06-30 07:41:31 -03:00
|
|
|
|
if flags & SRE_FLAG_MULTILINE:
|
2001-03-22 11:50:10 -04:00
|
|
|
|
av = AT_MULTILINE.get(av, av)
|
|
|
|
|
if flags & SRE_FLAG_LOCALE:
|
|
|
|
|
av = AT_LOCALE.get(av, av)
|
2018-10-05 14:53:45 -03:00
|
|
|
|
elif flags & SRE_FLAG_UNICODE:
|
2001-03-22 11:50:10 -04:00
|
|
|
|
av = AT_UNICODE.get(av, av)
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(av)
|
2000-06-30 07:41:31 -03:00
|
|
|
|
elif op is BRANCH:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(op)
|
2000-06-30 07:41:31 -03:00
|
|
|
|
tail = []
|
2004-03-26 07:16:55 -04:00
|
|
|
|
tailappend = tail.append
|
2000-06-30 07:41:31 -03:00
|
|
|
|
for av in av[1]:
|
2004-03-26 07:16:55 -04:00
|
|
|
|
skip = _len(code); emit(0)
|
2000-08-07 17:59:04 -03:00
|
|
|
|
# _compile_info(code, av, flags)
|
2000-06-30 07:41:31 -03:00
|
|
|
|
_compile(code, av, flags)
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(JUMP)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
tailappend(_len(code)); emit(0)
|
|
|
|
|
code[skip] = _len(code) - skip
|
2014-11-11 15:13:28 -04:00
|
|
|
|
emit(FAILURE) # end of branch
|
2000-06-30 07:41:31 -03:00
|
|
|
|
for tail in tail:
|
2004-03-26 07:16:55 -04:00
|
|
|
|
code[tail] = _len(code) - tail
|
2000-06-30 07:41:31 -03:00
|
|
|
|
elif op is CATEGORY:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(op)
|
2000-06-30 07:41:31 -03:00
|
|
|
|
if flags & SRE_FLAG_LOCALE:
|
2001-03-22 11:50:10 -04:00
|
|
|
|
av = CH_LOCALE[av]
|
2018-10-05 14:53:45 -03:00
|
|
|
|
elif flags & SRE_FLAG_UNICODE:
|
2001-03-22 11:50:10 -04:00
|
|
|
|
av = CH_UNICODE[av]
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(av)
|
2000-07-03 18:31:48 -03:00
|
|
|
|
elif op is GROUPREF:
|
2017-10-24 17:31:42 -03:00
|
|
|
|
if not flags & SRE_FLAG_IGNORECASE:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(op)
|
2017-10-24 17:31:42 -03:00
|
|
|
|
elif flags & SRE_FLAG_LOCALE:
|
|
|
|
|
emit(GROUPREF_LOC_IGNORE)
|
|
|
|
|
elif not fixes: # ascii
|
|
|
|
|
emit(GROUPREF_IGNORE)
|
|
|
|
|
else:
|
|
|
|
|
emit(GROUPREF_UNI_IGNORE)
|
2000-06-30 07:41:31 -03:00
|
|
|
|
emit(av-1)
|
2003-10-17 19:13:16 -03:00
|
|
|
|
elif op is GROUPREF_EXISTS:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(op)
|
2005-06-02 10:35:52 -03:00
|
|
|
|
emit(av[0]-1)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
skipyes = _len(code); emit(0)
|
2003-10-17 19:13:16 -03:00
|
|
|
|
_compile(code, av[1], flags)
|
|
|
|
|
if av[2]:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(JUMP)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
skipno = _len(code); emit(0)
|
|
|
|
|
code[skipyes] = _len(code) - skipyes + 1
|
2003-10-17 19:13:16 -03:00
|
|
|
|
_compile(code, av[2], flags)
|
2004-03-26 07:16:55 -04:00
|
|
|
|
code[skipno] = _len(code) - skipno
|
2003-10-17 19:13:16 -03:00
|
|
|
|
else:
|
2004-03-26 07:16:55 -04:00
|
|
|
|
code[skipyes] = _len(code) - skipyes + 1
|
2000-06-30 04:50:59 -03:00
|
|
|
|
else:
|
2015-03-25 16:03:47 -03:00
|
|
|
|
raise error("internal: unsupported operand type %r" % (op,))
|
2000-03-31 10:58:54 -04:00
|
|
|
|
|
2017-05-09 17:37:14 -03:00
|
|
|
|
def _compile_charset(charset, flags, code):
|
2000-08-07 17:59:04 -03:00
|
|
|
|
# compile charset subprogram
|
|
|
|
|
emit = code.append
|
2017-05-09 17:37:14 -03:00
|
|
|
|
for op, av in charset:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(op)
|
2000-08-07 17:59:04 -03:00
|
|
|
|
if op is NEGATE:
|
|
|
|
|
pass
|
|
|
|
|
elif op is LITERAL:
|
2014-10-31 07:36:56 -03:00
|
|
|
|
emit(av)
|
2017-10-24 17:31:42 -03:00
|
|
|
|
elif op is RANGE or op is RANGE_UNI_IGNORE:
|
2014-10-31 07:36:56 -03:00
|
|
|
|
emit(av[0])
|
|
|
|
|
emit(av[1])
|
2000-08-07 17:59:04 -03:00
|
|
|
|
elif op is CHARSET:
|
|
|
|
|
code.extend(av)
|
2001-07-02 13:58:38 -03:00
|
|
|
|
elif op is BIGCHARSET:
|
|
|
|
|
code.extend(av)
|
2000-08-07 17:59:04 -03:00
|
|
|
|
elif op is CATEGORY:
|
|
|
|
|
if flags & SRE_FLAG_LOCALE:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(CH_LOCALE[av])
|
2018-10-05 14:53:45 -03:00
|
|
|
|
elif flags & SRE_FLAG_UNICODE:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(CH_UNICODE[av])
|
2000-08-07 17:59:04 -03:00
|
|
|
|
else:
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(av)
|
2000-08-07 17:59:04 -03:00
|
|
|
|
else:
|
2015-03-25 16:03:47 -03:00
|
|
|
|
raise error("internal: unsupported set operator %r" % (op,))
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(FAILURE)
|
2000-08-07 17:59:04 -03:00
|
|
|
|
|
2017-05-09 17:37:14 -03:00
|
|
|
|
def _optimize_charset(charset, iscased=None, fixup=None, fixes=None):
|
2000-08-07 17:59:04 -03:00
|
|
|
|
# internal: optimize character set
|
|
|
|
|
out = []
|
2013-10-27 03:20:29 -03:00
|
|
|
|
tail = []
|
|
|
|
|
charmap = bytearray(256)
|
2017-05-09 17:37:14 -03:00
|
|
|
|
hascased = False
|
2013-10-27 03:20:29 -03:00
|
|
|
|
for op, av in charset:
|
|
|
|
|
while True:
|
|
|
|
|
try:
|
|
|
|
|
if op is LITERAL:
|
2014-10-31 07:36:56 -03:00
|
|
|
|
if fixup:
|
2014-11-10 06:43:14 -04:00
|
|
|
|
lo = fixup(av)
|
|
|
|
|
charmap[lo] = 1
|
|
|
|
|
if fixes and lo in fixes:
|
|
|
|
|
for k in fixes[lo]:
|
2014-11-10 06:37:16 -04:00
|
|
|
|
charmap[k] = 1
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if not hascased and iscased(av):
|
|
|
|
|
hascased = True
|
2014-11-10 06:37:16 -04:00
|
|
|
|
else:
|
|
|
|
|
charmap[av] = 1
|
2013-10-27 03:20:29 -03:00
|
|
|
|
elif op is RANGE:
|
2014-10-31 07:36:56 -03:00
|
|
|
|
r = range(av[0], av[1]+1)
|
|
|
|
|
if fixup:
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if fixes:
|
|
|
|
|
for i in map(fixup, r):
|
|
|
|
|
charmap[i] = 1
|
|
|
|
|
if i in fixes:
|
|
|
|
|
for k in fixes[i]:
|
|
|
|
|
charmap[k] = 1
|
|
|
|
|
else:
|
|
|
|
|
for i in map(fixup, r):
|
|
|
|
|
charmap[i] = 1
|
|
|
|
|
if not hascased:
|
|
|
|
|
hascased = any(map(iscased, r))
|
2014-11-10 06:37:16 -04:00
|
|
|
|
else:
|
|
|
|
|
for i in r:
|
|
|
|
|
charmap[i] = 1
|
2013-10-27 03:20:29 -03:00
|
|
|
|
elif op is NEGATE:
|
|
|
|
|
out.append((op, av))
|
|
|
|
|
else:
|
|
|
|
|
tail.append((op, av))
|
|
|
|
|
except IndexError:
|
|
|
|
|
if len(charmap) == 256:
|
|
|
|
|
# character set contains non-UCS1 character codes
|
|
|
|
|
charmap += b'\0' * 0xff00
|
|
|
|
|
continue
|
2014-10-31 07:36:56 -03:00
|
|
|
|
# Character set contains non-BMP character codes.
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if fixup:
|
|
|
|
|
hascased = True
|
|
|
|
|
# There are only two ranges of cased non-BMP characters:
|
|
|
|
|
# 10400-1044F (Deseret) and 118A0-118DF (Warang Citi),
|
2017-10-24 17:31:42 -03:00
|
|
|
|
# and for both ranges RANGE_UNI_IGNORE works.
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if op is RANGE:
|
2017-10-24 17:31:42 -03:00
|
|
|
|
op = RANGE_UNI_IGNORE
|
2013-10-27 03:20:29 -03:00
|
|
|
|
tail.append((op, av))
|
|
|
|
|
break
|
|
|
|
|
|
2000-08-07 17:59:04 -03:00
|
|
|
|
# compress character map
|
|
|
|
|
runs = []
|
2013-10-27 03:20:29 -03:00
|
|
|
|
q = 0
|
|
|
|
|
while True:
|
|
|
|
|
p = charmap.find(1, q)
|
|
|
|
|
if p < 0:
|
|
|
|
|
break
|
|
|
|
|
if len(runs) >= 2:
|
|
|
|
|
runs = None
|
|
|
|
|
break
|
|
|
|
|
q = charmap.find(0, p)
|
|
|
|
|
if q < 0:
|
|
|
|
|
runs.append((p, len(charmap)))
|
|
|
|
|
break
|
|
|
|
|
runs.append((p, q))
|
|
|
|
|
if runs is not None:
|
2000-08-07 17:59:04 -03:00
|
|
|
|
# use literal/range
|
2013-10-27 03:20:29 -03:00
|
|
|
|
for p, q in runs:
|
|
|
|
|
if q - p == 1:
|
|
|
|
|
out.append((LITERAL, p))
|
2000-08-07 17:59:04 -03:00
|
|
|
|
else:
|
2013-10-27 03:20:29 -03:00
|
|
|
|
out.append((RANGE, (p, q - 1)))
|
|
|
|
|
out += tail
|
2014-10-31 07:36:56 -03:00
|
|
|
|
# if the case was changed or new representation is more compact
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if hascased or len(out) < len(charset):
|
|
|
|
|
return out, hascased
|
2014-10-31 07:36:56 -03:00
|
|
|
|
# else original character set is good enough
|
2017-05-09 17:37:14 -03:00
|
|
|
|
return charset, hascased
|
2013-10-27 03:20:29 -03:00
|
|
|
|
|
|
|
|
|
# use bitmap
|
|
|
|
|
if len(charmap) == 256:
|
2001-07-02 13:58:38 -03:00
|
|
|
|
data = _mk_bitmap(charmap)
|
2013-10-27 03:20:29 -03:00
|
|
|
|
out.append((CHARSET, data))
|
|
|
|
|
out += tail
|
2017-05-09 17:37:14 -03:00
|
|
|
|
return out, hascased
|
2001-07-02 13:58:38 -03:00
|
|
|
|
|
2013-10-27 03:20:29 -03:00
|
|
|
|
# To represent a big charset, first a bitmap of all characters in the
|
|
|
|
|
# set is constructed. Then, this bitmap is sliced into chunks of 256
|
|
|
|
|
# characters, duplicate chunks are eliminated, and each chunk is
|
|
|
|
|
# given a number. In the compiled expression, the charset is
|
|
|
|
|
# represented by a 32-bit word sequence, consisting of one word for
|
|
|
|
|
# the number of different chunks, a sequence of 256 bytes (64 words)
|
|
|
|
|
# of chunk numbers indexed by their original chunk position, and a
|
|
|
|
|
# sequence of 256-bit chunks (8 words each).
|
2001-07-02 13:58:38 -03:00
|
|
|
|
|
2013-10-27 03:20:29 -03:00
|
|
|
|
# Compression is normally good: in a typical charset, large ranges of
|
|
|
|
|
# Unicode will be either completely excluded (e.g. if only cyrillic
|
|
|
|
|
# letters are to be matched), or completely included (e.g. if large
|
|
|
|
|
# subranges of Kanji match). These ranges will be represented by
|
|
|
|
|
# chunks of all one-bits or all zero-bits.
|
2001-07-02 13:58:38 -03:00
|
|
|
|
|
2013-10-27 03:20:29 -03:00
|
|
|
|
# Matching can be also done efficiently: the more significant byte of
|
|
|
|
|
# the Unicode character is an index into the chunk number, and the
|
|
|
|
|
# less significant byte is a bit index in the chunk (just like the
|
|
|
|
|
# CHARSET matching).
|
2003-04-19 09:56:08 -03:00
|
|
|
|
|
2013-10-27 03:20:29 -03:00
|
|
|
|
charmap = bytes(charmap) # should be hashable
|
2001-07-02 13:58:38 -03:00
|
|
|
|
comps = {}
|
2013-10-27 03:20:29 -03:00
|
|
|
|
mapping = bytearray(256)
|
2001-07-02 13:58:38 -03:00
|
|
|
|
block = 0
|
2013-10-27 03:20:29 -03:00
|
|
|
|
data = bytearray()
|
|
|
|
|
for i in range(0, 65536, 256):
|
|
|
|
|
chunk = charmap[i: i + 256]
|
|
|
|
|
if chunk in comps:
|
|
|
|
|
mapping[i // 256] = comps[chunk]
|
|
|
|
|
else:
|
|
|
|
|
mapping[i // 256] = comps[chunk] = block
|
|
|
|
|
block += 1
|
|
|
|
|
data += chunk
|
|
|
|
|
data = _mk_bitmap(data)
|
|
|
|
|
data[0:0] = [block] + _bytes_to_codes(mapping)
|
|
|
|
|
out.append((BIGCHARSET, data))
|
|
|
|
|
out += tail
|
2017-05-09 17:37:14 -03:00
|
|
|
|
return out, hascased
|
2013-10-27 03:20:29 -03:00
|
|
|
|
|
|
|
|
|
_CODEBITS = _sre.CODESIZE * 8
|
2014-11-11 15:13:28 -04:00
|
|
|
|
MAXCODE = (1 << _CODEBITS) - 1
|
2013-10-27 03:20:29 -03:00
|
|
|
|
_BITS_TRANS = b'0' + b'1' * 255
|
|
|
|
|
def _mk_bitmap(bits, _CODEBITS=_CODEBITS, _int=int):
|
|
|
|
|
s = bits.translate(_BITS_TRANS)[::-1]
|
|
|
|
|
return [_int(s[i - _CODEBITS: i], 2)
|
|
|
|
|
for i in range(len(s), 0, -_CODEBITS)]
|
|
|
|
|
|
|
|
|
|
def _bytes_to_codes(b):
|
|
|
|
|
# Convert block indices to word array
|
2014-11-10 07:24:47 -04:00
|
|
|
|
a = memoryview(b).cast('I')
|
2013-10-27 03:20:29 -03:00
|
|
|
|
assert a.itemsize == _sre.CODESIZE
|
|
|
|
|
assert len(a) * a.itemsize == len(b)
|
|
|
|
|
return a.tolist()
|
2001-07-02 13:58:38 -03:00
|
|
|
|
|
2017-05-14 02:32:33 -03:00
|
|
|
|
def _simple(p):
|
|
|
|
|
# check if this subpattern is a "simple" operator
|
|
|
|
|
if len(p) != 1:
|
|
|
|
|
return False
|
|
|
|
|
op, av = p[0]
|
|
|
|
|
if op is SUBPATTERN:
|
|
|
|
|
return av[0] is None and _simple(av[-1])
|
|
|
|
|
return op in _UNIT_CODES
|
2000-08-07 17:59:04 -03:00
|
|
|
|
|
2013-10-25 16:36:10 -03:00
|
|
|
|
def _generate_overlap_table(prefix):
|
|
|
|
|
"""
|
|
|
|
|
Generate an overlap table for the following prefix.
|
|
|
|
|
An overlap table is a table of the same size as the prefix which
|
|
|
|
|
informs about the potential self-overlap for each index in the prefix:
|
|
|
|
|
- if overlap[i] == 0, prefix[i:] can't overlap prefix[0:...]
|
|
|
|
|
- if overlap[i] == k with 0 < k <= i, prefix[i-k+1:i+1] overlaps with
|
|
|
|
|
prefix[0:k]
|
|
|
|
|
"""
|
|
|
|
|
table = [0] * len(prefix)
|
|
|
|
|
for i in range(1, len(prefix)):
|
|
|
|
|
idx = table[i - 1]
|
|
|
|
|
while prefix[i] != prefix[idx]:
|
|
|
|
|
if idx == 0:
|
|
|
|
|
table[i] = 0
|
|
|
|
|
break
|
|
|
|
|
idx = table[idx - 1]
|
|
|
|
|
else:
|
|
|
|
|
table[i] = idx + 1
|
|
|
|
|
return table
|
|
|
|
|
|
2017-05-09 17:37:14 -03:00
|
|
|
|
def _get_iscased(flags):
|
|
|
|
|
if not flags & SRE_FLAG_IGNORECASE:
|
|
|
|
|
return None
|
2018-10-05 14:53:45 -03:00
|
|
|
|
elif flags & SRE_FLAG_UNICODE:
|
2017-05-09 17:37:14 -03:00
|
|
|
|
return _sre.unicode_iscased
|
|
|
|
|
else:
|
|
|
|
|
return _sre.ascii_iscased
|
|
|
|
|
|
|
|
|
|
def _get_literal_prefix(pattern, flags):
|
2015-06-21 08:06:55 -03:00
|
|
|
|
# look for literal prefix
|
2000-06-29 20:33:12 -03:00
|
|
|
|
prefix = []
|
2004-03-26 07:16:55 -04:00
|
|
|
|
prefixappend = prefix.append
|
2015-06-21 08:06:55 -03:00
|
|
|
|
prefix_skip = None
|
2017-05-09 17:37:14 -03:00
|
|
|
|
iscased = _get_iscased(flags)
|
2015-06-21 08:06:55 -03:00
|
|
|
|
for op, av in pattern.data:
|
|
|
|
|
if op is LITERAL:
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if iscased and iscased(av):
|
|
|
|
|
break
|
2015-06-21 08:06:55 -03:00
|
|
|
|
prefixappend(av)
|
|
|
|
|
elif op is SUBPATTERN:
|
2016-09-09 18:57:55 -03:00
|
|
|
|
group, add_flags, del_flags, p = av
|
2017-10-24 17:31:42 -03:00
|
|
|
|
flags1 = _combine_flags(flags, add_flags, del_flags)
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if flags1 & SRE_FLAG_IGNORECASE and flags1 & SRE_FLAG_LOCALE:
|
2016-09-09 18:57:55 -03:00
|
|
|
|
break
|
2017-05-09 17:37:14 -03:00
|
|
|
|
prefix1, prefix_skip1, got_all = _get_literal_prefix(p, flags1)
|
2015-06-21 08:06:55 -03:00
|
|
|
|
if prefix_skip is None:
|
2016-09-09 18:57:55 -03:00
|
|
|
|
if group is not None:
|
2015-06-21 08:06:55 -03:00
|
|
|
|
prefix_skip = len(prefix)
|
|
|
|
|
elif prefix_skip1 is not None:
|
|
|
|
|
prefix_skip = len(prefix) + prefix_skip1
|
|
|
|
|
prefix.extend(prefix1)
|
|
|
|
|
if not got_all:
|
|
|
|
|
break
|
|
|
|
|
else:
|
|
|
|
|
break
|
2016-09-09 18:57:55 -03:00
|
|
|
|
else:
|
|
|
|
|
return prefix, prefix_skip, True
|
|
|
|
|
return prefix, prefix_skip, False
|
2015-06-21 08:06:55 -03:00
|
|
|
|
|
2017-05-09 17:37:14 -03:00
|
|
|
|
def _get_charset_prefix(pattern, flags):
|
|
|
|
|
while True:
|
|
|
|
|
if not pattern.data:
|
|
|
|
|
return None
|
2015-06-21 08:06:55 -03:00
|
|
|
|
op, av = pattern.data[0]
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if op is not SUBPATTERN:
|
|
|
|
|
break
|
|
|
|
|
group, add_flags, del_flags, pattern = av
|
2017-10-24 17:31:42 -03:00
|
|
|
|
flags = _combine_flags(flags, add_flags, del_flags)
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if flags & SRE_FLAG_IGNORECASE and flags & SRE_FLAG_LOCALE:
|
|
|
|
|
return None
|
|
|
|
|
|
|
|
|
|
iscased = _get_iscased(flags)
|
|
|
|
|
if op is LITERAL:
|
|
|
|
|
if iscased and iscased(av):
|
|
|
|
|
return None
|
|
|
|
|
return [(op, av)]
|
|
|
|
|
elif op is BRANCH:
|
|
|
|
|
charset = []
|
|
|
|
|
charsetappend = charset.append
|
|
|
|
|
for p in av[1]:
|
|
|
|
|
if not p:
|
|
|
|
|
return None
|
|
|
|
|
op, av = p[0]
|
|
|
|
|
if op is LITERAL and not (iscased and iscased(av)):
|
|
|
|
|
charsetappend((op, av))
|
2015-06-21 08:06:55 -03:00
|
|
|
|
else:
|
2017-05-09 17:37:14 -03:00
|
|
|
|
return None
|
|
|
|
|
return charset
|
|
|
|
|
elif op is IN:
|
|
|
|
|
charset = av
|
|
|
|
|
if iscased:
|
|
|
|
|
for op, av in charset:
|
|
|
|
|
if op is LITERAL:
|
|
|
|
|
if iscased(av):
|
|
|
|
|
return None
|
|
|
|
|
elif op is RANGE:
|
|
|
|
|
if av[1] > 0xffff:
|
|
|
|
|
return None
|
|
|
|
|
if any(map(iscased, range(av[0], av[1]+1))):
|
|
|
|
|
return None
|
|
|
|
|
return charset
|
|
|
|
|
return None
|
2015-06-21 08:06:55 -03:00
|
|
|
|
|
|
|
|
|
def _compile_info(code, pattern, flags):
|
|
|
|
|
# internal: compile an info block. in the current version,
|
|
|
|
|
# this contains min/max pattern width, and an optional literal
|
|
|
|
|
# prefix or a character map
|
|
|
|
|
lo, hi = pattern.getwidth()
|
|
|
|
|
if hi > MAXCODE:
|
|
|
|
|
hi = MAXCODE
|
|
|
|
|
if lo == 0:
|
|
|
|
|
code.extend([INFO, 4, 0, lo, hi])
|
|
|
|
|
return
|
|
|
|
|
# look for a literal prefix
|
|
|
|
|
prefix = []
|
|
|
|
|
prefix_skip = 0
|
|
|
|
|
charset = [] # not used
|
2017-05-09 17:37:14 -03:00
|
|
|
|
if not (flags & SRE_FLAG_IGNORECASE and flags & SRE_FLAG_LOCALE):
|
2015-06-21 08:06:55 -03:00
|
|
|
|
# look for literal prefix
|
2017-05-09 17:37:14 -03:00
|
|
|
|
prefix, prefix_skip, got_all = _get_literal_prefix(pattern, flags)
|
2015-06-21 08:06:55 -03:00
|
|
|
|
# if no prefix, look for charset prefix
|
|
|
|
|
if not prefix:
|
2017-05-09 17:37:14 -03:00
|
|
|
|
charset = _get_charset_prefix(pattern, flags)
|
2000-08-07 17:59:04 -03:00
|
|
|
|
## if prefix:
|
2014-11-11 15:13:28 -04:00
|
|
|
|
## print("*** PREFIX", prefix, prefix_skip)
|
2000-08-07 17:59:04 -03:00
|
|
|
|
## if charset:
|
2014-11-11 15:13:28 -04:00
|
|
|
|
## print("*** CHARSET", charset)
|
2000-06-29 20:33:12 -03:00
|
|
|
|
# add an info block
|
|
|
|
|
emit = code.append
|
2014-11-09 14:48:36 -04:00
|
|
|
|
emit(INFO)
|
2000-06-29 20:33:12 -03:00
|
|
|
|
skip = len(code); emit(0)
|
|
|
|
|
# literal flag
|
|
|
|
|
mask = 0
|
2000-07-02 09:00:07 -03:00
|
|
|
|
if prefix:
|
|
|
|
|
mask = SRE_INFO_PREFIX
|
2015-06-21 08:06:55 -03:00
|
|
|
|
if prefix_skip is None and got_all:
|
2014-11-11 15:13:28 -04:00
|
|
|
|
mask = mask | SRE_INFO_LITERAL
|
2000-07-02 09:00:07 -03:00
|
|
|
|
elif charset:
|
2014-11-11 15:13:28 -04:00
|
|
|
|
mask = mask | SRE_INFO_CHARSET
|
2000-06-29 20:33:12 -03:00
|
|
|
|
emit(mask)
|
|
|
|
|
# pattern length
|
2000-07-02 09:00:07 -03:00
|
|
|
|
if lo < MAXCODE:
|
|
|
|
|
emit(lo)
|
|
|
|
|
else:
|
|
|
|
|
emit(MAXCODE)
|
|
|
|
|
prefix = prefix[:MAXCODE]
|
2015-02-03 05:04:19 -04:00
|
|
|
|
emit(min(hi, MAXCODE))
|
2000-06-29 20:33:12 -03:00
|
|
|
|
# add literal prefix
|
|
|
|
|
if prefix:
|
2000-08-07 17:59:04 -03:00
|
|
|
|
emit(len(prefix)) # length
|
2015-06-21 08:06:55 -03:00
|
|
|
|
if prefix_skip is None:
|
|
|
|
|
prefix_skip = len(prefix)
|
2000-08-07 17:59:04 -03:00
|
|
|
|
emit(prefix_skip) # skip
|
|
|
|
|
code.extend(prefix)
|
|
|
|
|
# generate overlap table
|
2013-10-25 16:36:10 -03:00
|
|
|
|
code.extend(_generate_overlap_table(prefix))
|
2000-07-02 09:00:07 -03:00
|
|
|
|
elif charset:
|
2017-05-09 17:37:14 -03:00
|
|
|
|
charset, hascased = _optimize_charset(charset)
|
|
|
|
|
assert not hascased
|
2003-02-23 21:18:35 -04:00
|
|
|
|
_compile_charset(charset, flags, code)
|
2000-06-29 20:33:12 -03:00
|
|
|
|
code[skip] = len(code) - skip
|
|
|
|
|
|
2003-07-02 18:37:16 -03:00
|
|
|
|
def isstring(obj):
|
2008-03-18 17:19:54 -03:00
|
|
|
|
return isinstance(obj, (str, bytes))
|
2003-07-02 18:37:16 -03:00
|
|
|
|
|
2000-08-01 18:05:41 -03:00
|
|
|
|
def _code(p, flags):
|
2000-06-29 20:33:12 -03:00
|
|
|
|
|
2018-09-18 03:16:26 -03:00
|
|
|
|
flags = p.state.flags | flags
|
2000-06-29 13:57:40 -03:00
|
|
|
|
code = []
|
2000-06-29 20:33:12 -03:00
|
|
|
|
|
|
|
|
|
# compile info block
|
|
|
|
|
_compile_info(code, p, flags)
|
|
|
|
|
|
|
|
|
|
# compile the pattern
|
2000-06-01 14:39:12 -03:00
|
|
|
|
_compile(code, p.data, flags)
|
2000-06-29 20:33:12 -03:00
|
|
|
|
|
2014-11-09 14:48:36 -04:00
|
|
|
|
code.append(SUCCESS)
|
2000-06-29 20:33:12 -03:00
|
|
|
|
|
2000-08-01 15:20:07 -03:00
|
|
|
|
return code
|
|
|
|
|
|
2017-05-14 03:05:13 -03:00
|
|
|
|
def _hex_code(code):
|
|
|
|
|
return '[%s]' % ', '.join('%#0*x' % (_sre.CODESIZE*2+2, x) for x in code)
|
|
|
|
|
|
|
|
|
|
def dis(code):
|
|
|
|
|
import sys
|
|
|
|
|
|
|
|
|
|
labels = set()
|
|
|
|
|
level = 0
|
|
|
|
|
offset_width = len(str(len(code) - 1))
|
|
|
|
|
|
|
|
|
|
def dis_(start, end):
|
|
|
|
|
def print_(*args, to=None):
|
|
|
|
|
if to is not None:
|
|
|
|
|
labels.add(to)
|
|
|
|
|
args += ('(to %d)' % (to,),)
|
|
|
|
|
print('%*d%s ' % (offset_width, start, ':' if start in labels else '.'),
|
|
|
|
|
end=' '*(level-1))
|
|
|
|
|
print(*args)
|
|
|
|
|
|
|
|
|
|
def print_2(*args):
|
|
|
|
|
print(end=' '*(offset_width + 2*level))
|
|
|
|
|
print(*args)
|
|
|
|
|
|
|
|
|
|
nonlocal level
|
|
|
|
|
level += 1
|
|
|
|
|
i = start
|
|
|
|
|
while i < end:
|
|
|
|
|
start = i
|
|
|
|
|
op = code[i]
|
|
|
|
|
i += 1
|
|
|
|
|
op = OPCODES[op]
|
|
|
|
|
if op in (SUCCESS, FAILURE, ANY, ANY_ALL,
|
|
|
|
|
MAX_UNTIL, MIN_UNTIL, NEGATE):
|
|
|
|
|
print_(op)
|
|
|
|
|
elif op in (LITERAL, NOT_LITERAL,
|
|
|
|
|
LITERAL_IGNORE, NOT_LITERAL_IGNORE,
|
2017-10-24 17:31:42 -03:00
|
|
|
|
LITERAL_UNI_IGNORE, NOT_LITERAL_UNI_IGNORE,
|
2017-05-14 03:05:13 -03:00
|
|
|
|
LITERAL_LOC_IGNORE, NOT_LITERAL_LOC_IGNORE):
|
|
|
|
|
arg = code[i]
|
|
|
|
|
i += 1
|
|
|
|
|
print_(op, '%#02x (%r)' % (arg, chr(arg)))
|
|
|
|
|
elif op is AT:
|
|
|
|
|
arg = code[i]
|
|
|
|
|
i += 1
|
|
|
|
|
arg = str(ATCODES[arg])
|
|
|
|
|
assert arg[:3] == 'AT_'
|
|
|
|
|
print_(op, arg[3:])
|
|
|
|
|
elif op is CATEGORY:
|
|
|
|
|
arg = code[i]
|
|
|
|
|
i += 1
|
|
|
|
|
arg = str(CHCODES[arg])
|
|
|
|
|
assert arg[:9] == 'CATEGORY_'
|
|
|
|
|
print_(op, arg[9:])
|
2017-10-24 17:31:42 -03:00
|
|
|
|
elif op in (IN, IN_IGNORE, IN_UNI_IGNORE, IN_LOC_IGNORE):
|
2017-05-14 03:05:13 -03:00
|
|
|
|
skip = code[i]
|
|
|
|
|
print_(op, skip, to=i+skip)
|
|
|
|
|
dis_(i+1, i+skip)
|
|
|
|
|
i += skip
|
2017-10-24 17:31:42 -03:00
|
|
|
|
elif op in (RANGE, RANGE_UNI_IGNORE):
|
2017-05-14 03:05:13 -03:00
|
|
|
|
lo, hi = code[i: i+2]
|
|
|
|
|
i += 2
|
|
|
|
|
print_(op, '%#02x %#02x (%r-%r)' % (lo, hi, chr(lo), chr(hi)))
|
|
|
|
|
elif op is CHARSET:
|
|
|
|
|
print_(op, _hex_code(code[i: i + 256//_CODEBITS]))
|
|
|
|
|
i += 256//_CODEBITS
|
|
|
|
|
elif op is BIGCHARSET:
|
|
|
|
|
arg = code[i]
|
|
|
|
|
i += 1
|
|
|
|
|
mapping = list(b''.join(x.to_bytes(_sre.CODESIZE, sys.byteorder)
|
|
|
|
|
for x in code[i: i + 256//_sre.CODESIZE]))
|
|
|
|
|
print_(op, arg, mapping)
|
|
|
|
|
i += 256//_sre.CODESIZE
|
|
|
|
|
level += 1
|
|
|
|
|
for j in range(arg):
|
|
|
|
|
print_2(_hex_code(code[i: i + 256//_CODEBITS]))
|
|
|
|
|
i += 256//_CODEBITS
|
|
|
|
|
level -= 1
|
2017-10-24 17:31:42 -03:00
|
|
|
|
elif op in (MARK, GROUPREF, GROUPREF_IGNORE, GROUPREF_UNI_IGNORE,
|
|
|
|
|
GROUPREF_LOC_IGNORE):
|
2017-05-14 03:05:13 -03:00
|
|
|
|
arg = code[i]
|
|
|
|
|
i += 1
|
|
|
|
|
print_(op, arg)
|
|
|
|
|
elif op is JUMP:
|
|
|
|
|
skip = code[i]
|
|
|
|
|
print_(op, skip, to=i+skip)
|
|
|
|
|
i += 1
|
|
|
|
|
elif op is BRANCH:
|
|
|
|
|
skip = code[i]
|
|
|
|
|
print_(op, skip, to=i+skip)
|
|
|
|
|
while skip:
|
|
|
|
|
dis_(i+1, i+skip)
|
|
|
|
|
i += skip
|
|
|
|
|
start = i
|
|
|
|
|
skip = code[i]
|
|
|
|
|
if skip:
|
|
|
|
|
print_('branch', skip, to=i+skip)
|
|
|
|
|
else:
|
|
|
|
|
print_(FAILURE)
|
|
|
|
|
i += 1
|
|
|
|
|
elif op in (REPEAT, REPEAT_ONE, MIN_REPEAT_ONE):
|
|
|
|
|
skip, min, max = code[i: i+3]
|
|
|
|
|
if max == MAXREPEAT:
|
|
|
|
|
max = 'MAXREPEAT'
|
|
|
|
|
print_(op, skip, min, max, to=i+skip)
|
|
|
|
|
dis_(i+3, i+skip)
|
|
|
|
|
i += skip
|
|
|
|
|
elif op is GROUPREF_EXISTS:
|
|
|
|
|
arg, skip = code[i: i+2]
|
|
|
|
|
print_(op, arg, skip, to=i+skip)
|
|
|
|
|
i += 2
|
|
|
|
|
elif op in (ASSERT, ASSERT_NOT):
|
|
|
|
|
skip, arg = code[i: i+2]
|
|
|
|
|
print_(op, skip, arg, to=i+skip)
|
|
|
|
|
dis_(i+2, i+skip)
|
|
|
|
|
i += skip
|
|
|
|
|
elif op is INFO:
|
|
|
|
|
skip, flags, min, max = code[i: i+4]
|
|
|
|
|
if max == MAXREPEAT:
|
|
|
|
|
max = 'MAXREPEAT'
|
|
|
|
|
print_(op, skip, bin(flags), min, max, to=i+skip)
|
|
|
|
|
start = i+4
|
|
|
|
|
if flags & SRE_INFO_PREFIX:
|
|
|
|
|
prefix_len, prefix_skip = code[i+4: i+6]
|
|
|
|
|
print_2(' prefix_skip', prefix_skip)
|
|
|
|
|
start = i + 6
|
|
|
|
|
prefix = code[start: start+prefix_len]
|
|
|
|
|
print_2(' prefix',
|
|
|
|
|
'[%s]' % ', '.join('%#02x' % x for x in prefix),
|
|
|
|
|
'(%r)' % ''.join(map(chr, prefix)))
|
|
|
|
|
start += prefix_len
|
|
|
|
|
print_2(' overlap', code[start: start+prefix_len])
|
|
|
|
|
start += prefix_len
|
|
|
|
|
if flags & SRE_INFO_CHARSET:
|
|
|
|
|
level += 1
|
|
|
|
|
print_2('in')
|
|
|
|
|
dis_(start, i+skip)
|
|
|
|
|
level -= 1
|
|
|
|
|
i += skip
|
|
|
|
|
else:
|
|
|
|
|
raise ValueError(op)
|
|
|
|
|
|
|
|
|
|
level -= 1
|
|
|
|
|
|
|
|
|
|
dis_(0, len(code))
|
|
|
|
|
|
|
|
|
|
|
2000-08-01 15:20:07 -03:00
|
|
|
|
def compile(p, flags=0):
|
|
|
|
|
# internal: convert pattern list to internal format
|
|
|
|
|
|
2003-07-02 18:37:16 -03:00
|
|
|
|
if isstring(p):
|
2000-08-01 15:20:07 -03:00
|
|
|
|
pattern = p
|
|
|
|
|
p = sre_parse.parse(p, flags)
|
|
|
|
|
else:
|
|
|
|
|
pattern = None
|
|
|
|
|
|
2000-08-01 18:05:41 -03:00
|
|
|
|
code = _code(p, flags)
|
2000-08-01 15:20:07 -03:00
|
|
|
|
|
2017-05-14 03:05:13 -03:00
|
|
|
|
if flags & SRE_FLAG_DEBUG:
|
|
|
|
|
print()
|
|
|
|
|
dis(code)
|
2000-07-23 18:46:17 -03:00
|
|
|
|
|
2000-07-02 19:25:39 -03:00
|
|
|
|
# map in either direction
|
2018-09-18 03:16:26 -03:00
|
|
|
|
groupindex = p.state.groupdict
|
|
|
|
|
indexgroup = [None] * p.state.groups
|
2000-07-02 19:25:39 -03:00
|
|
|
|
for k, i in groupindex.items():
|
|
|
|
|
indexgroup[i] = k
|
|
|
|
|
|
2000-06-01 14:39:12 -03:00
|
|
|
|
return _sre.compile(
|
2018-09-18 03:16:26 -03:00
|
|
|
|
pattern, flags | p.state.flags, code,
|
|
|
|
|
p.state.groups-1,
|
2016-11-22 18:04:39 -04:00
|
|
|
|
groupindex, tuple(indexgroup)
|
2000-06-30 04:50:59 -03:00
|
|
|
|
)
|