cpython/Lib/re.py

310 lines
12 KiB
Python
Raw Normal View History

#
# Secret Labs' Regular Expression Engine
#
# re-compatible interface for the sre matching engine
#
# Copyright (c) 1998-2001 by Secret Labs AB. All rights reserved.
#
# This version of the SRE library can be redistributed under CNRI's
# Python 1.6 license. For any other use, please contact Secret Labs
# AB (info@pythonware.com).
#
# Portions of this engine have been developed in cooperation with
# CNRI. Hewlett-Packard provided funding for 1.6 integration and
# other compatibility work.
#
2001-09-04 16:20:06 -03:00
r"""Support for regular expressions (RE).
This module provides regular expression matching operations similar to
those found in Perl. It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.
Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves. You can
concatenate ordinary characters, so last matches the string 'last'.
The special characters are:
"." Matches any character except a newline.
"^" Matches the start of the string.
"$" Matches the end of the string.
"*" Matches 0 or more (greedy) repetitions of the preceding RE.
Greedy means that it will match as many repetitions as possible.
"+" Matches 1 or more (greedy) repetitions of the preceding RE.
"?" Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n} Matches from m to n repetitions of the preceding RE.
{m,n}? Non-greedy version of the above.
"\\" Either escapes special characters or signals a special sequence.
[] Indicates a set of characters.
A "^" as the first character indicates a complementing set.
"|" A|B, creates an RE that will match either A or B.
(...) Matches the RE inside the parentheses.
The contents can be retrieved or matched later in the string.
(?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below).
(?:...) Non-grouping version of regular parentheses.
(?P<name>...) The substring matched by the group is accessible by name.
(?P=name) Matches the text matched earlier by the group named name.
(?#...) A comment; ignored.
(?=...) Matches if ... matches next, but doesn't consume the string.
(?!...) Matches if ... doesn't match next.
The special sequences consist of "\\" and a character from the list
below. If the ordinary character is not on the list, then the
resulting RE will match the second character.
2001-09-04 16:20:06 -03:00
\number Matches the contents of the group of the same number.
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit; equivalent to the set [0-9].
\D Matches any non-digit character; equivalent to the set [^0-9].
\s Matches any whitespace character; equivalent to [ \t\n\r\f\v].
\S Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v].
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
2001-09-04 16:20:06 -03:00
\W Matches the complement of \w.
\\ Matches a literal backslash.
This module exports the following functions:
match Match a regular expression pattern to the beginning of a string.
search Search a string for the presence of a pattern.
sub Substitute occurrences of a pattern found in a string.
subn Same as sub, but also return the number of substitutions made.
split Split a string by the occurrences of a pattern.
findall Find all occurrences of a pattern in a string.
compile Compile a pattern into a RegexObject.
purge Clear the regular expression cache.
escape Backslash all non-alphanumerics in a string.
Some of the functions in this module takes flags as optional parameters:
I IGNORECASE Perform case-insensitive matching.
L LOCALE Make \w, \W, \b, \B, dependent on the current locale.
M MULTILINE "^" matches the beginning of lines as well as the string.
"$" matches the end of lines as well as the string.
S DOTALL "." matches any character at all, including the newline.
X VERBOSE Ignore whitespace and comments for nicer looking RE's.
U UNICODE Make \w, \W, \b, \B, dependent on the Unicode locale.
This module also defines an exception 'error'.
"""
import sys
import sre_compile
2000-06-29 05:58:44 -03:00
import sre_parse
# public symbols
__all__ = [ "match", "search", "sub", "subn", "split", "findall",
"compile", "purge", "template", "escape", "I", "L", "M", "S", "X",
"U", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
"UNICODE", "error" ]
__version__ = "2.2.1"
# flags
I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode locale
M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments
# sre extensions (experimental, don't rely on these)
T = TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking
DEBUG = sre_compile.SRE_FLAG_DEBUG # dump pattern after compilation
2000-06-29 05:58:44 -03:00
# sre exception
error = sre_compile.error
2000-06-29 05:58:44 -03:00
# --------------------------------------------------------------------
# public interface
def match(pattern, string, flags=0):
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).match(string)
def search(pattern, string, flags=0):
"""Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
def sub(pattern, repl, string, count=0):
"""Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a callable, it's passed the match object and must return
a replacement string to be used."""
return _compile(pattern, 0).sub(repl, string, count)
def subn(pattern, repl, string, count=0):
"""Return a 2-tuple containing (new_string, number).
new_string is the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in the source
string by the replacement repl. number is the number of
substitutions that were made. repl can be either a string or a
callable; if a callable, it's passed the match object and must
return a replacement string to be used."""
return _compile(pattern, 0).subn(repl, string, count)
def split(pattern, string, maxsplit=0):
"""Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings."""
return _compile(pattern, 0).split(string, maxsplit)
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
if sys.hexversion >= 0x02020000:
__all__.append("finditer")
def finditer(pattern, string, flags=0):
2001-10-28 16:15:40 -04:00
"""Return an iterator over all non-overlapping matches in the
string. For each match, the iterator returns a match object.
Empty matches are included in the result."""
return _compile(pattern, flags).finditer(string)
def compile(pattern, flags=0):
"Compile a regular expression pattern, returning a pattern object."
return _compile(pattern, flags)
def purge():
"Clear the regular expression cache"
_cache.clear()
_cache_repl.clear()
2000-06-29 05:58:44 -03:00
def template(pattern, flags=0):
"Compile a template pattern, returning a pattern object"
2000-06-29 05:58:44 -03:00
return _compile(pattern, flags|T)
2005-09-12 19:50:37 -03:00
_alphanum = {}
for c in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890':
_alphanum[c] = 1
del c
def escape(pattern):
"Escape all non-alphanumeric characters in pattern."
s = list(pattern)
2005-09-12 19:50:37 -03:00
alphanum = _alphanum
for i in range(len(pattern)):
c = pattern[i]
2005-09-12 19:50:37 -03:00
if c not in alphanum:
if c == "\000":
s[i] = "\\000"
else:
s[i] = "\\" + c
return pattern[:0].join(s)
# --------------------------------------------------------------------
# internals
_cache = {}
_cache_repl = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 100
def _compile(*key):
# internal: compile pattern
cachekey = (type(key[0]),) + key
p = _cache.get(cachekey)
if p is not None:
return p
pattern, flags = key
if isinstance(pattern, _pattern_type):
return pattern
if not sre_compile.isstring(pattern):
2007-08-29 22:19:48 -03:00
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if len(_cache) >= _MAXCACHE:
_cache.clear()
_cache[cachekey] = p
return p
def _compile_repl(*key):
# internal: compile replacement pattern
p = _cache_repl.get(key)
if p is not None:
return p
repl, pattern = key
2007-08-29 22:19:48 -03:00
p = sre_parse.parse_template(repl, pattern)
if len(_cache_repl) >= _MAXCACHE:
_cache_repl.clear()
_cache_repl[key] = p
return p
def _expand(pattern, match, template):
# internal: match.expand implementation hook
template = sre_parse.parse_template(template, pattern)
return sre_parse.expand_template(template, match)
def _subx(pattern, template):
# internal: pattern.sub/subn implementation helper
template = _compile_repl(template, pattern)
if not template[0] and len(template[1]) == 1:
# literal replacement
return template[1][0]
def filter(match, template=template):
return sre_parse.expand_template(template, match)
return filter
# register myself for pickling
import copy_reg
def _pickle(p):
return _compile, (p.pattern, p.flags)
copy_reg.pickle(_pattern_type, _pickle, _compile)
# --------------------------------------------------------------------
# experimental stuff (see python-dev discussions for details)
class Scanner:
def __init__(self, lexicon, flags=0):
from sre_constants import BRANCH, SUBPATTERN
self.lexicon = lexicon
# combine phrases into a compound pattern
p = []
s = sre_parse.Pattern()
s.flags = flags
for phrase, action in lexicon:
p.append(sre_parse.SubPattern(s, [
(SUBPATTERN, (len(p)+1, sre_parse.parse(phrase, flags))),
]))
p = sre_parse.SubPattern(s, [(BRANCH, (None, p))])
s.groups = len(p)
self.scanner = sre_compile.compile(p)
def scan(self, string):
result = []
append = result.append
match = self.scanner.scanner(string).match
i = 0
while 1:
m = match()
if not m:
break
j = m.end()
if i == j:
break
action = self.lexicon[m.lastindex-1][1]
Merged revisions 55407-55513 via svnmerge from svn+ssh://pythondev@svn.python.org/python/branches/p3yk ................ r55413 | fred.drake | 2007-05-17 12:30:10 -0700 (Thu, 17 May 2007) | 1 line fix argument name in documentation; match the implementation ................ r55430 | jack.diederich | 2007-05-18 06:39:59 -0700 (Fri, 18 May 2007) | 1 line Implements class decorators, PEP 3129. ................ r55432 | guido.van.rossum | 2007-05-18 08:09:41 -0700 (Fri, 18 May 2007) | 2 lines obsubmit. ................ r55434 | guido.van.rossum | 2007-05-18 09:39:10 -0700 (Fri, 18 May 2007) | 3 lines Fix bug in test_inspect. (I presume this is how it should be fixed; Jack Diedrich, please verify.) ................ r55460 | brett.cannon | 2007-05-20 00:31:57 -0700 (Sun, 20 May 2007) | 4 lines Remove the imageop module. With imgfile already removed in Python 3.0 and rgbimg gone in Python 2.6 the unit tests themselves were made worthless. Plus third-party libraries perform the same function much better. ................ r55469 | neal.norwitz | 2007-05-20 11:28:20 -0700 (Sun, 20 May 2007) | 118 lines Merged revisions 55324-55467 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r55348 | georg.brandl | 2007-05-15 13:19:34 -0700 (Tue, 15 May 2007) | 4 lines HTML-escape the plain traceback in cgitb's HTML output, to prevent the traceback inadvertently or maliciously closing the comment and injecting HTML into the error page. ........ r55372 | neal.norwitz | 2007-05-15 21:33:50 -0700 (Tue, 15 May 2007) | 6 lines Port rev 55353 from Guido: Add what looks like a necessary call to PyErr_NoMemory() when PyMem_MALLOC() fails. Will backport. ........ r55377 | neal.norwitz | 2007-05-15 22:06:33 -0700 (Tue, 15 May 2007) | 1 line Mention removal of some directories for obsolete platforms ........ r55380 | brett.cannon | 2007-05-15 22:50:03 -0700 (Tue, 15 May 2007) | 2 lines Change the maintainer of the BeOS port. ........ r55383 | georg.brandl | 2007-05-16 06:44:18 -0700 (Wed, 16 May 2007) | 2 lines Bug #1719995: don't use deprecated method in sets example. ........ r55386 | neal.norwitz | 2007-05-16 13:05:11 -0700 (Wed, 16 May 2007) | 5 lines Fix bug in marshal where bad data would cause a segfault due to lack of an infinite recursion check. Contributed by Damien Miller at Google. ........ r55389 | brett.cannon | 2007-05-16 15:42:29 -0700 (Wed, 16 May 2007) | 6 lines Remove the gopherlib module. It has been raising a DeprecationWarning since Python 2.5. Also remove gopher support from urllib/urllib2. As both imported gopherlib the usage of the support would have raised a DeprecationWarning. ........ r55394 | raymond.hettinger | 2007-05-16 18:08:04 -0700 (Wed, 16 May 2007) | 1 line calendar.py gets no benefit from xrange() instead of range() ........ r55395 | brett.cannon | 2007-05-16 19:02:56 -0700 (Wed, 16 May 2007) | 3 lines Complete deprecation of BaseException.message. Some subclasses were directly accessing the message attribute instead of using the descriptor. ........ r55396 | neal.norwitz | 2007-05-16 23:11:36 -0700 (Wed, 16 May 2007) | 4 lines Reduce the max stack depth to see if this fixes the segfaults on Windows and some other boxes. If this is successful, this rev should be backported. I'm not sure how close to the limit we should push this. ........ r55397 | neal.norwitz | 2007-05-16 23:23:50 -0700 (Wed, 16 May 2007) | 4 lines Set the depth to something very small to try to determine if the crashes on Windows are really due to the stack size or possibly some other problem. ........ r55398 | neal.norwitz | 2007-05-17 00:04:46 -0700 (Thu, 17 May 2007) | 4 lines Last try for tweaking the max stack depth. 5000 was the original value, 4000 didn't work either. 1000 does work on Windows. If 2000 works, that will hopefully be a reasonable balance. ........ r55412 | fred.drake | 2007-05-17 12:29:58 -0700 (Thu, 17 May 2007) | 1 line fix argument name in documentation; match the implementation ........ r55427 | neal.norwitz | 2007-05-17 22:47:16 -0700 (Thu, 17 May 2007) | 1 line Verify neither dumps or loads overflow the stack and segfault. ........ r55446 | collin.winter | 2007-05-18 16:11:24 -0700 (Fri, 18 May 2007) | 1 line Backport PEP 3110's new 'except' syntax to 2.6. ........ r55448 | raymond.hettinger | 2007-05-18 18:11:16 -0700 (Fri, 18 May 2007) | 1 line Improvements to NamedTuple's implementation, tests, and documentation ........ r55449 | raymond.hettinger | 2007-05-18 18:50:11 -0700 (Fri, 18 May 2007) | 1 line Fix beginner mistake -- don't mix spaces and tabs. ........ r55450 | neal.norwitz | 2007-05-18 20:48:47 -0700 (Fri, 18 May 2007) | 1 line Clear data so random memory does not get freed. Will backport. ........ r55452 | neal.norwitz | 2007-05-18 21:34:55 -0700 (Fri, 18 May 2007) | 3 lines Whoops, need to pay attention to those test failures. Move the clear to *before* the first use, not after. ........ r55453 | neal.norwitz | 2007-05-18 21:35:52 -0700 (Fri, 18 May 2007) | 1 line Give some clue as to what happened if the test fails. ........ r55455 | georg.brandl | 2007-05-19 11:09:26 -0700 (Sat, 19 May 2007) | 2 lines Fix docstring for add_package in site.py. ........ r55458 | brett.cannon | 2007-05-20 00:09:50 -0700 (Sun, 20 May 2007) | 2 lines Remove the rgbimg module. It has been deprecated since Python 2.5. ........ r55465 | nick.coghlan | 2007-05-20 04:12:49 -0700 (Sun, 20 May 2007) | 1 line Fix typo in example (should be backported, but my maintenance branch is woefully out of date) ........ ................ r55472 | brett.cannon | 2007-05-20 12:06:18 -0700 (Sun, 20 May 2007) | 2 lines Remove imageop from the Windows build process. ................ r55486 | neal.norwitz | 2007-05-20 23:59:52 -0700 (Sun, 20 May 2007) | 1 line Remove callable() builtin ................ r55506 | neal.norwitz | 2007-05-22 00:43:29 -0700 (Tue, 22 May 2007) | 78 lines Merged revisions 55468-55505 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r55468 | neal.norwitz | 2007-05-20 11:06:27 -0700 (Sun, 20 May 2007) | 1 line rotor is long gone. ........ r55470 | neal.norwitz | 2007-05-20 11:43:00 -0700 (Sun, 20 May 2007) | 1 line Update directories/files at the top-level. ........ r55471 | brett.cannon | 2007-05-20 12:05:06 -0700 (Sun, 20 May 2007) | 2 lines Try to remove rgbimg from Windows builds. ........ r55474 | brett.cannon | 2007-05-20 16:17:38 -0700 (Sun, 20 May 2007) | 4 lines Remove the macfs module. This led to the deprecation of macostools.touched(); it completely relied on macfs and is a no-op on OS X according to code comments. ........ r55476 | brett.cannon | 2007-05-20 16:56:18 -0700 (Sun, 20 May 2007) | 3 lines Move imgfile import to the global namespace to trigger an import error ASAP to prevent creation of a test file. ........ r55477 | brett.cannon | 2007-05-20 16:57:38 -0700 (Sun, 20 May 2007) | 3 lines Cause posixfile to raise a DeprecationWarning. Documented as deprecated since Ptyhon 1.5. ........ r55479 | andrew.kuchling | 2007-05-20 17:03:15 -0700 (Sun, 20 May 2007) | 1 line Note removed modules ........ r55481 | martin.v.loewis | 2007-05-20 21:35:47 -0700 (Sun, 20 May 2007) | 2 lines Add Alexandre Vassalotti. ........ r55482 | george.yoshida | 2007-05-20 21:41:21 -0700 (Sun, 20 May 2007) | 4 lines fix against r55474 [Remove the macfs module] Remove "libmacfs.tex" from Makefile.deps and mac/mac.tex. ........ r55487 | raymond.hettinger | 2007-05-21 01:13:35 -0700 (Mon, 21 May 2007) | 1 line Replace assertion with straight error-checking. ........ r55489 | raymond.hettinger | 2007-05-21 09:40:10 -0700 (Mon, 21 May 2007) | 1 line Allow all alphanumeric and underscores in type and field names. ........ r55490 | facundo.batista | 2007-05-21 10:32:32 -0700 (Mon, 21 May 2007) | 5 lines Added timeout support to HTTPSConnection, through the socket.create_connection function. Also added a small test for this, and updated NEWS file. ........ r55495 | georg.brandl | 2007-05-21 13:34:16 -0700 (Mon, 21 May 2007) | 2 lines Patch #1686487: you can now pass any mapping after '**' in function calls. ........ r55502 | neal.norwitz | 2007-05-21 23:03:36 -0700 (Mon, 21 May 2007) | 1 line Document new params to HTTPSConnection ........ r55504 | neal.norwitz | 2007-05-22 00:16:10 -0700 (Tue, 22 May 2007) | 1 line Stop using METH_OLDARGS ........ r55505 | neal.norwitz | 2007-05-22 00:16:44 -0700 (Tue, 22 May 2007) | 1 line Stop using METH_OLDARGS implicitly ........ ................
2007-05-22 15:11:13 -03:00
if hasattr(action, '__call__'):
self.match = m
action = action(self, m.group())
if action is not None:
append(action)
i = j
return result, string[i:]