mirror of https://github.com/python/cpython
#18562: various revisions to the regex howto for 3.x
* describe how \w is different when used in bytes and Unicode patterns. * describe re.ASCII flag to change that behaviour. * remove personal references ('I generally prefer...') * add some more links to the re module in the library reference * various small edits and re-wording.
This commit is contained in:
parent
ba5d8f33ec
commit
3f4f3ba1a8
|
@ -104,13 +104,25 @@ you can still match them in patterns; for example, if you need to match a ``[``
|
|||
or ``\``, you can precede them with a backslash to remove their special
|
||||
meaning: ``\[`` or ``\\``.
|
||||
|
||||
Some of the special sequences beginning with ``'\'`` represent predefined sets
|
||||
of characters that are often useful, such as the set of digits, the set of
|
||||
letters, or the set of anything that isn't whitespace. The following predefined
|
||||
special sequences are a subset of those available. The equivalent classes are
|
||||
for bytes patterns. For a complete list of sequences and expanded class
|
||||
definitions for Unicode string patterns, see the last part of
|
||||
:ref:`Regular Expression Syntax <re-syntax>`.
|
||||
Some of the special sequences beginning with ``'\'`` represent
|
||||
predefined sets of characters that are often useful, such as the set
|
||||
of digits, the set of letters, or the set of anything that isn't
|
||||
whitespace.
|
||||
|
||||
Let's take an example: ``\w`` matches any alphanumeric character. If
|
||||
the regex pattern is expressed in bytes, this is equivalent to the
|
||||
class ``[a-zA-Z0-9_]``. If the regex pattern is a string, ``\w`` will
|
||||
match all the characters marked as letters in the Unicode database
|
||||
provided by the :mod:`unicodedata` module. You can use the more
|
||||
restricted definition of ``\w`` in a string pattern by supplying the
|
||||
:const:`re.ASCII` flag when compiling the regular expression.
|
||||
|
||||
The following list of special sequences isn't complete. For a complete
|
||||
list of sequences and expanded class definitions for Unicode string
|
||||
patterns, see the last part of :ref:`Regular Expression Syntax
|
||||
<re-syntax>` in the Standard Library reference. In general, the
|
||||
Unicode versions match any character that's in the appropriate
|
||||
category in the Unicode database.
|
||||
|
||||
``\d``
|
||||
Matches any decimal digit; this is equivalent to the class ``[0-9]``.
|
||||
|
@ -160,9 +172,8 @@ previous character can be matched zero or more times, instead of exactly once.
|
|||
For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``),
|
||||
``caaat`` (3 ``a`` characters), and so forth. The RE engine has various
|
||||
internal limitations stemming from the size of C's ``int`` type that will
|
||||
prevent it from matching over 2 billion ``a`` characters; you probably don't
|
||||
have enough memory to construct a string that large, so you shouldn't run into
|
||||
that limit.
|
||||
prevent it from matching over 2 billion ``a`` characters; patterns
|
||||
are usually not written to match that much data.
|
||||
|
||||
Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching
|
||||
engine will try to repeat it as many times as possible. If later portions of the
|
||||
|
@ -353,7 +364,7 @@ for a complete listing.
|
|||
| | returns them as an :term:`iterator`. |
|
||||
+------------------+-----------------------------------------------+
|
||||
|
||||
:meth:`match` and :meth:`search` return ``None`` if no match can be found. If
|
||||
:meth:`~re.regex.match` and :meth:`~re.regex.search` return ``None`` if no match can be found. If
|
||||
they're successful, a :ref:`match object <match-objects>` instance is returned,
|
||||
containing information about the match: where it starts and ends, the substring
|
||||
it matched, and more.
|
||||
|
@ -419,8 +430,8 @@ Trying these methods will soon clarify their meaning::
|
|||
>>> m.span()
|
||||
(0, 5)
|
||||
|
||||
:meth:`group` returns the substring that was matched by the RE. :meth:`start`
|
||||
and :meth:`end` return the starting and ending index of the match. :meth:`span`
|
||||
:meth:`~re.match.group` returns the substring that was matched by the RE. :meth:`~re.match.start`
|
||||
and :meth:`~re.match.end` return the starting and ending index of the match. :meth:`~re.match.span`
|
||||
returns both start and end indexes in a single tuple. Since the :meth:`match`
|
||||
method only checks if the RE matches at the start of a string, :meth:`start`
|
||||
will always be zero. However, the :meth:`search` method of patterns
|
||||
|
@ -448,14 +459,14 @@ In actual programs, the most common style is to store the
|
|||
print('No match')
|
||||
|
||||
Two pattern methods return all of the matches for a pattern.
|
||||
:meth:`findall` returns a list of matching strings::
|
||||
:meth:`~re.regex.findall` returns a list of matching strings::
|
||||
|
||||
>>> p = re.compile('\d+')
|
||||
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
|
||||
['12', '11', '10']
|
||||
|
||||
:meth:`findall` has to create the entire list before it can be returned as the
|
||||
result. The :meth:`finditer` method returns a sequence of
|
||||
result. The :meth:`~re.regex.finditer` method returns a sequence of
|
||||
:ref:`match object <match-objects>` instances as an :term:`iterator`::
|
||||
|
||||
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
|
||||
|
@ -473,9 +484,9 @@ Module-Level Functions
|
|||
----------------------
|
||||
|
||||
You don't have to create a pattern object and call its methods; the
|
||||
:mod:`re` module also provides top-level functions called :func:`match`,
|
||||
:func:`search`, :func:`findall`, :func:`sub`, and so forth. These functions
|
||||
take the same arguments as the corresponding pattern method, with
|
||||
:mod:`re` module also provides top-level functions called :func:`~re.match`,
|
||||
:func:`~re.search`, :func:`~re.findall`, :func:`~re.sub`, and so forth. These functions
|
||||
take the same arguments as the corresponding pattern method with
|
||||
the RE string added as the first argument, and still return either ``None`` or a
|
||||
:ref:`match object <match-objects>` instance. ::
|
||||
|
||||
|
@ -485,26 +496,15 @@ the RE string added as the first argument, and still return either ``None`` or a
|
|||
<_sre.SRE_Match object at 0x...>
|
||||
|
||||
Under the hood, these functions simply create a pattern object for you
|
||||
and call the appropriate method on it. They also store the compiled object in a
|
||||
cache, so future calls using the same RE are faster.
|
||||
and call the appropriate method on it. They also store the compiled
|
||||
object in a cache, so future calls using the same RE won't need to
|
||||
parse the pattern again and again.
|
||||
|
||||
Should you use these module-level functions, or should you get the
|
||||
pattern and call its methods yourself? That choice depends on how
|
||||
frequently the RE will be used, and on your personal coding style. If the RE is
|
||||
being used at only one point in the code, then the module functions are probably
|
||||
more convenient. If a program contains a lot of regular expressions, or re-uses
|
||||
the same ones in several locations, then it might be worthwhile to collect all
|
||||
the definitions in one place, in a section of code that compiles all the REs
|
||||
ahead of time. To take an example from the standard library, here's an extract
|
||||
from the now-defunct Python 2 standard :mod:`xmllib` module::
|
||||
|
||||
ref = re.compile( ... )
|
||||
entityref = re.compile( ... )
|
||||
charref = re.compile( ... )
|
||||
starttagopen = re.compile( ... )
|
||||
|
||||
I generally prefer to work with the compiled object, even for one-time uses, but
|
||||
few people will be as much of a purist about this as I am.
|
||||
pattern and call its methods yourself? If you're accessing a regex
|
||||
within a loop, pre-compiling it will save a few function calls.
|
||||
Outside of loops, there's not much difference thanks to the internal
|
||||
cache.
|
||||
|
||||
|
||||
Compilation Flags
|
||||
|
@ -524,6 +524,10 @@ of each one.
|
|||
+---------------------------------+--------------------------------------------+
|
||||
| Flag | Meaning |
|
||||
+=================================+============================================+
|
||||
| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, |
|
||||
| | ``\s`` and ``\d`` match only on ASCII |
|
||||
| | characters with the respective property. |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
| :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including |
|
||||
| | newlines |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
|
@ -535,11 +539,7 @@ of each one.
|
|||
| | ``$`` |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
| :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized |
|
||||
| | more cleanly and understandably. |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, |
|
||||
| | ``\s`` and ``\d`` match only on ASCII |
|
||||
| | characters with the respective property. |
|
||||
| (for 'extended') | more cleanly and understandably. |
|
||||
+---------------------------------+--------------------------------------------+
|
||||
|
||||
|
||||
|
@ -558,7 +558,8 @@ of each one.
|
|||
LOCALE
|
||||
:noindex:
|
||||
|
||||
Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale.
|
||||
Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale
|
||||
instead of the Unicode database.
|
||||
|
||||
Locales are a feature of the C library intended to help in writing programs that
|
||||
take account of language differences. For example, if you're processing French
|
||||
|
@ -851,11 +852,10 @@ keep track of the group numbers. There are two features which help with this
|
|||
problem. Both of them use a common syntax for regular expression extensions, so
|
||||
we'll look at that first.
|
||||
|
||||
Perl 5 added several additional features to standard regular expressions, and
|
||||
the Python :mod:`re` module supports most of them. It would have been
|
||||
difficult to choose new single-keystroke metacharacters or new special sequences
|
||||
beginning with ``\`` to represent the new features without making Perl's regular
|
||||
expressions confusingly different from standard REs. If you chose ``&`` as a
|
||||
Perl 5 is well-known for its powerful additions to standard regular expressions.
|
||||
For these new features the Perl developers couldn't choose new single-keystroke metacharacters
|
||||
or new special sequences beginning with ``\`` without making Perl's regular
|
||||
expressions confusingly different from standard REs. If they chose ``&`` as a
|
||||
new metacharacter, for example, old expressions would be assuming that ``&`` was
|
||||
a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``.
|
||||
|
||||
|
@ -867,22 +867,15 @@ what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead
|
|||
assertion) and ``(?:foo)`` is something else (a non-capturing group containing
|
||||
the subexpression ``foo``).
|
||||
|
||||
Python adds an extension syntax to Perl's extension syntax. If the first
|
||||
character after the question mark is a ``P``, you know that it's an extension
|
||||
that's specific to Python. Currently there are two such extensions:
|
||||
``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to
|
||||
a named group. If future versions of Perl 5 add similar features using a
|
||||
different syntax, the :mod:`re` module will be changed to support the new
|
||||
syntax, while preserving the Python-specific syntax for compatibility's sake.
|
||||
Python supports several of Perl's extensions and adds an extension
|
||||
syntax to Perl's extension syntax. If the first character after the
|
||||
question mark is a ``P``, you know that it's an extension that's
|
||||
specific to Python.
|
||||
|
||||
Now that we've looked at the general extension syntax, we can return to the
|
||||
features that simplify working with groups in complex REs. Since groups are
|
||||
numbered from left to right and a complex expression may use many groups, it can
|
||||
become difficult to keep track of the correct numbering. Modifying such a
|
||||
complex RE is annoying, too: insert a new group near the beginning and you
|
||||
change the numbers of everything that follows it.
|
||||
Now that we've looked at the general extension syntax, we can return
|
||||
to the features that simplify working with groups in complex REs.
|
||||
|
||||
Sometimes you'll want to use a group to collect a part of a regular expression,
|
||||
Sometimes you'll want to use a group to denote a part of a regular expression,
|
||||
but aren't interested in retrieving the group's contents. You can make this fact
|
||||
explicit by using a non-capturing group: ``(?:...)``, where you can replace the
|
||||
``...`` with any other regular expression. ::
|
||||
|
@ -908,7 +901,7 @@ numbers, groups can be referenced by a name.
|
|||
|
||||
The syntax for a named group is one of the Python-specific extensions:
|
||||
``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups
|
||||
also behave exactly like capturing groups, and additionally associate a name
|
||||
behave exactly like capturing groups, and additionally associate a name
|
||||
with a group. The :ref:`match object <match-objects>` methods that deal with
|
||||
capturing groups all accept either integers that refer to the group by number
|
||||
or strings that contain the desired group's name. Named groups are still
|
||||
|
@ -975,9 +968,10 @@ The pattern to match this is quite simple:
|
|||
``.*[.].*$``
|
||||
|
||||
Notice that the ``.`` needs to be treated specially because it's a
|
||||
metacharacter; I've put it inside a character class. Also notice the trailing
|
||||
``$``; this is added to ensure that all the rest of the string must be included
|
||||
in the extension. This regular expression matches ``foo.bar`` and
|
||||
metacharacter, so it's inside a character class to only match that
|
||||
specific character. Also notice the trailing ``$``; this is added to
|
||||
ensure that all the rest of the string must be included in the
|
||||
extension. This regular expression matches ``foo.bar`` and
|
||||
``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``.
|
||||
|
||||
Now, consider complicating the problem a bit; what if you want to match
|
||||
|
@ -1051,7 +1045,7 @@ Splitting Strings
|
|||
The :meth:`split` method of a pattern splits a string apart
|
||||
wherever the RE matches, returning a list of the pieces. It's similar to the
|
||||
:meth:`split` method of strings but provides much more generality in the
|
||||
delimiters that you can split by; :meth:`split` only supports splitting by
|
||||
delimiters that you can split by; string :meth:`split` only supports splitting by
|
||||
whitespace or by a fixed string. As you'd expect, there's a module-level
|
||||
:func:`re.split` function, too.
|
||||
|
||||
|
@ -1106,7 +1100,6 @@ Another common task is to find all the matches for a pattern, and replace them
|
|||
with a different string. The :meth:`sub` method takes a replacement value,
|
||||
which can be either a string or a function, and the string to be processed.
|
||||
|
||||
|
||||
.. method:: .sub(replacement, string[, count=0])
|
||||
:noindex:
|
||||
|
||||
|
@ -1362,4 +1355,3 @@ and doesn't contain any Python material at all, so it won't be useful as a
|
|||
reference for programming in Python. (The first edition covered Python's
|
||||
now-removed :mod:`regex` module, which won't help you much.) Consider checking
|
||||
it out from your library.
|
||||
|
||||
|
|
Loading…
Reference in New Issue