bpo-33409: Clarify PEP 538/540 relationship (GH-7534)

While locale coercion and UTF-8 mode turned out to
be complementary ideas rather than competing ones,
it isn't immediately obvious why it's useful to
have both, or how they interact at runtime.

This updates both the Python 3.7 What's New doc
and the PYTHONCOERCECLOCALE and PYTHONUTF8
documentation in an attempt to clarify that
relationship:

- in the respective What's New sections, add a closing paragraph
  explaining which problem each one solves, and pointing to the
  other PEP's section for the specific aspects it relies on the other
  PEP to solve
- use "locale-aware mode" as a more descriptive term for the
  default non-UTF-8 mode
- improve wording conistenccy between the PYTHONCOERCECLOCALE
  and PYTHONUTF8 docs when they cover the same thing (mostly
  related to legacy locale detection and setting the standard
  stream error handler)
- improve the description of the locale coercion trigger conditions
  (including pointing out that setting LC_ALL turns off locale coercion)
- port the full description of the UTF-8 mode behaviour changes
  from PEP 540 into the PYTHONUTF8 documentation
- be explicit that PYTHONIOENCODING still overrides the settings
  for the standard streams
- mention concrete examples of things that do and don't get their
  text encoding assumptions adjusted by the two text encoding
  assumption override techniques
This commit is contained in:
Nick Coghlan 2018-06-09 16:54:08 +10:00 committed by GitHub
parent 4acc140f8d
commit 1bcb8a6368
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 106 additions and 24 deletions

View File

@ -438,8 +438,10 @@ Miscellaneous options
* Set the :attr:`~sys.flags.dev_mode` attribute of :attr:`sys.flags` to
``True``
* ``-X utf8`` enables the UTF-8 mode, whereas ``-X utf8=0`` disables the
UTF-8 mode.
* ``-X utf8`` enables UTF-8 mode for operating system interfaces, overriding
the default locale-aware mode. ``-X utf8=0`` explicitly disables UTF-8
mode (even when it would otherwise activate automatically).
See :envvar:`PYTHONUTF8` for more details.
It also allows passing arbitrary values and retrieving them through the
:data:`sys._xoptions` dictionary.
@ -789,14 +791,16 @@ conflict.
.. envvar:: PYTHONCOERCECLOCALE
If set to the value ``0``, causes the main Python command line application
to skip coercing the legacy ASCII-based C locale to a more capable UTF-8
based alternative.
to skip coercing the legacy ASCII-based C and POSIX locales to a more
capable UTF-8 based alternative.
If this variable is *not* set, or is set to a value other than ``0``, and
the current locale reported for the ``LC_CTYPE`` category is the default
``C`` locale, then the Python CLI will attempt to configure the following
locales for the ``LC_CTYPE`` category in the order listed before loading the
interpreter runtime:
If this variable is *not* set (or is set to a value other than ``0``), the
``LC_ALL`` locale override environment variable is also not set, and the
current locale reported for the ``LC_CTYPE`` category is either the default
``C`` locale, or else the explicitly ASCII-based ``POSIX`` locale, then the
Python CLI will attempt to configure the following locales for the
``LC_CTYPE`` category in the order listed before loading the interpreter
runtime:
* ``C.UTF-8``
* ``C.utf8``
@ -804,21 +808,32 @@ conflict.
If setting one of these locale categories succeeds, then the ``LC_CTYPE``
environment variable will also be set accordingly in the current process
environment before the Python runtime is initialized. This ensures the
updated setting is seen in subprocesses, as well as in operations that
query the environment rather than the current C locale (such as Python's
own :func:`locale.getdefaultlocale`).
environment before the Python runtime is initialized. This ensures that in
addition to being seen by both the interpreter itself and other locale-aware
components running in the same process (such as the GNU ``readline``
library), the updated setting is also seen in subprocesses (regardless of
whether or not those processes are running a Python interpreter), as well as
in operations that query the environment rather than the current C locale
(such as Python's own :func:`locale.getdefaultlocale`).
Configuring one of these locales (either explicitly or via the above
implicit locale coercion) will automatically set the error handler for
:data:`sys.stdin` and :data:`sys.stdout` to ``surrogateescape``. This
behavior can be overridden using :envvar:`PYTHONIOENCODING` as usual.
implicit locale coercion) automatically enables the ``surrogateescape``
:ref:`error handler <error-handlers>` for :data:`sys.stdin` and
:data:`sys.stdout` (:data:`sys.stderr` continues to use ``backslashreplace``
as it does in any other locale). This stream handling behavior can be
overridden using :envvar:`PYTHONIOENCODING` as usual.
For debugging purposes, setting ``PYTHONCOERCECLOCALE=warn`` will cause
Python to emit warning messages on ``stderr`` if either the locale coercion
activates, or else if a locale that *would* have triggered coercion is
still active when the Python runtime is initialized.
Also note that even when locale coercion is disabled, or when it fails to
find a suitable target locale, :envvar:`PYTHONUTF8` will still activate by
default in legacy ASCII-based locales. Both features must be disabled in
order to force the interpreter to use ``ASCII`` instead of ``UTF-8`` for
system interfaces.
Availability: \*nix
.. versionadded:: 3.7
@ -834,10 +849,56 @@ conflict.
.. envvar:: PYTHONUTF8
If set to ``1``, enable the UTF-8 mode. If set to ``0``, disable the UTF-8
mode. Any other non-empty string cause an error.
If set to ``1``, enables the interpreter's UTF-8 mode, where ``UTF-8`` is
used as the text encoding for system interfaces, regardless of the
current locale setting.
This means that:
* :func:`sys.getfilesystemencoding()` returns ``'UTF-8'`` (the locale
encoding is ignored).
* :func:`locale.getpreferredencoding()` returns ``'UTF-8'`` (the locale
encoding is ignored, and the function's ``do_setlocale`` parameter has no
effect).
* :data:`sys.stdin`, :data:`sys.stdout`, and :data:`sys.stderr` all use
UTF-8 as their text encoding, with the ``surrogateescape``
:ref:`error handler <error-handlers>` being enabled for :data:`sys.stdin`
and :data:`sys.stdout` (:data:`sys.stderr` continues to use
``backslashreplace`` as it does in the default locale-aware mode)
As a consequence of the changes in those lower level APIs, other higher
level APIs also exhibit different default behaviours:
* Command line arguments, environment variables and filenames are decoded
to text using the UTF-8 encoding.
* :func:`os.fsdecode()` and :func:`os.fsencode()` use the UTF-8 encoding.
* :func:`open()`, :func:`io.open()`, and :func:`codecs.open()` use the UTF-8
encoding by default. However, they still use the strict error handler by
default so that attempting to open a binary file in text mode is likely
to raise an exception rather than producing nonsense data.
Note that the standard stream settings in UTF-8 mode can be overridden by
:envvar:`PYTHONIOENCODING` (just as they can be in the default locale-aware
mode).
If set to ``0``, the interpreter runs in its default locale-aware mode.
Setting any other non-empty string causes an error during interpreter
initialisation.
If this environment variable is not set at all, then the interpreter defaults
to using the current locale settings, *unless* the current locale is
identified as a legacy ASCII-based locale
(as descibed for :envvar:`PYTHONCOERCECLOCALE`), and locale coercion is
either disabled or fails. In such legacy locales, the interpreter will
default to enabling UTF-8 mode unless explicitly instructed not to do so.
Also available as the :option:`-X` ``utf8`` option.
Availability: \*nix
.. versionadded:: 3.7
See :pep:`540` for more details.
Debug-mode variables

View File

@ -97,9 +97,10 @@ Significant improvements in the standard library:
CPython implementation improvements:
* Avoiding the use of ASCII as a default text encoding:
* :ref:`PEP 538 <whatsnew37-pep538>`, legacy C locale coercion
* :ref:`PEP 540 <whatsnew37-pep540>`, forced UTF-8 runtime mode
* :ref:`PEP 552 <whatsnew37-pep552>`, deterministic .pycs
* :ref:`PEP 538 <whatsnew37-pep538>`, legacy C locale coercion
* :ref:`PEP 540 <whatsnew37-pep540>`, forced UTF-8 runtime mode
* :ref:`the new development runtime mode <whatsnew37-devmode>`
* :ref:`PEP 565 <whatsnew37-pep565>`, improved :exc:`DeprecationWarning`
handling
@ -184,7 +185,8 @@ PEP 538: Legacy C Locale Coercion
An ongoing challenge within the Python 3 series has been determining a sensible
default strategy for handling the "7-bit ASCII" text encoding assumption
currently implied by the use of the default C locale on non-Windows platforms.
currently implied by the use of the default C or POSIX locale on non-Windows
platforms.
:pep:`538` updates the default interpreter command line interface to
automatically coerce that locale to an available UTF-8 based locale as
@ -205,10 +207,18 @@ continues to be ``backslashreplace``, regardless of locale.
Locale coercion is silent by default, but to assist in debugging potentially
locale related integration problems, explicit warnings (emitted directly on
:data:`~sys.stderr` can be requested by setting ``PYTHONCOERCECLOCALE=warn``.
:data:`~sys.stderr`) can be requested by setting ``PYTHONCOERCECLOCALE=warn``.
This setting will also cause the Python runtime to emit a warning if the
legacy C locale remains active when the core interpreter is initialized.
While :pep:`538`'s locale coercion has the benefit of also affecting extension
modules (such as GNU ``readline``), as well as child processes (including those
running non-Python applications and older versions of Python), it has the
downside of requiring that a suitable target locale be present on the running
system. To better handle the case where no suitable target locale is available
(as occurs on RHEL/CentOS 7, for example), Python 3.7 also implements
:ref:`whatsnew37-pep540`.
.. seealso::
:pep:`538` -- Coercing the legacy C locale to a UTF-8 based locale
@ -231,8 +241,17 @@ The forced UTF-8 mode can be used to change the text handling behavior in
an embedded Python interpreter without changing the locale settings of
an embedding application.
The UTF-8 mode is enabled by default when the locale is "C". See
:ref:`whatsnew37-pep538` for details.
While :pep:`540`'s UTF-8 mode has the benefit of working regardless of which
locales are available on the running system, it has the downside of having no
effect on extension modules (such as GNU ``readline``), child processes running
non-Python applications, and child processes running older versions of Python.
To reduce the risk of corrupting text data when communicating with such
components, Python 3.7 also implements :ref:`whatsnew37-pep540`).
The UTF-8 mode is enabled by default when the locale is ``C`` or ``POSIX``, and
the :pep:`538` locale coercion feature fails to change it to a UTF-8 based
alternative (whether that failure is due to ``PYTHONCOERCECLOCALE=0`` being set,
``LC_ALL`` being set, or the lack of a suitable target locale).
.. seealso::

View File

@ -0,0 +1,2 @@
Clarified the relationship between PEP 538's PYTHONCOERCECLOCALE and PEP
540's PYTHONUTF8 mode.