diff --git a/Doc/using/cmdline.rst b/Doc/using/cmdline.rst index e72dea90758..c6bb0be6bc4 100644 --- a/Doc/using/cmdline.rst +++ b/Doc/using/cmdline.rst @@ -438,8 +438,10 @@ Miscellaneous options * Set the :attr:`~sys.flags.dev_mode` attribute of :attr:`sys.flags` to ``True`` - * ``-X utf8`` enables the UTF-8 mode, whereas ``-X utf8=0`` disables the - UTF-8 mode. + * ``-X utf8`` enables UTF-8 mode for operating system interfaces, overriding + the default locale-aware mode. ``-X utf8=0`` explicitly disables UTF-8 + mode (even when it would otherwise activate automatically). + See :envvar:`PYTHONUTF8` for more details. It also allows passing arbitrary values and retrieving them through the :data:`sys._xoptions` dictionary. @@ -789,14 +791,16 @@ conflict. .. envvar:: PYTHONCOERCECLOCALE If set to the value ``0``, causes the main Python command line application - to skip coercing the legacy ASCII-based C locale to a more capable UTF-8 - based alternative. + to skip coercing the legacy ASCII-based C and POSIX locales to a more + capable UTF-8 based alternative. - If this variable is *not* set, or is set to a value other than ``0``, and - the current locale reported for the ``LC_CTYPE`` category is the default - ``C`` locale, then the Python CLI will attempt to configure the following - locales for the ``LC_CTYPE`` category in the order listed before loading the - interpreter runtime: + If this variable is *not* set (or is set to a value other than ``0``), the + ``LC_ALL`` locale override environment variable is also not set, and the + current locale reported for the ``LC_CTYPE`` category is either the default + ``C`` locale, or else the explicitly ASCII-based ``POSIX`` locale, then the + Python CLI will attempt to configure the following locales for the + ``LC_CTYPE`` category in the order listed before loading the interpreter + runtime: * ``C.UTF-8`` * ``C.utf8`` @@ -804,21 +808,32 @@ conflict. If setting one of these locale categories succeeds, then the ``LC_CTYPE`` environment variable will also be set accordingly in the current process - environment before the Python runtime is initialized. This ensures the - updated setting is seen in subprocesses, as well as in operations that - query the environment rather than the current C locale (such as Python's - own :func:`locale.getdefaultlocale`). + environment before the Python runtime is initialized. This ensures that in + addition to being seen by both the interpreter itself and other locale-aware + components running in the same process (such as the GNU ``readline`` + library), the updated setting is also seen in subprocesses (regardless of + whether or not those processes are running a Python interpreter), as well as + in operations that query the environment rather than the current C locale + (such as Python's own :func:`locale.getdefaultlocale`). Configuring one of these locales (either explicitly or via the above - implicit locale coercion) will automatically set the error handler for - :data:`sys.stdin` and :data:`sys.stdout` to ``surrogateescape``. This - behavior can be overridden using :envvar:`PYTHONIOENCODING` as usual. + implicit locale coercion) automatically enables the ``surrogateescape`` + :ref:`error handler ` for :data:`sys.stdin` and + :data:`sys.stdout` (:data:`sys.stderr` continues to use ``backslashreplace`` + as it does in any other locale). This stream handling behavior can be + overridden using :envvar:`PYTHONIOENCODING` as usual. For debugging purposes, setting ``PYTHONCOERCECLOCALE=warn`` will cause Python to emit warning messages on ``stderr`` if either the locale coercion activates, or else if a locale that *would* have triggered coercion is still active when the Python runtime is initialized. + Also note that even when locale coercion is disabled, or when it fails to + find a suitable target locale, :envvar:`PYTHONUTF8` will still activate by + default in legacy ASCII-based locales. Both features must be disabled in + order to force the interpreter to use ``ASCII`` instead of ``UTF-8`` for + system interfaces. + Availability: \*nix .. versionadded:: 3.7 @@ -834,10 +849,56 @@ conflict. .. envvar:: PYTHONUTF8 - If set to ``1``, enable the UTF-8 mode. If set to ``0``, disable the UTF-8 - mode. Any other non-empty string cause an error. + If set to ``1``, enables the interpreter's UTF-8 mode, where ``UTF-8`` is + used as the text encoding for system interfaces, regardless of the + current locale setting. + + This means that: + + * :func:`sys.getfilesystemencoding()` returns ``'UTF-8'`` (the locale + encoding is ignored). + * :func:`locale.getpreferredencoding()` returns ``'UTF-8'`` (the locale + encoding is ignored, and the function's ``do_setlocale`` parameter has no + effect). + * :data:`sys.stdin`, :data:`sys.stdout`, and :data:`sys.stderr` all use + UTF-8 as their text encoding, with the ``surrogateescape`` + :ref:`error handler ` being enabled for :data:`sys.stdin` + and :data:`sys.stdout` (:data:`sys.stderr` continues to use + ``backslashreplace`` as it does in the default locale-aware mode) + + As a consequence of the changes in those lower level APIs, other higher + level APIs also exhibit different default behaviours: + + * Command line arguments, environment variables and filenames are decoded + to text using the UTF-8 encoding. + * :func:`os.fsdecode()` and :func:`os.fsencode()` use the UTF-8 encoding. + * :func:`open()`, :func:`io.open()`, and :func:`codecs.open()` use the UTF-8 + encoding by default. However, they still use the strict error handler by + default so that attempting to open a binary file in text mode is likely + to raise an exception rather than producing nonsense data. + + Note that the standard stream settings in UTF-8 mode can be overridden by + :envvar:`PYTHONIOENCODING` (just as they can be in the default locale-aware + mode). + + If set to ``0``, the interpreter runs in its default locale-aware mode. + + Setting any other non-empty string causes an error during interpreter + initialisation. + + If this environment variable is not set at all, then the interpreter defaults + to using the current locale settings, *unless* the current locale is + identified as a legacy ASCII-based locale + (as descibed for :envvar:`PYTHONCOERCECLOCALE`), and locale coercion is + either disabled or fails. In such legacy locales, the interpreter will + default to enabling UTF-8 mode unless explicitly instructed not to do so. + + Also available as the :option:`-X` ``utf8`` option. + + Availability: \*nix .. versionadded:: 3.7 + See :pep:`540` for more details. Debug-mode variables diff --git a/Doc/whatsnew/3.7.rst b/Doc/whatsnew/3.7.rst index 8a3afdf1f89..762d84a89bf 100644 --- a/Doc/whatsnew/3.7.rst +++ b/Doc/whatsnew/3.7.rst @@ -97,9 +97,10 @@ Significant improvements in the standard library: CPython implementation improvements: +* Avoiding the use of ASCII as a default text encoding: + * :ref:`PEP 538 `, legacy C locale coercion + * :ref:`PEP 540 `, forced UTF-8 runtime mode * :ref:`PEP 552 `, deterministic .pycs -* :ref:`PEP 538 `, legacy C locale coercion -* :ref:`PEP 540 `, forced UTF-8 runtime mode * :ref:`the new development runtime mode ` * :ref:`PEP 565 `, improved :exc:`DeprecationWarning` handling @@ -184,7 +185,8 @@ PEP 538: Legacy C Locale Coercion An ongoing challenge within the Python 3 series has been determining a sensible default strategy for handling the "7-bit ASCII" text encoding assumption -currently implied by the use of the default C locale on non-Windows platforms. +currently implied by the use of the default C or POSIX locale on non-Windows +platforms. :pep:`538` updates the default interpreter command line interface to automatically coerce that locale to an available UTF-8 based locale as @@ -205,10 +207,18 @@ continues to be ``backslashreplace``, regardless of locale. Locale coercion is silent by default, but to assist in debugging potentially locale related integration problems, explicit warnings (emitted directly on -:data:`~sys.stderr` can be requested by setting ``PYTHONCOERCECLOCALE=warn``. +:data:`~sys.stderr`) can be requested by setting ``PYTHONCOERCECLOCALE=warn``. This setting will also cause the Python runtime to emit a warning if the legacy C locale remains active when the core interpreter is initialized. +While :pep:`538`'s locale coercion has the benefit of also affecting extension +modules (such as GNU ``readline``), as well as child processes (including those +running non-Python applications and older versions of Python), it has the +downside of requiring that a suitable target locale be present on the running +system. To better handle the case where no suitable target locale is available +(as occurs on RHEL/CentOS 7, for example), Python 3.7 also implements +:ref:`whatsnew37-pep540`. + .. seealso:: :pep:`538` -- Coercing the legacy C locale to a UTF-8 based locale @@ -231,8 +241,17 @@ The forced UTF-8 mode can be used to change the text handling behavior in an embedded Python interpreter without changing the locale settings of an embedding application. -The UTF-8 mode is enabled by default when the locale is "C". See -:ref:`whatsnew37-pep538` for details. +While :pep:`540`'s UTF-8 mode has the benefit of working regardless of which +locales are available on the running system, it has the downside of having no +effect on extension modules (such as GNU ``readline``), child processes running +non-Python applications, and child processes running older versions of Python. +To reduce the risk of corrupting text data when communicating with such +components, Python 3.7 also implements :ref:`whatsnew37-pep540`). + +The UTF-8 mode is enabled by default when the locale is ``C`` or ``POSIX``, and +the :pep:`538` locale coercion feature fails to change it to a UTF-8 based +alternative (whether that failure is due to ``PYTHONCOERCECLOCALE=0`` being set, +``LC_ALL`` being set, or the lack of a suitable target locale). .. seealso:: diff --git a/Misc/NEWS.d/next/Documentation/2018-06-08-23-46-01.bpo-33409.r4z9MM.rst b/Misc/NEWS.d/next/Documentation/2018-06-08-23-46-01.bpo-33409.r4z9MM.rst new file mode 100644 index 00000000000..5b1a018df55 --- /dev/null +++ b/Misc/NEWS.d/next/Documentation/2018-06-08-23-46-01.bpo-33409.r4z9MM.rst @@ -0,0 +1,2 @@ +Clarified the relationship between PEP 538's PYTHONCOERCECLOCALE and PEP +540's PYTHONUTF8 mode.