bpo-42236: Use UTF-8 encoding if nl_langinfo(CODESET) fails (GH-23086)
If the nl_langinfo(CODESET) function returns an empty string, Python now uses UTF-8 as the filesystem encoding. In May 2010 (commitb744ba1d14
), I modified Python to log a warning and use UTF-8 as the filesystem encoding (instead of None) if nl_langinfo(CODESET) returns an empty string. In August 2020 (commit94908bbc15
), I modified Python startup to fail with a fatal error and a specific error message if nl_langinfo(CODESET) returns an empty string. The intent was to prevent guessing the encoding and also investigate user configuration where this case happens. In 10 years (2010 to 2020), I saw zero user report about the error message related to nl_langinfo(CODESET) returning an empty string. Today, UTF-8 became the defacto standard and it's safe to make the assumption that the user expects UTF-8. For example, nl_langinfo(CODESET) can return an empty string on macOS if the LC_CTYPE locale is not supported, and UTF-8 is the default encoding on macOS. While this change is likely to not affect anyone in practice, it should make UTF-8 lover happy ;-) Rewrite also the documentation explaining how Python selects the filesystem encoding and error handler.
This commit is contained in:
parent
82458b6cdb
commit
e662c398d8
|
@ -253,10 +253,16 @@ PyPreConfig
|
||||||
|
|
||||||
See :c:member:`PyConfig.isolated`.
|
See :c:member:`PyConfig.isolated`.
|
||||||
|
|
||||||
.. c:member:: int legacy_windows_fs_encoding (Windows only)
|
.. c:member:: int legacy_windows_fs_encoding
|
||||||
|
|
||||||
If non-zero, disable UTF-8 Mode, set the Python filesystem encoding to
|
If non-zero:
|
||||||
``mbcs``, set the filesystem error handler to ``replace``.
|
|
||||||
|
* Set :c:member:`PyPreConfig.utf8_mode` to ``0``,
|
||||||
|
* Set :c:member:`PyConfig.filesystem_encoding` to ``"mbcs"``,
|
||||||
|
* Set :c:member:`PyConfig.filesystem_errors` to ``"replace"``.
|
||||||
|
|
||||||
|
Initialized the from :envvar:`PYTHONLEGACYWINDOWSFSENCODING` environment
|
||||||
|
variable value.
|
||||||
|
|
||||||
Only available on Windows. ``#ifdef MS_WINDOWS`` macro can be used for
|
Only available on Windows. ``#ifdef MS_WINDOWS`` macro can be used for
|
||||||
Windows specific code.
|
Windows specific code.
|
||||||
|
@ -499,11 +505,47 @@ PyConfig
|
||||||
|
|
||||||
.. c:member:: wchar_t* filesystem_encoding
|
.. c:member:: wchar_t* filesystem_encoding
|
||||||
|
|
||||||
Filesystem encoding, :func:`sys.getfilesystemencoding`.
|
Filesystem encoding: :func:`sys.getfilesystemencoding`.
|
||||||
|
|
||||||
|
On macOS, Android and VxWorks: use ``"utf-8"`` by default.
|
||||||
|
|
||||||
|
On Windows: use ``"utf-8"`` by default, or ``"mbcs"`` if
|
||||||
|
:c:member:`~PyPreConfig.legacy_windows_fs_encoding` of
|
||||||
|
:c:type:`PyPreConfig` is non-zero.
|
||||||
|
|
||||||
|
Default encoding on other platforms:
|
||||||
|
|
||||||
|
* ``"utf-8"`` if :c:member:`PyPreConfig.utf8_mode` is non-zero.
|
||||||
|
* ``"ascii"`` if Python detects that ``nl_langinfo(CODESET)`` announces
|
||||||
|
the ASCII encoding (or Roman8 encoding on HP-UX), whereas the
|
||||||
|
``mbstowcs()`` function decodes from a different encoding (usually
|
||||||
|
Latin1).
|
||||||
|
* ``"utf-8"`` if ``nl_langinfo(CODESET)`` returns an empty string.
|
||||||
|
* Otherwise, use the LC_CTYPE locale encoding:
|
||||||
|
``nl_langinfo(CODESET)`` result.
|
||||||
|
|
||||||
|
At Python statup, the encoding name is normalized to the Python codec
|
||||||
|
name. For example, ``"ANSI_X3.4-1968"`` is replaced with ``"ascii"``.
|
||||||
|
|
||||||
|
See also the :c:member:`~PyConfig.filesystem_errors` member.
|
||||||
|
|
||||||
.. c:member:: wchar_t* filesystem_errors
|
.. c:member:: wchar_t* filesystem_errors
|
||||||
|
|
||||||
Filesystem encoding errors, :func:`sys.getfilesystemencodeerrors`.
|
Filesystem error handler: :func:`sys.getfilesystemencodeerrors`.
|
||||||
|
|
||||||
|
On Windows: use ``"surrogatepass"`` by default, or ``"replace"`` if
|
||||||
|
:c:member:`~PyPreConfig.legacy_windows_fs_encoding` of
|
||||||
|
:c:type:`PyPreConfig` is non-zero.
|
||||||
|
|
||||||
|
On other platforms: use ``"surrogateescape"`` by default.
|
||||||
|
|
||||||
|
Supported error handlers:
|
||||||
|
|
||||||
|
* ``"strict"``
|
||||||
|
* ``"surrogateescape"``
|
||||||
|
* ``"surrogatepass"`` (only supported with the UTF-8 encoding)
|
||||||
|
|
||||||
|
See also the :c:member:`~PyConfig.filesystem_encoding` member.
|
||||||
|
|
||||||
.. c:member:: unsigned long hash_seed
|
.. c:member:: unsigned long hash_seed
|
||||||
.. c:member:: int use_hash_seed
|
.. c:member:: int use_hash_seed
|
||||||
|
|
|
@ -616,29 +616,20 @@ always available.
|
||||||
.. function:: getfilesystemencoding()
|
.. function:: getfilesystemencoding()
|
||||||
|
|
||||||
Return the name of the encoding used to convert between Unicode
|
Return the name of the encoding used to convert between Unicode
|
||||||
filenames and bytes filenames. For best compatibility, str should be
|
filenames and bytes filenames.
|
||||||
used for filenames in all cases, although representing filenames as bytes
|
|
||||||
is also supported. Functions accepting or returning filenames should support
|
For best compatibility, str should be used for filenames in all cases,
|
||||||
either str or bytes and internally convert to the system's preferred
|
although representing filenames as bytes is also supported. Functions
|
||||||
representation.
|
accepting or returning filenames should support either str or bytes and
|
||||||
|
internally convert to the system's preferred representation.
|
||||||
|
|
||||||
This encoding is always ASCII-compatible.
|
This encoding is always ASCII-compatible.
|
||||||
|
|
||||||
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
|
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
|
||||||
the correct encoding and errors mode are used.
|
the correct encoding and errors mode are used.
|
||||||
|
|
||||||
* In the UTF-8 mode, the encoding is ``utf-8`` on any platform.
|
The filesystem encoding is initialized from
|
||||||
|
:c:member:`PyConfig.filesystem_encoding`.
|
||||||
* On macOS, the encoding is ``'utf-8'``.
|
|
||||||
|
|
||||||
* On Unix, the encoding is the locale encoding.
|
|
||||||
|
|
||||||
* On Windows, the encoding may be ``'utf-8'`` or ``'mbcs'``, depending
|
|
||||||
on user configuration.
|
|
||||||
|
|
||||||
* On Android, the encoding is ``'utf-8'``.
|
|
||||||
|
|
||||||
* On VxWorks, the encoding is ``'utf-8'``.
|
|
||||||
|
|
||||||
.. versionchanged:: 3.2
|
.. versionchanged:: 3.2
|
||||||
:func:`getfilesystemencoding` result cannot be ``None`` anymore.
|
:func:`getfilesystemencoding` result cannot be ``None`` anymore.
|
||||||
|
@ -660,6 +651,9 @@ always available.
|
||||||
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
|
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
|
||||||
the correct encoding and errors mode are used.
|
the correct encoding and errors mode are used.
|
||||||
|
|
||||||
|
The filesystem error handler is initialized from
|
||||||
|
:c:member:`PyConfig.filesystem_errors`.
|
||||||
|
|
||||||
.. versionadded:: 3.6
|
.. versionadded:: 3.6
|
||||||
|
|
||||||
.. function:: getrefcount(object)
|
.. function:: getrefcount(object)
|
||||||
|
@ -1457,6 +1451,9 @@ always available.
|
||||||
This is equivalent to defining the :envvar:`PYTHONLEGACYWINDOWSFSENCODING`
|
This is equivalent to defining the :envvar:`PYTHONLEGACYWINDOWSFSENCODING`
|
||||||
environment variable before launching Python.
|
environment variable before launching Python.
|
||||||
|
|
||||||
|
See also :func:`sys.getfilesystemencoding` and
|
||||||
|
:func:`sys.getfilesystemencodeerrors`.
|
||||||
|
|
||||||
.. availability:: Windows.
|
.. availability:: Windows.
|
||||||
|
|
||||||
.. versionadded:: 3.6
|
.. versionadded:: 3.6
|
||||||
|
|
|
@ -156,36 +156,13 @@ typedef struct {
|
||||||
/* Python filesystem encoding and error handler:
|
/* Python filesystem encoding and error handler:
|
||||||
sys.getfilesystemencoding() and sys.getfilesystemencodeerrors().
|
sys.getfilesystemencoding() and sys.getfilesystemencodeerrors().
|
||||||
|
|
||||||
Default encoding and error handler:
|
The Doc/c-api/init_config.rst documentation explains how Python selects
|
||||||
|
the filesystem encoding and error handler.
|
||||||
|
|
||||||
* if Py_SetStandardStreamEncoding() has been called: they have the
|
_PyUnicode_InitEncodings() updates the encoding name to the Python codec
|
||||||
highest priority;
|
name. For example, "ANSI_X3.4-1968" is replaced with "ascii". It also
|
||||||
* PYTHONIOENCODING environment variable;
|
sets Py_FileSystemDefaultEncoding to filesystem_encoding and
|
||||||
* The UTF-8 Mode uses UTF-8/surrogateescape;
|
sets Py_FileSystemDefaultEncodeErrors to filesystem_errors. */
|
||||||
* If Python forces the usage of the ASCII encoding (ex: C locale
|
|
||||||
or POSIX locale on FreeBSD or HP-UX), use ASCII/surrogateescape;
|
|
||||||
* locale encoding: ANSI code page on Windows, UTF-8 on Android and
|
|
||||||
VxWorks, LC_CTYPE locale encoding on other platforms;
|
|
||||||
* On Windows, "surrogateescape" error handler;
|
|
||||||
* "surrogateescape" error handler if the LC_CTYPE locale is "C" or "POSIX";
|
|
||||||
* "surrogateescape" error handler if the LC_CTYPE locale has been coerced
|
|
||||||
(PEP 538);
|
|
||||||
* "strict" error handler.
|
|
||||||
|
|
||||||
Supported error handlers: "strict", "surrogateescape" and
|
|
||||||
"surrogatepass". The surrogatepass error handler is only supported
|
|
||||||
if Py_DecodeLocale() and Py_EncodeLocale() use directly the UTF-8 codec;
|
|
||||||
it's only used on Windows.
|
|
||||||
|
|
||||||
initfsencoding() updates the encoding to the Python codec name.
|
|
||||||
For example, "ANSI_X3.4-1968" is replaced with "ascii".
|
|
||||||
|
|
||||||
On Windows, sys._enablelegacywindowsfsencoding() sets the
|
|
||||||
encoding/errors to mbcs/replace at runtime.
|
|
||||||
|
|
||||||
|
|
||||||
See Py_FileSystemDefaultEncoding and Py_FileSystemDefaultEncodeErrors.
|
|
||||||
*/
|
|
||||||
wchar_t *filesystem_encoding;
|
wchar_t *filesystem_encoding;
|
||||||
wchar_t *filesystem_errors;
|
wchar_t *filesystem_errors;
|
||||||
|
|
||||||
|
|
|
@ -50,7 +50,7 @@ PyAPI_FUNC(int) _Py_GetLocaleconvNumeric(
|
||||||
|
|
||||||
PyAPI_FUNC(void) _Py_closerange(int first, int last);
|
PyAPI_FUNC(void) _Py_closerange(int first, int last);
|
||||||
|
|
||||||
PyAPI_FUNC(wchar_t*) _Py_GetLocaleEncoding(const char **errmsg);
|
PyAPI_FUNC(wchar_t*) _Py_GetLocaleEncoding(void);
|
||||||
PyAPI_FUNC(PyObject*) _Py_GetLocaleEncodingObject(void);
|
PyAPI_FUNC(PyObject*) _Py_GetLocaleEncodingObject(void);
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
|
|
|
@ -841,12 +841,16 @@ extern _invalid_parameter_handler _Py_silent_invalid_parameter_handler;
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#if defined(__ANDROID__) || defined(__VXWORKS__)
|
#if defined(__ANDROID__) || defined(__VXWORKS__)
|
||||||
/* Ignore the locale encoding: force UTF-8 */
|
// Use UTF-8 as the locale encoding, ignore the LC_CTYPE locale.
|
||||||
|
// See _Py_GetLocaleEncoding(), PyUnicode_DecodeLocale()
|
||||||
|
// and PyUnicode_EncodeLocale().
|
||||||
# define _Py_FORCE_UTF8_LOCALE
|
# define _Py_FORCE_UTF8_LOCALE
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#if defined(_Py_FORCE_UTF8_LOCALE) || defined(__APPLE__)
|
#if defined(_Py_FORCE_UTF8_LOCALE) || defined(__APPLE__)
|
||||||
/* Use UTF-8 as filesystem encoding */
|
// Use UTF-8 as the filesystem encoding.
|
||||||
|
// See PyUnicode_DecodeFSDefaultAndSize(), PyUnicode_EncodeFSDefault(),
|
||||||
|
// Py_DecodeLocale() and Py_EncodeLocale().
|
||||||
# define _Py_FORCE_UTF8_FS_ENCODING
|
# define _Py_FORCE_UTF8_FS_ENCODING
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,2 @@
|
||||||
|
If the ``nl_langinfo(CODESET)`` function returns an empty string, Python now
|
||||||
|
uses UTF-8 as the filesystem encoding. Patch by Victor Stinner.
|
|
@ -826,20 +826,15 @@ _Py_EncodeLocaleEx(const wchar_t *text, char **str,
|
||||||
// - Return "UTF-8" if _Py_FORCE_UTF8_LOCALE macro is defined (ex: on Android)
|
// - Return "UTF-8" if _Py_FORCE_UTF8_LOCALE macro is defined (ex: on Android)
|
||||||
// - Return "UTF-8" if the UTF-8 Mode is enabled
|
// - Return "UTF-8" if the UTF-8 Mode is enabled
|
||||||
// - On Windows, return the ANSI code page (ex: "cp1250")
|
// - On Windows, return the ANSI code page (ex: "cp1250")
|
||||||
// - Return "UTF-8" if nl_langinfo(CODESET) returns an empty string
|
// - Return "UTF-8" if nl_langinfo(CODESET) returns an empty string.
|
||||||
// and if the _Py_FORCE_UTF8_FS_ENCODING macro is defined (ex: on macOS).
|
|
||||||
// - Otherwise, return nl_langinfo(CODESET).
|
// - Otherwise, return nl_langinfo(CODESET).
|
||||||
//
|
//
|
||||||
// Return NULL and set errmsg to an error message
|
// Return NULL on memory allocation failure.
|
||||||
// if nl_langinfo(CODESET) fails.
|
|
||||||
//
|
|
||||||
// Return NULL and set errmsg to NULL on memory allocation failure.
|
|
||||||
//
|
//
|
||||||
// See also config_get_locale_encoding()
|
// See also config_get_locale_encoding()
|
||||||
wchar_t*
|
wchar_t*
|
||||||
_Py_GetLocaleEncoding(const char **errmsg)
|
_Py_GetLocaleEncoding(void)
|
||||||
{
|
{
|
||||||
*errmsg = NULL;
|
|
||||||
#ifdef _Py_FORCE_UTF8_LOCALE
|
#ifdef _Py_FORCE_UTF8_LOCALE
|
||||||
// On Android langinfo.h and CODESET are missing,
|
// On Android langinfo.h and CODESET are missing,
|
||||||
// and UTF-8 is always used in mbstowcs() and wcstombs().
|
// and UTF-8 is always used in mbstowcs() and wcstombs().
|
||||||
|
@ -859,21 +854,14 @@ _Py_GetLocaleEncoding(const char **errmsg)
|
||||||
#else
|
#else
|
||||||
const char *encoding = nl_langinfo(CODESET);
|
const char *encoding = nl_langinfo(CODESET);
|
||||||
if (!encoding || encoding[0] == '\0') {
|
if (!encoding || encoding[0] == '\0') {
|
||||||
#ifdef _Py_FORCE_UTF8_FS_ENCODING
|
// Use UTF-8 if nl_langinfo() returns an empty string. It can happen on
|
||||||
// nl_langinfo() can return an empty string when the LC_CTYPE locale is
|
// macOS if the LC_CTYPE locale is not supported.
|
||||||
// not supported. Default to UTF-8 in that case, because UTF-8 is the
|
|
||||||
// default charset on macOS.
|
|
||||||
return _PyMem_RawWcsdup(L"UTF-8");
|
return _PyMem_RawWcsdup(L"UTF-8");
|
||||||
#else
|
|
||||||
*errmsg = "failed to get the locale encoding: "
|
|
||||||
"nl_langinfo(CODESET) returns an empty string";
|
|
||||||
return NULL;
|
|
||||||
#endif
|
|
||||||
}
|
}
|
||||||
|
|
||||||
wchar_t *wstr;
|
wchar_t *wstr;
|
||||||
int res = decode_current_locale(encoding, &wstr, NULL,
|
int res = decode_current_locale(encoding, &wstr, NULL,
|
||||||
errmsg, _Py_ERROR_SURROGATEESCAPE);
|
NULL, _Py_ERROR_SURROGATEESCAPE);
|
||||||
if (res < 0) {
|
if (res < 0) {
|
||||||
return NULL;
|
return NULL;
|
||||||
}
|
}
|
||||||
|
@ -887,15 +875,9 @@ _Py_GetLocaleEncoding(const char **errmsg)
|
||||||
PyObject *
|
PyObject *
|
||||||
_Py_GetLocaleEncodingObject(void)
|
_Py_GetLocaleEncodingObject(void)
|
||||||
{
|
{
|
||||||
const char *errmsg;
|
wchar_t *encoding = _Py_GetLocaleEncoding();
|
||||||
wchar_t *encoding = _Py_GetLocaleEncoding(&errmsg);
|
|
||||||
if (encoding == NULL) {
|
if (encoding == NULL) {
|
||||||
if (errmsg != NULL) {
|
PyErr_NoMemory();
|
||||||
PyErr_SetString(PyExc_ValueError, errmsg);
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
PyErr_NoMemory();
|
|
||||||
}
|
|
||||||
return NULL;
|
return NULL;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
@ -1318,7 +1318,7 @@ config_read_env_vars(PyConfig *config)
|
||||||
|
|
||||||
#ifdef MS_WINDOWS
|
#ifdef MS_WINDOWS
|
||||||
_Py_get_env_flag(use_env, &config->legacy_windows_stdio,
|
_Py_get_env_flag(use_env, &config->legacy_windows_stdio,
|
||||||
"PYTHONLEGACYWINDOWSSTDIO");
|
"PYTHONLEGACYWINDOWSSTDIO");
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
if (config_get_env(config, "PYTHONDUMPREFS")) {
|
if (config_get_env(config, "PYTHONDUMPREFS")) {
|
||||||
|
@ -1498,15 +1498,9 @@ static PyStatus
|
||||||
config_get_locale_encoding(PyConfig *config, const PyPreConfig *preconfig,
|
config_get_locale_encoding(PyConfig *config, const PyPreConfig *preconfig,
|
||||||
wchar_t **locale_encoding)
|
wchar_t **locale_encoding)
|
||||||
{
|
{
|
||||||
const char *errmsg;
|
wchar_t *encoding = _Py_GetLocaleEncoding();
|
||||||
wchar_t *encoding = _Py_GetLocaleEncoding(&errmsg);
|
|
||||||
if (encoding == NULL) {
|
if (encoding == NULL) {
|
||||||
if (errmsg != NULL) {
|
return _PyStatus_NO_MEMORY();
|
||||||
return _PyStatus_ERR(errmsg);
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
return _PyStatus_NO_MEMORY();
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
PyStatus status = PyConfig_SetString(config, locale_encoding, encoding);
|
PyStatus status = PyConfig_SetString(config, locale_encoding, encoding);
|
||||||
PyMem_RawFree(encoding);
|
PyMem_RawFree(encoding);
|
||||||
|
|
Loading…
Reference in New Issue