Commit Graph

227 Commits

Author SHA1 Message Date
Miss Islington (bot) 735a960ac9
bpo-36311: Fixes decoding multibyte characters around chunk boundaries and improves decoding performance (GH-15083)
(cherry picked from commit 7ebdda0dbe)

Co-authored-by: Steve Dower <steve.dower@python.org>
2019-08-21 16:55:57 -07:00
Miss Islington (bot) c755ca89c7 [3.7] bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) (GH-14369)
* bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304)

* The UTF-8 incremental decoders fails now fast if encounter
  a sequence that can't be handled by the error handler.
* The UTF-16 incremental decoders with the surrogatepass error
  handler decodes now a lone low surrogate with final=False.
(cherry picked from commit 894263ba80)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2019-06-25 12:29:18 +02:00
Miss Islington (bot) a6dc5d4e1c bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278)
(cherry picked from commit a6ec1ce1ac)

Co-authored-by: Ammar Askar <ammar_askar@hotmail.com>
2019-05-31 23:03:22 +03:00
Jelle Zijlstra 81c5ec9e41 [3.7] bpo-33482: fix codecs.StreamRecoder.writelines (GH-6779) (GH-13502)
A very simple fix. I found this while writing typeshed stubs for StreamRecoder.

https://bugs.python.org/issue33482.
(cherry picked from commit b3be407288)

Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>





https://bugs.python.org/issue33482
2019-05-22 09:28:38 -07:00
Miss Islington (bot) bd48280cb6 bpo-24214: Fixed the UTF-8 incremental decoder. (GH-12603) (GH-12627)
The bug occurred when the encoded surrogate character is passed
to the incremental decoder in two chunks.
(cherry picked from commit 7a465cb5ee)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2019-03-30 15:52:41 +02:00
Miss Islington (bot) 74829b7323
bpo-36312: Fix decoders for some code pages. (GH-12369)
(cherry picked from commit c1e2c288f4)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2019-03-20 21:31:57 -07:00
Miss Islington (bot) bdeb56cd21
bpo-35372: Fix the code page decoder for input > 2 GiB. (GH-10848)
(cherry picked from commit 4013c17911)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2018-12-03 01:09:11 -08:00
Victor Stinner 91106cd9ff
bpo-29240: PEP 540: Add a new UTF-8 Mode (#855)
* Add -X utf8 command line option, PYTHONUTF8 environment variable
  and a new sys.flags.utf8_mode flag.
* If the LC_CTYPE locale is "C" at startup: enable automatically the
  UTF-8 mode.
* Add _winapi.GetACP(). encodings._alias_mbcs() now calls
  _winapi.GetACP() to get the ANSI code page
* locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8
  mode. As a side effect, open() now uses the UTF-8 encoding by
  default in this mode.
* Py_DecodeLocale() and Py_EncodeLocale() now use the UTF-8 encoding
  in the UTF-8 Mode.
* Update subprocess._args_from_interpreter_flags() to handle -X utf8
* Skip some tests relying on the current locale if the UTF-8 mode is
  enabled.
* Add test_utf8mode.py.
* _Py_DecodeUTF8_surrogateescape() gets a new optional parameter to
  return also the length (number of wide characters).
* pymain_get_global_config() and pymain_set_global_config() now
  always copy flag values, rather than only copying if the new value
  is greater than the old value.
2017-12-13 12:29:09 +01:00
Serhiy Storchaka 219c2de5ad
bpo-32110: codecs.StreamReader.read(n) now returns not more than n (#4499)
characters/bytes for non-negative n.  This makes it compatible with
read() methods of other file-like objects.
2017-11-29 01:30:00 +02:00
Serhiy Storchaka 56cb465cc9 bpo-31825: Fixed OverflowError in the 'unicode-escape' codec (#4058)
and in codecs.escape_decode() when decode an escaped non-ascii byte.
2017-10-20 17:08:15 +03:00
Berker Peksag 7b4bcd2004 Issue #25270: Merge from 3.5 2016-09-16 17:32:06 +03:00
Berker Peksag 4a72a7b6c4 Issue #25270: Prevent codecs.escape_encode() from raising SystemError when an empty bytestring is passed 2016-09-16 17:31:06 +03:00
R David Murray 110b6fecbb #27364: Deprecate invalid escape strings in str/byutes.
Patch by Emanuel Barry, reviewed by Serhiy Storchaka and Martin Panter.
2016-09-08 15:34:08 -04:00
R David Murray 44b548dda8 #27364: fix "incorrect" uses of escape character in the stdlib.
And most of the tools.

Patch by Emanual Barry, reviewed by me, Serhiy Storchaka, and
Martin Panter.
2016-09-08 13:59:53 -04:00
Steve Dower f5aba58480 Issue #27959: Adds oem encoding, alias ansi to mbcs, move aliasmbcs to codec lookup 2016-09-06 19:42:27 -07:00
Serhiy Storchaka e437a10d15 Issue #23277: Remove unused imports in tests. 2016-04-24 21:41:02 +03:00
Martin Panter 8b04a945ef Merge typo fixes from 3.5 2016-04-16 09:29:17 +00:00
Martin Panter 119e502277 Fix typos in code comments and documentation 2016-04-16 09:28:57 +00:00
Martin Panter cda80940ed Issue #15984: Merge PyUnicode doc from 3.5 2016-04-15 02:27:11 +00:00
Martin Panter 6245cb3c01 Correct “an” → “a” with “Unicode”, “user”, “UTF”, etc
This affects documentation, code comments, and a debugging messages.
2016-04-15 02:14:19 +00:00
Martin Panter e56a919100 Issue #25523: Merge a-to-an corrections from 3.5 2015-11-02 04:27:17 +00:00
Martin Panter 2eb819f7a8 Issue #25523: Merge "a" to "an" fixes from 3.4 into 3.5 2015-11-02 04:04:57 +00:00
Martin Panter 7462b64911 Issue #25523: Correct "a" article to "an" article
This changes the main documentation, doc strings, source code comments, and a
couple error messages in the test suite. In some cases the word was removed
or edited some other way to fix the grammar.
2015-11-02 03:37:02 +00:00
Victor Stinner 797485e101 Issue #25318: Avoid sprintf() in backslashreplace()
Rewrite backslashreplace() to be closer to PyCodec_BackslashReplaceErrors().

Add also unit tests for non-BMP characters.
2015-10-09 03:17:30 +02:00
Victor Stinner 1d65d9192d Issue #25301: The UTF-8 decoder is now up to 15 times as fast for error
handlers: ``ignore``, ``replace`` and ``surrogateescape``.
2015-10-05 13:43:50 +02:00
Serhiy Storchaka 29e68edbf4 Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
1. Non-ASCII bytes were accepted after shift sequence.
2. A low surrogate could be emitted in case of error in high surrogate.
3. In some circumstances the '\xfd' character was produced instead of the
replacement character '\ufffd' (due to a bug in _PyUnicodeWriter).
2015-10-02 13:14:03 +03:00
Serhiy Storchaka 58c8f2bb6d Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
1. Non-ASCII bytes were accepted after shift sequence.
2. A low surrogate could be emitted in case of error in high surrogate.
3. In some circumstances the '\xfd' character was produced instead of the
replacement character '\ufffd' (due to a bug in _PyUnicodeWriter).
2015-10-02 13:13:14 +03:00
Serhiy Storchaka 28b21e50c8 Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
1. Non-ASCII bytes were accepted after shift sequence.
2. A low surrogate could be emitted in case of error in high surrogate.
2015-10-02 13:07:28 +03:00
Victor Stinner 01ada3996b Issue #25267: The UTF-8 encoder is now up to 75 times as fast for error
handlers: ``ignore``, ``replace``, ``surrogateescape``, ``surrogatepass``.
Patch co-written with Serhiy Storchaka.
2015-10-01 21:54:51 +02:00
Victor Stinner c3713e9706 Optimize ascii/latin1+surrogateescape encoders
Issue #25227: Optimize ASCII and latin1 encoders with the ``surrogateescape``
error handler: the encoders are now up to 3 times as fast.

Initial patch written by Serhiy Storchaka.
2015-09-29 12:32:13 +02:00
Victor Stinner f96418de05 Issue #24870: Optimize the ASCII decoder for error handlers: surrogateescape,
ignore and replace. Initial patch written by Naoki Inada.

The decoder is now up to 60 times as fast for these error handlers.

Add also unit tests for the ASCII decoder.
2015-09-21 23:06:27 +02:00
Martin Panter 9ab96946ee Issue #16473: Merge codecs doc and test from 3.4 into 3.5 2015-09-12 01:22:17 +00:00
Martin Panter 06171bd52a Issue #16473: Fix byte transform codec documentation; test quotetabs=True
This changes the equivalent functions listed for the Base-64, hex and Quoted-
Printable codecs to reflect the functions actually used. Also mention and
test the "quotetabs" setting for Quoted-Printable encoding.
2015-09-12 00:34:28 +00:00
Serhiy Storchaka f0eeedf0d8 Issue #22681: Added support for the koi8_t encoding. 2015-05-12 23:24:19 +03:00
Serhiy Storchaka ad8a1c3fb2 Issue #22682: Added support for the kz1048 encoding. 2015-05-12 23:16:55 +03:00
Serhiy Storchaka 8490f5acfe Issue #23001: Few functions in modules mmap, ossaudiodev, socket, ssl, and
codecs, that accepted only read-only bytes-like object now accept writable
bytes-like object too.
2015-03-20 09:00:36 +02:00
Victor Stinner f2be23d329 Issue #22286, #23321: Fix failing test on Windows code page 932
There was a bug which was fixed. The unit test was also wrong.
2015-01-26 23:26:11 +01:00
Serhiy Storchaka 07985ef387 Issue #22286: The "backslashreplace" error handlers now works with
decoding and translating.
2015-01-25 22:56:57 +02:00
Nick Coghlan 582acb75e9 Merge issue 19548 changes from 3.4 2015-01-07 00:37:01 +10:00
Nick Coghlan b9fdb7a452 Issue 19548: update codecs module documentation
- clarified the distinction between text encodings and other codecs
- clarified relationship with builtin open and the io module
- consolidated documentation of error handlers into one section
- clarified type constraints of some behaviours
- added tests for some of the new statements in the docs
2015-01-07 00:22:00 +10:00
Serhiy Storchaka f65d1d3b02 Issue #23071: "namereplace_errors" was added only in 3.5. 2014-12-20 18:53:01 +02:00
Serhiy Storchaka 4d33ff6183 Issue #23071: Added missing names to codecs.__all__. Patch by Martin Panter. 2014-12-20 17:46:05 +02:00
Serhiy Storchaka de3ee5b94f Issue #23071: Added missing names to codecs.__all__. Patch by Martin Panter. 2014-12-20 17:42:38 +02:00
Serhiy Storchaka 166ebc4e5d Issue #19676: Added the "namereplace" error handler. 2014-11-25 13:57:17 +02:00
Serhiy Storchaka 85e7066278 Issue #22406: Fixed the uu_codec codec incorrectly ported to 3.x.
Based on patch by Martin Panter.
2014-11-07 14:06:19 +02:00
Serhiy Storchaka 519114df42 Issue #22406: Fixed the uu_codec codec incorrectly ported to 3.x.
Based on patch by Martin Panter.
2014-11-07 14:04:37 +02:00
Nick Coghlan a0f33759fa Merge fix for issue #22166 from 3.4 2014-09-15 23:55:16 +12:00
Nick Coghlan 8fad1676a2 Issue #22166: clear codec caches in test_codecs 2014-09-15 23:50:44 +12:00
Victor Stinner 0d4e01ca07 Issue #13916: Fix surrogatepass error handler on Windows 2014-05-16 14:46:20 +02:00
Serhiy Storchaka 88d8fb6af6 Issue #13916: Disallowed the surrogatepass error handler for non UTF-*
encodings.
2014-05-15 14:37:42 +03:00