This PR replaces #1977. The reason for the replacement is two-fold.
The fix itself is different is that if the CTE header doesn't exist in the original message, it is inserted. This is important because the new CTE could be quoted-printable whereas the original is implicit 8bit.
Also the tests are different. The test_nonascii_as_string_without_cte test in #1977 doesn't actually test the issue in that it passes without the fix. The test_nonascii_as_string_without_content_type_and_cte test is improved here, and even though it doesn't fail without the fix, it is included for completeness.
Automerge-Triggered-By: @warsaw
Fixes a case in which email._header_value_parser.get_unstructured hangs the system for some invalid headers. This covers the cases in which the header contains either:
- a case without trailing whitespace
- an invalid encoded word
https://bugs.python.org/issue37764
This fix should also be backported to 3.7 and 3.8
https://bugs.python.org/issue37764
* bpo-30835: email: Fix AttributeError when parsing invalid Content-Transfer-Encoding
Parsing an email containing a multipart Content-Type, along with a
Content-Transfer-Encoding containing an invalid (non-ASCII-decodable) byte
will fail. email.feedparser.FeedParser._parsegen() gets the header and
attempts to convert it to lowercase before comparing it with the accepted
encodings, but as the header contains an invalid byte, it's returned as a
Header object rather than a str.
Cast the Content-Transfer-Encoding header to a str to avoid this.
Found using the AFL fuzzer.
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Andrew Donnellan <andrew@donnellan.id.au>
* Add email and NEWS entry for the bugfix.
This changes the main documentation, doc strings, source code comments, and a
couple error messages in the test suite. In some cases the word was removed
or edited some other way to fix the grammar.
This is more RFC compliant (see issue) and fixes a problem with
signature verifiers rejecting the part when signed. There is some
amount of backward compatibility concern here since it changes
the output, but the RFC issue coupled with fixing the problem
with signature verifiers seems worth the small risk of breaking
code that depends on the current incorrect output.
This is a bit of an ugly hack because of the way generator pieces together the
output message. The deepcopys aren't too expensive, though, because we know it
is only called on messages that are not multiparts, and the payload (the thing
that could be large) is an immutable object.
Test and preliminary work on patch by Vajrasky Kok.
This fixes an edge case (20206) where if the input ended in a character
needing encoding but there was no newline on the string, the last byte
of the encoded character would be dropped. The fix is to use a more
efficient algorithm, provided by Serhiy Storchaka (5803), that does not
have the bug.