This is more RFC compliant (see issue) and fixes a problem with
signature verifiers rejecting the part when signed. There is some
amount of backward compatibility concern here since it changes
the output, but the RFC issue coupled with fixing the problem
with signature verifiers seems worth the small risk of breaking
code that depends on the current incorrect output.
This applies only to the new parser. The old parser decodes encoded words
inside quoted strings already, although it gets the whitespace wrong
when it does so.
This version of the patch only handles the most common case (a single encoded
word surrounded by quotes), but I haven't seen any other variations of this in
the wild yet, so its good enough for now.
This is a bit of an ugly hack because of the way generator pieces together the
output message. The deepcopys aren't too expensive, though, because we know it
is only called on messages that are not multiparts, and the payload (the thing
that could be large) is an immutable object.
Test and preliminary work on patch by Vajrasky Kok.
This fixes an edge case (20206) where if the input ended in a character
needing encoding but there was no newline on the string, the last byte
of the encoded character would be dropped. The fix is to use a more
efficient algorithm, provided by Serhiy Storchaka (5803), that does not
have the bug.
This is a backward compatible partial fix, the complete fix requires raising
an error instead of accepting the invalid input, so the real fix is only
suitable for 3.4.
This adds EmailMessage and, MIMEPart subclasses of Message
with new API methods, and a ContentManager class used by
the new methods. Also a new policy setting, content_manager.
Patch was reviewed by Stephen J. Turnbull and Serhiy Storchaka,
and reflects their feedback.
I will ideally add some examples of using the new API to the
documentation before the final release.
This also backs out the previous fixes for for #14360, #1717, and #16564.
Those bugs were actually caused by the fact that set_payload didn't decode to
str, thus rendering the model inconsistent. This fix does mean the data
processed by the encoder functions goes through an extra encode/decode cycle,
but it means the model is always consistent. Future API updates will provide
a better way to encode payloads, which will bypass this minor de-optimization.
Tests by Vajrasky Kok.
This was triggered by wanting to make the doctest in email.policy.rst pass;
as_bytes and __bytes__ are clearly useful now that we have BytesGenerator.
Also updated the Message docs to document the policy keyword that was
added in 3.3.
There is more to be done here in terms of accepting RFC invalid
input that some mailers accept, but this covers the valid
RFC places where encoded words can occur in structured headers.
The problem was I was only checking for decimal digits after the third '?',
not for *hex* digits :(.
This changeset also fixes a couple of comment typos, deletes an unused
function relating to encoded word parsing, and removed an invalid
'if' test from the folding function that was revealed by the tests
written to validate this issue.
There were no tests for the encoders module. encode_base64 worked
because it is the default and so got tested implicitly elsewhere, and
we use encode_7or8bit internally, so that worked, too. I previously
fixed encode_noop, so this fix means that everythign in the encoders
module now works, hopefully correctly. Also added an explicit test
for encode_base64.
The new _has_surrogates code was suggested by Serhiy Storchaka. See
the issue for timings, but it is far faster than any other alternative,
and also removes the load time that we previously incurred from compiling
the complex regex this replaces.
Previously the parts of the message retained whatever linesep they had on
read, which means if the messages weren't read in univeral newline mode, the
line endings could well be inconsistent. In general sending it via smtplib
would result in them getting fixed, but it is better to generate them
correctly to begin with. Also, the new send_message method of smtplib does
not do the fixup, so that method is producing rfc-invalid output without this
fix.
Previously the parts of the message retained whatever linesep they had on
read, which means if the messages weren't read in univeral newline mode, the
line endings could well be inconsistent. In general sending it via smtplib
would result in them getting fixed, but it is better to generate them
correctly to begin with. Also, the new send_message method of smtplib does
not do the fixup, so that method is producing rfc-invalid output without this
fix.
Previously the parts of the message retained whatever linesep they had on
read, which means if the messages weren't read in univeral newline mode, the
line endings could well be inconsistent. In general sending it via smtplib
would result in them getting fixed, but it is better to generate them
correctly to begin with. Also, the new send_message method of smtplib does
not do the fixup, so that method is producing rfc-invalid output without this
fix.
This code passes all the same tests that the existing RFC mime header
parser passes, plus a bunch of additional ones.
There are a couple of commented out tests where there are issues with the
folding. The folding doesn't normally get invoked for headers parsed from
source, and the cases are marginal anyway (headers with invalid binary data)
so I'm not worried about them, but will fix them after the beta.
There are things that can be done to make this API even more convenient, but I
think this is a solid foundation worth having. And the parser is a full RFC
parser, so it handles cases that the current parser doesn't. (There are also
probably cases where it fails when the current parser doesn't, but I haven't
found them yet ;)
Oh, yeah, and there are some really ugly bits in the parser for handling some
'postel' cases that are unfortunately common.
I hope/plan to to eventually refactor a lot of the code in the parser which
should reduce the line count...but there is no escaping the fact that the
error recovery is welter of special cases.
The behavior of MessageDefect is legacy behavior. The chances anyone is
actually using the undocumented 'line' attribute is low, but it costs
little to retain backward compatibility. Although one of the costs is
having to restore normal exception behavior in HeaderDefect. On the
other hand, I'll probably add some specialized behavior there later.
This is a behavior change: before this leading and trailing spaces were
stripped from ASCII parts, now they are preserved. Without this fix we didn't
parse the examples in the RFC correctly, so I think breaking backward
compatibility here is justified.
Patch by Ralf Schlatterbeck.
This feature was supposed to be part of the initial email6 checkin, but it got
lost in my big refactoring.
In this patch I'm not providing an easy way to turn off the errors, but they
only happen when a header is added programmatically, and it is almost never
the right thing to do to allow the duplicate to be added. An application that
needs to add duplicates of unique headers can create a policy subclass to
allow it.
This commit also restores the news item for 167256 that it looks like
Terry inadvertently deleted. (Either that, or I don't understand
now merging works...which is equally possible.)
Which also means that it is now producing *something* for any base64
payload, which is what leads to the couple of older test changes in
test_email. This is a slightly backward incompatible behavior change,
but the new behavior is so much more useful than the old (you can now
*reliably* detect errors, and any program that was detecting errors by
sniffing for a base64 return from get_payload(decode=True) and then doing
its own error-recovery decode will just get the error-recovery decode
right away). So this seems to me to be worth the small risk inherent
in this behavior change.
This patch also refactors the defect tests into a separate test file,
since they are no longer just parser tests.
This patch also deprecates the MalformedHeaderDefect. My best guess is that
this defect was rendered obsolete by a refactoring of the parser, and the
corresponding defect for the new parser (which this patch introduces) was
overlooked.
When I made the checkin of the provisional email policy, I knew that
Address and Group needed to be made accessible from somewhere. The more
I looked at it, though, the more it became clear that since this is a
provisional API anyway, there's no good reason to hide headerregistry as
a private API. It was designed to ultimately be part of the public API,
and so it should be part of the provisional API.
This patch fully documents the headerregistry API, and deletes the
abbreviated version of those docs I had added to the provisional policy
docs.
Although '<>' is invalid according to RFC 5322, SMTP uses it for various
things, and it sometimes ends up in email headers. This patch changes
get_angle_addr to recognize it and just register a Defect instead of raising a
parsing error.
Without this function people would be tempted to use the other date functions
in email.utils to compute an aware localtime, and those functions are not as
good for that purpose as this code. The code is Alexander Belopolsy's from
his proposed patch for issue 9527, with a fix (and additional tests) by Brian
K. Jones.
When the new policies are used (and only when the new policies are explicitly
used) headers turn into objects that have attributes based on their parsed
values, and can be set using objects that encapsulate the values, as well as
set directly from unicode strings. The folding algorithm then takes care of
encoding unicode where needed, and folding according to the highest level
syntactic objects.
With this patch only date and time headers are parsed as anything other than
unstructured, but that is all the helper methods in the existing API handle.
I do plan to add more parsers, and complete the set specified in the RFC
before the package becomes stable.
This patch primarily does two things: (1) it adds some internal-interface
methods to Policy that allow for Policy to control the parsing and folding of
headers in such a way that we can construct a backward compatibility policy
that is 100% compatible with the 3.2 API, while allowing a new policy to
implement the email6 API. (2) it adds that backward compatibility policy and
refactors the test suite so that the only differences between the 3.2
test_email.py file and the 3.3 test_email.py file is some small changes in
test framework and the addition of tests for bugs fixed that apply to the 3.2
API.
There are some additional teaks, such as moving just the code needed for the
compatibility policy into _policybase, so that the library code can import
only _policybase. That way the new code that will be added for email6
will only get imported when a non-compatibility policy is imported.
Éric pointed out that given that the default was documented as None, someone
would reasonably pass that to get the default behavior. In fixing the code to
use None, I noticed that the change to _charset was being done after it had
already been passed to MIMENonMultipart. The change to the test verifies that
the order is now correct.
Previously it would just accept the unicode, which would wind up as unicode in
the transfer-encoded message object, which is just wrong.
Patch by Jeff Knupp.
In Python2, if a unicode string was assigned as the value of a header,
email would automatically CTE encode it using the UTF8 charset.
This capability was lost in the Python3 translation, and this patch
restores it.
Patch by Ali Ikinci, assisted by R. David Murray.
I also added a fix for the mailbox test that was depending (with a comment
that it was a bad idea to so depend) on non-ASCII causing message_from_string
to raise an error. It now uses support.patch to induce an error during
message serialization.
In Python2, if a unicode string was assigned as the value of a header,
email would automatically CTE encode it using the UTF8 charset.
This capability was lost in the Python3 translation, and this patch
restores it.
Patch by Ali Ikinci, assisted by R. David Murray.
I also added a fix for the mailbox test that was depending (with a comment
that it was a bad idea to so depend) on non-ASCII causing message_from_string
to raise an error. It now uses support.patch to induce an error during
message serialization.
Analogous to the decode_header fix, this fix makes Header.append and
make_header correctly handle the unknown-8bit charset introduced by email5.1,
when the input to them is binary strings. Previous to this fix the
make_header(decode_header(x)) == x invariant was broken in the face of the
unknown-8bit charset.
This new interface will also allow for future planned enhancements
in control over the parser/generator without requiring any additional
complexity in the parser/generator API.
Patch reviewed by Éric Araujo and Barry Warsaw.
Why I consider this a bug rather than an API change: the API change was
to Message, which didn't used to return Headers unless you added them
yourself. Now it does (for 8bit binary header input), so decode_header
needs to be able to handle them.