message/delivery-status clause, and genericize it to handle all (other)
message/* content types. This lets us correctly parse 2 more of Anthony's
MIME torture tests (specifically, the message/external-body examples).
bunch of module globals that aren't used.
__maxheaderlen -> _maxheaderlen
_handle_multipart(): This should be more RFC compliant now, and does match the
updated/fixed semantics for preamble and epilogue.
(standard) tests, and doesn't throw parse errors. I still need throw
Anthony's torture test at it, but I wanted to get this checked in and off my
disk.
> ----------------------------
> revision 1.20.4.4
> date: 2003/06/12 09:14:17; author: anthonybaxter; state: Exp; lines: +13 -6
> preamble is None when missing, not ''.
> Handle a couple of bogus formatted messages - now parses my main testsuite.
> Handle message/external-body.
> ----------------------------
> revision 1.20.4.3
> date: 2003/06/12 07:16:40; author: anthonybaxter; state: Exp; lines: +6 -4
> epilogue-processing is now the same as the old parser - the newline at the
> end of the line with the --endboundary-- is included as part of the epilogue.
> Note that any whitespace after the boundary is _not_ part of the epilogue.
> ----------------------------
> revision 1.20.4.2
> date: 2003/06/12 06:39:09; author: anthonybaxter; state: Exp; lines: +6 -4
> message/delivery-status fixed.
> HeaderParser fixed.
> ----------------------------
> revision 1.20.4.1
> date: 2003/06/12 06:08:56; author: anthonybaxter; state: Exp; lines: +163 -129
> A work-in-progress snapshot of the new parser. A couple of known problems:
>
> - first (blank) line of MIME epilogues is being consumed
> - message/delivery-status isn't quite right
>
> It still needs a lot of cleanup, but right now it parses a whole lot of
> badness that the old parser failed on. I also need to think about adding
> back the old 'strict' flag in some way.
> =============================================================================
patch removes dependencies on the old unsupported KoreanCodecs package
and the alternative JapaneseCodecs package. Since both of those
provide aliases for their codecs, this removal just makes the generic
codec names work.
We needed to make slight changes to __init__() as well.
This will be backported to Python 2.3 when its branch freeze is over.
gets done when maxheaderlen <> 0. The header really gets wrapped via
the email.Header.Header class, which has a more sophisticated
algorithm than just splitting on semi-colons.
quotes. Fixes SF bug #794466, with the essential patch provided by
Stuart D. Gathman. Specifically,
_parseparam(), _get_params_preserve(): Use the parsing function that
takes quotes into account, as given (essentially) in the bug report's
test program.
Backport candidate.
test_rfc2231_no_language_or_charset_in_boundary(),
test_rfc2231_no_language_or_charset_in_charset(): New tests for proper
decoding of some RFC 2231 headers.
Backport candidate (as was the Utils.py 1.25 change) to both Python
2.3.1 and 2.2.4 -- will do momentarily.
can be None, and what to do in that situation.
get_filename(), get_boundary(), get_content_charset(): Make sure these
handle RFC 2231 headers without a CHARSET field.
Backport candidate (as was the Utils.py 1.25 change) to both Python
2.3.1 and 2.2.4 -- will do momentarily.
in some locales. This code simplifies the boundary algorithm to use
randint() which is what we wanted anyway.
Bump package version to 2.5.3.
Backport candidate for Python 2.2.3
long header lines is now (properly) in the Header class. So we no
longer need _split_header() and we'll just defer to Header.encode()
when we have a plain string.
_encode_chunks(): Pass maxlinelen in instead of always using
self._maxlinelen, so we can adjust for shorter initial lines.
Pass this value through to _max_append().
encode(): Weave maxlinelen through to the _encode_chunks() call.
_split_ascii(): When recursively splitting a line on spaces
(i.e. lower level syntactic split), don't append the whole returned
string. Instead, split it on linejoiners and extend the lines up to
the last line (for proper packing). Calculate the linelen based on
the last element in the this list.
part itself is longer than maxlen, and we aren't already splitting on
whitespace, then we recursively split the part on whitespace and
append that to the this list.
preserve spaces in the encoded/unencoded word boundaries. RFC 2047 is
ambiguous here, but most people expect the space to be preserved.
Really closes SF bug # 640110.
_split(): New implementation of ASCII line splitting which should do a
better job and not be subject to the various weird artifacts (bugs)
reported. This should also do a better job of higher-level syntactic
splits by trying first to split on semis, then commas, then
whitespace.
Use a Timbot-ly binary search for optimal non-ASCII split points for
better packing of header lines. This also lets us remove one
recursion call. Don't pass in firstline, but instead pass in the
actual line length we're shooting for. Also pass in the list of split
characters.
encode(): Pass in the list of split characters so applications can
have some control over what "higher level syntactic breaks" are.
Also,
decode_header(): Transform binascii.Errors which can occur when
decoding a base64 RFC 2047 header with bogus data, into an
email.Errors.HeaderParseError. Closes SF bug #696712.
_handle_multipart(): Ensure that if the preamble exists but does not
end in a newline, a newline is still added. Without this, the
boundary separator will end up on the preamble line, breaking the MIME
structure.
_make_boundary(): Handle differences in the decimal point character
based on the locale.
Charset: Alias __repr__ to __str__ for debugging.
header_encode(): When calling quopriMIME.header_encode(), set
maxlinelen=None so that the lower level function doesn't (also) try to
wrap/fold the line.
_max_append(): Change the comparison so that the new string is
concatenated if it's less than or equal to the max length.
header_encode(): Allow for maxlinelen == None to mean, don't do any
line splitting. This is because this module is mostly used by higher
level abstractions (Header.py) which already ensures line lengths. We
do this in a cheapo way by setting the max_encoding to some insanely
<100k wink> large number.
because the test file, msg_26.txt which has \r\n line endings, was
getting munged by cvs, which knows to do line ending conversions for
text files. But we want \r\n to be preserved on all platforms, so we
cvs admin'd the file to be -kb (binary), which means we have to open
the file in binary mode to preserve these line ends. Hopefully this
will be the end of the thrashing on this issue (but probably not).
Test passes on *nix now, and Tim confirms it passes on Windows. We'll
leave it to Jack to test MacOS.
is passed straight through to the unicode() and ustr.encode() calls.
I think it's the best we can do to address the UnicodeErrors in badly
encoded headers such as is described in SF bug #648119.
file, needed because some binary distros (read RPMs) don't include the
test module in their standard Python package. This eliminates an
external dependency and closes SF bug # 650441.
binary distros (read RPMs) don't include the test module in their
standard Python package. This eliminates an external dependency and
closes SF bug # 650441.
where in lax parsing, the first non-header line after a header block
(e.g. the first line not containing a colon, and not a continuation),
can be treated as the first body line, even without the RFC mandated
blank line separator.
rfc822 had this behavior, and I vaguely remember problems with this,
but can't remember details. In any event, all the tests still pass,
so I guess we'll find out. ;/
This patch works by returning the non-header, non-continuation line
from _parseheader() and using that as the first header line prepended
to fp.read() if given. It's usually None.
We use this approach instead of trying to seek/tell the file-like
object.
multipart/digest isn't a message/rfc822. This is legal, but counter
to recommended practice in RFC 2046, $5.1.5.
The fix is to look at the content type after setting the default
content type. If the maintype is then message or multipart, attach
the parsed subobject, otherwise use set_payload() to set the data of
the other object.
Ben. If s is a byte string, make sure it can be converted to unicode
with the input codec, and from unicode with the output codec, or raise
a UnicodeError exception early. Skip this test (and the unicode->byte
string conversion) when the charset is our faux 8bit raw charset.
must be a Charset instance, not a string. The bug here was that
self._charset wasn't being converted to a Charset instance so later
.append() calls which used the default charset would break.
_split(): If the charset of the chunk is '8bit', return the chunk
unchanged. We can't safely split it, so this is the avenue of least
harm.
8-bit data, we cannot split it safely, so return the original string
unchanged.
_is8bitstring(): Helper function which returns True when we have a
byte string that contains non-ascii characters (i.e. mysterious 8-bit
data).
Also, it fixes a really egregious error in Header.encode() (really
in Header._encode_chunks()) that could cause a header to grow and
grow each time encode() was called if output_codec was different
from input_codec.
Also, fix a typo.
the change in revision 1.11 (test_email.py) in response to SF bug
#609988. We now think that was the wrong fix and that WinZip was the
real culprit there.
get_type(). Also, one of the regular expressions is constant so might
as well make it a module global. And, when splitting up digests,
handle lineseps that are longer than 1 character in length
(e.g. \r\n).
semantics of header chunks using byte and Unicode strings.
Specifically,
append(): When the given string is a byte string, charset (whether
specified explicitly in the argument list or implicitly via the
constructor default) is the encoding of the byte string, and a
UnicodeError will be raised if the string cannot be decoded with that
charset. If s is a Unicode string, then charset is a hint specifying
the character set of the characters in the string. In this case, when
producing an RFC 2822 compliant header using RFC 2047 rules, the
Unicode string will be encoded using the following charsets in order:
us-ascii, the charset hint, utf-8.
__init__(): Use the global USASCII Charset instance when the charset
argument is None. Also, clarification in the docstring.
Also, use True/False where appropriate.
Python 2.1.3. However it's required by the email tests suite, so poke
it into the encodings aliases if it's missing. The is apparently the
approved API for doing so.
Now we can remove the hexversion shortcircuits in the test suite.
encoding flag SHORTEST means to return the shortest encoding between
base64 and qp. This is used for the header_enc for utf-8. SHORTEST
isn't legal for body_enc.
Also some code cleanup:
- use True/False everywhere
- use == instead of `is' in a few places
- added _unicode() and make consistent the "is unicode" checks
- update docstrings
project, and with assistance from Oleg Broytmann. Specifically,
added some new tests to make sure we handle RFC 2231 encoded
parameters correctly. Two new data files were added which contain RFC
2231 encoded parameters.
project, and with assistance from Oleg Broytmann. Specifically,
get_param(), get_params(): Document that these methods may return
parameter values that are either strings, or 3-tuples in the case of
RFC 2231 encoded parameters. The application should be prepared to
deal with such return values.
get_boundary(): Be prepared to deal with RFC 2231 encoded boundary
parameters. It makes little sense to have boundaries that are
anything but ascii, so if we get back a 3-tuple from get_param() we
will decode it into ascii and let any failures percolate up.
get_content_charset(): New method which treats the charset parameter
just like the boundary parameter in get_boundary(). Note that
"get_charset()" was already taken to return the default Charset
object.
get_charsets(): Rewrite to use get_content_charset().
Move the imports of Parser and Message inside the
message_from_string() and message_from_file() functions. This way
just "import email" won't suck in most of the submodules of the
package.
Note: this will break code that relied on "import email" giving you a
bunch of the submodules, but that was never documented and should not
have been relied on.
_handle_text(): Use _isstring() for stringiness test.
_handle_multipart(): Add a test before the ListType test, checking for
stringiness of the payload. String payloads for multitypes means a
message with broken MIME chrome was parsed by a lax parser. Instead
of raising a BoundaryError in those cases, the entire body is assigned
to the message payload (but since the content type is still
multipart/*, the Generator needs to be updated too).
Broytmann in SF patch #600096. Specifically, the former function now
encodes the triplets, while the latter adds optional charset and
language arguments.
2045, section 5.2 states that if the Content-Type: header is
syntactically invalid, the default type should be text/plain.
Implement minimal sanity checking of the header -- it must have
exactly one slash in it. This closes SF patch #597593 by Skip, but in
a different way.
Note that these methods used to raise ValueError for invalid ctypes,
but now they won't.
get the MIME main and sub types, instead of getting the whole ctype
and splitting it here. The two more specific methods now correctly
implement RFC 2045, section 5.2.
imports e.g. test_support must do so using an absolute package name
such as "import test.test_support" or "from test import test_support".
This also updates the README in Lib/test, and gets rid of the
duplicate data dirctory in Lib/test/data (replaced by
Lib/email/test/data).
Now Tim and Jack can have at it. :)
from test.test_support import TestSkipped, run_unittest
to
from test_support import TestSkipped, run_unittest
Otherwise, if the Japanese codecs aren't installed, regrtest doesn't
believe the TestSkipped exception raised by this test matches the
except (ImportError, test_support.TestSkipped), msg:
it's looking for, and reports the skip as a crash failure instead of
as a skipped test.
I suppose this will make it harder to run this test outside of
regrtest, but under the assumption only Barry does that, better to
make it skip cleanly for everyone else.
(i.e. email.test), so move the guts of them here from Lib/test. The
latter directory will retain stubs to run the email.test tests using
Python's standard regression test.
test_email_torture.py is a torture tester which will not run under
Python's test suite because I don't want to commit megs of data to
that project (it will fail cleanly there). When run under the mimelib
project it'll stress test the package with megs of message samples
collected from various locations in the wild.
(i.e. email.test), so move the guts of them here from Lib/test. The
latter directory will retain stubs to run the email.test tests using
Python's standard regression test.
test_email_torture.py is a torture tester which will not run under
Python's test suite because I don't want to commit megs of data to
that project (it will fail cleanly there). When run under the mimelib
project it'll stress test the package with megs of message samples
collected from various locations in the wild.
email/test/data is a copy of Lib/test/data. The fate of the latter is
still undecided.
backwards compatibility, we're silently deprecating get_type(),
get_subtype() and get_main_type(). We may eventually noisily
deprecate these. For now, we'll just fix a bug in the splitting of
the main and subtypes.
get_content_type(), get_content_maintype(), get_content_subtype(): New
methods which replace the above. These /always/ return a content type
string and do not take a failobj, because an email message always at
least has a default content type.
set_default_type(): Someday there may be additional default content
types, so don't hard code an assertion about the value of the ctype
argument.
quoting:
in non-strict mode, messages don't require a blank line at the end
with a missing end-terminator. A single newline is sufficient now.
Handle trailing whitespace at the end of a boundary. Had to switch
from using string.split() to re.split()
Handle whitespace on the end of a parameter list for Content-type.
Handle whitespace on the end of a plain content-type header.
Specifically,
get_type(): Strip the content type string.
_get_params_preserve(): Strip the parameter names and values on both
sides.
_parsebody(): Lots of changes as described above, with some stylistic
changes by Barry (who hopefully didn't screw things up ;).
create a Header instance. Closes feature request #539481.
Header.__init__(): Allow the initial string to be omitted.
__eq__(), __ne__(): Support rich comparisons for equality of Header
instances withy Header instances or strings.
Also, update a bunch of docstrings.
argument to the constructor -- defaulting to true -- which is
different than Anthony's approach of using global state.
parse(), parsestr(): Grow a `headersonly' argument which stops parsing
once the header block has been seen, i.e. it does /not/ parse or even
read the body of the message. This is used for parsing message/rfc822
type messages.
We need test cases for the non-strict parsing. Anthony will supply
these.
_parsebody(): We can get rid of the isdigest end-of-line kludges,
although we still need to know if we're parsing a multipart/digest so
we can set the default type accordingly.
text/plain but the RFCs state that inside a multipart/digest, the
default type is message/rfc822. To preserve idempotency, we need a
separate place to define the default type than the Content-Type:
header.
get_default_type(), set_default_type(): Accessor and mutator methods
for the default type.
recursive generation).
_dispatch(): If the message object doesn't have a Content-Type:
header, check its default type instead of assuming it's text/plain.
This makes for correct generation of message/rfc822 containers.
_handle_multipart(): We can get rid of the isdigest kludge. Just
print the message as normal and everything will work out correctly.
_handle_mulitpart_digest(): We don't need this anymore either.
Specifically,
decode_rfc2231(), encode_rfc2231(): Functions to encode and decode RFC
2231 style parameters.
decode_params(): Function to decode a list of parameters.
Specifically,
_formatparam(): Teach this about encoded `param' arguments, which are
a 3-tuple of items (charset, language, value). language is ignored.
_unquotevalue(): Handle both 3-tuple RFC 2231 values and unencoded
values.
_get_params_preserve(): Decode the parameters before returning them.
get_params(), get_param(): Use _unquotevalue().
get_filename(), get_boundary(): Teach these about encoded (3-tuple)
parameters.
headers with no charset or 'us-ascii' charsets. Actually this is only
partially true: we know about semicolons (but not true parameters) and
we know about whitespace (but not technically folding whitespace).
Still it should be good enough for all practical purposes.
Other changes include:
__init__(): Add a continuation_ws argument, which defaults to a single
space. Set this to change the whitespace used for continuation lines
when a header must be split. Also, changed the way header line
lengths are calculated, so that they take into account continuation_ws
(when tabs-expanded) and any provided header_name parameter. This
should do much better on returning split headers for which the first
and subsequent lines must fit into a specified width.
guess_maxlinelen(): Removed. I don't think we need this method as
part of the public API.
encode_chunks() -> _encode_chunks(): I don't think we need this one as
part of the public API either.
know anything about RFC 2047 encoded headers. Fortunately we have a
perfectly good header splitter in Header.encode(). So we just call
that to give us a properly formatted and split header.
Header.encode() didn't know about "highest-level syntactic breaks" but
that's been fixed now too.
__call__() can be 2-3x slower than the equivalent normal method.
_handle_message(): The structure of message/rfc822 message has
changed. Now parent's payload is a list of length 1, and the zeroth
element is the Message sub-object. Adjust the printing of such
message trees to reflect this change.
subclasses.
MIMENonMultipart: Base class for non-multipart/* content type subclass
specializations, e.g. image/gif. This class overrides attach() which
raises an exception, since it makes no sense to attach a subpart to
e.g. an image/gif message.
MIMEMultipart: Base class for multipart/* content type subclass
specializations, e.g. multipart/mixed. Does little more than provide
a useful constructor.
email package's Parser to handle the three common line endings.
Certain protocols such as IMAP define CRLF line endings and it doesn't
make sense for the client app to have to normalize the line endings
before handing it message off to the Parser.
_parsebody(): Be more flexible in the matching of line endings for
finding the MIME separators. Accept any of \r, \n and \r\n. Note
that we do /not/ change the line endings in the payloads, we just
accept any of those three around MIME boundaries.
single byte character sets. Also fixed a semantic problem with the
constructor's default arguments. Specifically,
__init__(): Change the maxlinelen argument default to None instead of
MAXLINELEN. The semantics should have been (and now are) that if
maxlinelen is given it is always honored. If it isn't given, but
header_name is given, then the maximum line length is calculated. If
neither are given then the default 76 characters is used.
_split(): If the character set is a single byte character set then we
can split the line at the maxlinelen because we know that encoding the
header won't increase its length. If the charset isn't a single byte
charset then we use the quicker divide-and-conquer line splitting
algorithm as before.
for the email package. The former is now just a shell project that
has some extra files for packaging for independent use (e.g. setup.py
and README).
Added a compatibility layer so that the same API can be used in Python
2.1 and 2.2/2.3 with the major differences shuffled off into helper
modules (_compat21.py and _compat22.py).
Also bumped the package version number to 2.0.3 for some fixes to be
checked in momentarily.
double call to AddressList.getaddrlist(), and /that/ always returns an
empty list for the second and subsequent calls.
Instead, instantiate an AddressList directly, and get the parsed
addresses out of the addresslist attribute.
non-us-ascii character sets in headers and bodies. Some API changes
(with DeprecationWarnings for the old APIs). Better RFC-compliant
implementations of base64 and quoted-printable.
Updated test cases. Documentation updates to follow (after I finish
writing them ;).