#1349106: add linesep argument to generator.flatten and header.encode.

This commit is contained in:
R. David Murray 2010-10-23 22:19:56 +00:00
parent 29aad0005d
commit 8451c4b6e0
7 changed files with 164 additions and 49 deletions

View File

@ -56,7 +56,7 @@ Here are the public methods of the :class:`Generator` class, imported from the
The other public :class:`Generator` methods are:
.. method:: flatten(msg, unixfrom=False)
.. method:: flatten(msg, unixfrom=False, linesep='\\n')
Print the textual representation of the message object structure rooted at
*msg* to the output file specified when the :class:`Generator` instance
@ -71,12 +71,20 @@ Here are the public methods of the :class:`Generator` class, imported from the
Note that for subparts, no envelope header is ever printed.
Optional *linesep* specifies the line separator character used to
terminate lines in the output. It defaults to ``\n`` because that is
the most useful value for Python application code (other library packages
expect ``\n`` separated lines). ``linesep=\r\n`` can be used to
generate output with RFC-compliant line separators.
Messages parsed with a Bytes parser that have a
:mailheader:`Content-Transfer-Encoding` of 8bit will be converted to a
use a 7bit Content-Transfer-Encoding. Any other non-ASCII bytes in the
message structure will be converted to '?' characters.
.. versionchanged:: 3.2 added support for re-encoding 8bit message bodies.
.. versionchanged:: 3.2
added support for re-encoding 8bit message bodies, and the linesep
argument
.. method:: clone(fp)
@ -97,16 +105,70 @@ formatted string representation of a message object. For more detail, see
.. class:: BytesGenerator(outfp, mangle_from_=True, maxheaderlen=78)
This class has the same API as the :class:`Generator` class, except that
*outfp* must be a file like object that will accept :class`bytes` input to
its ``write`` method. If the message object structure contains non-ASCII
bytes, this generator's :meth:`~BytesGenerator.flatten` method will produce
them as-is, including preserving parts with a
:mailheader:`Content-Transfer-Encoding` of ``8bit``.
The constructor for the :class:`BytesGenerator` class takes a binary
:term:`file-like object` called *outfp* for an argument. *outfp* must
support a :meth:`write` method that accepts binary data.
Note that even the :meth:`write` method API is identical: it expects
strings as input, and converts them to bytes by encoding them using
the ASCII codec.
Optional *mangle_from_* is a flag that, when ``True``, puts a ``>``
character in front of any line in the body that starts exactly as ``From``,
i.e. ``From`` followed by a space at the beginning of the line. This is the
only guaranteed portable way to avoid having such lines be mistaken for a
Unix mailbox format envelope header separator (see `WHY THE CONTENT-LENGTH
FORMAT IS BAD <http://www.jwz.org/doc/content-length.html>`_ for details).
*mangle_from_* defaults to ``True``, but you might want to set this to
``False`` if you are not writing Unix mailbox format files.
Optional *maxheaderlen* specifies the longest length for a non-continued
header. When a header line is longer than *maxheaderlen* (in characters,
with tabs expanded to 8 spaces), the header will be split as defined in the
:class:`~email.header.Header` class. Set to zero to disable header
wrapping. The default is 78, as recommended (but not required) by
:rfc:`2822`.
The other public :class:`BytesGenerator` methods are:
.. method:: flatten(msg, unixfrom=False, linesep='\n')
Print the textual representation of the message object structure rooted
at *msg* to the output file specified when the :class:`BytesGenerator`
instance was created. Subparts are visited depth-first and the resulting
text will be properly MIME encoded. If the input that created the *msg*
contained bytes with the high bit set and those bytes have not been
modified, they will be copied faithfully to the output, even if doing so
is not strictly RFC compliant. (To produce strictly RFC compliant
output, use the :class:`Generator` class.)
Messages parsed with a Bytes parser that have a
:mailheader:`Content-Transfer-Encoding` of 8bit will be reconstructed
as 8bit if they have not been modified.
Optional *unixfrom* is a flag that forces the printing of the envelope
header delimiter before the first :rfc:`2822` header of the root message
object. If the root object has no envelope header, a standard one is
crafted. By default, this is set to ``False`` to inhibit the printing of
the envelope delimiter.
Note that for subparts, no envelope header is ever printed.
Optional *linesep* specifies the line separator character used to
terminate lines in the output. It defaults to ``\n`` because that is
the most useful value for Python application code (other library packages
expect ``\n`` separated lines). ``linesep=\r\n`` can be used to
generate output with RFC-compliant line separators.
.. method:: clone(fp)
Return an independent clone of this :class:`BytesGenerator` instance with
the exact same options.
.. method:: write(s)
Write the string *s* to the underlying file object. *s* is encoded using
the ``ASCII`` codec and written to the *write* method of the *outfp*
*outfp* passed to the :class:`BytesGenerator`'s constructor. This
provides just enough file-like API for :class:`BytesGenerator` instances
to be used in the :func:`print` function.
.. versionadded:: 3.2

View File

@ -104,7 +104,7 @@ Here is the :class:`Header` class description:
:func:`ustr.encode` call, and defaults to "strict".
.. method:: encode(splitchars=';, \\t', maxlinelen=None)
.. method:: encode(splitchars=';, \\t', maxlinelen=None, linesep='\\n')
Encode a message header into an RFC-compliant format, possibly wrapping
long lines and encapsulating non-ASCII parts in base64 or quoted-printable
@ -115,6 +115,13 @@ Here is the :class:`Header` class description:
*maxlinelen*, if given, overrides the instance's value for the maximum
line length.
*linesep* specifies the characters used to separate the lines of the
folded header. It defaults to the most useful value for Python
application code (``\n``), but ``\r\n`` can be specified in order
to produce headers with RFC-compliant line separators.
.. versionchanged:: 3.2 added the linesep argument
The :class:`Header` class also provides a number of methods to support
standard operators and built-in functions.

View File

@ -17,7 +17,7 @@ from email.header import Header
from email.message import _has_surrogates
UNDERSCORE = '_'
NL = '\n'
NL = '\n' # XXX: no longer used by the code below.
fcre = re.compile(r'^From ', re.MULTILINE)
@ -58,7 +58,7 @@ class Generator:
# Just delegate to the file object
self._fp.write(s)
def flatten(self, msg, unixfrom=False):
def flatten(self, msg, unixfrom=False, linesep='\n'):
"""Print the message object tree rooted at msg to the output file
specified when the Generator instance was created.
@ -68,12 +68,23 @@ class Generator:
is False to inhibit the printing of any From_ delimiter.
Note that for subobjects, no From_ line is printed.
linesep specifies the characters used to indicate a new line in
the output.
"""
# We use the _XXX constants for operating on data that comes directly
# from the msg, and _encoded_XXX constants for operating on data that
# has already been converted (to bytes in the BytesGenerator) and
# inserted into a temporary buffer.
self._NL = linesep
self._encoded_NL = self._encode(linesep)
self._EMPTY = ''
self._encoded_EMTPY = self._encode('')
if unixfrom:
ufrom = msg.get_unixfrom()
if not ufrom:
ufrom = 'From nobody ' + time.ctime(time.time())
self.write(ufrom + NL)
self.write(ufrom + self._NL)
self._write(msg)
def clone(self, fp):
@ -93,20 +104,18 @@ class Generator:
# it has already transformed the input; but, since this whole thing is a
# hack anyway this seems good enough.
# We use these class constants when we need to manipulate data that has
# already been written to a buffer (ex: constructing a re to check the
# boundary), and the module level NL constant when adding new output to a
# buffer via self.write, because 'write' always takes strings.
# Having write always take strings makes the code simpler, but there are
# a few occasions when we need to write previously created data back
# to the buffer or to a new buffer; for those cases we use self._fp.write.
_NL = NL
_EMPTY = ''
# Similarly, we have _XXX and _encoded_XXX attributes that are used on
# source and buffer data, respectively.
_encoded_EMPTY = ''
def _new_buffer(self):
# BytesGenerator overrides this to return BytesIO.
return StringIO()
def _encode(self, s):
# BytesGenerator overrides this to encode strings to bytes.
return s
def _write(self, msg):
# We can't write the headers yet because of the following scenario:
# say a multipart message includes the boundary string somewhere in
@ -158,14 +167,15 @@ class Generator:
for h, v in msg.items():
self.write('%s: ' % h)
if isinstance(v, Header):
self.write(v.encode(maxlinelen=self._maxheaderlen)+NL)
self.write(v.encode(
maxlinelen=self._maxheaderlen, linesep=self._NL)+self._NL)
else:
# Header's got lots of smarts, so use it.
header = Header(v, maxlinelen=self._maxheaderlen,
header_name=h)
self.write(header.encode()+NL)
self.write(header.encode(linesep=self._NL)+self._NL)
# A blank line always separates headers from body
self.write(NL)
self.write(self._NL)
#
# Handlers for writing types and subtypes
@ -208,11 +218,11 @@ class Generator:
for part in subparts:
s = self._new_buffer()
g = self.clone(s)
g.flatten(part, unixfrom=False)
g.flatten(part, unixfrom=False, linesep=self._NL)
msgtexts.append(s.getvalue())
# Now make sure the boundary we've selected doesn't appear in any of
# the message texts.
alltext = self._NL.join(msgtexts)
alltext = self._encoded_NL.join(msgtexts)
# BAW: What about boundaries that are wrapped in double-quotes?
boundary = msg.get_boundary(failobj=self._make_boundary(alltext))
# If we had to calculate a new boundary because the body text
@ -225,9 +235,9 @@ class Generator:
msg.set_boundary(boundary)
# If there's a preamble, write it out, with a trailing CRLF
if msg.preamble is not None:
self.write(msg.preamble + NL)
self.write(msg.preamble + self._NL)
# dash-boundary transport-padding CRLF
self.write('--' + boundary + NL)
self.write('--' + boundary + self._NL)
# body-part
if msgtexts:
self._fp.write(msgtexts.pop(0))
@ -236,13 +246,13 @@ class Generator:
# --> CRLF body-part
for body_part in msgtexts:
# delimiter transport-padding CRLF
self.write('\n--' + boundary + NL)
self.write(self._NL + '--' + boundary + self._NL)
# body-part
self._fp.write(body_part)
# close-delimiter transport-padding
self.write('\n--' + boundary + '--')
self.write(self._NL + '--' + boundary + '--')
if msg.epilogue is not None:
self.write(NL)
self.write(self._NL)
self.write(msg.epilogue)
def _handle_multipart_signed(self, msg):
@ -266,16 +276,16 @@ class Generator:
g = self.clone(s)
g.flatten(part, unixfrom=False)
text = s.getvalue()
lines = text.split(self._NL)
lines = text.split(self._encoded_NL)
# Strip off the unnecessary trailing empty line
if lines and lines[-1] == self._EMPTY:
blocks.append(self._NL.join(lines[:-1]))
if lines and lines[-1] == self._encoded_EMPTY:
blocks.append(self._encoded_NL.join(lines[:-1]))
else:
blocks.append(text)
# Now join all the blocks with an empty line. This has the lovely
# effect of separating each block with an empty line, but not adding
# an extra one after the last one.
self._fp.write(self._NL.join(blocks))
self._fp.write(self._encoded_NL.join(blocks))
def _handle_message(self, msg):
s = self._new_buffer()
@ -333,10 +343,9 @@ class BytesGenerator(Generator):
The outfp object must accept bytes in its write method.
"""
# Bytes versions of these constants for use in manipulating data from
# Bytes versions of this constant for use in manipulating data from
# the BytesIO buffer.
_NL = NL.encode('ascii')
_EMPTY = b''
_encoded_EMPTY = b''
def write(self, s):
self._fp.write(s.encode('ascii', 'surrogateescape'))
@ -344,6 +353,9 @@ class BytesGenerator(Generator):
def _new_buffer(self):
return BytesIO()
def _encode(self, s):
return s.encode('ascii')
def _write_headers(self, msg):
# This is almost the same as the string version, except for handling
# strings with 8bit bytes.
@ -363,9 +375,9 @@ class BytesGenerator(Generator):
# Header's got lots of smarts and this string is safe...
header = Header(v, maxlinelen=self._maxheaderlen,
header_name=h)
self.write(header.encode()+NL)
self.write(header.encode(linesep=self._NL)+self._NL)
# A blank line always separates headers from body
self.write(NL)
self.write(self._NL)
def _handle_text(self, msg):
# If the string has surrogates the original source was bytes, so

View File

@ -272,7 +272,7 @@ class Header:
output_string = input_bytes.decode(output_charset, errors)
self._chunks.append((output_string, charset))
def encode(self, splitchars=';, \t', maxlinelen=None):
def encode(self, splitchars=';, \t', maxlinelen=None, linesep='\n'):
"""Encode a message header into an RFC-compliant format.
There are many issues involved in converting a given string for use in
@ -293,6 +293,11 @@ class Header:
Optional splitchars is a string containing characters to split long
ASCII lines on, in rough support of RFC 2822's `highest level
syntactic breaks'. This doesn't affect RFC 2047 encoded lines.
Optional linesep is a string to be used to separate the lines of
the value. The default value is the most useful for typical
Python applications, but it can be set to \r\n to produce RFC-compliant
line separators when needed.
"""
self._normalize()
if maxlinelen is None:
@ -311,7 +316,7 @@ class Header:
if len(lines) > 1:
formatter.newline()
formatter.add_transition()
return str(formatter)
return formatter._str(linesep)
def _normalize(self):
# Step 1: Normalize the chunks so that all runs of identical charsets
@ -342,9 +347,12 @@ class _ValueFormatter:
self._lines = []
self._current_line = _Accumulator(headerlen)
def __str__(self):
def _str(self, linesep):
self.newline()
return NL.join(self._lines)
return linesep.join(self._lines)
def __str__(self):
return self._str(NL)
def newline(self):
end_of_line = self._current_line.pop()

View File

@ -24,7 +24,8 @@ Simple email with attachment.
--1618492860--2051301190--113853680
Content-Type: application/riscos; name="clock.bmp,69c"; type=BMP; load=&fff69c4b; exec=&355dd4d1; access=&03
Content-Type: application/riscos; name="clock.bmp,69c"; type=BMP;
load=&fff69c4b; exec=&355dd4d1; access=&03
Content-Disposition: attachment; filename="clock.bmp"
Content-Transfer-Encoding: base64

View File

@ -77,7 +77,7 @@ class TestMessageAPI(TestEmailBase):
eq(msg.get_all('cc'), ['ccc@zzz.org', 'ddd@zzz.org', 'eee@zzz.org'])
eq(msg.get_all('xx', 'n/a'), 'n/a')
def test_getset_charset(self):
def TEst_getset_charset(self):
eq = self.assertEqual
msg = Message()
eq(msg.get_charset(), None)
@ -2600,6 +2600,18 @@ Here's the message body
part2 = msg.get_payload(1)
eq(part2.get_content_type(), 'application/riscos')
def test_crlf_flatten(self):
# Using newline='\n' preserves the crlfs in this input file.
with openfile('msg_26.txt', newline='\n') as fp:
text = fp.read()
msg = email.message_from_string(text)
s = StringIO()
g = Generator(s)
g.flatten(msg, linesep='\r\n')
self.assertEqual(s.getvalue(), text)
maxDiff = None
def test_multipart_digest_with_extra_mime_headers(self):
eq = self.assertEqual
neq = self.ndiffAssertEqual
@ -2931,6 +2943,16 @@ class Test8BitBytesHandling(unittest.TestCase):
m = bfp.close()
self.assertEqual(str(m), self.latin_bin_msg_as7bit)
def test_crlf_flatten(self):
with openfile('msg_26.txt', 'rb') as fp:
text = fp.read()
msg = email.message_from_bytes(text)
s = BytesIO()
g = email.generator.BytesGenerator(s)
g.flatten(msg, linesep='\r\n')
self.assertEqual(s.getvalue(), text)
maxDiff = None
class TestBytesGeneratorIdempotent(TestIdempotent):

View File

@ -48,6 +48,9 @@ Core and Builtins
Library
-------
- Issue #1349106: Generator (and BytesGenerator) flatten method and Header
encode method now support a 'linesep' argument.
- Issue #5639: Add a *server_hostname* argument to ``SSLContext.wrap_socket``
in order to support the TLS SNI extension. ``HTTPSConnection`` and
``urlopen()`` also use this argument, so that HTTPS virtual hosts are now