cpython

Commit Graph

Author	SHA1	Message	Date
Barry Warsaw	5b8c69f11e	_split_ascii() [method and function]: Don't join the lines just to split them again. Simply return them as chunk lists. _encode_chunks(): Don't add more folding whitespace than necessary.	2003-03-10 15:14:08 +00:00
Barry Warsaw	33975eac3d	_split_ascii(): lstrip the individual lines in the ascii split lines, since we'll be adding our own continuation whitespace later.	2003-03-07 23:24:34 +00:00
Barry Warsaw	9f3fcd9c23	More internal refinements of the ascii splitting algorithm. _encode_chunks(): Pass maxlinelen in instead of always using self._maxlinelen, so we can adjust for shorter initial lines. Pass this value through to _max_append(). encode(): Weave maxlinelen through to the _encode_chunks() call. _split_ascii(): When recursively splitting a line on spaces (i.e. lower level syntactic split), don't append the whole returned string. Instead, split it on linejoiners and extend the lines up to the last line (for proper packing). Calculate the linelen based on the last element in the this list.	2003-03-07 15:39:37 +00:00
Tim Peters	2b4821347f	Repaired a misleading comment Barry inherited from me.	2003-03-06 23:41:58 +00:00
Barry Warsaw	bd836dfba3	_split_ascii(): In the clause where curlen + partlen > maxlen, if the part itself is longer than maxlen, and we aren't already splitting on whitespace, then we recursively split the part on whitespace and append that to the this list.	2003-03-06 20:33:04 +00:00
Barry Warsaw	4848805341	__unicode__(): When converting to a unicode string, we need to preserve spaces in the encoded/unencoded word boundaries. RFC 2047 is ambiguous here, but most people expect the space to be preserved. Really closes SF bug # 640110.	2003-03-06 16:10:30 +00:00
Barry Warsaw	671c3e6373	decode_header(): Typo when appending an unencoded chunk to the previous unencoded chunk (e.g. when they appear on separate lines). Closes the 2nd bug in SF #640110 (the first one's already been fixed).	2003-03-06 06:37:42 +00:00
Barry Warsaw	e899e51c06	Merge of the folding-reimpl-branch. Specific changes, _split(): New implementation of ASCII line splitting which should do a better job and not be subject to the various weird artifacts (bugs) reported. This should also do a better job of higher-level syntactic splits by trying first to split on semis, then commas, then whitespace. Use a Timbot-ly binary search for optimal non-ASCII split points for better packing of header lines. This also lets us remove one recursion call. Don't pass in firstline, but instead pass in the actual line length we're shooting for. Also pass in the list of split characters. encode(): Pass in the list of split characters so applications can have some control over what "higher level syntactic breaks" are. Also, decode_header(): Transform binascii.Errors which can occur when decoding a base64 RFC 2047 header with bogus data, into an email.Errors.HeaderParseError. Closes SF bug #696712.	2003-03-06 05:39:46 +00:00
Barry Warsaw	f4fdff715a	Header.__init__(), .append(): Add an optional argument `errors' which is passed straight through to the unicode() and ustr.encode() calls. I think it's the best we can do to address the UnicodeErrors in badly encoded headers such as is described in SF bug #648119.	2002-12-30 19:13:00 +00:00
Barry Warsaw	67f8f2fe2a	append(): Fixing the test for convertability after consultation with Ben. If s is a byte string, make sure it can be converted to unicode with the input codec, and from unicode with the output codec, or raise a UnicodeError exception early. Skip this test (and the unicode->byte string conversion) when the charset is our faux 8bit raw charset.	2002-10-14 16:52:41 +00:00
Barry Warsaw	5e3bcff651	__init__(): Fix an invariant, that the charset item in a chunk tuple must be a Charset instance, not a string. The bug here was that self._charset wasn't being converted to a Charset instance so later .append() calls which used the default charset would break. _split(): If the charset of the chunk is '8bit', return the chunk unchanged. We can't safely split it, so this is the avenue of least harm.	2002-10-14 15:13:17 +00:00
Barry Warsaw	0c358258c9	_encode_chunks(), encode(): Don't modify self._chunks. As Ben says: Also, it fixes a really egregious error in Header.encode() (really in Header._encode_chunks()) that could cause a header to grow and grow each time encode() was called if output_codec was different from input_codec. Also, fix a typo.	2002-10-13 04:06:28 +00:00
Barry Warsaw	48330687f3	Docstring consistency with the updated .tex files.	2002-09-30 23:07:35 +00:00
Barry Warsaw	174aa49a88	With help from Martin v. Loewis, clarification is added for the semantics of header chunks using byte and Unicode strings. Specifically, append(): When the given string is a byte string, charset (whether specified explicitly in the argument list or implicitly via the constructor default) is the encoding of the byte string, and a UnicodeError will be raised if the string cannot be decoded with that charset. If s is a Unicode string, then charset is a hint specifying the character set of the characters in the string. In this case, when producing an RFC 2822 compliant header using RFC 2047 rules, the Unicode string will be encoded using the following charsets in order: us-ascii, the charset hint, utf-8. __init__(): Use the global USASCII Charset instance when the charset argument is None. Also, clarification in the docstring. Also, use True/False where appropriate.	2002-09-30 15:51:31 +00:00
Barry Warsaw	45d9bde6c1	_ascii_split(): Don't lstrip continuation lines. Closes SF bug #601392 .	2002-09-10 15:57:29 +00:00
Barry Warsaw	92825a9a52	append(): Bite the bullet and let charset be the string name of a character set, which we'll convert to a Charset instance. Sigh.	2002-07-23 06:08:10 +00:00
Barry Warsaw	15d3739446	make_header(): Watch out for charset is None, which decode_header() will return as the charset if implicit us-ascii is used.	2002-07-23 04:29:54 +00:00
Barry Warsaw	8da39aa56a	make_header(): New function to take the output of decode_header() and create a Header instance. Closes feature request #539481. Header.__init__(): Allow the initial string to be omitted. __eq__(), __ne__(): Support rich comparisons for equality of Header instances withy Header instances or strings. Also, update a bunch of docstrings.	2002-07-09 16:33:47 +00:00
Barry Warsaw	6ee7156996	append(): Clarify the expected type of charset.	2002-07-03 05:04:04 +00:00
Barry Warsaw	8e69bdac33	__unicode__(): Patch # 541263 by Mikhail Zabaluev, implementation modified by Barry.	2002-06-29 03:26:58 +00:00
Barry Warsaw	766125080f	Teach this class about "highest-level syntactic breaks" but only for headers with no charset or 'us-ascii' charsets. Actually this is only partially true: we know about semicolons (but not true parameters) and we know about whitespace (but not technically folding whitespace). Still it should be good enough for all practical purposes. Other changes include: __init__(): Add a continuation_ws argument, which defaults to a single space. Set this to change the whitespace used for continuation lines when a header must be split. Also, changed the way header line lengths are calculated, so that they take into account continuation_ws (when tabs-expanded) and any provided header_name parameter. This should do much better on returning split headers for which the first and subsequent lines must fit into a specified width. guess_maxlinelen(): Removed. I don't think we need this method as part of the public API. encode_chunks() -> _encode_chunks(): I don't think we need this one as part of the public API either.	2002-06-28 23:46:53 +00:00
Barry Warsaw	1c30aa2292	The _compat modules now export _floordiv() instead of _intdiv2() for better code reuse. _split() Use _floordiv().	2002-06-01 05:49:17 +00:00
Tim Peters	8ac1495a6a	Whitespace normalization.	2002-05-23 15:15:30 +00:00
Barry Warsaw	812031b955	Fixed a bug in the splitting of lines, and improved the splitting for single byte character sets. Also fixed a semantic problem with the constructor's default arguments. Specifically, __init__(): Change the maxlinelen argument default to None instead of MAXLINELEN. The semantics should have been (and now are) that if maxlinelen is given it is always honored. If it isn't given, but header_name is given, then the maximum line length is calculated. If neither are given then the default 76 characters is used. _split(): If the character set is a single byte character set then we can split the line at the maxlinelen because we know that encoding the header won't increase its length. If the charset isn't a single byte charset then we use the quicker divide-and-conquer line splitting algorithm as before.	2002-05-19 23:47:53 +00:00
Barry Warsaw	409a4c08b5	Sync'ing with standalone email package 2.0.1. This adds support for non-us-ascii character sets in headers and bodies. Some API changes (with DeprecationWarnings for the old APIs). Better RFC-compliant implementations of base64 and quoted-printable. Updated test cases. Documentation updates to follow (after I finish writing them ;).	2002-04-10 21:01:31 +00:00

25 Commits