Commit Graph

951 Commits

Author SHA1 Message Date
Marc-André Lemburg 2d9204199f This patch changes the way the string .encode() method works slightly
and introduces a new method .decode().

The major change is that strg.encode() will no longer try to convert
Unicode returns from the codec into a string, but instead pass along
the Unicode object as-is. The same is now true for all other codec
return types. The underlying C APIs were changed accordingly.

Note that even though this does have the potential of breaking
existing code, the chances are low since conversion from Unicode
previously took place using the default encoding which is normally
set to ASCII rendering this auto-conversion mechanism useless for
most Unicode encodings.

The good news is that you can now use .encode() and .decode() with
much greater ease and that the door was opened for better accessibility
of the builtin codecs.

As demonstration of the new feature, the patch includes a few new
codecs which allow string to string encoding and decoding (rot13,
hex, zip, uu, base64).

Written by Marc-Andre Lemburg. Copyright assigned to the PSF.
2001-05-15 12:00:02 +00:00
Tim Peters 342c65e19a Aggressive reordering of dict comparisons. In case of collision, it stands
to reason that me_key is much more likely to match the key we're looking
for than to match dummy, and if the key is absent me_key is much more
likely to be NULL than dummy:  most dicts don't even have a dummy entry.
Running instrumented dict code over the test suite and some apps confirmed
that matching dummy was 200-300x less frequent than matching key in
practice.  So this reorders the tests to try the common case first.
It can lose if a large dict with many collisions is mostly deleted, not
resized, and then frequently searched, but that's hardly a case we
should be favoring.
2001-05-13 06:43:53 +00:00
Tim Peters 2f228e75e4 Get rid of the superstitious "~" in dict hashing's "i = (~hash) & mask".
The comment following used to say:
	/* We use ~hash instead of hash, as degenerate hash functions, such
	   as for ints <sigh>, can have lots of leading zeros. It's not
	   really a performance risk, but better safe than sorry.
	   12-Dec-00 tim:  so ~hash produces lots of leading ones instead --
	   what's the gain? */
That is, there was never a good reason for doing it.  And to the contrary,
as explained on Python-Dev last December, it tended to make the *sum*
(i + incr) & mask (which is the first table index examined in case of
collison) the same "too often" across distinct hashes.

Changing to the simpler "i = hash & mask" reduced the number of string-dict
collisions (== # number of times we go around the lookup for-loop) from about
6 million to 5 million during a full run of the test suite (these are
approximate because the test suite does some random stuff from run to run).
The number of collisions in non-string dicts also decreased, but not as
dramatically.

Note that this may, for a given dict, change the order (wrt previous
releases) of entries exposed by .keys(), .values() and .items().  A number
of std tests suffered bogus failures as a result.  For dicts keyed by
small ints, or (less so) by characters, the order is much more likely to be
in increasing order of key now; e.g.,

>>> d = {}
>>> for i in range(10):
...    d[i] = i
...
>>> d
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}
>>>

Unfortunately. people may latch on to that in small examples and draw a
bogus conclusion.

test_support.py
    Moved test_extcall's sortdict() into test_support, made it stronger,
    and imported sortdict into other std tests that needed it.
test_unicode.py
    Excluced cp875 from the "roundtrip over range(128)" test, because
    cp875 doesn't have a well-defined inverse for unicode("?", "cp875").
    See Python-Dev for excruciating details.
Cookie.py
    Chaged various output functions to sort dicts before building
    strings from them.
test_extcall
    Fiddled the expected-result file.  This remains sensitive to native
    dict ordering, because, e.g., if there are multiple errors in a
    keyword-arg dict (and test_extcall sets up many cases like that), the
    specific error Python complains about first depends on native dict
    ordering.
2001-05-13 00:19:31 +00:00
Tim Peters 16cabc0a3d Repair "module has no attribute xxx" error msg; bug introduced when
switching from tp_getattr to tp_getattro.
2001-05-12 20:24:22 +00:00
Tim Peters d85e102337 Variant of patch #423262: Change module attribute get & set
Allow module getattr and setattr to exploit string interning, via the
previously null module object tp_getattro and tp_setattro slots.   Yields
a very nice speedup for things like random.random and os.path etc.
2001-05-11 21:51:48 +00:00
Jeremy Hylton 1b0feb4ada Variant of SF patch 423181
For rich comparisons, use instance_getattr2() when possible to avoid
the expense of setting an AttributeError.  Also intern the name_op[]
table and use the interned strings rather than creating a new string
and interning it each time through.
2001-05-11 14:48:41 +00:00
Tim Peters 5acbfcc164 Cosmetic: code under "else" clause was missing indent. 2001-05-11 03:36:45 +00:00
Tim Peters 4fa58bfac2 Restore dicts' tp_compare slot, and change dict_richcompare to say it
doesn't know how to do LE, LT, GE, GT.  dict_richcompare can't do the
latter any faster than dict_compare can.  More importantly, for
cmp(dict1, dict2), Python *first* tries rich compares with EQ, LT, and
GT one at a time, even if the tp_compare slot is defined, and
dict_richcompare called dict_compare for the latter two because
it couldn't do them itself.  The result was a lot of wasted calls to
dict_compare.  Now dict_richcompare gives up at once the times Python
calls it with LT and GT from try_rich_to_3way_compare(), and dict_compare
is called only once (when Python gets around to trying the tp_compare
slot).
Continued mystery:  despite that this cut the number of calls to
dict_compare approximately in half in test_mutants.py, the latter still
runs amazingly slowly.  Running under the debugger doesn't show excessive
activity in the dict comparison code anymore, so I'm guessing the culprit
is somewhere else -- but where?  Perhaps in the element (key/value)
comparison code?  We clearly spend a lot of time figuring out how to
compare things.
2001-05-10 21:45:19 +00:00
Tim Peters 3918fb2549 Repair typo in comment. 2001-05-10 18:58:31 +00:00
Tim Peters 95bf9390a4 SF bug #422121 Insecurities in dict comparison.
Fixed a half dozen ways in which general dict comparison could crash
Python (even cause Win98SE to reboot) in the presence of kay and/or
value comparison routines that mutate the dict during dict comparison.
Bugfix candidate.
2001-05-10 08:32:44 +00:00
Tim Peters 9c012af3c3 Heh. I need a break. After this: stropmodule & stringobject were more
out of synch than I realized, and I managed to break replace's "count"
argument when it was 0.  All is well again.  Maybe.
Bugfix candidate.
2001-05-10 00:32:57 +00:00
Tim Peters 4cd44ef4bf Fudge. stropmodule and stringobject both had copies of the buggy
mymemXXX stuff, and they were already out of synch.  Fix the remaining
bugs in both and get them back in synch.
Bugfix release candidate.
2001-05-10 00:05:33 +00:00
Tim Peters 1a97d5f098 SF patch #416247 2.1c1 stringobject: unused vrbl cleanup.
Thanks to Mark Favas.
2001-05-09 20:06:00 +00:00
Tim Peters 4862ab7bf4 Sheesh -- repair the dodge around "cast isn't an lvalue" complaints to
restore correct semantics.
2001-05-09 08:43:21 +00:00
Tim Peters 9e897f41db Mark Favas reported that gcc caught me using casts as lvalues. Dodge it. 2001-05-09 07:37:07 +00:00
Tim Peters b4bbcd76ea Ack! Restore the COUNT_ALLOCS one_strings code. 2001-05-09 00:31:40 +00:00
Tim Peters cf5ad5d6f6 My change to string_item() left an extra reference to each 1-character
interned string created by "string"[i].  Since they're immortal anyway,
this was hard to notice, but it was still wrong <wink>.
2001-05-09 00:24:55 +00:00
Tim Peters 5b4d477568 Intern 1-character strings as soon as they're created. As-is, they aren't
interned when created, so the cached versions generally aren't ever
interned.  With the patch, the
		Py_INCREF(t);
		*p = t;
		Py_DECREF(s);
		return;
indirection block in PyString_InternInPlace() is never executed during a
full run of the test suite, but was executed very many times before.  So
I'm trading more work when creating one-character strings for doing less
work later.  Note that the "more work" here can happen at most 256 times
per program run, so it's trivial.  The same reasoning accounts for the
patch's simplification of string_item (the new version can call
PyString_FromStringAndSize() no more than 256 times per run, so there's
no point to inlining that stuff -- if we were serious about saving time
here, we'd pre-initialize the characters vector so that no runtime testing
at all was needed!).
2001-05-08 22:33:50 +00:00
Tim Peters 72f98e9b83 SF bug #422177: Results from .pyc differs from .py
Store floats and doubles to full precision in marshal.
Test that floats read from .pyc/.pyo closely match those read from .py.
Declare PyFloat_AsString() in floatobject header file.
Add new PyFloat_AsReprString() API function.
Document the functions declared in floatobject.h.
2001-05-08 15:19:57 +00:00
Tim Peters e63415ead8 SF patch #421922: Implement rich comparison for dicts.
d1 == d2 and d1 != d2 now work even if the keys and values in d1 and d2
don't support comparisons other than ==, and testing dicts for equality
is faster now (especially when inequality obtains).
2001-05-08 04:38:29 +00:00
Jeremy Hylton 4c889011db SF patch 419176 from MvL; fixed bug 418977
Two errors in dict_to_map() helper used by PyFrame_LocalsToFast().
2001-05-08 04:08:59 +00:00
Jeremy Hylton d37292bb8d Remove unused variable 2001-05-08 04:00:45 +00:00
Tim Peters 6d60b2e762 SF bug #422108 - Error in rich comparisons.
2.1.1 bugfix candidate too.
Fix a bad (albeit unlikely) return value in try_rich_to_3way_compare().
Also document do_cmp()'s return values.
2001-05-07 20:53:51 +00:00
Tim Peters cb8d368b82 Reimplement PySequence_Contains() and instance_contains(), so they work
safely together and don't duplicate logic (the common logic was factored
out into new private API function _PySequence_IterContains()).
Visible change:
    some_complex_number  in  some_instance
no longer blows up if some_instance has __getitem__ but neither
__contains__ nor __iter__.  test_iter changed to ensure that remains true.
2001-05-05 21:05:01 +00:00
Tim Peters 75f8e35ef4 Generalize PySequence_Count() (operator.countOf) to work with iterators. 2001-05-05 11:33:43 +00:00
Tim Peters de9725f135 Make 'x in y' and 'x not in y' (PySequence_Contains) play nice w/ iterators.
NEEDS DOC CHANGES
A few more AttributeErrors turned into TypeErrors, but in test_contains
this time.
The full story for instance objects is pretty much unexplainable, because
instance_contains() tries its own flavor of iteration-based containment
testing first, and PySequence_Contains doesn't get a chance at it unless
instance_contains() blows up.  A consequence is that
    some_complex_number in some_instance
dies with a TypeError unless some_instance.__class__ defines __iter__ but
does not define __getitem__.
2001-05-05 10:06:17 +00:00
Tim Peters 2cfe368283 Make unicode.join() work nice with iterators. This also required a change
to string.join(), so that when the latter figures out in midstream that
it really needs unicode.join() instead, unicode.join() can actually get
all the sequence elements (i.e., there's no guarantee that the sequence
passed to string.join() can be iterated over *again* by unicode.join(),
so string.join() must not pass on the original sequence object anymore).
2001-05-05 05:36:48 +00:00
Tim Peters 12d0a6c78a Fix a tiny and unlikely memory leak. Was there before too, and actually
several of these turned up and got fixed during the iteration crusade.
2001-05-05 04:10:25 +00:00
Tim Peters 6912d4ddf0 Generalize tuple() to work nicely with iterators.
NEEDS DOC CHANGES.
This one surprised me!  While I expected tuple() to be a no-brainer, turns
out it's actually dripping with consequences:
1. It will *allow* the popular PySequence_Fast() to work with any iterable
   object (code for that not yet checked in, but should be trivial).
2. It caused two std tests to fail.  This because some places used
   PyTuple_Sequence() (the C spelling of tuple()) as an indirect way to test
   whether something *is* a sequence.  But tuple() code only looked for the
   existence of sq->item to determine that, and e.g. an instance passed
   that test whether or not it supported the other operations tuple()
   needed (e.g., __len__).  So some things the tests *expected* to fail
   with an AttributeError now fail with a TypeError instead.  This looks
   like an improvement to me; e.g., test_coercion used to produce 559
   TypeErrors and 2 AttributeErrors, and now they're all TypeErrors.  The
   error details are more informative too, because the places calling this
   were *looking* for TypeErrors in order to replace the generic tuple()
   "not a sequence" msg with their own more specific text, and
   AttributeErrors snuck by that.
2001-05-05 03:56:37 +00:00
Tim Peters f4848dac41 Make PyIter_Next() a little smarter (wrt its knowledge of iterator
internals) so clients can be a lot dumber (wrt their knowledge).
2001-05-05 00:14:56 +00:00
Fred Drake 6aebded915 The weakref support in PyObject_InitVar() as well; this should have come out
at the same time as it did from PyObject_Init() .
2001-05-03 20:04:33 +00:00
Fred Drake ba40ec42c8 Remove unnecessary intialization for the case of weakly-referencable objects;
the code necessary to accomplish this is simpler and faster if confined to
the object implementations, so we only do this there.

This causes no behaviorial changes beyond a (very slight) speedup.
2001-05-03 19:44:50 +00:00
Fred Drake 4dcb85b817 Since Py_TPFLAGS_HAVE_WEAKREFS is set in Py_TPFLAGS_DEFAULT, it does not
need to be specified in the type structures independently.  The flag
exists only for binary compatibility.

This is a "source cleanliness" issue and introduces no behavioral changes.
2001-05-03 16:04:13 +00:00
Guido van Rossum b1f35bffe5 Mchael Hudson pointed out that the code for detecting changes in
dictionary size was comparing ma_size, the hash table size, which is
always a power of two, rather than ma_used, wich changes on each
insertion or deletion.  Fixed this.
2001-05-02 15:13:44 +00:00
Marc-André Lemburg 542fe56cb9 Fix for bug #417030: "print '%*s' fails for unicode string" 2001-05-02 14:21:53 +00:00
Tim Peters 6ad22c41c2 Plug a memory leak in list(), when appending to the result list. 2001-05-02 07:12:39 +00:00
Tim Peters f553f89d45 Generalize list(seq) to work with iterators. This also generalizes list()
to no longer insist that len(seq) be defined.
NEEDS DOC CHANGES.
This is meant to be a model for how other functions of this ilk (max,
filter, etc) can be generalized similarly.  Feel encouraged to grab your
favorite and convert it!
Note some cute consequences:
    list(file) == file.readlines() == list(file.xreadlines())
    list(dict) == dict.keys()
    list(dict.iteritems()) = dict.items()
    list(xrange(i, j, k)) == range(i, j, k)
2001-05-01 20:45:31 +00:00
Guido van Rossum 47668928e6 Discard a misleading comment about iter_iternext(). 2001-05-01 17:01:25 +00:00
Guido van Rossum 4f288ab7d6 Printing objects to a real file still wasn't done right: if the
object's type didn't define tp_print, there were still cases where the
full "print uses str() which falls back to repr()" semantics weren't
honored.  This resulted in

    >>> print None
    <None object at 0x80bd674>
    >>> print type(u'')
    <type object at 0x80c0a80>

Fixed this by always using the appropriate PyObject_Repr() or
PyObject_Str() call, rather than trying to emulate what they would do.

Also simplified PyObject_Str() to always fall back on PyObject_Repr()
when tp_str is not defined (rather than making an extra check for
instances with a __str__ method).  And got rid of the special case for
strings.
2001-05-01 16:53:37 +00:00
Guido van Rossum 189f1df301 Add a proper implementation for the tp_str slot (returning self, of
course), so I can get rid of the special case for strings in
PyObject_Str().
2001-05-01 16:51:53 +00:00
Guido van Rossum 09e563abb4 Add experimental iterkeys(), itervalues(), iteritems() to dict
objects.

Tests show that iteritems() is 5-10% faster than iterating over the
dict and extracting the value with dict[key].
2001-05-01 12:10:21 +00:00
Guido van Rossum 82c690f11a Well darnit! The innocuous fix I made to PyObject_Print() caused
printing of instances not to look for __str__().  Fix this.
2001-04-30 14:39:18 +00:00
Tim Peters b3d8d1f76c A different approach to the problem reported in
Patch #419651: Metrowerks on Mac adds 0x itself
C std says %#x and %#X conversion of 0 do not add the 0x/0X base marker.
Metrowerks apparently does.  Mark Favas reported the same bug under a
Compaq compiler on Tru64 Unix, but no other libc broken in this respect
is known (known to be OK under MSVC and gcc).
So just try the damn thing at runtime and see what the platform does.
Note that we've always had bugs here, but never knew it before because
a relevant test case didn't exist before 2.1.
2001-04-28 05:38:26 +00:00
Guido van Rossum 3a80c4a29c (Adding this to the trunk as well.)
Fix a very old flaw in PyObject_Print().  Amazing!  When an object
type defines tp_str but not tp_repr, 'print x' to a real file
object would not call the tp_str slot but rather print a default style
representation: <foo object at 0x....>.  This even though 'print x' to
a file-like-object would correctly call the tp_str slot.
2001-04-27 21:35:01 +00:00
Marc-André Lemburg 8155e0e541 This patch originated from an idea by Martin v. Loewis who submitted a
patch for sharing single character Unicode objects.

Martin's patch had to be reworked in a number of ways to take Unicode
resizing into consideration as well. Here's what the updated patch
implements:

* Single character Unicode strings in the Latin-1 range are shared
  (not only ASCII chars as in Martin's original patch).

* The ASCII and Latin-1 codecs make use of this optimization,
  providing a noticable speedup for single character strings. Most
  Unicode methods can use the optimization as well (by virtue
  of using PyUnicode_FromUnicode()).

* Some code cleanup was done (replacing memcpy with Py_UNICODE_COPY)

* The PyUnicode_Resize() can now also handle the case of resizing
  unicode_empty which previously resulted in an error.

* Modified the internal API _PyUnicode_Resize() and
  the public PyUnicode_Resize() API to handle references to
  shared objects correctly. The _PyUnicode_Resize() signature
  changed due to this.

* Callers of PyUnicode_FromUnicode() may now only modify the Unicode
  object contents of the returned object in case they called the API
  with NULL as content template.

Note that even though this patch passes the regression tests, there
may still be subtle bugs in the sharing code.
2001-04-23 14:44:21 +00:00
Guido van Rossum 213c7a6aa5 Mondo changes to the iterator stuff, without changing how Python code
sees it (test_iter.py is unchanged).

- Added a tp_iternext slot, which calls the iterator's next() method;
  this is much faster for built-in iterators over built-in types
  such as lists and dicts, speeding up pybench's ForLoop with about
  25% compared to Python 2.1.  (Now there's a good argument for
  iterators. ;-)

- Renamed the built-in sequence iterator SeqIter, affecting the C API
  functions for it.  (This frees up the PyIter prefix for generic
  iterator operations.)

- Added PyIter_Check(obj), which checks that obj's type has a
  tp_iternext slot and that the proper feature flag is set.

- Added PyIter_Next(obj) which calls the tp_iternext slot.  It has a
  somewhat complex return condition due to the need for speed: when it
  returns NULL, it may not have set an exception condition, meaning
  the iterator is exhausted; when the exception StopIteration is set
  (or a derived exception class), it means the same thing; any other
  exception means some other error occurred.
2001-04-23 14:08:49 +00:00
Guido van Rossum 65967259f2 Oops, forgot to merge this from the iter-branch to the trunk.
This adds "for line in file" iteration, as promised.
2001-04-21 13:20:18 +00:00
Tim Peters cf96de052f SF but #417587: compiler warnings compiling 2.1.
Repaired *some* of the SGI compiler warnings Sjoerd Mullender reported.
2001-04-21 02:46:11 +00:00
Guido van Rossum 05311481d4 Adding iterobject.[ch], which were accidentally not added. Sorry\! 2001-04-20 21:06:46 +00:00
Guido van Rossum 59d1d2b434 Iterators phase 1. This comprises:
new slot tp_iter in type object, plus new flag Py_TPFLAGS_HAVE_ITER
new C API PyObject_GetIter(), calls tp_iter
new builtin iter(), with two forms: iter(obj), and iter(function, sentinel)
new internal object types iterobject and calliterobject
new exception StopIteration
new opcodes for "for" loops, GET_ITER and FOR_ITER (also supported by dis.py)
new magic number for .pyc files
new special method for instances: __iter__() returns an iterator
iteration over dictionaries: "for x in dict" iterates over the keys
iteration over files: "for x in file" iterates over lines

TODO:

documentation
test suite
decide whether to use a different way to spell iter(function, sentinal)
decide whether "for key in dict" is a good idea
use iterators in map/filter/reduce, min/max, and elsewhere (in/not in?)
speed tuning (make next() a slot tp_next???)
2001-04-20 19:13:02 +00:00