SF 560379: Karatsuba multiplication.
Lots of things were changed from that. This needs a lot more testing,
for correctness and speed, the latter especially when bit lengths are
unbalanced. For now, the Karatsuba code gets invoked if and only if
envar KARAT exists.
currently return inconsistent results for ints and longs; in
particular: hex/oct/%u/%o/%x/%X of negative short ints, and x<<n that
either loses bits or changes sign. (No warnings for repr() of a long,
though that will also change to lose the trailing 'L' eventually.)
This introduces some warnings in the test suite; I'll take care of
those later.
This is friendlier for caches.
2. Cut MIN_GALLOP to 7, but added a per-sort min_gallop vrbl that adapts
the "get into galloping mode" threshold higher when galloping isn't
paying, and lower when it is. There's no known case where this hurts.
It's (of course) neutral for /sort, \sort and =sort. It also happens
to be neutral for !sort. It cuts a tiny # of compares in 3sort and +sort.
For *sort, it reduces the # of compares to better than what this used to
do when MIN_GALLOP was hardcoded to 10 (it did about 0.1% more *sort
compares before, but given how close we are to the limit, this is "a
lot"!). %sort used to do about 1.5% more compares, and ~sort about
3.6% more. Here are exact counts:
i *sort 3sort +sort %sort ~sort !sort
15 449235 33019 33016 51328 188720 65534 before
448885 33016 33007 50426 182083 65534 after
0.08% 0.01% 0.03% 1.79% 3.65% 0.00% %ch from after
16 963714 65824 65809 103409 377634 131070
962991 65821 65808 101667 364341 131070
0.08% 0.00% 0.00% 1.71% 3.65% 0.00%
17 2059092 131413 131362 209130 755476 262142
2057533 131410 131361 206193 728871 262142
0.08% 0.00% 0.00% 1.42% 3.65% 0.00%
18 4380687 262440 262460 421998 1511174 524286
4377402 262437 262459 416347 1457945 524286
0.08% 0.00% 0.00% 1.36% 3.65% 0.00%
19 9285709 524581 524634 848590 3022584 1048574
9278734 524580 524633 837947 2916107 1048574
0.08% 0.00% 0.00% 1.27% 3.65% 0.00%
20 19621118 1048960 1048942 1715806 6045418 2097150
19606028 1048958 1048941 1694896 5832445 2097150
0.08% 0.00% 0.00% 1.23% 3.65% 0.00%
3. Added some key asserts I overlooked before.
4. Updated the doc file.
before %sort was introduced. Redid them (the numbers change, but the
conclusions don't). Also did the samplesort counts with the released
2.2.1, as they're slightly different under the last CVS 2.3 samplesort
(some higher, some lower -- CVS had been changed to stop doing the
special-case business on recursive samplesort calls).
example of where this changes behavior is when a new-style instance
defines '__mul__' and '__rmul__' and is multiplied by an int. Before the
change the '__rmul__' method is never called, even if the int is the
left operand.
trampolining going on with the tp_new descriptor, where the inherited
PyType_GenericNew was overwritten with the much slower slot_tp_new
which would end up calling tp_new_wrapper which would eventually call
PyType_GenericNew. Add a special case for this to update_one_slot().
XXX Hope there isn't a loophole in this. I'll buy the first person to
point out a bug in the reasoning a beer.
Backport candidate (but I won't do it).
intern the string "__new__" so we can call PyObject_GetAttr() rather
than PyObject_GetAttrString(). (Though it's a mystery why slot_tp_new
is being called when a class doesn't define __new__. I'll look into
that tomorrow.)
2.2 backport candidate (but I won't do it).
a lot of work: it had to save and restore the current exception around
a call to lookup_maybe(), because that could fail in rare cases, and
most objects don't have a __del__ method, so the whole exercise was
usually a waste of time. Changed this to cache the __del__ method in
the type object just like all other special methods, in a new slot
tp_del. So now subtype_dealloc() can test whether tp_del is NULL and
skip the whole exercise if it is. The new slot doesn't need a new
flag bit: subtype_dealloc() is only called if the type was dynamically
allocated by type_new(), so it's guaranteed to have all current slots.
Types defined in C cannot fill in tp_del with a function of their own,
so there's no corresponding "wrapper". (That functionality is already
available through tp_dealloc.)
subtype_dealloc().
When call_finalizer() failed, it would return without going through
the trashcan end macro, thereby unbalancing the trashcan nesting level
counter, and thereby defeating the test case (slottrash() in
test_descr.py). This in turn meant that the assert in the GC_UNTRACK
macro wasn't triggered by the slottrash() test despite a bug in the
code: _PyTrash_destroy_chain() calls the dealloc routine with an
object that's untracked, and the assert in the GC_UNTRACK macro would
fail on this; but because of an earlier test that resurrects an
object, causing call_finalizer() to fail and the trashcan nesting
level to be unbalanced, so _PyTrash_destroy_chain() was never called.
Calling the slottrash() test in isolation *did* trigger the assert,
however.
So the fix is twofold: (1) call the GC_UnTrack() function instead of
the GC_UNTRACK macro, because the function is safe when the object is
already untracked; (2) when call_finalizer() fails, jump to a label
that exits through the trashcan end macro, keeping the trashcan
nesting balanced.
This is inspired by SF patch 581742 (by Jonathan Hogg, who also
submitted the bug report, and two other suggested patches), but
separates the non-GC case from the GC case to avoid testing for GC
several times.
Had to fix an assert() from call_finalizer() that asserted that the
object wasn't untracked, because it's possible that the object isn't
GC'ed!
For a file f, iter(f) now returns f (unless f is closed), and f.next()
is similar to f.readline() when EOF is not reached; however, f.next()
uses a readahead buffer that messes up the file position, so mixing
f.next() and f.readline() (or other methods) doesn't work right.
Calling f.seek() drops the readahead buffer, but other operations
don't.
The real purpose of this change is to reduce the confusion between
objects and their iterators. By making a file its own iterator, it's
made clearer that using the iterator modifies the file object's state
(in particular the current position).
A nice side effect is that this speeds up "for line in f:" by not
having to use the xreadlines module. The f.xreadlines() method is
still supported for backwards compatibility, though it is the same as
iter(f) now.
(I made some cosmetic changes to Oren's code, and added a test for
"file closed" to file_iternext() and file_iter().)
directly when no comparison function is specified. This saves a layer
of function call on every compare then. Measured speedups:
i 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort
15 32768 12.5% 0.0% 0.0% 100.0% 0.0% 50.0% 100.0% 100.0% -50.0%
16 65536 8.7% 0.0% 0.0% 0.0% 0.0% 0.0% 12.5% 0.0% 0.0%
17 131072 8.0% 25.0% 0.0% 25.0% 0.0% 14.3% 5.9% 0.0% 0.0%
18 262144 6.3% -10.0% 12.5% 11.1% 0.0% 6.3% 5.6% 12.5% 0.0%
19 524288 5.3% 5.9% 0.0% 5.6% 0.0% 5.9% 5.4% 0.0% 2.9%
20 1048576 5.3% 2.9% 2.9% 5.1% 2.8% 1.3% 5.9% 2.9% 4.2%
The best indicators are those that take significant time (larger i), and
where sort doesn't do very few compares (so *sort and ~sort benefit most
reliably). The large numbers are due to roundoff noise combined with
platform variability; e.g., the 14.3% speedup for %sort at i=17 reflects
a printed elapsed time of 0.18 seconds falling to 0.17, but a change in
the last digit isn't really meaningful (indeed, if it really took 0.175
seconds, one electron having a lazy nanosecond could shift it to either
value <wink>). Similarly the 25% at 3sort i=17 was a meaningless change
from 0.05 to 0.04. However, almost all the "meaningless changes" were
in the same direction, which is good. The before-and-after times for
*sort are clearest:
before after
0.18 0.16
0.25 0.23
0.54 0.50
1.18 1.11
2.57 2.44
5.58 5.30
longer to run than normal. A profiler run showed that this was due to
PyFrame_New() taking up an unreasonable amount of time. A little
thinking showed that this was due to the while loop clearing the space
available for the stack. The solution is to only clear the local
variables (and cells and free variables), not the space available for
the stack, since anything beyond the stack top is considered to be
garbage anyway. Also, use memset() instead of a while loop counting
backwards. This should be a time savings for normal code too! (By a
probably unmeasurable amount. :-)
version of PySlice_GetIndicesEx"):
> OK. Michael, if you want to check in indices(), go ahead.
Then I did what was needed, but didn't check it in. Here it is.
listsort. If the former calls itself recursively, they're a waste of
time, since it's called on a random permutation of a random subset of
elements. OTOH, for exactly the same reason, they're an immeasurably
small waste of time (the odds of finding exploitable order in a random
permutation are ~= 0, so the special-case loops looking for order give
up quickly). The point is more for conceptual clarity.
Also changed some "assert comments" into real asserts; when this code
was first written, Python.h didn't supply assert.h.
introduced, list.sort() was rewritten to use only the "< or not <?"
distinction. After rich comparisons were introduced, docompare() was
fiddled to translate a Py_LT Boolean result into the old "-1 for <,
0 for ==, 1 for >" flavor of outcome, and the sorting code was left
alone. This left things more obscure than they should be, and turns
out it also cost measurable cycles.
So: The old CMPERROR novelty is gone. docompare() is renamed to islt(),
and now has the same return conditinos as PyObject_RichCompareBool. The
SETK macro is renamed to ISLT, and is even weirder than before (don't
complain unless you want to maintain the sort code <wink>).
Overall, this yields a 1-2% speedup in the usual (no explicit function
passed to list.sort()) case when sorting arrays of floats (as sortperf.py
does). The boost is higher for arrays of ints.
The staticforward define was needed to support certain broken C
compilers (notably SCO ODT 3.0, perhaps early AIX as well) botched the
static keyword when it was used with a forward declaration of a static
initialized structure. Standard C allows the forward declaration with
static, and we've decided to stop catering to broken C compilers. (In
fact, we expect that the compilers are all fixed eight years later.)
I'm leaving staticforward and statichere defined in object.h as
static. This is only for backwards compatibility with C extensions
that might still use it.
XXX I haven't updated the documentation.
PyType_Ready() because the tp_iternext slot is set (fortunately,
because using the tp_iternext implementation for the the next()
implementation is buggy). Also changed the allocation order in
enum_next() so that the underlying iterator is only moved ahead when
we have successfully allocated the result tuple and index.
di_dict field when the end of the list is reached. Also make the
error ("dictionary changed size during iteration") a sticky state.
Also remove the next() method -- one is supplied automatically by
PyType_Ready() because the tp_iternext slot is set. That's a good
thing, because the implementation given here was buggy (it never
raised StopIteration).
object references (it_seq for seqiterobject, it_callable and
it_sentinel for calliterobject) when the end of the list is reached.
Also remove the next() methods -- one is supplied automatically by
PyType_Ready() because the tp_iternext slot is set. That's a good
thing, because the implementation given here was buggy (it never
raised StopIteration).
it_seq field when the end of the list is reached.
Also remove the next() method -- one is supplied automatically by
PyType_Ready() because the tp_iternext slot is set. That's a good
thing, because the implementation given here was buggy (it never
raised StopIteration).
If the object is an ExtensionClass, for example, the slot is not even
defined. So we must check that the type has the slot (implied by
HAVE_CLASS) before calling tp_init().
explicit comparison function case: use PyObject_Call instead of
PyEval_CallObject. Same thing in context, but gives a 2.4% overall
speedup when sorting a list of ints via list.sort(__builtin__.cmp).
MSDN sample programs use it, apparently in error. The correct name
is WIN32_LEAN_AND_MEAN. After switching to the correct name, in two
cases more was needed because the code actually relied on things that
disappear when WIN32_LEAN_AND_MEAN is defined.
arg tuple. This was suggested on c.l.py but afraid I can't find the msg
again for proper attribution. For
list.sort(cmp)
where list is a list of random ints, and cmp is __builtin__.cmp, this
yields an overall 50-60% speedup on my Win2K box. Of course this is a
best case, because the overhead of calling cmp relative to the cost of
actually comparing two ints is at an extreme. Nevertheless it's huge
bang for the buck. An additionak 20-30% can be bought by making the arg
tuple an immortal static (avoiding all but "the first" PyTuple_New), but
that's tricky to make correct since docompare needs to be reentrant. So
this picks the cherry and leaves the pits for Fred <wink>.
Note that this makes no difference to the
list.sort()
case; an arg tuple gets built only if the user specifies an explicit
sort function.
helper macros to something saner, and used them appropriately in other
files too, to reduce #ifdef blocks.
classobject.c, instance_dealloc(): One of my worst Python Memories is
trying to fix this routine a few years ago when COUNT_ALLOCS was defined
but Py_TRACE_REFS wasn't. The special-build code here is way too
complicated. Now it's much simpler. Difference: in a Py_TRACE_REFS
build, the instance is no longer in the doubly-linked list of live
objects while its __del__ method is executing, and that may be visible
via sys.getobjects() called from a __del__ method. Tough -- the object
is presumed dead while its __del__ is executing anyway, and not calling
_Py_NewReference() at the start allows enormous code simplification.
typeobject.c, call_finalizer(): The special-build instance_dealloc()
pain apparently spread to here too via cut-'n-paste, and this is much
simpler now too. In addition, I didn't understand why this routine
was calling _PyObject_GC_TRACK() after a resurrection, since there's no
plausible way _PyObject_GC_UNTRACK() could have been called on the
object by this point. I suspect it was left over from pasting the
instance_delloc() code. Instead asserted that the object is still
tracked. Caution: I suspect we don't have a test that actually
exercises the subtype_dealloc() __del__-resurrected-me code.
more trivial lexical helper macros so that uses of these guys expand
to nothing at all when they're not enabled. This should help sub-
standard compilers that can't do a good job of optimizing away the
previous "(void)0" expressions.
Py_DECREF: There's only one definition of this now. Yay! That
was that last one in the family defined multiple times in an #ifdef
maze.
Py_FatalError(): Changed the char* signature to const char*.
_Py_NegativeRefcount(): New helper function for the Py_REF_DEBUG
expansion of Py_DECREF. Calling an external function cuts down on
the volume of generated code. The previous inline expansion of abort()
didn't work as intended on Windows (the program often kept going, and
the error msg scrolled off the screen unseen). _Py_NegativeRefcount
calls Py_FatalError instead, which captures our best knowledge of
how to abort effectively across platforms.
Repair segfaults and infinite loops in COUNT_ALLOCS builds in the
presence of new-style (heap-allocated) classes/types.
Bugfix candidate. I'll backport this to 2.2. It's irrelevant in 2.1.
that have taken me "too long" to reverse-engineer over the years.
Vastly reduced the nesting level and redundancy of #ifdef-ery.
Took a light stab at repairing comments that are no longer true.
sys_gettotalrefcount(): Changed to enable under Py_REF_DEBUG.
It was enabled under Py_TRACE_REFS, which was much heavier than
necessary. sys.gettotalrefcount() is now available in a
Py_REF_DEBUG-only build.
mechanism is no longer evil: it no longer plays dangerous games with
the type pointer or refcounts, and objects in extension modules can play
along too without needing to edit the core first.
Rewrote all the comments to explain this, and (I hope) give clear
guidance to extension authors who do want to play along. Documented
all the functions. Added more asserts (it may no longer be evil, but
it's still dangerous <0.9 wink>). Rearranged the generated code to
make it clearer, and to tolerate either the presence or absence of a
semicolon after the macros. Rewrote _PyTrash_destroy_chain() to call
tp_dealloc directly; it was doing a Py_DECREF again, and that has all
sorts of obscure distorting effects in non-release builds (Py_DECREF
was already called on the object!). Removed Christian's little "embedded
change log" comments -- that's what checkin messages are for, and since
it was impossible to correlate the comments with the code that changed,
I found them merely distracting.
In a fresh interpreter, type.mro(tuple) would segfault, because
PyType_Ready() isn't called for tuple yet. To fix, call
PyType_Ready(type) if type->tp_dict is NULL.
These built-in functions are replaced by their (now callable) type:
slice()
buffer()
and these types can also be called (but have no built-in named
function named after them)
classobj (type name used to be "class")
code
function
instance
instancemethod (type name used to be "instance method")
The module "new" has been replaced with a small backward compatibility
placeholder in Python.
A large portion of the patch simply removes the new module from
various platform-specific build recipes. The following binary Mac
project files still have references to it:
Mac/Build/PythonCore.mcp
Mac/Build/PythonStandSmall.mcp
Mac/Build/PythonStandalone.mcp
[I've tweaked the code layout and the doc strings here and there, and
added a comment to types.py about StringTypes vs. basestring. --Guido]
gotten from a weak reference to NULL instead of to None. This caused
the following assert() to fail (but only in 2.2 in the debug build --
I have to find a better test case). Will backport.
optional attribute, only clear the exception when the internal getattr
operation raised AttributeError. Many places in this file already had
that policy; but just as many didn't, and there didn't seem to be any
rhyme or reason to it. Be consistently cautious.
Question: should I backport this? On the one hand it's a bugfix. On
the other hand it's a change in behavior. Certain forms of buggy or
just weird code would work in the past but raise an exception under
the new rules; e.g. if you define a __getattr__ method that raises a
non-AttributeError exception.
473985. Through a subtle rearrangement of some members in the etype
struct (!), mapping methods are now preferred over sequence methods,
which is necessary to support str.__getitem__("hello", slice(4)) etc.