space is no longer needed, so removed the code. It was only possible when
a degenerate (ah->ob_size == 0) split happened, but after that fix went
in I added k_lopsided_mul(), which saves the body of k_mul() from seeing
a degenerate split. So this removes code, and adds a honking long comment
block explaining why spilling out of bounds isn't possible anymore. Note:
ff we end up spilling out of bounds anyway <wink>, an assert in v_iadd()
is certain to trigger.
Close the bug report again -- this time for Cygwin due to a newlib bug.
See the following for the details:
http://sources.redhat.com/ml/newlib/2002/msg00369.html
Note that this commit is only a documentation (i.e., comment) change.
(rev. 2.86). The other type is only disqualified from sq_repeat when
it has the CHECKTYPES flag. This means that for extension types that
only support "old-style" numeric ops, such as Zope 2's ExtensionClass,
sq_repeat still trumps nb_multiply.
test was written. So boosted the number of "digits" this generates, and
also beefed up the "* / divmod" test to tickle numbers big enough to
trigger the Karatsuba algorithm. It takes about 2 seconds now on my box.
k_mul() when inputs have vastly different sizes, and a little more
efficient when they're close to a factor of 2 out of whack.
I consider this done now, although I'll set up some more correctness
tests to run overnight.
cases, overflow the allocated result object by 1 bit. In such cases,
it would have been brought back into range if we subtracted al*bl and
ah*bh from it first, but I don't want to do that because it hurts cache
behavior. Instead we just ignore the excess bit when it appears -- in
effect, this is forcing unsigned mod BASE**(asize + bsize) arithmetic
in a case where that doesn't happen all by itself.
1. You can now have __dict__ and/or __weakref__ in your __slots__
(before only __weakref__ was supported). This is treated
differently than before: it merely sets a flag that the object
should support the corresponding magic.
2. Dynamic types now always have descriptors __dict__ and __weakref__
thrust upon them. If the type in fact does not support one or the
other, that descriptor's __get__ method will raise AttributeError.
3. (This is the reason for all this; it fixes SF bug 575229, reported
by Cesar Douady.) Given this code:
class A(object): __slots__ = []
class B(object): pass
class C(A, B): __slots__ = []
the class object for C was broken; its size was less than that of
B, and some descriptors on B could cause a segfault. C now
correctly inherits __weakrefs__ and __dict__ from B, even though A
is the "primary" base (C.__base__ is A).
4. Some code cleanup, and a few comments added.
algorithm. MSVC 6 wasn't impressed <wink>.
Something odd: the x_mul algorithm appears to get substantially worse
than quadratic time as the inputs grow larger:
bits in each input x_mul time k_mul time
------------------ ---------- ----------
15360 0.01 0.00
30720 0.04 0.01
61440 0.16 0.04
122880 0.64 0.14
245760 2.56 0.40
491520 10.76 1.23
983040 71.28 3.69
1966080 459.31 11.07
That is, x_mul is perfectly quadratic-time until a little burp at
2.56->10.76, and after that goes to hell in a hurry. Under Karatsuba,
doubling the input size "should take" 3 times longer instead of 4, and
that remains the case throughout this range. I conclude that my "be nice
to the cache" reworkings of k_mul() are paying.
correct now, so added some final comments, did some cleanup, and enabled
it for all long-int multiplies. The KARAT envar no longer matters,
although I left some #if 0'ed code in there for my own use (temporary).
k_mul() is still much slower than x_mul() if the inputs have very
differenent sizes, and that still needs to be addressed.
(it's possible, but should be harmless -- this requires more thought,
and allocating enough space in advance to prevent it requires exactly
as much thought, to know exactly how much that is -- the end result
certainly fits in the allocated space -- hmm, but that's really all
the thought it needs! borrows/carries out of the high digits really
are harmless).
k_mul(): This didn't allocate enough result space when one input had
more than twice as many bits as the other. This was partly hidden by
that x_mul() didn't normalize its result.
The Karatsuba recurrence is pretty much hosed if the inputs aren't
roughly the same size. If one has at least twice as many bits as the
other, we get a degenerate case where the "high half" of the smaller
input is 0. Added a special case for that, for speed, but despite that
it helped, this can still be much slower than the "grade school" method.
It seems to take a really wild imbalance to trigger that; e.g., a
2**22-bit input times a 1000-bit input on my box runs about twice as slow
under k_mul than under x_mul. This still needs to be addressed.
I'm also not sure that allocating a->ob_size + b->ob_size digits is
enough, given that this is computing k = (ah+al)*(bh+bl) instead of
k = (ah-al)*(bl-bh); i.e., it's certainly enough for the final result,
but it's vaguely possible that adding in the "artificially" large k may
overflow that temporarily. If so, an assert will trigger in the debug
build, but we'll probably compute the right result anyway(!).
addition and subtraction. Reworked the tail end of k_mul() to use them.
This saves oodles of one-shot longobject allocations (this is a triply-
recursive routine, so saving one allocation in the body saves 3**n
allocations at depth n; we actually save 2 allocations in the body).