* Using addition instead of substraction on array indices allows the
compiler to use a fast addressing mode. Saves about 10%.
* Using PyTuple_GET_ITEM and PyList_SET_ITEM is about 7% faster than
PySequenceFast_GET_ITEM which has to make a list check on every pass.
utilization, and speed:
* Moved the responsibility for emptying the previous list from list_fill
to list_init.
* Replaced the code in list_extend with the superior code from list_fill.
* Eliminated list_fill.
Results:
* list.extend() no longer creates an intermediate tuple except to handle
the special case of x.extend(x). The saves memory and time.
* list.extend(x) runs
5 to 10% faster when x is a list or tuple
15% faster when x is an iterable not defining __len__
twice as fast when x is an iterable defining __len__
* the code is about 15 lines shorter and no longer duplicates
functionality.
The Py2.3 approach overallocated small lists by up to 8 elements.
The last checkin would limited this to one but slowed down (by 20 to 30%)
the creation of small lists between 3 to 8 elements.
This tune-up balances the two, limiting overallocation to 3 elements
(significantly reducing space consumption from Py2.3) and running faster
than the previous checkin.
The first part of the growth pattern (0, 4, 8, 16) neatly meshes with
allocators that trigger data movement only when crossing a power of two
boundary. Also, then even numbers mesh well with common data alignments.
realloc(). This is achieved by tracking the overallocation size in a new
field and using that information to skip calls to realloc() whenever
possible.
* Simplified and tightened the amount of overallocation. For larger lists,
this overallocates by 1/8th (compared to the previous scheme which ranged
between 1/4th to 1/32nd over-allocation). For smaller lists (n<6), the
maximum overallocation is one byte (formerly it could be upto eight bytes).
This saves memory in applications with large numbers of small lists.
* Eliminated the NRESIZE macro in favor of a new, static list_resize function
that encapsulates the resizing logic. Coverting this back to macro would
give a small (under 1%) speed-up. This was too small to warrant the loss
of readability, maintainability, and de-coupling.
* Some functions using NRESIZE had grown unnecessarily complex in their
efforts to bend to the macro's calling pattern. With the new list_resize
function in place, those other functions could be simplified. That is
being saved for a separate patch.
* The ob_item==NULL check could be eliminated from the new list_resize
function. This would entail finding each piece of code that sets ob_item
to NULL and adding a new line to invalidate the overallocation tracking
field. Rather than impose a new requirement on other pieces of list code,
it was preferred to leave the NULL check in place and retain the benefits
of decoupling, maintainability and information hiding (only PyList_New()
and list_sort() need to know about the new field). This approach also
reduces the odds of breaking an extension module.
(Collaborative effort by Raymond Hettinger, Hye-Shik Chang, Tim Peters,
and Armin Rigo.)
Formerly, length data fetched from sequence objects.
Now, any object that reports its length can benefit from pre-sizing.
On one sample timing, it gave a threefold speedup for list(s) where s
was a set object.
The special-case code that was removed could return a value indicating
success but leave an exception set. test_fileinput failed in a debug
build as a result.
which can be reviewed via
http://coding.derkeiler.com/Archive/Python/comp.lang.python/2003-12/1011.html
Duncan Booth investigated, and discovered that an "optimisation" was
in fact a pessimisation for small numbers of elements in a source list,
compared to not having the optimisation, although with large numbers
of elements in the source list the optimisation was quite beneficial.
He posted his change to comp.lang.python (but not to SF).
Further research has confirmed his assessment that the optimisation only
becomes a net win when the source list has more than 100 elements.
I also found that the optimisation could apply to tuples as well,
but the gains only arrive with source tuples larger than about 320
elements and are nowhere near as significant as the gains with lists,
(~95% gain @ 10000 elements for lists, ~20% gain @ 10000 elements for
tuples) so I haven't proceeded with this.
The code as it was applied the optimisation to list subclasses as
well, and this also appears to be a net loss for all reasonable sized
sources (~80-100% for up to 100 elements, ~20% for more than 500
elements; I tested up to 10000 elements).
Duncan also suggested special casing empty lists, which I've extended
to all empty sequences.
On the basis that list_fill() is only ever called with a list for the
result argument, testing for the source being the destination has
now happens before testing source types.
key provides C support for the decorate-sort-undecorate pattern.
reverse provide a stable sort of the list with the comparisions reversed.
* Amended the docs to guarantee sort stability.
Reverted a Py2.3b1 change to iterator in subclasses of list and tuple.
They had been changed to use __getitem__ whenever it had been overriden
in the subclass.
This caused some usabilty and performance problems. Also, it was
inconsistent with the rest of python where many container methods
access the underlying object directly without first checking for
an overridden getter. Users needing a change in iterator behavior
should override it directly.
As a side issue on this bug, it was noted that list and tuple iterators
used macros to directly access containers and would not recognize
__getitem__ overrides. If the method is overridden, the patch returns
a generic sequence iterator which calls the __getitem__ method; otherwise,
it returns a high custom iterator with direct access to container elements.
interpreted by slicing, so negative values count from the end of the
list. This was the only place where such an interpretation was not
placed on a list index.
Obtain cleaner coding and a system wide
performance boost by using the fast, pre-parsed
PyArg_Unpack function instead of PyArg_ParseTuple
function which is driven by a format string.
Armin Rigo's Draconian but effective fix for
SF bug 453523: list.sort crasher
slightly fiddled to catch more cases of list mutation. The dreaded
internal "immutable list type" is gone! OTOH, if you look at a list
*while* it's being sorted now, it will appear to be empty. Better
than a core dump.
This is friendlier for caches.
2. Cut MIN_GALLOP to 7, but added a per-sort min_gallop vrbl that adapts
the "get into galloping mode" threshold higher when galloping isn't
paying, and lower when it is. There's no known case where this hurts.
It's (of course) neutral for /sort, \sort and =sort. It also happens
to be neutral for !sort. It cuts a tiny # of compares in 3sort and +sort.
For *sort, it reduces the # of compares to better than what this used to
do when MIN_GALLOP was hardcoded to 10 (it did about 0.1% more *sort
compares before, but given how close we are to the limit, this is "a
lot"!). %sort used to do about 1.5% more compares, and ~sort about
3.6% more. Here are exact counts:
i *sort 3sort +sort %sort ~sort !sort
15 449235 33019 33016 51328 188720 65534 before
448885 33016 33007 50426 182083 65534 after
0.08% 0.01% 0.03% 1.79% 3.65% 0.00% %ch from after
16 963714 65824 65809 103409 377634 131070
962991 65821 65808 101667 364341 131070
0.08% 0.00% 0.00% 1.71% 3.65% 0.00%
17 2059092 131413 131362 209130 755476 262142
2057533 131410 131361 206193 728871 262142
0.08% 0.00% 0.00% 1.42% 3.65% 0.00%
18 4380687 262440 262460 421998 1511174 524286
4377402 262437 262459 416347 1457945 524286
0.08% 0.00% 0.00% 1.36% 3.65% 0.00%
19 9285709 524581 524634 848590 3022584 1048574
9278734 524580 524633 837947 2916107 1048574
0.08% 0.00% 0.00% 1.27% 3.65% 0.00%
20 19621118 1048960 1048942 1715806 6045418 2097150
19606028 1048958 1048941 1694896 5832445 2097150
0.08% 0.00% 0.00% 1.23% 3.65% 0.00%
3. Added some key asserts I overlooked before.
4. Updated the doc file.
directly when no comparison function is specified. This saves a layer
of function call on every compare then. Measured speedups:
i 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort
15 32768 12.5% 0.0% 0.0% 100.0% 0.0% 50.0% 100.0% 100.0% -50.0%
16 65536 8.7% 0.0% 0.0% 0.0% 0.0% 0.0% 12.5% 0.0% 0.0%
17 131072 8.0% 25.0% 0.0% 25.0% 0.0% 14.3% 5.9% 0.0% 0.0%
18 262144 6.3% -10.0% 12.5% 11.1% 0.0% 6.3% 5.6% 12.5% 0.0%
19 524288 5.3% 5.9% 0.0% 5.6% 0.0% 5.9% 5.4% 0.0% 2.9%
20 1048576 5.3% 2.9% 2.9% 5.1% 2.8% 1.3% 5.9% 2.9% 4.2%
The best indicators are those that take significant time (larger i), and
where sort doesn't do very few compares (so *sort and ~sort benefit most
reliably). The large numbers are due to roundoff noise combined with
platform variability; e.g., the 14.3% speedup for %sort at i=17 reflects
a printed elapsed time of 0.18 seconds falling to 0.17, but a change in
the last digit isn't really meaningful (indeed, if it really took 0.175
seconds, one electron having a lazy nanosecond could shift it to either
value <wink>). Similarly the 25% at 3sort i=17 was a meaningless change
from 0.05 to 0.04. However, almost all the "meaningless changes" were
in the same direction, which is good. The before-and-after times for
*sort are clearest:
before after
0.18 0.16
0.25 0.23
0.54 0.50
1.18 1.11
2.57 2.44
5.58 5.30
listsort. If the former calls itself recursively, they're a waste of
time, since it's called on a random permutation of a random subset of
elements. OTOH, for exactly the same reason, they're an immeasurably
small waste of time (the odds of finding exploitable order in a random
permutation are ~= 0, so the special-case loops looking for order give
up quickly). The point is more for conceptual clarity.
Also changed some "assert comments" into real asserts; when this code
was first written, Python.h didn't supply assert.h.
introduced, list.sort() was rewritten to use only the "< or not <?"
distinction. After rich comparisons were introduced, docompare() was
fiddled to translate a Py_LT Boolean result into the old "-1 for <,
0 for ==, 1 for >" flavor of outcome, and the sorting code was left
alone. This left things more obscure than they should be, and turns
out it also cost measurable cycles.
So: The old CMPERROR novelty is gone. docompare() is renamed to islt(),
and now has the same return conditinos as PyObject_RichCompareBool. The
SETK macro is renamed to ISLT, and is even weirder than before (don't
complain unless you want to maintain the sort code <wink>).
Overall, this yields a 1-2% speedup in the usual (no explicit function
passed to list.sort()) case when sorting arrays of floats (as sortperf.py
does). The boost is higher for arrays of ints.
The staticforward define was needed to support certain broken C
compilers (notably SCO ODT 3.0, perhaps early AIX as well) botched the
static keyword when it was used with a forward declaration of a static
initialized structure. Standard C allows the forward declaration with
static, and we've decided to stop catering to broken C compilers. (In
fact, we expect that the compilers are all fixed eight years later.)
I'm leaving staticforward and statichere defined in object.h as
static. This is only for backwards compatibility with C extensions
that might still use it.
XXX I haven't updated the documentation.
it_seq field when the end of the list is reached.
Also remove the next() method -- one is supplied automatically by
PyType_Ready() because the tp_iternext slot is set. That's a good
thing, because the implementation given here was buggy (it never
raised StopIteration).
explicit comparison function case: use PyObject_Call instead of
PyEval_CallObject. Same thing in context, but gives a 2.4% overall
speedup when sorting a list of ints via list.sort(__builtin__.cmp).
arg tuple. This was suggested on c.l.py but afraid I can't find the msg
again for proper attribution. For
list.sort(cmp)
where list is a list of random ints, and cmp is __builtin__.cmp, this
yields an overall 50-60% speedup on my Win2K box. Of course this is a
best case, because the overhead of calling cmp relative to the cost of
actually comparing two ints is at an extreme. Nevertheless it's huge
bang for the buck. An additionak 20-30% can be bought by making the arg
tuple an immortal static (avoiding all but "the first" PyTuple_New), but
that's tricky to make correct since docompare needs to be reentrant. So
this picks the cherry and leaves the pits for Fred <wink>.
Note that this makes no difference to the
list.sort()
case; an arg tuple gets built only if the user specifies an explicit
sort function.
[ 400998 ] experimental support for extended slicing on lists
somewhat spruced up and better tested than it was when I wrote it.
Includes docs & tests. The whatsnew section needs expanding, and arrays
should support extended slices -- later.
A MemoryError is now raised when the list cannot be created.
There is a test, but as the comment says, it really only
works for 32 bit systems. I don't know how to improve
the test for other systems (ie, 64 bit or systems
where the data size != addressable size,
e.g. 64 bit data, but 48 bit addressable memory)
The fix makes it possible to call PyObject_GC_UnTrack() more than once
on the same object, and then move the PyObject_GC_UnTrack() call to
*before* the trashcan code is invoked.
BUGFIX CANDIDATE!
Rather than tweaking the inheritance of type object slots (which turns
out to be too messy to try), this fix adds a __hash__ to the list and
dict types (the only mutable types I'm aware of) that explicitly
raises an error. This has the advantage that list.__hash__([]) also
raises an error (previously, this would invoke object.__hash__([]),
returning the argument's address); ditto for dict.__hash__.
The disadvantage for this fix is that 3rd party mutable types aren't
automatically fixed. This should be added to the rules for creating
subclassable extension types: if you don't want your object to be
hashable, add a tp_hash function that raises an exception.
Also, it's possible that I've forgotten about other mutable types for
which this should be done.
many types were subclassable but had a xxx_dealloc function that
called PyObject_DEL(self) directly instead of deferring to
self->ob_type->tp_free(self). It is permissible to set tp_free in the
type object directly to _PyObject_Del, for non-GC types, or to
_PyObject_GC_Del, for GC types. Still, PyObject_DEL was a tad faster,
so I'm fearing that our pystone rating is going down again. I'm not
sure if doing something like
void xxx_dealloc(PyObject *self)
{
if (PyXxxCheckExact(self))
PyObject_DEL(self);
else
self->ob_type->tp_free(self);
}
is any faster than always calling the else branch, so I haven't
attempted that -- however those types whose own dealloc is fancier
(int, float, unicode) do use this pattern.
Gave Python linear-time repr() implementations for dicts, lists, strings.
This means, e.g., that repr(range(50000)) is no longer 50x slower than
pprint.pprint() in 2.2 <wink>.
I don't consider this a bugfix candidate, as it's a performance boost.
Added _PyString_Join() to the internal string API. If we want that in the
public API, fine, but then it requires runtime error checks instead of
asserts.
resizing.
Accurate timings are impossible on my Win98SE box, but this is obviously
faster even on this box for reasonable list.append() cases. I give
credit for this not to the resizing strategy but to getting rid of integer
multiplication and divsion (in favor of shifting) when computing the
rounded-up size.
For unreasonable list.append() cases, Win98SE now displays linear behavior
for one-at-time appends up to a list with about 35 million elements. Then
it dies with a MemoryError, due to fatally fragmented *address space*
(there's plenty of VM available, but by this point Win9X has broken user
space into many distinct heaps none of which has enough contiguous space
left to resize the list, and for whatever reason Win9x isn't coalescing
the dead heaps). Before the patch it got a MemoryError for the same
reason, but once the list reached about 2 million elements.
Haven't yet tried on Win2K but have high hopes extreme list.append()
will be much better behaved now (NT & Win2K didn't fragment address space,
but suffered obvious quadratic-time behavior before as lists got large).
For other systems I'm relying on common sense: replacing integer * and /
by << and >> can't plausibly hurt, the number of function calls hasn't
changed, and the total operation count for reasonably small lists is about
the same (while the operations are cheaper now).
This fixes SF bug #132008, reported by Warren J. Hack.
The copyright for this patch (and this patch only) belongs to CNRI, as
part of the (yet to be issued) 1.6.1 release.
This is now checked into the HEAD branch. Tim will check in a test
case to check for this specific bug, and an assertion in
PyArgs_ParseTuple() to catch similar bugs in the future.
- sort's docompare() calls RichCompare(Py_LT).
- list_contains(), list_index(), listcount(), listremove() call
RichCompare(Py_EQ).
- Get rid of list_compare(), in favor of new list_richcompare(). The
latter does some nice shortcuts, like when == or != is requested, it
first compares the lengths for trivial accept/reject. Then it goes
over the items until it finds an index where the items differe; then
it does more shortcut magic to minimize the number of additional
comparisons.
- Aligned the comments for large struct initializers.
Add definitions of INT_MAX and LONG_MAX to pyport.h.
Remove includes of limits.h and conditional definitions of INT_MAX
and LONG_MAX elsewhere.
This closes SourceForge patch #101659 and bug #115323.
This patch modifies the type structures of objects that
participate in GC. The object's tp_basicsize is increased when
GC is enabled. GC information is prefixed to the object to
maintain binary compatibility. GC objects also define the
tp_flag Py_TPFLAGS_GC.
The following patch adds "sq_contains" support to rangeobject, and enables
the already-written support for sq_contains in listobject and tupleobject.
The rangeobject "contains" code should be a bit more efficient than the
current default "in" implementation ;-) It might not get used much, but it's
not that much to add.
listobject.c and tupleobject.c already had code for sq_contains, and the
proper struct member was set, but the PyType structure was not extended to
include tp_flags, so the object-specific code was not getting called (Go
ahead, test it ;-). I also did this for the immutable_list_type in
listobject.c, eventhough it is probably never used. Symmetry and all that.
For more comments, read the patches@python.org archives.
For documentation read the comments in mymalloc.h and objimpl.h.
(This is not exactly what Vladimir posted to the patches list; I've
made a few changes, and Vladimir sent me a fix in private email for a
problem that only occurs in debug mode. I'm also holding back on his
change to main.c, which seems unnecessary to me.)
Added wrapping macros to dictobject.c, listobject.c, tupleobject.c,
frameobject.c, traceback.c that safely prevends core dumps
on stack overflow. Macros and functions in object.c, object.h.
The method is an "elevator destructor" that turns cascading
deletes into tail recursive behavior when some limit is hit.
diagnostics.
*** INCOMPATIBLE CHANGE: This changes append(), remove(), index(), and
*** count() to require exactly one argument -- previously, multiple
*** arguments were silently assumed to be a tuple.
+ Took the "list" argument out of the other functions that no longer need
it. This speeds things up a little more.
+ Small comment changes in accord with that.
+ Exploited the now-safe ability to cache values in the partitioning loop.
Makes no timing difference on my flavor of Pentium, but this machine ran out
of registers 12 iterations ago. It should yield a small speedup on a RISC
machine, and not hurt in any case.
instead of testing whether the list changed size after each
comparison, temporarily set the type of the list to an immutable list
type. This should allow continued use of the list for legitimate
purposes but disallows all operations that can change it in any way.
(Changes to the internals of list items are not caught, of cause;
that's not possible to detect, and it's not necessary to protect the
sort code, either.)
object pointers. Should be a bit faster than the C library's qsort(),
and doesn't have the prohibition on recursion that Solaris qsort() has
in the threaded version of their C library.
Thanks to discussions with Tim Peters.
* {tuple,list,mapping,array}object.c: call printobject with 0 for flags
* compile.c (parsestr): use quote instead of '\'' at one crucial point
* arraymodule.c (array_getattr): Added __members__ attribute
setlistslice() can be used to cut the unused part out of a freshly made
slice (as done by bagof()). [needed by the next mod!]
* structural changes to bagof(), map() etc.
image objects, and lots of new methods.
* Added counting of allocations and deallocations of builtin types if
COUNT_ALLOCS is defined. Had to move calls to NEWREF down in some
files.
* Bug fix in sorting lists.
Added $(SYSDEF) to its build rule in Makefile.
* cgensupport.[ch], modsupport.[ch]: removed some old stuff. Also
changed files that still used it... And made several things static
that weren't but should have been... And other minor cleanups...
* listobject.[ch]: add external interfaces {set,get}listslice
* socketmodule.c: fix bugs in new send() argument parsing.
* sunaudiodevmodule.c: added flush() and close().
listfontnames, bitmap ops.
* listobject.c: use mkvalue() when possible; avoid weird error when
calling append() without args.
* modsupport.c: new feature in getargs(): if the format string
contains a semicolor the string after that is used as the error
message instead of "bad argument list (format %s)" when there's an
error.
* various modules: added 1993 to copyright.
* thread.c: added copyright notice.
* ceval.c: minor change to error message for "+"
* stdwinmodule.c: check for error from wfetchcolor
* config.c: MS-DOS fixes (define PYTHONPATH, use DELIM, use osdefs.h)
* Add declaration of inittab to import.h
* sysmodule.c: added sys.builtin_module_names
* xxmodule.c, xxobject.c: fix minor errors
* posixmodule.c: move extern function declarations to top
* listobject.c: cmp() arguments must be void* if __STDC__
* Makefile, allobjects.h, panelmodule.c, modsupport.c: get rid of
strdup() -- it is a portability risk
* Makefile: enclosed ranlib command in parentheses for Sequent Make
which aborts if the command is not found even if '-' is present
* timemodule.c: time() returns a floating point number, in microsecond
precision if BSD_TIME is defined.
* flmodule.c: added some missing functions; changed readonly flags of
some data members based upon FORMS documentation.
* listobject.c: fixed int/long arg lint bug (bites PC compilers).
* several: removed redundant print methods (repr is good enough).
* posixmodule.c: added (still experimental) process group functions.
argument to malloc() (size_t or unsigned int)
* listobject.c: check for overflow of the size of the object,
so things like range(0x7fffffff) will raise MemoryError instead
of calling malloc() with -4 (and then crashing -- malloc's fault)
* socketmodule.c: get rid of makepair(); fix makesocketaddr to fix
broken recvfrom()
* socketmodule: get rid of getStrarg()
* ceval.h: move eval_code() to new file eval.h, so compile.h is no
longer needed.
* ceval.c: move thread comments to ceval.h; always make save/restore
thread functions available (for dynloaded modules)
* cdmodule.c, listobject.c: don't include compile.h
* flmodule.c: include ceval.h
* import.c: include eval.h instead of ceval.h
* cgen.py: add forground(); noport(); winopen(""); to initgl().
* bltinmodule.c, socketmodule.c, fileobject.c, posixmodule.c,
selectmodule.c:
adapt to threads (add BGN/END SAVE macros)
* stdwinmodule.c: adapt to threads and use a special stdwin lock.
* pythonmain.c: don't include getpythonpath().
* pythonrun.c: use BGN/END SAVE instead of direct calls; also more
BGN/END SAVE calls etc.
* thread.c: bigger stack size for sun; change exit() to _exit()
* threadmodule.c: use BGN/END SAVE macros where possible
* timemodule.c: adapt better to threads; use BGN/END SAVE; add
longsleep internal function if BSD_TIME; cosmetics