Remove the old comment suggesting that it was desireable to have
blocksize+2 as a multiple of the cache line length. That would
have made sense only if the block structure start point was always
aligned to a cache line boundary. However, the memory allocations
are 16 byte aligned, so we don't really have control over whether
the struct spills across cache line boundaries.
The former block size traded away good fit within cache lines in
order to gain faster division in deque_item(). However, compilers
are getting smarter and can now replace the slow division operation
with a fast integer multiply and right shift. Accordingly, it makes
sense to go back to a size that lets blocks neatly fill entire
cache-lines.
GCC-4.8 and CLANG 4.0 both compute "x // 62" with something
roughly equivalent to "x * 9520900167075897609 >> 69".
* Add comment explaining the endpoint checks
* Only do the checks in a debug build
* Simplify newblock() to only require a length argument
and leave the link updates to the calling code.
* Also add comment for the freelisting logic.
The division and modulo calculation in deque_item() can be compiled
to fast bitwise operations when the BLOCKLEN is a power of two.
Timing before:
~/cpython $ py -m timeit -r7 -s 'from collections import deque' -s 'd=deque(range(10))' 'd[5]'
10000000 loops, best of 7: 0.0627 usec per loop
Timing after:
~/cpython $ py -m timeit -r7 -s 'from collections import deque' -s 'd=deque(range(10))' 'd[5]'
10000000 loops, best of 7: 0.0581 usec per loop
* Clarified comment on the impact of BLOCKLEN on deque_index
(with a power-of-two, the division and modulo
computations are done with a right-shift and bitwise-and).
* Clarified comment on the overflow check to note that
it is general and not just applicable the 64-bit builds.
* In deque._rotate(), the "deque->" indirections are
factored-out of the loop (loop invariant code motion),
leaving the code cleaner looking and slightly faster.
* In deque._rotate(), replaced the memcpy() with an
equivalent loop. That saved the memcpy setup time
and allowed the pointers to move in their natural
leftward and rightward directions.
See comparative timings at: http://pastebin.com/p0RJnT5N