bpo-46406: Faster single digit int division. (#30626)

* bpo-46406: Faster single digit int division.

This expresses the algorithm in a more basic manner resulting in better
instruction generation by todays compilers.

See https://mail.python.org/archives/list/python-dev@python.org/thread/ZICIMX5VFCX4IOFH5NUPVHCUJCQ4Q7QM/#NEUNFZU3TQU4CPTYZNF3WCN7DOJBBTK5
This commit is contained in:
Gregory P. Smith 2022-01-23 02:00:41 -08:00 committed by GitHub
parent 83a0ef2162
commit c7f20f1cc8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 28 additions and 9 deletions

View File

@ -0,0 +1,3 @@
The integer division ``//`` implementation has been optimized to better let the
compiler understand its constraints. It can be 20% faster on the amd64 platform
when dividing an int by a value smaller than ``2**30``.

View File

@ -1617,25 +1617,41 @@ v_rshift(digit *z, digit *a, Py_ssize_t m, int d)
in pout, and returning the remainder. pin and pout point at the LSD. in pout, and returning the remainder. pin and pout point at the LSD.
It's OK for pin == pout on entry, which saves oodles of mallocs/frees in It's OK for pin == pout on entry, which saves oodles of mallocs/frees in
_PyLong_Format, but that should be done with great care since ints are _PyLong_Format, but that should be done with great care since ints are
immutable. */ immutable.
This version of the code can be 20% faster than the pre-2022 version
on todays compilers on architectures like amd64. It evolved from Mark
Dickinson observing that a 128:64 divide instruction was always being
generated by the compiler despite us working with 30-bit digit values.
See the thread for full context:
https://mail.python.org/archives/list/python-dev@python.org/thread/ZICIMX5VFCX4IOFH5NUPVHCUJCQ4Q7QM/#NEUNFZU3TQU4CPTYZNF3WCN7DOJBBTK5
If you ever want to change this code, pay attention to performance using
different compilers, optimization levels, and cpu architectures. Beware of
PGO/FDO builds doing value specialization such as a fast path for //10. :)
Verify that 17 isn't specialized and this works as a quick test:
python -m timeit -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
*/
static digit static digit
inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n) inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n)
{ {
twodigits rem = 0; digit remainder = 0;
assert(n > 0 && n <= PyLong_MASK); assert(n > 0 && n <= PyLong_MASK);
pin += size;
pout += size;
while (--size >= 0) { while (--size >= 0) {
digit hi; twodigits dividend;
rem = (rem << PyLong_SHIFT) | *--pin; dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size];
*--pout = hi = (digit)(rem / n); digit quotient;
rem -= (twodigits)hi * n; quotient = (digit)(dividend / n);
remainder = dividend % n;
pout[size] = quotient;
} }
return (digit)rem; return remainder;
} }
/* Divide an integer by a digit, returning both the quotient /* Divide an integer by a digit, returning both the quotient
(as function result) and the remainder (through *prem). (as function result) and the remainder (through *prem).
The sign of a is ignored; n should not be zero. */ The sign of a is ignored; n should not be zero. */