All Blake2 params have to be encoded in little-endian byte order. For
the two multi-byte integer params, leaf_length and node_offset, that
means that assigning a native-endian integer to them appears to work on
little-endian platforms, but gives the wrong result on big-endian. The
current libb2 API doesn't make that very clear, and @sneves is working
on new API functions in the GH issue above. In the meantime, we can work
around the problem by explicitly assigning little-endian values to the
parameter block.
See https://github.com/BLAKE2/libb2/issues/12.
Replace occurence of nested comments in blake2 reference implementation
with preprocessor directive for disabling unused code.
`blake2s-load-xop.h` is conditionally pulled in only on chips with XOP
support, among others the AMD Bulldozer. The malformed comments in the
source file breaks the build of `hashlib`'s `_blake2` on GCC 6.3.0.
Official reference code on github uses `#if` so this change should be
uncontroversial.
Rework the code choosing BLAKE2 code paths from using the optimized
variant on all x86_64 machines to using it when SSSE3 or better
supported instructions sets are available.
Firstly, this solves the problem of using pure SSE2 code path on x86_64
machines. As reported in the bug, this code is slower than the reference
code on all tested x86_64 machines. Furthermore, on Athlon64 that lacks
SSSE3, it is even 2.5 times slower than the reference code! Checking
for SSSE3 therefore ensures that the optimized implementation will only
be used when it has a chance of performing better.
Secondly, this makes it possible to use SSSE3+ optimizations on 32-bit
x86 systems. This allows for even 2 times speed gain on modern 32-bit
x86 systems (tested in a 32-bit chroot).