measured speedup on an x86_64 Linux
-----------------------------------
clang, -O2, without patch:
* 5,000,000x CRC of a 256 byte buffer: TOOK: 0.858567
* 5,000,000x CRC of a 512 byte buffer: TOOK: 1.67744
* 5,000,000x CRC of a 1024 byte buffer: TOOK: 3.31552
* 5,000,000x CRC of a 2048 byte buffer: TOOK: 6.58735
* 5,000,000x CRC of a 4096 byte buffer: TOOK: 13.1924
clang, -O2, with patch
* 5,000,000x CRC of a 256 byte buffer: TOOK: 0.669745
* 5,000,000x CRC of a 512 byte buffer: TOOK: 1.3234
* 5,000,000x CRC of a 1024 byte buffer: TOOK: 2.63565
* 5,000,000x CRC of a 2048 byte buffer: TOOK: 5.26927
* 5,000,000x CRC of a 4096 byte buffer: TOOK: 10.6086
gcc, -O2, without patch:
* 5,000,000x CRC of a 256 byte buffer: TOOK: 0.752911
* 5,000,000x CRC of a 512 byte buffer: TOOK: 1.46402
* 5,000,000x CRC of a 1024 byte buffer: TOOK: 2.88934
* 5,000,000x CRC of a 2048 byte buffer: TOOK: 5.74819
* 5,000,000x CRC of a 4096 byte buffer: TOOK: 11.4839
gcc, -O2, with patch:
* 5,000,000x CRC of a 256 byte buffer: TOOK: 0.643093
* 5,000,000x CRC of a 512 byte buffer: TOOK: 1.20488
* 5,000,000x CRC of a 1024 byte buffer: TOOK: 2.39155
* 5,000,000x CRC of a 2048 byte buffer: TOOK: 4.75178
* 5,000,000x CRC of a 4096 byte buffer: TOOK: 9.34864