Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
huff0: Assembler improvements (#736)
Main changes: * Compute out[id * dstEvery + i] statically. This shaves four instructions off the main loops. (It also frees up a register.) * Track "exhausted" by addition instead or OR. This gets rid of an additional instruction. The variable is now also zeroed inside the loop as a dependency hint. Benchmark results show small speedups on some datasets: ``` name old speed new speed delta Decompress1XTable/digits-8 350MB/s ± 0% 350MB/s ± 1% ~ (p=0.764 n=10+9) Decompress1XTable/gettysburg-8 270MB/s ± 1% 268MB/s ± 1% -0.72% (p=0.001 n=10+10) Decompress1XTable/twain-8 329MB/s ± 1% 328MB/s ± 0% ~ (p=0.035 n=10+9) Decompress1XTable/low-ent.10k-8 387MB/s ± 1% 386MB/s ± 0% ~ (p=0.027 n=10+8) Decompress1XTable/superlow-ent-10k-8 377MB/s ± 0% 375MB/s ± 0% -0.48% (p=0.000 n=10+10) Decompress1XTable/crash2-8 17.0MB/s ± 0% 16.9MB/s ± 0% -0.36% (p=0.004 n=9+10) Decompress1XTable/endzerobits-8 53.3MB/s ± 0% 53.0MB/s ± 0% -0.55% (p=0.000 n=10+9) Decompress1XTable/endnonzero-8 11.3MB/s ± 0% 11.3MB/s ± 1% ~ (p=0.060 n=10+10) Decompress1XTable/case1-8 22.0MB/s ± 0% 21.9MB/s ± 1% ~ (p=0.015 n=9+9) Decompress1XTable/case2-8 18.1MB/s ± 1% 18.1MB/s ± 1% ~ (p=0.202 n=10+9) Decompress1XTable/case3-8 19.1MB/s ± 1% 19.2MB/s ± 1% ~ (p=0.056 n=9+10) Decompress1XTable/pngdata.001-8 374MB/s ± 0% 374MB/s ± 0% ~ (p=0.148 n=10+10) Decompress1XTable/normcount2-8 54.4MB/s ± 1% 54.4MB/s ± 1% ~ (p=0.617 n=10+10) Decompress1XNoTable/digits/100-8 280MB/s ± 0% 280MB/s ± 1% ~ (p=0.951 n=9+10) Decompress1XNoTable/digits/10000-8 366MB/s ± 1% 367MB/s ± 0% ~ (p=0.090 n=10+9) Decompress1XNoTable/digits/262143-8 348MB/s ± 1% 349MB/s ± 0% ~ (p=0.043 n=10+10) Decompress1XNoTable/gettysburg/100-8 276MB/s ± 0% 277MB/s ± 1% +0.44% (p=0.009 n=10+10) Decompress1XNoTable/gettysburg/10000-8 363MB/s ± 1% 363MB/s ± 0% ~ (p=0.041 n=10+7) Decompress1XNoTable/gettysburg/262143-8 349MB/s ± 1% 350MB/s ± 0% ~ (p=0.123 n=10+10) Decompress1XNoTable/twain/100-8 267MB/s ± 0% 268MB/s ± 0% ~ (p=0.052 n=10+10) Decompress1XNoTable/twain/10000-8 357MB/s ± 3% 363MB/s ± 0% +1.74% (p=0.000 n=10+10) Decompress1XNoTable/twain/262143-8 320MB/s ± 2% 329MB/s ± 0% +3.09% (p=0.000 n=10+10) Decompress1XNoTable/low-ent.10k/100-8 183MB/s ± 1% 184MB/s ± 0% ~ (p=0.211 n=9+10) Decompress1XNoTable/low-ent.10k/10000-8 377MB/s ± 3% 385MB/s ± 1% +2.14% (p=0.000 n=10+10) Decompress1XNoTable/low-ent.10k/262143-8 386MB/s ± 1% 389MB/s ± 1% +0.84% (p=0.005 n=10+10) Decompress1XNoTable/superlow-ent-10k/262143-8 382MB/s ± 2% 389MB/s ± 1% +1.89% (p=0.001 n=10+10) Decompress1XNoTable/crash2/100-8 276MB/s ± 2% 278MB/s ± 0% ~ (p=0.180 n=10+8) Decompress1XNoTable/crash2/10000-8 373MB/s ± 1% 374MB/s ± 1% ~ (p=0.315 n=10+10) Decompress1XNoTable/crash2/262143-8 373MB/s ± 1% 375MB/s ± 0% ~ (p=0.165 n=10+8) Decompress1XNoTable/endzerobits/100-8 184MB/s ± 0% 184MB/s ± 1% ~ (p=0.845 n=9+9) Decompress1XNoTable/endzerobits/10000-8 384MB/s ± 1% 386MB/s ± 0% +0.61% (p=0.007 n=10+10) Decompress1XNoTable/endzerobits/262143-8 387MB/s ± 2% 389MB/s ± 0% ~ (p=0.963 n=9+8) Decompress1XNoTable/endnonzero/100-8 181MB/s ± 2% 183MB/s ± 0% ~ (p=0.017 n=9+10) Decompress1XNoTable/endnonzero/10000-8 385MB/s ± 0% 382MB/s ± 1% -0.88% (p=0.001 n=8+10) Decompress1XNoTable/endnonzero/262143-8 387MB/s ± 1% 385MB/s ± 2% ~ (p=0.143 n=10+10) Decompress1XNoTable/case1/100-8 278MB/s ± 2% 282MB/s ± 1% ~ (p=0.013 n=10+9) Decompress1XNoTable/case1/10000-8 373MB/s ± 1% 373MB/s ± 0% ~ (p=0.274 n=10+8) Decompress1XNoTable/case1/262143-8 374MB/s ± 1% 374MB/s ± 0% ~ (p=0.589 n=10+9) Decompress1XNoTable/case2/100-8 274MB/s ± 0% 274MB/s ± 0% -0.26% (p=0.002 n=10+9) Decompress1XNoTable/case2/10000-8 378MB/s ± 0% 377MB/s ± 0% ~ (p=0.093 n=10+10) Decompress1XNoTable/case2/262143-8 377MB/s ± 1% 376MB/s ± 1% ~ (p=0.225 n=10+10) Decompress1XNoTable/case3/100-8 266MB/s ± 0% 265MB/s ± 0% -0.20% (p=0.007 n=10+9) Decompress1XNoTable/case3/10000-8 371MB/s ± 0% 372MB/s ± 0% ~ (p=0.211 n=10+9) Decompress1XNoTable/case3/262143-8 373MB/s ± 0% 374MB/s ± 0% ~ (p=0.073 n=10+10) Decompress1XNoTable/pngdata.001/100-8 239MB/s ± 0% 239MB/s ± 0% ~ (p=0.889 n=9+10) Decompress1XNoTable/pngdata.001/10000-8 384MB/s ± 0% 384MB/s ± 0% ~ (p=0.228 n=10+8) Decompress1XNoTable/pngdata.001/262143-8 377MB/s ± 0% 379MB/s ± 0% +0.56% (p=0.000 n=10+10) Decompress1XNoTable/normcount2/100-8 281MB/s ± 1% 282MB/s ± 1% ~ (p=0.015 n=10+10) Decompress1XNoTable/normcount2/10000-8 368MB/s ± 0% 370MB/s ± 0% +0.37% (p=0.004 n=10+10) Decompress1XNoTable/normcount2/262143-8 371MB/s ± 0% 371MB/s ± 0% ~ (p=0.034 n=8+10) Decompress4XNoTable/digits/100-8 200MB/s ± 1% 201MB/s ± 0% ~ (p=0.274 n=8+10) Decompress4XNoTable/digits/10000-8 603MB/s ± 0% 622MB/s ± 1% +3.20% (p=0.000 n=8+10) Decompress4XNoTable/digits/262143-8 578MB/s ± 0% 595MB/s ± 1% +2.87% (p=0.000 n=8+10) Decompress4XNoTable/gettysburg/100-8 260MB/s ± 0% 260MB/s ± 1% ~ (p=0.011 n=8+10) Decompress4XNoTable/gettysburg/10000-8 643MB/s ± 0% 657MB/s ± 1% +2.19% (p=0.000 n=10+9) Decompress4XNoTable/gettysburg/262143-8 572MB/s ± 0% 589MB/s ± 0% +2.93% (p=0.000 n=8+10) Decompress4XNoTable/twain/100-8 206MB/s ± 1% 206MB/s ± 1% ~ (p=0.436 n=10+10) Decompress4XNoTable/twain/10000-8 639MB/s ± 1% 653MB/s ± 1% +2.25% (p=0.000 n=10+10) Decompress4XNoTable/twain/262143-8 516MB/s ± 0% 522MB/s ± 1% +1.09% (p=0.004 n=10+10) Decompress4XNoTable/low-ent.10k/100-8 207MB/s ± 1% 207MB/s ± 0% ~ (p=1.000 n=10+9) Decompress4XNoTable/low-ent.10k/10000-8 631MB/s ± 0% 653MB/s ± 0% +3.42% (p=0.000 n=10+9) Decompress4XNoTable/low-ent.10k/262143-8 685MB/s ± 1% 696MB/s ± 0% +1.61% (p=0.000 n=10+10) Decompress4XNoTable/superlow-ent-10k/262143-8 684MB/s ± 1% 695MB/s ± 1% +1.51% (p=0.000 n=9+10) Decompress4XNoTable/case1/100-8 208MB/s ± 1% 207MB/s ± 0% ~ (p=0.353 n=10+10) Decompress4XNoTable/case1/10000-8 601MB/s ± 0% 621MB/s ± 1% +3.22% (p=0.000 n=10+10) Decompress4XNoTable/case1/262143-8 613MB/s ± 1% 632MB/s ± 0% +3.14% (p=0.000 n=10+10) Decompress4XNoTable/case2/100-8 210MB/s ± 2% 208MB/s ± 2% ~ (p=0.315 n=10+9) Decompress4XNoTable/case2/10000-8 618MB/s ± 0% 636MB/s ± 0% +2.95% (p=0.000 n=10+10) Decompress4XNoTable/case2/262143-8 635MB/s ± 0% 651MB/s ± 0% +2.56% (p=0.000 n=7+10) Decompress4XNoTable/case3/100-8 199MB/s ± 1% 200MB/s ± 1% ~ (p=0.055 n=10+10) Decompress4XNoTable/case3/10000-8 615MB/s ± 0% 633MB/s ± 1% +2.94% (p=0.000 n=10+10) Decompress4XNoTable/case3/262143-8 620MB/s ± 0% 639MB/s ± 1% +3.00% (p=0.000 n=10+10) Decompress4XNoTable/pngdata.001/100-8 212MB/s ± 0% 211MB/s ± 1% ~ (p=0.211 n=10+9) Decompress4XNoTable/pngdata.001/10000-8 649MB/s ± 0% 667MB/s ± 1% +2.76% (p=0.000 n=10+10) Decompress4XNoTable/pngdata.001/262143-8 646MB/s ± 0% 660MB/s ± 0% +2.28% (p=0.000 n=9+10) Decompress4XNoTable/normcount2/100-8 261MB/s ± 1% 262MB/s ± 1% ~ (p=0.031 n=9+9) Decompress4XNoTable/normcount2/10000-8 589MB/s ± 1% 613MB/s ± 0% +3.99% (p=0.000 n=10+9) Decompress4XNoTable/normcount2/262143-8 585MB/s ± 3% 617MB/s ± 1% +5.57% (p=0.000 n=10+10) Decompress4XNoTableTableLog8/digits-8 579MB/s ± 2% 610MB/s ± 0% +5.33% (p=0.000 n=10+10) Decompress4XTable/digits-8 584MB/s ± 1% 607MB/s ± 1% +3.89% (p=0.000 n=10+10) Decompress4XTable/gettysburg-8 370MB/s ± 0% 373MB/s ± 1% +0.67% (p=0.009 n=10+10) Decompress4XTable/twain-8 512MB/s ± 2% 523MB/s ± 1% +2.08% (p=0.000 n=9+10) Decompress4XTable/low-ent.10k-8 656MB/s ± 1% 677MB/s ± 1% +3.21% (p=0.000 n=10+10) Decompress4XTable/superlow-ent-10k-8 603MB/s ± 4% 626MB/s ± 1% +3.91% (p=0.000 n=9+10) Decompress4XTable/case1-8 21.1MB/s ± 0% 21.0MB/s ± 0% -0.55% (p=0.000 n=9+9) Decompress4XTable/case2-8 17.6MB/s ± 0% 17.6MB/s ± 1% ~ (p=0.736 n=9+10) Decompress4XTable/case3-8 18.7MB/s ± 1% 18.7MB/s ± 1% ~ (p=0.642 n=10+10) Decompress4XTable/pngdata.001-8 648MB/s ± 0% 657MB/s ± 0% +1.50% (p=0.000 n=10+8) Decompress4XTable/normcount2-8 49.7MB/s ± 1% 49.7MB/s ± 1% ~ (p=0.839 n=10+10) [Geo mean] 271MB/s 274MB/s +0.96% ```
- Loading branch information