Add microbenchmarks for vector functions. #3

jtibshirani · 2019-08-23T18:32:42Z

This PR shows some microbenchmarks for the decoding vectors and taking the dot product of two vectors. These benchmarks are meant for local testing purposes and will not be merged into the elasticsearch repo.

Benchmark                                                  Mode  Cnt     Score     Error  Units
VectorFunctionBenchmark.decodeNoop                         avgt   30    31.820 ±   0.141  ns/op
VectorFunctionBenchmark.decode                             avgt   30   109.387 ±   0.179  ns/op
VectorFunctionBenchmark.decodeWithByteBuffer               avgt   30    74.400 ±   0.139  ns/op
VectorFunctionBenchmark.decodeWithUnrolling4               avgt   30   109.110 ±   0.150  ns/op
VectorFunctionBenchmark.dotProduct                         avgt   30    68.549 ±   0.737  ns/op
VectorFunctionBenchmark.dotProductWithUnrolling4           avgt   30    36.346 ±   0.030  ns/op
VectorFunctionBenchmark.decodeThenDotProduct               avgt   30   169.888 ±   0.274  ns/op
VectorFunctionBenchmark.decodeAndDotProduct                avgt   30   102.490 ±   0.135  ns/op
VectorFunctionBenchmark.decodeAndDotProductWithUnrolling2  avgt   30    92.387 ±   0.125  ns/op
VectorFunctionBenchmark.decodeAndDotProductWithUnrolling4  avgt   30   102.078 ±   0.281  ns/op

The results suggest a few directions to pursue, that I'll explore next in search macrobenchmarks:

Switching to ByteBuffer instead of manual shifts might help for decoding.
If a computation only requires one dot product, then decoding and performing the dot product at the same time could help.
In dotProductWithUnrolling4, we manually unroll the dot product loop to clarify there are no dependencies between operations. This likely encourages SIMD to kick in, resulting in an improvement.

Platform information:

openjdk 12.0.1 2019-04-16
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz

mayya-sharipova · 2019-08-25T14:52:32Z

@jtibshirani Thanks for this, great work! Great thinking to try ByteBuffer and unrolling to take advantage of SIMD.

I have added another function that combines both of your functions decoding with ByteBuffer and dot product with unrolling.

 static float decodeWithBufferAndDotProductWithUnrolling(float[] queryVector, BytesRef vectorBR) {
    if (vectorBR == null) {
        throw new IllegalArgumentException("A document doesn't have a value for a vector field!");
    }
    ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);

    float dot0 = 0;
    float dot1 = 0;
    float dot2 = 0;
    float dot3 = 0;

    int offset = vectorBR.offset;
    int length = (queryVector.length / 4) * 4;
    for (int dim = 0; dim < length; dim += 4, offset += 16) {
        dot0 += byteBuffer.getFloat(offset) * queryVector[dim];
        dot1 += byteBuffer.getFloat(offset + 4) * queryVector[dim + 1];
        dot2 += byteBuffer.getFloat(offset + 8) * queryVector[dim + 2];
        dot3 += byteBuffer.getFloat(offset + 12) * queryVector[dim + 3];
    }

    for (int dim = length; dim < queryVector.length; dim++, offset += 4) {
        dot0 += byteBuffer.getFloat(offset) * queryVector[dim];
    }
    return dot0 + dot1 + dot2 + dot3;
}

Here are results on my machine:

Benchmark                                                           Mode  Cnt     Score      Error  Units
VectorFunctionBenchmark.decodeNoop                                  avgt   30    41.404 ±    0.802  ns/op
VectorFunctionBenchmark.decode                                      avgt   30   127.624 ±    2.645  ns/op
VectorFunctionBenchmark.decodeWithByteBuffer                        avgt   30    88.841 ±    2.059  ns/op
VectorFunctionBenchmark.decodeWithUnrolling                         avgt   30   127.553 ±    1.940  ns/op

VectorFunctionBenchmark.dotProduct                                  avgt   30    76.797 ±    1.229  ns/op
VectorFunctionBenchmark.dotProductWithUnrolling                     avgt   30    41.452 ±    0.893  ns/op
VectorFunctionBenchmark.decodeThenDotProduct                        avgt   30   238.820 ±    3.210  ns/op
VectorFunctionBenchmark.decodeAndDotProduct                         avgt   30   172.425 ±    9.730  ns/op
VectorFunctionBenchmark.decodeWithBufferAndDotProductWithUnrolling  avgt   30   111.164 ±    2.277  ns/op

Indeed almost 2x speedups can be achieved with ByteBuffer for decoding and unrolling in dot product.

My machine params:
Process: 2.9 GHz Intel Core i7

sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

OpenJDK JDK 11.0.2, VM 11.0.2+9-LTS

…rks.

This commit updates the vector encoding and decoding logic to use `java.nio.ByteBuffer`. Using `ByteBuffer` shows an improvement in [microbenchmarks](jtibshirani#3) and I think it helps code readability. The performance gain might be due to the fact `ByteBuffer` uses hotspot intrinsic candidates like `Unsafe#getIntUnaligned` under the hood.

jtibshirani · 2019-11-18T22:38:09Z

A note for future context: although it helped in some microbenchmarks, on other platforms unrolled dot product was substantially slower. For example, on @mayya-sharipova's Linux server (40 core Intel Xeon, open jdk 11.0.1), unrolling took 151.91 ns versus a baseline of 122.68 ns.

jtibshirani · 2019-11-18T22:39:15Z

I'm going to close this PR, since we finished implementing a round of changes based off the results.

jtibshirani force-pushed the master branch from d662ea2 to 0412504 Compare August 23, 2019 18:36

jtibshirani force-pushed the vector-microbenchmarks branch from 8e32e7e to 98c305e Compare August 23, 2019 20:59

jtibshirani mentioned this pull request Aug 23, 2019

Switch to ByteBuffer for vector encoding. elastic/elasticsearch#45936

Merged

jtibshirani force-pushed the master branch from fe9f0b4 to af890f1 Compare August 23, 2019 22:16

jtibshirani closed this Aug 23, 2019

jtibshirani force-pushed the vector-microbenchmarks branch from 98c305e to af890f1 Compare August 23, 2019 22:26

jtibshirani reopened this Aug 23, 2019

jtibshirani force-pushed the vector-microbenchmarks branch 3 times, most recently from 5116d06 to 52a211f Compare August 26, 2019 18:20

jtibshirani added 3 commits August 26, 2019 12:09

Add microbenchmarks for vector functions.

1b3f72b

Use ByteBuffer in the combined decode and dot product benchmarks.

449abc8

Fix a bug in dotProductWithUnrolling.

0e7007d

jtibshirani force-pushed the vector-microbenchmarks branch from 52a211f to 178cb07 Compare August 26, 2019 19:09

jtibshirani added 2 commits August 26, 2019 12:10

Add benchmark for decode and dot product with unrolling.

1e3e666

Minor clean-up: return float instead of double in dot product benchma…

7560f1e

…rks.

jtibshirani force-pushed the vector-microbenchmarks branch from 178cb07 to c358db8 Compare August 26, 2019 19:58

Add decode and dot product with unrolling loop to 4.

0cd71b8

jtibshirani force-pushed the vector-microbenchmarks branch from c358db8 to 0cd71b8 Compare August 26, 2019 20:20

jtibshirani mentioned this pull request Sep 3, 2019

Further ideas for optimizing vector functions. elastic/elasticsearch#46202

Closed

mayya-sharipova mentioned this pull request Sep 23, 2019

Explore using bfloat16 for storing vector values elastic/elasticsearch#46980

Closed

mayya-sharipova and others added 2 commits September 25, 2019 10:24

Benchmarks for encoding vector values as bfloat16

409890d

Clarify some naming around bfloat16.

b0a1c9c

jtibshirani closed this Nov 18, 2019

mayya-sharipova mentioned this pull request Jan 21, 2020

Use Math.fma function in vector computations elastic/elasticsearch#51275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add microbenchmarks for vector functions. #3

Add microbenchmarks for vector functions. #3

jtibshirani commented Aug 23, 2019 •

edited

Loading

mayya-sharipova commented Aug 25, 2019 •

edited

Loading

jtibshirani commented Nov 18, 2019

jtibshirani commented Nov 18, 2019

Add microbenchmarks for vector functions. #3

Add microbenchmarks for vector functions. #3

Conversation

jtibshirani commented Aug 23, 2019 • edited Loading

mayya-sharipova commented Aug 25, 2019 • edited Loading

jtibshirani commented Nov 18, 2019

jtibshirani commented Nov 18, 2019

jtibshirani commented Aug 23, 2019 •

edited

Loading

mayya-sharipova commented Aug 25, 2019 •

edited

Loading