Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add microbenchmarks for vector functions. #3

Closed
wants to merge 8 commits into from

Conversation

jtibshirani
Copy link
Owner

@jtibshirani jtibshirani commented Aug 23, 2019

This PR shows some microbenchmarks for the decoding vectors and taking the dot product of two vectors. These benchmarks are meant for local testing purposes and will not be merged into the elasticsearch repo.

Benchmark                                                  Mode  Cnt     Score     Error  Units
VectorFunctionBenchmark.decodeNoop                         avgt   30    31.820 ±   0.141  ns/op
VectorFunctionBenchmark.decode                             avgt   30   109.387 ±   0.179  ns/op
VectorFunctionBenchmark.decodeWithByteBuffer               avgt   30    74.400 ±   0.139  ns/op
VectorFunctionBenchmark.decodeWithUnrolling4               avgt   30   109.110 ±   0.150  ns/op
VectorFunctionBenchmark.dotProduct                         avgt   30    68.549 ±   0.737  ns/op
VectorFunctionBenchmark.dotProductWithUnrolling4           avgt   30    36.346 ±   0.030  ns/op
VectorFunctionBenchmark.decodeThenDotProduct               avgt   30   169.888 ±   0.274  ns/op
VectorFunctionBenchmark.decodeAndDotProduct                avgt   30   102.490 ±   0.135  ns/op
VectorFunctionBenchmark.decodeAndDotProductWithUnrolling2  avgt   30    92.387 ±   0.125  ns/op
VectorFunctionBenchmark.decodeAndDotProductWithUnrolling4  avgt   30   102.078 ±   0.281  ns/op

The results suggest a few directions to pursue, that I'll explore next in search macrobenchmarks:

  • Switching to ByteBuffer instead of manual shifts might help for decoding.
  • If a computation only requires one dot product, then decoding and performing the dot product at the same time could help.
  • In dotProductWithUnrolling4, we manually unroll the dot product loop to clarify there are no dependencies between operations. This likely encourages SIMD to kick in, resulting in an improvement.

Platform information:

  • openjdk 12.0.1 2019-04-16
  • Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz

@mayya-sharipova
Copy link

mayya-sharipova commented Aug 25, 2019

@jtibshirani Thanks for this, great work! Great thinking to try ByteBuffer and unrolling to take advantage of SIMD.

I have added another function that combines both of your functions decoding with ByteBuffer and dot product with unrolling.

 static float decodeWithBufferAndDotProductWithUnrolling(float[] queryVector, BytesRef vectorBR) {
    if (vectorBR == null) {
        throw new IllegalArgumentException("A document doesn't have a value for a vector field!");
    }
    ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);

    float dot0 = 0;
    float dot1 = 0;
    float dot2 = 0;
    float dot3 = 0;

    int offset = vectorBR.offset;
    int length = (queryVector.length / 4) * 4;
    for (int dim = 0; dim < length; dim += 4, offset += 16) {
        dot0 += byteBuffer.getFloat(offset) * queryVector[dim];
        dot1 += byteBuffer.getFloat(offset + 4) * queryVector[dim + 1];
        dot2 += byteBuffer.getFloat(offset + 8) * queryVector[dim + 2];
        dot3 += byteBuffer.getFloat(offset + 12) * queryVector[dim + 3];
    }

    for (int dim = length; dim < queryVector.length; dim++, offset += 4) {
        dot0 += byteBuffer.getFloat(offset) * queryVector[dim];
    }
    return dot0 + dot1 + dot2 + dot3;
}  

Here are results on my machine:

Benchmark                                                           Mode  Cnt     Score      Error  Units
VectorFunctionBenchmark.decodeNoop                                  avgt   30    41.404 ±    0.802  ns/op
VectorFunctionBenchmark.decode                                      avgt   30   127.624 ±    2.645  ns/op
VectorFunctionBenchmark.decodeWithByteBuffer                        avgt   30    88.841 ±    2.059  ns/op
VectorFunctionBenchmark.decodeWithUnrolling                         avgt   30   127.553 ±    1.940  ns/op

VectorFunctionBenchmark.dotProduct                                  avgt   30    76.797 ±    1.229  ns/op
VectorFunctionBenchmark.dotProductWithUnrolling                     avgt   30    41.452 ±    0.893  ns/op
VectorFunctionBenchmark.decodeThenDotProduct                        avgt   30   238.820 ±    3.210  ns/op
VectorFunctionBenchmark.decodeAndDotProduct                         avgt   30   172.425 ±    9.730  ns/op
VectorFunctionBenchmark.decodeWithBufferAndDotProductWithUnrolling  avgt   30   111.164 ±    2.277  ns/op

Indeed almost 2x speedups can be achieved with ByteBuffer for decoding and unrolling in dot product.

My machine params:
Process: 2.9 GHz Intel Core i7

sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

OpenJDK JDK 11.0.2, VM 11.0.2+9-LTS

@jtibshirani jtibshirani force-pushed the vector-microbenchmarks branch 3 times, most recently from 5116d06 to 52a211f Compare August 26, 2019 18:20
jtibshirani added a commit to elastic/elasticsearch that referenced this pull request Aug 28, 2019
This commit updates the vector encoding and decoding logic to use
`java.nio.ByteBuffer`. Using `ByteBuffer` shows an improvement in
[microbenchmarks](jtibshirani#3) and I
think it helps code readability. The performance gain might be due to the fact
`ByteBuffer` uses hotspot intrinsic candidates like `Unsafe#getIntUnaligned`
under the hood.
jtibshirani added a commit to elastic/elasticsearch that referenced this pull request Aug 30, 2019
This commit updates the vector encoding and decoding logic to use
`java.nio.ByteBuffer`. Using `ByteBuffer` shows an improvement in
[microbenchmarks](jtibshirani#3) and I
think it helps code readability. The performance gain might be due to the fact
`ByteBuffer` uses hotspot intrinsic candidates like `Unsafe#getIntUnaligned`
under the hood.
@jtibshirani
Copy link
Owner Author

A note for future context: although it helped in some microbenchmarks, on other platforms unrolled dot product was substantially slower. For example, on @mayya-sharipova's Linux server (40 core Intel Xeon, open jdk 11.0.1), unrolling took 151.91 ns versus a baseline of 122.68 ns.

@jtibshirani
Copy link
Owner Author

I'm going to close this PR, since we finished implementing a round of changes based off the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants