Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rerun benchmark with elasticsearch 7.5 or above #2

Closed
jtibshirani opened this issue Mar 9, 2020 · 14 comments
Closed

Rerun benchmark with elasticsearch 7.5 or above #2

jtibshirani opened this issue Mar 9, 2020 · 14 comments

Comments

@jtibshirani
Copy link

In ES 7.5, we made some improvements to the performance of Elasticsearch dense_vector operations (elastic/elasticsearch#46294). Although I still expect the QPS to be significantly worse than Vespa's, it would be helpful to rerun the benchmarks against ES 7.5 to get an up-to-date comparison.

@jobergum
Copy link
Owner

jobergum commented Mar 9, 2020

Hello @jtibshirani , Thanks for reaching out. I'm actually working on evaluation of 7.6. Why do you recommend 7.5? And yes, 7.6 is much faster than 7.4 (about 2x I think so well done!)

I'm also switching to standard ann benchmark data instead of random data, see also my comments on elastic/elasticsearch#51243. From my tests, it seems like those brute force latency numbers must have been produced with number of shards set to 2.

@jtibshirani
Copy link
Author

Why do you recommend 7.5?

My statement was a bit confusing -- any version 7.5 or above will contain the improvements.

I'm also switching to standard ann benchmark data instead of random data, see also my comments on elastic/elasticsearch#51243.

Great that you're standardizing on the ann-benchmarks data!

@jobergum
Copy link
Owner

jobergum commented Mar 9, 2020

Yes, coming first is gist-960-euclidean and sift-128-euclidean.

@jobergum
Copy link
Owner

@jtibshirani would you be able to help with the following questions?

  • Is there a simple batch oriented api for feeding a json formatted file with the elastic docker distribution? Feeding one and one document through the REST http api is painfully slow with 1M gist-960 dataset.
  • For comparisons it's best to use 1 shard with ES and 1 thread per search with Vespa, you agree?
  • Could you look at the ES numbers below and tell me if they are within your expectations?

The results are

gist-960-euclidean 1M vectors

single shard with Elastic and threads-per-search equal to one with Vespa

Engine QPS Average Latency (ms) 95P Latency (ms) Recall@10
Elastic 7.6 0.40 2502.82 2520.21 1.0000
Vespa 7.184.8 0.63 1579.29 1787.40 1.0000

two shards for Elastic and two threads-per-search with Vespa

Engine QPS Average Latency (ms) 95P Latency (ms) Recall@10
Elastic 7.6 0.78 1276.69 1333.61 1.0000
Vespa 7.184.8 1.26 794.28 892.23 1.0000

Are the results with Elastic comparable with your setup? Same HW as before. Vespa is implementing a variant of the hnsw algorithm for ann (experimental feature currently) so will eventually publish some results with that enabled as well.

@jtibshirani
Copy link
Author

jtibshirani commented Mar 11, 2020

Is there a simple batch oriented api for feeding a json formatted file with the elastic docker distribution?

I would recommend using the bulk API. You can't feed the list of JSON documents directly, some minimal wrapping is still needed to create the request. The ES Python client has some nice bulk helpers to make the process easier.

For comparisons it's best to use 1 shard with ES and 1 thread per search with Vespa, you agree?

I don't have deep knowledge of Vespa's architecture, but from the ES side this seems like a reasonable comparison -- with only one shard ES will use a single thread to perform the search.

In addition to setting number_of_shards: 1, it'd be good number_of_replicas: 0 so that there is only one shard copy serving searches. (I'm just mentioning this as a best practice, I assume you've configured vespa-fbench to run each search in a serial fashion, and not submit multiple searches at once? Setting number_of_replicas shouldn't change the performance numbers much if you are running each search serially.)

There are a few other pieces of set-up that are important:

  • After indexing all the vectors, you should 'force merge' all of the segments in a shard. Otherwise ES will have to search many small segments serially, then merge together the results. An example of force merging an index to one segment can be found here.
  • In @mayya-sharipova's latest benchmarks under the section 'Bruteforce benchmarks', she set the heap size to 7GB. The default heap size is quite low, so it would be good to raise it -- instructions on how to set the heap size can be found here.

@jobergum
Copy link
Owner

Thanks a lot for your input @jtibshirani,

Yes, all numbers are reported using a single client and no concurrency. I want to also evaluate with more with higher concurrency so thanks for the recommendation on number_of_replicas. Some of the libraries out there for ANN scales pretty badly with increased concurrency but that does not show in any of the ann-benchmarks.

I've given Elastic 8G heap (ES_JAVA_OPTS="-Xms8g -Xmx8g") and I don't see any GC pressure signs and I've used force segments (Vespa has a similar mechanism for flushing the memory index), difference is that Vespa has a memory index (b+ tree implementation) which can updated without any merging like with Lucene based engines. Once the memory index has reached a threshold it's flushed and merged with the index (Similar to Lucene segment merging).

I'm able to reproduce the brute force numbers here elastic/elasticsearch#51243 (comment) but in my setup I need 2 shards to get 0.78 QPS.

@jobergum
Copy link
Owner

@jtibshirani I've updated the master branch using 7.6.

@jtibshirani
Copy link
Author

jtibshirani commented Mar 23, 2020

@jobergum I'm sorry for the late reply. I'm not sure why your benchmarking results aren't lining up with @mayya-sharipova's. The only other difference that comes to mind is that we always make sure to omit the returning the full document source in results by setting _source: false in the search request body: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/search-request-body.html#request-body-search-source-filtering. Otherwise ES will load and return the whole stored vector for the top 10 results, whereas we are just interested in the document IDs.

@jtibshirani I've updated the master branch using 7.6.

Thanks! The 'Ivy Bridge' numbers make sense to me, based on the previous results and the performance improvements in ES. However the Haswell numbers are more surprising -- do you know why Vespa shows a latency improvement of ~2x between the Ivy Bridge and Haswell processors?

@jobergum
Copy link
Owner

jobergum commented Mar 23, 2020

@jtibshirani the vector is not returned with the result, if that was the case yes - I would have spotted it.

Sample response from ES

{"took":604,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":10000,"relation":"gte"},"max_score":0.005666477,"hits":[{"_index":"doc","_type":"_doc","_id":"669835","_score":0.005666477},{"_index":"doc","_type":"_doc","_id":"408764","_score":0.0056393184},{"_index":"doc","_type":"_doc","_id":"408462","_score":0.0054252045},{"_index":"doc","_type":"_doc","_id":"408855","_score":0.0053858217},{"_index":"doc","_type":"_doc","_id":"551661","_score":0.0053397696},{"_index":"doc","_type":"_doc","_id":"861882","_score":0.005264404},{"_index":"doc","_type":"_doc","_id":"406273","_score":0.0052393572},{"_index":"doc","_type":"_doc","_id":"406324","_score":0.0052266084},{"_index":"doc","_type":"_doc","_id":"551743","_score":0.005219447},{"_index":"doc","_type":"_doc","_id":"861530","_score":0.0052178036}]}}

On cpu architectures, yes it's explained by us using avx512 instructions
See

Will soon update with results using our HNSW implementation for approximate nearest neighbor search, some sample data with gist data set:

image

@jtibshirani
Copy link
Author

jtibshirani commented Mar 23, 2020

Thanks for the explanation + links on AVX. The HNSW implementation looks really promising.

If it's not too much work, it would be great to report the sift-128-euclidean results against Ivy Bridge as well. I'd be curious to see how consistent the latency differences are. Other than that I don't have anything else to add, happy if you'd like to close out this issue.

@jobergum
Copy link
Owner

Thanks, Yes, I just did. Thanks for the input on ES benchmarking.

@jobergum
Copy link
Owner

I'm resolving this - hoping to have more time to introduce the ANN vespa version later on

@jobergum
Copy link
Owner

@jtibshirani does ES ship with an asynch feed client which makes it easier to feed documents with high throughput? I'm using the synchronous HTTP POST api but would like to move away from it. Vespa has this utility to feed a JSON file https://docs.vespa.ai/en/reference/vespa-cmdline-tools.html#vespa-feeder so I'm looking for an ES equivalent.

@jtibshirani
Copy link
Author

There's an Elasticsearch Python client, which adds convenient 'bulk helpers' for indexing a large set of documents: https://elasticsearch-py.readthedocs.io/en/v7.10.1/helpers.html#bulk-helpers.

Here's an example from one of my colleagues: https://github.com/elastic/examples/blob/master/Machine%20Learning/Online%20Search%20Relevance%20Metrics/bin/index#L34. You could ignore everything related to 'pipeline', this is an optional piece of configuration for transforming documents before indexing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants