-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocate one NeighborQueue per search for results #12255
Conversation
@msokolov wrote earlier,
How can I reproduce this locally? |
hmm, this is interesting, the hashmap branch passes tests, the neighborqueue branch passes tests, but putting them together does not pass. investigating... |
^ this is addressed on the hashmap PR now |
I think we can close if this has already been addressed, right? |
The backwards compatibility breakage was fixed on the hashmap branch, this optimization was not. |
Got it thank you! I think we can later add a CHANGES entry where we track improvements to help with release notes. |
Hello, is it expected that this change alters the KNN hits returned? That's fine (if it was expected) ... Lucene's nightly benchmarks are angry about it though, so I'll just regold if this is OK/expected. |
Not expected. How can I run the test locally? |
Those tests are run using a separate package called |
I wonder if it could have been #12248 that caused the difference? |
Downloading now. In the meantime, can you give more details on the failure? Some variance is expected just by the nature of hnsw randomness, but it shouldn't go from e.g. 90% recall to 80%. |
Thanks @jbellis -- I'm not sure this change caused any difference. Something in the past couple days tweaked the KNN results and this one jumped out at me as a possibility. This is the only detail the nightly benchmark produced:
Unfortunately it is not so simple to reproduce these nightly benchmarks. Note that they only check for exact hit/score differences and not any precision/recall tradeoff. |
The idea is that the hnsw randomness should be predictable based on its fixed random seed (42 IIRC). It isn't really a problem if we changed that as long as we have some idea how / why, and as you say the recall remains unchanged or improved. Also we'd want to make sure that the new normal is stable |
If the algorithm is implemented correctly, and I think that it is, then in theory the order of neighbor traversal should not matter. But we are seeing a difference here, so I think what causes that is the limited precision that you get in practice when computing vector similarities. If you have enough vectors, and enough dimensions, then the round off error can accumulate enough to make a difference. That is why the test suite does not surface this difference. I performed 38 runs of the Texmex SIFT benchmark with known-correct KNN. This resulted in the new code having a very tiny bit better recall on average, with p-value 0.16. My statistics is a bit rusty (it's very rusty) but I believe we're justified in concluding that recall is no worse than before, at least on this test. Test harness is here, google sheet is here and raw data is attached as csv. The first column is the new code, and the second is the old (git sha 1fa2be9). |
Thanks Jonathan for this detailed analysis -- it looks like the returned hits did change a bit, but recall is a bit better, so it's fine. I'll regold the nightly benchmarks. |
@jbellis would this change effect query per second? There is a 20% drop since this commit in QPS in Lucene bench nightlies: https://home.apache.org/~mikemccand/lucenebench/2023.05.08.18.03.38.html Looking at the other searches, none slow down this significantly, so it doesn't seem to be the lucene util change. Did your testing for recall changes show any performance differences on single threaded QPS? |
The testing I performed pre-merge showed an improvement, but I'm happy to look closer. Is there a way to get a graph of performance over time out of that tool or is it manually pasting things into a spreadsheet? |
Also, is it actually this commit, or is it the HashMap commit 3c16374 ? |
I don't see how it is that commit as the change was only detected after this PR was merged and has persisted since then. Though I may be reading the graph incorrectly. We should probably test the two commits independently to determine the cause with Lucene util. |
Here is the timeline graph with one sample per day, roughly https://home.apache.org/~mikemccand/lucenebench/VectorSearch.html Hmm there was a several day gap around there, I think because the results changed and this breaks the benchmark tool. I suspect @mikemccand eventually decided that was benign and overrode its controls with imposing a "new normal". |
@jbellis that change seems to only be for the writer and the OnHeap vectors. Both of which aren't accessed during search, unless you think the graph building drastically changed for that commit and thus reduced the query per seconds somehow. |
^ This is almost the entire diff, there is also the same change for byte[] and then The differences that I see are
(I also see that Builder isn't taking advantage of the new API, so that's a missed opportunity on my part but I assume QPS isn't measuring build time.) It's hard for me to tell what code path the tester is exercising, but if it is specifying a low-ish visitedLimit then that would explain the regression. (It would also mean that the tester should expect pretty bad recall since it's not actually getting to the bottom level of the graph.) I tried pushing to the branch for this PR but it didn't seem to do anything, so I opened a new PR that addresses the Builder and the visitedLimit issues at #12303. P.S. I also notice that the byte[] version is missing these lines from the float[] version, not sure if that's intended or not
|
Hmm, curiously, clicking through the datapoint with the ~20% QPS drop, these were the changes to Lucene, which is just this PR. Could that have caused this drop? Is |
So, using lucene util, I have been comparing this PR's commit vs. the one previous. I am consistently getting lower QPS with this PR's change but not near the 20% slow down (only around 3%, so could be system noise...). Here are all the commits between the last good run and the one recorded with the significant slow down: 397c2e5...223e28e I will try including Luca's commit to see if that increases the QPS discrepancy. |
Well, with just a single thread, Luca's commit changed nothing to my local tests. I do seem to have a single threaded slow down (though minor) due to this change. @mikemccand how can I determine which parameters the vector search task used when querying? Searching in Lucene Util for Any additional help with trouble shooting this weird slow down is welcome :). |
Can you check the performance with the changes at #12303 ? |
I see that |
One thing that';s confusing here is that luceneutil reports "commits since last successful run" relative to a 5/10 run, but there is no graph point showing that run. The last successful run according to the graph was on 5/7 and there were a bunch of other commits between that one and the 5/10 run. Including this one:
so I think the change to slice executor is a red herring and we are just missing some datapoints on the graph |
I tried running luceneutil before/after this change using this command:
and I get this error:
maybe it's expected that we changed the results? I think this is what Mike M ran into with the nightly benchmarks |
Re-running with
lets the benchmark complete without errors and I get this result:
I don't understand why, but it seems to me we ought to revert this |
The nightly benchmarks "just" run
Eek -- I'll dig into why the nightly chart is misleading us. Maybe the "last successful run" logic is buggy. |
Note that you can click & drag to zoom into the nightly chart! Very helpful when trying to isolate specific nights' builds! |
OK indeed there was a bug in this logic! I pushed a possible fix. Unfortunately it will not retroactively correct the past bug's effect, but going forward, new nightly builds should be fixed. I'll watch the next few nightlies to confirm. Thanks for catching this @msokolov! |
+1 -- let's revert for now and then try to understand the performance regression offline. |
This reverts commit 9a7efe9.
ok, I reverted. Maybe we can scratch our heads and learn something by understanding what the difference was |
One thing I noticed is that |
OK I think I see the problem -- when we search the upper levels of the graph we do so using topK=1. This initializes the NeighborQueue to have "initialSize=1" and therefore its LongHeap is created with maxSize=1, and this is never changed when we clear() these data structures. But when we search the bottom layer of the graph we want to use topK=topK as passed by the caller. So I guess we could re-use the same NeighborQueue for the upper graph layers, but we really do a need to set the maxSize of the heap larger for the bottom layer, and it seems as if we might as well just create a new heap. I guess some of the confusion arises because this data structure is sometimes used in a fixed-size way (we only insert nodes using insertWithOverflow into the results) and in other usages we allow it grow without bound (we insert nodes into the candidate queue using add). |
I've noticed that we create a NeighborQueue with an initialSize set to topK. For instance, if topK is 100, the maximum size of LongHeap is also 100. However, when we execute the searchLayer function for any layer other than 0, the maximum size of LongHeap in the original version is only 1. In the if (results.insertWithOverflow(friendOrd, friendSimilarity) && results.size() >= topK) { if (size >= maxSize) {
if (value < heap[1]) {
return false;
}
updateTop(value);
return true;
}
push(value);
return true; So, if maxSize is 100, it will store up to 100 points. However, if maxSize is 1, it will only store 1 point. When we try to return the result, we pop elements from the heap. If maxSize is 100, it will pop 99 elements when the layer is not 0. This could potentially be a time-consuming operation. In contrast, the original version does not pop any elements; it simply returns the result. Here is the relevant code from the Apache Lucene repository: while (results.size() > topK) {
results.pop();
} In summary, the performance degradation could be due to the increased number of pop operations when the maximum size of the heap is larger than 1. |
But I'm still trying to run lucenutil bechmark to test my hypothesis. |
@msokolov I try to run lucenutil use the command you provide, it throw the exception
my vector-test.py looks like #!/usr/bin/env python
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import competition
import sys
import constants
# simple example that runs benchmark with WIKI_MEDIUM source and task files
# Baseline here is ../lucene_baseline versus ../lucene_candidate
if __name__ == '__main__':
#sourceData = competition.sourceData('wikivector1m')
#sourceData = competition.sourceData('wikivector10k')
sourceData = competition.sourceData('wikimedium10k')
comp = competition.Competition(verifyScores=False)
index = comp.newIndex('baseline', sourceData,
vectorFile=constants.GLOVE_VECTOR_DOCS_FILE,
vectorDimension=100,
vectorEncoding='FLOAT32')
# Warning -- Do not break the order of arguments
# TODO -- Fix the following by using argparser
concurrentSearches = True
# create a competitor named baseline with sources in the ../trunk folder
comp.competitor('baseline', 'baseline',
vectorDict=constants.GLOVE_WORD_VECTORS_FILE,
index=index, concurrentSearches=concurrentSearches)
comp.competitor('candidate', 'candidate',
vectorDict=constants.GLOVE_WORD_VECTORS_FILE,
index=index, concurrentSearches=concurrentSearches)
# use a different index
# create a competitor named my_modified_version with sources in the ../patch folder
# note that we haven't specified an index here, luceneutil will automatically use the index from the base competitor for searching
# while the codec that is used for running this competitor is taken from this competitor.
# start the benchmark - this can take long depending on your index and machines
comp.benchmark("baseline_vs_candidate") Could you tell me how can I fix that? |
@tang-hi you need to switch the sourceData to use |
@msokolov, thank you! I have successfully run the test and it confirms what I mentioned earlier. I believe that #12303 by @jbellis could resolve this issue. baseline (this pull request) compared to the candidate version (version prior to this pull request). baseline(version prior to this pull request) compared to the candidate version(fix this PR) |
Thanks everyone for testing and fixing. I had reverted this yesterday and I believe what we have on main now has recovered the performance we had before. I also ran luceneutil a few times and wasn't able to observe any change from clear() vs creating new NeighborQueue. |
The performance impact to building is more meaningful because that is where you are allocating large queues for multiple levels |
As discussed on the hashmap PR, this saves about 1% of graph build time.