-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VP-Tree HashIndex #177
Comments
Postponing this feature due to VP-Trees not performing as expected. |
HashIndex
Given the recent overhaul of the LSH algorithm modularity, this would now be implemented as a new |
I'd be happy to have a crack at this if it's still of interest. I didn't see much on the dev/vptree-nn-index branch, so a few questions:
|
Hi @wphicks! Thanks for your interest. Incoming wall of text.
In other details, the test implementation that I wrote as well as a this other implementation seem to have a deficiency in regards to indexing values using the hamming distance metric. I observed that both vp-tree implementations broke down into brute force where as the sklearn ball-tree performed as with other distance metrics. Now, this may be because both implementations have flaws that cause this, or vp-trees just break down with this metric, I don't currently remember or know. This should probably be investigated. |
Okay, great! Responding point-by-point:
In terms of your final point, I'll do some research. I have a vague memory about something like this rattling around somewhere in the back of my skull, but I'll have a look at the literature on it. |
I should note that the definition of cumulative gains in this context is
somewhat loose.
When you retrieve K nearest neighbors to a particular query with a VP tree,
you can rank them by actual distance. The idea here is to compare this
against the exact nearest neighbors to see how well the query results
adhere to ground truth.
So on the X axis, you have the RANK of the EXACT neighbor, and on the Y,
you have the RANK of the QUERY RESULT. If the nth ranked exact neighbor is
NOT present in the results, the graph keeps going horizontal. If it IS
present, the graph will go both up and right.
If the result matches the exact near neighbors set, then the graph will
climb up to a plateau at K and just go horizontally. Most of the time it
won't, so it will go up more gradually than the exact graph.
…On Wed, Sep 27, 2017, 4:44 PM William Hicks ***@***.***> wrote:
Okay, great! Responding point-by-point:
1. I'll go ahead and start doing my own implementation until/unless
you can put your local implementations out in some form. If you do put it
out sometime soon, I'd be happy to do testing and/or help flesh out
whatever you've got.
2. Roger that- I'll start with pure Python and see how it goes.
3. Sounds good! @fishcorn <https://github.com/fishcorn>, any details
are much appreciated.
In terms of your final point, I'll do some research. I have a vague memory
about something like this rattling around somewhere in the back of my
skull, but I'll have a look at the literature on it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#177 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABInCPfAFTWm-cRBRj_Y8EtXE57WpY1Zks5smrOLgaJpZM4HAkNQ>
.
|
Makes sense, thanks! I'll get to work. |
I should be able to put my tree implementations on a branch sometime today for you to pick apart. |
Other implementations I've learned from and could be compared against the one coming on a branch in a few:
|
VpTree utilities now on branch dev/vp-tree-utils |
Great, thanks! I'll take a look at it over the weekend and compare against what I've got so far. |
Sorry, got my own wall of text for you this time! After going through the Yianilos paper again, I think the poor performance with the Hamming distance metric is inherent to VP trees and not an implementation flaw. Specifically, consider two random bit vectors of length Looking at Algorithm 2, the question is how often The only good news here is that in the limit of large |
I think I will need to process your above explanation and maybe read the paper again, but it seems to follow what I've seen experimentally. In regards to typical descriptor vector sizes, that depends on the content being described and the algorithm being used. Currently, we have been focusing on imagery and using CNN-based description approaches, which usually yield 2048 to 4096 element vectors. Depending how networks evolve going forward, that may fluctuate. |
Okay, thanks! In that case, it may still be worth it. I'll run some tests at around those sizes and let you know what I find out. |
I ran a bunch of tests over the weekend with your VP tree implementation (integrated into this fork). A few results:
One huge caveat on these results: In building and searching the VP trees, I converted the bit vectors to an integer representation using the functions in bit_utils.py (consistent with what is done for linear hash indices here). That turns out to be very expensive, accounting for 91% of the runtime in searching a VP tree at dimension 4096 (without numba). I think there might be a few ways to get around that conversion, but the takeaway should be that VP trees perform even better than these results suggest. Given that, it seems worth it to use VP trees (for SMQTK's typical domain) after all. Assuming that all sounds good to you, I'm going to see if I can work around the vector to int conversion issue and then finish integrating this as a hash index. |
Sounds promising. Did you create any scripts, graphs or plots to show your results? Are any of your tests something that could be used for performance unit-tests? Feel free to optimize the vector/int conversion and create a PR for a VP-tree hash-index and/or nn-index. |
Yes, I have a couple scripts that I used to generate a whole bunch of plots for different dimensions and numbers of indexed vectors. I was planning on putting the plots in a blog post later, but if there's a better/ more useful place to stick them, I'll be happy to. I've just pushed the scripts to this branch (here and here) in case you want to look at them, but they're very hacky. I can definitely scrape out the good stuff into some clean performance tests, though. |
Sorry for the lack of response. I have been preparing for and am now in a field test environment. I will be able to get to taking a look are your linked work more in about a week. |
No worries and thanks for all the help so far! I'm hoping to get the full PR done this weekend, so you can just take a look at that when you're back. |
Or somehow generalize this into the CodeIndex representation structure. Its pretty much the case where all LSH-based nearest neighbor algorithms will/can use the same nearest-code-neighbor strategy once hashs are generated.
The text was updated successfully, but these errors were encountered: