Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VP-Tree HashIndex #177

Open
Purg opened this issue Jan 7, 2016 · 19 comments
Open

Add VP-Tree HashIndex #177

Purg opened this issue Jan 7, 2016 · 19 comments
Assignees

Comments

@Purg
Copy link
Member

Purg commented Jan 7, 2016

Or somehow generalize this into the CodeIndex representation structure. Its pretty much the case where all LSH-based nearest neighbor algorithms will/can use the same nearest-code-neighbor strategy once hashs are generated.

@Purg Purg added this to the v0.3.0 milestone Jan 7, 2016
@Purg Purg self-assigned this Jan 7, 2016
@Purg Purg removed this from the v0.3.0 milestone Jan 19, 2016
@Purg
Copy link
Member Author

Purg commented Jan 19, 2016

Postponing this feature due to VP-Trees not performing as expected.

@Purg Purg changed the title Add VP-Tree functionality to LSH indexing (ITQ impl) Add VP-Tree HashIndex Jan 20, 2016
@Purg
Copy link
Member Author

Purg commented Jan 20, 2016

Given the recent overhaul of the LSH algorithm modularity, this would now be implemented as a new HashIndex implementation

@Purg Purg changed the title Add VP-Tree HashIndex Add VP-Tree HashIndex Jan 20, 2016
@wphicks
Copy link
Contributor

wphicks commented Sep 27, 2017

I'd be happy to have a crack at this if it's still of interest. I didn't see much on the dev/vptree-nn-index branch, so a few questions:

  1. Was there an existing implementation of VP-trees that you were hoping to use here?
  2. If not, is performance enough of a consideration that you'd prefer this be implemented as a C++ extension rather than pure Python?
  3. In the same vein, are there performance benchmarks I should test this against before submitting a PR?

@Purg
Copy link
Member Author

Purg commented Sep 27, 2017

Hi @wphicks! Thanks for your interest. Incoming wall of text.

  1. I don't have a specific implementation in mind with regards to back an algorithm wrapper. I have a local implementation of VP, VP-s and VP-sb trees as described by the original paper by Yianilos. I think it works given some fairly basic testing but that code has not been checked in yet anywhere. I need to double check some things first before I can attempt to put that code somewhere.

  2. The basic performance check is to get an improvement in build and/or query time against the KD and ball tree implementations in scikit-learn. Literature claims that VP-trees should be faster and more space efficient than kd/ball trees. If that can be achieved via pure python, then hooray. Otherwise a C implementation would be required for improvement, however that may complicate model serialization.

  3. I don't have any standard nearest-neighbor performance benchmarks implemented in SMQTK, but a @fishcorn (a coworker of mine) has used cumulative gains as a comparative measure. He can explain that measure better than I can at the moment.

In other details, the test implementation that I wrote as well as a this other implementation seem to have a deficiency in regards to indexing values using the hamming distance metric. I observed that both vp-tree implementations broke down into brute force where as the sklearn ball-tree performed as with other distance metrics. Now, this may be because both implementations have flaws that cause this, or vp-trees just break down with this metric, I don't currently remember or know. This should probably be investigated.

@wphicks
Copy link
Contributor

wphicks commented Sep 27, 2017

Okay, great! Responding point-by-point:

  1. I'll go ahead and start doing my own implementation until/unless you can put your local implementations out in some form. If you do put it out sometime soon, I'd be happy to do testing and/or help flesh out whatever you've got.
  2. Roger that- I'll start with pure Python and see how it goes.
  3. Sounds good! @fishcorn, any details are much appreciated.

In terms of your final point, I'll do some research. I have a vague memory about something like this rattling around somewhere in the back of my skull, but I'll have a look at the literature on it.

@fishcorn
Copy link
Contributor

fishcorn commented Sep 28, 2017 via email

@wphicks
Copy link
Contributor

wphicks commented Sep 28, 2017

Makes sense, thanks! I'll get to work.

@Purg
Copy link
Member Author

Purg commented Sep 28, 2017

I should be able to put my tree implementations on a branch sometime today for you to pick apart.

@Purg
Copy link
Member Author

Purg commented Sep 29, 2017

Other implementations I've learned from and could be compared against the one coming on a branch in a few:

@Purg
Copy link
Member Author

Purg commented Sep 29, 2017

VpTree utilities now on branch dev/vp-tree-utils

@wphicks
Copy link
Contributor

wphicks commented Sep 30, 2017

Great, thanks! I'll take a look at it over the weekend and compare against what I've got so far.

@wphicks
Copy link
Contributor

wphicks commented Oct 2, 2017

Sorry, got my own wall of text for you this time!

After going through the Yianilos paper again, I think the poor performance with the Hamming distance metric is inherent to VP trees and not an implementation flaw. Specifically, consider two random bit vectors of length D. The probability that the (non-normalized) Hamming distance between them is r will be (1/2)^D * binomial(D, r). This violates the ZPS property Yianilos lays out in Definition 2, so the question is whether it's a strong enough violation that we'll consistently revert to a brute-force search.

Looking at Algorithm 2, the question is how often x will be an element of both I_l and I_r, which is to say how often the distance from our query (q) to the vantage point (v) of a given node will be between the lower bound of the node's right child-tree minus the current tolerance and the upper bound of the node's left tree plus the current tolerance. Let's make the generous assumption that that only happens when the distance from q to v is exactly the median distance (m) between v and all other vectors indexed in its child trees. If q is chosen randomly, then the probability of this happening is (1/2)^D * binomial(D, m), and v should be chosen to maximize or minimize m. Without specifying something about the distribution of our input vectors (i.e. that they have a particular kind of clustering), there's no guarantee that we can bring the probability of d(q, v) == m below any given threshold. If q is chosen from a distribution generally similar to the distribution of input vectors that were actually used to build the tree, I think the situation actually gets worse, though I haven't worked it out formally yet. In that case, I'm also reasonably sure that the usual method of trying to maximize the spread of distances to the input vectors when choosing v is still the best strategy.

The only good news here is that in the limit of large D, the probability of d(q, v) == m does go to 0 (for randomly selected q). I'm going to do some numerical testing, both to make sure I'm not fooling myself and to see if there's some dimensionality above which the performance beats out KD and ball trees. The question then is how large of descriptor vectors are typical for SMQTK to see if it's worth including this.

@Purg
Copy link
Member Author

Purg commented Oct 2, 2017

I think I will need to process your above explanation and maybe read the paper again, but it seems to follow what I've seen experimentally.

In regards to typical descriptor vector sizes, that depends on the content being described and the algorithm being used. Currently, we have been focusing on imagery and using CNN-based description approaches, which usually yield 2048 to 4096 element vectors. Depending how networks evolve going forward, that may fluctuate.

@wphicks
Copy link
Contributor

wphicks commented Oct 2, 2017

Okay, thanks! In that case, it may still be worth it. I'll run some tests at around those sizes and let you know what I find out.

@wphicks
Copy link
Contributor

wphicks commented Oct 9, 2017

I ran a bunch of tests over the weekend with your VP tree implementation (integrated into this fork). A few results:

  1. All variants of VP trees start to outperform the ball tree implementation (in terms of search runtime) for indices of ~1000 binary vectors or more at any dimension above ~16. For indices of 10000 vectors at dimension 4096, VP tree searches are about 25 times faster.
  2. Consistent with my intuition from before, VP trees perform substantially worse (in terms of nodes visited) when searching based on a query that is close to an indexed vectors than when searching based on a random query. Even at their worst behavior, though, they substantially outperform BTs for high (> ~16) dimension and large (> ~1000) indices.
  3. Building a VP tree is always substantially slower than building a BT. It scales comparably to (and slightly worse than) building a linear hash index.

One huge caveat on these results: In building and searching the VP trees, I converted the bit vectors to an integer representation using the functions in bit_utils.py (consistent with what is done for linear hash indices here). That turns out to be very expensive, accounting for 91% of the runtime in searching a VP tree at dimension 4096 (without numba). I think there might be a few ways to get around that conversion, but the takeaway should be that VP trees perform even better than these results suggest.

Given that, it seems worth it to use VP trees (for SMQTK's typical domain) after all. Assuming that all sounds good to you, I'm going to see if I can work around the vector to int conversion issue and then finish integrating this as a hash index.

@Purg
Copy link
Member Author

Purg commented Oct 9, 2017

Sounds promising. Did you create any scripts, graphs or plots to show your results? Are any of your tests something that could be used for performance unit-tests?

Feel free to optimize the vector/int conversion and create a PR for a VP-tree hash-index and/or nn-index.

@wphicks
Copy link
Contributor

wphicks commented Oct 9, 2017

Yes, I have a couple scripts that I used to generate a whole bunch of plots for different dimensions and numbers of indexed vectors. I was planning on putting the plots in a blog post later, but if there's a better/ more useful place to stick them, I'll be happy to. I've just pushed the scripts to this branch (here and here) in case you want to look at them, but they're very hacky. I can definitely scrape out the good stuff into some clean performance tests, though.

@Purg
Copy link
Member Author

Purg commented Oct 13, 2017

Sorry for the lack of response. I have been preparing for and am now in a field test environment. I will be able to get to taking a look are your linked work more in about a week.

@wphicks
Copy link
Contributor

wphicks commented Oct 13, 2017

No worries and thanks for all the help so far! I'm hoping to get the full PR done this weekend, so you can just take a look at that when you're back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants