Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use cached matrix of pairwise dot products where possible #95

Open
senderle opened this issue Jan 15, 2020 · 6 comments
Open

use cached matrix of pairwise dot products where possible #95

senderle opened this issue Jan 15, 2020 · 6 comments

Comments

@senderle
Copy link

senderle commented Jan 15, 2020

Currently the 3CosAdd and 3CosMul methods run very slowly. It should be possible to speed both operations with the strategy, used by hyperwords, of precalculating all pairwise cosine similarity scores once, and simply adding, subtracting, multiplying, or dividing the resulting scalars for tests.

This is possible because cos(b*, a* − a + b) can be distributed out to cos(b*, a*) − cos(b*, a) + cos(b*, b). (The second looked worse to me until I realized you can precompute all possible pairwise dot products and never have to calculate any more, whereas every new combination of a*, a, b in the first case requires executing a new dot product.)

Given the current architecture this might not be a trivial operation, but at least for smaller vocabularies, it should definitely be possible, and could yield a ~100x speedup by leveraging fast linear algebra routines.

This is something I'd be interested in working on, though I may not be able to devote time to it in the near term. If that would be of interest, though, please let me know.

@senderle senderle changed the title use cached matrix of pairwise distances where possible use cached matrix of pairwise dot products where possible Jan 15, 2020
@undertherain
Copy link
Contributor

Hi!
Yeah that's interesting and does not look like it will take a lot of code to implement.
One reservation I have, as you pointed out, is that with larger-ish vocab size the cached matrix of dot products will take super-huge memory space. For instance I typically see an order of 100K tokens vocabs, meaning matrix size will be 40GB. So Ideally we'd check memory available RAM size before doing this and give appropriate feedback to user

@senderle
Copy link
Author

Yeah I was going back over this in my head and remembering that what makes it practical is that the full VxV matrix doesn't need to be stored. It should be possible to store a VxT matrix of cosine values, where T is the number of words in the analogy test set.

Then the solution for a given analogy is (VT[:, a_] - VT[:, a] + VT[:, b]).argmax().

Maybe you still couldn't do it with all words for all 40 tests at once. But I bet it would still give a good speedup if you did it just once for each test — that would be, what, 2500 words or so? For 32 bit floats seems like that's 2 gigs, well within the capacity of most modern laptops.

@senderle
Copy link
Author

Wait, no! It's way smaller because the number of distinct words per test is far smaller than the number of tests, since the tests look at every pair against every other pair. There must only be ~2k distinct terms in the entire BATS test set — is that right? In that case you could definitely do the full VxT matrix. Though the approach of splitting them by test might be more practical from an implementation standpoint.

@undertherain
Copy link
Contributor

You can not restrict the space of possible answers to word from the test set. basically vocabulary size and content affects performance of different embeddings tremendously. For fair comparison between two sets of embeddings we first make vocabularies identical. If only words from the test are allowed, the answer would most often be correct.

@senderle
Copy link
Author

senderle commented Jan 16, 2020

This won't restrict the space of possible answers though — we need all possible outputs (the V dimension) but we only need a subset of inputs (the T dimension).

@senderle
Copy link
Author

senderle commented Jan 16, 2020

I wrote a quick example to show what I mean. I believe it works as it should, but let me know if you see something wrong.

(This used to be a gist, but gist wasn't playing nicely with the sample vector file so I just created a standard repo.)

https://github.com/senderle/3cosadd-demo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants