-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use cached matrix of pairwise dot products where possible #95
Comments
Hi! |
Yeah I was going back over this in my head and remembering that what makes it practical is that the full VxV matrix doesn't need to be stored. It should be possible to store a VxT matrix of cosine values, where T is the number of words in the analogy test set. Then the solution for a given analogy is Maybe you still couldn't do it with all words for all 40 tests at once. But I bet it would still give a good speedup if you did it just once for each test — that would be, what, 2500 words or so? For 32 bit floats seems like that's 2 gigs, well within the capacity of most modern laptops. |
Wait, no! It's way smaller because the number of distinct words per test is far smaller than the number of tests, since the tests look at every pair against every other pair. There must only be ~2k distinct terms in the entire BATS test set — is that right? In that case you could definitely do the full VxT matrix. Though the approach of splitting them by test might be more practical from an implementation standpoint. |
You can not restrict the space of possible answers to word from the test set. basically vocabulary size and content affects performance of different embeddings tremendously. For fair comparison between two sets of embeddings we first make vocabularies identical. If only words from the test are allowed, the answer would most often be correct. |
This won't restrict the space of possible answers though — we need all possible outputs (the V dimension) but we only need a subset of inputs (the T dimension). |
I wrote a quick example to show what I mean. I believe it works as it should, but let me know if you see something wrong. (This used to be a gist, but gist wasn't playing nicely with the sample vector file so I just created a standard repo.) |
Currently the
3CosAdd
and3CosMul
methods run very slowly. It should be possible to speed both operations with the strategy, used by hyperwords, of precalculating all pairwise cosine similarity scores once, and simply adding, subtracting, multiplying, or dividing the resulting scalars for tests.This is possible because
cos(b*, a* − a + b)
can be distributed out tocos(b*, a*) − cos(b*, a) + cos(b*, b)
. (The second looked worse to me until I realized you can precompute all possible pairwise dot products and never have to calculate any more, whereas every new combination ofa*, a, b
in the first case requires executing a new dot product.)Given the current architecture this might not be a trivial operation, but at least for smaller vocabularies, it should definitely be possible, and could yield a ~100x speedup by leveraging fast linear algebra routines.
This is something I'd be interested in working on, though I may not be able to devote time to it in the near term. If that would be of interest, though, please let me know.
The text was updated successfully, but these errors were encountered: