-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KeyedVectors support to AnnoyIndexer #1318
Conversation
Below is the "monkey patched" version, in case it's useful to someone. It adds
|
Thanks! I actually remember some discussion about supporting the GloVe format in the past. @tmylk didn't we add a script to convert GloVe=>word2vec format? I agree we could (should) easily infer it directly from the file. It looks easy enough. |
@Quole This is a useful PR. Could you please add a quick unit test for it? Just loading and indexing a KV file. Glove vectors can be loaded in Gensim after running the conversion script https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/glove2word2vec.py Would appreciate a PR that loads glove into KeyedVectors directly, without any command line conversions. |
I've had a quick stab in the dark at writing a test, but I never got around to getting gensim to run from the source code, and getting test code running via monkey patching is kinda difficult, so didn't get it working. I think the setup of the test work, just need to work out some assert statements. I probably also need to spend more time working out how the test code is organized. I'll have a go at it another day if I get a chance, but probably not hard for someone more familiar with the codebase to fix it up from the broken code if someone wants to. |
Sorry you had problems setting up gensim development environment. We try to make it easy. What kind of problems did you encounter? Even without your environment you can see your test errors in our Travis build environment |
@Quole You can see error in travis log, also, if you want to run tests in currect class locally, you can use this line (in your repo directory) Now, please fix this problem
|
This PR is now abandoned. |
I've fixed up the tests. If they're acceptable, would it be better if I combined these commits and/or put them in a branch with a different name or anything? @tmylk Wasn't actually difficult to set up a dev environment. I just had an awkward setup inside a docker container that I needed to find the time to make more usable for development (which took less time than I expected), though in retrospect I could have probably just relied on the checks from Travis and done the changes in notepad. |
Could you please add the new use case to the ipynb tutorial? |
Is there a magic way of editing a notebook that doesn't make a mess of its change log? |
Nbdime helps a bit |
@Quole Also you can use your favorite text editor |
Would anyone be able to test if my updated version of annoytutorial.ipynb runs correctly with this fork? |
@Quole I run annoytutorial.ipynb and get exception after this code: %%time
model.save('/tmp/mymodel')
def f(process_id):
print('Process Id: ', os.getpid())
process = psutil.Process(os.getpid())
new_model = Word2Vec.load('/tmp/mymodel')
vector = new_model["science"]
annoy_index = AnnoyIndexer()
annoy_index.load('index')
annoy_index.model = new_model
approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
print('\nMemory used by process {}: '.format(os.getpid()), process.memory_info(), "\n---")
# Creating and running two parallel process to share the same index file.
p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join() output:
|
Ok, the notebook file works now. I've added a section titled "Work with Google's word2vec C formats". The part that is only possible with this PR is this bit:
I've also added file extensions to all the saved tmp files, as it gets confusing. |
I see some problems in your PR:
|
@menshikh-iv Thanks for reviewing the code. The first two points are from existing code. (Sorry I haven't not tried to clean up the diff as yet) Yes, there a bunch of stuff that is very slow and makes editing and testing this notebook quite tedious, especially section 5 ("Evaluate relationship of (3rd point) I will rewrite my code to also work with python2. Thanks for spotting this. |
-1 on that Or why is this obfuscation needed? |
I've changed the offending code as follows:
|
Thank you @Quole |
@menshikh-iv |
Currently the Annoy Indexer can be run on a
Word2Vec
orDoc2Vec
objects, which can be created, for example, by training on sentences. e.g.gensim.models.Word2Vec(sentences)
However, I find myself with a
KeyedVectors
instance. This object type comes from loading pre-trained data (e.g. Google's News dataset or Stanford's Glove[1] Twitter dataset). That data is loaded usingKeyedVectors.load_word2vec_format()
. TheKeyedVectors
class appears to have everything needed to create anAnnoyIndex
, so I've added support for it to GenSim.I'm fairly new to GenSim, so maybe I'm missing some part of the usual workflow that might make this patch redundant. I have only tested this pull request using monkey patching (overriding the class methods after they loaded) and have not tested this particular form of the patch so you might want to double check it, but I don't see any reason it should have a problem [edit: except for a missing import, oops. fixed].
Thanks
[1] note: Stanford Glove datasets are missing the header—which normally contains the number of items and vector dimension count—so this had to be manually inserted, although it should have been easily inferred by gensim.