Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KeyedVectors support to AnnoyIndexer #1318

Merged
merged 14 commits into from
Jun 12, 2017

Conversation

pengowray
Copy link
Contributor

@pengowray pengowray commented May 13, 2017

Currently the Annoy Indexer can be run on a Word2Vec or Doc2Vec objects, which can be created, for example, by training on sentences. e.g. gensim.models.Word2Vec(sentences)

However, I find myself with a KeyedVectors instance. This object type comes from loading pre-trained data (e.g. Google's News dataset or Stanford's Glove[1] Twitter dataset). That data is loaded using KeyedVectors.load_word2vec_format(). The KeyedVectors class appears to have everything needed to create an AnnoyIndex, so I've added support for it to GenSim.

I'm fairly new to GenSim, so maybe I'm missing some part of the usual workflow that might make this patch redundant. I have only tested this pull request using monkey patching (overriding the class methods after they loaded) and have not tested this particular form of the patch so you might want to double check it, but I don't see any reason it should have a problem [edit: except for a missing import, oops. fixed].

Thanks

[1] note: Stanford Glove datasets are missing the header—which normally contains the number of items and vector dimension count—so this had to be manually inserted, although it should have been easily inferred by gensim.

@pengowray
Copy link
Contributor Author

pengowray commented May 13, 2017

Below is the "monkey patched" version, in case it's useful to someone. It adds KeyedVectors support to the Annoy Indexer without directly modifying gensim per se.

from gensim.models.keyedvectors import KeyedVectors

def new_init(self, model=None, num_trees=None):
	self.index = None
	self.labels = None
	self.model = model
	self.num_trees = num_trees
	
	if model and num_trees:
		if isinstance(self.model, Doc2Vec):
			self.build_from_doc2vec()
		elif isinstance(self.model, Word2Vec):
			self.build_from_word2vec()
		elif isinstance(self.model, KeyedVectors):
			self.build_from_keyedvectors()
		else:
			raise ValueError("Only a Word2Vec or Doc2Vec or KeyedVectors instance can be used")

def build_from_keyedvectors(self):
	"""Build an Annoy index using word vectors from a KeyedVectors model"""
	
	#skip normalization:
	#return self._build_from_model(self.model.syn0, self.model.index2word, self.model.vector_size)
	
	#with normalization:
	self.model.init_sims()
	return self._build_from_model(self.model.syn0norm, self.model.index2word, self.model.vector_size)

AnnoyIndexer.__init__ = new_init
AnnoyIndexer.build_from_keyedvectors = build_from_keyedvectors

@piskvorky
Copy link
Owner

Thanks!

I actually remember some discussion about supporting the GloVe format in the past. @tmylk didn't we add a script to convert GloVe=>word2vec format?

I agree we could (should) easily infer it directly from the file. It looks easy enough.

@tmylk
Copy link
Contributor

tmylk commented May 15, 2017

@Quole This is a useful PR. Could you please add a quick unit test for it? Just loading and indexing a KV file.

Glove vectors can be loaded in Gensim after running the conversion script https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/glove2word2vec.py

Would appreciate a PR that loads glove into KeyedVectors directly, without any command line conversions.

@pengowray
Copy link
Contributor Author

I've had a quick stab in the dark at writing a test, but I never got around to getting gensim to run from the source code, and getting test code running via monkey patching is kinda difficult, so didn't get it working.

I think the setup of the test work, just need to work out some assert statements. I probably also need to spend more time working out how the test code is organized. I'll have a go at it another day if I get a chance, but probably not hard for someone more familiar with the codebase to fix it up from the broken code if someone wants to.

@tmylk
Copy link
Contributor

tmylk commented May 18, 2017

Sorry you had problems setting up gensim development environment. We try to make it easy. What kind of problems did you encounter?

Even without your environment you can see your test errors in our Travis build environment

@menshikh-iv
Copy link
Contributor

@Quole You can see error in travis log, also, if you want to run tests in currect class locally, you can use this line (in your repo directory)
python gensim/test/test_similarities.py TestWord2VecAnnoyIndexer

Now, please fix this problem

======================================================================
ERROR: testAnnoyIndexingOfKeyedVectors (__main__.TestWord2VecAnnoyIndexer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "gensim/test/test_similarities.py", line 489, in testAnnoyIndexingOfKeyedVectors
    self.assertVectorIsSimilarToItself(model, index)
  File "gensim/test/test_similarities.py", line 499, in assertVectorIsSimilarToItself
    vector = model.wv.syn0norm[0]
AttributeError: 'KeyedVectors' object has no attribute 'wv'

----------------------------------------------------------------------
Ran 4 tests in 0.060s

@tmylk tmylk added the difficulty easy Easy issue: required small fix label May 29, 2017
@tmylk
Copy link
Contributor

tmylk commented May 29, 2017

This PR is now abandoned.
Anyone, especially someone looking for a first contribution, is welcome to pick it up.
It just needs a minor change to assertVectorIsSimilarToItself

@pengowray
Copy link
Contributor Author

I've fixed up the tests. If they're acceptable, would it be better if I combined these commits and/or put them in a branch with a different name or anything?

@tmylk Wasn't actually difficult to set up a dev environment. I just had an awkward setup inside a docker container that I needed to find the time to make more usable for development (which took less time than I expected), though in retrospect I could have probably just relied on the checks from Travis and done the changes in notepad.

@tmylk
Copy link
Contributor

tmylk commented May 30, 2017

Could you please add the new use case to the ipynb tutorial?
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/annoytutorial.ipynb

@pengowray
Copy link
Contributor Author

Is there a magic way of editing a notebook that doesn't make a mess of its change log?

@tmylk
Copy link
Contributor

tmylk commented May 30, 2017

Nbdime helps a bit

@menshikh-iv
Copy link
Contributor

@Quole Also you can use your favorite text editor

@pengowray
Copy link
Contributor Author

Would anyone be able to test if my updated version of annoytutorial.ipynb runs correctly with this fork?

@menshikh-iv
Copy link
Contributor

@Quole I run annoytutorial.ipynb and get exception after this code:

%%time

model.save('/tmp/mymodel')

def f(process_id):
    print('Process Id: ', os.getpid())
    process = psutil.Process(os.getpid())
    new_model = Word2Vec.load('/tmp/mymodel')
    vector = new_model["science"]
    annoy_index = AnnoyIndexer()
    annoy_index.load('index')
    annoy_index.model = new_model
    approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
    print('\nMemory used by process {}: '.format(os.getpid()), process.memory_info(), "\n---")

# Creating and running two parallel process to share the same index file.
p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()

output:

('Process Id: ', 13330)
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "<timed exec>", line 7, in f
  File "/home/ivan/.virtualenvs/clean2/local/lib/python2.7/site-packages/gensim-2.1.0-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 1386, in load
    if model.negative and hasattr(model.wv, 'index2word'):
AttributeError: 'KeyedVectors' object has no attribute 'negative'
('Process Id: ', 13353)
Process Process-4:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "<timed exec>", line 7, in f
  File "/home/ivan/.virtualenvs/clean2/local/lib/python2.7/site-packages/gensim-2.1.0-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 1386, in load
    if model.negative and hasattr(model.wv, 'index2word'):
AttributeError: 'KeyedVectors' object has no attribute 'negative'
CPU times: user 544 ms, sys: 52 ms, total: 596 ms
Wall time: 1.21 s

@pengowray
Copy link
Contributor Author

pengowray commented Jun 4, 2017

Ok, the notebook file works now. I've added a section titled "Work with Google's word2vec C formats". The part that is only possible with this PR is this bit:

# To create and save Annoy Index from a loaded `KeyedVectors` object (with 100 trees)
annoy_index = AnnoyIndexer(wv, 100)
annoy_index.save('/tmp/mymodel.annoyindex')

I've also added file extensions to all the saved tmp files, as it gets confusing.

@menshikh-iv
Copy link
Contributor

I see some problems in your PR:

  • print('\nMemory used by process {}: '.format(os.getpid()), process.memory_info(), "\n---") or print ('Process Id: ', os.getpid()) print tuple (not string)

  • This cycle takes a lot of time, I think it will be a good idea to logging progress

   x_values.append(x)
   start_time = time.time()
   ...
  • This code incorrect in python2 (because print isn't a function, print is an operator), you can rewrite this as simple cycle
# The first line should have the total number of entries and the vector size
with open('/tmp/vectors.txt') as myfile:
    [print(next(myfile)) for x in range(3)]

@pengowray
Copy link
Contributor Author

pengowray commented Jun 4, 2017

@menshikh-iv Thanks for reviewing the code. The first two points are from existing code. (Sorry I haven't not tried to clean up the diff as yet)

Yes, there a bunch of stuff that is very slow and makes editing and testing this notebook quite tedious, especially section 5 ("Evaluate relationship of num_trees to initialization time and accuracy"). It should probably also be moved to another notebook file. The accuracy graph could also be constructed much more quickly if the index was simply built only once with with 300 trees and then each iteration choosing the number of trees to consider (instead of re-building the entire index hundreds of times). This would probably require a small change to the API to add num_trees (aka n_trees) as a parameter for model.most_similar(). The initialization time graph still requires rebuilding the index, but it is very linear, so perhaps it could skip all but every 50th iteration. There should also be a graph for evaluating accuracy vs search_k (see annoy readme) too. But all these changes (and changes to logging) should probably be in a different PR.

(3rd point) I will rewrite my code to also work with python2. Thanks for spotting this.

@piskvorky
Copy link
Owner

-1 on that [print(next(myfile)) for x in range(3)] construct: just write a normal loop.

Or why is this obfuscation needed?

@pengowray
Copy link
Contributor Author

I've changed the offending code as follows:

with open('/tmp/vectors.txt') as myfile:
    for i in range(3):
        print(myfile.readline().strip())

@menshikh-iv
Copy link
Contributor

Thank you @Quole

@menshikh-iv menshikh-iv merged commit 686e975 into piskvorky:develop Jun 12, 2017
@piskvorky
Copy link
Owner

@menshikh-iv open => smart_open please (we try to use that consistently, for simpler reading from S3/compressed/HDFS etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants