Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving performance for function most_similar #527

Closed
anhldbk opened this issue Nov 13, 2015 · 9 comments
Closed

Improving performance for function most_similar #527

anhldbk opened this issue Nov 13, 2015 · 9 comments

Comments

@anhldbk
Copy link

anhldbk commented Nov 13, 2015

The current implementation of most_similar is NOT efficient. I've found that Annoy can be used to yield a better performance.

Here is my code:

import logging
from gensim import matutils
from gensim.models.word2vec_inner import REAL
from numpy.core.multiarray import ndarray, array
from gensim.models import Doc2Vec
from six import integer_types
from six import string_types
from sklearn.externals import joblib
import numpy as np
from annoy import AnnoyIndex

FILE_DOC2VEC = 'dbow_5ns_100features_1mw_r20'
FEATURES_NUM = 100
ANNOY_TREE_SIZE = 300 # choose careful

dbow = Doc2Vec.load(FILE_DOC2VEC)
indexer_label_dict = dict()


def init_indexer(features_num, tree_size):
    """
    Initialize the Annoy indexer
    """
    annoy_indexer = AnnoyIndex(features_num)
    docvecs = dbow.docvecs
    docvecs.init_sims()
    size = len(docvecs.doctag_syn0norm)

    for i in range(size):
        annoy_indexer.add_item(i, docvecs.doctag_syn0norm[i])
        indexer_label_dict[i] = docvecs.index2doctag[i]

    annoy_indexer.build(tree_size)
    return annoy_indexer

def dev_most_similar(docvecs, positive=[], negative=[], topn=10, clip_start=0, clip_end=None):
    """
    Find the top-N most similar docvecs known from training. Positive docs contribute
    positively towards the similarity, negative docs negatively.

    This method computes cosine similarity between a simple mean of the projection
    weight vectors of the given docs. Docs may be specified as vectors, integer indexes
    of trained docvecs, or if the documents were originally presented with string tags,
    by the corresponding tags.

    ('clip_start' and 'clip_end' do not play any role. I just keep it to maintain backward compatibility)
    The 'clip_start' and 'clip_end' allow limiting results to a particular contiguous
    range of the underlying doctag_syn0norm vectors. (This may be useful if the ordering
    there was chosen to be significant, such as more popular tag IDs in lower indexes.)
    """
    docvecs.init_sims()

    if isinstance(positive, string_types + integer_types) and not negative:
        # allow calls like most_similar('dog'), as a shorthand for most_similar(['dog'])
        positive = [positive]

    # add weights for each doc, if not already present; default to 1.0 for positive and -1.0 for negative docs
    positive = [(doc, 1.0) if isinstance(doc, string_types + (ndarray,) + integer_types)
                else doc for doc in positive]
    negative = [(doc, -1.0) if isinstance(doc, string_types + (ndarray,) + integer_types)
                else doc for doc in negative]

    # compute the weighted average of all docs
    all_docs, mean = set(), []
    for doc, weight in positive + negative:
        if isinstance(doc, ndarray):
            mean.append(weight * doc)
        elif doc in docvecs.doctags or doc < docvecs.count:
            mean.append(weight * docvecs.doctag_syn0norm[docvecs._int_index(doc)])
            all_docs.add(docvecs._int_index(doc))
        else:
            raise KeyError("doc '%s' not in trained set" % doc)
    if not mean:
        raise ValueError("cannot compute similarity with no input")
    mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)

    nearest = indexer.get_nns_by_vector(mean, topn, include_distances=True)
    size = len(nearest[0])

    result = [(indexer_label_dict[nearest[0][count]], 1 - nearest[1][count] / 2) for count in range(size)]

    return result

indexer = init_indexer(FEATURES_NUM, ANNOY_TREE_SIZE)

# now put it into use
...
vec = dbow.infer_vector(tokens)

# Using the original method for getting most similar documents 
nearest = np.array(dbow.docvecs.most_similar([vec]), dtype=[("label", "S20"), ('similarity', 'f4')])

# Using dev_most_similar()
nearest = np.array(dev_most_similar(dbow.docvecs, [vec]), dtype=[("label", "S20"), ('similarity', 'f4')])

I used a corpus of 5000 documents and recorded the total execution time using most_similar and dev_most_similar (with different tree sizes):

Test Execution time
most_similar 249s
dev_most_similar (tree size = 600) 134s
dev_most_similar (tree size = 500) 127s
dev_most_similar (tree size = 300) 109s

We have to trade-off between initialization time for improving performance. But it's worth, right?

@piskvorky
Copy link
Owner

Yes, that's a good point @anhldbk !

Extending the similarity API to allow arbitrary kNN libs (such as Annoy) is actually one of our "wishlist tasks":
https://github.com/piskvorky/gensim/wiki/Ideas-&-Feature-proposals#integrate-a-fast-k-nn-library

The "init vs. query" tradeoff is fine; the main question here is how to design an API so that the choice of the underlying kNN implementation (MatrixSimilarity, Annoy, K-graph...) is up to the user and flexible.

Btw the particular benchmark times will depend on the BLAS implementation you're using. With a fast BLAS library (such as Intel's MKL on Intel processors), the tradeoff between using approximate kNN vs. exact kNN is not so clear cut, especially on smaller datasets. But we still want to give the user the choice, so they can plug in whatever NN lib they prefer.

@anhldbk
Copy link
Author

anhldbk commented Nov 13, 2015

@piskvorky So is it a good idea to integrate such libs in gensim, allowing end-users to choose by themselves?

@piskvorky
Copy link
Owner

Yep. Sorry if I wasn't clear.

Desired steps:

  1. design an API that abstracts from kNN lib used
  2. write a wrapper for Annoy (or KGraph, or NMSlib, or other lib...) that follows this new API
  3. change algorithms inside gensim that rely on kNN (such as Word2Vec.most_similar), to use this new API, allowing users to choose what kNN lib (wrapper) to use

Not all steps must necessarily be implemented at once. But doing 3) without 1) seems suboptimal, I wouldn't like that. Without an API, we'll have to do the same set of adjustments you did in this PR for every single algorithm, one by one, manually, introducing inconsistencies and code duplication. Apart from most_similar, gensim uses nearest neighbour search in document similarities etc.

@davechallis
Copy link
Contributor

Sounds like a great idea. I've been using https://github.com/lyst/rpforest to achieve the same thing, so an abstraction layer would be much more preferable than a hardcoded one.

@anhldbk
Copy link
Author

anhldbk commented Nov 14, 2015

@piskvorky 3 steps above are required to extensively integrate such libs. I think I'm gonna have my implementation soon.

@honnibal
Copy link

honnibal commented Dec 6, 2015

Should users want to drop in different kNN? It seems to me that for this use-case, some implementation is going to be just better, for speed/accuracy trade-off. So is it really necessary to abstract this part?

@piskvorky
Copy link
Owner

Yes, we definitely want to abstract the kNN behind an API.

@lechatpito
Copy link
Contributor

+1
an approximate faster most_similar would be very welcome. I would also rank having a good working approximate search implementation before having an abstraction.

@tmylk
Copy link
Contributor

tmylk commented Oct 18, 2016

Implemented in https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/annoytutorial.ipynb

@tmylk tmylk closed this as completed Oct 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants