Improving performance for function most_similar #527

anhldbk · 2015-11-13T01:22:51Z

The current implementation of most_similar is NOT efficient. I've found that Annoy can be used to yield a better performance.

Here is my code:

import logging
from gensim import matutils
from gensim.models.word2vec_inner import REAL
from numpy.core.multiarray import ndarray, array
from gensim.models import Doc2Vec
from six import integer_types
from six import string_types
from sklearn.externals import joblib
import numpy as np
from annoy import AnnoyIndex

FILE_DOC2VEC = 'dbow_5ns_100features_1mw_r20'
FEATURES_NUM = 100
ANNOY_TREE_SIZE = 300 # choose careful

dbow = Doc2Vec.load(FILE_DOC2VEC)
indexer_label_dict = dict()


def init_indexer(features_num, tree_size):
    """
    Initialize the Annoy indexer
    """
    annoy_indexer = AnnoyIndex(features_num)
    docvecs = dbow.docvecs
    docvecs.init_sims()
    size = len(docvecs.doctag_syn0norm)

    for i in range(size):
        annoy_indexer.add_item(i, docvecs.doctag_syn0norm[i])
        indexer_label_dict[i] = docvecs.index2doctag[i]

    annoy_indexer.build(tree_size)
    return annoy_indexer

def dev_most_similar(docvecs, positive=[], negative=[], topn=10, clip_start=0, clip_end=None):
    """
    Find the top-N most similar docvecs known from training. Positive docs contribute
    positively towards the similarity, negative docs negatively.

    This method computes cosine similarity between a simple mean of the projection
    weight vectors of the given docs. Docs may be specified as vectors, integer indexes
    of trained docvecs, or if the documents were originally presented with string tags,
    by the corresponding tags.

    ('clip_start' and 'clip_end' do not play any role. I just keep it to maintain backward compatibility)
    The 'clip_start' and 'clip_end' allow limiting results to a particular contiguous
    range of the underlying doctag_syn0norm vectors. (This may be useful if the ordering
    there was chosen to be significant, such as more popular tag IDs in lower indexes.)
    """
    docvecs.init_sims()

    if isinstance(positive, string_types + integer_types) and not negative:
        # allow calls like most_similar('dog'), as a shorthand for most_similar(['dog'])
        positive = [positive]

    # add weights for each doc, if not already present; default to 1.0 for positive and -1.0 for negative docs
    positive = [(doc, 1.0) if isinstance(doc, string_types + (ndarray,) + integer_types)
                else doc for doc in positive]
    negative = [(doc, -1.0) if isinstance(doc, string_types + (ndarray,) + integer_types)
                else doc for doc in negative]

    # compute the weighted average of all docs
    all_docs, mean = set(), []
    for doc, weight in positive + negative:
        if isinstance(doc, ndarray):
            mean.append(weight * doc)
        elif doc in docvecs.doctags or doc < docvecs.count:
            mean.append(weight * docvecs.doctag_syn0norm[docvecs._int_index(doc)])
            all_docs.add(docvecs._int_index(doc))
        else:
            raise KeyError("doc '%s' not in trained set" % doc)
    if not mean:
        raise ValueError("cannot compute similarity with no input")
    mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)

    nearest = indexer.get_nns_by_vector(mean, topn, include_distances=True)
    size = len(nearest[0])

    result = [(indexer_label_dict[nearest[0][count]], 1 - nearest[1][count] / 2) for count in range(size)]

    return result

indexer = init_indexer(FEATURES_NUM, ANNOY_TREE_SIZE)

# now put it into use
...
vec = dbow.infer_vector(tokens)

# Using the original method for getting most similar documents 
nearest = np.array(dbow.docvecs.most_similar([vec]), dtype=[("label", "S20"), ('similarity', 'f4')])

# Using dev_most_similar()
nearest = np.array(dev_most_similar(dbow.docvecs, [vec]), dtype=[("label", "S20"), ('similarity', 'f4')])

I used a corpus of 5000 documents and recorded the total execution time using most_similar and dev_most_similar (with different tree sizes):

Test	Execution time
most_similar	249s
dev_most_similar (tree size = 600)	134s
dev_most_similar (tree size = 500)	127s
dev_most_similar (tree size = 300)	109s

We have to trade-off between initialization time for improving performance. But it's worth, right?

piskvorky · 2015-11-13T01:44:48Z

Yes, that's a good point @anhldbk !

Extending the similarity API to allow arbitrary kNN libs (such as Annoy) is actually one of our "wishlist tasks":
https://github.com/piskvorky/gensim/wiki/Ideas-&-Feature-proposals#integrate-a-fast-k-nn-library

The "init vs. query" tradeoff is fine; the main question here is how to design an API so that the choice of the underlying kNN implementation (MatrixSimilarity, Annoy, K-graph...) is up to the user and flexible.

Btw the particular benchmark times will depend on the BLAS implementation you're using. With a fast BLAS library (such as Intel's MKL on Intel processors), the tradeoff between using approximate kNN vs. exact kNN is not so clear cut, especially on smaller datasets. But we still want to give the user the choice, so they can plug in whatever NN lib they prefer.

anhldbk · 2015-11-13T08:56:57Z

@piskvorky So is it a good idea to integrate such libs in gensim, allowing end-users to choose by themselves?

piskvorky · 2015-11-13T09:42:27Z

Yep. Sorry if I wasn't clear.

Desired steps:

design an API that abstracts from kNN lib used
write a wrapper for Annoy (or KGraph, or NMSlib, or other lib...) that follows this new API
change algorithms inside gensim that rely on kNN (such as Word2Vec.most_similar), to use this new API, allowing users to choose what kNN lib (wrapper) to use

Not all steps must necessarily be implemented at once. But doing 3) without 1) seems suboptimal, I wouldn't like that. Without an API, we'll have to do the same set of adjustments you did in this PR for every single algorithm, one by one, manually, introducing inconsistencies and code duplication. Apart from most_similar, gensim uses nearest neighbour search in document similarities etc.

davechallis · 2015-11-13T09:43:52Z

Sounds like a great idea. I've been using https://github.com/lyst/rpforest to achieve the same thing, so an abstraction layer would be much more preferable than a hardcoded one.

anhldbk · 2015-11-14T01:29:41Z

@piskvorky 3 steps above are required to extensively integrate such libs. I think I'm gonna have my implementation soon.

honnibal · 2015-12-06T16:15:25Z

Should users want to drop in different kNN? It seems to me that for this use-case, some implementation is going to be just better, for speed/accuracy trade-off. So is it really necessary to abstract this part?

piskvorky · 2015-12-07T02:25:35Z

Yes, we definitely want to abstract the kNN behind an API.

lechatpito · 2015-12-08T19:26:36Z

+1
an approximate faster most_similar would be very welcome. I would also rank having a good working approximate search implementation before having an abstraction.

tmylk · 2016-10-18T14:47:59Z

Implemented in https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/annoytutorial.ipynb

anhldbk mentioned this issue Nov 25, 2015

Flexible API for nearest neighbour search #548

Closed

gojomo mentioned this issue Nov 26, 2015

Potential refactor: a 'NamedVectors' class for reuse by Word2Vec, Doc2Vec, etc #549

Closed

piskvorky mentioned this issue Dec 5, 2015

Feature Request: Vector "File" interface explosion/spaCy#197

Closed

anhldbk mentioned this issue Feb 25, 2016

Flexible API for nearest neighbour search #617

Closed

karlhigley mentioned this issue Jun 3, 2016

[WIP] Add support for using Annoy as an external similarity index #732

Closed

droudy mentioned this issue Jun 27, 2016

Add annoy import to travis #758

Closed

droudy mentioned this issue Jul 6, 2016

Add support for using Annoy as an external similarity index #774

Merged

tmylk closed this as completed Oct 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving performance for function most_similar #527

Improving performance for function most_similar #527

anhldbk commented Nov 13, 2015 •

edited by tmylk

Loading

piskvorky commented Nov 13, 2015

anhldbk commented Nov 13, 2015

piskvorky commented Nov 13, 2015

davechallis commented Nov 13, 2015

anhldbk commented Nov 14, 2015

honnibal commented Dec 6, 2015

piskvorky commented Dec 7, 2015

lechatpito commented Dec 8, 2015

tmylk commented Oct 18, 2016

Improving performance for function most_similar #527

Improving performance for function most_similar #527

Comments

anhldbk commented Nov 13, 2015 • edited by tmylk Loading

piskvorky commented Nov 13, 2015

anhldbk commented Nov 13, 2015

piskvorky commented Nov 13, 2015

davechallis commented Nov 13, 2015

anhldbk commented Nov 14, 2015

honnibal commented Dec 6, 2015

piskvorky commented Dec 7, 2015

lechatpito commented Dec 8, 2015

tmylk commented Oct 18, 2016

anhldbk commented Nov 13, 2015 •

edited by tmylk

Loading