Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Soft Cosine Measure #1827

Merged
merged 40 commits into from
Feb 8, 2018
Merged

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented Jan 6, 2018

Introduction

I implemented the Soft Cosine Measure (SCM) [wiki, 1, 2] as a part of research for my thesis [3]. Although the original algorithm [1] has a time complexity that is quadratic in the document length, I implemented a linear-time approximative algorithm that I sketch in [3, sec. 4.4]. Since Gensim was such an indispensable asset in my work, I thought I would give back and contribute code. The implementation is showcased in a jupyter notebook on corpora from the SemEval 2016 and 2017 competitions.

soft_cosine_tutorial

Description

My original implementation closely followed the Gensim implementation of the Word Mover's Distance (WMD), which is split into a gensim.models.keyedvectors.EuclideanKeyedVectors.wmdistance method that takes two token lists and computes the WMD for them, and into the gensim.similarity.WdmSimilarity class that provides batch similarity queries. However, I was not quite happy with this for the following reasons:

  1. Not all useful term similarity matrices are constructed using word embeddings. Therefore, putting the entire logic into gensim.models.keyedvectors.EuclideanKeyedVectors immediately seemed like a bad idea that would hinder further extensions.
  2. By automatically converting token lists into bag-of-words representation behind the curtain, the user is unable to apply document length normalization methods such as tf-idf.

For the above reasons, I ultimately decided to split the implementation into a function, a method, and a class as follows:

  1. The gensim.matutils.softcossim function takes two documents in the bag-of-words representation, a sparse term similarity matrix in the scipy CSC format, and computes SCM.
  2. The gensim.models.keyedvectors.EuclideanKeyedVectors.similarity_matrix method takes a corpus of bag-of-words vectors, a dictionary, and produces a sparse term similarity matrix Mrel described by Charlet and Damnati, 2017 [1].
  3. The gensim.similarities.SoftCosineSimilarity class takes a corpus of bag-of-words vectors, a sparse term similarity matrix in the scipy CSC format, and provides batch SCM queries against the corpus.

The above design achieves a much looser coupling between the individual components and eliminates the original concerns. I demonstrate the implementation in a jupyter notebook on the corpus of Yelp reviews. The approximative linear-time approximative algorithm for SCM achieves about the same speed as the linear-time approximative algorithm for WMD (see the corresponding jupyter notebook).

Future work

The gensim.similarities.SoftCosineSimilarity class goes over the entire corpus and computes the SCM between the query and each document separately by calling gensim.matutils.softcossim. If performance is a concern, SCM can be computed in a single step by computing q^T * S * C, where q is the normalized query vector, S is the term similarity matrix, C is the normalized term-document matrix of the corpus, and “normalized” in this context stands for a vector v being divided by sqrt(v^T * S * v). This is similar to what e.g. the gensim.similarity.MatrixSimilarity.get_similarities method does, only with the basic cosine similarity rather than SCM.

References

  1. Grigori Sidorov et al. Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model, 2014. (link to PDF)
  2. Delphine Charlet and Geraldine Damnati, SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering, 2017. (link to PDF)
  3. Vít Novotný, Vector Space Representations in Information Retrieval (preprint), 2017. (link to PDF)

@Witiko Witiko force-pushed the softcossim branch 4 times, most recently from defaf4d to 2b4b47c Compare January 7, 2018 16:12
@Witiko
Copy link
Contributor Author

Witiko commented Jan 7, 2018

I added numpy-style documentation and unit tests. Hopefully, the code should be good to go now.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @Witiko, in general, looks nice!

"outputs": [],
"source": [
"from time import time\n",
"start_nb = time()\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused var

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 08dea4e.

"metadata": {},
"outputs": [],
"source": [
"sentence_obama = 'Obama speaks to the media in Illinois'\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sentence_obama = 'Obama speaks to the media in Illinois'
sentence_obama = sentence_obama.lower().split()

to

sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()

same for other sentences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 621ed0d.

}
],
"source": [
"start = time()\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To track time of cell, better to use "magic" %%time ,

%%time

.. <SOME CODE> ..

instead of

start = time()

.. <SOME CODE> ..

print('Cell took %.2f seconds to run.' % (time() - start))

here and everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8af5f67.

"start = time()\n",
"import os\n",
"\n",
"from gensim.models import KeyedVectors\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use gensim-data functionality instead of this part

import gensim.downloader as api
model = api.load("word2vec-google-news-300")

"w2v_corpus = [] # Documents to train word2vec on (all 6 restaurants).\n",
"scs_corpus = [] # Documents to run queries against (only one restaurant).\n",
"documents = [] # scs_corpus, with no pre-processing (so we can see the original documents).\n",
"with open('/data/review.json') as data_file:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great if we add this dataset to gensim-data and use it here. Can you investigate @Witiko, is this possible (Yelp license allows us to share it or not)?

Copy link
Contributor Author

@Witiko Witiko Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can, the license seems to explicitly forbid any distribution of the dataset (see section C). The dataset is also used in the Word Mover's Distance notebook, but from the description, it contained different data back when the notebook was created. This would indicate that the dataset keeps changing over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I update both the WMD and SCS notebooks to use some open dataset available in gensim-data?

Copy link
Contributor

@menshikh-iv menshikh-iv Jan 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, I see two possible solutions:

  • Leave as is (because Yelp dataset is really good for this task, but we can't store it in gensim-data), probably add link, when user can download this dataset
  • Choice other datasets (that will be really nice for demonstration + possible to add it to gensim-data by license). @Witiko have you good candidates for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the SemEval 2015–2017 Task 3 Subtask B question answering datasets in my thesis. These are also a good fit for the semantic similarity task and are permissively licensed, so it should be possible to add them to gensim-data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, please replace Yelp dataset with SemEval + create an issue in gensim-data with needed info (guide: https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model), probably add other "tasks" from semeval will be really good too.

Copy link
Contributor Author

@Witiko Witiko Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running the evaluations. If all looks fine, I will submit an issue to gensim-data. Until then, please consider the current state of the Jupyter Notebook temporary.

if w1_index != w2_index and dictionary[w2_index] in self.vocab)
else: # Traverse only columns corresponding to the embeddings closest to w1.
num_nonzero = similarity_matrix[w1_index].getnnz() - 1
columns = ((dictionary.token2id[w2], similarity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use hanging indents (instead of vertical).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e1eb7cd and c7f0ce1.

# Ensure that we don't exceed `nonzero_limit` by mirroring the upper triangle.
if similarity > threshold and similarity_matrix[w2_index].getnnz() <= nonzero_limit:
element = similarity**exponent
similarity_matrix[w1_index, w2_index] = element
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarity_matrix is symmetrical, maybe better to store only "half" of this matrix -> reduce memory usage twice?

Copy link
Contributor Author

@Witiko Witiko Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that such a saving would be nice, but there seems to be no support in SciPy for a dot product of a symmetrical matrix that only stores the upper / lower triangle ((vec1.T).dot(similarity_matrix).dot(vec2)[0, 0]). Even beyond SciPy, I don't know of a sparse matrix format that would allow both row-wise and column-wise efficient access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad but true, thanks for the clarification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this knowledge can still be useful when storing and transmitting the similarity matrix.

if w1 not in self.vocab:
continue # A word from the dictionary not present in the word2vec model.
# Traverse upper triangle columns.
if len(dictionary) <= nonzero_limit + 1: # Traverse all columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if num_rows instead of len(dictionary) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in effef71.

@@ -559,6 +560,90 @@ def similar_by_vector(self, vector, topn=10, restrict_vocab=None):
"""
return self.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)

def similarity_matrix(self, corpus, dictionary, threshold=0.0, exponent=2.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corpus isn't used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 08dea4e.

index = self.cls(texts, self.w2v_model)
else:
index = self.cls(corpus, num_features=len(dictionary))
index = self.factoryMethod()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure that this an equivalent?

Copy link
Contributor Author

@Witiko Witiko Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this exact if-then-else statement currently appears at six places in test_similarities.py at gensim/develop. Rather than add two new lines for SoftCosineSimilarity at each of the six locations, I decided to refactor the if-then-else statement into factory methods.

Copy link
Contributor Author

@Witiko Witiko Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I turn this refactoring step into a separate pull request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No (because this is really small change).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was more concerned with the fact that this pull request may take a little longer to finish up, whereas this single refactoring is independent and could be incorporated fast.

@menshikh-iv
Copy link
Contributor

@Witiko also please merge develop from upstream to your branch.

@Witiko
Copy link
Contributor Author

Witiko commented Jan 12, 2018

I will look into these, thank you for taking the time to do the review.

@menshikh-iv
Copy link
Contributor

ping @Witiko, how is going?

@Witiko
Copy link
Contributor Author

Witiko commented Jan 23, 2018

I am planning to push new commits this weekend.

@Witiko Witiko changed the title Implement Soft Cosine Similarity Implement Soft Cosine Measure Jan 28, 2018
@Witiko
Copy link
Contributor Author

Witiko commented Feb 5, 2018

@evanmiltenburg That should be interesting; I will see if I can run a comparison by the end of the week.

@thomasniebler
Copy link

@evanmiltenburg @Witiko I already worked on that :) Long story short: The soft cosine with respective weights can improve semantic word similarity (and I'm sure, also semantic textual similarity).
I learned the respective weights as a posdef matrix using a metric learning approach (seeing as this the soft cosine is actually the same) from evaluation datasets like WordSimilarity353 or MEN. If you are interested in the code, see here: https://thomasniebler.github.io/semantics-metriclearning/
The corresponding paper can be read at https://arxiv.org/abs/1705.07425 and has been published as a poster at the ISWC2017: https://www.thomas-niebler.de/pub/niebler2017rrl.pdf
If you have any questions regarding that, let me know.

@evanmiltenburg
Copy link

@thomasniebler: Note that MEN has been criticized for not being a good measure of similarity. Rather, it measures relatedness between terms. See the SimLex-999 dataset for an alternative.

@thomasniebler
Copy link

@evanmiltenburg You're correct. Indeed, improving relatedness was my first goal with this measure. However, since we're talking about skewing the vector space, learning on SimLex data should allow us to learn similarity. Maybe not as strict similarity scores as with other metric learning based algorithms (Retrofitting, Counterfitting, Paragram Embeddings), as those do not consider intensity scores, but still. Overall, you have a good point. I will consider this in future work.
Just as a side note: If needed, just try an algorithm from the https://github.com/metric-learn/ repositories. There should be plenty of metric learning algorithms that give you a matrix for the soft cosine that is learned on similarity/dissimilarity constraints. Technically, even the retrofitting/counterfitting/paragram algorithms are nothing different than metric learning algorithms applied to semantic tasks, as they learn a (nonlinear) transformation of the embedding vector space.

@Naruto-Sasuke
Copy link

Naruto-Sasuke commented Feb 7, 2018

I wonder it can be used in image patch similarity. For example, replacing L2 loss with this.

@evanmiltenburg
Copy link

Well, the issue for me is not how to learn a model that can predict good similarity measures. The issue is that when people try soft cosine for themselves, without any transformations of the vector space, what will the model be good at? It's good to know this in advance, and it's important to be precise in your claims about this. (Don't want to give users the wrong impression and then disappoint them.)

The STS task is a decent intrinsic test to see whether the model can predict whether two sentences are similar. It's a nice addition to the (slightly more extrinsic) Community Question Answering task that the author of this issue already tested Soft Cosine Similarity on.

@thomasniebler
Copy link

Well, technically, the soft cosine measure is a vector transformation, as it makes use of the dot product which can be easily parameterized by a quadratic matrix as e.g. done in the Mahalanobis distance. By using a parameterized similarity or distance measure, you already transforming the vector space, as metric spaces are defined by their metrics.
So basically, you rewrite the soft cosine as <a, Sb> / (sqrt(<a, Sa>) * sqrt(<b, Sb>)) and thus you transformed the vector space. If you don't want to transform your vector space, well, then you shouldn't use the soft cosine. But I admit that there is a difference in learning S and using a simple heuristic.

@menshikh-iv
Copy link
Contributor

@Witiko great work 👍

@menshikh-iv menshikh-iv merged commit 43a33c7 into piskvorky:develop Feb 8, 2018
sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this pull request Feb 21, 2018
* Implement Soft Cosine Similarity

* Added numpy-style documentation for Soft Cosine Similarity

* Added unit tests for Soft Cosine Similarity

* Make WmdSimilarity and SoftCosineSimilarity handle empty queries

* Rename Soft Cosine Similarity to Soft Cosine Measure

* Add links to Soft Cosine Measure papers

* Remove unused variables and parameters for Soft Cosine Measure

* Replace explicit timers with magic %time in Soft Cosine Measure notebook

* Rename var in term similarity matrix construction to reflect symmetry

* Update SoftCosineSimilarity class example to define all variables

* Make the code in Soft Cosine Measure notebook more compact

* Use hanging indents in EuclideanKeyedVectors.similarity_matrix

* Simplified expressions in WmdSimilarity and SoftCosineSimilarity

* Extract the sparse2coo function to the global scope

* Fix __str__ of SoftCosineSimilarity

* Use hanging indents in SoftCossim.__init__

* Fix formatting of the matutils module

* Make similarity matrix info messages appear at fixed frequency

* Construct term similarity matrix rows for important terms first

* Optimize softcossim for an estimated 100-fold constant speed increase

* Remove unused import in gensim.similarities.docsim

* Fix imports in gensim.models.keyedvectors

* replace reference to anonymous link

* Update "See Also" references to new *2vec implementation

* Fix formatting error in gensim.models.keyedvectors

* Update Soft Cosine Measure tutorial notebook

* Update Soft Cosine Measure tutorial notebook

* Use smaller glove-wiki-gigaword-50 model in Soft Cosine Measure notebook

* Use gensim-data to load SemEval datasets in Soft Cosine Measure notebook

* Use backwards-compatible syntax in Soft Cosine Similarity notebook

* Remove unnecessary package requirements in Soft Cosine Measure notebook

* Fix Soft Cosine Measure notebook to use true gensim-data dataset names

* fix docs[1]

* fix docs[2]

* fix docs[3]

* small fixes

* small fixes[2]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants