-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Soft Cosine Measure #1827
Conversation
defaf4d
to
2b4b47c
Compare
I added numpy-style documentation and unit tests. Hopefully, the code should be good to go now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @Witiko, in general, looks nice!
"outputs": [], | ||
"source": [ | ||
"from time import time\n", | ||
"start_nb = time()\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 08dea4e.
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"sentence_obama = 'Obama speaks to the media in Illinois'\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_obama = sentence_obama.lower().split()
to
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
same for other sentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 621ed0d.
} | ||
], | ||
"source": [ | ||
"start = time()\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To track time of cell, better to use "magic" %%time
,
%%time
.. <SOME CODE> ..
instead of
start = time()
.. <SOME CODE> ..
print('Cell took %.2f seconds to run.' % (time() - start))
here and everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 8af5f67.
"start = time()\n", | ||
"import os\n", | ||
"\n", | ||
"from gensim.models import KeyedVectors\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use gensim-data
functionality instead of this part
import gensim.downloader as api
model = api.load("word2vec-google-news-300")
"w2v_corpus = [] # Documents to train word2vec on (all 6 restaurants).\n", | ||
"scs_corpus = [] # Documents to run queries against (only one restaurant).\n", | ||
"documents = [] # scs_corpus, with no pre-processing (so we can see the original documents).\n", | ||
"with open('/data/review.json') as data_file:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be great if we add this dataset to gensim-data
and use it here. Can you investigate @Witiko, is this possible (Yelp license allows us to share it or not)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can, the license seems to explicitly forbid any distribution of the dataset (see section C). The dataset is also used in the Word Mover's Distance notebook, but from the description, it contained different data back when the notebook was created. This would indicate that the dataset keeps changing over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I update both the WMD and SCS notebooks to use some open dataset available in gensim-data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, I see two possible solutions:
- Leave as is (because Yelp dataset is really good for this task, but we can't store it in gensim-data), probably add link, when user can download this dataset
- Choice other datasets (that will be really nice for demonstration + possible to add it to gensim-data by license). @Witiko have you good candidates for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the SemEval 2015–2017 Task 3 Subtask B question answering datasets in my thesis. These are also a good fit for the semantic similarity task and are permissively licensed, so it should be possible to add them to gensim-data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, please replace Yelp dataset with SemEval + create an issue in gensim-data
with needed info (guide: https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model), probably add other "tasks" from semeval will be really good too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am running the evaluations. If all looks fine, I will submit an issue to gensim-data
. Until then, please consider the current state of the Jupyter Notebook temporary.
gensim/models/keyedvectors.py
Outdated
if w1_index != w2_index and dictionary[w2_index] in self.vocab) | ||
else: # Traverse only columns corresponding to the embeddings closest to w1. | ||
num_nonzero = similarity_matrix[w1_index].getnnz() - 1 | ||
columns = ((dictionary.token2id[w2], similarity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use hanging indents (instead of vertical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gensim/models/keyedvectors.py
Outdated
# Ensure that we don't exceed `nonzero_limit` by mirroring the upper triangle. | ||
if similarity > threshold and similarity_matrix[w2_index].getnnz() <= nonzero_limit: | ||
element = similarity**exponent | ||
similarity_matrix[w1_index, w2_index] = element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarity_matrix
is symmetrical, maybe better to store only "half" of this matrix -> reduce memory usage twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that such a saving would be nice, but there seems to be no support in SciPy for a dot product of a symmetrical matrix that only stores the upper / lower triangle ((vec1.T).dot(similarity_matrix).dot(vec2)[0, 0]
). Even beyond SciPy, I don't know of a sparse matrix format that would allow both row-wise and column-wise efficient access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sad but true, thanks for the clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this knowledge can still be useful when storing and transmitting the similarity matrix.
gensim/models/keyedvectors.py
Outdated
if w1 not in self.vocab: | ||
continue # A word from the dictionary not present in the word2vec model. | ||
# Traverse upper triangle columns. | ||
if len(dictionary) <= nonzero_limit + 1: # Traverse all columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if num_rows
instead of len(dictionary)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in effef71.
gensim/models/keyedvectors.py
Outdated
@@ -559,6 +560,90 @@ def similar_by_vector(self, vector, topn=10, restrict_vocab=None): | |||
""" | |||
return self.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab) | |||
|
|||
def similarity_matrix(self, corpus, dictionary, threshold=0.0, exponent=2.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corpus
isn't used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 08dea4e.
index = self.cls(texts, self.w2v_model) | ||
else: | ||
index = self.cls(corpus, num_features=len(dictionary)) | ||
index = self.factoryMethod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure that this an equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this exact if-then-else statement currently appears at six places in test_similarities.py
at gensim/develop
. Rather than add two new lines for SoftCosineSimilarity
at each of the six locations, I decided to refactor the if-then-else statement into factory methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I turn this refactoring step into a separate pull request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No (because this is really small change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was more concerned with the fact that this pull request may take a little longer to finish up, whereas this single refactoring is independent and could be incorporated fast.
@Witiko also please merge |
I will look into these, thank you for taking the time to do the review. |
ping @Witiko, how is going? |
I am planning to push new commits this weekend. |
@evanmiltenburg That should be interesting; I will see if I can run a comparison by the end of the week. |
@evanmiltenburg @Witiko I already worked on that :) Long story short: The soft cosine with respective weights can improve semantic word similarity (and I'm sure, also semantic textual similarity). |
@thomasniebler: Note that MEN has been criticized for not being a good measure of similarity. Rather, it measures relatedness between terms. See the SimLex-999 dataset for an alternative. |
@evanmiltenburg You're correct. Indeed, improving relatedness was my first goal with this measure. However, since we're talking about skewing the vector space, learning on SimLex data should allow us to learn similarity. Maybe not as strict similarity scores as with other metric learning based algorithms (Retrofitting, Counterfitting, Paragram Embeddings), as those do not consider intensity scores, but still. Overall, you have a good point. I will consider this in future work. |
I wonder it can be used in image patch similarity. For example, replacing L2 loss with this. |
Well, the issue for me is not how to learn a model that can predict good similarity measures. The issue is that when people try soft cosine for themselves, without any transformations of the vector space, what will the model be good at? It's good to know this in advance, and it's important to be precise in your claims about this. (Don't want to give users the wrong impression and then disappoint them.) The STS task is a decent intrinsic test to see whether the model can predict whether two sentences are similar. It's a nice addition to the (slightly more extrinsic) Community Question Answering task that the author of this issue already tested Soft Cosine Similarity on. |
Well, technically, the soft cosine measure is a vector transformation, as it makes use of the dot product which can be easily parameterized by a quadratic matrix as e.g. done in the Mahalanobis distance. By using a parameterized similarity or distance measure, you already transforming the vector space, as metric spaces are defined by their metrics. |
# Conflicts: # docs/notebooks/soft_cosine_tutorial.ipynb
@Witiko great work 👍 |
* Implement Soft Cosine Similarity * Added numpy-style documentation for Soft Cosine Similarity * Added unit tests for Soft Cosine Similarity * Make WmdSimilarity and SoftCosineSimilarity handle empty queries * Rename Soft Cosine Similarity to Soft Cosine Measure * Add links to Soft Cosine Measure papers * Remove unused variables and parameters for Soft Cosine Measure * Replace explicit timers with magic %time in Soft Cosine Measure notebook * Rename var in term similarity matrix construction to reflect symmetry * Update SoftCosineSimilarity class example to define all variables * Make the code in Soft Cosine Measure notebook more compact * Use hanging indents in EuclideanKeyedVectors.similarity_matrix * Simplified expressions in WmdSimilarity and SoftCosineSimilarity * Extract the sparse2coo function to the global scope * Fix __str__ of SoftCosineSimilarity * Use hanging indents in SoftCossim.__init__ * Fix formatting of the matutils module * Make similarity matrix info messages appear at fixed frequency * Construct term similarity matrix rows for important terms first * Optimize softcossim for an estimated 100-fold constant speed increase * Remove unused import in gensim.similarities.docsim * Fix imports in gensim.models.keyedvectors * replace reference to anonymous link * Update "See Also" references to new *2vec implementation * Fix formatting error in gensim.models.keyedvectors * Update Soft Cosine Measure tutorial notebook * Update Soft Cosine Measure tutorial notebook * Use smaller glove-wiki-gigaword-50 model in Soft Cosine Measure notebook * Use gensim-data to load SemEval datasets in Soft Cosine Measure notebook * Use backwards-compatible syntax in Soft Cosine Similarity notebook * Remove unnecessary package requirements in Soft Cosine Measure notebook * Fix Soft Cosine Measure notebook to use true gensim-data dataset names * fix docs[1] * fix docs[2] * fix docs[3] * small fixes * small fixes[2]
Introduction
I implemented the Soft Cosine Measure (SCM) [wiki, 1, 2] as a part of research for my thesis [3]. Although the original algorithm [1] has a time complexity that is quadratic in the document length, I implemented a linear-time approximative algorithm that I sketch in [3, sec. 4.4]. Since Gensim was such an indispensable asset in my work, I thought I would give back and contribute code. The implementation is showcased in a jupyter notebook on corpora from the SemEval 2016 and 2017 competitions.
Description
My original implementation closely followed the Gensim implementation of the Word Mover's Distance (WMD), which is split into a
gensim.models.keyedvectors.EuclideanKeyedVectors.wmdistance
method that takes two token lists and computes the WMD for them, and into thegensim.similarity.WdmSimilarity
class that provides batch similarity queries. However, I was not quite happy with this for the following reasons:gensim.models.keyedvectors.EuclideanKeyedVectors
immediately seemed like a bad idea that would hinder further extensions.For the above reasons, I ultimately decided to split the implementation into a function, a method, and a class as follows:
gensim.matutils.softcossim
function takes two documents in the bag-of-words representation, a sparse term similarity matrix in the scipy CSC format, and computes SCM.gensim.models.keyedvectors.EuclideanKeyedVectors.similarity_matrix
method takes a corpus of bag-of-words vectors, a dictionary, and produces a sparse term similarity matrixMrel
described by Charlet and Damnati, 2017 [1].gensim.similarities.SoftCosineSimilarity
class takes a corpus of bag-of-words vectors, a sparse term similarity matrix in the scipy CSC format, and provides batch SCM queries against the corpus.The above design achieves a much looser coupling between the individual components and eliminates the original concerns. I demonstrate the implementation in a jupyter notebook on the corpus of Yelp reviews. The approximative linear-time approximative algorithm for SCM achieves about the same speed as the linear-time approximative algorithm for WMD (see the corresponding jupyter notebook).
Future work
The
gensim.similarities.SoftCosineSimilarity
class goes over the entire corpus and computes the SCM between the query and each document separately by callinggensim.matutils.softcossim
. If performance is a concern, SCM can be computed in a single step by computingq^T * S * C
, whereq
is the normalized query vector,S
is the term similarity matrix,C
is the normalized term-document matrix of the corpus, and “normalized” in this context stands for a vectorv
being divided bysqrt(v^T * S * v)
. This is similar to what e.g. thegensim.similarity.MatrixSimilarity.get_similarities
method does, only with the basic cosine similarity rather than SCM.References