Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialize and copy the corpus passed to SoftCosineSimilarity #3128

Merged

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented Apr 28, 2021

SoftCosineSimilarity currently does not perform any indexing and keeps a reference to the corpus passed to the constructor. This can cause confusion if a corpus is lazy and requests for documents are only resolved when reading the corpus. In general, subclasses of the SimilarityABC interface are expected to materialize and copy (i.e. index) the corpus rather than just reference it.

Some classes, such as WmdSimilarity, do not conform to this expectation. However, these classes also don't work with traditional BoW corpora and should be considered novelties/outliers. SoftCosineSimilarity, on the other hand, is very similar to the core SparseMatrixSimilarity class, and should satisfy the same invariants even if they are generally unspoken.

This pull request makes SoftCosineSimilarity materialize and copy the corpus passed to the constructor using the list() built-in. This is a minimal change with no change of interface and little chance of regressions. A more thorough solution would be to index the corpus to a sparse CSC matrix. However, this would likely break the interface, which may be undesirable so soon after 4.0.0.

@piskvorky piskvorky requested a review from mpenkov April 29, 2021 05:59
Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you @Witiko !

@mpenkov mpenkov merged commit 1be4674 into piskvorky:develop May 5, 2021
@sug4ndh
Copy link

sug4ndh commented May 20, 2021

Hi,

I have an update to a problem that I reported here https://groups.google.com/g/gensim/c/3ksqzgCTWxI. Not sure if this affects this PR merge. I noticed that the problem reported in the google group thread does not occur if I generate corpus and dictionary afresh before I train word2vec or fasttext models and then subsequently generate SoftCosineSimilarity index. However, if used a saved corpus and dictionary to train fasttext or word2vec model and then generate SoftCosineSimilarity index, I always get the problem that "No such file or directory: 'corpus.mm' if corpus.mm is not present in dir from where I am executing the program. If I copy corpus.mm to the dir from where I am executing the program then no error appears.

@Witiko
Copy link
Contributor Author

Witiko commented May 20, 2021

@sug4ndh Hello, this PR should solve your problem. As a temporary workaround with your release of Gensim, you can create the SoftCosineSimilarity index with list(corpus) instead of corpus. Then, you should be able to also save and subsequently reload your SoftCosineSimilarity index. Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants