Added a `maintain_sparsity` argument to SparseMatrixSimilarity. #590

davechallis · 2016-01-25T15:45:37Z

Adds an option to SparseMatrixSimilarity to allow it to return a sparse matrix for similarity queries, instead of always converting to dense (as discussed in #289).

Default behaviour hasn't changed though, so shouldn't change behaviour of any existing code.

When set to True, causes the `get_similarities` method to return the sparse matrix used internally by this object, instead of converting to a dense matrix before returning.

piskvorky · 2016-01-26T00:24:39Z

gensim/similarities/docsim.py

@@ -653,6 +658,9 @@ def get_similarities(self, query):
        if result.shape[1] == 1 and not is_corpus:
            # for queries of one document, return a 1d array
            result = result.toarray().flatten()
+        elif self.maintain_sparsity:
+            # avoid converting to dense array if maintaining sparsity
+            result = result.T


The line above, result.toarray().flatten(), will still densify the output. So you'll get sometimes dense, sometimes sparse, depending which if branch is hit.

I don't know your use case exactly, but isn't it better to do this maintain_sparsity check as the first thing, so that the output is always sparse (or always dense)? The API seems cleaner that way.

Hmm, good point, I'll change that. I figured that a dense array was probably a better choice for queries of a single document, but probably more important to keep the API cleaner and more predictable.

Just started having a look at this, but not sure it's possible actually - I think scipy's sparse matrices have to be 2 dimensional, so I can transform an (N, 1) sparse matrix to a (1, N) one, but can't convert it to an (N,) shaped one as the dense branch of code does.

Any preferences on behaviour for this?

No preferences -- all code up till now uses dense outputs. You're the first "sparse output" user, so the decision on how to treat single-vector inputs is yours! You know the use-case best.

Having tried a few options, I think I'm happy with the pull request as it is. It's already a fairly specialised use case, so I think it's safe to leave it as it is.

davechallis · 2016-03-10T10:50:00Z

Ah, just noticed the failing test on this - I think it's unrelated to anything changed in this PR, is it possible to rerun them?

tmylk · 2016-03-10T14:25:42Z

Please ignore appveyor tests.
The travis tests are passing so that pr is ok.

tmylk · 2016-03-10T14:26:36Z

@davechallis Please add a test for the new functionality and we would be good to merge

davechallis · 2016-03-10T15:49:46Z

@tmylk Thanks for the reminder, will get on that tomorrow :)

tmylk · 2016-06-09T16:24:08Z

@davechallis Thanks for the PR!
Nice meeting you at PyData London too!

piskvorky · 2016-06-10T01:43:47Z

@tmylk @davechallis thanks! But deserves a CHANGELOG entry too.

Added a maintain_sparsity argument to SparseMatrixSimilarity.

97fada5

When set to True, causes the `get_similarities` method to return the sparse matrix used internally by this object, instead of converting to a dense matrix before returning.

piskvorky reviewed Jan 26, 2016
View reviewed changes

piskvorky assigned tmylk Mar 10, 2016

davechallis added 3 commits March 11, 2016 09:31

Merge remote-tracking branch 'upstream/develop' into develop

05a34d1

Added unit test for SparseMatrixSimilarity with maintain_sparsity set.

3203e21

Merge remote-tracking branch 'upstream/develop' into develop

fe1d443

tmylk merged commit c7e4c57 into piskvorky:develop Jun 9, 2016

menshikh-iv mentioned this pull request Oct 3, 2017

Wishlist item: maintaining sparse matrices when using SparseMatrixSimilarity #289

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a `maintain_sparsity` argument to SparseMatrixSimilarity. #590

Added a `maintain_sparsity` argument to SparseMatrixSimilarity. #590

davechallis commented Jan 25, 2016

piskvorky Jan 26, 2016

davechallis Jan 26, 2016

davechallis Jan 28, 2016

piskvorky Jan 29, 2016

davechallis Feb 1, 2016

davechallis commented Mar 10, 2016

tmylk commented Mar 10, 2016

tmylk commented Mar 10, 2016

davechallis commented Mar 10, 2016

tmylk commented Jun 9, 2016

piskvorky commented Jun 10, 2016

Added a maintain_sparsity argument to SparseMatrixSimilarity. #590

Added a maintain_sparsity argument to SparseMatrixSimilarity. #590

Conversation

davechallis commented Jan 25, 2016

piskvorky Jan 26, 2016

Choose a reason for hiding this comment

davechallis Jan 26, 2016

Choose a reason for hiding this comment

davechallis Jan 28, 2016

Choose a reason for hiding this comment

piskvorky Jan 29, 2016

Choose a reason for hiding this comment

davechallis Feb 1, 2016

Choose a reason for hiding this comment

davechallis commented Mar 10, 2016

tmylk commented Mar 10, 2016

tmylk commented Mar 10, 2016

davechallis commented Mar 10, 2016

tmylk commented Jun 9, 2016

piskvorky commented Jun 10, 2016

Added a `maintain_sparsity` argument to SparseMatrixSimilarity. #590

Added a `maintain_sparsity` argument to SparseMatrixSimilarity. #590