-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a maintain_sparsity
argument to SparseMatrixSimilarity.
#590
Conversation
When set to True, causes the `get_similarities` method to return the sparse matrix used internally by this object, instead of converting to a dense matrix before returning.
@@ -653,6 +658,9 @@ def get_similarities(self, query): | |||
if result.shape[1] == 1 and not is_corpus: | |||
# for queries of one document, return a 1d array | |||
result = result.toarray().flatten() | |||
elif self.maintain_sparsity: | |||
# avoid converting to dense array if maintaining sparsity | |||
result = result.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line above, result.toarray().flatten()
, will still densify the output. So you'll get sometimes dense, sometimes sparse, depending which if
branch is hit.
I don't know your use case exactly, but isn't it better to do this maintain_sparsity
check as the first thing, so that the output is always sparse (or always dense)? The API seems cleaner that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point, I'll change that. I figured that a dense array was probably a better choice for queries of a single document, but probably more important to keep the API cleaner and more predictable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just started having a look at this, but not sure it's possible actually - I think scipy's sparse matrices have to be 2 dimensional, so I can transform an (N, 1) sparse matrix to a (1, N) one, but can't convert it to an (N,) shaped one as the dense branch of code does.
Any preferences on behaviour for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No preferences -- all code up till now uses dense outputs. You're the first "sparse output" user, so the decision on how to treat single-vector inputs is yours! You know the use-case best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having tried a few options, I think I'm happy with the pull request as it is. It's already a fairly specialised use case, so I think it's safe to leave it as it is.
Ah, just noticed the failing test on this - I think it's unrelated to anything changed in this PR, is it possible to rerun them? |
Please ignore appveyor tests. |
@davechallis Please add a test for the new functionality and we would be good to merge |
@tmylk Thanks for the reminder, will get on that tomorrow :) |
@davechallis Thanks for the PR! |
@tmylk @davechallis thanks! But deserves a CHANGELOG entry too. |
Adds an option to SparseMatrixSimilarity to allow it to return a sparse matrix for similarity queries, instead of always converting to dense (as discussed in #289).
Default behaviour hasn't changed though, so shouldn't change behaviour of any existing code.