Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a maintain_sparsity argument to SparseMatrixSimilarity. #590

Merged
merged 4 commits into from
Jun 9, 2016

Conversation

davechallis
Copy link
Contributor

Adds an option to SparseMatrixSimilarity to allow it to return a sparse matrix for similarity queries, instead of always converting to dense (as discussed in #289).

Default behaviour hasn't changed though, so shouldn't change behaviour of any existing code.

When set to True, causes the `get_similarities` method to return the
sparse matrix used internally by this object, instead of converting to a
dense matrix before returning.
@@ -653,6 +658,9 @@ def get_similarities(self, query):
if result.shape[1] == 1 and not is_corpus:
# for queries of one document, return a 1d array
result = result.toarray().flatten()
elif self.maintain_sparsity:
# avoid converting to dense array if maintaining sparsity
result = result.T
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line above, result.toarray().flatten(), will still densify the output. So you'll get sometimes dense, sometimes sparse, depending which if branch is hit.

I don't know your use case exactly, but isn't it better to do this maintain_sparsity check as the first thing, so that the output is always sparse (or always dense)? The API seems cleaner that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point, I'll change that. I figured that a dense array was probably a better choice for queries of a single document, but probably more important to keep the API cleaner and more predictable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just started having a look at this, but not sure it's possible actually - I think scipy's sparse matrices have to be 2 dimensional, so I can transform an (N, 1) sparse matrix to a (1, N) one, but can't convert it to an (N,) shaped one as the dense branch of code does.

Any preferences on behaviour for this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preferences -- all code up till now uses dense outputs. You're the first "sparse output" user, so the decision on how to treat single-vector inputs is yours! You know the use-case best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tried a few options, I think I'm happy with the pull request as it is. It's already a fairly specialised use case, so I think it's safe to leave it as it is.

@davechallis
Copy link
Contributor Author

Ah, just noticed the failing test on this - I think it's unrelated to anything changed in this PR, is it possible to rerun them?

@tmylk
Copy link
Contributor

tmylk commented Mar 10, 2016

Please ignore appveyor tests.
The travis tests are passing so that pr is ok.

@tmylk
Copy link
Contributor

tmylk commented Mar 10, 2016

@davechallis Please add a test for the new functionality and we would be good to merge

@davechallis
Copy link
Contributor Author

@tmylk Thanks for the reminder, will get on that tomorrow :)

@tmylk
Copy link
Contributor

tmylk commented Jun 9, 2016

@davechallis Thanks for the PR!
Nice meeting you at PyData London too!

@tmylk tmylk merged commit c7e4c57 into piskvorky:develop Jun 9, 2016
@piskvorky
Copy link
Owner

@tmylk @davechallis thanks! But deserves a CHANGELOG entry too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants