Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a maintain_sparsity argument to SparseMatrixSimilarity. #590

Merged
merged 4 commits into from
Jun 9, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion gensim/similarities/docsim.py
Original file line number Diff line number Diff line change
Expand Up @@ -562,13 +562,18 @@ class SparseMatrixSimilarity(interfaces.SimilarityABC):
The matrix is internally stored as a `scipy.sparse.csr` matrix. Unless the entire
matrix fits into main memory, use `Similarity` instead.

Takes an optional `maintain_sparsity` argument, setting this to True
causes `get_similarities` to return a sparse matrix instead of a
dense representation if possible.

See also `Similarity` and `MatrixSimilarity` in this module.
"""
def __init__(self, corpus, num_features=None, num_terms=None, num_docs=None, num_nnz=None,
num_best=None, chunksize=500, dtype=numpy.float32):
num_best=None, chunksize=500, dtype=numpy.float32, maintain_sparsity=False):
self.num_best = num_best
self.normalize = True
self.chunksize = chunksize
self.maintain_sparsity = maintain_sparsity

if corpus is not None:
logger.info("creating sparse index")
Expand Down Expand Up @@ -633,6 +638,9 @@ def get_similarities(self, query):
if result.shape[1] == 1 and not is_corpus:
# for queries of one document, return a 1d array
result = result.toarray().flatten()
elif self.maintain_sparsity:
# avoid converting to dense array if maintaining sparsity
result = result.T
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line above, result.toarray().flatten(), will still densify the output. So you'll get sometimes dense, sometimes sparse, depending which if branch is hit.

I don't know your use case exactly, but isn't it better to do this maintain_sparsity check as the first thing, so that the output is always sparse (or always dense)? The API seems cleaner that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point, I'll change that. I figured that a dense array was probably a better choice for queries of a single document, but probably more important to keep the API cleaner and more predictable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just started having a look at this, but not sure it's possible actually - I think scipy's sparse matrices have to be 2 dimensional, so I can transform an (N, 1) sparse matrix to a (1, N) one, but can't convert it to an (N,) shaped one as the dense branch of code does.

Any preferences on behaviour for this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preferences -- all code up till now uses dense outputs. You're the first "sparse output" user, so the decision on how to treat single-vector inputs is yours! You know the use-case best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tried a few options, I think I'm happy with the pull request as it is. It's already a fairly specialised use case, so I think it's safe to leave it as it is.

else:
# otherwise, return a 2d matrix (#queries x #index)
result = result.toarray().T
Expand Down
15 changes: 15 additions & 0 deletions gensim/test/test_similarities.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import tempfile

import numpy
import scipy

from gensim.corpora import mmcorpus, Dictionary
from gensim import matutils, utils, similarities
Expand Down Expand Up @@ -262,6 +263,20 @@ class TestSparseMatrixSimilarity(unittest.TestCase, _TestSimilarityABC):
def setUp(self):
self.cls = similarities.SparseMatrixSimilarity

def testMaintainSparsity(self):
"""Sparsity is correctly maintained when maintain_sparsity=True"""
num_features = len(dictionary)

index = self.cls(corpus, num_features=num_features)
dense_sims = index[corpus]

index = self.cls(corpus, num_features=num_features, maintain_sparsity=True)
sparse_sims = index[corpus]

self.assertFalse(scipy.sparse.issparse(dense_sims))
self.assertTrue(scipy.sparse.issparse(sparse_sims))
numpy.testing.assert_array_equal(dense_sims, sparse_sims.todense())


class TestSimilarity(unittest.TestCase, _TestSimilarityABC):
def setUp(self):
Expand Down