Materialize and copy the corpus passed to SoftCosineSimilarity #3128
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
SoftCosineSimilarity currently does not perform any indexing and keeps a reference to the corpus passed to the constructor. This can cause confusion if a corpus is lazy and requests for documents are only resolved when reading the corpus. In general, subclasses of the SimilarityABC interface are expected to materialize and copy (i.e. index) the corpus rather than just reference it.
Some classes, such as WmdSimilarity, do not conform to this expectation. However, these classes also don't work with traditional BoW corpora and should be considered novelties/outliers. SoftCosineSimilarity, on the other hand, is very similar to the core SparseMatrixSimilarity class, and should satisfy the same invariants even if they are generally unspoken.
This pull request makes SoftCosineSimilarity materialize and copy the corpus passed to the constructor using the
list()
built-in. This is a minimal change with no change of interface and little chance of regressions. A more thorough solution would be to index the corpus to a sparse CSC matrix. However, this would likely break the interface, which may be undesirable so soon after 4.0.0.