Adds support for Jaccard bag/multiset semantics #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit adds support for calculating Jaccard similarity using
bag/multiset semantics, as described on pgs. 76-77 in chapter 3 of
Mining of Massive Datasets (MMDS).
MMDS uses the example of movie ratings:
To fit a model using bag semantics, the user:
true. This is an optional variable that is false by default.
where the indices of each SparseVector correspond to the distinct
items in the set, and the values of each correspond to the number of
times each corresponding item is repeated in the set.
The only difference in running the model and getting output is that the
Jaccard similarity between two sets with bag semantics has a maximum of
0.5 rather than 1.0.