Adds support for Jaccard bag/multiset semantics #5

tcsalameh · 2016-11-01T02:39:03Z

This commit adds support for calculating Jaccard similarity using
bag/multiset semantics, as described on pgs. 76-77 in chapter 3 of
Mining of Massive Datasets (MMDS).

MMDS uses the example of movie ratings:

If ratings are 1-to-5-stars, put a movie in a customer's set n times
if they rated the movie n-stars. Then, use Jaccard similarity for bags
when measuring the similarity of customers. The Jaccard similarity for
bags B and C is defined by counting an element n times in the
intersection if n is the minimum of the number of times the element
appears in B and C. In the union, we count the element the sum of the
number of times it appears in B and C.

To fit a model using bag semantics, the user:

Instantiates an LSH model with the variable repeatedItems set to
true. This is an optional variable that is false by default.
Passes their data into the model as a List or RDD of SparseVectors,
where the indices of each SparseVector correspond to the distinct
items in the set, and the values of each correspond to the number of
times each corresponding item is repeated in the set.

The only difference in running the model and getting output is that the
Jaccard similarity between two sets with bag semantics has a maximum of
0.5 rather than 1.0.

This commit adds support for calculating Jaccard similarity using bag/multiset semantics, as described on pgs. 76-77 in chapter 3 of Mining of Massive Datasets (MMDS). MMDS uses the example of movie ratings: > If ratings are 1-to-5-stars, put a movie in a customer's set n times > if they rated the movie n-stars. Then, use Jaccard similarity for bags > when measuring the similarity of customers. The Jaccard similarity for > bags B and C is defined by counting an element n times in the > intersection if n is the minimum of the number of times the element > appears in B and C. In the union, we count the element the sum of the > number of times it appears in B and C. To fit a model using bag semantics, the user: - Instantiates an LSH model with the variable repeatedItems set to true. This is an optional variable that is false by default. - Passes their data into the model as a List or RDD of SparseVectors, where the indices of each SparseVector correspond to the distinct items in the set, and the values of each correspond to the number of times each corresponding item is repeated in the set. The only difference in running the model and getting output is that the Jaccard similarity between two sets with bag semantics has a maximum of 0.5 rather than 1.0.

tcsalameh · 2016-11-01T02:40:32Z

Also: I just added a couple lines in the README about the repeatedItems option, but I'm happy to also add an example if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for Jaccard bag/multiset semantics #5

Adds support for Jaccard bag/multiset semantics #5

tcsalameh commented Nov 1, 2016

tcsalameh commented Nov 1, 2016

Adds support for Jaccard bag/multiset semantics #5

Are you sure you want to change the base?

Adds support for Jaccard bag/multiset semantics #5

Conversation

tcsalameh commented Nov 1, 2016

tcsalameh commented Nov 1, 2016