Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot with max cardinality percentage #241

Merged
merged 50 commits into from
Mar 29, 2019
Merged

Pivot with max cardinality percentage #241

merged 50 commits into from
Mar 29, 2019

Conversation

michaelweilsalesforce
Copy link
Contributor

@michaelweilsalesforce michaelweilsalesforce commented Mar 12, 2019

Related issues
Some pivots do not necessary make sense when the cardinality is too high.

Describe the proposed solution
Set a max cardinality percentage for pivot in OneHotVectorizer, TextMapVectorizer and MultiPickListMap Vectorizer. Number of uniques is computed with HyperLogLog Monoid.

Additional context
Default value will be 1.00. Then Experiments will be run in order to determine a good default.
Max cardinality percentage is implemented for MultiPickList(Map) Vectorizers. this may be questionable.
Max cardinality percentage is not in SmartText(Map)Vectorizer, because the param maxCardinality already exists, and it is used to differentiate a category from a text.

@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #241 into master will decrease coverage by 0.18%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #241      +/-   ##
=========================================
- Coverage   86.78%   86.6%   -0.19%     
=========================================
  Files         314     315       +1     
  Lines       10466   10339     -127     
  Branches      354     566     +212     
=========================================
- Hits         9083    8954     -129     
- Misses       1383    1385       +2
Impacted Files Coverage Δ
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 87.5% <ø> (-0.26%) ⬇️
...alesforce/op/utils/spark/SequenceAggregators.scala 50% <ø> (ø) ⬆️
...ce/op/stages/impl/feature/OpOneHotVectorizer.scala 96.77% <100%> (+0.26%) ⬆️
...p/stages/impl/feature/TextMapPivotVectorizer.scala 100% <100%> (ø) ⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala 96.9% <100%> (-0.05%) ⬇️
...a/com/salesforce/op/filters/RawFeatureFilter.scala 92.77% <100%> (-2.41%) ⬇️
...ala/com/salesforce/op/dsl/RichNumericFeature.scala 100% <100%> (ø) ⬆️
...orce/op/stages/impl/feature/MathTransformers.scala 100% <100%> (ø)
...ages/impl/feature/MultiPickListMapVectorizer.scala 100% <100%> (ø) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 85.71% <0%> (-4.09%) ⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1099fc2...83b13c0. Read the comment docs.

@leahmcguire
Copy link
Collaborator

@mweilsalesforce can you please add a description of what this PR is doing?


val percentFilter = countUniques.flatMap(_.map{ case (k, v) =>
k -> (v.estimatedSize / n < $(maxPercentageCardinality))}.toSeq).toMap
val filteredDataset = filterHighCardinality(dataset, percentFilter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than repeating this block of code can you make it a mixin trait?

@michaelweilsalesforce
Copy link
Contributor Author

BTW @tovbinm, couldn't make the Count Min Sketch work. Algebird has however a working CMSTopK Monoid, but we can't use it : we want topK based on minSupport.

@tovbinm tovbinm changed the title Mw/pivot fn Pivot with max cardinality percentage Mar 14, 2019
@tovbinm tovbinm requested a review from Jauntbox March 14, 2019 23:57

val countOccurrences: Seq[Map[String, Int]] = {
if (rdd.isEmpty) Seq.empty[Map[String, Int]]
else rdd.reduce((a, b) => a.zip(b).map { case (m1, m2) => m1 + m2 })
if (filteredRDD.isEmpty) Seq.empty[Map[String, Int]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of isEmpty and reduce use fold

Copy link
Collaborator

@tovbinm tovbinm Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I think this should be Seq.fill(inN.length)(Map.empty[String, Int]) no?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need a bit more tests in UniqueCountTest in particular:

  1. make them more robust and cover more cases
  2. make better results assertions

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tovbinm tovbinm merged commit 8e6e050 into master Mar 29, 2019
@tovbinm tovbinm deleted the mw/pivotFn branch March 29, 2019 06:45
@tovbinm tovbinm mentioned this pull request Apr 10, 2019
@tovbinm tovbinm mentioned this pull request Jul 11, 2019
@salesforce-cla
Copy link

Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants