Pivot with max cardinality percentage #241

michaelweilsalesforce · 2019-03-12T18:34:03Z

Related issues
Some pivots do not necessary make sense when the cardinality is too high.

Describe the proposed solution
Set a max cardinality percentage for pivot in OneHotVectorizer, TextMapVectorizer and MultiPickListMap Vectorizer. Number of uniques is computed with HyperLogLog Monoid.

Additional context
Default value will be 1.00. Then Experiments will be run in order to determine a good default.
Max cardinality percentage is implemented for MultiPickList(Map) Vectorizers. this may be questionable.
Max cardinality percentage is not in SmartText(Map)Vectorizer, because the param maxCardinality already exists, and it is used to differentiate a category from a text.

codecov · 2019-03-12T18:57:43Z

Codecov Report

Merging #241 into master will decrease coverage by 0.18%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #241      +/-   ##
=========================================
- Coverage   86.78%   86.6%   -0.19%     
=========================================
  Files         314     315       +1     
  Lines       10466   10339     -127     
  Branches      354     566     +212     
=========================================
- Hits         9083    8954     -129     
- Misses       1383    1385       +2

Impacted Files	Coverage Δ
.../src/main/scala/com/salesforce/op/OpWorkflow.scala	`87.5% <ø> (-0.26%)`	⬇️
...alesforce/op/utils/spark/SequenceAggregators.scala	`50% <ø> (ø)`	⬆️
...ce/op/stages/impl/feature/OpOneHotVectorizer.scala	`96.77% <100%> (+0.26%)`	⬆️
...p/stages/impl/feature/TextMapPivotVectorizer.scala	`100% <100%> (ø)`	⬆️
...sforce/op/stages/impl/feature/Transmogrifier.scala	`96.9% <100%> (-0.05%)`	⬇️
...a/com/salesforce/op/filters/RawFeatureFilter.scala	`92.77% <100%> (-2.41%)`	⬇️
...ala/com/salesforce/op/dsl/RichNumericFeature.scala	`100% <100%> (ø)`	⬆️
...orce/op/stages/impl/feature/MathTransformers.scala	`100% <100%> (ø)`
...ages/impl/feature/MultiPickListMapVectorizer.scala	`100% <100%> (ø)`	⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala	`85.71% <0%> (-4.09%)`	⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1099fc2...83b13c0. Read the comment docs.

leahmcguire · 2019-03-12T19:56:43Z

@mweilsalesforce can you please add a description of what this PR is doing?

leahmcguire · 2019-03-12T19:59:11Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/TextMapPivotVectorizer.scala

+
+    val percentFilter = countUniques.flatMap(_.map{ case (k, v) =>
+      k -> (v.estimatedSize / n < $(maxPercentageCardinality))}.toSeq).toMap
+    val filteredDataset = filterHighCardinality(dataset, percentFilter)


Rather than repeating this block of code can you make it a mixin trait?

michaelweilsalesforce · 2019-03-13T00:27:39Z

BTW @tovbinm, couldn't make the Count Min Sketch work. Algebird has however a working CMSTopK Monoid, but we can't use it : we want topK based on minSupport.

tovbinm · 2019-03-20T02:28:36Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/OpOneHotVectorizer.scala


    val countOccurrences: Seq[Map[String, Int]] = {
-      if (rdd.isEmpty) Seq.empty[Map[String, Int]]
-      else rdd.reduce((a, b) => a.zip(b).map { case (m1, m2) => m1 + m2 })
+      if (filteredRDD.isEmpty) Seq.empty[Map[String, Int]]


instead of isEmpty and reduce use fold

also, I think this should be Seq.fill(inN.length)(Map.empty[String, Int]) no?

… mw/pivotFn

tovbinm

I think we still need a bit more tests in UniqueCountTest in particular:

make them more robust and cover more cases
make better results assertions

leahmcguire

LGTM

… mw/pivotFn

into mw/pivotFn

…pivotFn

salesforce-cla · 2020-10-21T18:44:13Z

Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this.

mweilsalesforce added 9 commits March 11, 2019 13:14

MaxPctCardinality param

eaf3aa2

param in TransmogrifAI

a08ce06

Fixing OPSetVectorizerTest

ef34050

TextTransmogrifyTest fix

af74e1e

Fix SmartTextVectorizerTest

c0f876e

Add Cardinality Test for OPSetVectorizer

6d03740

Adding test in TextTransmogrify test

be64853

High Cardinality Map

b762437

MultiPickListMap Vectorizer as well

9375813

michaelweilsalesforce requested review from leahmcguire and tovbinm as code owners March 12, 2019 18:34

Merge branch 'master' into mw/pivotFn

87fa2c7

salesforce-cla bot added the cla:signed label Mar 12, 2019

leahmcguire reviewed Mar 12, 2019

View reviewed changes

mweilsalesforce added 2 commits March 12, 2019 13:06

Adding description

3dbedd0

Fix Scalastyle

883028c

mweilsalesforce added 4 commits March 12, 2019 17:47

Migrating HLL count Code Block to Trait

b69a04b

Adding tests for Map

9874d63

MultiPickListMap Test

932c418

Remove print statements

7634106

michaelweilsalesforce added the ready for review label Mar 13, 2019

Adding bit params for HLL

29fe9bd

tovbinm changed the title ~~Mw/pivot fn~~ Pivot with max cardinality percentage Mar 14, 2019

tovbinm requested a review from Jauntbox March 14, 2019 23:57

tovbinm mentioned this pull request Mar 15, 2019

Labels may not be dropped in MultiClassClassification #245

Closed

clean ups

7b44e1a

tovbinm reviewed Mar 20, 2019

View reviewed changes

tovbinm and others added 10 commits March 26, 2019 16:48

Refactoring

66d0847

Merge branch 'mw/pivotFn' of github.com:salesforce/TransmogrifAI into…

3075f0a

… mw/pivotFn

bugfixes and cleanup

e39107c

renames

4fd1683

fixes

f9e0291

pretty

7e612df

fun

a1b3b65

cleanuop

78a8d40

cleanup

f0e12ac

tests

cc46b69

tovbinm requested changes Mar 27, 2019

View reviewed changes

tovbinm and others added 2 commits March 26, 2019 21:06

cleanup

8c86945

Merge branch 'master' into mw/pivotFn

133cf04

leahmcguire approved these changes Mar 27, 2019

View reviewed changes

tovbinm and others added 8 commits March 27, 2019 11:35

newline

b0a9d74

Merge branch 'mw/pivotFn' of github.com:salesforce/TransmogrifAI into…

ebad483

… mw/pivotFn

Add more tests

8664901

Merge branch 'mw/pivotFn' of https://github.com/salesforce/TransmogrifAI

885cb8b

into mw/pivotFn

Merge branch 'master' into mw/pivotFn

ceb1b83

use fold instead of reduce

1099fc2

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mw/…

5b4c76d

…pivotFn

cleanup

83b13c0

tovbinm approved these changes Mar 29, 2019

View reviewed changes

tovbinm merged commit 8e6e050 into master Mar 29, 2019

tovbinm deleted the mw/pivotFn branch March 29, 2019 06:45

tovbinm mentioned this pull request Apr 10, 2019

Release 0.5.2 #277

Merged

tovbinm mentioned this pull request Jul 11, 2019

Release 3.3.3 #26

Merged

salesforce-cla bot removed the cla:signed label Oct 21, 2020

salesforce-cla bot added the cla:missing label Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pivot with max cardinality percentage #241

Pivot with max cardinality percentage #241

michaelweilsalesforce commented Mar 12, 2019 •

edited

Loading

codecov bot commented Mar 12, 2019 •

edited

Loading

leahmcguire commented Mar 12, 2019

leahmcguire Mar 12, 2019

michaelweilsalesforce commented Mar 13, 2019

tovbinm Mar 20, 2019

tovbinm Mar 20, 2019 •

edited

Loading

tovbinm left a comment

leahmcguire left a comment

salesforce-cla bot commented Oct 21, 2020

Pivot with max cardinality percentage #241

Pivot with max cardinality percentage #241

Conversation

michaelweilsalesforce commented Mar 12, 2019 • edited Loading

codecov bot commented Mar 12, 2019 • edited Loading

Codecov Report

leahmcguire commented Mar 12, 2019

leahmcguire Mar 12, 2019

Choose a reason for hiding this comment

michaelweilsalesforce commented Mar 13, 2019

tovbinm Mar 20, 2019

Choose a reason for hiding this comment

tovbinm Mar 20, 2019 • edited Loading

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

salesforce-cla bot commented Oct 21, 2020

michaelweilsalesforce commented Mar 12, 2019 •

edited

Loading

codecov bot commented Mar 12, 2019 •

edited

Loading

tovbinm Mar 20, 2019 •

edited

Loading