-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pivot with max cardinality percentage #241
Conversation
Codecov Report
@@ Coverage Diff @@
## master #241 +/- ##
=========================================
- Coverage 86.78% 86.6% -0.19%
=========================================
Files 314 315 +1
Lines 10466 10339 -127
Branches 354 566 +212
=========================================
- Hits 9083 8954 -129
- Misses 1383 1385 +2
Continue to review full report at Codecov.
|
@mweilsalesforce can you please add a description of what this PR is doing? |
|
||
val percentFilter = countUniques.flatMap(_.map{ case (k, v) => | ||
k -> (v.estimatedSize / n < $(maxPercentageCardinality))}.toSeq).toMap | ||
val filteredDataset = filterHighCardinality(dataset, percentFilter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than repeating this block of code can you make it a mixin trait?
BTW @tovbinm, couldn't make the Count Min Sketch work. Algebird has however a working CMSTopK Monoid, but we can't use it : we want topK based on minSupport. |
|
||
val countOccurrences: Seq[Map[String, Int]] = { | ||
if (rdd.isEmpty) Seq.empty[Map[String, Int]] | ||
else rdd.reduce((a, b) => a.zip(b).map { case (m1, m2) => m1 + m2 }) | ||
if (filteredRDD.isEmpty) Seq.empty[Map[String, Int]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of isEmpty
and reduce
use fold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I think this should be Seq.fill(inN.length)(Map.empty[String, Int])
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need a bit more tests in UniqueCountTest
in particular:
- make them more robust and cover more cases
- make better results assertions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this. |
Related issues
Some pivots do not necessary make sense when the cardinality is too high.
Describe the proposed solution
Set a max cardinality percentage for pivot in OneHotVectorizer, TextMapVectorizer and MultiPickListMap Vectorizer. Number of uniques is computed with HyperLogLog Monoid.
Additional context
Default value will be 1.00. Then Experiments will be run in order to determine a good default.
Max cardinality percentage is implemented for MultiPickList(Map) Vectorizers. this may be questionable.
Max cardinality percentage is not in SmartText(Map)Vectorizer, because the param maxCardinality already exists, and it is used to differentiate a category from a text.