[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator #20132

jkbradley · 2017-12-31T23:48:36Z

What changes were proposed in this pull request?

Follow-up cleanups for the OneHotEncoderEstimator PR. See some discussion in the original PR: #19527 or read below for what this PR includes:

configedCategorySize: I reverted this to return an Array. I realized the original setup (which I had recommended in the original PR) caused the whole model to be serialized in the UDF.
encoder: I reorganized the logic to show what I meant in the comment in the previous PR. I think it's simpler but am open to suggestions.

I also made some small style cleanups based on IntelliJ warnings.

How was this patch tested?

Existing unit tests

jkbradley · 2017-12-31T23:49:54Z

@viirya This basically has 2 changes:

configedCategorySize: my mistake!
encoder: clarify what I meant before

SparkQA · 2018-01-01T00:55:02Z

Test build #85569 has finished for PR 20132 at commit 9bf045d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-01-01T01:11:05Z

@jkbradley Thanks for this follow-up!

I've noticed that first issue in original PR. But don't have enough time to discuss with you further.

I'll go through this soon.

viirya · 2018-01-01T08:28:43Z

The simplified logic for encoder looks good to me.

viirya · 2018-01-01T08:32:10Z

LGTM

viirya · 2018-01-01T15:09:19Z

mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala

        } else {
-          Vectors.sparse(size, Array(size - 1), oneValue)
+          if (label < 0) {
+            throw new SparkException(s"Negative value: $label. Input can't be negative. " +


I have a question. Since we don't allow negative value when fitting, should we allow it in transforming even handleInvalid is KEEP_INVALID?

Good point that it's unclear. I do think it'd be good to be robust during transform(). As far as fitting, I could see going either way (forcing data validation vs. being robust to small issues). I'd like to keep this strict during fitting (throwing errors) and robust during transform(), but let me know what you think.

I'll clarify this in the documentation.

jkbradley · 2018-01-05T06:33:51Z

Updated!

SparkQA · 2018-01-05T07:43:14Z

Test build #85719 has finished for PR 20132 at commit c547d0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

LGTM

jkbradley · 2018-01-05T19:51:00Z

Thanks! Merging with master and branch-2.3

## What changes were proposed in this pull request? Follow-up cleanups for the OneHotEncoderEstimator PR. See some discussion in the original PR: #19527 or read below for what this PR includes: * configedCategorySize: I reverted this to return an Array. I realized the original setup (which I had recommended in the original PR) caused the whole model to be serialized in the UDF. * encoder: I reorganized the logic to show what I meant in the comment in the previous PR. I think it's simpler but am open to suggestions. I also made some small style cleanups based on IntelliJ warnings. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <[email protected]> Closes #20132 from jkbradley/viirya-SPARK-13030. (cherry picked from commit 930b90a) Signed-off-by: Joseph K. Bradley <[email protected]>

updates for final PR

9bf045d

jkbradley mentioned this pull request Dec 31, 2017

[SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as Estimator #19527

Closed

viirya reviewed Jan 1, 2018

View reviewed changes

Clarified role of handleInvalid during fitting

c547d0f

viirya approved these changes Jan 5, 2018

View reviewed changes

asfgit closed this in 930b90a Jan 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator #20132

[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator #20132

jkbradley commented Dec 31, 2017

jkbradley commented Dec 31, 2017

SparkQA commented Jan 1, 2018

viirya commented Jan 1, 2018

viirya commented Jan 1, 2018

viirya commented Jan 1, 2018

viirya Jan 1, 2018

jkbradley Jan 5, 2018 •

edited

Loading

jkbradley commented Jan 5, 2018

SparkQA commented Jan 5, 2018

viirya left a comment

jkbradley commented Jan 5, 2018

[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator #20132

[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator #20132

Conversation

jkbradley commented Dec 31, 2017

What changes were proposed in this pull request?

How was this patch tested?

jkbradley commented Dec 31, 2017

SparkQA commented Jan 1, 2018

viirya commented Jan 1, 2018

viirya commented Jan 1, 2018

viirya commented Jan 1, 2018

viirya Jan 1, 2018

Choose a reason for hiding this comment

jkbradley Jan 5, 2018 • edited Loading

Choose a reason for hiding this comment

jkbradley commented Jan 5, 2018

SparkQA commented Jan 5, 2018

viirya left a comment

Choose a reason for hiding this comment

jkbradley commented Jan 5, 2018

jkbradley Jan 5, 2018 •

edited

Loading