Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator #20132

Closed
wants to merge 2 commits into from

Conversation

jkbradley
Copy link
Member

What changes were proposed in this pull request?

Follow-up cleanups for the OneHotEncoderEstimator PR. See some discussion in the original PR: #19527 or read below for what this PR includes:

  • configedCategorySize: I reverted this to return an Array. I realized the original setup (which I had recommended in the original PR) caused the whole model to be serialized in the UDF.
  • encoder: I reorganized the logic to show what I meant in the comment in the previous PR. I think it's simpler but am open to suggestions.

I also made some small style cleanups based on IntelliJ warnings.

How was this patch tested?

Existing unit tests

@jkbradley
Copy link
Member Author

@viirya This basically has 2 changes:

  • configedCategorySize: my mistake!
  • encoder: clarify what I meant before

@SparkQA
Copy link

SparkQA commented Jan 1, 2018

Test build #85569 has finished for PR 20132 at commit 9bf045d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jan 1, 2018

@jkbradley Thanks for this follow-up!

I've noticed that first issue in original PR. But don't have enough time to discuss with you further.

I'll go through this soon.

@viirya
Copy link
Member

viirya commented Jan 1, 2018

The simplified logic for encoder looks good to me.

@viirya
Copy link
Member

viirya commented Jan 1, 2018

LGTM

} else {
Vectors.sparse(size, Array(size - 1), oneValue)
if (label < 0) {
throw new SparkException(s"Negative value: $label. Input can't be negative. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question. Since we don't allow negative value when fitting, should we allow it in transforming even handleInvalid is KEEP_INVALID?

Copy link
Member Author

@jkbradley jkbradley Jan 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point that it's unclear. I do think it'd be good to be robust during transform(). As far as fitting, I could see going either way (forcing data validation vs. being robust to small issues). I'd like to keep this strict during fitting (throwing errors) and robust during transform(), but let me know what you think.

I'll clarify this in the documentation.

@jkbradley
Copy link
Member Author

Updated!

@SparkQA
Copy link

SparkQA commented Jan 5, 2018

Test build #85719 has finished for PR 20132 at commit c547d0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jkbradley
Copy link
Member Author

Thanks! Merging with master and branch-2.3

asfgit pushed a commit that referenced this pull request Jan 5, 2018
## What changes were proposed in this pull request?

Follow-up cleanups for the OneHotEncoderEstimator PR.  See some discussion in the original PR: #19527 or read below for what this PR includes:
* configedCategorySize: I reverted this to return an Array.  I realized the original setup (which I had recommended in the original PR) caused the whole model to be serialized in the UDF.
* encoder: I reorganized the logic to show what I meant in the comment in the previous PR.  I think it's simpler but am open to suggestions.

I also made some small style cleanups based on IntelliJ warnings.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <[email protected]>

Closes #20132 from jkbradley/viirya-SPARK-13030.

(cherry picked from commit 930b90a)
Signed-off-by: Joseph K. Bradley <[email protected]>
@asfgit asfgit closed this in 930b90a Jan 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants