-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator #20132
Conversation
@viirya This basically has 2 changes:
|
Test build #85569 has finished for PR 20132 at commit
|
@jkbradley Thanks for this follow-up! I've noticed that first issue in original PR. But don't have enough time to discuss with you further. I'll go through this soon. |
The simplified logic for encoder looks good to me. |
LGTM |
} else { | ||
Vectors.sparse(size, Array(size - 1), oneValue) | ||
if (label < 0) { | ||
throw new SparkException(s"Negative value: $label. Input can't be negative. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question. Since we don't allow negative value when fitting, should we allow it in transforming even handleInvalid is KEEP_INVALID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point that it's unclear. I do think it'd be good to be robust during transform(). As far as fitting, I could see going either way (forcing data validation vs. being robust to small issues). I'd like to keep this strict during fitting (throwing errors) and robust during transform(), but let me know what you think.
I'll clarify this in the documentation.
Updated! |
Test build #85719 has finished for PR 20132 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! Merging with master and branch-2.3 |
## What changes were proposed in this pull request? Follow-up cleanups for the OneHotEncoderEstimator PR. See some discussion in the original PR: #19527 or read below for what this PR includes: * configedCategorySize: I reverted this to return an Array. I realized the original setup (which I had recommended in the original PR) caused the whole model to be serialized in the UDF. * encoder: I reorganized the logic to show what I meant in the comment in the previous PR. I think it's simpler but am open to suggestions. I also made some small style cleanups based on IntelliJ warnings. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <[email protected]> Closes #20132 from jkbradley/viirya-SPARK-13030. (cherry picked from commit 930b90a) Signed-off-by: Joseph K. Bradley <[email protected]>
What changes were proposed in this pull request?
Follow-up cleanups for the OneHotEncoderEstimator PR. See some discussion in the original PR: #19527 or read below for what this PR includes:
I also made some small style cleanups based on IntelliJ warnings.
How was this patch tested?
Existing unit tests