-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 in RandomForests #4073
Conversation
Test build #25666 has started for PR 4073 at commit
|
@jkbradley @mengxr it would be great if you could have a look. |
Test build #25666 has finished for PR 4073 at commit
|
Test FAILed. |
I've made changes such that this not break anything, i.e everything is backward compat. |
@jkbradley Oops, the comments got deleted somehow. I meant that this is because there are 10 arguments in |
@MechCoder Taking a closer look, I now realize that part of this functionality is already there...see the JIRA & let me know what you think. |
Oh well, but still if I'm not mistaken, the |
Test build #25672 has started for PR 4073 at commit
|
Good point, yes, I think it's worth fixing. |
Test build #25672 has finished for PR 4073 at commit
|
Test FAILed. |
Also, as far as testing....it's hard. One way might be to:
|
Thanks, Also a design decision, is it worthy enough to add this as an option to |
I'd vote for not adding it to train since that part of the API is so unwieldy. |
This reverts commit 6685b4494d2cb1ec72dbc540d2d747c75c6939ee.
Test build #25704 has started for PR 4073 at commit
|
Test build #25704 has finished for PR 4073 at commit
|
Test PASSed. |
@jkbradley I've added a test according to the other tests in the |
Test build #25705 has started for PR 4073 at commit
|
Test build #25705 has finished for PR 4073 at commit
|
Test PASSed. |
test("subsampling rate in RandomForest"){ | ||
val arr = EnsembleTestHelper.generateOrderedLabeledPoints(5, 20) | ||
val rdd = sc.parallelize(arr) | ||
val strategy1 = new Strategy(algo = Classification, impurity = Gini, maxDepth = 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make 1 instance of strategy, train rf1, modify the strategy's subsamplingRate, and train rf2. Simpler + more clearly using the same settings for other parameters
@jkbradley Thanks for the tip. Fixed. Anything more? |
Test build #25953 has started for PR 4073 at commit
|
Test build #25953 has finished for PR 4073 at commit
|
Test FAILed. |
Repushed after fixing the style checks. |
Test build #25955 has started for PR 4073 at commit
|
Test build #25955 has finished for PR 4073 at commit
|
Test PASSed. |
Test build #25961 has started for PR 4073 at commit
|
Test build #25961 has finished for PR 4073 at commit
|
Test PASSed. |
ping @jkbradley Could you please have a final look? |
@MechCoder This is an addition instead of a correction, but I just realized that Strategy.assertValid() does not check subsamplingRate. Would you mind adding that check? The rest looks good to me. Thanks! |
@jkbradley Fixed. I can haz merge? |
Test build #26061 has started for PR 4073 at commit
|
Test build #26061 has finished for PR 4073 at commit
|
Test PASSed. |
@MechCoder Thanks! LGTM CC: @mengxr Note this is sort of an API change: RandomForest can now be run with subsampled rows. (But this seems fine to me since users could set subsamplingRate before---it just wouldn't do anything.) |
@mengxr This can also be viewd as a bugfix which prevents overwriting of the param |
Merged into master. Thanks! |
I've added support for sampling_rate not equal to 1.0 . I have two major questions.