[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest) #13959

MechCoder · 2016-06-29T00:30:21Z

What changes were proposed in this pull request?

The current findBestSplits method creates an instance of ImpurityCalculator and ImpurityStats for every possible split and feature in the search for the bestSplit. Every instance of ImpurityCalculator creates an array of size statsSize which is unnecessary and take a non-negligible amount of time. This pull request tackles this problem by the following technique.

Remove the impurityCalculator instantiation for every possible split and feature. Replace this by a calculateGain method for each impurity that computes the gain directly from the allStats attribute of the DTStatsAggregator which holds all the necessary information.
Replace returning an instance of ImpurityStats for every possible split and feature with just the information gain returned from the calculateGain method since the gain is sufficient to calculate the bestSplit. Just return an instance of ImpurityStats once for the bestSplit
Remove the not-so-useful calculateImpurityStats method.

How was this patch tested?

Since this is a performance improvement, tests are necessary. Here are the improvements for a RandomForestRegressor with maxDepth set to 30, subSamplingRate set to 1 and maxBins set to 20 on synthetic data. The timings were calculated locally and the mean of 3 attempts were taken.

n_trees	n_samples	n_features	time in master	total time in this branch
1	10000	500	8.954	7.786
10	10000	500	9.44	6.825
100	10000	500	18.457	16.498
1	500	10000	8.718	6.783
10	500	10000	8.579	6.853
100	500	10000	17.593	15.905
1	1000	1000	8.323	6.456
10	1000	1000	8.841	6.633
100	1000	1000	17.834	16.077
500	1000	1000	64.3	58.94

MechCoder · 2016-06-29T00:31:01Z

@jkbradley @sethah Please have a look when free!

SparkQA · 2016-06-29T00:48:29Z

Test build #61425 has finished for PR 13959 at commit e8b8914.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-29T01:00:09Z

Test build #61427 has finished for PR 13959 at commit af1ff66.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2016-06-29T16:41:45Z

The test failure is just due to binary incompatibility. I can fix those once we decide that the current PR is the way to proceed.

HyukjinKwon · 2017-05-11T12:52:43Z

I think we should fix the compatibility issue first rather then leaving this PR incomplete. If it is inactive, I would rather like to propose to close this for now.

MechCoder · 2017-05-18T13:55:39Z

I don't understand. If you don't have time to review that is fine (I've been there too), but there is no need to close a PR due to unavailability of comitters.

One of the reasons, that I am happy to have stopped contributing to Spark and focus my energy elsewhere...

Thanks!

srowen · 2017-05-18T14:10:57Z

I think the problem is that this PR was incomplete, and left open. We generally only leave open PRs that are active. There was evidently no interest in proceeding with it; I don't know if it was lack of attention.

sethah · 2017-05-18T15:35:02Z

The lack of bandwidth in MLlib means that sometimes good code that would make an impact just gets ignored. This is kind of the reality of things. However, if we are going to close the PR simply because committers could not or did not get to it - this is the case here IMO - then we should also close the JIRA. Closing a PR for this reason essentially means "we don't see this as an issue worth spending time on." That's a reason to close a JIRA as well. Closing the JIRA will at least prevent others from wasting their time on this issue like @MechCoder did.

If we don't close the JIRA, then it seems like we are closing it merely because we don't want the "clutter" of long waiting prs. But if a PR is still valid, well-written, and solves a real problem, why would we not keep it open? This sends a bad message to contributors IMO.

srowen · 2017-05-18T16:20:38Z

True, and I'd probably close the JIRA too. Maybe we can draw @jkbradley 's attention for a comment?

A closed PR still exists and can be examined or reopened, so it doesn't go away. I'd prefer to close it if it's almost surely not going to be merged, as a minor courtesy to the contributor, rather than leave it open. It's not so much the clutter, but, that's a factor. If there are always 500 open PRs, what's one more? and being open carries virtually no information.

I both want more closing of things to reflect that fact that, at this stage, not a lot is going to change -- and want more committers.

HyukjinKwon · 2017-05-18T21:40:08Z

FWIW, I think we already had few discussion in the mailing list about the last resort - automatic-closing. I was strongly against this. This is my effort to prevent this for now and the reason described above.

sethah · 2017-05-18T22:01:42Z

This is fine, but are we not also policing JIRAs? I've argued above that the reason this PR has been inactive is simply lack of interest in this issue. If that's the case, then the JIRA must also be closed, since we've implicitly decided that this is no longer or never was a problem worth solving.

If it is not the case, then we must absolutely leave this open - since this is a.) still unsolved b.) good code c.) still of interest. The reason for closing PRs but not their corresponding JIRA would be that the PR is either poorly implemented or the author is non-responsive. While it is no doubt frustrating for contributors to submit code that is well-written and solves a problem that a project committer asked for, I also don't know that leaving it open indefinitely is a solution either. I guess I don't understand why we'd be willing to leave JIRAs open indefinitely but not PRs. At any rate, in this case I would have proposed we ask for Joseph's (issue creator) input for a few days, and if we hear nothing close both JIRA and PR. We surely do not want others to submit patches for this issue if it will not be reviewed and merged.

HyukjinKwon · 2017-05-18T23:34:38Z

The reason for closing PRs but not their corresponding JIRA would be that the PR is either poorly implemented or the author is non-responsive.

Yes, I tried to identify this case.

For this PR (or such PRs), the author looks still responsive and active so I do not disagree with re-opening personally because this was the point in #18017. Probably, I should have left a comment about this in each PR for clarification though.

sethah · 2017-05-18T23:38:43Z

Yes, this is a tough issue. Let's wait and see if @jkbradley has thoughts on this issue. If we don't hear anything, then I'd leave it up to @MechCoder on whether to reopen. Thanks, btw, for taking the time to do the cleanups. It is important and justified in many cases.

HyukjinKwon · 2017-05-18T23:41:57Z

@MechCoder, I apologise that, probably, it sounds the reason for my suggestion was not clear initially and if it looked without a respect.

MechCoder added 7 commits June 27, 2016 13:42

Add calculateGain method to all Impurity objects

64d066b

Refactor gain calculation for categorical splits

f1d8c89

Remove impurity calculation to outside the for loop

6e31e3a

Remove per feature impurityCalculator initialization

ea4a073

Get rid of calculateImpurityStats

ca8b360

where did that come from?

67b401a

Add documentation

e8b8914

minor change to variance calculation

af1ff66

MechCoder mentioned this pull request Jul 6, 2016

[SPARK-14610][ML] Remove superfluous split for continuous features in decision tree training #12374

Closed

MechCoder mentioned this pull request Jul 19, 2016

[SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch #14273

Closed

HyukjinKwon mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest) #13959

[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest) #13959

MechCoder commented Jun 29, 2016 •

edited

Loading

MechCoder commented Jun 29, 2016

SparkQA commented Jun 29, 2016

SparkQA commented Jun 29, 2016

MechCoder commented Jun 29, 2016

HyukjinKwon commented May 11, 2017

MechCoder commented May 18, 2017

srowen commented May 18, 2017

sethah commented May 18, 2017

srowen commented May 18, 2017

HyukjinKwon commented May 18, 2017 •

edited

Loading

sethah commented May 18, 2017

HyukjinKwon commented May 18, 2017

sethah commented May 18, 2017

HyukjinKwon commented May 18, 2017 •

edited

Loading

[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest) #13959

[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest) #13959

Conversation

MechCoder commented Jun 29, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

MechCoder commented Jun 29, 2016

SparkQA commented Jun 29, 2016

SparkQA commented Jun 29, 2016

MechCoder commented Jun 29, 2016

HyukjinKwon commented May 11, 2017

MechCoder commented May 18, 2017

srowen commented May 18, 2017

sethah commented May 18, 2017

srowen commented May 18, 2017

HyukjinKwon commented May 18, 2017 • edited Loading

sethah commented May 18, 2017

HyukjinKwon commented May 18, 2017

sethah commented May 18, 2017

HyukjinKwon commented May 18, 2017 • edited Loading

MechCoder commented Jun 29, 2016 •

edited

Loading

HyukjinKwon commented May 18, 2017 •

edited

Loading

HyukjinKwon commented May 18, 2017 •

edited

Loading