-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest) #13959
Conversation
@jkbradley @sethah Please have a look when free! |
Test build #61425 has finished for PR 13959 at commit
|
Test build #61427 has finished for PR 13959 at commit
|
The test failure is just due to binary incompatibility. I can fix those once we decide that the current PR is the way to proceed. |
I think we should fix the compatibility issue first rather then leaving this PR incomplete. If it is inactive, I would rather like to propose to close this for now. |
I don't understand. If you don't have time to review that is fine (I've been there too), but there is no need to close a PR due to unavailability of comitters. One of the reasons, that I am happy to have stopped contributing to Spark and focus my energy elsewhere... Thanks! |
I think the problem is that this PR was incomplete, and left open. We generally only leave open PRs that are active. There was evidently no interest in proceeding with it; I don't know if it was lack of attention. |
The lack of bandwidth in MLlib means that sometimes good code that would make an impact just gets ignored. This is kind of the reality of things. However, if we are going to close the PR simply because committers could not or did not get to it - this is the case here IMO - then we should also close the JIRA. Closing a PR for this reason essentially means "we don't see this as an issue worth spending time on." That's a reason to close a JIRA as well. Closing the JIRA will at least prevent others from wasting their time on this issue like @MechCoder did. If we don't close the JIRA, then it seems like we are closing it merely because we don't want the "clutter" of long waiting prs. But if a PR is still valid, well-written, and solves a real problem, why would we not keep it open? This sends a bad message to contributors IMO. |
True, and I'd probably close the JIRA too. Maybe we can draw @jkbradley 's attention for a comment? A closed PR still exists and can be examined or reopened, so it doesn't go away. I'd prefer to close it if it's almost surely not going to be merged, as a minor courtesy to the contributor, rather than leave it open. It's not so much the clutter, but, that's a factor. If there are always 500 open PRs, what's one more? and being open carries virtually no information. I both want more closing of things to reflect that fact that, at this stage, not a lot is going to change -- and want more committers. |
FWIW, I think we already had few discussion in the mailing list about the last resort - automatic-closing. I was strongly against this. This is my effort to prevent this for now and the reason described above. |
This is fine, but are we not also policing JIRAs? I've argued above that the reason this PR has been inactive is simply lack of interest in this issue. If that's the case, then the JIRA must also be closed, since we've implicitly decided that this is no longer or never was a problem worth solving. If it is not the case, then we must absolutely leave this open - since this is a.) still unsolved b.) good code c.) still of interest. The reason for closing PRs but not their corresponding JIRA would be that the PR is either poorly implemented or the author is non-responsive. While it is no doubt frustrating for contributors to submit code that is well-written and solves a problem that a project committer asked for, I also don't know that leaving it open indefinitely is a solution either. I guess I don't understand why we'd be willing to leave JIRAs open indefinitely but not PRs. At any rate, in this case I would have proposed we ask for Joseph's (issue creator) input for a few days, and if we hear nothing close both JIRA and PR. We surely do not want others to submit patches for this issue if it will not be reviewed and merged. |
Yes, I tried to identify this case. For this PR (or such PRs), the author looks still responsive and active so I do not disagree with re-opening personally because this was the point in #18017. Probably, I should have left a comment about this in each PR for clarification though. |
Yes, this is a tough issue. Let's wait and see if @jkbradley has thoughts on this issue. If we don't hear anything, then I'd leave it up to @MechCoder on whether to reopen. Thanks, btw, for taking the time to do the cleanups. It is important and justified in many cases. |
@MechCoder, I apologise that, probably, it sounds the reason for my suggestion was not clear initially and if it looked without a respect. |
What changes were proposed in this pull request?
The current
findBestSplits
method creates an instance ofImpurityCalculator
andImpurityStats
for every possible split and feature in the search for the bestSplit. Every instance ofImpurityCalculator
creates an array of sizestatsSize
which is unnecessary and take a non-negligible amount of time. This pull request tackles this problem by the following technique.impurityCalculator
instantiation for every possible split and feature. Replace this by acalculateGain
method for each impurity that computes the gain directly from theallStats
attribute of theDTStatsAggregator
which holds all the necessary information.ImpurityStats
for every possible split and feature with just the information gain returned from thecalculateGain
method since the gain is sufficient to calculate thebestSplit
. Just return an instance ofImpurityStats
once for thebestSplit
calculateImpurityStats
method.How was this patch tested?
Since this is a performance improvement, tests are necessary. Here are the improvements for a
RandomForestRegressor
withmaxDepth
set to 30,subSamplingRate
set to 1 andmaxBins
set to 20 on synthetic data. The timings were calculated locally and the mean of 3 attempts were taken.