-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Review] Random Forest & Decision Tree Regression + major updates to Classification #635
[Review] Random Forest & Decision Tree Regression + major updates to Classification #635
Conversation
vishalmehta1991
commented
May 27, 2019
•
edited
Loading
edited
- Fixing DT classification number of feature limitation. major fix for RF classifier against branch-0.7
- Adding Regression Decision Tree and Random Forest
- Adding Entropy for classification MSE/MAE for regression
- Base classes for RF and DT, derived classes for regression and classifications
- Various Bug fixes
- Added rfRegressor class to random forest, along with relevant stateless API functions. Implementation currently commented out as DecisionTreeRegressor is not yet implemented. - Updated RF_metrics struct to include regression metrics too. Metrics supported are mean absolute error, mean squared error and median absolute error as per SKL-rf.
- Added a base dt class, similat to what we had in rf. - Added a DecisionTreeRegressor class. API only, no implementation for now.
-Further templated TreeNode to work for both regression and classification. -Moved all predict methods to the base dt class.
…ion and metric info
- Further templated TemporaryMemory (for regression labels) + other minor fixes. - Moved split_branch to base dt class. - Placeholder methods for DecisionTreeRegressor.
- RFRegressor flat API signature fixes. - Minor tempmem fixes. - Temp addition to rf_test for regression testing. Very simply test. - Debugging ongoing.
- DecisionTreeRegressor: supports MSE (mean squared error) or MAE (mean absolute error) as split criterion. - DecisionTreeClassifier: default and only option is GINI.
… for classification
- Removed useless memory allocation from memory.cuh - Added temporary testing folder in randomforest w/ testing script for both regression and classification. This directory will be removed prior any merge req.
- Preprocess quantiles in batches of batch_cols if there isn't enough device memory to process all ncols at once. - The number of columns per batch is dynamically determined based on the available device memory.
- Updated all our cudaMemcpy calls to updateDevice, updateHost, or updateAsync as appropriate. - Copied recent iota -> permute change from rfRegressor to rfClassifier fit too.
- Also minor update to minmax prim code. - minmax prim use in dt commented out.
- Calling find_best_fruit_all is unnecessary when a node will be considered a leaf due to depth (or max leaves) constraints. - Added stream to cub call in memory.cuh
- Do not update # bins for GLOBAL_QUANTILE when # rows per node < # bins.
Can one of the admins verify this patch? |
add to whitelist |
- Added batching support to minmaxKernel to enable datasets with a large number of features. - Added an extra testcase where ncols wouldn't previously fit in the available shared memory.
…hta1991/cuml into fea-ext-randomforest_regression
…ndomforest_regression - Also run clang-format on files.
- Also more clang formatting changes that the previous commit missed.
- Examples run but accuracy is low. TODO Debug - Corrected doxygen comments in randomforest.cu
…ndomforest_regression - Updates to minmax prim.
…pmem optimization
…ndomforest_regression - Also, updated python wrapper w/ quantile_per_tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a huge PR and hard to go over every code but overall LGTM. Please check the two comments I left.
|
||
//Cleanup | ||
selected_rows.release(stream); | ||
trees[i].fit(user_handle, input, n_cols, n_rows, labels, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think decision tree building should be parallelized to get better speedup. Should be relatively straigforward using OpenMP with multiple streams. It can be done in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed but as per the cuml developer guide. We want to keep the algorithms single threaded. Openmp can be compiler specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishalmehta1991, Some of us had a discussion about this on #694. When I get some time, I'm going to submit a PR so that we can be more clear in our language in the developer guide.
This is a case that, I believe, would warrant the use of threading since building trees in serial has such a massive impact on the performance of our application. This should be made possible by synchronizing the main stream before your for..loop, creating new streams in threads and synchronizing them while still in the for loop. I have also verified with the CUDA team that the cuda API is, in fact, thread safe so you should not have a problem creating and submitting cuda kernels while in separate threads.
So long as you synchronize the main stream before the for loop, you should not have any problems maintaining thread-safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cjnolet am completely aware about the CUDA api and the thread support for it. That not the concern. My concern is that the way openmp is treated across a board of compilers. To be compliant with all the compiler that do openmp; we will need to test them.
But again from a developer point; putting openmp is easy and simple. I would be up for it if Rapids developer guide allows me to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging @jirikraus & @teju85 for additional feedback / suggestions of ways we can design this to build trees in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishalmehta1991. I would like to learn more about the problems related to openmp compiler support. Would you mind sharing some references related to this problem?
I'm concerned that we're seeing signs the current implementation's performance is bounded by the number of available CPU cores as a result of the kernels being submitted in a single stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seriously, what is the real issue here about OpenMP or any other multi-threading library? OpenMP was used in KNN and I don't recall any problem. It's obvious that we need to build the trees in parallel. What do you guys suggest if you don't want to use openmp or a multi-threading library? Only other option I can think of is to redesign and rewrite the whole algorithm from scratch to build the trees in parallel which will require a significant amount of work. Any other suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I dont mind threads; its embarrassing parallel. If you want we can do openmp in a separate PR (this PR is too big ) and also open a PR to change the cuml developer guide; so that we adhere to cuml guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this PR and try openmp in a separate PR. We can change the developer guideline if openmp PR gives us good results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording that is being discussed for our changes to the developer guide:
With the exception of the cumlHandle, cuML algorithms should maintain thread-safety and are, in general, assumed to be single threaded. Exceptions are made for algorithms that can take advantage of multiple CUDA streams in order to oversubscribe or increase occupancy on a single GPU. In these cases, multiple threads should be used only to maintain concurrency of the underlying CUDA streams. Multiple threads should be used sparingly, be bounded, and should not perform CPU-intensive computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @oyilmaz-nvidia . More perf-updates can be dealt with in subsequent PRs after thorough rounds of profiling, once @cjnolet provides the python script he's been using to benchmark RF code.
Code at the current state looks good for merging.
…ntropy, added missing rf quantile check