Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential CUDA context poison when negative (invalid) categories provided to FIL model [21.10] #4315

Merged

Conversation

levsnv
Copy link
Contributor

@levsnv levsnv commented Oct 29, 2021

Fix potential CUDA context poison due to invalid global read when negative categories provided at inference:

now equivalent to non-matching. Same for +-Inf categories. NAN categories are still !def_left, and fractional categories are truncated (via a typecast), as this is what Treelite does

FIL now converts dummy nodes to numerical on import
and never generates max_matching == -1 categorical features in test.
FIL will still generate empty categorical nodes in test (a non-empty bits vector which contains only zeros), export them as dummy numerical nodes and import again as dummy numerical nodes. If a feature only contains dummy numerical nodes, it will be deemed a numerical feature (same as for non-dummy numerical nodes or a mix thereof). Therefore, categorical feature max_matching == -1 is still prevented.

…inference: now equivalent to non-matching. FIL now converts dummy nodes to numerical on import and never generates max_matching == -1 categorical features in test.
@levsnv levsnv requested a review from a team as a code owner October 29, 2021 04:49
@levsnv levsnv added bug Something isn't working non-breaking Non-breaking change labels Oct 29, 2021
cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Merging #4315 (a3dfb0e) into branch-21.12 (bf6ade1) will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##           branch-21.12    #4315   +/-   ##
=============================================
  Coverage         86.04%   86.04%           
=============================================
  Files               231      231           
  Lines             18691    18715   +24     
=============================================
+ Hits              16083    16104   +21     
- Misses             2608     2611    +3     
Flag Coverage Δ
dask 47.00% <ø> (-0.01%) ⬇️
non-dask 78.74% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/manifold/t_sne.pyx 76.95% <0.00%> (-0.51%) ⬇️
python/cuml/linear_model/mbsgd_classifier.pyx 100.00% <0.00%> (ø)
python/cuml/ensemble/randomforestregressor.pyx 83.53% <0.00%> (ø)
python/cuml/dask/ensemble/randomforestregressor.py 88.70% <0.00%> (ø)
...ython/cuml/dask/ensemble/randomforestclassifier.py 61.17% <0.00%> (ø)
python/cuml/naive_bayes/naive_bayes.py 96.10% <0.00%> (+<0.01%) ⬆️
python/cuml/cluster/dbscan.pyx 98.26% <0.00%> (+0.03%) ⬆️
python/cuml/ensemble/randomforest_common.pyx 93.50% <0.00%> (+0.08%) ⬆️
python/cuml/metrics/regression.pyx 95.55% <0.00%> (+0.10%) ⬆️
python/cuml/explainer/kernel_shap.pyx 97.70% <0.00%> (+0.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf6ade1...a3dfb0e. Read the comment docs.

@wphicks
Copy link
Contributor

wphicks commented Oct 29, 2021

I think this is targeting the wrong branch? Looks to be targeting 21.12, but it's labeled 21.10

Copy link
Contributor

@canonizer canonizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! As is, this should address the problem of out-of-bounds access for categorical features.

The main comment is about more testing for different kinds of values for categorical features. Other comments are mostly technical.

cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved
cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved
cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved
cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved
cpp/test/sg/fil_test.cu Show resolved Hide resolved
cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved
cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved
cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Oct 29, 2021
@levsnv levsnv requested a review from a team as a code owner October 29, 2021 21:07
Copy link
Contributor

@canonizer canonizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Approved provided that the comments (mostly technical) are addressed.

cpp/test/sg/fil_test.cu Show resolved Hide resolved
python/cuml/fil/fil.pyx Outdated Show resolved Hide resolved
cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved
cpp/test/sg/fil_child_index_test.cu Show resolved Hide resolved
python/cuml/fil/fil.pyx Outdated Show resolved Hide resolved
@levsnv
Copy link
Contributor Author

levsnv commented Nov 3, 2021

@dantegd should I close this PR since there's the https://github.com/rapidsai/cuml/tree/hotfix-21.10-pr-4315 branch is open?
And is there anything I can do to help track that hotfix?
I'm also planning to submit a small optimization to recover some of the slowdown incurred in this bugfix, but that might take until the end of the week. Not sure if that needs to be part of the hotfix.

@levsnv
Copy link
Contributor Author

levsnv commented Nov 3, 2021

Update from Andy: we won't fix the perf for the hotfix, but will for sibling PR #4314.
Also: #4314 shows lightgbm test failures, which is worrying

@levsnv levsnv added the 4 - Waiting on Author Waiting for author to respond to review label Nov 3, 2021
@dantegd
Copy link
Member

dantegd commented Nov 3, 2021

@levsnv we still need this PR, since we want to release a full conda/docker hotfix. The only issue is that it is targetted to branch-21.12 as opposed to 21.10, could you change the target (or I can help doing that as well). After that, we can start the hotfix process later in the week.

@levsnv levsnv changed the base branch from branch-21.12 to branch-21.10 November 3, 2021 03:11
fix edge case of reeeeeeeally large models
give more room by not being pedantic about the name
extra test and technical questions
+1
@@ -510,11 +526,9 @@ def to_categorical(features, n_categorical):
for icol in range(n_categorical, n_features):
df_cols[icol] = pd.Series(features[:, icol])
# shuffle the columns around
seed(42)
new_idx = sample(range(n_features), k=n_features)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see review in #4326 - re: determinism

@levsnv levsnv added 3 - Ready for Review Ready for review by team ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround and removed 4 - Waiting on Author Waiting for author to respond to review labels Nov 9, 2021
@ajschmidt8 ajschmidt8 merged commit fcc4344 into rapidsai:branch-21.10 Nov 10, 2021
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
…provided to FIL model (rapidsai#4315)

Fix potential CUDA context poison due to invalid global read when negative categories provided at inference:

now equivalent to non-matching. Same for `+-Inf` categories. NAN categories are still `!def_left`, and fractional categories are truncated (via a typecast), as this is what Treelite does

FIL now converts dummy nodes to numerical on import
and never generates max_matching == -1 categorical features in test.
FIL will still generate empty categorical nodes in test (a non-empty bits vector which contains only zeros), export them as dummy numerical nodes and import again as dummy numerical nodes. If a feature only contains dummy numerical nodes, it will be deemed a numerical feature (same as for non-dummy numerical nodes or a mix thereof). Therefore, categorical feature max_matching == -1 is still prevented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround bug Something isn't working CUDA/C++ Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants