Fix potential CUDA context poison when negative (invalid) categories provided to FIL model [21.10] #4315

levsnv · 2021-10-29T04:49:48Z

Fix potential CUDA context poison due to invalid global read when negative categories provided at inference:

now equivalent to non-matching. Same for +-Inf categories. NAN categories are still !def_left, and fractional categories are truncated (via a typecast), as this is what Treelite does

FIL now converts dummy nodes to numerical on import
and never generates max_matching == -1 categorical features in test.
FIL will still generate empty categorical nodes in test (a non-empty bits vector which contains only zeros), export them as dummy numerical nodes and import again as dummy numerical nodes. If a feature only contains dummy numerical nodes, it will be deemed a numerical feature (same as for non-dummy numerical nodes or a mix thereof). Therefore, categorical feature max_matching == -1 is still prevented.

…inference: now equivalent to non-matching. FIL now converts dummy nodes to numerical on import and never generates max_matching == -1 categorical features in test.

cpp/src/fil/internal.cuh

codecov-commenter · 2021-10-29T10:50:36Z

Codecov Report

Merging #4315 (a3dfb0e) into branch-21.12 (bf6ade1) will increase coverage by 0.00%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##           branch-21.12    #4315   +/-   ##
=============================================
  Coverage         86.04%   86.04%           
=============================================
  Files               231      231           
  Lines             18691    18715   +24     
=============================================
+ Hits              16083    16104   +21     
- Misses             2608     2611    +3

Flag	Coverage Δ
dask	`47.00% <ø> (-0.01%)`	⬇️
non-dask	`78.74% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/cuml/manifold/t_sne.pyx	`76.95% <0.00%> (-0.51%)`	⬇️
python/cuml/linear_model/mbsgd_classifier.pyx	`100.00% <0.00%> (ø)`
python/cuml/ensemble/randomforestregressor.pyx	`83.53% <0.00%> (ø)`
python/cuml/dask/ensemble/randomforestregressor.py	`88.70% <0.00%> (ø)`
...ython/cuml/dask/ensemble/randomforestclassifier.py	`61.17% <0.00%> (ø)`
python/cuml/naive_bayes/naive_bayes.py	`96.10% <0.00%> (+<0.01%)`	⬆️
python/cuml/cluster/dbscan.pyx	`98.26% <0.00%> (+0.03%)`	⬆️
python/cuml/ensemble/randomforest_common.pyx	`93.50% <0.00%> (+0.08%)`	⬆️
python/cuml/metrics/regression.pyx	`95.55% <0.00%> (+0.10%)`	⬆️
python/cuml/explainer/kernel_shap.pyx	`97.70% <0.00%> (+0.41%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf6ade1...a3dfb0e. Read the comment docs.

wphicks · 2021-10-29T13:18:16Z

I think this is targeting the wrong branch? Looks to be targeting 21.12, but it's labeled 21.10

canonizer

Great! As is, this should address the problem of out-of-bounds access for categorical features.

The main comment is about more testing for different kinds of values for categorical features. Other comments are mostly technical.

cpp/src/fil/internal.cuh

cpp/test/sg/fil_test.cu

cpp/src/fil/internal.cuh

cpp/test/sg/fil_child_index_test.cu

canonizer

Great!

Approved provided that the comments (mostly technical) are addressed.

cpp/test/sg/fil_test.cu

python/cuml/fil/fil.pyx

cpp/src/fil/internal.cuh

cpp/test/sg/fil_child_index_test.cu

python/cuml/fil/fil.pyx

levsnv · 2021-11-03T00:44:31Z

@dantegd should I close this PR since there's the https://github.com/rapidsai/cuml/tree/hotfix-21.10-pr-4315 branch is open?
And is there anything I can do to help track that hotfix?
I'm also planning to submit a small optimization to recover some of the slowdown incurred in this bugfix, but that might take until the end of the week. Not sure if that needs to be part of the hotfix.

levsnv · 2021-11-03T03:09:13Z

Update from Andy: we won't fix the perf for the hotfix, but will for sibling PR #4314.
Also: #4314 shows lightgbm test failures, which is worrying

dantegd · 2021-11-03T03:10:06Z

@levsnv we still need this PR, since we want to release a full conda/docker hotfix. The only issue is that it is targetted to branch-21.12 as opposed to 21.10, could you change the target (or I can help doing that as well). After that, we can start the hotfix process later in the week.

fix edge case of reeeeeeeally large models give more room by not being pedantic about the name extra test and technical questions +1

levsnv · 2021-11-09T04:17:33Z

python/cuml/test/test_fil.py

@@ -510,11 +526,9 @@ def to_categorical(features, n_categorical):
    for icol in range(n_categorical, n_features):
        df_cols[icol] = pd.Series(features[:, icol])
    # shuffle the columns around
-    seed(42)
-    new_idx = sample(range(n_features), k=n_features)


see review in #4326 - re: determinism

…provided to FIL model (rapidsai#4315) Fix potential CUDA context poison due to invalid global read when negative categories provided at inference: now equivalent to non-matching. Same for `+-Inf` categories. NAN categories are still `!def_left`, and fractional categories are truncated (via a typecast), as this is what Treelite does FIL now converts dummy nodes to numerical on import and never generates max_matching == -1 categorical features in test. FIL will still generate empty categorical nodes in test (a non-empty bits vector which contains only zeros), export them as dummy numerical nodes and import again as dummy numerical nodes. If a feature only contains dummy numerical nodes, it will be deemed a numerical feature (same as for non-dummy numerical nodes or a mix thereof). Therefore, categorical feature max_matching == -1 is still prevented.

fixed potential runtime crash due to negative categories provided at …

a3dfb0e

…inference: now equivalent to non-matching. FIL now converts dummy nodes to numerical on import and never generates max_matching == -1 categorical features in test.

levsnv requested a review from a team as a code owner October 29, 2021 04:49

github-actions bot added the CUDA/C++ label Oct 29, 2021

levsnv requested review from wphicks, canonizer and dantegd October 29, 2021 04:50

levsnv added bug Something isn't working non-breaking Non-breaking change labels Oct 29, 2021

levsnv mentioned this pull request Oct 29, 2021

[BUG] within cuml, Treelite might not handle empty categorical nodes #4246

Open

levsnv commented Oct 29, 2021

View reviewed changes

cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved

canonizer suggested changes Oct 29, 2021

View reviewed changes

wphicks approved these changes Oct 29, 2021

View reviewed changes

cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved

cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved

cpp/src/fil/internal.cuh Outdated Show resolved Hide resolved

behavior and documentation more robust now

ea5e370

github-actions bot added the Cython / Python Cython or Python issue label Oct 29, 2021

style, extra comment

ab3c24e

levsnv requested a review from a team as a code owner October 29, 2021 21:07

levsnv added 2 commits October 29, 2021 14:35

fix edge case of reeeeeeeally large models

3f7998f

give more room by not being pedantic about the name

804b21c

levsnv commented Oct 29, 2021

View reviewed changes

cpp/test/sg/fil_child_index_test.cu Show resolved Hide resolved

canonizer approved these changes Oct 29, 2021

View reviewed changes

levsnv added 3 commits October 29, 2021 15:40

extra test and technical questions

a9bc453

+1

d505b05

compile

f311778

levsnv added the 4 - Waiting on Author Waiting for author to respond to review label Nov 3, 2021

levsnv changed the base branch from branch-21.12 to branch-21.10 November 3, 2021 03:11

several commits

be13fa6

fix edge case of reeeeeeeally large models give more room by not being pedantic about the name extra test and technical questions +1

levsnv mentioned this pull request Nov 3, 2021

Use opt-in shared memory carveout for FIL #3759

Merged

levsnv added 2 commits November 8, 2021 20:11

moved to passing local generator in a thread-safe manner

04fba87

remove extra change

04d4cbf

levsnv commented Nov 9, 2021

View reviewed changes

levsnv added 3 - Ready for Review Ready for review by team ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround and removed 4 - Waiting on Author Waiting for author to respond to review labels Nov 9, 2021

dantegd approved these changes Nov 10, 2021

View reviewed changes

ajschmidt8 merged commit fcc4344 into rapidsai:branch-21.10 Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix potential CUDA context poison when negative (invalid) categories provided to FIL model [21.10] #4315

Fix potential CUDA context poison when negative (invalid) categories provided to FIL model [21.10] #4315

levsnv commented Oct 29, 2021 •

edited

Loading

codecov-commenter commented Oct 29, 2021

wphicks commented Oct 29, 2021

canonizer left a comment

canonizer left a comment

levsnv commented Nov 3, 2021

levsnv commented Nov 3, 2021

dantegd commented Nov 3, 2021

levsnv Nov 9, 2021

Fix potential CUDA context poison when negative (invalid) categories provided to FIL model [21.10] #4315

Fix potential CUDA context poison when negative (invalid) categories provided to FIL model [21.10] #4315

Conversation

levsnv commented Oct 29, 2021 • edited Loading

codecov-commenter commented Oct 29, 2021

Codecov Report

wphicks commented Oct 29, 2021

canonizer left a comment

Choose a reason for hiding this comment

canonizer left a comment

Choose a reason for hiding this comment

levsnv commented Nov 3, 2021

levsnv commented Nov 3, 2021

dantegd commented Nov 3, 2021

levsnv Nov 9, 2021

Choose a reason for hiding this comment

levsnv commented Oct 29, 2021 •

edited

Loading