Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null-bounds calculation for ranged window queries #7568

Merged

Conversation

mythrocks
Copy link
Contributor

This commit fixes an off-by-one error in grouped_time_range_rolling_window(), for cases where the order-by column (i.e. the timestamp) contains null values.

The error was reported when running Spark with the spark-rapids plugin enabled, manifesting in a process crash:

21/03/11 04:26:51 ERROR TaskSetManager: Task 3 in stage 4.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 1 times, most recent failure: Los
t task 3.0 in stage 4.0 (TID 5, 10.0.0.23, executor driver): ai.rapids.cudf.CudfException: for_each: failed to synchronize: cud
aErrorLaunchFailure: unspecified launch failure
        at ai.rapids.cudf.Table.timeRangeRollingWindowAggregate(Native Method)
        at ai.rapids.cudf.Table.access$3100(Table.java:45)
        at ai.rapids.cudf.Table$AggregateOperation.aggregateWindowsOverTimeRanges(Table.java:2258)
        at com.nvidia.spark.rapids.GpuWindowExpression.$anonfun$evaluateRangeBasedWindowExpression$2(GpuWindowExpression.scala:
233)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.GpuWindowExpression.withResource(GpuWindowExpression.scala:132)
        at com.nvidia.spark.rapids.GpuWindowExpression.$anonfun$evaluateRangeBasedWindowExpression$1(GpuWindowExpression.scala:
224)

It is interesting that a fair number of the 870+ tests from 75 test cases in grouped_rolling_test.cpp exercise this code path. cuda-memcheck still does not seem to detect the corruption in CountMultiGroupTimestampASCNullsFirst, for instance:

✗ cuda-memcheck gtests/GROUPED_ROLLING_TEST --gtest_filter=TypedNullTimestampTestForRangeQ eries/0.CountMultiGroupTimestampASCNullsFirst
========= CUDA-MEMCHECK
Note: Google Test filter = TypedNullTimestampTestForRangeQueries/0.CountMultiGroupTimestampASCNullsFirst
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TypedNullTimestampTestForRangeQueries/0, where TypeParam = signed char
[ RUN      ] TypedNullTimestampTestForRangeQueries/0.CountMultiGroupTimestampASCNullsFirst
[       OK ] TypedNullTimestampTestForRangeQueries/0.CountMultiGroupTimestampASCNullsFirst (663 ms)
[----------] 1 test from TypedNullTimestampTestForRangeQueries/0 (663 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (663 ms total)
[  PASSED  ] 1 test.
========= ERROR SUMMARY: 0 errors

@mythrocks mythrocks requested a review from a team as a code owner March 11, 2021 04:29
@mythrocks mythrocks requested review from trxcllnt and jrhemstad and removed request for a team March 11, 2021 04:29
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 11, 2021
@mythrocks mythrocks self-assigned this Mar 11, 2021
@mythrocks mythrocks added 4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working Spark Functionality that helps Spark RAPIDS labels Mar 11, 2021
@mythrocks
Copy link
Contributor Author

I've raised this against branch-0.18 since this is a bug in a released version of libcudf. The user will experience application crashes or data corruption. I'm not sure of the best way to update a released CUDF version.

I'd be happy to retarget this PR to branch-0.19 if that's preferable.

@kkraus14
Copy link
Collaborator

See the hotfix process here: https://docs.rapids.ai/releases/hotfix/

cc @jrhemstad @harrism for visibility and to weigh in on if this is hotfix worthy

@kkraus14 kkraus14 added ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround non-breaking Non-breaking change and removed 4 - Needs Review Waiting for reviewer to review or respond labels Mar 11, 2021
@kkraus14
Copy link
Collaborator

In addition to the fix we should add a test that captures this behavior.

@mythrocks
Copy link
Contributor Author

mythrocks commented Mar 11, 2021

Hello, @kkraus14. Thank you for your advice.

In addition to the fix we should add a test that captures this behavior

This is an off-by-1 error that should have been triggered by tests that currently exist in grouped_rolling_test.cpp. (CountMultiGroupTimestampASCNullsFirst, for instance.) I am as yet unable to reliably reproduce the crash from unit-tests, short of using a manually extracted Parquet file. The crash isn't a consistent one, outside of running cuda-memcheck as part of the test.

Please pardon my temerity, but I'm not sure I would have written a test for this case apart from the ones we already have. :/ (I'll keep trying, though.)

@mythrocks
Copy link
Contributor Author

See the hotfix process here: https://docs.rapids.ai/releases/hotfix/

The process prescribes the following:

  1. Create your branch from the branch-M.B branch

M.B is the next minor release i.e. 0.19. I will re-target this PR for branch-0.19, and raise a separate one for main.

@mythrocks mythrocks requested review from a team as code owners March 11, 2021 07:33
@mythrocks mythrocks requested review from galipremsagar and removed request for a team March 11, 2021 07:33
@mythrocks mythrocks changed the base branch from branch-0.18 to branch-0.19 March 11, 2021 07:33
@mythrocks mythrocks removed request for a team March 11, 2021 07:33
@kkraus14
Copy link
Collaborator

Will let @harrism take a look before we move forward in starting the hotfix release process. Thanks all.

@mike-wendt
Copy link
Contributor

Working on getting the tests running on this as we don't have testing enabled for this branch as it is not active

@raydouglass
Copy link
Member

rerun tests

@kkraus14
Copy link
Collaborator

Python build failures are expected because of the Dask branch changes. Do we need to resolve these things on 0.18?

@mike-wendt
Copy link
Contributor

Python build failures are expected because of the Dask branch changes. Do we need to resolve these things on 0.18?

@mythrocks can you patch the contents of #7535 onto your PR? This way we can get the Python builds and GPU tests to work now that Dask has changed their default branch?

@kkraus14
Copy link
Collaborator

Needs changes from #7532 as well. I'll add them.

@kkraus14
Copy link
Collaborator

@mike-wendt changes should be added

@mike-wendt
Copy link
Contributor

Will merge when @raydouglass returns this afternoon

@mythrocks
Copy link
Contributor Author

mythrocks commented Mar 12, 2021

Python build failures are expected because of the Dask branch changes. Do we need to resolve these things on 0.18?

@mythrocks can you patch the contents of #7535 onto your PR? This way we can get the Python builds and GPU tests to work now that Dask has changed their default branch?

@mike-wendt: Acknowledging here with gratitude that @kkraus14 updated the PR before I even read this message. :] Thanks!

@mythrocks
Copy link
Contributor Author

Hmm... Tests are failing for Dask:

10:13:33 Tests failed for dask-cudf-0.18.0a-py37_gb4b3c72479_258.tar.bz2 - moving package to /opt/conda/envs/rapids/conda-bld/broken
10:13:33 WARNING:conda_build.build:Tests failed for dask-cudf-0.18.0a-py37_gb4b3c72479_258.tar.bz2 - moving package to /opt/conda/envs/rapids/conda-bld/broken
10:13:33 WARNING conda_build.build:tests_failed(2890): Tests failed for dask-cudf-0.18.0a-py37_gb4b3c72479_258.tar.bz2 - moving package to /opt/conda/envs/rapids/conda-bld/broken
10:13:33 export PREFIX=/opt/conda/envs/rapids/conda-bld/dask-cudf_1615572558850/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh

Is this a failure to build the conda env?

@kkraus14
Copy link
Collaborator

13:15:12 ImportError: cannot import name 'stringify_path' from 'dask.bytes.core' (/opt/conda/envs/rapids/conda-bld/dask-cudf_1615572592450/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/lib/python3.7/site-packages/dask/bytes/core.py)

Needs the fix from #7580

@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 12, 2021
@harrism
Copy link
Member

harrism commented Mar 12, 2021

I'm concerned that our guests failed to catch an error like this. @mythrocks if this code path is exercised by multiple tests as you say, why do none of them fail? This PR should add a test that fails without the fix and succeeds after the fix.

@kkraus14
Copy link
Collaborator

I'm concerned that our guests failed to catch an error like this. @mythrocks if this code path is exercised by multiple tests as you say, why do none of them fail? This PR should add a test that fails without the fix and succeeds after the fix.

Lets add the test in 0.19 so as to not delay the hotfix any further?

@mythrocks
Copy link
Contributor Author

This PR should add a test that fails without the fix and succeeds after the fix.

It hasn't been from want of effort. Outside of exercising this code path from Spark through a handcrafted Parquet file, I have been unable to reproduce the crash, even by constructing the same input columns programmatically.

Lets add the test in 0.19 so as to not delay the hotfix any further?

I have filed #7590 to add a test to branch-0.19. I will try hit this ASAP.

rapids-bot bot pushed a commit that referenced this pull request Mar 16, 2021
Closes #7586 

Brings the hotfix from #7568 into branch-0.19.

Authors:
  - Keith Kraus (@kkraus14)
  - Ray Douglass (@raydouglass)
  - MithunR (@mythrocks)

Approvers:
  - Nghia Truong (@ttnghia)

URL: #7589
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants