-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix debrotli issue on CUDA 11.5 #9632
Conversation
I can confirm that this fixes issue in both cuda 11.4 & 11.5 on T4 gpus. |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9632 +/- ##
================================================
- Coverage 10.79% 10.68% -0.11%
================================================
Files 116 117 +1
Lines 18869 19872 +1003
================================================
+ Hits 2036 2123 +87
- Misses 16833 17749 +916
Continue to review full report at Codecov.
|
rerun tests |
@galipremsagar could you run a round of fuzz tests with brotli to confirm the fix? |
cpp/src/io/comp/debrotli.cu
Outdated
volatile uint32_t* heap_ptr = reinterpret_cast<volatile uint32_t*>(ext_heap_base); | ||
uint32_t first_free_block = ~0; | ||
auto const len = (bytes + 0xf) & ~0xf; | ||
auto const heap_ptr = static_cast<volatile uint32_t*>(ext_heap_base); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is const
and volatile
at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pointer is const, and the data it points to is volatile.
I can avoid auto here so the type becomes uint32_t volatile* const
, which reads fine (right to left).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense. I'd rather see the const
on the left of the equal sign.
@vuule In fuzz-testing we ran into a data-corruption issue with >>> import cudf
>>> actual = cudf.read_parquet('cpu_pdf.parquet')
>>> expected = pd.read_parquet('cpu_pdf.parquet')
(Pdb) actual['21'] # See the last value, it should be 27195 instead of 0.
0 -28871
1 <NA>
2 18224
3 -13182
4 <NA>
...
9426 6841
9427 3473
9428 31914
9429 -22488
9430 0
Name: 21, Length: 9431, dtype: int16
(Pdb) expected['21']
0 -28871
1 <NA>
2 18224
3 -13182
4 <NA>
...
9426 6841
9427 3473
9428 31914
9429 -22488
9430 27195
Name: 21, Length: 9431, dtype: Int16
(Pdb) expected['21'][actual['21'].to_pandas(nullable=True) != expected['21']] # 1339 out of 9431 values are corrupted.
6571 -25740
6573 6433
6575 -21601
6576 19348
6577 4142
...
9418 28837
9420 4155
9423 8501
9424 28381
9430 27195
Name: 21, Length: 1339, dtype: Int16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving python changes
Rerun tests. |
Verified with fuzz-testing, no issues found. 🎉 |
@gpucibot merge |
Closes #9546
This PR fixes the issue likely through elimination of undefined behavior.
Modified local heap implementation to return
void*
instead onuint8_t*
. This greatly reduces the number ofreinterpret_cast
s. Also changed heap type tochar*
, presumably reducing/eliminating aliasing issues.Some other clean up in related code included.