Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include encode type in the error message when unsupported Parquet encoding is detected #14453

Merged
merged 31 commits into from
Dec 20, 2023

Conversation

ZelboK
Copy link
Contributor

@ZelboK ZelboK commented Nov 20, 2023

Description

Per #14209 this will list out unsupported encodings that were found.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@ZelboK ZelboK requested a review from a team as a code owner November 20, 2023 15:48
Copy link

copy-pr-bot bot commented Nov 20, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 20, 2023
@bdice bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 20, 2023
@bdice
Copy link
Contributor

bdice commented Nov 20, 2023

@ZelboK Could you run pre-commit? See instructions in the Contributing Guide. I think this PR is going to fail C++ style checks from clang-format in its current state. https://github.com/rapidsai/cudf/blob/branch-23.12/CONTRIBUTING.md#using-pre-commit-hooks

@ZelboK
Copy link
Contributor Author

ZelboK commented Nov 20, 2023

@ZelboK Could you run pre-commit? See instructions in the Contributing Guide. I think this PR is going to fail C++ style checks from clang-format in its current state. https://github.com/rapidsai/cudf/blob/branch-23.12/CONTRIBUTING.md#using-pre-commit-hooks

Will do, thanks for mentioning!

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Just a few suggestions :)

cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Nov 20, 2023

/ok to test

@ZelboK
Copy link
Contributor Author

ZelboK commented Nov 20, 2023

Question: I can't use BitAnd because the newly added kernel_error class's value() does not return a scoped enumeration. BitAnd uses SFINAE to only allow for scoped enums to use it.
I can:

  1. Add new overload for BitAnd
  2. Modify kernel_error to use scoped enumerations? (Might entail a lot more effort)
  3. Maybe im forgetting some simple way to map the types properly. I'm still kinda new to C++ so i apologize.
  4. Just not use BitAnd and do a regular bitwise operation.

I hesitate doing 1, or 2, because I don't have enough context for the codebase to know the friction it may cause. I imagine there's some reason BitAnd is constrained on scoped enums, and I don't have a good visual on how much work is required for 2.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks for the quick changes :) A few more minor suggestions.

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
@vuule vuule changed the title Feat 14209 exception msg Include encode type in the error message when unsupported Parquet encoding is detected Nov 20, 2023
@vuule vuule added the cuIO cuIO issue label Nov 20, 2023
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a few more suggestions.

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
@GregoryKimball
Copy link
Contributor

Thank you @ZelboK and @etseidl for iterating on this!

@ZelboK
Copy link
Contributor Author

ZelboK commented Dec 19, 2023

I was not getting the list of bad encodings. One problem was using the device iterators, but another issue is that count_page_headers will also receive the same error code, but the pages vector doesn't exist yet so we can't use that to get the set of bad encodings. I'd suggest testing for UNSUPPORTED_ENCODING in the error handling for count_page_headers() and not FAIL if that flag is set.

Is there any way I can test this behavior without having to rely on cudf python? Unfortunately I've run into a few blockers trying to get it set up. I'd like to finish this PR but it wouldn't be too efficient for me to offload the testing part to you like this. Builds are also quite slow without access to the cloud build artifacts.

I should be able to create unsupported encodings in a unit test I hope? I wasn't sure if the tests didn't have them because they weren't perceived to add value or if there was some complication I'm unaware of.

@etseidl
Copy link
Contributor

etseidl commented Dec 19, 2023

Is there any way I can test this behavior without having to rely on cudf python?

You could write a small C++ prog that calls cudf::io::read_parquet(). Compiling it might be an adventure 😅

@ZelboK
Copy link
Contributor Author

ZelboK commented Dec 19, 2023

Is there any way I can test this behavior without having to rely on cudf python?

You could write a small C++ prog that calls cudf::io::read_parquet(). Compiling it might be an adventure 😅

Hm. Curious to know - when it comes to cuIO will I often need to run things from cudf python to test out functionality? If so that would justify me putting more effort into getting cudf python to work. I just have a lot more motivation for getting more experience with GPU programming than I do with tooling so I put it off lol. I'll spend a couple hours trying to figure this out

(btw this is the problem i get when trying to run pytest, forgive me, I don't program in python often 😅 ). Builds just fine, tests jut don't start up for me.

======================== test session starts ========================
platform linux -- Python 3.10.13, pytest-7.4.3, pluggy-1.3.0 -- /home/ksm/anaconda3/envs/cudf_dev/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/home/ksm/work/improving/cudf/python/cudf/.hypothesis/examples'))
rootdir: /home/ksm/work/improving/cudf/python/cudf
plugins: benchmark-4.0.0, cases-3.8.1, anyio-4.1.0, cov-4.1.0, hypothesis-6.92.0, xdist-3.5.0
collected 0 items / 1 error                                         

============================== ERRORS ===============================
___________________ ERROR collecting test session ___________________
/home/ksm/anaconda3/envs/cudf_dev/lib/python3.10/site-packages/_pytest/config/__init__.py:641: in _importconftest
    mod = import_path(conftestpath, mode=importmode, root=rootpath)
/home/ksm/anaconda3/envs/cudf_dev/lib/python3.10/site-packages/_pytest/pathlib.py:567: in import_path
    importlib.import_module(module_name)
/home/ksm/anaconda3/envs/cudf_dev/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1050: in _gcd_import
    ???
<frozen importlib._bootstrap>:1027: in _find_and_load
    ???
<frozen importlib._bootstrap>:1006: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:688: in _load_unlocked
    ???
/home/ksm/anaconda3/envs/cudf_dev/lib/python3.10/site-packages/_pytest/assertion/rewrite.py:186: in exec_module
    exec(co, module.__dict__)
benchmarks/conftest.py:59: in <module>
    from config import cudf  # noqa: W0611, E402, F401
benchmarks/common/config.py:46: in <module>
    import cudf  # noqa: W0611, F401
cudf/__init__.py:9: in <module>
    _setup_numba()
cudf/utils/_numba.py:122: in _setup_numba
    ptx_toolkit_version = _get_cuda_version_from_ptx_file(CC_60_PTX_FILE)
cudf/utils/_numba.py:157: in _get_cuda_version_from_ptx_file
    with open(path) as ptx_file:
E   FileNotFoundError: [Errno 2] No such file or directory: '/home/ksm/work/improving/cudf/python/cudf/cudf/utils/../core/udf/shim_60.ptx'
====================== short test summary info ======================
ERROR  - FileNotFoundError: [Errno 2] No such file or directory: '/home/k...
!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!
========================= 1 error in 0.77s ==========================
(cudf_dev) ksm@Kashimo:~/work/improving/cudf/python/cudf$

Edit:

NVM! I fixed them :D

@DanialJavady96
Copy link
Contributor

DanialJavady96 commented Dec 19, 2023

I was not getting the list of bad encodings. One problem was using the device iterators, but another issue is that count_page_headers will also receive the same error code, but the pages vector doesn't exist yet so we can't use that to get the set of bad encodings. I'd suggest testing for UNSUPPORTED_ENCODING in the error handling for count_page_headers() and not FAIL if that flag is set.

Might be misunderstanding, can you clarify a few things?
Just wondering, count_page_headers() without the pages vector will not be able to give anything past a general message no? Are you saying to relocate the logic in decode_page_headers over in count_page_headers()? I can use CUDF_LOG_ERROR to give a general message to the user.

Or are you saying that this logic

   if (error_code.value() != 0) {
    CUDF_FAIL("Parquet header parsing failed with code(s) " + error_code.str());
   }

is impacting the logic in decode_page_headers() because that shouldn't be the case.

Edit: Okay so it would seem that this is the case.

(cudf_dev) ksm@Kashimo:~/work/improving/cudf/python/cudf/cudf/tests$ pytest -s test_parquet.py::test_parquet_reader_unsupported_page_encoding
==================================== test session starts =====================================
platform linux -- Python 3.10.13, pytest-7.4.3, pluggy-1.3.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/ksm/work/improving/cudf/python/cudf/cudf/tests
configfile: pytest.ini
plugins: benchmark-4.0.0, cases-3.8.1, anyio-4.1.0, cov-4.1.0, hypothesis-6.92.0, xdist-3.5.0
collected 1 item

test_parquet.py [ 51428][12:58:39:100371][error ] Parquet header parsing failed with code(s) 0x4
[ 51428][12:58:39:102664][error ] Unsupported encodings found: DELTA_LENGTH_BYTE_ARRAY

is what I get, if I change CUDF_FAIL to CUDF_LOG_ERROR instead, in count_page_headers(). It'll fail before decode_page_headers gets a chance to? Do we need both of these in both functions?

@vuule
Copy link
Contributor

vuule commented Dec 19, 2023

I think @etseidl 's idea is to completely ignore UNSUPPORTED_ENCODING error in count_page_headers so we can get to print the unsupported encoding types after DecodePageHeaders, where it will be reported again.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more small request 😅

Then I think it's good to go.

cpp/src/io/parquet/reader_impl_preprocess.cu Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch, looks good!
Few small suggestions

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
@ttnghia
Copy link
Contributor

ttnghia commented Dec 20, 2023

/ok to test

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Dec 20, 2023
@vuule vuule requested a review from etseidl December 20, 2023 20:20
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my nits have been picked 😄. LGTM

@vuule
Copy link
Contributor

vuule commented Dec 20, 2023

/merge

@rapids-bot rapids-bot bot merged commit 225040e into rapidsai:branch-24.02 Dec 20, 2023
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants