-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests to check compatibility with fastparquet
#9366
Add tests to check compatibility with fastparquet
#9366
Conversation
Signed-off-by: MithunR <[email protected]>
Also, recombined the test functions.
Plus, minor refactors.
Build |
integration_tests/src/main/python/fastparquet_compatibility_test.py
Outdated
Show resolved
Hide resolved
1. Fixed dataframe size in test_read_fastparquet_single_column_tables. 2. Removed unnecessary test params for write confs. 3. Moved fastparquet import to top of test file. 4. Expanded on test failure description for timestamps.
Build |
To introduce new test dep, please also update
Also, I would recommend providing the strategy here to enable/disable the cases, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern here really is just follow on work. There are so many xfails. Are they things that we don't think will ever be fixed? If so then we might want to not test them at all. Are they things that we just have not had time to track down and file issues to fix? If so then we really should have a follow on issue to do that. Is it something else?
integration_tests/src/main/python/fastparquet_compatibility_test.py
Outdated
Show resolved
Hide resolved
integration_tests/src/main/python/fastparquet_compatibility_test.py
Outdated
Show resolved
Hide resolved
I'll reply to the individual comments in a moment. A lot of the xfails have to do with the inter-conversions between Pandas and Spark, and dtype inference.
|
Classification of xfails:For fixing in Spark RAPIDS / CUDF:
For follow-up in Spark RAPIDS tests:
Problems in
|
integration_tests/src/main/python/fastparquet_compatibility_test.py
Outdated
Show resolved
Hide resolved
I'm at a loss. I'm not sure where I've missed adding |
FYI https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/Jenkinsfile-blossom.premerge#L34 NOTE: pre-merge CI image build would run every 4 hours (there could be some gap after merge this PR and build the new pre-merge image for others), so to try not affect others' premerge CI you also need to manually trigger an ops/docker_image-manager with build after merge instantly. You can also allow me to do the merge to avoid potential blocking |
I see your point, your dockerfile change did was not considered as modified files in pre-merge CI, let me take a look UPDATE: fixed in #9415 |
build |
the dockerfile change detector works fine with the same merged commit locally. UPDATE: fixed in #9415 |
build in current trigger
the installed one |
I still recommend adding the flag for people to choose if enable the cases. thanks |
Dockerfile-blossom.integration.centos is deprecated.
@pxLi: So as not to break anyone else's tests, I have modified the test to check whether |
Build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, for CI update thanks!
Spark does not support
We also can not fix the fastparquet code. |
LGTM |
Did not find tests for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This commit adds tests to check for compatibility between the Spark RAPIDS plugin and
fastparquet
.The tests include the following scenarios:
fastparquet
are read correctly with the Spark RAPIDS plugin.fastparquet
.fastparquet
.There are known limitations for these tests:
spark-3.4.x
. This limits the Spark versions where the test may be run.fastparquet
have known limitations with certain datatypes. For instance:a.
fastparquet
does not supportDECIMAL
columns, as per documentation. These are treated asFLOAT
.b. There seem to be limits to the ranges of
Date
andTimestamp
rows, as compared to Apache Spark. For instance, dates likedate(year=8543, month=12, day=31)
is deemed out of range in Pandas/fastparquet
. Such a date cannot be written viafastparquet
, nor could they be read correctly viafastparquet
.