Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fastparquet tests to work with HDFS #9583

Merged

Conversation

mythrocks
Copy link
Collaborator

Fixes #9545.

This commit fixes the fastparquet tests to run on Spark clusters where the fs.default.name does not point to the local filesystem.

Before this commit, the fastparquet tests assumed that the parquet files generated for the tests were written to local filesystem, and could be read from both fastparquet and Spark from the same location. However, this fails when run against clusters whose default filesystem is HDFS. fastparquet can only read from the local filesystem.

This commit changes the tests as follows:

  1. For tests where data is generated by Spark, the data is copied to local filesystem before it is read by fastparquet.
  2. For tests where data is generated by fastparquet, the data is copied to the default Hadoop filesystem before reading through Spark.

Fixes NVIDIA#9545.

This commit fixes the `fastparquet` tests to run on Spark clusters where
the `fs.default.name` does not point to the local filesystem.

Before this commit, the `fastparquet` tests assumed that the parquet files
generated for the tests were written to local filesystem, and could be read
from both `fastparquet` and Spark from the same location.  However, this fails
when run against clusters whose default filesystem is HDFS. `fastparquet` can
only read from the local filesystem.

This commit changes the tests as follows:
1. For tests where data is generated by Spark, the data is copied to local
   filesystem before it is read by `fastparquet`.
2. For tests where data is generated by `fastparquet`, the data is copied
   to the default Hadoop filesystem before reading through Spark.

Signed-off-by: MithunR <[email protected]>
@mythrocks
Copy link
Collaborator Author

Build

@@ -167,6 +203,8 @@ def test_reading_file_written_with_gpu(spark_tmp_path, column_gen):
There are xfails here because of fastparquet limitations in handling Decimal, Timestamps, Dates, etc.
"""
data_path = spark_tmp_path + "/FASTPARQUET_TEST_GPU_WRITE_PATH"
local_base_path = (spark_tmp_path + "_local")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay, but it does assume that the temp path in HDFS is going to be available in the local file system too. I don't think there is anything we can do about that though.

@mythrocks mythrocks merged commit 58612bb into NVIDIA:branch-23.12 Oct 31, 2023
38 checks passed
@mythrocks
Copy link
Collaborator Author

mythrocks commented Oct 31, 2023

I've merged this change. With any luck, this should unblock the CDH builds.
Thank you for reviewing, @revans2.

@gerashegalov
Copy link
Collaborator

Alternative:

fastparquet uses fsspec that provides a bunch of implementations out of the box https://github.com/fsspec/filesystem_spec/tree/master/fsspec/implementations .
This includes WebHDFS that can be enabled on CDH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Dataproc 2.0 test_reading_file_rewritten_with_fastparquet tests failing
4 participants