Fix fastparquet tests to work with HDFS #9583

mythrocks · 2023-10-31T02:01:45Z

This commit fixes the fastparquet tests to run on Spark clusters where the fs.default.name does not point to the local filesystem.

Before this commit, the fastparquet tests assumed that the parquet files generated for the tests were written to local filesystem, and could be read from both fastparquet and Spark from the same location. However, this fails when run against clusters whose default filesystem is HDFS. fastparquet can only read from the local filesystem.

This commit changes the tests as follows:

For tests where data is generated by Spark, the data is copied to local filesystem before it is read by fastparquet.
For tests where data is generated by fastparquet, the data is copied to the default Hadoop filesystem before reading through Spark.

Fixes NVIDIA#9545. This commit fixes the `fastparquet` tests to run on Spark clusters where the `fs.default.name` does not point to the local filesystem. Before this commit, the `fastparquet` tests assumed that the parquet files generated for the tests were written to local filesystem, and could be read from both `fastparquet` and Spark from the same location. However, this fails when run against clusters whose default filesystem is HDFS. `fastparquet` can only read from the local filesystem. This commit changes the tests as follows: 1. For tests where data is generated by Spark, the data is copied to local filesystem before it is read by `fastparquet`. 2. For tests where data is generated by `fastparquet`, the data is copied to the default Hadoop filesystem before reading through Spark. Signed-off-by: MithunR <[email protected]>

mythrocks · 2023-10-31T02:06:49Z

Build

revans2 · 2023-10-31T13:20:08Z

integration_tests/src/main/python/fastparquet_compatibility_test.py

@@ -167,6 +203,8 @@ def test_reading_file_written_with_gpu(spark_tmp_path, column_gen):
    There are xfails here because of fastparquet limitations in handling Decimal, Timestamps, Dates, etc.
    """
    data_path = spark_tmp_path + "/FASTPARQUET_TEST_GPU_WRITE_PATH"
+    local_base_path = (spark_tmp_path + "_local")


I think this is okay, but it does assume that the temp path in HDFS is going to be available in the local file system too. I don't think there is anything we can do about that though.

mythrocks · 2023-10-31T17:38:47Z

I've merged this change. With any luck, this should unblock the CDH builds.
Thank you for reviewing, @revans2.

gerashegalov · 2023-10-31T17:59:58Z

Alternative:

fastparquet uses fsspec that provides a bunch of implementations out of the box https://github.com/fsspec/filesystem_spec/tree/master/fsspec/implementations .
This includes WebHDFS that can be enabled on CDH.

mythrocks mentioned this pull request Oct 31, 2023

[BUG] Dataproc 2.0 test_reading_file_rewritten_with_fastparquet tests failing #9545

Closed

sameerz added the test Only impacts tests label Oct 31, 2023

mythrocks self-assigned this Oct 31, 2023

revans2 approved these changes Oct 31, 2023

View reviewed changes

mythrocks merged commit 58612bb into NVIDIA:branch-23.12 Oct 31, 2023
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fastparquet tests to work with HDFS #9583

Fix fastparquet tests to work with HDFS #9583

mythrocks commented Oct 31, 2023

mythrocks commented Oct 31, 2023

revans2 Oct 31, 2023

mythrocks commented Oct 31, 2023 •

edited

Loading

gerashegalov commented Oct 31, 2023

Fix fastparquet tests to work with HDFS #9583

Fix fastparquet tests to work with HDFS #9583

Conversation

mythrocks commented Oct 31, 2023

mythrocks commented Oct 31, 2023

revans2 Oct 31, 2023

Choose a reason for hiding this comment

mythrocks commented Oct 31, 2023 • edited Loading

gerashegalov commented Oct 31, 2023

mythrocks commented Oct 31, 2023 •

edited

Loading