Update locate_parquet_testing_files function to support hdfs input path for dataproc CI #10356

yinqingh · 2024-02-01T07:25:29Z

To fix #10255

Add hdfs_glob function to support hdfs path.

Signed-off-by: Yinqing Hao <[email protected]>

jlowe

This should be targeted to 24.04 since the corresponding issue was moved to 24.04.

integration_tests/src/main/python/parquet_testing_test.py

Signed-off-by: Yinqing Hao <[email protected]>

integration_tests/src/main/python/parquet_testing_test.py

yinqingh · 2024-02-02T02:13:43Z

Retargeted to 24.04

jlowe

copyrights that I missed in the first review and a minor suggestion but otherwise lgtm.

integration_tests/src/main/python/parquet_testing_test.py

Signed-off-by: Yinqing Hao <[email protected]>

yinqingh · 2024-02-05T01:04:21Z

build

NvTimLiu · 2024-02-05T01:32:24Z

build

gerashegalov · 2024-02-05T02:58:59Z

integration_tests/src/main/python/parquet_testing_test.py

+    """
+    path_str = path.as_posix()
+    full_pattern = path_str + '/' + pattern
+    cmd = ['hadoop', 'fs', '-ls', '-C', full_pattern]


it should be possible to call FileSystem globStatus directly via PY4J without forking the JVM via hadoop fs

Hi Gera, Thanks for the review!

I tried following code to use PY4J to glob files by patterns and it works as well.
But seems it's not so straightforward like hadoop fs -ls command and also we need to process the path string further.

Considering the readability, I think probably we could keep the current implementation?

Suggested change

cmd = ['hadoop', 'fs', '-ls', '-C', full_pattern]

path_str = path.as_posix()

full_pattern = path_str + '/' + pattern

sc = get_spark_i_know_what_i_am_doing().sparkContext

config = sc._jsc.hadoopConfiguration()

fs_path = sc._jvm.org.apache.hadoop.fs.Path(full_pattern)

fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(config)

statuses = fs.globStatus(fs_path)

for status in statuses:

# status.getPath().toString() return string like "hdfs://hostname:8020/src/test/resources/parquet-testing/data/single_nan.parquet"

# but pathlib.Path will remove the first "/" and convert it to "hdfs:/hostname:8020/src/test/resources/parquet-testing/data/single_nan.parquet" and then this path becomes illegal.

# so we need to process the path like this.

p = f'hdfs:{status.getPath().toUri().getPath()}'

yield Path(p)

I find it readable enough. Not a must-fix but probably would save us trouble-shooting various OOMs and other resource issues down the road. Especially in the xdist case. Keep in mind JVM initialization is slow. Hadoop adds significant XML parsing overhead on top of it. We can file it as a potential improvement and mitigation.

I will file it as a improvement. Thanks a lot!

Update locate_parquet_testing_files to support hdfs input path

4ccccba

Signed-off-by: Yinqing Hao <[email protected]>

yinqingh requested review from jlowe, GaryShen2008, NvTimLiu and YanxuanLiu February 1, 2024 07:53

jlowe added the test Only impacts tests label Feb 1, 2024

jlowe reviewed Feb 1, 2024

View reviewed changes

integration_tests/src/main/python/parquet_testing_test.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/parquet_testing_test.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/parquet_testing_test.py Outdated Show resolved Hide resolved

Update hadoop cmd to list all files directly based on the pattern

9dc04aa

Signed-off-by: Yinqing Hao <[email protected]>

yinqingh changed the base branch from branch-24.02 to branch-24.04 February 2, 2024 02:02

yinqingh commented Feb 2, 2024

View reviewed changes

integration_tests/src/main/python/parquet_testing_test.py Outdated Show resolved Hide resolved

jlowe reviewed Feb 2, 2024

View reviewed changes

integration_tests/src/main/python/parquet_testing_test.py Show resolved Hide resolved

integration_tests/src/main/python/parquet_testing_test.py Outdated Show resolved Hide resolved

Update copyrights and error message

34f3143

Signed-off-by: Yinqing Hao <[email protected]>

gerashegalov reviewed Feb 5, 2024

View reviewed changes

jlowe approved these changes Feb 5, 2024

View reviewed changes

GaryShen2008 approved these changes Feb 6, 2024

View reviewed changes

yinqingh merged commit b9da628 into NVIDIA:branch-24.04 Feb 6, 2024
39 of 40 checks passed

yinqingh deleted the fix-hdfs-parquet-file-path branch February 6, 2024 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update locate_parquet_testing_files function to support hdfs input path for dataproc CI #10356

Update locate_parquet_testing_files function to support hdfs input path for dataproc CI #10356

yinqingh commented Feb 1, 2024

jlowe left a comment

yinqingh commented Feb 2, 2024

jlowe left a comment

yinqingh commented Feb 5, 2024

NvTimLiu commented Feb 5, 2024

gerashegalov Feb 5, 2024

yinqingh Feb 5, 2024

gerashegalov Feb 5, 2024 •

edited

Loading

yinqingh Feb 6, 2024

-    cmd = ['hadoop', 'fs', '-ls', '-C', full_pattern]
+    path_str = path.as_posix()
+    full_pattern = path_str + '/' + pattern
+    sc = get_spark_i_know_what_i_am_doing().sparkContext
+    config = sc._jsc.hadoopConfiguration()
+    fs_path = sc._jvm.org.apache.hadoop.fs.Path(full_pattern)
+    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(config)
+    statuses = fs.globStatus(fs_path)
+    for status in statuses:
+        # status.getPath().toString() return string like "hdfs://hostname:8020/src/test/resources/parquet-testing/data/single_nan.parquet"
+        # but pathlib.Path will remove the first "/" and convert it to "hdfs:/hostname:8020/src/test/resources/parquet-testing/data/single_nan.parquet" and then this path becomes illegal.
+        # so we need to process the path like this.
+        p = f'hdfs:{status.getPath().toUri().getPath()}'
+        yield Path(p)

Update locate_parquet_testing_files function to support hdfs input path for dataproc CI #10356

Update locate_parquet_testing_files function to support hdfs input path for dataproc CI #10356

Conversation

yinqingh commented Feb 1, 2024

jlowe left a comment

Choose a reason for hiding this comment

yinqingh commented Feb 2, 2024

jlowe left a comment

Choose a reason for hiding this comment

yinqingh commented Feb 5, 2024

NvTimLiu commented Feb 5, 2024

gerashegalov Feb 5, 2024

Choose a reason for hiding this comment

yinqingh Feb 5, 2024

Choose a reason for hiding this comment

gerashegalov Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

yinqingh Feb 6, 2024

Choose a reason for hiding this comment

gerashegalov Feb 5, 2024 •

edited

Loading