[BUG] String columns written with `fastparquet` are read differently with Spark RAPIDS #9387

mythrocks · 2023-10-05T04:59:36Z

Description

As part of #9366, it was noticed that string columns written with fastparquet are read differently on Spark RAPIDS, as compared to vanilla Apache Spark. There seemed to be trailing characters at the end of the last row.

Repro

A gzipped example Parquet file is attached herewith: fastparquet_string.zip.

In the integration tests, the difference manifests as extra characters at the end of the last row. It isn't quite as obvious when read explicitly on Apache Spark and Spark RAPIDS:

// Spark RAPIDS:
scala> spark.read.parquet("/tmp/parq_write").show()
...
+--------------+
|           str|
+--------------+
|           all|
|           the|
|leaves|
+--------------+

// Apache Spark:
scala> spark.read.parquet("/tmp/parq_write").show()
+------+
|   str|
+------+
|   all|
|   the|
|leaves|
+------+

Note that the attachment is a smaller, contrived version of the actual failing test. That file is larger and less readable than the attachment here.

The text was updated successfully, but these errors were encountered:

revans2 · 2023-10-05T13:58:17Z

Ya I tested this out myself and it looks like it has to be some kind of a bug in CUDF. CUDF thinks the length of the last string is 14, but spark sees it as 6. Perhaps there is some odd padding that is happening. I am going to try and dump the raw data that we got from CUDF.

revans2 · 2023-10-05T14:13:01Z

So it looks like there are nulls at the end of the last string.

GPU COLUMN LENGTH - NC: 0 DATA: DeviceMemoryBufferView{address=0x30a003400, length=20, id=-1} VAL: DeviceMemoryBufferView{address=0x30a001e00, length=64, id=-1}
COLUMN LENGTH - STRING
0 "all" 616c6c
1 "the" 746865
2 "leaves" 6c65617665730000000000000000

The parquet-mr cli tool appears to dump the data correctly like Spark does.

This even shows up when we try to read str as a binary column instead of a string column. The next step should be to see if we can get a repro case in pure cudf C++ and if so then we need to file an issue with CUDF for this.

mythrocks added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 5, 2023

revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 5, 2023

This was referenced Oct 5, 2023

[BUG] String columns written with fastparquet seem to be read incorrectly via CUDF's Parquet reader rapidsai/cudf#14258

Open

Add tests to check compatibility with fastparquet #9366

Merged

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] String columns written with `fastparquet` are read differently with Spark RAPIDS #9387

[BUG] String columns written with `fastparquet` are read differently with Spark RAPIDS #9387

mythrocks commented Oct 5, 2023 •

edited

Loading

revans2 commented Oct 5, 2023

revans2 commented Oct 5, 2023

[BUG] String columns written with fastparquet are read differently with Spark RAPIDS #9387

[BUG] String columns written with fastparquet are read differently with Spark RAPIDS #9387

Comments

mythrocks commented Oct 5, 2023 • edited Loading

revans2 commented Oct 5, 2023

revans2 commented Oct 5, 2023

[BUG] String columns written with `fastparquet` are read differently with Spark RAPIDS #9387

[BUG] String columns written with `fastparquet` are read differently with Spark RAPIDS #9387

mythrocks commented Oct 5, 2023 •

edited

Loading