[BUG] String columns written with fastparquet
are read differently with Spark RAPIDS
#9387
Labels
bug
Something isn't working
cudf_dependency
An issue or PR with this label depends on a new feature in cudf
Description
As part of #9366, it was noticed that string columns written with
fastparquet
are read differently on Spark RAPIDS, as compared to vanilla Apache Spark. There seemed to be trailing characters at the end of the last row.Repro
A gzipped example Parquet file is attached herewith: fastparquet_string.zip.
In the integration tests, the difference manifests as extra characters at the end of the last row. It isn't quite as obvious when read explicitly on Apache Spark and Spark RAPIDS:
Note that the attachment is a smaller, contrived version of the actual failing test. That file is larger and less readable than the attachment here.
The text was updated successfully, but these errors were encountered: