Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] String columns written with fastparquet are read differently with Spark RAPIDS #9387

Open
mythrocks opened this issue Oct 5, 2023 · 2 comments
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@mythrocks
Copy link
Collaborator

mythrocks commented Oct 5, 2023

Description

As part of #9366, it was noticed that string columns written with fastparquet are read differently on Spark RAPIDS, as compared to vanilla Apache Spark. There seemed to be trailing characters at the end of the last row.

Repro

A gzipped example Parquet file is attached herewith: fastparquet_string.zip.

In the integration tests, the difference manifests as extra characters at the end of the last row. It isn't quite as obvious when read explicitly on Apache Spark and Spark RAPIDS:

// Spark RAPIDS:
scala> spark.read.parquet("/tmp/parq_write").show()
...
+--------------+
|           str|
+--------------+
|           all|
|           the|
|leaves|
+--------------+
// Apache Spark:
scala> spark.read.parquet("/tmp/parq_write").show()
+------+
|   str|
+------+
|   all|
|   the|
|leaves|
+------+

Note that the attachment is a smaller, contrived version of the actual failing test. That file is larger and less readable than the attachment here.

@mythrocks mythrocks added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 5, 2023
@revans2
Copy link
Collaborator

revans2 commented Oct 5, 2023

Ya I tested this out myself and it looks like it has to be some kind of a bug in CUDF. CUDF thinks the length of the last string is 14, but spark sees it as 6. Perhaps there is some odd padding that is happening. I am going to try and dump the raw data that we got from CUDF.

@revans2
Copy link
Collaborator

revans2 commented Oct 5, 2023

So it looks like there are nulls at the end of the last string.

GPU COLUMN LENGTH - NC: 0 DATA: DeviceMemoryBufferView{address=0x30a003400, length=20, id=-1} VAL: DeviceMemoryBufferView{address=0x30a001e00, length=64, id=-1}
COLUMN LENGTH - STRING
0 "all" 616c6c
1 "the" 746865
2 "leaves" 6c65617665730000000000000000

The parquet-mr cli tool appears to dump the data correctly like Spark does.

This even shows up when we try to read str as a binary column instead of a string column. The next step should be to see if we can get a repro case in pure cudf C++ and if so then we need to file an issue with CUDF for this.

@revans2 revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 5, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

3 participants