Add tests to check compatibility with `fastparquet` #9366

mythrocks · 2023-10-02T21:50:53Z

This commit adds tests to check for compatibility between the Spark RAPIDS plugin and fastparquet.

The tests include the following scenarios:

To check whether Parquet files written with fastparquet are read correctly with the Spark RAPIDS plugin.
To check whether Parquet files written with the Spark RAPIDS plugin are read correctly with fastparquet.
To check that files written with Apache Spark are read similarly with the Spark RAPIDS plugin, and fastparquet.

There are known limitations for these tests:

Converting Pandas dataframes to Spark dataframes seems to fail with Spark versions preceding spark-3.4.x. This limits the Spark versions where the test may be run.
Pandas and fastparquet have known limitations with certain datatypes. For instance:
a. fastparquet does not support DECIMAL columns, as per documentation. These are treated as FLOAT.
b. There seem to be limits to the ranges of Date and Timestamp rows, as compared to Apache Spark. For instance, dates like date(year=8543, month=12, day=31) is deemed out of range in Pandas/fastparquet. Such a date cannot be written via fastparquet, nor could they be read correctly via fastparquet.

Signed-off-by: MithunR <[email protected]>

Also, recombined the test functions.

Plus, minor refactors.

mythrocks · 2023-10-05T06:07:25Z

Build

integration_tests/src/main/python/fastparquet_compatibility_test.py

integration_tests/requirements.txt

1. Fixed dataframe size in test_read_fastparquet_single_column_tables. 2. Removed unnecessary test params for write confs. 3. Moved fastparquet import to top of test file. 4. Expanded on test failure description for timestamps.

mythrocks · 2023-10-05T23:09:35Z

Build

pxLi · 2023-10-06T04:53:17Z

To introduce new test dep, please also update

https://github.com/NVIDIA/spark-rapids/blob/branch-23.10/integration_tests/README.md#dependencies
dockerfiles under https://github.com/NVIDIA/spark-rapids/tree/branch-23.10/jenkins/ if need to cover in CI, OR
include the requirement installation in the run_pyspark script

Also, I would recommend providing the strategy here to enable/disable the cases,
e.g. if not finding the required pkg skip the case or provide a flag to help others control if test against specific cases
better to be disabled as default so missing dep would not affect others
https://github.com/NVIDIA/spark-rapids/blob/branch-23.10/integration_tests/conftest.py#L42-L50

revans2

My biggest concern here really is just follow on work. There are so many xfails. Are they things that we don't think will ever be fixed? If so then we might want to not test them at all. Are they things that we just have not had time to track down and file issues to fix? If so then we really should have a follow on issue to do that. Is it something else?

integration_tests/src/main/python/fastparquet_compatibility_test.py

mythrocks · 2023-10-06T16:39:03Z

I'll reply to the individual comments in a moment. A lot of the xfails have to do with the inter-conversions between Pandas and Spark, and dtype inference.

A Pandas dataframe containing <NA>/null, when converted to a Spark dataframe (via sparkSession.createDataFrame() fails to be constructed, because the Pandas dataframe type seems to be Struct, for some reason. This seems independent of fastparquet. Adding a schema changes the error, but not the fact that there is an error.
Some dataframes simply change types during conversion. For instance, ARRAY<INT> when converted to Pandas turns into STRING.

mythrocks · 2023-10-06T19:53:44Z

Classification of xfails:

For fixing in Spark RAPIDS / CUDF:

[BUG] String columns written with fastparquet are read differently with Spark RAPIDS #9387, [BUG] String columns written with fastparquet seem to be read incorrectly via CUDF's Parquet reader rapidsai/cudf#14258: String columns written with fastparquet have extra null characters at the end of the last row.

For follow-up in Spark RAPIDS tests:

[BUG] Fix STRUCT comparison between Pandas and Spark dataframes in fastparquet tests #9399: STRUCT<INT, FLOAT> columns, when read through fastparquet and converted to Spark dataframe for diffing, have a different notation from those read through the plugin:
a. CPU: Row(a.first=-341142443, a.second=3.333994866005594e-37)
b. GPU: Row(a=Row(first=-341142443, second=3.333994866005594e-37))
Might need custom diff logic to convert between notations.

Problems in `fastparquet`:

fastparquet reads DECIMAL columns in Parquet files as FLOAT columns, as per fastparquet documentation.
This manifests as differences from Spark results.
fastparquet reads DATE columns in Parquet files as TIMESTAMP.
fastparquet has different date/timestamp validity limits, as compared to Apache Spark. For instance, year=8705 and year=705 are both read incorrectly (i.e. completely different values) by fastparquet.

Any Pandas dataframe that contains None causes sparkSession.createDataFrame() to see "merge errors":

>>> sql(" values (0), (1), (2), (null) as foo(i) ").write.mode('overwrite').parquet("/tmp/foobar")
>>> ipdf = fastparquet.ParquetFile("/tmp/foobar").to_pandas()
>>> ipdf.dtypes
 i    Int32
 dtype: object

>>> spark.createDataFrame(ipdf)
...
TypeError: field i: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StructType'>

Using an explicit schema doesn't improve matters much:

>>> schema = StructType([StructField('i', IntegerType())])
>>> spark.createDataFrame(ipdf, schema=schema)
...
TypeError: field i: IntegerType() can not accept object <NA> in type <class 'pandas._libs.missing.NAType'>

It's possible that this needs followup at our end.

A Pandas dataframe (say, from fastparquet.ParquetFile().to_pandas()) containing ARRAY<INT>, when passed to sparkSession.createDataFrame() produces the following error:

>>> sql(" values (array(1,2,3)), (array(4,5,6)) as foo(arr) ").write.mode("overwrite").parquet("/tmp/foobar")
>>> pdf = fastparquet.ParquetFile("/tmp/foobar").to_pandas()
>>> spark.createDataFrame(pdf)
...
TypeError: Unable to infer the type of the field arr.

Specifying a schema does not seem to help matters.

>>> spark.createDataFrame(pdf, schema=StructType([StructField('arr', ArrayType(IntegerType(), True), True)]))
...
TypeError: element in array field arr: IntegerType() can not accept object 1 in type <class 'numpy.int32'>

Date/timestamp columns written by fastparquet in int96 cannot be rebased for correctness, to account for
dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z, as Apache Spark does with
spark.sql.legacy.parquet.int96RebaseModeInRead.

fastparquet loses ARRAY<INT> type information when writing Pandas dataframe of type ARRAY<INT>. The column is written as STRING.

>>> pdf = pandas.DataFrame({ 'arr': [ [1,2,3], [4,5,6] ] })
>>> pdf
        arr
0  [1, 2, 3]
1  [4, 5, 6]

>>> spark.createDataFrame(pdf).printSchema()
root
|-- arr: array (nullable = true)
|    |-- element: long (containsNull = true)

>>> fastparquet.write("/tmp/foobar", pdf)
>>> spark.read.parquet("/tmp/foobar").printSchema()
root
|-- arr: string (nullable = true)

For fixing in Apache Spark (possibly).

Date/timestamp columns written by fastparquet in int64 cannot be read by Spark or the plugin.

integration_tests/src/main/python/fastparquet_compatibility_test.py

mythrocks · 2023-10-10T05:01:49Z

dockerfiles under https://github.com/NVIDIA/spark-rapids/tree/branch-23.10/jenkins/ if need to cover in CI, OR

include the requirement installation in the run_pyspark script

I'm at a loss. I'm not sure where I've missed adding fastparquet as a dependency.

jenkins/Dockerfile-blossom.integration.centos

pxLi · 2023-10-10T06:53:16Z

FYI
to check which docker image or file would be used in pre-merge

https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/Jenkinsfile-blossom.premerge#L34
https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/Jenkinsfile-blossom.premerge#L135-L154 (run w/ modified dockerfile)

NOTE: pre-merge CI image build would run every 4 hours (there could be some gap after merge this PR and build the new pre-merge image for others), so to try not affect others' premerge CI you also need to manually trigger an ops/docker_image-manager with build after merge instantly. You can also allow me to do the merge to avoid potential blocking

pxLi · 2023-10-10T07:16:52Z

dockerfiles under https://github.com/NVIDIA/spark-rapids/tree/branch-23.10/jenkins/ if need to cover in CI, OR

include the requirement installation in the run_pyspark script

I'm at a loss. I'm not sure where I've missed adding fastparquet as a dependency.

I see your point, your dockerfile change did was not considered as modified files in pre-merge CI, let me take a look

UPDATE: fixed in #9415

pxLi · 2023-10-10T07:27:13Z

build

pxLi · 2023-10-10T07:32:21Z

git fetch --tags --force --progress -- https://github.com/NVIDIA/spark-rapids.git "+refs/pull/9366/merge"
git checkout 008afd78a21c79c0b271e34a40a48509e641bd84
BASE=$(git --no-pager log --oneline -1 | awk '{ print $NF }')
git --no-pager diff --name-only HEAD $(git merge-base HEAD) -- jenkins/Dockerfile-blossom.ubuntu

the dockerfile change detector works fine with the same merged commit locally.
Not sure why its failed to trigger the corresponding build stage in jenkins, triggered another run to check

UPDATE: fixed in #9415

pxLi · 2023-10-10T08:46:47Z

build

in current trigger


[2023-10-10T08:58:55.873Z] Collecting fastparquet

[2023-10-10T08:58:55.873Z]   Obtaining dependency information for fastparquet from https://files.pythonhosted.org/packages/d5/99/d6ed5914e30e794775d3bc645e952ba7b6855ca8db2dc41d6d5069e76abb/fastparquet-2023.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata

[2023-10-10T08:58:55.873Z]   Downloading fastparquet-2023.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)

the installed one

pxLi · 2023-10-10T09:58:51Z

FYI to check which docker image or file would be used in pre-merge

https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/Jenkinsfile-blossom.premerge#L34 https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/Jenkinsfile-blossom.premerge#L135-L154 (run w/ modified dockerfile)

NOTE: pre-merge CI image build would run every 4 hours (there could be some gap after merge this PR and build the new pre-merge image for others), so to try not affect others' premerge CI you also need to manually trigger an ops/docker_image-manager with build after merge instantly. You can also allow me to do the merge to avoid potential blocking

I still recommend adding the flag for people to choose if enable the cases. thanks

Dockerfile-blossom.integration.centos is deprecated.

…ompat-tests

mythrocks · 2023-10-10T19:50:02Z

@pxLi: So as not to break anyone else's tests, I have modified the test to check whether fastparquet is available for import, per your earlier suggestion. I have verified that this skips the tests if fastparquet is not installed. The relevant commit is here:
fa356f8.

mythrocks · 2023-10-10T19:51:04Z

Not sure why the so many reviewers got added here. I'll lean on @revans2, @pxLi and @res-life for this one.

mythrocks · 2023-10-10T19:51:10Z

Build

pxLi

+1, for CI update thanks!

res-life · 2023-10-13T07:10:27Z

For fixing in Apache Spark (possibly).
Date/timestamp columns written by fastparquet in int64 cannot be read by Spark or the plugin.

Spark does not support us timestamp, so I think it's impossible to fix.
https://github.com/apache/spark/blob/v3.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala#L27-L30

For fixing in fastparquet:
1 - 7 items

We also can not fix the fastparquet code.

res-life · 2023-10-13T07:10:33Z

LGTM

res-life · 2023-10-13T07:13:26Z

Did not find tests for map type.
We can file a follow-up.

NvTimLiu

LGTM

mythrocks added 5 commits September 29, 2023 10:37

WIP: Initial stab at fastparquet tests.

74d31f1

Signed-off-by: MithunR <[email protected]>

Date tests. Plus minor refactor.

d3394dc

Date/Time tests.

27ab6d9

Also, recombined the test functions.

Added tests for reading data written with fastparquet.

b309918

Tests for reading GPU-written files.

45000f9

mythrocks self-assigned this Oct 2, 2023

mythrocks added the test Only impacts tests label Oct 2, 2023

mythrocks marked this pull request as draft October 2, 2023 21:51

Added failing tests for arrays, struct.

cad94bb

Plus, minor refactors.

mythrocks mentioned this pull request Oct 5, 2023

[BUG] String columns written with fastparquet are read differently with Spark RAPIDS #9387

Open

Clarification of failure conditions.

725c316

revans2 reviewed Oct 5, 2023

View reviewed changes

integration_tests/src/main/python/fastparquet_compatibility_test.py Outdated Show resolved Hide resolved

integration_tests/requirements.txt Outdated Show resolved Hide resolved

mythrocks added 7 commits October 5, 2023 13:57

Workaround tests for timestamps.

1b141c2

Workaround tests for dates.

641fa14

Miscellaneous fixes:

ef9c5f1

1. Fixed dataframe size in test_read_fastparquet_single_column_tables. 2. Removed unnecessary test params for write confs. 3. Moved fastparquet import to top of test file. 4. Expanded on test failure description for timestamps.

Test descriptions.

782e076

Workaround tests for STRUCT, ARRAY, etc.

411612e

Added xfails for struct/array.

3624fac

Updated with concrete fastparquet version.

6bcff7a

mythrocks marked this pull request as ready for review October 5, 2023 23:11

mythrocks mentioned this pull request Oct 5, 2023

[BUG] String columns written with fastparquet seem to be read incorrectly via CUDF's Parquet reader rapidsai/cudf#14258

Open

revans2 reviewed Oct 6, 2023

View reviewed changes

mythrocks mentioned this pull request Oct 6, 2023

[BUG] Fix STRUCT comparison between Pandas and Spark dataframes in fastparquet tests #9399

Closed

Fixed up some xfail messages.

4a67ef0

res-life reviewed Oct 8, 2023

View reviewed changes

integration_tests/src/main/python/fastparquet_compatibility_test.py Outdated Show resolved Hide resolved

pxLi reviewed Oct 10, 2023

View reviewed changes

jenkins/Dockerfile-blossom.integration.centos Outdated Show resolved Hide resolved

This was referenced Oct 10, 2023

DO NOT RVIEW #9413

Closed

[BUG] fix docker modified check in premerge [skip ci] #9415

Merged

[BUG] fix docker modified check in premerge [skip ci] NVIDIA/spark-rapids-jni#1485

Merged

mythrocks added 4 commits October 10, 2023 11:15

Per NVIDIA#8789, reverted change for Centos Dockerfile.

f34eec7

Dockerfile-blossom.integration.centos is deprecated.

Removed fastparquet from UDF tests.

126b9f4

Optionally skips fastparquet tests.

fa356f8

Merge remote-tracking branch 'origin/branch-23.12' into fastparquet-c…

c072dc5

…ompat-tests

mythrocks requested review from revans2, pxLi and res-life October 10, 2023 19:50

pxLi approved these changes Oct 11, 2023

View reviewed changes

res-life mentioned this pull request Oct 11, 2023

Fix STRUCT comparison between Pandas and Spark dataframes in fastparquet tests #9418

Merged

res-life approved these changes Oct 13, 2023

View reviewed changes

NvTimLiu approved these changes Oct 18, 2023

View reviewed changes

mythrocks merged commit a8cbfca into NVIDIA:branch-23.12 Oct 23, 2023
28 checks passed

res-life mentioned this pull request Oct 26, 2023

Test compatibility between fastparquet and GPU #9550

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests to check compatibility with `fastparquet` #9366

Add tests to check compatibility with `fastparquet` #9366

mythrocks commented Oct 2, 2023 •

edited

Loading

mythrocks commented Oct 5, 2023

mythrocks commented Oct 5, 2023

pxLi commented Oct 6, 2023 •

edited

Loading

revans2 left a comment

mythrocks commented Oct 6, 2023

mythrocks commented Oct 6, 2023 •

edited

Loading

mythrocks commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 10, 2023

pxLi commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 10, 2023

mythrocks commented Oct 10, 2023

mythrocks commented Oct 10, 2023

mythrocks commented Oct 10, 2023

pxLi left a comment

res-life commented Oct 13, 2023

res-life commented Oct 13, 2023

res-life commented Oct 13, 2023

NvTimLiu left a comment

Add tests to check compatibility with fastparquet #9366

Add tests to check compatibility with fastparquet #9366

Conversation

mythrocks commented Oct 2, 2023 • edited Loading

mythrocks commented Oct 5, 2023

mythrocks commented Oct 5, 2023

pxLi commented Oct 6, 2023 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

mythrocks commented Oct 6, 2023

mythrocks commented Oct 6, 2023 • edited Loading

Classification of xfails:

For fixing in Spark RAPIDS / CUDF:

For follow-up in Spark RAPIDS tests:

Problems in fastparquet:

For fixing in Apache Spark (possibly).

mythrocks commented Oct 10, 2023 • edited Loading

pxLi commented Oct 10, 2023 • edited Loading

pxLi commented Oct 10, 2023 • edited Loading

pxLi commented Oct 10, 2023

pxLi commented Oct 10, 2023 • edited Loading

pxLi commented Oct 10, 2023 • edited Loading

pxLi commented Oct 10, 2023

mythrocks commented Oct 10, 2023

mythrocks commented Oct 10, 2023

mythrocks commented Oct 10, 2023

pxLi left a comment

Choose a reason for hiding this comment

res-life commented Oct 13, 2023

res-life commented Oct 13, 2023

res-life commented Oct 13, 2023

NvTimLiu left a comment

Choose a reason for hiding this comment

Add tests to check compatibility with `fastparquet` #9366

Add tests to check compatibility with `fastparquet` #9366

mythrocks commented Oct 2, 2023 •

edited

Loading

pxLi commented Oct 6, 2023 •

edited

Loading

mythrocks commented Oct 6, 2023 •

edited

Loading

Problems in `fastparquet`:

mythrocks commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 10, 2023 •

edited

Loading