Add tests to check compatibility with fastparquet (#9366)

This commit adds tests to check for compatibility between the Spark RAPIDS plugin and [`fastparquet`](https://fastparquet.readthedocs.io/en/latest/index.html). The tests include the following scenarios: 1. To check whether Parquet files written with `fastparquet` are read correctly with the Spark RAPIDS plugin. 2. To check whether Parquet files written with the Spark RAPIDS plugin are read correctly with `fastparquet`. 3. To check that files written with Apache Spark are read similarly with the Spark RAPIDS plugin, and `fastparquet`. There are known limitations for these tests: 1. Converting Pandas dataframes to Spark dataframes seems to fail with Spark versions preceding `spark-3.4.x`. This limits the Spark versions where the test may be run. 2. Pandas and `fastparquet` have known limitations with certain datatypes. For instance: a. `fastparquet` does not support `DECIMAL` columns, as per [documentation](https://fastparquet.readthedocs.io/en/latest/details.html#data-types). These are treated as `FLOAT`. b. There seem to be limits to the ranges of `Date` and `Timestamp` rows, as compared to Apache Spark. For instance, dates like `date(year=8543, month=12, day=31)` is deemed out of range in Pandas/`fastparquet`. Such a date cannot be written via `fastparquet`, nor could they be read correctly via `fastparquet`. Signed-off-by: MithunR <[email protected]>
NVIDIA · Oct 23, 2023 · a8cbfca · a8cbfca
1 parent ed3a1f3
commit a8cbfca
Show file tree

Hide file tree

Showing 7 changed files with 363 additions and 5 deletions.
diff --git a/integration_tests/README.md b/integration_tests/README.md
@@ -101,6 +101,9 @@ For manual installation, you need to setup your environment:
   tests across multiple CPUs to speed up test execution
 - findspark
   : Adds pyspark to sys.path at runtime
+- [fastparquet](https://fastparquet.readthedocs.io)
+  : A Python library (independent of Apache Spark) for reading/writing Parquet. Used in the
+  integration tests for checking Parquet read/write compatibility with the RAPIDS plugin.
 
 You can install all the dependencies using `pip` by running the following command:
 

diff --git a/integration_tests/requirements.txt b/integration_tests/requirements.txt
@@ -16,4 +16,5 @@ sre_yield
 pandas
 pyarrow
 pytest-xdist >= 2.0.0
-findspark
+findspark
+fastparquet >= 2023.8.0
diff --git a/integration_tests/src/main/python/fastparquet_compatibility_test.py b/integration_tests/src/main/python/fastparquet_compatibility_test.py
diff --git a/jenkins/Dockerfile-blossom.integration.rocky b/jenkins/Dockerfile-blossom.integration.rocky
@@ -57,7 +57,7 @@ RUN export CUDA_VER=`echo ${CUDA_VER} | cut -d '.' -f 1,2` && \
     conda install -y -c conda-forge sre_yield && \
     conda clean -ay
 # install pytest plugins for xdist parallel run
-RUN python -m pip install findspark pytest-xdist pytest-order
+RUN python -m pip install findspark pytest-xdist pytest-order fastparquet
 
 # Set default java as 1.8.0
 ENV JAVA_HOME "/usr/lib/jvm/java-1.8.0-openjdk"
diff --git a/jenkins/Dockerfile-blossom.integration.ubuntu b/jenkins/Dockerfile-blossom.integration.ubuntu
@@ -69,7 +69,7 @@ RUN export CUDA_VER=`echo ${CUDA_VER} | cut -d '.' -f 1,2` && \
     conda install -y -c conda-forge sre_yield && \
     conda clean -ay
 # install pytest plugins for xdist parallel run
-RUN python -m pip install findspark pytest-xdist pytest-order
+RUN python -m pip install findspark pytest-xdist pytest-order fastparquet
 
 RUN apt install -y inetutils-ping expect
 

diff --git a/jenkins/Dockerfile-blossom.ubuntu b/jenkins/Dockerfile-blossom.ubuntu
@@ -58,7 +58,7 @@ RUN update-java-alternatives --set /usr/lib/jvm/java-1.8.0-openjdk-amd64
 
 RUN ln -sfn /usr/bin/python3.8 /usr/bin/python
 RUN ln -sfn /usr/bin/python3.8 /usr/bin/python3
-RUN python -m pip install pytest sre_yield requests pandas pyarrow findspark pytest-xdist pre-commit pytest-order
+RUN python -m pip install pytest sre_yield requests pandas pyarrow findspark pytest-xdist pre-commit pytest-order fastparquet
 
 # libnuma1 and libgomp1 are required by ucx packaging
 RUN apt install -y inetutils-ping expect wget libnuma1 libgomp1

diff --git a/jenkins/databricks/setup.sh b/jenkins/databricks/setup.sh
@@ -49,4 +49,4 @@ PYTHON_VERSION=$(${PYSPARK_PYTHON} -c 'import sys; print("python{}.{}".format(sy
 # Set the path of python site-packages, and install packages here.
 PYTHON_SITE_PACKAGES="$HOME/.local/lib/${PYTHON_VERSION}/site-packages"
 # Use "python -m pip install" to make sure pip matches with python.
-$PYSPARK_PYTHON -m pip install --target $PYTHON_SITE_PACKAGES pytest sre_yield requests pandas pyarrow findspark pytest-xdist pytest-order
+$PYSPARK_PYTHON -m pip install --target $PYTHON_SITE_PACKAGES pytest sre_yield requests pandas pyarrow findspark pytest-xdist pytest-order fastparquet