Skip to content

Commit

Permalink
Add tests to check compatibility with fastparquet (#9366)
Browse files Browse the repository at this point in the history
This commit adds tests to check for compatibility between the Spark RAPIDS plugin and [`fastparquet`](https://fastparquet.readthedocs.io/en/latest/index.html).

The tests include the following scenarios:
1. To check whether Parquet files written with `fastparquet` are read correctly with the Spark RAPIDS plugin.
2. To check whether Parquet files written with the Spark RAPIDS plugin are read correctly with `fastparquet`.
3. To check that files written with Apache Spark are read similarly with the Spark RAPIDS plugin, and `fastparquet`.

There are known limitations for these tests:
1. Converting Pandas dataframes to Spark dataframes seems to fail with Spark versions preceding `spark-3.4.x`.  This limits the Spark versions where the test may be run.
2. Pandas and `fastparquet` have known limitations with certain datatypes.  For instance:
   a. `fastparquet` does not support `DECIMAL` columns, as per [documentation](https://fastparquet.readthedocs.io/en/latest/details.html#data-types). These are treated as `FLOAT`.
   b. There seem to be limits to the ranges of `Date` and `Timestamp` rows, as compared to Apache Spark. For instance, dates like `date(year=8543, month=12, day=31)` is deemed out of range in Pandas/`fastparquet`. Such a date cannot be written via `fastparquet`, nor could they be read correctly via `fastparquet`.

Signed-off-by: MithunR <[email protected]>
  • Loading branch information
mythrocks authored Oct 23, 2023
1 parent ed3a1f3 commit a8cbfca
Show file tree
Hide file tree
Showing 7 changed files with 363 additions and 5 deletions.
3 changes: 3 additions & 0 deletions integration_tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,9 @@ For manual installation, you need to setup your environment:
tests across multiple CPUs to speed up test execution
- findspark
: Adds pyspark to sys.path at runtime
- [fastparquet](https://fastparquet.readthedocs.io)
: A Python library (independent of Apache Spark) for reading/writing Parquet. Used in the
integration tests for checking Parquet read/write compatibility with the RAPIDS plugin.

You can install all the dependencies using `pip` by running the following command:

Expand Down
3 changes: 2 additions & 1 deletion integration_tests/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,5 @@ sre_yield
pandas
pyarrow
pytest-xdist >= 2.0.0
findspark
findspark
fastparquet >= 2023.8.0
354 changes: 354 additions & 0 deletions integration_tests/src/main/python/fastparquet_compatibility_test.py

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion jenkins/Dockerfile-blossom.integration.rocky
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ RUN export CUDA_VER=`echo ${CUDA_VER} | cut -d '.' -f 1,2` && \
conda install -y -c conda-forge sre_yield && \
conda clean -ay
# install pytest plugins for xdist parallel run
RUN python -m pip install findspark pytest-xdist pytest-order
RUN python -m pip install findspark pytest-xdist pytest-order fastparquet

# Set default java as 1.8.0
ENV JAVA_HOME "/usr/lib/jvm/java-1.8.0-openjdk"
2 changes: 1 addition & 1 deletion jenkins/Dockerfile-blossom.integration.ubuntu
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ RUN export CUDA_VER=`echo ${CUDA_VER} | cut -d '.' -f 1,2` && \
conda install -y -c conda-forge sre_yield && \
conda clean -ay
# install pytest plugins for xdist parallel run
RUN python -m pip install findspark pytest-xdist pytest-order
RUN python -m pip install findspark pytest-xdist pytest-order fastparquet

RUN apt install -y inetutils-ping expect

Expand Down
2 changes: 1 addition & 1 deletion jenkins/Dockerfile-blossom.ubuntu
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ RUN update-java-alternatives --set /usr/lib/jvm/java-1.8.0-openjdk-amd64

RUN ln -sfn /usr/bin/python3.8 /usr/bin/python
RUN ln -sfn /usr/bin/python3.8 /usr/bin/python3
RUN python -m pip install pytest sre_yield requests pandas pyarrow findspark pytest-xdist pre-commit pytest-order
RUN python -m pip install pytest sre_yield requests pandas pyarrow findspark pytest-xdist pre-commit pytest-order fastparquet

# libnuma1 and libgomp1 are required by ucx packaging
RUN apt install -y inetutils-ping expect wget libnuma1 libgomp1
Expand Down
2 changes: 1 addition & 1 deletion jenkins/databricks/setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,4 @@ PYTHON_VERSION=$(${PYSPARK_PYTHON} -c 'import sys; print("python{}.{}".format(sy
# Set the path of python site-packages, and install packages here.
PYTHON_SITE_PACKAGES="$HOME/.local/lib/${PYTHON_VERSION}/site-packages"
# Use "python -m pip install" to make sure pip matches with python.
$PYSPARK_PYTHON -m pip install --target $PYTHON_SITE_PACKAGES pytest sre_yield requests pandas pyarrow findspark pytest-xdist pytest-order
$PYSPARK_PYTHON -m pip install --target $PYTHON_SITE_PACKAGES pytest sre_yield requests pandas pyarrow findspark pytest-xdist pytest-order fastparquet

0 comments on commit a8cbfca

Please sign in to comment.