Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include avro test by using '--packages' option [skip ci] #6505

Merged
merged 4 commits into from
Sep 9, 2022

Conversation

zhanga5
Copy link
Contributor

@zhanga5 zhanga5 commented Sep 5, 2022

Closes #5657

@zhanga5 zhanga5 changed the title Include avro test by using '--packages' option Include avro test by using '--packages' option [skip ci] Sep 5, 2022
@zhanga5
Copy link
Contributor Author

zhanga5 commented Sep 5, 2022

build

@jlowe jlowe added the test Only impacts tests label Sep 6, 2022
@@ -284,6 +300,7 @@ export -f get_tests_by_tags
# - DEFAULT: all tests except cudf_udf tests
# - CUDF_UDF_ONLY: cudf_udf tests only, requires extra conda cudf-py lib
# - ICEBERG_ONLY: iceberg tests only
# - AVRO_ONLY: avro tests only (with --packages option instead of --jars)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add documentation to README about this option and mention using --packages since if you don't have internet access it won't work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which README? Didn't find any info about existing options DELTA_LAKE_ONLY or ICEBERG_ONLY, which used similar --packages...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like enabling avro test was quite different with iceberg or delta lake. Not sure whether should provide some consistency and then document it in same manner

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think this should be a new issue to bring avro test consistent w/ other ones. filed an issue to track this #6532

@@ -211,6 +211,18 @@ run_iceberg_tests() {
fi
}

# Test spark-avro with documented way of deploying at run time via --packages option from Maven
run_avro_tests() {
unset TEST_PARALLEL # Enable auto spark local parallel in run_pyspark_from_build.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK_SUBMIT_FLAGS and TEST_PARALLEL > 0 are not compatible, see the thread on #6403

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refined

Copy link
Collaborator

@pxLi pxLi Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not require it must be parallel in this PR as we are testing --package avro.

For this place to let run_pyspark_from_build to set the parallelism may not always be a good idea as it try to set as many as parallelism as possible, sometimes this would actually decrease the performance if over some specific value and could potentially caused host machine memory OOM with different GPU types. I guess we could hardcode it to 4 or 5 here

We can discuss about how to better parallelize tests like iceberg and avro in #6403. But for this PR lets keep it simple~

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. we need to drop unset TEST_PARALLEL in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -211,6 +211,18 @@ run_iceberg_tests() {
fi
}

# Test spark-avro with documented way of deploying at run time via --packages option from Maven
run_avro_tests() {
unset TEST_PARALLEL # Enable auto spark local parallel in run_pyspark_from_build.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. we need to drop unset TEST_PARALLEL in this PR

export PYSP_TEST_spark_jars_packages="org.apache.spark:spark-avro_2.12:${SPARK_VER}"

# Workaround to avoid appending avro jar file by '--jars'
rm -vf $LOCAL_JAR_PATH/spark-avro*.jar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move out of the way (e.g. rename spark-avro.jar to spark-avro.jar%) before calling the avro tests and restore after to avoid side effects and creating dependence on the order of the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refined

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets take care of all avro_test inconsistency in #6532

pxLi
pxLi previously approved these changes Sep 9, 2022
Copy link
Collaborator

@pxLi pxLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will upmerge this and update a few later in #6522

@pxLi
Copy link
Collaborator

pxLi commented Sep 9, 2022

build


# Workaround to avoid appending avro jar file by '--jars'
AVRO_JAR_FILE=`cd $LOCAL_JAR_PATH && ls spark-avro*.jar`
mv $LOCAL_JAR_PATH/$AVRO_JAR_FILE $LOCAL_JAR_PATH/$AVRO_JAR_FILE%
Copy link
Collaborator

@pxLi pxLi Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that could cause issue is that mv could fail if no avro file in LOCAL_JAR_PATH.
may simply || true to avoid error out the case that people mark ENV INCLUDE_SPARK_AVRO_JAR=false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted to previous simple workaround. It would be fine as avro test would be executed at the end of scripts and avro jar file will be auto downloaded each time

@pxLi
Copy link
Collaborator

pxLi commented Sep 9, 2022

build

@pxLi pxLi merged commit 950e2c8 into NVIDIA:branch-22.10 Sep 9, 2022
@zhanga5 zhanga5 deleted the issue-5657 branch September 13, 2022 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Documented deployment of spark-avro is not tested
5 participants