-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrap scalar generation into spark session in integration test #9405
Conversation
Signed-off-by: Haoyang Li <[email protected]>
build |
LGTM |
What is the exception that actually triggered this? This is essentially the reason that we turned it on by default everywhere.
I just want to understand why the default settings are not applying. |
@revans2 For example, in following case form the issue 9404: @pytest.mark.parametrize('data_gen', [DecimalGen(34, -5)], ids=idfn)
def test_greatest1(data_gen):
num_cols = 20
s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))
# we want lots of nulls
gen = StructGen([('_c' + str(x), data_gen.copy_special_case(None, weight=100.0))
for x in range(0, num_cols)], nullable=False)
command_args = [f.col('_c' + str(x)) for x in range(0, num_cols)]
command_args.append(s1)
data_type = data_gen.data_type
assert_gpu_and_cpu_are_equal_collect(
lambda spark : gen_df(spark, gen).select(
f.greatest(*command_args))) If we run this case individually, it will fail because In the pre-merge CI job, there are 4 xdist agents running in parallel, and the cases are assigned to different agents by round robin. So if this case happens to be assigned to the first one on the list for a particular agent, the CI will fail. We believe that this is the reason why #9288's pre-merge keeps failing. |
Seems to me the issue is that one or more tests is generating data outside of the normal spark session context that sets up the configs properly. (i.e.: move the data generation to within the dataframe callback that is currently a lambda). Personally, I'm not a fan of the current PR approach where data_gen can silently smash config values and leave them in that smashed state. Can be very surprising behavior and annoying to track down if it bites someone explicitly trying to test without that config setting. |
Signed-off-by: Haoyang Li <[email protected]>
@jlowe Ok, I moved the |
src = _gen_scalars_common(data_gen, count, seed=seed) | ||
data_type = src.data_type | ||
return (_mark_as_lit(src.gen(force_no_nulls=force_no_nulls), data_type) for i in range(0, count)) | ||
return with_cpu_session(lambda spark: gen_scalars_help(data_gen=data_gen, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting this in a cpu_session fixes the current problem, but it adds a new one. If gen_scalars
is called from inside a with_*_session
it will have other problems. with_spark_session
calls reset_spark_session_conf
which does more than just reset the conf. It clears out the catalog too with no way to get the original config or catalog back after it exits. That means with_gpu_session -> gen_scalars will result in the query running on the CPU after the gen_scalars.
I see a few ways to properly fix this.
- We set
spark.sql.legacy.allowNegativeScaleOfDecimal
when launching spark and have the test framework throw an exception if it is not set. Then we remove references to it in all of the tests for consistency. Then we file a follow on issue to fixwith_spark_session
to not allow nesting and to throw an exception if it is nested. - We fix with_spark_session to throw an exception if it is ever nested and do what you are doing today + update the docs for it to be clear that it can never be called from within a with_spark_session
- We fix the test to call gen_scalar from within a with_spark_session and add a doc fix for gen_scalar to indicate that negative scale decimals can have problems if called from outside of with_spark_session block. Then we file a follow on issue to fix
with_spark_session
to not allow nesting and to throw an exception if it is nested.
I personally prefer option 1 but I am fine with option 2 or 3. Talking to @jlowe he really prefers option 3. The main difference between option 3 and option 2 for me really about the amount of code that needs to change. If we just fix the one test and add some docs, that feels like a really small change. If we have to fix nesting/etc that feels a bit larger, but it is something we need to do either way and would mean all tests that use gen_scalar would be good to deal with all decimal values properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of 2. It's again surprising behavior (who would expect it to spawn a Spark session?). I'm fine with either 1 or 3, and even with 1, I still think we should fix the test(s). We should be putting all data generation inside a spark session context of some kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated code to option 3.
Now I wrap all scalar generation with a with_cpu_session
, no matter if it calls f.lit
or uses DecimalGen
. Not sure if we only want to move the cases that are possible to fail into Spark sessions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-on issue: #9412
Signed-off-by: Haoyang Li <[email protected]>
build |
Signed-off-by: Haoyang Li <[email protected]>
build |
2 similar comments
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR headline should be updated to reflect the new approach.
Co-authored-by: Jason Lowe <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
f.lit
in integration test
@jlowe Thanks for review, all done. |
build |
@@ -67,8 +67,8 @@ def test_concat_double_list_with_lit(dg): | |||
|
|||
@pytest.mark.parametrize('data_gen', non_nested_array_gens, ids=idfn) | |||
def test_concat_list_with_lit(data_gen): | |||
lit_col1 = f.lit(gen_scalar(data_gen)).cast(data_gen.data_type) | |||
lit_col2 = f.lit(gen_scalar(data_gen)).cast(data_gen.data_type) | |||
lit_col1 = f.lit(with_cpu_session(lambda spark: gen_scalar(data_gen))).cast(data_gen.data_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is intended to put f.lit
into with_cpu_session
, not only the f.lit
in the gen_scala
but also the f.lit
in other places. Maybe change to the following:
with_cpu_session(lambda spark: f.lit(gen_scalar(data_gen)).cast(data_gen.data_type))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check f.lit
in other places in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, checked and fixed.
Signed-off-by: Haoyang Li <[email protected]>
build |
Hi @revans2 please take another look thanks. |
Fixes #9404
When calling
f.lit
, error message:will be reported when
spark.sql.legacy.allowNegativeScaleOfDecimal
is unset.This config is in the default config of integration test, but those config will only be set when calling
with_spark_session
, but callingf.lit
can happen before any of them in some edge cases when CI running IT parallel.This PR add the negative scale config before calling
f.lit
when generating scalars.Also fixed the
cache_repr
ofTimestampGen
to add the new parameter.