Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String generation from complex regex in integration tests #8594

Merged
merged 12 commits into from
Jul 17, 2023

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Jun 21, 2023

Closes #8593
Closes #8607
Contributes to #8447

This PR :

  • Fixed StringGen in integration tests to correctly generate long and complex strings from a regex. StringGen can't do this when the state space of a regex is very large, as described in [FEA] Generation of long strings properly with StringGen in integration tests #8593.

  • Implemented a decimal generation function to speed up the generation of decimals that were previously generated by sre_yield.

  • Updated _cache_repr of UniqueLongGen and RepeatSeqGen to make them simpler.

Some small changes in test cases that affected by the StringGen change:

  1. In test_regexp_extract_all_idx_out_of_bounds, the error message is raised only if the given string matches the pattern, so it can't raise the error with mk_str_gen('[abcd]{0,3}') after StringGen changes because of bad luck. This behavior is matched between cpu and gpu, so I just change the gen to mk_str_gen('[a-d]{1,2}.{0,1}[0-9]{1,2}').

  2. One regex used by test_json_tuple and test_get_json_object will generate numbers with leading zeros, which is not compatible between spark and plugin, as described in [BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607. I changed it to not generate such numbers.

  3. When processing string to sql in _convert_to_sql, there is a edge case that if there is a \' in string before replace, the result will be r\\', which will make the ' escape. So I replaced all r\ with r\\ first to prevent it.

  4. [UPDATED] test_hash_groupby_collect_set_on_nested_type and test_hash_reduction_collect_set_on_nested_type faile after this PR, but they also fail when selecting different seeds, as described in [BUG] test_hash_groupby_collect_set_on_nested_type and test_hash_reduction_collect_set_on_nested_type failed #8716. I don't think it's related to this PR, so I temporarily marked them as XFAIL.

[UPDATED] Some test results:
For DecimalGen, this change brings some performance improvement. I tested this one from current IT:

def test_special_decimal_division():
    for precision in range(1, 39):
        for scale in range(-3, precision + 1):
            print("PRECISION " + str(precision) + " SCALE " + str(scale))
            data_gen = DecimalGen(precision, scale)
            assert_gpu_and_cpu_are_equal_collect(
                    lambda spark : two_col_df(spark, data_gen, data_gen).select(
                        f.col('a') / f.col('b')))

Before this PR: 0:07:05, after this PR: 0:04:44

For StringGen, if we use ''.join(rand.choice(sre_yield.CHARSET + ['\n']) for _ in range(30)) to generate default strings, the speed is even slower, so I leave it unchanged. I tested this case:

@pytest.mark.parametrize('data_gen', [string_gen]*100, ids=idfn)
def test_string_gen_perf(data_gen):
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark: unary_op_df(spark, data_gen))

Before this PR: 52.04s, First commit of this PR: 52.49s. Other string-related tests are also a bit slower. Thus, I will not close the sre_yield performance issue #8477.

The overall performance improvement will be small because it only affects DecimalGen.

Note

  • Now the DecimalGen will only generating numbers with the given precision rather than (1, precision) because it may cause precision limitations in some cases, leading to CPU/GPU mismatch.

  • test_str_to_map_expr_fixed_pattern_input and test_initcap are failed when using random() * length as index but passed now. I don't know why and I will investigate them later. Currently they don't block anything.

@thirtiseven thirtiseven self-assigned this Jun 21, 2023
@thirtiseven
Copy link
Collaborator Author

build

integration_tests/src/main/python/data_gen.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/data_gen.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/data_gen.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/data_gen.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/map_test.py Outdated Show resolved Hide resolved
@thirtiseven
Copy link
Collaborator Author

build

@@ -808,7 +808,7 @@ def test_regexp_extract_all_idx_negative():

@allow_non_gpu('ProjectExec', 'RegExpExtractAll')
def test_regexp_extract_all_idx_out_of_bounds():
gen = mk_str_gen('[abcd]{0,3}')
gen = StringGen('.{0,10}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change being made? The original regexp was made to match closely with the regexp_extract_all pattern below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.

The test is failed after this PR because the error message "Regex group count is 2, but the specified group index is 3" will only raised when data matches the pattern.

In the previous code, what triggers this error message is actually the '.{0,10}' as a special case in mk_str_gen. So the test was PASSED because of good luck.

Now I change the pattern to '[a-d]{1,2}.{0,1}[0-9]{1,2}' to make sure they can match the pattern below.

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven thirtiseven marked this pull request as ready for review July 14, 2023 04:02
@thirtiseven thirtiseven changed the title WIP: String generation from complex regex in integration tests String generation from complex regex in integration tests Jul 14, 2023
@thirtiseven
Copy link
Collaborator Author

build

@@ -808,7 +808,7 @@ def test_regexp_extract_all_idx_negative():

@allow_non_gpu('ProjectExec', 'RegExpExtractAll')
def test_regexp_extract_all_idx_out_of_bounds():
gen = mk_str_gen('[abcd]{0,3}')
gen = mk_str_gen('[a-d]{1,2}.{0,1}[0-9]{1,2}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the change looks good. I just want to check that you changed the generation pattern to make it so it would always match regular expression below.

@thirtiseven thirtiseven merged commit 615156a into NVIDIA:branch-23.08 Jul 17, 2023
@sameerz sameerz added the test Only impacts tests label Jul 19, 2023
@thirtiseven thirtiseven deleted the sre_yield_issue branch August 18, 2023 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
3 participants