String generation from complex regex in integration tests #8594

thirtiseven · 2023-06-21T12:52:50Z

Closes #8593
Closes #8607
Contributes to #8447

This PR :

Fixed StringGen in integration tests to correctly generate long and complex strings from a regex. StringGen can't do this when the state space of a regex is very large, as described in [FEA] Generation of long strings properly with StringGen in integration tests #8593.
Implemented a decimal generation function to speed up the generation of decimals that were previously generated by sre_yield.
Updated _cache_repr of UniqueLongGen and RepeatSeqGen to make them simpler.

Some small changes in test cases that affected by the StringGen change:

In test_regexp_extract_all_idx_out_of_bounds, the error message is raised only if the given string matches the pattern, so it can't raise the error with mk_str_gen('[abcd]{0,3}') after StringGen changes because of bad luck. This behavior is matched between cpu and gpu, so I just change the gen to mk_str_gen('[a-d]{1,2}.{0,1}[0-9]{1,2}').
One regex used by test_json_tuple and test_get_json_object will generate numbers with leading zeros, which is not compatible between spark and plugin, as described in [BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607. I changed it to not generate such numbers.
When processing string to sql in _convert_to_sql, there is a edge case that if there is a \' in string before replace, the result will be r\\', which will make the ' escape. So I replaced all r\ with r\\ first to prevent it.
[UPDATED] test_hash_groupby_collect_set_on_nested_type and test_hash_reduction_collect_set_on_nested_type faile after this PR, but they also fail when selecting different seeds, as described in [BUG] test_hash_groupby_collect_set_on_nested_type and test_hash_reduction_collect_set_on_nested_type failed #8716. I don't think it's related to this PR, so I temporarily marked them as XFAIL.

[UPDATED] Some test results:
For DecimalGen, this change brings some performance improvement. I tested this one from current IT:

def test_special_decimal_division():
    for precision in range(1, 39):
        for scale in range(-3, precision + 1):
            print("PRECISION " + str(precision) + " SCALE " + str(scale))
            data_gen = DecimalGen(precision, scale)
            assert_gpu_and_cpu_are_equal_collect(
                    lambda spark : two_col_df(spark, data_gen, data_gen).select(
                        f.col('a') / f.col('b')))

Before this PR: 0:07:05, after this PR: 0:04:44

For StringGen, if we use ''.join(rand.choice(sre_yield.CHARSET + ['\n']) for _ in range(30)) to generate default strings, the speed is even slower, so I leave it unchanged. I tested this case:

@pytest.mark.parametrize('data_gen', [string_gen]*100, ids=idfn)
def test_string_gen_perf(data_gen):
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark: unary_op_df(spark, data_gen))

Before this PR: 52.04s, First commit of this PR: 52.49s. Other string-related tests are also a bit slower. Thus, I will not close the sre_yield performance issue #8477.

The overall performance improvement will be small because it only affects DecimalGen.

Note

Now the DecimalGen will only generating numbers with the given precision rather than (1, precision) because it may cause precision limitations in some cases, leading to CPU/GPU mismatch.
test_str_to_map_expr_fixed_pattern_input and test_initcap are failed when using random() * length as index but passed now. I don't know why and I will investigate them later. Currently they don't block anything.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-06-21T13:39:28Z

build

integration_tests/src/main/python/data_gen.py

integration_tests/src/main/python/map_test.py

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-06-26T12:18:49Z

build

revans2 · 2023-06-26T14:00:30Z

integration_tests/src/main/python/regexp_test.py

@@ -808,7 +808,7 @@ def test_regexp_extract_all_idx_negative():

 @allow_non_gpu('ProjectExec', 'RegExpExtractAll')
 def test_regexp_extract_all_idx_out_of_bounds():
-    gen = mk_str_gen('[abcd]{0,3}')
+    gen = StringGen('.{0,10}')


Why is this change being made? The original regexp was made to match closely with the regexp_extract_all pattern below.

Sorry for the late reply.

The test is failed after this PR because the error message "Regex group count is 2, but the specified group index is 3" will only raised when data matches the pattern.

In the previous code, what triggers this error message is actually the '.{0,10}' as a special case in mk_str_gen. So the test was PASSED because of good luck.

Now I change the pattern to '[a-d]{1,2}.{0,1}[0-9]{1,2}' to make sure they can match the pattern below.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-07-14T04:10:14Z

build

revans2 · 2023-07-14T13:22:42Z

integration_tests/src/main/python/regexp_test.py

@@ -808,7 +808,7 @@ def test_regexp_extract_all_idx_negative():

 @allow_non_gpu('ProjectExec', 'RegExpExtractAll')
 def test_regexp_extract_all_idx_out_of_bounds():
-    gen = mk_str_gen('[abcd]{0,3}')
+    gen = mk_str_gen('[a-d]{1,2}.{0,1}[0-9]{1,2}')


Generally the change looks good. I just want to check that you changed the generation pattern to make it so it would always match regular expression below.

String generation from complex regex in integration tests

6d823b6

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven self-assigned this Jun 21, 2023

revans2 reviewed Jun 21, 2023

View reviewed changes

thirtiseven and others added 7 commits June 25, 2023 16:42

Merge branch 'NVIDIA:branch-23.08' into sre_yield_issue

6132896

String generation from complex regex in integration tests

2152ae9

Signed-off-by: Haoyang Li <[email protected]>

address comments and fix an issue in json test

cfd700d

Signed-off-by: Haoyang Li <[email protected]>

clean up

f7dd666

Signed-off-by: Haoyang Li <[email protected]>

clean up

46e1731

Signed-off-by: Haoyang Li <[email protected]>

set default pattern of stringgen to None

5aab983

Signed-off-by: Haoyang Li <[email protected]>

clean up

b1910ef

Signed-off-by: Haoyang Li <[email protected]>

revans2 reviewed Jun 26, 2023

View reviewed changes

pxLi mentioned this pull request Jun 27, 2023

[BUG] maven unit test crashed during regex test intermittently #8612

Closed

thirtiseven and others added 3 commits July 13, 2023 19:57

Merge branch 'NVIDIA:branch-23.08' into sre_yield_issue

b1a3bcd

Removed default stringgen function and update a test case

fe0f5dc

Signed-off-by: Haoyang Li <[email protected]>

Merge branch 'NVIDIA:branch-23.08' into sre_yield_issue

997eb4a

thirtiseven mentioned this pull request Jul 14, 2023

[BUG] test_hash_groupby_collect_set_on_nested_type and test_hash_reduction_collect_set_on_nested_type failed #8716

Closed

Added two xfail cases

f32c534

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven marked this pull request as ready for review July 14, 2023 04:02

thirtiseven changed the title ~~WIP: String generation from complex regex in integration tests~~ String generation from complex regex in integration tests Jul 14, 2023

revans2 approved these changes Jul 14, 2023

View reviewed changes

thirtiseven merged commit 615156a into NVIDIA:branch-23.08 Jul 17, 2023

sameerz added the test Only impacts tests label Jul 19, 2023

thirtiseven deleted the sre_yield_issue branch August 18, 2023 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String generation from complex regex in integration tests #8594

String generation from complex regex in integration tests #8594

thirtiseven commented Jun 21, 2023 •

edited

Loading

thirtiseven commented Jun 21, 2023

thirtiseven commented Jun 26, 2023

revans2 Jun 26, 2023

thirtiseven Jul 14, 2023

thirtiseven commented Jul 14, 2023

revans2 Jul 14, 2023

String generation from complex regex in integration tests #8594

String generation from complex regex in integration tests #8594

Conversation

thirtiseven commented Jun 21, 2023 • edited Loading

thirtiseven commented Jun 21, 2023

thirtiseven commented Jun 26, 2023

revans2 Jun 26, 2023

Choose a reason for hiding this comment

thirtiseven Jul 14, 2023

Choose a reason for hiding this comment

thirtiseven commented Jul 14, 2023

revans2 Jul 14, 2023

Choose a reason for hiding this comment

thirtiseven commented Jun 21, 2023 •

edited

Loading