-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String generation from complex regex in integration tests #8594
String generation from complex regex in integration tests #8594
Conversation
Signed-off-by: Haoyang Li <[email protected]>
build |
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
build |
@@ -808,7 +808,7 @@ def test_regexp_extract_all_idx_negative(): | |||
|
|||
@allow_non_gpu('ProjectExec', 'RegExpExtractAll') | |||
def test_regexp_extract_all_idx_out_of_bounds(): | |||
gen = mk_str_gen('[abcd]{0,3}') | |||
gen = StringGen('.{0,10}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change being made? The original regexp was made to match closely with the regexp_extract_all pattern below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply.
The test is failed after this PR because the error message "Regex group count is 2, but the specified group index is 3" will only raised when data matches the pattern.
In the previous code, what triggers this error message is actually the '.{0,10}'
as a special case in mk_str_gen
. So the test was PASSED because of good luck.
Now I change the pattern to '[a-d]{1,2}.{0,1}[0-9]{1,2}'
to make sure they can match the pattern below.
Signed-off-by: Haoyang Li <[email protected]>
build |
@@ -808,7 +808,7 @@ def test_regexp_extract_all_idx_negative(): | |||
|
|||
@allow_non_gpu('ProjectExec', 'RegExpExtractAll') | |||
def test_regexp_extract_all_idx_out_of_bounds(): | |||
gen = mk_str_gen('[abcd]{0,3}') | |||
gen = mk_str_gen('[a-d]{1,2}.{0,1}[0-9]{1,2}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally the change looks good. I just want to check that you changed the generation pattern to make it so it would always match regular expression below.
Closes #8593
Closes #8607
Contributes to #8447
This PR :
Fixed
StringGen
in integration tests to correctly generate long and complex strings from a regex. StringGen can't do this when the state space of a regex is very large, as described in [FEA] Generation of long strings properly withStringGen
in integration tests #8593.Implemented a decimal generation function to speed up the generation of decimals that were previously generated by
sre_yield
.Updated
_cache_repr
ofUniqueLongGen
andRepeatSeqGen
to make them simpler.Some small changes in test cases that affected by the
StringGen
change:In
test_regexp_extract_all_idx_out_of_bounds
, the error message is raised only if the given string matches the pattern, so it can't raise the error withmk_str_gen('[abcd]{0,3}')
after StringGen changes because of bad luck. This behavior is matched between cpu and gpu, so I just change thegen
tomk_str_gen('[a-d]{1,2}.{0,1}[0-9]{1,2}')
.One regex used by
test_json_tuple
andtest_get_json_object
will generate numbers with leading zeros, which is not compatible between spark and plugin, as described in [BUG] Mismatch cases in json_tuple function when json items have leading zeroes #8607. I changed it to not generate such numbers.When processing string to sql in
_convert_to_sql
, there is a edge case that if there is a\'
in string beforereplace
, the result will be r\\'
, which will make the'
escape. So I replaced all r\
with r\\
first to prevent it.[UPDATED]
test_hash_groupby_collect_set_on_nested_type
andtest_hash_reduction_collect_set_on_nested_type
faile after this PR, but they also fail when selecting different seeds, as described in [BUG]test_hash_groupby_collect_set_on_nested_type
andtest_hash_reduction_collect_set_on_nested_type
failed #8716. I don't think it's related to this PR, so I temporarily marked them as XFAIL.[UPDATED] Some test results:
For DecimalGen, this change brings some performance improvement. I tested this one from current IT:
Before this PR: 0:07:05, after this PR: 0:04:44
For
StringGen
, if we use''.join(rand.choice(sre_yield.CHARSET + ['\n']) for _ in range(30))
to generate default strings, the speed is even slower, so I leave it unchanged. I tested this case:Before this PR: 52.04s, First commit of this PR: 52.49s. Other string-related tests are also a bit slower. Thus, I will not close the
sre_yield
performance issue #8477.The overall performance improvement will be small because it only affects
DecimalGen
.Note
Now the
DecimalGen
will only generating numbers with the given precision rather than (1, precision) because it may cause precision limitations in some cases, leading to CPU/GPU mismatch.test_str_to_map_expr_fixed_pattern_input
andtest_initcap
are failed when usingrandom() * length
as index but passed now. I don't know why and I will investigate them later. Currently they don't block anything.