Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve integration test string generation performance #8447

Open
revans2 opened this issue May 31, 2023 · 2 comments
Open

[FEA] Improve integration test string generation performance #8447

revans2 opened this issue May 31, 2023 · 2 comments
Labels
feature request New feature or request good first issue Good for newcomers test Only impacts tests

Comments

@revans2
Copy link
Collaborator

revans2 commented May 31, 2023

Is your feature request related to a problem? Please describe.
After seeing the big performance gains from #8441 I did a quick profile run of a subset of the tests and I found that even after the caching fix data generation is taking a majority of the time, when we are not waiting for Spark. And of this almost all of it is in sre_yield which is the tool we use to do string generation. (we also use it to generate strings that are formatted like proper decimal values). It would be good to see what we can do to speed up this string generation. Perhaps we can look at a good way to generate random strings without it if there is no regular expression passed in. We might also want to look at alternatives for generating decimal values, or very simple patterns.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify test Only impacts tests labels May 31, 2023
@mattahrens mattahrens added good first issue Good for newcomers and removed ? - Needs Triage Need team to review and classify labels Jun 6, 2023
@ttnghia
Copy link
Collaborator

ttnghia commented Jun 9, 2023

Can we do file caching for test data? I.e., a generator will first try to load from a cached file. If the file doesn't exist, data will be generated and written down to file.

@revans2
Copy link
Collaborator Author

revans2 commented Jun 12, 2023

Can we do file caching for test data? I.e., a generator will first try to load from a cached file. If the file doesn't exist, data will be generated and written down to file.

That would be another interesting option. I'm not sure how much it would help beyond the in memory caching for CI runs. It would help a lot for local runs where we can keep data cached for longer. We would want to make sure that we add some kind of a "version" number to the generators so that if we change them, then we can invalidate the data saved to disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers test Only impacts tests
Projects
None yet
Development

No branches or pull requests

3 participants