-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sql-formatter optimization #368
Conversation
@veler, I haven't had much time lately, so it has taken some time. Please have a look at it and let me know if anything needs to be changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thanks for proposing this BIG improvement :D It's such a relief not having just a literal translation from JS to C# with all the bad practices involved. Very great usage of ReadOnlySpan by the way!
Overall, it's all look good. Just a few things I asked that would help me understanding better what's going on, and a few styling issues.
Thank you so much! 😁 🚀
src/dev/impl/DevToys/Helpers/SqlFormatter/Languages/SparkSqlFormatter.cs
Outdated
Show resolved
Hide resolved
This is just a suggestion that I was thinking about for a while... Should not it be more efficient to write a simple grammar to read different SQL languages and later transform it for different outputs? That could (probably) even transform a MySQL query to PostgreSQL. To write a simple grammar could be a bit exhausting because it is one for each language (with their visitors and else to have a complete output) but it could be interesting... |
@youkai95, ANTLR or custom parser would definitely perform better than a simple regex IMO. In fact, Microsoft's TSql parser uses ANTLR internally for parsing. So you could generate a formatted script efficiently with the script generator after parsing the input, but this will only be exclusive to Microsoft SQL. |
It looks all good now :D Thank you very much @akarboush ! |
* Optimize sql-formatter (1) - avoid unnecessary string allocation when tokenizing and formatting * Optimize sql-formatter (2) - remove RegexOptions.Compiled * cleanup * cleanup * revert method access level to internal in TokenHelper * add regex default match timeout & compiled option to static regex fields
Pull request type
Please check the type of change your PR introduces:
What is the current behavior?
What is the new behavior?
Other information
this pull request contains 2 optimization 1 and 2.
Optimize sql-formatter (1):
During tokenization and after a successful regex match, only the index and the length of the match are stored. So there is no new string allocations. And during formatting, a read-only span is created over the string input and the new formatted string is written using the StringBuilder.
Optimize sql-formatter (2):
Removing the "Compiled" option from all regex objects has made a big difference in terms of memory and speed.
Note:
String-named placeholders (@"foo") are actually not originally supported in sql, so to simplify the key parsing I removed it and changed the other two placeholder types (indexed and indent named) to be of type char, as a result there is no additional property in the token object, which contains the key (mostly it is null), because the placeholder always has a length of 1(@, :, $ ...)
Benchmark: (0 is the current main branch)
I did additional micro-optimizations, but they had no noticeable impact on performance since most memory allocation happens in the GetTokenizer method, so I decided to drop them
Quality check
Before creating this PR, have you: