Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sql-formatter optimization #368

Merged
merged 6 commits into from
Feb 12, 2022
Merged

Sql-formatter optimization #368

merged 6 commits into from
Feb 12, 2022

Conversation

akarboush
Copy link
Contributor

@akarboush akarboush commented Feb 10, 2022

Pull request type

Please check the type of change your PR introduces:

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation content changes
  • Internationalization and localization
  • Other (please describe):

What is the current behavior?

What is the new behavior?

Other information

this pull request contains 2 optimization 1 and 2.

  • Optimize sql-formatter (1):
    During tokenization and after a successful regex match, only the index and the length of the match are stored. So there is no new string allocations. And during formatting, a read-only span is created over the string input and the new formatted string is written using the StringBuilder.

  • Optimize sql-formatter (2):
    Removing the "Compiled" option from all regex objects has made a big difference in terms of memory and speed.

Note:
String-named placeholders (@"foo") are actually not originally supported in sql, so to simplify the key parsing I removed it and changed the other two placeholder types (indexed and indent named) to be of type char, as a result there is no additional property in the token object, which contains the key (mostly it is null), because the placeholder always has a length of 1(@, :, $ ...)

Benchmark: (0 is the current main branch)

image

I did additional micro-optimizations, but they had no noticeable impact on performance since most memory allocation happens in the GetTokenizer method, so I decided to drop them

Quality check

Before creating this PR, have you:

  • Followed the code style guideline as described in CONTRIBUTING.md
  • Verified that the change work in Release build configuration
  • Checked all unit tests pass

@akarboush
Copy link
Contributor Author

@veler, I haven't had much time lately, so it has taken some time. Please have a look at it and let me know if anything needs to be changed.

@btiteux btiteux requested a review from veler February 10, 2022 17:27
Copy link
Collaborator

@veler veler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for proposing this BIG improvement :D It's such a relief not having just a literal translation from JS to C# with all the bad practices involved. Very great usage of ReadOnlySpan by the way!

Overall, it's all look good. Just a few things I asked that would help me understanding better what's going on, and a few styling issues.

Thank you so much! 😁 🚀

@veler veler added this to the v1.0.3.0 milestone Feb 11, 2022
@youkai95
Copy link
Contributor

This is just a suggestion that I was thinking about for a while... Should not it be more efficient to write a simple grammar to read different SQL languages and later transform it for different outputs? That could (probably) even transform a MySQL query to PostgreSQL. To write a simple grammar could be a bit exhausting because it is one for each language (with their visitors and else to have a complete output) but it could be interesting...
I am not sure if ANTLR could perform efficiently compared to this solution...

@akarboush
Copy link
Contributor Author

I am not sure if ANTLR could perform efficiently compared to this solution...

@youkai95, ANTLR or custom parser would definitely perform better than a simple regex IMO. In fact, Microsoft's TSql parser uses ANTLR internally for parsing. So you could generate a formatted script efficiently with the script generator after parsing the input, but this will only be exclusive to Microsoft SQL.
ANTLR would be for sure a nicer way and the transforming between languages would be possible too but as you mentioned it could be a bit exhausting knowing all grammars and differences for the languages

@veler
Copy link
Collaborator

veler commented Feb 12, 2022

It looks all good now :D Thank you very much @akarboush !

@veler veler merged commit cc455ff into DevToys-app:main Feb 12, 2022
@akarboush akarboush deleted the sqlformatter branch February 12, 2022 18:35
veler pushed a commit that referenced this pull request Mar 31, 2023
* Optimize sql-formatter (1) - avoid unnecessary string allocation when tokenizing and formatting

* Optimize sql-formatter (2) - remove RegexOptions.Compiled

* cleanup

* cleanup

* revert method access level to internal in TokenHelper

* add regex default match timeout & compiled option to static regex fields
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants