Sql-formatter optimization #368

akarboush · 2022-02-10T16:04:01Z

Pull request type

Please check the type of change your PR introduces:

What is the current behavior?

What is the new behavior?

Other information

this pull request contains 2 optimization 1 and 2.

Optimize sql-formatter (1):
During tokenization and after a successful regex match, only the index and the length of the match are stored. So there is no new string allocations. And during formatting, a read-only span is created over the string input and the new formatted string is written using the StringBuilder.
Optimize sql-formatter (2):
Removing the "Compiled" option from all regex objects has made a big difference in terms of memory and speed.

Note:
String-named placeholders (@"foo") are actually not originally supported in sql, so to simplify the key parsing I removed it and changed the other two placeholder types (indexed and indent named) to be of type char, as a result there is no additional property in the token object, which contains the key (mostly it is null), because the placeholder always has a length of 1(@, :, $ ...)

Benchmark: (0 is the current main branch)

I did additional micro-optimizations, but they had no noticeable impact on performance since most memory allocation happens in the GetTokenizer method, so I decided to drop them

Quality check

Before creating this PR, have you:

Followed the code style guideline as described in CONTRIBUTING.md
Verified that the change work in Release build configuration
Checked all unit tests pass

… tokenizing and formatting

akarboush · 2022-02-10T16:21:01Z

@veler, I haven't had much time lately, so it has taken some time. Please have a look at it and let me know if anything needs to be changed.

veler

Hello, thanks for proposing this BIG improvement :D It's such a relief not having just a literal translation from JS to C# with all the bad practices involved. Very great usage of ReadOnlySpan by the way!

Overall, it's all look good. Just a few things I asked that would help me understanding better what's going on, and a few styling issues.

Thank you so much! 😁 🚀

src/tests/DevToys.Tests/Providers/Tools/SqlFormatterTests.cs

src/dev/impl/DevToys/Helpers/SqlFormatter/Languages/SparkSqlFormatter.cs

src/dev/impl/DevToys/Helpers/SqlFormatter/Core/Tokenizer.cs

src/dev/impl/DevToys/Helpers/SqlFormatter/Core/TokenHelper.cs

src/dev/impl/DevToys/Helpers/SqlFormatter/Core/RegexFactory.cs

youkai95 · 2022-02-11T03:03:40Z

This is just a suggestion that I was thinking about for a while... Should not it be more efficient to write a simple grammar to read different SQL languages and later transform it for different outputs? That could (probably) even transform a MySQL query to PostgreSQL. To write a simple grammar could be a bit exhausting because it is one for each language (with their visitors and else to have a complete output) but it could be interesting...
I am not sure if ANTLR could perform efficiently compared to this solution...

akarboush · 2022-02-12T13:56:15Z

I am not sure if ANTLR could perform efficiently compared to this solution...

@youkai95, ANTLR or custom parser would definitely perform better than a simple regex IMO. In fact, Microsoft's TSql parser uses ANTLR internally for parsing. So you could generate a formatted script efficiently with the script generator after parsing the input, but this will only be exclusive to Microsoft SQL.
ANTLR would be for sure a nicer way and the transforming between languages would be possible too but as you mentioned it could be a bit exhausting knowing all grammars and differences for the languages

veler · 2022-02-12T18:11:39Z

It looks all good now :D Thank you very much @akarboush !

* Optimize sql-formatter (1) - avoid unnecessary string allocation when tokenizing and formatting * Optimize sql-formatter (2) - remove RegexOptions.Compiled * cleanup * cleanup * revert method access level to internal in TokenHelper * add regex default match timeout & compiled option to static regex fields

akarboush added 2 commits February 10, 2022 01:27

Optimize sql-formatter (1) - avoid unnecessary string allocation when…

ca52c27

… tokenizing and formatting

Optimize sql-formatter (2) - remove RegexOptions.Compiled

1ddfb77

btiteux requested a review from veler February 10, 2022 17:27

veler reviewed Feb 11, 2022

View reviewed changes

veler added this to the v1.0.3.0 milestone Feb 11, 2022

akarboush added 4 commits February 11, 2022 20:21

cleanup

ca6d4e7

cleanup

6d619fe

revert method access level to internal in TokenHelper

9a3f7ed

add regex default match timeout & compiled option to static regex fields

855ae76

veler merged commit cc455ff into DevToys-app:main Feb 12, 2022

akarboush deleted the sqlformatter branch February 12, 2022 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sql-formatter optimization #368

Sql-formatter optimization #368

akarboush commented Feb 10, 2022 •

edited

Loading

akarboush commented Feb 10, 2022

veler left a comment

youkai95 commented Feb 11, 2022

akarboush commented Feb 12, 2022

veler commented Feb 12, 2022

Sql-formatter optimization #368

Sql-formatter optimization #368

Conversation

akarboush commented Feb 10, 2022 • edited Loading

Pull request type

What is the current behavior?

What is the new behavior?

Other information

Quality check

akarboush commented Feb 10, 2022

veler left a comment

Choose a reason for hiding this comment

youkai95 commented Feb 11, 2022

akarboush commented Feb 12, 2022

veler commented Feb 12, 2022

akarboush commented Feb 10, 2022 •

edited

Loading