improve normalizer cpu usage #43

lu-zhengda · 2024-12-16T23:22:29Z

This PR improves the high CPU utilization in the Normalizer by addressing the inefficiency caused by frequent calls to strings.ToUpper. Profiling with Go's cpuprofile revealed that strings.ToUpper was a major contributor to CPU usage.

Key Changes

1. Parsing Behavior for SQL Commands:

Updated behavior to only match SQL commands when identifiers are:
- All uppercase (e.g., SELECT)
- All lowercase (e.g., select)
- Titlecase (e.g., Select)
Unformatted identifiers with inconsistent casing (e.g., sElEcT) are no longer parsed as SQL commands in the statement metadata.
This change is a deliberate tradeoff between performance and completeness to avoid repeatedly calling strings.ToUpper on every identifier, which was the main cause of high CPU usage.

2. Normalized Query Output:

The end output of the normalized query statement remains unaffected.
Users can still opt to uppercase all SQL keywords if desired.
By default, the normalizer does not alter the case of the input query, preserving existing behavior.

Benchmark Results

Before vs. After

Sub-Benchmark	Iterations (↑)	Time per Op (ns/op) (↓)	Memory (B/op) (↓)	Allocations (↓)
Escaping/512	+34.7%	-27.7%	-36.7%	-57.9%
Grouping/199	+37.4%	-28.9%	-27.4%	-42.1%
Large/3694	+73.3%	-45.7%	-45.4%	-69.2%
Complex/969	+79.5%	-43.2%	-48.3%	-65.6%
SuperLarge/4198	+101.6%	-43.8%	-42.0%	-70.0%

Summary of Improvements

Performance

Execution time dropped by ~27.7% to ~45.7% across benchmarks, with the most dramatic improvements for larger workloads.

Memory:

Reduced memory consumption by ~36.7% to ~48.3%, lowering pressure on the garbage collector.

https://datadoghq.atlassian.net/browse/DBMON-4759

sqllexer_utils.go

lu-zhengda · 2024-12-16T23:35:40Z

Before

 ~/go/src/github.com/DataDog/go-sqllexer/ [main] go test -benchmem -run=^$ -bench ^BenchmarkObfuscationAndNormalization$ github.com/DataDog/go-sqllexer 
goos: darwin
goarch: arm64
pkg: github.com/DataDog/go-sqllexer
BenchmarkObfuscationAndNormalization/Escaping/512-10              144000              7182 ns/op            1440 B/op         76 allocs/op
BenchmarkObfuscationAndNormalization/Grouping/199-10              288408              4121 ns/op             760 B/op         38 allocs/op
BenchmarkObfuscationAndNormalization/Large/3694-10                 12823             94193 ns/op           26360 B/op       1103 allocs/op
BenchmarkObfuscationAndNormalization/Complex/969-10                44007             27684 ns/op            6520 B/op        302 allocs/op
BenchmarkObfuscationAndNormalization/SuperLarge/4198-10            10000            113920 ns/op           41448 B/op       1059 allocs/op
PASS
ok      github.com/DataDog/go-sqllexer  7.576s

After

 ~/go/src/github.com/DataDog/go-sqllexer/ [zhengda.lu/normalize*] go test -benchmem -run=^$ -bench ^BenchmarkObfuscationAndNormalization$ github.com/DataDog/go-sqllexer -cpuprofile=cpu.prof
goos: darwin
goarch: arm64
pkg: github.com/DataDog/go-sqllexer
BenchmarkObfuscationAndNormalization/Escaping/512-10              194044              5191 ns/op             912 B/op         32 allocs/op
BenchmarkObfuscationAndNormalization/Grouping/199-10              396208              2932 ns/op             552 B/op         22 allocs/op
BenchmarkObfuscationAndNormalization/Large/3694-10                 22224             51106 ns/op           14392 B/op        340 allocs/op
BenchmarkObfuscationAndNormalization/Complex/969-10                78969             15727 ns/op            3368 B/op        104 allocs/op
BenchmarkObfuscationAndNormalization/SuperLarge/4198-10            20156             64022 ns/op           24032 B/op        318 allocs/op
PASS
ok      github.com/DataDog/go-sqllexer  7.852s

lu-zhengda · 2024-12-16T23:43:48Z

strings.ToUpper consumes high CPU because it processes each character individually, performing Unicode-compliant transformations.

Unicode Table Lookups: Determines if a character needs transformation, which is computationally intensive, especially for non-ASCII characters.
UTF-8 Decoding: Converts multi-byte characters to runes and back, adding overhead.
Memory Allocation: Creates a new string for every transformation, increasing GC activity.

improve normalizer cpu usage

33b632d

datadog-datadog-prod-us1 bot reviewed Dec 16, 2024

View reviewed changes

sqllexer_utils.go Outdated Show resolved Hide resolved

sqllexer_utils.go Outdated Show resolved Hide resolved

sqllexer_utils.go Outdated Show resolved Hide resolved

sqllexer_utils.go Outdated Show resolved Hide resolved

sqllexer_utils.go Outdated Show resolved Hide resolved

lu-zhengda added 2 commits December 16, 2024 18:23

remove cpu.prof

29c6a92

improve format

06a308b

lu-zhengda added 2 commits December 16, 2024 18:54

quick check for ascii letter

90812ab

upper case collected commands

c6d44ad

lu-zhengda marked this pull request as ready for review December 17, 2024 02:42

lu-zhengda requested a review from a team as a code owner December 17, 2024 02:42

nenadnoveljic approved these changes Dec 17, 2024

View reviewed changes

Merge branch 'main' into zhengda.lu/normalize

f67aaa9

lu-zhengda merged commit a418b44 into main Dec 17, 2024
4 checks passed

lu-zhengda deleted the zhengda.lu/normalize branch December 17, 2024 19:28

lu-zhengda mentioned this pull request Dec 17, 2024

bump github.com/DataDog/go-sqllexer to v0.0.18 DataDog/datadog-agent#32315

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve normalizer cpu usage #43

improve normalizer cpu usage #43

lu-zhengda commented Dec 16, 2024 •

edited

Loading

lu-zhengda commented Dec 16, 2024 •

edited

Loading

lu-zhengda commented Dec 16, 2024

improve normalizer cpu usage #43

improve normalizer cpu usage #43

Conversation

lu-zhengda commented Dec 16, 2024 • edited Loading

Key Changes

Benchmark Results

Summary of Improvements

lu-zhengda commented Dec 16, 2024 • edited Loading

lu-zhengda commented Dec 16, 2024

lu-zhengda commented Dec 16, 2024 •

edited

Loading

lu-zhengda commented Dec 16, 2024 •

edited

Loading