Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlstats: fix reset stats container on flush #121134

Closed
xinhaoz opened this issue Mar 26, 2024 · 1 comment · Fixed by #121156
Closed

sqlstats: fix reset stats container on flush #121134

xinhaoz opened this issue Mar 26, 2024 · 1 comment · Fixed by #121156
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability

Comments

@xinhaoz
Copy link
Member

xinhaoz commented Mar 26, 2024

s.mu.stmts = make(map[stmtKey]*stmtStats, len(s.mu.stmts)/2)
s.mu.txns = make(map[appstatspb.TransactionFingerprintID]*txnStats, len(s.mu.txns)/2)
s.mu.sampledPlanMetadataCache = make(map[sampledPlanKey]time.Time, len(s.mu.sampledPlanMetadataCache)/2)
s.freeLocked(ctx)
if s.knobs != nil && s.knobs.OnAfterClear != nil {
s.knobs.OnAfterClear()
}

In the above, freeLocked is called after clearing the containers. Within freeLocked, the size of each container cleared is used to decrement uniqueServerCount which is used to track the total number of in-memory fingerprints tracked by sql stats. Due to these steps we never actually decrement the counter after each flush, causing the uniqueServerCount to grow erroneously. This can cause us to incorrectly spam sending the memory pressure signal to the sql stats flush worker, causing rapid flushing of empty containers.

Jira issue: CRDB-37092

@xinhaoz xinhaoz self-assigned this Mar 26, 2024
Copy link

blathers-crl bot commented Mar 26, 2024

Hi @xinhaoz, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@xinhaoz xinhaoz added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 26, 2024
craig bot pushed a commit that referenced this issue Mar 27, 2024
121086: kv: use response keys of ranged writes for lock tracking r=nvanbenschoten a=nvanbenschoten

Informs #117978.

This PR updates the `txnPipeliner` to use the response keys from `Get`, `Scan`, `ReverseScan`, and `DeleteRange` requests to track pipelined and non-pipelined lock acquisitions / intent writes, instead of assuming that the requests could have left intents anywhere in their request span. This more precise tracking avoid broad ranged intent resolution when more narrow intent resolution is possible. It will also be used by #117978 to track in-flight replicated lock acquisition.

This is a WIP. Some tests need to be updated.

Release note: None

121152: roachtest: adjust WAL failover disk stall roachtest's logging config r=sumeerbhola a=jbowens

The previous logging config outputted all logs to stderr too, which is not buffered and could become blocked on the stalled disk.

Epic: none
Release note: none

121156: sqlstats: fix reset in-memory sql stats on flush r=xinhaoz a=xinhaoz

After flushing in-memory sql stats to disk, we reset and prep each app container for reuse by:
- Decrementing the per-node fingerprint counter by the size of the app container. This counter prevents us from writing more sql stats when we reach the maximum amount of fingerprints stored in memory.
- Clearing the container and reducing its capacity to 1/2.

When introducing atomic flushing, we swapped the 2 ops above in the reset step, resulting in the decrement step being a noop. The counter never resets and eventually results in each attempt at writing sql stats to be throttled which then also signals the sql stats flush worker.

Epic: none
Fixes: #121134

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
@craig craig bot closed this as completed in ee97259 Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant