-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlstats: add limit to flush data #97123
Conversation
077b002
to
3d99f14
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w)
Previously, j82w (Jake) wrote…
What is the concern with the data blowing up? Just wondering if the persisted table growing is better than losing the stats data.
the problem is that we saw the case of getting 4Tb of data, and if the user wants the limit of 1M, that is a lot more.
After that we can work on improvements such as check the status of cleanup job to give warning or combining with flush, and this would no longer be an issue. But at least is a safety for now
Previously, maryliag (Marylia Gutierrez) wrote…
Even with 1M the UI will still fail to load right? Just wondering if we should focus on reducing the response size limit or do some change so that the UI always loads. |
Do we need to change this behaviour at all? What did we do for in-memory stats prior to persisted stats? It would be nice if it behaved like a FIFO cache so that we could hold on to the in-memory stats between flush attempts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @xinhaoz)
Previously, j82w (Jake) wrote…
Even with 1M the UI will still fail to load right? Just wondering if we should focus on reducing the response size limit or do some change so that the UI always loads.
Yes, it does fail to load, but that is not what this PR is trying to address. That would require more significant changes. Here I'm just focusing on the case where the cleanup gets stuck, fails, and I'm adding the safety to avoid data blowing up.
pkg/sql/sqlstats/persistedsqlstats/flush.go
line 49 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
Do we need to change this behaviour at all? What did we do for in-memory stats prior to persisted stats? It would be nice if it behaved like a FIFO cache so that we could hold on to the in-memory stats between flush attempts.
I'm still deciding if I should change this or not. I think I might keep as is for this PR, to focus on the safety check, and further improvements can look into this better. Specially if we want to change to do a cleanup along with flush, this behaviour will certainly change.
Previously, we would always let data get flushed to system tables, then the cleanup job would remove the excess data. When the cleanup job have hiccups and get stuck data was still continuously being added, making the situation worst and requiring reset of stats in some cases. To prevent this cases, this commit is adding a limit of how much excess data can be flushed, this way if the cleanup job stops working, the data won't blow up. Part Of cockroachdb#97074 Follow up work can do a better job at cleaning the data during the flush, but this commit focus on adding the safety so it can be backported. Release note (sql change): Add a hard limit of how much data can be flushed to system tables for sql stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as a safety check! I had a minor question/suggestion, but that could also wait for the follow-up PR.
Reviewed 2 of 3 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @maryliag, and @xinhaoz)
pkg/sql/sqlstats/persistedsqlstats/flush.go
line 104 at r2 (raw file):
} actualSize := float64(tree.MustBeDInt(row[0])) return actualSize > (maxPersistedRows * 1.5)
The 1.5
multiplier seems a bit arbitrary. Would it make more sense to use sql.metrics.max_mem_stmt_fingerprints
(i.e., up to one extra flush)?
Code quote:
1.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @j82w, and @xinhaoz)
pkg/sql/sqlstats/persistedsqlstats/flush.go
line 104 at r2 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
The
1.5
multiplier seems a bit arbitrary. Would it make more sense to usesql.metrics.max_mem_stmt_fingerprints
(i.e., up to one extra flush)?
My thinking for this number was:
with the default for memory 100k and default for persisted 1M, imagine that during each flush we do send 100k, and we do want to keep sending those until we reach the next hourly cleanup, so when you reach the 50min you would have added 500k extra, then on the hour you're about the clean, so I stop introducing more data.
Allowing just the max for the memory as extra is more conservative, but I wanted to give the extra room for data, but still not letting blowing up. I think it would be okay so let the data reach 1.5 of its max, knowing it will get cleanup soon.
We can always change this value if we receive any feedback on it off course.
TFTR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @maryliag, and @xinhaoz)
pkg/sql/sqlstats/persistedsqlstats/flush.go
line 85 at r2 (raw file):
maxPersistedRows := float64(SQLStatsMaxPersistedRows.Get(&s.SQLStats.GetClusterSettings().SV)) readStmt := `
Do we need to do any perf testing on this? I've noticed sometimes querying this table can take several seconds to over a minute. Would it be more efficient to get a descriptor and get a rough disk size or the table stats or directly query the time series db?
Build failed: |
bors r- |
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 5f357ba to blathers/backport-release-22.1-97123: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
During the sql stats compaction job, we limit the amount of rows being deleted per transaction. We used a default value of 1024, but we have been increasinly seeing customer needing to increase this value to allow the job to keep up with the large amount of data being flushed. We have been recommening a value for 20k, so being more conservative with the default (plus the changes on cockroachdb#97123 that won't let tables get in a state with so many rows), this commit changes the value to 10k. Fixes cockroachdb#97528 Release note (sql change): Increase the default value of `sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase efficiency of the cleanup job for sql statistics.
97486: ui: fix jobs page error state r=xinhaoz a=xinhaoz Previously, the error state for the jobs page was only displayed if there were previously jobs returned. We should show the api error even when we have never received a successful jobs payload (e.g. error on first request). This commit changes error displaying in the jobs page such that we will show the request error regardless of whether or not we have previously received data. If we have previous data when we receive an unsuccessful request response, we will show the error along with the existing data. Epic: none Release note (bug fix): Jobs page properly shows error state when we receive an error during data fetching. 97642: sqlstats: increase default value for deleted rows r=maryliag a=maryliag During the sql stats compaction job, we limit the amount of rows being deleted per transaction. We used a default value of 1024, but we have been increasinly seeing customer needing to increase this value to allow the job to keep up with the large amount of data being flushed. We have been recommening a value for 20k, so being more conservative with the default (plus the changes on #97123 that won't let tables get in a state with so many rows), this commit changes the value to 10k. Fixes #97528 Release note (sql change): Increase the default value of `sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase efficiency of the cleanup job for sql statistics. 97662: ui: show recent executions even without lock information r=maryliag a=maryliag Previously, if the call to retrieve contention/lock information returned a max size limit error, we were not displaying any data on the recent executions. Now we show the data returned with a warning that cluster locks information might be missing. Part Of #96184 <img width="1133" alt="Screenshot 2023-02-24 at 6 25 39 PM" src="https://user-images.githubusercontent.com/1017486/221320673-65a8f2eb-e7e3-4f27-b911-4111e8dc6728.png"> Release note (ui change): Still show active execution information when there is a max size limit error. Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: maryliag <[email protected]>
During the sql stats compaction job, we limit the amount of rows being deleted per transaction. We used a default value of 1024, but we have been increasinly seeing customer needing to increase this value to allow the job to keep up with the large amount of data being flushed. We have been recommening a value for 20k, so being more conservative with the default (plus the changes on #97123 that won't let tables get in a state with so many rows), this commit changes the value to 10k. Fixes #97528 Release note (sql change): Increase the default value of `sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase efficiency of the cleanup job for sql statistics.
During the sql stats compaction job, we limit the amount of rows being deleted per transaction. We used a default value of 1024, but we have been increasinly seeing customer needing to increase this value to allow the job to keep up with the large amount of data being flushed. We have been recommening a value for 20k, so being more conservative with the default (plus the changes on #97123 that won't let tables get in a state with so many rows), this commit changes the value to 10k. Fixes #97528 Release note (sql change): Increase the default value of `sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase efficiency of the cleanup job for sql statistics.
Previously, we would always let data get flushed to system tables, then the cleanup job would remove the excess data.
When the cleanup job have hiccups and get stuck data was still continuously being added, making the situation worst and requiring reset of stats in some cases.
To prevent this cases, this commit is adding a limit of how much excess data can be flushed, this way if the cleanup job stops working, the data won't blow up.
Part Of #97074
Follow up work can do a better job at cleaning the data during the flush, but this commit focus on adding the safety so it can be backported.
Release note (sql change): Add a hard limit of how much data can be flushed to system tables for sql stats.