-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: remove diagnostics.sql_stat_reset.interval #71597
Conversation
Follow up to #71596 |
34b5b1d
to
7d5d4f3
Compare
Tagging @cockroachdb/sql-experience for reviewing if I retired the cluster setting properly |
7d5d4f3
to
d337fdd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any tests that would be useful to add here? I'm imagining something that would show the interaction of the happy-path flush to system tables with the uh-oh path reset because it's been too long and we're running out of memory. WDYT?
Reviewed 9 of 10 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @matthewtodd)
-- commits, line 17 at r1:
nit: some style / grammar adjustments, WDYT?
Previously, the SQL Stats system reset its in-memory SQL Stats periodically,
determined by thediagnostics.sql_stat_reset.interval
cluster setting. This
periodic reset loop was introduced early in 21.1 prior to persisted SQL Stats.
In 21.2, we introduced persisted SQL Stats and along with a perodic task to
flush in-memory SQL Stats into system tables. These two async tasks'
coexistence introduced unwanted race conditions. If the SQL Stats reset task
reset the in-memory store before the flusher could flush it to the system
table, those particular stats would be lost forever.This commit changes the
.Start()
method inPersistedSQLStats
to avoid this
race condition. However, reported SQL Stats still periodically resets its
in-memory storage in case the telemetry server goes offline.
pkg/sql/conn_executor.go, line 447 at r1 (raw file):
s.sqlStats.Start(ctx, stopper) // reportedStats is periodically cleared to prevent too much SQL Stats
nit: s/much/many/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need tests that are able to capture the initial issue that this PR is trying to fix otherwise it's just a guess that this change would indeed fix the issue.
Reviewed 8 of 10 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)
Hmm testing this would be really difficult. Because the solution to the original problematic component is to completely remove the original problematic component. The implication of that means the "new" tests we introduce would be testing against a non-existent component since we will be removing it. The old component also lacks testing knobs for us to properly test this logic, so in order to introduce test for the old component, we need to:
The net effect would be the same as in this PR. On a different subject, I think there's a decent chance that this PR might not address the root issue that caused #71596. I will test my new theory and send in a new PR for that. However, regardless, we should merge this PR because we don't want 2 different goroutines racing to clear our in-memory data. |
Also the telemetry dumping unit tests was updated to reflect the new change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding testing, there are two separate stories:
-
When I look at this PR without the context of the issue that motivates it, I still see a compelling reason to merge the PR: the periodic flush has become redundant with the new mechanism to persist stats, and it's good to remove redundant code (for hygiene). When removing an obsolete mechanism, we do not typicall introduce new tests that confirm that the mechanism has been removed. Instead, we remove the tests that had been added to support the original mechanism.
(Btw, I am not recongizing this: where were the unit tests for
diagnostics.sql_stats_reset.interval
located? These should at least be altered, if not removed outright, by this PR) -
Generally, if there's a race condition somewhere, it is due to the interaction between two or more subsystems. If we keep both subsystems and then add some coordination mechanism (e.g. a mutex), then yes we do want to add a new test for that race condition, to assert that it does not occur and prevent regressions.
However, in this case, we are removing one of the participants in the race condition such that only one subsystem remains. After this PR is merged, there is no conflict between two mechanisms any more - no environment where a race condition could even occur. In that case (all concurrent mechanisms, but one, have been removed) we also don't typically add new tests.
Reviewed 10 of 10 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)
-- commits, line 19 at r1:
More words here please.
d337fdd
to
1e26f2c
Compare
Previously, the SQL Stats system reset its in-memory SQL Stats periodically, determined by the diagnostics.sql_stat_reset.interval cluster setting. This periodic reset loop was introduced early in 21.1 prior to persisted SQL Stats. In 21.2, we introduced persisted SQL Stats and along with a perodic task to flush in-memory SQL Stats into system tables. These two async tasks' coexistence introduced unwanted race conditions. If the SQL Stats reset task reset the in-memory store before the flusher could flush it to the system table, those particular stats would be lost forever. This commit changes the .Start() method in PersistedSQLStats to avoid this race condition. However, reported SQL Stats still periodically resets its in-memory storage in case the telemetry server goes offline. As the result, diagnostics.sql_stats_reset.interval cluster setting is removed. diagnostics.forced_sql_stats_reset.interval cluster setting now only controls the reset of the *reported* SQL Stats if it is not collected by the telemetry reporter. Release note (sql change): Previously, in-memory SQL Stats was reset periodically. This behavior is now removed since persisted SQL Stats is introduced in 21.2. As the result, diagnostics.sql_stats_reset.interval cluster setting is removed. diagnostics.forced_sql_stats_reset.interval now only controls the reset of the reported SQL Stats if it is not collected by the telemetry reporter.
1e26f2c
to
af1866e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having a difficult time trying to find where we originally tested the behaviour for diagnostics.sql_stats_reset.interval
cluster setting, but I couldn't find any. Perhaps this was the reason why this issue was not caught early on.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @maryliag, and @matthewtodd)
Previously, matthewtodd (Matthew Todd) wrote…
nit: some style / grammar adjustments, WDYT?
Previously, the SQL Stats system reset its in-memory SQL Stats periodically,
determined by thediagnostics.sql_stat_reset.interval
cluster setting. This
periodic reset loop was introduced early in 21.1 prior to persisted SQL Stats.
In 21.2, we introduced persisted SQL Stats and along with a perodic task to
flush in-memory SQL Stats into system tables. These two async tasks'
coexistence introduced unwanted race conditions. If the SQL Stats reset task
reset the in-memory store before the flusher could flush it to the system
table, those particular stats would be lost forever.This commit changes the
.Start()
method inPersistedSQLStats
to avoid this
race condition. However, reported SQL Stats still periodically resets its
in-memory storage in case the telemetry server goes offline.
Done.
Previously, knz (kena) wrote…
More words here please.
Added more description to the release note.
pkg/sql/conn_executor.go, line 447 at r1 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
nit: s/much/many/
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Azhng)
TFTR! bors r=knz,maryliag |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from af1866e to blathers/backport-release-21.2-71597: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
should this be blindly backported to 21.2? Was the code for stats persistence already in 21.2? |
blathers backport 21.2 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating backport branch refs/heads/blathers/backport-release-21.2-71597: POST https://api.github.com/repos/cockroachdb/cockroach/git/refs: 422 Reference already exists [] Backport to branch 21.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Previously, SQL Stats resets its in-memory SQL Stats
periodically (determined by diagnostics.sql_stat_reset.interval
cluster setting). The periodic reset loop was introduced early
in 21.1 prior to persisted SQL Stats. In 21.2, we introduced
persisted SQL Stats and also runs perodic task that flushes
in-memory SQL Stats into system tables. This two async tasks'
coexistence introduced unwanted race conditions. If the SQL
Stats reset task resets the in-memory store before the
flusher can flush it to system table, then that particular stats
would be lost forever.
This commit change the .Start() method in PersistedSQLStats to
avoid this race condition. However, reported SQL Stats still
periodically resets its in-memory storage in case the telemetry
server goes offline.
Release note (sql change): diagnostics.sql_stats_reset.interval
is removed.