sql: remove diagnostics.sql_stat_reset.interval #71597

Azhng · 2021-10-15T02:53:19Z

Previously, SQL Stats resets its in-memory SQL Stats
periodically (determined by diagnostics.sql_stat_reset.interval
cluster setting). The periodic reset loop was introduced early
in 21.1 prior to persisted SQL Stats. In 21.2, we introduced
persisted SQL Stats and also runs perodic task that flushes
in-memory SQL Stats into system tables. This two async tasks'
coexistence introduced unwanted race conditions. If the SQL
Stats reset task resets the in-memory store before the
flusher can flush it to system table, then that particular stats
would be lost forever.
This commit change the .Start() method in PersistedSQLStats to
avoid this race condition. However, reported SQL Stats still
periodically resets its in-memory storage in case the telemetry
server goes offline.

Release note (sql change): diagnostics.sql_stats_reset.interval
is removed.

cockroach-teamcity · 2021-10-15T02:53:26Z

This change is

Azhng · 2021-10-15T02:54:20Z

Follow up to #71596

Azhng · 2021-10-15T04:00:15Z

Tagging @cockroachdb/sql-experience for reviewing if I retired the cluster setting properly

matthewtodd

Are there any tests that would be useful to add here? I'm imagining something that would show the interaction of the happy-path flush to system tables with the uh-oh path reset because it's been too long and we're running out of memory. WDYT?

Reviewed 9 of 10 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @matthewtodd)

-- commits, line 17 at r1:
nit: some style / grammar adjustments, WDYT?

Previously, the SQL Stats system reset its in-memory SQL Stats periodically,
determined by the diagnostics.sql_stat_reset.interval cluster setting. This
periodic reset loop was introduced early in 21.1 prior to persisted SQL Stats.
In 21.2, we introduced persisted SQL Stats and along with a perodic task to
flush in-memory SQL Stats into system tables. These two async tasks'
coexistence introduced unwanted race conditions. If the SQL Stats reset task
reset the in-memory store before the flusher could flush it to the system
table, those particular stats would be lost forever.

This commit changes the .Start() method in PersistedSQLStats to avoid this
race condition. However, reported SQL Stats still periodically resets its
in-memory storage in case the telemetry server goes offline.

pkg/sql/conn_executor.go, line 447 at r1 (raw file):

	s.sqlStats.Start(ctx, stopper)

	// reportedStats is periodically cleared to prevent too much SQL Stats

nit: s/much/many/

maryliag

We also need tests that are able to capture the initial issue that this PR is trying to fix otherwise it's just a guess that this change would indeed fix the issue.

Reviewed 8 of 10 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)

Azhng · 2021-10-19T14:54:55Z

Hmm testing this would be really difficult. Because the solution to the original problematic component is to completely remove the original problematic component. The implication of that means the "new" tests we introduce would be testing against a non-existent component since we will be removing it. The old component also lacks testing knobs for us to properly test this logic, so in order to introduce test for the old component, we need to:

commit 1: introduce testing knobs to the old component (that should be deleted)
commit 2: introduce the test that ensure it fails
commit 3: introduce the fix, but we need to remove the test because it depends on the testing knobs and the old component

The net effect would be the same as in this PR.

On a different subject, I think there's a decent chance that this PR might not address the root issue that caused #71596. I will test my new theory and send in a new PR for that.

However, regardless, we should merge this PR because we don't want 2 different goroutines racing to clear our in-memory data.

Azhng · 2021-10-19T14:55:38Z

Also the telemetry dumping unit tests was updated to reflect the new change.

knz

Regarding testing, there are two separate stories:

When I look at this PR without the context of the issue that motivates it, I still see a compelling reason to merge the PR: the periodic flush has become redundant with the new mechanism to persist stats, and it's good to remove redundant code (for hygiene). When removing an obsolete mechanism, we do not typicall introduce new tests that confirm that the mechanism has been removed. Instead, we remove the tests that had been added to support the original mechanism.

(Btw, I am not recongizing this: where were the unit tests for diagnostics.sql_stats_reset.interval located? These should at least be altered, if not removed outright, by this PR)
Generally, if there's a race condition somewhere, it is due to the interaction between two or more subsystems. If we keep both subsystems and then add some coordination mechanism (e.g. a mutex), then yes we do want to add a new test for that race condition, to assert that it does not occur and prevent regressions.

However, in this case, we are removing one of the participants in the race condition such that only one subsystem remains. After this PR is merged, there is no conflict between two mechanisms any more - no environment where a race condition could even occur. In that case (all concurrent mechanisms, but one, have been removed) we also don't typically add new tests.

Reviewed 10 of 10 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)

-- commits, line 19 at r1:
More words here please.

Previously, the SQL Stats system reset its in-memory SQL Stats periodically, determined by the diagnostics.sql_stat_reset.interval cluster setting. This periodic reset loop was introduced early in 21.1 prior to persisted SQL Stats. In 21.2, we introduced persisted SQL Stats and along with a perodic task to flush in-memory SQL Stats into system tables. These two async tasks' coexistence introduced unwanted race conditions. If the SQL Stats reset task reset the in-memory store before the flusher could flush it to the system table, those particular stats would be lost forever. This commit changes the .Start() method in PersistedSQLStats to avoid this race condition. However, reported SQL Stats still periodically resets its in-memory storage in case the telemetry server goes offline. As the result, diagnostics.sql_stats_reset.interval cluster setting is removed. diagnostics.forced_sql_stats_reset.interval cluster setting now only controls the reset of the *reported* SQL Stats if it is not collected by the telemetry reporter. Release note (sql change): Previously, in-memory SQL Stats was reset periodically. This behavior is now removed since persisted SQL Stats is introduced in 21.2. As the result, diagnostics.sql_stats_reset.interval cluster setting is removed. diagnostics.forced_sql_stats_reset.interval now only controls the reset of the reported SQL Stats if it is not collected by the telemetry reporter.

Azhng

I was having a difficult time trying to find where we originally tested the behaviour for diagnostics.sql_stats_reset.interval cluster setting, but I couldn't find any. Perhaps this was the reason why this issue was not caught early on.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @maryliag, and @matthewtodd)

-- commits, line 17 at r1:

Previously, matthewtodd (Matthew Todd) wrote…

nit: some style / grammar adjustments, WDYT?

Previously, the SQL Stats system reset its in-memory SQL Stats periodically,
determined by the diagnostics.sql_stat_reset.interval cluster setting. This
periodic reset loop was introduced early in 21.1 prior to persisted SQL Stats.
In 21.2, we introduced persisted SQL Stats and along with a perodic task to
flush in-memory SQL Stats into system tables. These two async tasks'
coexistence introduced unwanted race conditions. If the SQL Stats reset task
reset the in-memory store before the flusher could flush it to the system
table, those particular stats would be lost forever.

This commit changes the .Start() method in PersistedSQLStats to avoid this
race condition. However, reported SQL Stats still periodically resets its
in-memory storage in case the telemetry server goes offline.

Done.

-- commits, line 19 at r1:

Previously, knz (kena) wrote…

More words here please.

Added more description to the release note.

pkg/sql/conn_executor.go, line 447 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

nit: s/much/many/

Done.

knz

Reviewed 7 of 7 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)

maryliag

Reviewed 7 of 7 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Azhng)

Azhng · 2021-11-09T19:53:33Z

TFTR!

bors r=knz,maryliag

craig · 2021-11-09T21:54:55Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2021-11-09T22:48:58Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2021-11-10T00:53:12Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2021-11-10T00:53:23Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from af1866e to blathers/backport-release-21.2-71597: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

knz · 2021-11-10T09:54:23Z

should this be blindly backported to 21.2? Was the code for stats persistence already in 21.2?

Azhng · 2021-11-19T21:15:51Z

blathers backport 21.2

blathers-crl · 2021-11-19T21:15:54Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating backport branch refs/heads/blathers/backport-release-21.2-71597: POST https://api.github.com/repos/cockroachdb/cockroach/git/refs: 422 Reference already exists []

Backport to branch 21.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

Azhng added the backport-21.2.x label Oct 15, 2021

Azhng force-pushed the sqlstats-nuke-telemetry-dump branch from 34b5b1d to 7d5d4f3 Compare October 15, 2021 03:47

Azhng requested a review from a team October 15, 2021 03:59

Azhng marked this pull request as ready for review October 15, 2021 03:59

Azhng requested a review from a team October 15, 2021 03:59

Azhng force-pushed the sqlstats-nuke-telemetry-dump branch from 7d5d4f3 to d337fdd Compare October 15, 2021 04:22

matthewtodd suggested changes Oct 18, 2021

View reviewed changes

maryliag reviewed Oct 18, 2021

View reviewed changes

Azhng requested review from a team and removed request for a team November 9, 2021 16:13

knz reviewed Nov 9, 2021

View reviewed changes

Azhng force-pushed the sqlstats-nuke-telemetry-dump branch from d337fdd to 1e26f2c Compare November 9, 2021 17:22

Azhng force-pushed the sqlstats-nuke-telemetry-dump branch from 1e26f2c to af1866e Compare November 9, 2021 17:32

Azhng commented Nov 9, 2021

View reviewed changes

knz approved these changes Nov 9, 2021

View reviewed changes

maryliag approved these changes Nov 9, 2021

View reviewed changes

craig bot merged commit fcb5221 into cockroachdb:master Nov 10, 2021

cockroach-teamcity mentioned this pull request Nov 10, 2021

sql: remove diagnostics.sql_stat_reset.interval cockroachdb/docs#12223

Closed

Azhng mentioned this pull request Nov 19, 2021

release-21.2: group statement statistics based on its transaction fingerprintID #72941

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: remove diagnostics.sql_stat_reset.interval #71597

sql: remove diagnostics.sql_stat_reset.interval #71597

Azhng commented Oct 15, 2021

cockroach-teamcity commented Oct 15, 2021

Azhng commented Oct 15, 2021

Azhng commented Oct 15, 2021

matthewtodd left a comment

maryliag left a comment

Azhng commented Oct 19, 2021 •

edited

Loading

Azhng commented Oct 19, 2021

knz left a comment

Azhng left a comment

knz left a comment

maryliag left a comment

Azhng commented Nov 9, 2021

craig bot commented Nov 9, 2021

craig bot commented Nov 9, 2021

craig bot commented Nov 10, 2021

blathers-crl bot commented Nov 10, 2021

knz commented Nov 10, 2021

Azhng commented Nov 19, 2021

blathers-crl bot commented Nov 19, 2021

sql: remove diagnostics.sql_stat_reset.interval #71597

sql: remove diagnostics.sql_stat_reset.interval #71597

Conversation

Azhng commented Oct 15, 2021

cockroach-teamcity commented Oct 15, 2021

Azhng commented Oct 15, 2021

Azhng commented Oct 15, 2021

matthewtodd left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

Azhng commented Oct 19, 2021 • edited Loading

Azhng commented Oct 19, 2021

knz left a comment

Choose a reason for hiding this comment

Azhng left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

Azhng commented Nov 9, 2021

craig bot commented Nov 9, 2021

craig bot commented Nov 9, 2021

craig bot commented Nov 10, 2021

blathers-crl bot commented Nov 10, 2021

knz commented Nov 10, 2021

Azhng commented Nov 19, 2021

blathers-crl bot commented Nov 19, 2021

Azhng commented Oct 19, 2021 •

edited

Loading