Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: atomic flushing of sql stats #116444

Merged
merged 2 commits into from
Mar 4, 2024

Conversation

koorosh
Copy link
Collaborator

@koorosh koorosh commented Dec 14, 2023

This change refactors existing logic of flushing SQL stats. The main goal of this change is to fix an issue where flushing and clearing in-memory stats aren't "atomic" operation and may cause cases where not yet persisted stats are cleared unintentionally.

Current change introduces following changes:

  1. Introduces PopAllStatementsStats function that prepares stats to be persisted and then clears in-memory stats as an atomic operation.
  2. the process of flushing stats is following: - pop all stats from in-memory storage - reset in-memory stats - use local copy of stats and persist them
  3. before this process was like this: - iterate in-memory stats which could be updated during iteration; - persisting stats could take some time and iteration over stats slow; - after flushing all stats, in-memory storage is cleared, but there's no guaranties that at this moment nothing is added to SQL stats.

New implementation does have disadvantage, it might cause glitches when we pop stats from in-memory storage and before persisting them - user might not see up to date information. It is assumed that this is better than having missing statistics permanently.

Release note: None

Resolves: #115168

Copy link

blathers-crl bot commented Dec 14, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Dec 14, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @maryliag, and @xinhaoz)


pkg/sql/sqlstats/sslocal/sql_stats.go line 164 at r1 (raw file):

// MaybeDumpStatsToLog flushes stats into target If it is not nil.
func (s *SQLStats) MaybeDumpStatsToLog(

Refactored to stand alone function without any changes. The main reason is to reuse only specific part of this functionality,


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 151 at r1 (raw file):

		app := app
		container := s.GetApplicationStats(app, true).(*ssmemstorage.Container)
		if err := s.MaybeDumpStatsToLog(ctx, app, container, s.flushTarget); err != nil {

Dump stats to logs before we reset all stats from in-memory storage


pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 599 at r1 (raw file):

// PopAllStatementsStats returns all collected statement stats in memory to the caller and clears SQL stats
// make sure that new arriving stats won't be interfering with existing one.
func (s *Container) PopAllStatementsStats() []*appstatspb.CollectedStatementStatistics {

Logic of constructing CollectedStatementStatistics and CollectedTransactionStatistics in PopAllStatementsStats() and PopAllTransactionsStats() functions are borrowed from StmtStatsIterator.Next function, nothing new is added. Couldn't reuse it because needed everything here under the single Lock.

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 2afef14 to 478a1a3 Compare December 14, 2023 15:26
Copy link

blathers-crl bot commented Dec 14, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 478a1a3 to 986ca15 Compare December 14, 2023 18:02
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction this PR is headed, and I think these are important changes to make!

I know @maryliag had some memory concerns, so we should outline those here and consider how to address those. IIRC, when we dump the cache to flush, the concern is that the new cache continues to grow before we've released the one we're flushing (e.g. 2x the current memory limit). I'm sure we could address those, for example by 1/2ing the mem limit - although admittedly I'm not sure the implications of doing so.

I had a different idea in my mind on how I expected this to look re: the atomic flushing. I tried to explain below, let me know what you think.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @koorosh, @maryliag, and @xinhaoz)


-- commits line 24 at r3:
Good callout. What would this look like to a user? Would the user see the in-memory stats in the UI, refresh the page, and suddenly they're missing until they're finally persisted?

Code quote:

  New implementation does have disadvantage, it might cause glitches when we pop
  stats from in-memory storage and before persisting them - user might not see
  up to date information. It is assumed that this is better than having missing

pkg/sql/sqlstats/persistedsqlstats/flush.go line 100 at r3 (raw file):

			})

		if s.cfg.Knobs != nil && s.cfg.Knobs.OnStmtStatsFlushFinished != nil {

Do we need to maintain OnTxnStatsFlushFinished() as well? e.g.:

	if s.cfg.Knobs != nil && s.cfg.Knobs.OnTxnStatsFlushFinished != nil {
		s.cfg.Knobs.OnTxnStatsFlushFinished()
	}

I noticed it from the previous diff.


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 151 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Dump stats to logs before we reset all stats from in-memory storage

Underneath, this causes an entirely separate iteration of the items in the ssmemstorage.Container. I wonder if we can somehow group this logic together with the existing iterations we do below to avoid iterating twice?

Seems like it's only done for stmtStats, not txnStats... So maybe it can be included as part of the 1st goroutine that iterates the stmtStats? LMK what you think, maybe it's a premature optimization 🤷‍♂️.

func (s *Container) SaveToLog(ctx context.Context, appName string) {
s.mu.Lock()
defer s.mu.Unlock()
if len(s.mu.stmts) == 0 {
return
}
var buf bytes.Buffer
for key, stats := range s.mu.stmts {
json, err := func() ([]byte, error) {
stats.mu.Lock()
defer stats.mu.Unlock()
return json.Marshal(stats.mu.data)
}()
if err != nil {
log.Errorf(ctx, "error while marshaling stats for %q // %q: %v", appName, key.String(), err)
continue
}
fmt.Fprintf(&buf, "%q: %s\n", key.String(), json)
}
log.Infof(ctx, "statistics for %q:\n%s", appName, buf.String())
}


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 152 at r3 (raw file):

) {
	apps := s.getAppNames(false)
	for _, app := range apps {

I imagined this to work a bit differently.

In SQLStats.mu we have apps map[string]*ssmemstorage.Container.

I was thinking we could just pull a local reference to the entire map into memory here, and replace it with a new empty map in SQLStats. For example:

func (s *SQLStats) dumpAllStatsByAppName() map[string]*ssmemstorage.Container {
    s.mu.Lock()
    defer s.mu.Unlock()
    toReturn := s.mu.apps
    s.mu.apps = make(map[string]*ssmemstorage.Container)

    // Maybe do some stuff with s.mu.mon as well to reset it? 
    // Or maybe that's handled already in `container.Free(ctx)`...

    return toReturn
}

This way, the "cutoff" for all apps in this map is the same. As we're doing it today, that time window is "sliding" because we iterate the apps individually, meaning that the point in time where an app's cache is "cleared" is not consistent from one app to the next.

We can do so, and then iterate the map k, v instead of the app names. Overall, this approach feels much more "atomic" to me, since the entire map is cleared all at once in the eyes of clients writing new SQL stats to the cache, instead of them clearing one-at-a-time. It also feels easier to reason about, because the caches we're iterating is no longer changing.


pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 614 at r3 (raw file):

		database := stmt.mu.database
		querySummary := stmt.mu.querySummary
		stmt.mu.Unlock() // nolint:deferunlockcheck

nit: can we wrap this in an anon function so we can use defer? e.g.:

// Declare vars outside (I omitted types, but you get the idea!)
var data, distSQLUsed, vectorized, fullScan, database, querySummary ...
// Assign within an anon function call, so we can still use defer for the unlock.
func () {
		stmt.mu.Lock()
		defer stmt.mu.Unlock()
		data := stmt.mu.data
		distSQLUsed := stmt.mu.distSQLUsed
		vectorized := stmt.mu.vectorized
		fullScan := stmt.mu.fullScan
		database := stmt.mu.database
		querySummary := stmt.mu.querySummary
}()

We've had some nasty deadlock bugs from undeferred unlocks, so if possible I'd like to use.

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @maryliag, and @xinhaoz)


-- commits line 24 at r3:

Previously, abarganier (Alex Barganier) wrote…

Good callout. What would this look like to a user? Would the user see the in-memory stats in the UI, refresh the page, and suddenly they're missing until they're finally persisted?

In general, yes. Here's what it looks like if I understand the process of showing combined (in memory and persisted stats)
Screenshot 2024-01-02 at 14.19.38.png


pkg/sql/sqlstats/persistedsqlstats/flush.go line 100 at r3 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Do we need to maintain OnTxnStatsFlushFinished() as well? e.g.:

	if s.cfg.Knobs != nil && s.cfg.Knobs.OnTxnStatsFlushFinished != nil {
		s.cfg.Knobs.OnTxnStatsFlushFinished()
	}

I noticed it from the previous diff.

Definitely it should be handled here as well! Thanks!


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 151 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Underneath, this causes an entirely separate iteration of the items in the ssmemstorage.Container. I wonder if we can somehow group this logic together with the existing iterations we do below to avoid iterating twice?

Seems like it's only done for stmtStats, not txnStats... So maybe it can be included as part of the 1st goroutine that iterates the stmtStats? LMK what you think, maybe it's a premature optimization 🤷‍♂️.

func (s *Container) SaveToLog(ctx context.Context, appName string) {
s.mu.Lock()
defer s.mu.Unlock()
if len(s.mu.stmts) == 0 {
return
}
var buf bytes.Buffer
for key, stats := range s.mu.stmts {
json, err := func() ([]byte, error) {
stats.mu.Lock()
defer stats.mu.Unlock()
return json.Marshal(stats.mu.data)
}()
if err != nil {
log.Errorf(ctx, "error while marshaling stats for %q // %q: %v", appName, key.String(), err)
continue
}
fmt.Fprintf(&buf, "%q: %s\n", key.String(), json)
}
log.Infof(ctx, "statistics for %q:\n%s", appName, buf.String())
}

All of your suggestions are reasonable and make sense to me, the main point to push them bake is to avoid total refactoring of existing code in scope of this fix.
Every bit of code looks very coupled here and I would suggest to keep it as separate task tbh to avoid unintended regressions.


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 152 at r3 (raw file):
My primary concern with this suggestion that other places like MaybeDumpStatsToLog (and internal calls) still refer to s.mu.apps, or container.Free calls. It wold require to refactor this code even more. I'm not feeling confident to do this in scope of this change.

Maybe do some stuff with s.mu.mon as well to reset it?

I didn't find how mon is cleared in current implementation, it is only initialised once 🤔. container.Free doesn't handle this.

@abarganier abarganier requested a review from kevin-v-ngo January 2, 2024 16:01
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koorosh thanks for the discussion.

Looping in @kevin-v-ngo for opinions on the brief impact to users in the UI (see individual comment thread).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @kevin-v-ngo, @koorosh, @maryliag, and @xinhaoz)


-- commits line 24 at r3:

Previously, koorosh (Andrii Vorobiov) wrote…

In general, yes. Here's what it looks like if I understand the process of showing combined (in memory and persisted stats)
Screenshot 2024-01-02 at 14.19.38.png

I see, thank you for the helpful diagram! I can imagine this being something that users eventually notice, so let's be sure we check with @kevin-v-ngo to understand if this is an acceptable trade-off.


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 151 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

All of your suggestions are reasonable and make sense to me, the main point to push them bake is to avoid total refactoring of existing code in scope of this fix.
Every bit of code looks very coupled here and I would suggest to keep it as separate task tbh to avoid unintended regressions.

Understood - like I said in my other comment, I'm open to limiting the scope of the changes here. Let's do our best to outline the follow-up work, though, so that we understand the direction this code should move in.


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 152 at r3 (raw file):
Just wondering, is the plan to backport this? I'm open to minimizing the initial scope of these changes. I'd like for us to outline future plans for this code, though, since these changes are complex & sensitive. We should understand the roadmap for follow-up changes.

I suppose one benefit of dumping them one app at a time, is that the scope of the issue where stats briefly "disappear" from the UI is minimized to just one app at a time. I suppose the cost is that we have more contention around the SQLStats mutex, and it's more difficult to reason about the operations because of the "sliding window" I mentioned in my previous comment.

I didn't find how mon is cleared in current implementation, it is only initialised once 🤔. container.Free doesn't handle this.

It looks like the SQLStats.mon is the parent monitor, from which we create child monitors in each Container for the apps. When a container is Free()d or Clear()d, it looks like the monitor is cleared: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go#L627

@kevin-v-ngo
Copy link

@koorosh thanks for the discussion.

Looping in @kevin-v-ngo for opinions on the brief impact to users in the UI (see individual comment thread).

It seems like the impact is only when we show both in-memory and persisted and 'glitches' surface as delay correct? Meaning that it may take longer than 10 minutes to see statistics?

And in-memory only (when there is no persisted stats) is not impacted correct?

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it might affect only last 1hour interval when request is dispatched exactly during the flushing process and when both in-memory and persisted stats requested.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @kevin-v-ngo, @maryliag, and @xinhaoz)

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @kevin-v-ngo, @maryliag, and @xinhaoz)


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 152 at r3 (raw file):

Just wondering, is the plan to backport this?

I suppose this fix should be backported.

I suppose the cost is that we have more contention around the SQLStats mutex

Agree, and entire process of persisting stats has more "eventual" nature rather than "atomic" (ie we assume that all stats will be persisted at some point of time, but even insertion of every stats record is performed sequentially). I assume that "sliding window" effect is not something that drastically affects how we show stats on UI, but increased contention around mutex has to be resolved for sure.

One more thing that bothers me is that glitching stats on UI should be resolved in the code that is responsible for displaying data on UI, not here during flushing stats to disk 🙂. I mean UI should be aware about this eventual nature of data flushing and properly handle this.


pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 614 at r3 (raw file):

Previously, abarganier (Alex Barganier) wrote…

nit: can we wrap this in an anon function so we can use defer? e.g.:

// Declare vars outside (I omitted types, but you get the idea!)
var data, distSQLUsed, vectorized, fullScan, database, querySummary ...
// Assign within an anon function call, so we can still use defer for the unlock.
func () {
		stmt.mu.Lock()
		defer stmt.mu.Unlock()
		data := stmt.mu.data
		distSQLUsed := stmt.mu.distSQLUsed
		vectorized := stmt.mu.vectorized
		fullScan := stmt.mu.fullScan
		database := stmt.mu.database
		querySummary := stmt.mu.querySummary
}()

We've had some nasty deadlock bugs from undeferred unlocks, so if possible I'd like to use.

Done.

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 986ca15 to 0b1342e Compare January 3, 2024 15:41
Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @kevin-v-ngo, @maryliag, and @xinhaoz)


pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 603 at r4 (raw file):

	var stmts map[stmtKey]*stmtStats

	func() {

@abarganier , I reused your suggestion with small adjustment. I reassign map under the mutex on app level, it doesn't fix "sliding window" behavior but reduces the time of acquired lock.

@dhartunian dhartunian requested a review from abarganier January 3, 2024 18:54
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is a step in the right direction, but will need more iteration.

This PR contains no tests. We should add them if they're missing. The diagram provided in the discussion is actually a great candidate for some examples that we should test with.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @kevin-v-ngo, @koorosh, @maryliag, and @xinhaoz)


-- commits line 24 at r3:

Previously, abarganier (Alex Barganier) wrote…

I see, thank you for the helpful diagram! I can imagine this being something that users eventually notice, so let's be sure we check with @kevin-v-ngo to understand if this is an acceptable trade-off.

Thanks for the helpful diagram. I don't think this is behavior we should accept because it makes it impossible to rely on consistent outputs.

@koorosh is it possible to "copy to buffer" instead of dump, and then delete the in-memory batch as part of the UPSERT transaction?

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @kevin-v-ngo, @maryliag, and @xinhaoz)


-- commits line 24 at r3:
To make sure that result is consistent, we would need to revise entire logic how stats accumulated, and then flushed.
Initial logic, works in a way that stats flushed to disk and then cleaned from in-memory batch that creates opposite problem of having the same stats in-memory and persisted on a disk.
I would try to distinguish two problems that we have with current implementation:

  1. missing stats (that this PR is aimed to resolve);
  2. atomic operation of flushing stats;

is it possible to "copy to buffer"

New stats can be added to in-memory buffer after we copied values from it and it would lead to having both "copied" and new values at the same time. How to distinguish one from another one later is a problem.

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 0b1342e to 5f089f0 Compare January 8, 2024 15:37
Copy link

blathers-crl bot commented Jan 8, 2024

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 5f089f0 to f3a1d34 Compare January 8, 2024 19:23
Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Added tests to validate both correctness of pulling stats from in-memory cache and another test simultaneous adding new stats and flushing them to make sure that no stats missed. With additional testing knobs, there's no interaction with DB and random test data generated to make distinct stats (we don't care about its content).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @kevin-v-ngo, @maryliag, and @xinhaoz)

@koorosh koorosh requested a review from dhartunian January 9, 2024 07:43
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer, sorry for the terrible review delay.

I made a comment about how we might be able to improve the test's determinism. Let me know what you think.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @kevin-v-ngo, @koorosh, @maryliag, and @xinhaoz)


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 161 at r5 (raw file):

	apps := s.getAppNames(false)
	for _, app := range apps {
		app := app

nit: why do we reassign app here to itself?


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 167 at r5 (raw file):

		}
		stmtStats := container.PopAllStatementsStats()
		txnStats := container.PopAllTransactionStats()

Can we combine these two functions so that both txn and stmt stats are popped together atomically? Something like:

stmtStats, txnStats := container.PopAllStats()

Currently, we acquire the container lock, dump the stmt stats, and then release the lock.

Then, afterwards, we acquire the container lock again, dump the txn stats, and then release the lock.

This makes the "popping" inconsistent across txn and statements. Maybe it's not a big deal, but doing both at the same time would be easy to implement (I think) and make things easier to reason about. This way, the container is "cleared" of everything all at once, instead of doing so in a staggered manner.

If we do this, we could probably also move the container.Free(ctx) call into PopAllStats().

Code quote:

		stmtStats := container.PopAllStatementsStats()
		txnStats := container.PopAllTransactionStats()

pkg/sql/sqlstats/persistedsqlstats/flush_test.go line 855 at r5 (raw file):

	// This test aims to validate that stats popped from in-memory atomically and don't cause
	// conditions when some stats are lost or consumed multiple times.
	t.Run("concurrently add stats and flush them", func(t *testing.T) {

To clarify, the previous problem was something like:

  1. The SQL Stats flush process begins
  2. While the stats within an app container are being iterated+flushed, new stats are added to the same container.
  3. The iterator doesn't pickup on the newly added stats from step # 2.
  4. The app container is "cleared"
  5. The stats added in step # 2 are lost.

If so, my concern is that this test doesn't deterministically recreate this scenario.

IIUC, the problem in the old code occurred when new stats were added while the two flush goroutines were executing, is that right?

If so, could we use something like a lock or WaitGroup in the test interceptor functions to pause the stmt/txn flush goroutines to recreate this scenario? Something like:

  1. Add a bunch of app statistics.
  2. Trigger the flush, but make sure the interceptors block (for now). This causes the flush goroutines to hang.
  3. With the interceptors blocking, we now add more app stats. My understanding is that previously, this would trigger the bug since we're adding more statistics to an app before the flush has completed for that same app.
  4. Unblock the interceptors, let them complete the flush.
  5. Now, flush once more, and verify that the stats we added in step # 3 aren't lost.

Let me know what you think about this idea.

Code quote:

	t.Run("concurrently add stats and flush them", func(t *testing.T) {

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 16 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @kevin-v-ngo, @koorosh, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 16 at r10 (raw file):

	"context"
	"fmt"
	"github.com/cockroachdb/cockroach/pkg/util/stop"

nit: order imports


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 145 at r9 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

docstring needs more details about the "guarantees"

Done.

implement the sqlstats.Reader interface

I'm not sure that this interface is what we need in this case. sqlstats.Reader interface allows to iterate over available instances only and it is not clear why Reader interface should also be responsible to clear in-memory statistics after iteration.

This method makes things confusing because now there's yet another way to iterate.

Agree, that having multiple implementations isn't make code easy to read/maintain, but at the same time I don't see that current implementations of sqlstats.Reader interface makes it easier. Here's existing structs that implement it:

  • sslocal.SQLStats
  • ssmemstorage.Container

All of them introduce overhead like instantiating memory monitor or other unrelated fields (or leaving with half initialised which doesn't feel correct and makes easy to reason why some of fields aren't initialised).

👍 thanks for taking a look.


pkg/sql/sqlstats/persistedsqlstats/flush_test.go line 861 at r9 (raw file):

	//																																^
	//														should be flushed with							|
	// 																	next Flush      --------------|

nit: what's going on with this comment? does the diagram need to be adjusted? I don't follow what's happening on the right side.


pkg/sql/sqlstats/persistedsqlstats/flush_test.go line 934 at r9 (raw file):

		init.Store(true)
		// Flush all available stats.
		sqlStats.Flush(ctx)

can we do a check here to verify that 1 stmt and 1 txn was flushed. that's what should happen, right?


pkg/sql/sqlstats/persistedsqlstats/scheduled_sql_stats_compaction_test.go line 142 at r10 (raw file):

	// the system table.
	helper.sqlDB.Exec(t, "SELECT 1; SELECT 1, 1")
	helper.server.ApplicationLayer().SQLServer().(*sql.Server).GetSQLStatsProvider().(*persistedsqlstats.PersistedSQLStats).Flush(ctx, helper.server.Stopper())

I think what you want here is helper.server.AppStopper() or something like that. No need to pull the whole server in.

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 0dce983 to a61a221 Compare February 14, 2024 10:22
Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @kevin-v-ngo, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 16 at r10 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: order imports

Done.


pkg/sql/sqlstats/persistedsqlstats/flush_test.go line 861 at r9 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: what's going on with this comment? does the diagram need to be adjusted? I don't follow what's happening on the right side.

😱 that what happens when TAB and spaces used altogether to make offsets. Reviewable interprets TABs in different way than Goland.
Fixed.


pkg/sql/sqlstats/persistedsqlstats/flush_test.go line 934 at r9 (raw file):

Previously, dhartunian (David Hartunian) wrote…

can we do a check here to verify that 1 stmt and 1 txn was flushed. that's what should happen, right?

Good point, done!


pkg/sql/sqlstats/persistedsqlstats/scheduled_sql_stats_compaction_test.go line 142 at r10 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I think what you want here is helper.server.AppStopper() or something like that. No need to pull the whole server in.

Done.

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @kevin-v-ngo, and @xinhaoz)


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 176 at r12 (raw file):

		wg.Add(2)

		err := stopper.RunAsyncTask(ctx, "sql-stmt-stats-flush", func(ctx context.Context) {

Using stopper.RunAsyncTask instead of go funcs causes following errors.

panic: attempting to start span with parent from different Tracer.
parent operation: sql-txn-stats-flush, tracer created at:

goroutine 13 [running]:
runtime/debug.Stack()
	/opt/homebrew/opt/go/libexec/src/runtime/debug/stack.go:24 +0x64
github.com/cockroachdb/cockroach/pkg/util/tracing.NewTracer()
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:597 +0x3c
github.com/cockroachdb/cockroach/pkg/util/stop.NewStopper({0x0, 0x0, 0x0?})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:216 +0xe4
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster({_, _}, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:265 +0x28c
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.testClusterFactoryImpl.NewTestCluster(...)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1963
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.NewCluster(...)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/serverutils/test_cluster_shim.go:303
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.StartCluster({_, _}, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/serverutils/test_cluster_shim.go:287 +0x84
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats_test.createCluster(0x10a970910?)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/reader_test.go:170 +0x15c
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats_test.TestPersistedSQLStatsReadHybrid(0x14000781380)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/reader_test.go:72 +0xd8
testing.tRunner(0x14000781380, 0x10a93c740)
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1595 +0xe8
created by testing.(*T).Run in goroutine 1
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1648 +0x33c


child operation: insert-txn-stats, tracer created at:
goroutine 13 [running]:
runtime/debug.Stack()
	/opt/homebrew/opt/go/libexec/src/runtime/debug/stack.go:24 +0x64
github.com/cockroachdb/cockroach/pkg/util/tracing.NewTracer()
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:597 +0x3c
github.com/cockroachdb/cockroach/pkg/util/tracing.NewTracerWithOpt({0x10a992b10, 0x10db35f00}, {0x14005ac5dd0, 0x2, 0x0?})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:639 +0x8c
github.com/cockroachdb/cockroach/pkg/server.(*testServer).StartTenant(_, {_, _}, {{0x0, 0x0}, {0xa}, 0x0, 0x140090ee000, 0x0, {{0x0, ...}, ...}, ...})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:1654 +0xa8c
github.com/cockroachdb/cockroach/pkg/server.(*testServer).startDefaultTestTenant(0x14000ef7500, {0x10a992b10, 0x10db35f00})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:638 +0x3e0
github.com/cockroachdb/cockroach/pkg/server.(*testServer).maybeStartDefaultTestTenant(0x14000ef7500, {0x10a992b10?, 0x10db35f00?})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:710 +0x24c
github.com/cockroachdb/cockroach/pkg/server.(*testServer).Activate(0x14000ef7500, {0x10a992b10, 0x10db35f00})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:836 +0xbc
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.(*wrap).Activate(0x10a970940?, {0x10a992b10, 0x10db35f00})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/serverutils/ts_control_forwarder_generated.go:25 +0x54
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start(0x14000ef6e00, {0x10a9c8ea0, 0x14000781380?})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:484 +0x634
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.StartCluster({_, _}, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/serverutils/test_cluster_shim.go:288 +0xa0
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats_test.createCluster(0x10a970910?)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/reader_test.go:170 +0x15c
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats_test.TestPersistedSQLStatsReadHybrid(0x14000781380)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/reader_test.go:72 +0xd8
testing.tRunner(0x14000781380, 0x10a93c740)
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1595 +0xe8
created by testing.(*T).Run in goroutine 1
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1648 +0x33c
 [recovered]

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @kevin-v-ngo, @koorosh, and @xinhaoz)


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 176 at r12 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Using stopper.RunAsyncTask instead of go funcs causes following errors.

panic: attempting to start span with parent from different Tracer.
parent operation: sql-txn-stats-flush, tracer created at:

goroutine 13 [running]:
runtime/debug.Stack()
	/opt/homebrew/opt/go/libexec/src/runtime/debug/stack.go:24 +0x64
github.com/cockroachdb/cockroach/pkg/util/tracing.NewTracer()
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:597 +0x3c
github.com/cockroachdb/cockroach/pkg/util/stop.NewStopper({0x0, 0x0, 0x0?})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:216 +0xe4
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster({_, _}, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:265 +0x28c
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.testClusterFactoryImpl.NewTestCluster(...)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1963
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.NewCluster(...)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/serverutils/test_cluster_shim.go:303
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.StartCluster({_, _}, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/serverutils/test_cluster_shim.go:287 +0x84
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats_test.createCluster(0x10a970910?)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/reader_test.go:170 +0x15c
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats_test.TestPersistedSQLStatsReadHybrid(0x14000781380)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/reader_test.go:72 +0xd8
testing.tRunner(0x14000781380, 0x10a93c740)
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1595 +0xe8
created by testing.(*T).Run in goroutine 1
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1648 +0x33c


child operation: insert-txn-stats, tracer created at:
goroutine 13 [running]:
runtime/debug.Stack()
	/opt/homebrew/opt/go/libexec/src/runtime/debug/stack.go:24 +0x64
github.com/cockroachdb/cockroach/pkg/util/tracing.NewTracer()
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:597 +0x3c
github.com/cockroachdb/cockroach/pkg/util/tracing.NewTracerWithOpt({0x10a992b10, 0x10db35f00}, {0x14005ac5dd0, 0x2, 0x0?})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:639 +0x8c
github.com/cockroachdb/cockroach/pkg/server.(*testServer).StartTenant(_, {_, _}, {{0x0, 0x0}, {0xa}, 0x0, 0x140090ee000, 0x0, {{0x0, ...}, ...}, ...})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:1654 +0xa8c
github.com/cockroachdb/cockroach/pkg/server.(*testServer).startDefaultTestTenant(0x14000ef7500, {0x10a992b10, 0x10db35f00})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:638 +0x3e0
github.com/cockroachdb/cockroach/pkg/server.(*testServer).maybeStartDefaultTestTenant(0x14000ef7500, {0x10a992b10?, 0x10db35f00?})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:710 +0x24c
github.com/cockroachdb/cockroach/pkg/server.(*testServer).Activate(0x14000ef7500, {0x10a992b10, 0x10db35f00})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:836 +0xbc
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.(*wrap).Activate(0x10a970940?, {0x10a992b10, 0x10db35f00})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/serverutils/ts_control_forwarder_generated.go:25 +0x54
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start(0x14000ef6e00, {0x10a9c8ea0, 0x14000781380?})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:484 +0x634
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.StartCluster({_, _}, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/testutils/serverutils/test_cluster_shim.go:288 +0xa0
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats_test.createCluster(0x10a970910?)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/reader_test.go:170 +0x15c
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats_test.TestPersistedSQLStatsReadHybrid(0x14000781380)
	/Users/avor/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/reader_test.go:72 +0xd8
testing.tRunner(0x14000781380, 0x10a93c740)
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1595 +0xe8
created by testing.(*T).Run in goroutine 1
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1648 +0x33c
 [recovered]

I think it's because the test passes a ctx.Background() directly to the IterateTransactionStats calls in the TestPersistedSQLStatsReadHybrid test. The context contains the spans used for tracing operations. Modify the test to pass in the ctx created with the cluster and I think it will work.

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from a61a221 to 96d3331 Compare February 14, 2024 17:15
Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @kevin-v-ngo, and @xinhaoz)


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 176 at r12 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I think it's because the test passes a ctx.Background() directly to the IterateTransactionStats calls in the TestPersistedSQLStatsReadHybrid test. The context contains the spans used for tracing operations. Modify the test to pass in the ctx created with the cluster and I think it will work.

Just figured out, I passed cluster.Stopper() instead of AppStopper(). Updated all affected tests.

@koorosh koorosh force-pushed the sql-stats-flush-sync branch 4 times, most recently from 6d8a475 to 3b8c941 Compare February 15, 2024 13:01
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @kevin-v-ngo, @koorosh, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 50 at r14 (raw file):

			if err := s.SQLStats.Reset(ctx); err != nil {
				log.Warningf(ctx, "fail to reset in-memory SQL Stats: %s", err)
			}

Would it make more sense to call s.knobs.OnBeforeReset() in this deferred function, instead of in each individual Container.Clear() call?

Code quote:

			if err := s.SQLStats.Reset(ctx); err != nil {
				log.Warningf(ctx, "fail to reset in-memory SQL Stats: %s", err)
			}

pkg/sql/sqlstats/sslocal/sslocal_provider.go line 176 at r12 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Just figured out, I passed cluster.Stopper() instead of AppStopper(). Updated all affected tests.

Nice, thanks for using the Stopper library here! I was wondering if we should be aware of context cancellation here, but I think these tasks are short-lived enough where it's okay to just let them finish in the event of a shutdown.


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 144 at r14 (raw file):

}

// ConsumeStats leverages the process of atomic pulling stats from in-memory storage, cleaning in-memory stats, and

nit: clearing (I think)

Code quote:

cleaning

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 3b8c941 to 954d9d9 Compare February 15, 2024 16:41
Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @kevin-v-ngo, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 50 at r14 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Would it make more sense to call s.knobs.OnBeforeReset() in this deferred function, instead of in each individual Container.Clear() call?

This defer function is guarded by condition, and will be called only when stats flushing disabled so it won't handle default scenario when stats flushing will be called periodically.

We reset stats per app so I felt it is more natural to call this hook right after each app's container reset.
It is also possible to use this hook in addition for entire in-memory stats, but it is not clear what to do when only some app containers are flushed (and others don't due to the error)?


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 144 at r14 (raw file):

Previously, abarganier (Alex Barganier) wrote…

nit: clearing (I think)

Done. Thanks, that's my usual mistake (among many others) 😅

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things look good, I just want to make sure the OnBeforeReset is well documented and its usage is clear.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @kevin-v-ngo, @koorosh, and @xinhaoz)


pkg/sql/sqlstats/test_utils.go line 69 at r15 (raw file):

	// OnBeforeReset is invoked right before reset SQLStats from in-memory cache
	OnBeforeReset func()

nit: Let's expand the comment/documentation on this a bit more. I think explaining more about when specifically this gets called, as well as what it's useful for (e.g. an example test scenario where this is helpful) would help make the purpose of the callback clear.

Code quote:

	// OnBeforeReset is invoked right before reset SQLStats from in-memory cache
	OnBeforeReset func()

pkg/sql/sqlstats/persistedsqlstats/flush.go line 50 at r14 (raw file):
I see, that makes sense.

We reset stats per app so I felt it is more natural to call this hook right after each app's container reset.

Is the name OnBeforeReset a bit misleading in that case? Just because it seems like we invoke this after we invoke Clear(). Also, if we're invoking Clear(), then should it be OnBeforeClear?

I just want to make sure that the testing callback's usage is clear.


pkg/sql/sqlstats/sslocal/sql_stats.go line 155 at r15 (raw file):

	for appName, statsContainer := range s.mu.apps {
		lastErr := s.MaybeDumpStatsToLog(ctx, appName, statsContainer, target)
		statsContainer.Clear(ctx)

If we're going to have something like OnBeforeReset, then should we also call it here? I think the callback should consistently be invoked any time we're going to clear the memory stats.

Code quote:

		statsContainer.Clear(ctx)

pkg/sql/sqlstats/sslocal/sslocal_provider.go line 144 at r14 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Done. Thanks, that's my usual mistake (among many others) 😅

All good! We all do this 😆

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 954d9d9 to 6ff0bca Compare February 22, 2024 19:26
Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @kevin-v-ngo, and @xinhaoz)


pkg/sql/sqlstats/test_utils.go line 69 at r15 (raw file):

Previously, abarganier (Alex Barganier) wrote…

nit: Let's expand the comment/documentation on this a bit more. I think explaining more about when specifically this gets called, as well as what it's useful for (e.g. an example test scenario where this is helpful) would help make the purpose of the callback clear.

Done.


pkg/sql/sqlstats/persistedsqlstats/flush.go line 50 at r14 (raw file):

Previously, abarganier (Alex Barganier) wrote…

I see, that makes sense.

We reset stats per app so I felt it is more natural to call this hook right after each app's container reset.

Is the name OnBeforeReset a bit misleading in that case? Just because it seems like we invoke this after we invoke Clear(). Also, if we're invoking Clear(), then should it be OnBeforeClear?

I just want to make sure that the testing callback's usage is clear.

Agree, naming is just confusing. Renamed to OnAfterClear.


pkg/sql/sqlstats/sslocal/sql_stats.go line 155 at r15 (raw file):

Previously, abarganier (Alex Barganier) wrote…

If we're going to have something like OnBeforeReset, then should we also call it here? I think the callback should consistently be invoked any time we're going to clear the memory stats.

It is called within Clear function.

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with a few logging requests.

Reviewed 2 of 6 files at r9, 17 of 17 files at r15.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @kevin-v-ngo, @koorosh, and @xinhaoz)


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 185 at r15 (raw file):

			}
		})
		if err != nil {

log warning here with error


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 199 at r15 (raw file):

			}
		})
		if err != nil {

log warning here with error

@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 6ff0bca to 62b6c08 Compare March 4, 2024 14:00
@koorosh koorosh requested a review from a team March 4, 2024 14:00
koorosh added 2 commits March 4, 2024 16:07
This change refactors existing logic of flushing SQL stats. The main goal
of this change is to fix an issue where flushing and clearing in-memory
stats aren't "atomic" operation and may cause cases where not yet
persisted stats are cleared unintentionally.

Current change introduces following changes:
1. Introduces `PopAllStats` function that prepares stats to
be persisted and then clears in-memory stats as an atomic operation.
2. the process of flushing stats is following:
    - pop all stats from in-memory storage
    - reset in-memory stats
    - use local copy of stats and persist them
3. before this process was like this:
    - iterate in-memory stats which could be updated during iteration;
    - persisting stats could take some time and iteration over stats slow;
    - after flushing all stats, in-memory storage is cleared, but there's no
      guaranties that at this moment nothing is added to SQL stats.

New implementation does have disadvantage, it might cause glitches when we pop
stats from in-memory storage and before persisting them - user might not see
up to date information. It is assumed that this is better than having missing
statistics permanently.

Release note: None
This change refactors `SQLStats.ConsumeStats` function to rely
on `stopper.RunAsyncTask` function instead of calling `go` directly
in code which doesn't handle context cancellation properly.
Most of the changes here are done in tests to comply new argument
that is required by `ConsumeStats` func.

Release note: None
@koorosh koorosh force-pushed the sql-stats-flush-sync branch from 62b6c08 to 046641b Compare March 4, 2024 14:11
Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @dhartunian, @kevin-v-ngo, and @xinhaoz)


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 185 at r15 (raw file):

Previously, dhartunian (David Hartunian) wrote…

log warning here with error

Done.


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 199 at r15 (raw file):

Previously, dhartunian (David Hartunian) wrote…

log warning here with error

Done.

@koorosh
Copy link
Collaborator Author

koorosh commented Mar 4, 2024

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 4, 2024

Build succeeded:

@craig craig bot merged commit a2420ac into cockroachdb:master Mar 4, 2024
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

observability: SQL Activity page is missing some statements
5 participants