-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: scheduled logger to capture index usage stats #76886
sql: scheduled logger to capture index usage stats #76886
Conversation
48fccc1
to
60d2799
Compare
7efa752
to
968c579
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking two questions but they're not blockers to approval, so don't worry about them if you're busy and trying to get things in 🙂.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @THardy98)
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 146 at r1 (raw file):
// logIndexUsageStatsWithDelay logs a slice of eventpb.EventPayload at half // second intervals (2 logs per second) to avoid exceeding the 10 log-line per // second limit per node on the telemetry logging pipeline.
Just curious, is this 10 log-line limit cumulative across other things that might also be logging in the telemetry pipeline?
Code quote:
// second intervals (2 logs per second) to avoid exceeding the 10 log-line per
// second limit per node on the telemetry logging pipeline.
pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):
CommonEventDetails common = 1 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "", (gogoproto.embed) = true]; // Couldn't use roachpb.CollectedIndexUsageStatistics due to circular dependency.
Just out of curiosity, why is the solution here not to pull something out in a way that removes the circular dependency (perhaps as a long-term fix if you're trying to get this in)? Are the module organizations too complicated/don't lend to reorganization/we're usually fine was just duplicating things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and to add tests before un-drafting.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @THardy98)
968c579
to
697727f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 10 files at r1, 1 of 3 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, and @THardy98)
pkg/server/server_sql.go, line 1249 at r3 (raw file):
) scheduledlogging.StartCapturingIndexUsageStats(ctx, stopper, s.execCfg.DB, s.execCfg.Settings, s.internalExecutor)
Maybe just call this Start
? Since we're dealing with a global limit on the telemetry pipeline, I imagine any future scheduled logging would happen inside the same async task?
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 199 at r3 (raw file):
// second intervals (2 logs per second) to avoid exceeding the 10 log-line per // second limit per node on the telemetry logging pipeline. func logIndexUsageStatsWithDelay(
Also, I notice that this function doesn't care about the EventPayload
contents. Perhaps there's a natural seam here where we've got one thing selecting what we want to log (just index usage stats for now) and another thing that slowly writes them into the logging pipeline. I see now that decoupling them means you need some kind of buffering, which gets more complicated, so maybe that's more than we need to think about for now?
pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):
Previously, jocrl (Josephine) wrote…
Just out of curiosity, why is the solution here not to pull something out in a way that removes the circular dependency (perhaps as a long-term fix if you're trying to get this in)? Are the module organizations too complicated/don't lend to reorganization/we're usually fine was just duplicating things?
I'm curious about this, too.
697727f
to
cefdbd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests :)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, and @matthewtodd)
pkg/server/server_sql.go, line 1249 at r3 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
Maybe just call this
Start
? Since we're dealing with a global limit on the telemetry pipeline, I imagine any future scheduled logging would happen inside the same async task?
I've changed this to Start
, and yes, I believe that the idea going forward. From what I've seen, Start
also seems to be the convention when starting background processes.
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 146 at r1 (raw file):
Previously, jocrl (Josephine) wrote…
Just curious, is this 10 log-line limit cumulative across other things that might also be logging in the telemetry pipeline?
Yes. Currently, the only other event being emitted to the telemetry pipeline is SampledQuery
. Consequently, I reduced the frequency at which it could emit logs from 10 logs per second, to 8 per second, to allow CaptureIndexUsageStats
to emit 2 logs per second.
IIUC, there could be some improvement on this in the future.
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 199 at r3 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
Also, I notice that this function doesn't care about the
EventPayload
contents. Perhaps there's a natural seam here where we've got one thing selecting what we want to log (just index usage stats for now) and another thing that slowly writes them into the logging pipeline. I see now that decoupling them means you need some kind of buffering, which gets more complicated, so maybe that's more than we need to think about for now?
I'm not sure I fully understand this, but maybe I can provide some context with my own efforts to write a logging scheduler that supported any logger. Initially, I had written the scheduler to allow for multiple loggers that wanted to emit events on an interval. It was effectively the same as the current scheduler but it started a separate go-routine for each logger, so was a bit expensive. It didn't have any need for buffering as each logger was responsible for their own schedule and writing, but the number of go-routines likely wouldn't scale well over time. I was thinking of using some sort of a buffer that all logs could write to, and a background go-routine would periodically write the actual logs from the buffer. Is this in line with what you were thinking?
pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
I'm curious about this, too.
I don't think it would be very difficult. It would just need a new .proto
file in a separate package (maybe idxusage
in this case?). That being said, I'm not aware if moving roachpb.CollectedIndexUsageStatistics
would cause for other issues (or maybe cause a different import cycle).
6e219a4
to
e4eddd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Approving for addressing my questions, and I'll leave the stamp to Matthew.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @matthewtodd, and @THardy98)
pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):
Previously, THardy98 (Thomas Hardy) wrote…
I don't think it would be very difficult. It would just need a new
.proto
file in a separate package (maybeidxusage
in this case?). That being said, I'm not aware if movingroachpb.CollectedIndexUsageStatistics
would cause for other issues (or maybe cause a different import cycle).
I would suggest that if it can be done, and if they are conceptually the same thing, then pulling it out to avoid duplication would be helpful. I'm not familiar enough to assess how neutral/concerning this is, and so my uninformed sentiment is to leave it up to you whether to make the change in this pr or punt to the future.
Also, FYI on the failing TeamCity build 🙃 |
1368c96
to
e0e3ed9
Compare
e0e3ed9
to
0ee384d
Compare
c3f5922
to
d39a917
Compare
ea0dbcf
to
84e1807
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, @maryliag, @matthewtodd, and @THardy98)
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 82 at r11 (raw file):
// on a scheduled interval. type CaptureIndexUsageStatsLoggingScheduler struct { DB *kv.DB
how come this is exported ?
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 83 at r11 (raw file):
type CaptureIndexUsageStatsLoggingScheduler struct { DB *kv.DB cs *cluster.Settings
(optional) nit: i think most of the code base i've seen uses either st
or settings
as abbreivation rather than cs
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 152 at r11 (raw file):
err := s.DB.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error { return captureIndexUsageStats(ctx, s.ie, stopper, telemetryCaptureIndexUsageStatsLoggingDelay.Get(&s.cs.SV))
Hmm why are we starting a transaction here? The txn
isn't passed to captureIndexUsageStats()
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 156 at r11 (raw file):
if err != nil { log.Errorf(ctx, "error capturing index usage stats: %+v", err) }
Hmm I feelerror
is indicating something severely wrong with the cluster and demand immediate attention. Might be worth to just call this a Warning instead. And guard it behind log.V()
so it doesn't pollute the log file too much.
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 225 at r11 (raw file):
} databaseName := tree.MustBeDString(row[0])
minor nit: hmm why not just directly use the databaseName
that we already have? the '%s' as database_name
feels a bit unnecessary.
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 257 at r11 (raw file):
if err != nil { return err }
small nit: inline the err
assignment
if err = it.Close(); err != nil {
return err
}
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 43 at r11 (raw file):
func (s *stubDurations) setLoggingDuration(d time.Duration) { s.RWMutex.Lock()
nit: you can just do s.Lock()
since the lock is embedded in the struct
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 66 at r11 (raw file):
} func installTelemetryLogFileSink(sc *log.TestLogScope, t *testing.T) func() {
nit: usually t *testing.T
is the first args in the function
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 78 at r11 (raw file):
if err := cfg.Validate(&dir); err != nil { t.Fatal(err) }
small nit: use require
here?
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 162 at r11 (raw file):
// of logs for each index. testutils.SucceedsWithin(t, func() error { return checkNumTotalEntriesAndNumIndexEntries(expectedTotalNumEntriesInSingleInterval, expectedNumberOfIndividualIndexEntriesInSingleInterval)
style nit: wrap the long line
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 205 at r11 (raw file):
0, math.MaxInt64, 10000,
Hmm, what do these literals mean?
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 219 at r11 (raw file):
testData := []time.Duration{ 0 * time.Second, stubScheduleInterval, /* second difference between first and second schedule */
style nit: we don't have too much inline comments with /**/
like this one. Move it above the stubScheduleInterval
and swtich to use //
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 220 at r11 (raw file):
0 * time.Second, stubScheduleInterval, /* second difference between first and second schedule */ sd.getOverlapDuration(), /* second difference between second and third schedule */
to clarify, do you mean the difference in number of seconds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, @maryliag, @matthewtodd, and @THardy98)
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 82 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
how come this is exported ?
There is no reason for it to be exported, I've un-exported it.
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 83 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
(optional) nit: i think most of the code base i've seen uses either
st
orsettings
as abbreivation rather thancs
Changed to st
.
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 152 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Hmm why are we starting a transaction here? The
txn
isn't passed tocaptureIndexUsageStats()
No need, removed.
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 156 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Hmm I feel
error
is indicating something severely wrong with the cluster and demand immediate attention. Might be worth to just call this a Warning instead. And guard it behindlog.V()
so it doesn't pollute the log file too much.
I changed it to Warningf
which I think should be fine. Only logs to the DEV
channel.
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 225 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
minor nit: hmm why not just directly use the
databaseName
that we already have? the'%s' as database_name
feels a bit unnecessary.
Removed from the query, pass databaseName
directly into the log payload.
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 257 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
small nit: inline the
err
assignmentif err = it.Close(); err != nil { return err }
Done.
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 43 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
nit: you can just do
s.Lock()
since the lock is embedded in the struct
Done.
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 66 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
nit: usually
t *testing.T
is the first args in the function
Done.
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 78 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
small nit: use
require
here?
Done.
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 162 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
style nit: wrap the long line
Done for each checkNumTotalEntriesAndNumIndexEntries
call.
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 205 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Hmm, what do these literals mean?
I've assigned them to variables for naming.
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 219 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
style nit: we don't have too much inline comments with
/**/
like this one. Move it above thestubScheduleInterval
and swtich to use//
Done.
pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 220 at r11 (raw file):
Previously, Azhng (Archer Zhang) wrote…
to clarify, do you mean
the difference in number of seconds
?
Yes, clarified in the comment now.
84e1807
to
2f6f68d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, @maryliag, @matthewtodd, and @THardy98)
pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 82 at r11 (raw file):
Previously, THardy98 (Thomas Hardy) wrote…
There is no reason for it to be exported, I've un-exported it.
Ugh, I think it's still being exported 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r3, 4 of 5 files at r8, 2 of 2 files at r9, 6 of 8 files at r10, 2 of 2 files at r12, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @maryliag, @matthewtodd, and @THardy98)
2fa3944
to
1d8327f
Compare
This change introduces a scheduled logger that captures index usage statistics logs on a time interval. Release note (sql change): Initial implementation of a scheduled logger used to capture index usage statistics to the telemetry logging channel. Release justification: Category 4: Low risk, high benefit changes to existing functionality.
1d8327f
to
d09a45e
Compare
bors r+ |
Build succeeded: |
blathers backport 21.2 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from d09a45e to blathers/backport-release-21.2-76886: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Resolves: #72486
This change introduces a scheduled logger that captures index usage
statistics logs on a time interval.
Release note (sql change): Initial implementation of a scheduled logger
used to capture index usage statistics to the telemetry logging channel.
Release justification: Category 4: Low risk, high benefit changes to
existing functionality.