-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telemetry: log transaction exec events to TELEMETRY #115722
Conversation
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
de3813e
to
ba6a808
Compare
ba6a808
to
be367bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this became fairly compact given the improvements from the previous commit!
Just a few comments, nothing crazy.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @maryliag, and @xinhaoz)
-- commits
line 43 at r3:
nit: ")" instead of "."
Code quote:
.
pkg/sql/exec_log.go
line 441 at r3 (raw file):
// maybeLogTransaction conditionally records the current transaction // to the TELEMETRY channel. func (p *planner) maybeLogTransaction(
nit: can we unit test this function specifically? I know we capture the output within the logSpy in the tests, but it'd be nice to have specific tests for all the different early return conditions.li
pkg/sql/exec_log.go
line 450 at r3 (raw file):
trackedForTelemetry bool, ) { isTxnTelemetryMode := telemetrySamplingMode.Get(&p.execCfg.Settings.SV) == telemetryModeTransaction
In the previous commit/PR we had a lot of added logic to determine whether a transaction should be sampled (e.g. in (*TelemetryLoggingMetrics).shouldTrackTransaction()
. It feels like we duplicate a lot of that logic once again here.
Is there some reason we can't just rely on extraTxnState.shouldLogToTelemetry
instead of making all these checks once again?
pkg/sql/exec_log.go
line 495 at r3 (raw file):
RangeKeySkippedPoints: txnStats.ExecStats.MvccRangeKeySkippedPoints, }, }
Would it make sense to use something like a sync.Pool
for these eventpb.SampledExecStats
objects?
Code quote:
sampledExecStats = &eventpb.SampledExecStats{
NetworkBytes: txnStats.ExecStats.NetworkBytesSent,
MaxMemUsage: txnStats.ExecStats.MaxMemUsage,
ContentionTime: int64(txnStats.ExecStats.ContentionTime.Seconds()),
NetworkMessages: txnStats.ExecStats.NetworkMessages,
MaxDiskUsage: txnStats.ExecStats.MaxDiskUsage,
CPUSQLNanos: txnStats.ExecStats.CPUTime.Nanoseconds(),
MVCCIteratorStats: &eventpb.MVCCIteratorStats{
StepCount: txnStats.ExecStats.MvccSteps,
StepCountInternal: txnStats.ExecStats.MvccStepsInternal,
SeekCount: txnStats.ExecStats.MvccSeeks,
SeekCountInternal: txnStats.ExecStats.MvccSeeksInternal,
BlockBytes: txnStats.ExecStats.MvccBlockBytes,
BlockBytesInCache: txnStats.ExecStats.MvccBlockBytesInCache,
KeyBytes: txnStats.ExecStats.MvccKeyBytes,
ValueBytes: txnStats.ExecStats.MvccValueBytes,
PointCount: txnStats.ExecStats.MvccPointCount,
PointsCoveredByRangeTombstones: txnStats.ExecStats.MvccPointsCoveredByRangeTombstones,
RangeKeyCount: txnStats.ExecStats.MvccRangeKeyCount,
RangeKeyContainedPoints: txnStats.ExecStats.MvccRangeKeyContainedPoints,
RangeKeySkippedPoints: txnStats.ExecStats.MvccRangeKeySkippedPoints,
},
}
pkg/sql/exec_log.go
line 510 at r3 (raw file):
} sampledTxn := eventpb.SampledTransaction{
ditto: would it make sense to use something like a sync.Pool
here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @maryliag, and @xinhaoz)
pkg/sql/exec_log.go
line 495 at r3 (raw file):
Previously, abarganier (Alex Barganier) wrote…
Would it make sense to use something like a
sync.Pool
for theseeventpb.SampledExecStats
objects?
We already have very similar code for statements that hasn't required such a technique, but I'd be curious to at least explore the need here. How can we determine whether these objects are "creating gc pressure"?
pkg/util/log/eventpb/telemetry.proto
line 483 at r3 (raw file):
// SkippedTransactions is the number of transactions that were not considered for sampling prior to this one. // We only count skipped transactions when telemetry logging is enabled for the transaction type.
nit: spacing
be367bc
to
cd230db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @maryliag, and @xinhaoz)
pkg/sql/exec_log.go
line 495 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
We already have very similar code for statements that hasn't required such a technique, but I'd be curious to at least explore the need here. How can we determine whether these objects are "creating gc pressure"?
Benchmarking allocations would be one way. We can compare results with/without a sync.Pool
over many calls to maybeLogStatementInternal
(where the sync.Pool
is shared across calls) to understand the impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @maryliag and @xinhaoz)
pkg/sql/conn_executor_exec.go
line 3304 at r8 (raw file):
&recordedTxnStats, ex.server.TelemetryLoggingMetrics, ex.extraTxnState.shouldLogToTelemetry)
nit: use the ,
and newline here before closing paren when writing multiline args
pkg/sql/testdata/telemetryLogging/stmts_per_txn_limit
line 21 at r8 (raw file):
---- {"ApplicationName":"telemetry-logging-datadriven","Database":"defaultdb","Distribution":"local","EventType":"sampled_query","NumRows":1,"OutputRowsEstimate":1,"PlanGist":"AgICAgYC","SkippedQueries":1,"Statement":"SELECT ‹1›","StatementFingerprintID":16085855936700856000,"StmtPosInTxn":1,"Tag":"SELECT","User":"root"} {"ApplicationName":"telemetry-logging-datadriven","Committed":true,"EventType":"sampled_transaction","NumRows":2,"RowsRead":0,"RowsWritten":0,"SkippedTransactions":1,"User":"root"}
Are we scrubbing the fingerprint here or should it not be present? Is there a txn ID? I guess I'm trying to understand how the stmt and txn events can be cross-referenced? Do we scrub those fields in this test?
pkg/sql/telemetry_logging_test.go
line 1786 at r8 (raw file):
sts.SetTracingStatus(query.tracingEnabled) spiedConn.Exec(t, query.query) require.Zero(t, telemetryLogging.getSkippedTransactionCount())
I don't quite understand this test. Why is the check identical for all cases? How do we check which txn is skipped vs emitted in different scenarios?
64840c7
to
b0dcd11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @maryliag)
Previously, abarganier (Alex Barganier) wrote…
nit: ")" instead of "."
Done.
pkg/sql/conn_executor_exec.go
line 3304 at r8 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: use the
,
and newline here before closing paren when writing multiline args
Done.
pkg/sql/exec_log.go
line 441 at r3 (raw file):
Previously, abarganier (Alex Barganier) wrote…
nit: can we unit test this function specifically? I know we capture the output within the logSpy in the tests, but it'd be nice to have specific tests for all the different early return conditions.li
Added a test case to verify the early return conditions!
pkg/sql/exec_log.go
line 450 at r3 (raw file):
Previously, abarganier (Alex Barganier) wrote…
In the previous commit/PR we had a lot of added logic to determine whether a transaction should be sampled (e.g. in
(*TelemetryLoggingMetrics).shouldTrackTransaction()
. It feels like we duplicate a lot of that logic once again here.Is there some reason we can't just rely on
extraTxnState.shouldLogToTelemetry
instead of making all these checks once again?
It's to be consistent with how we've been tracking skipped query events. We only increment the SkippedTransactions
counter when telemetry is enabled.
pkg/sql/exec_log.go
line 510 at r3 (raw file):
Previously, abarganier (Alex Barganier) wrote…
ditto: would it make sense to use something like a
sync.Pool
here as well?
I can perform some allocation benchmarking here and add a sync.Pool
for both statements and transactions if necessary.
pkg/sql/testdata/telemetryLogging/stmts_per_txn_limit
line 21 at r8 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Are we scrubbing the fingerprint here or should it not be present? Is there a txn ID? I guess I'm trying to understand how the stmt and txn events can be cross-referenced? Do we scrub those fields in this test?
Ah, forgot to add the transaction fingerprint id and its statement fingeprints to the allowed fields. Added.
pkg/sql/telemetry_logging_test.go
line 1786 at r8 (raw file):
Previously, dhartunian (David Hartunian) wrote…
I don't quite understand this test. Why is the check identical for all cases? How do we check which txn is skipped vs emitted in different scenarios?
This tests that we don't count a transaction as 'skipped' when telemetry logging is off. Specifically it's testing the portion of maybeLogTransaction
where we early exit when transaction telemetry is off so the transaction won't be counted as skipped by telemetry. It goes through all the combinations of the cluster settings for which we'd count txn telemetry as off. There's also one configuration for which it's on but due to the stubbing of time we never skip transactions. I've changed the test so that in the txn enabled mode we have some skipped transactions. Also left some more comments, hopefully that makes things more clear but lmk otherwise!
093baea
to
b0045ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @maryliag)
pkg/sql/exec_log.go
line 495 at r3 (raw file):
Previously, abarganier (Alex Barganier) wrote…
Benchmarking allocations would be one way. We can compare results with/without a
sync.Pool
over many calls tomaybeLogStatementInternal
(where thesync.Pool
is shared across calls) to understand the impact.
Added pools for both. While pooling these allocations did save an allocation per statement run with logging, just FYI that the benchmarking showed that the rest of the logging pkg performs a significantly more amount of allocation (I think this checks out since it's building a lot of strings?), so while pooling saves us some of those, at high logging frequencies we'll still have memory impact from logging in general.
pkg/sql/telemetry_logging_test.go
line 1786 at r8 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
This tests that we don't count a transaction as 'skipped' when telemetry logging is off. Specifically it's testing the portion of
maybeLogTransaction
where we early exit when transaction telemetry is off so the transaction won't be counted as skipped by telemetry. It goes through all the combinations of the cluster settings for which we'd count txn telemetry as off. There's also one configuration for which it's on but due to the stubbing of time we never skip transactions. I've changed the test so that in the txn enabled mode we have some skipped transactions. Also left some more comments, hopefully that makes things more clear but lmk otherwise!
Just FYI I converted it to a data driven test now that we have that handler 👍
b0045ce
to
cc0d1bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r12, 14 of 14 files at r14, 3 of 3 files at r15, 11 of 11 files at r18, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @maryliag, and @xinhaoz)
pkg/util/log/eventpb/eventpbgen/gen.go
line 724 at r15 (raw file):
if m.{{.FieldName}} != nil { {{- end }} if printComma { b = append(b, ',')}; printComma = true
does this result in incorrect indentation of these files if the if
statement is missing?
pkg/sql/testdata/telemetry_logging/logging/skipped_transactions
line 81 at r19 (raw file):
show-skipped-transactions ---- 0
why is this 0 in all the tests? do we need a test with a positive number example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @maryliag)
pkg/util/log/eventpb/eventpbgen/gen.go
line 724 at r15 (raw file):
Previously, dhartunian (David Hartunian) wrote…
does this result in incorrect indentation of these files if the
if
statement is missing?
It shouldn't from what I can tell - it follows the same pattern as the .AllowZero
portion of the template whcih doesn't mess up the whitespace of the file.
pkg/sql/testdata/telemetry_logging/logging/skipped_transactions
line 81 at r19 (raw file):
Previously, dhartunian (David Hartunian) wrote…
why is this 0 in all the tests? do we need a test with a positive number example?
This is an edge case test to make sure that we don't increment skippedtransactions if telemetry is off (see comment at top of this file). The field is tested as non-zero when appropriate in other test files, but I can include an additional example here showing the non-zero value when telemetry is on if this is unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @maryliag, and @xinhaoz)
pkg/sql/conn_executor.go
line 2031 at r19 (raw file):
ex.planner.ExtendedEvalContext().Tracing.Enabled()) ex.extraTxnState.shouldLogToTelemetry, ex.extraTxnState.telemetrySkippedTxns = ex.server.TelemetryLoggingMetrics.shouldEmitTransactionLog(forceSampling, ex.executorType == executorTypeInternal)
can we compute shouldForceTxnSampling
within shouldEmitTransactionLog
?
pkg/sql/exec_log.go
line 460 at r19 (raw file):
if !telemetryLoggingEnabled.Get(&p.execCfg.Settings.SV) || !isTxnTelemetryMode || (execType == executorTypeInternal && !telemetryInternalQueriesEnabled.Get(&p.execCfg.Settings.SV)) { return
nit: split out if statement into separate cases for readability
pkg/sql/telemetry_logging.go
line 386 at r19 (raw file):
tracingEnabled := t.isTracing(nil, tracingOn) logConsoleQuery := telemetryInternalConsoleQueriesEnabled.Get(&t.st.SV) && strings.HasPrefix(applicationName, "$ internal-console")
consolidate internal query checks...somewhere. either here or in the maybeLog...
method. We have one for console txns and one for general internal ones.
This commit adds the field SkippedTransactions to the SampledTransaction protobuf to count the number of transactions that were not sampled while telemetry transaction logging is enabled. The corresponding field is added to the telemetryLogging struct and will be used in the following commit to track skipped transactions. Some whitespace in the SampledTransaction proto definition is adjusted. Epic: none Release note (sql change): New field `SkippedTransactions` in the SampledTransaction event, which is emitted to the TELEMETRY logging channel when telemetry logging is enabled and set to "transaction" mode.
This field should always exist in SampledExecStats. Since SampledTransaction is the only user of this message right now and is yet to be used we can safely change the proto definition. Epic: none Release note: None
Add a pool for SampledQuery objects. At high logging frequencies we can avoid an allocation per statement. Epic: none Release note: None
7270dff
to
4cc0197
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @maryliag)
pkg/sql/conn_executor.go
line 2031 at r19 (raw file):
Previously, dhartunian (David Hartunian) wrote…
can we compute
shouldForceTxnSampling
withinshouldEmitTransactionLog
?
Done.
pkg/sql/exec_log.go
line 460 at r19 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: split out if statement into separate cases for readability
Removed these checks.
pkg/sql/telemetry_logging.go
line 386 at r19 (raw file):
Previously, dhartunian (David Hartunian) wrote…
consolidate internal query checks...somewhere. either here or in the
maybeLog...
method. We have one for console txns and one for general internal ones.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 15 files at r20, 3 of 3 files at r21, 2 of 2 files at r22, 3 of 12 files at r23.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @maryliag, and @xinhaoz)
-- commits
line 49 at r23:
add a note here that the logging decision is captured at start of transaction and not "refreshed" later.
pkg/sql/telemetry_logging.go
line 227 at r23 (raw file):
if !logConsoleQuery && isInternal && !telemetryInternalQueriesEnabled.Get(&t.st.SV) { return false, t.skippedTransactionCount.Load() }
nit: this if statement is doing two things that look the same but are coded differently. break this if statement into two with parallel structure for easy parsing like:
(pseudocode)
if isConsole && !telementryInternalConsoleQueriesEnabled {
...
}
if isInternal && !telememetryInternalQueriesEnabled {
...
}
This commit sends SampledTransaction events to the telemetry channel. In "transaction" sampling mode, if a transaction is marked to be logged to telemetry, we will emit a SampledTranaction event on transaction end containing transaction level stats. It is expected that if a transaction event exists in telemetry, its statement events will also have been logged (with a maximum number according to the setting sql.telemetry.transaction_sampling.statement_events_per_transaction.max). Transaction recording for telemetry is decided at the start of transaction execution (including on restarts), and will not be refreshed for the remainder of transaction execution. Closes: cockroachdb#108284 Release note (ops change): Transactions sampled for the telemetry logging channel will now emit a SampledTransaction event. To sample transactions, set the cluster setting `sql.telemetry.query_sampling.mode = 'transaction'` and enable telemetry logging via `sql.telemetry.query_sampling.enabled = true`.
4cc0197
to
9199146
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @dhartunian, and @maryliag)
Previously, dhartunian (David Hartunian) wrote…
add a note here that the logging decision is captured at start of transaction and not "refreshed" later.
Done.
pkg/sql/telemetry_logging.go
line 227 at r23 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: this if statement is doing two things that look the same but are coded differently. break this if statement into two with parallel structure for easy parsing like:
(pseudocode)
if isConsole && !telementryInternalConsoleQueriesEnabled { ... } if isInternal && !telememetryInternalQueriesEnabled { ... }
Because console queries are internal queries, I included them in the same case since otherwise console queries would not be logged if internal queries are on (this behaviour matches existing behaviour from statement sampling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @maryliag, and @xinhaoz)
pkg/sql/telemetry_logging.go
line 227 at r23 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
Because console queries are internal queries, I included them in the same case since otherwise console queries would not be logged if internal queries are on (this behaviour matches existing behaviour from statement sampling).
Gotcha. Thanks for clarifying. We can keep as-is for now.
TFTR! |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 7b88f08 to blathers/backport-release-23.2-115722: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
1. sql/telemetry: add SkippedTransactions to SampledTransaction proto
This commit adds the field SkippedTransactions to the
SampledTransaction protobuf to count the number of transactions
that were not sampled while telemetry transaction logging
is enabled. The corresponding field is added to the
telemetryLogging struct and will be used in the following
commit to track skipped transactions. Some whitespace in the
SampledTransaction proto definition is adjusted.
Epic: none
Release note (sql change): New field
SkippedTransactions
inthe SampledTransaction event, which is emitted to the TELEMETRY
logging channel when telemetry logging is enabled and set to
"transaction" mode.
2. eventpb: make MVCCIteratorStats in SampledExecStats non-nullable
This field should always exist in SampledExecStats. Since
SampledTransaction is the only user of this message right now
and is yet to be used we can safely change the proto definition.
Epic: none
Release note: None
3. eventpb: make MVCCIteratorStats in SampledExecStats non-nullable
This field should always exist in SampledExecStats. Since
SampledTransaction is the only user of this message right now
and is yet to be used we can safely change the proto definition.
Epic: none
Release note: None
4. telemetry: log transaction exec events to TELEMETRY
This commit sends SampledTransaction events to the telemetry channel.
In "transaction" sampling mode, if a transaction is marked to be logged
to telemetry, we will emit a SampledTranaction event on transaction
end containing transaction level stats. It is expected that if a transaction
event exists in telemetry, its statement events will also have been logged
(with a maximum number according to the setting
sql.telemetry.transaction_sampling.statement_events_per_transaction.max.
Closes: #108284
Release note (ops change): Transactions sampled for the telemetry logging
channel will now emit a SampledTransaction event. To sample transactions,
set the cluster setting
sql.telemetry.query_sampling.mode = 'transaction'
and enable telemetry logging via
sql.telemetry.query_sampling.enabled = true
.