-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streamingccl: hookup tracing aggregtor events for the C2C job #108458
streamingccl: hookup tracing aggregtor events for the C2C job #108458
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but it has produced another thought about some of the PRs it builds on. But once those PRs are in, I think this should be g2g.
pkg/ccl/streamingccl/streamingest/stream_ingestion_processor.go
Outdated
Show resolved
Hide resolved
@@ -313,7 +314,9 @@ func canWrap(mode sessiondatapb.VectorizeExecMode, core *execinfrapb.ProcessorCo | |||
case core.RestoreData != nil: | |||
case core.Filterer != nil: | |||
case core.StreamIngestionData != nil: | |||
return errStreamIngestionWrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I read the linked issue and understand why we need this. I'm a little surprised we don't need this for the restore case though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised too, let me dig some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has something to do with the fact that in c2c the stream ingestion procs push the meta downstream to another processor in the frontier proc. In restore however the restore data processor is the leaf processor and pushes metas to the row result writer. I'll add some logging to sanity check that the restore data procs aren't doing any unwanted buffering of progress metas.
6370e46
to
53f69c9
Compare
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
53f69c9
to
21f4527
Compare
21f4527
to
f0cac6f
Compare
This change builds on top of cockroachdb#107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Fixes: cockroachdb#100126 Release note: None
f0cac6f
to
da18c77
Compare
bors r=stevendanna |
Build succeeded: |
The motivation for this change was to fix the leaked goroutine in cockroachdb#109658 but it led to a larger cleanup as explained below. In cockroachdb#108458 and other PRs we taught the processors of backup, restore and c2c to periodically send metas containing the TracingAggregatorEvents, for the coordinator to then process and persist for improved observability. In that logic we were defensive against a scenario in which the processor received a root context with no tracing span. In which case we never initialized the tracing aggregator on the processor. This should never be possible in production code, but we want to prevent failing the replication job if this were to happen. As part of this defense we also returned `nil` instead of the meta in `Next()`. What We didn't realize was that return a nil row and nil meta indicates that the consumer has been completely drained and causes the processor to pre-emptively drain and shutdown. With this change, if the root context does not have a tracing span and we are unable to init a tracing aggregator, we simply do not allow the timer controlling the return of the TracingAggergatorEvents to fire. By doing this, we avoid a situation where a nil row and nil meta are returned too early. As part of this cleanup this change also simplifies the tracing aggregator. The aggregator is no longer responsible for creating a child span and registering a listener. Instead it simply returns an object that can be registered as an event listener by the caller. The former was an approach that was necessary when we wanted to decorate the span with LazyTags, but nobody uses LazyTags and we have a much better workflow to consume aggregator events with the `Advanced Debugging` job details page. This simplification was motivated by the fact that `processorbase.StartInternal` now takes event listeners that can be registered with the tracing span of the context that hangs of the processor, and is accessible via the `processor.Ctx()` method. Fixes: cockroachdb#109658 Release note: None
The motivation for this change was to fix the leaked goroutine in cockroachdb#109658 but it led to a larger cleanup as explained below. In cockroachdb#108458 and other PRs we taught the processors of backup, restore and c2c to periodically send metas containing the TracingAggregatorEvents, for the coordinator to then process and persist for improved observability. In that logic we were defensive against a scenario in which the processor received a root context with no tracing span. In which case we never initialized the tracing aggregator on the processor. This should never be possible in production code, but we want to prevent failing the replication job if this were to happen. As part of this defense we also returned `nil` instead of the meta in `Next()`. What We didn't realize was that return a nil row and nil meta indicates that the consumer has been completely drained and causes the processor to pre-emptively drain and shutdown. With this change, if the root context does not have a tracing span and we are unable to init a tracing aggregator, we simply do not allow the timer controlling the return of the TracingAggergatorEvents to fire. By doing this, we avoid a situation where a nil row and nil meta are returned too early. As part of this cleanup this change also simplifies the tracing aggregator. The aggregator is no longer responsible for creating a child span and registering a listener. Instead it simply returns an object that can be registered as an event listener by the caller. The former was an approach that was necessary when we wanted to decorate the span with LazyTags, but nobody uses LazyTags and we have a much better workflow to consume aggregator events with the `Advanced Debugging` job details page. This simplification was motivated by the fact that `processorbase.StartInternal` now takes event listeners that can be registered with the tracing span of the context that hangs of the processor, and is accessible via the `processor.Ctx()` method. Fixes: cockroachdb#109658 Release note: None
109378: backupccl: avoid splitting if the split point might be unsafe r=lidorcarmel a=lidorcarmel Restore may use unsafe keys as split points, which may cause unsafe splits between column families, which may cause SQL to fail when reading the row, or worse, return wrong resutls. This commit avoids splitting on keys that might be unsafe. See the issue for more info. Epic: none Informs: #109483 Release note: None. 109631: sql: serialize under stress and unskip TestQueryCache r=cucaroach a=cucaroach This test would timeout under stress+race, this is because it exploits t.Parallel to run 15 tests concurrently that each fire up a TestServer. Fix the timeout by running the tests serially under stress. Fixes: #105174 Epic: none Release note: none 109683: spanconfig: skip protected timestamps on non-table data r=arulajmani a=aadityasondhi Previously, we were placing a protected timestamp using the `EverythingSpan` which covered the entire keyspace when targeting a cluster backup. This was non-ideal because not all are used for backup. This is especially problematic for high churn ranges, such as node liveness and timeseries, that can accumulate lots of MVCC garbage very quickly. Placing a protected timestamp on these ranges, thus preventing the MVCC GC to run, can cause badness. This patch introduces a new span that covers the keyspace excluded from backup. When we encounter a span that is within those bounds, we skip placing a protected timestamp on it. Fixes: #102338 Release note: None 109720: cluster-ui: break circular import of `rootActions` r=xinhaoz a=xinhaoz Importing `rootActions` from `reducers.ts` in cluster-ui was causing a circular import, preventing one of the redux fields experiencing the cyclic dependency from having their reducer populated, which omitted that field from the store altogether. Currently, this field is `uiConfig` which is affecting features that rely on checking the sql role of a user, such as displaying the `Reset SQL Stats` button. This commit extracts `rootActions` into its own file to remove the cyclic dependencies created. Fixes: #97996 Release note (bug fix): On CC, `Reset Sql Stats` button is now visible if the user has admin role. Using a production buiild 23.1 cluster-ui version: <img width="1909" alt="image" src="https://github.com/cockroachdb/cockroach/assets/20136951/348a941a-3be5-42ad-8b16-47cbf48f3f19"> 109734: bulk,ccl: refactor tracing aggregator integration r=stevendanna a=adityamaru The motivation for this change was to fix the leaked goroutine in #109658 but it led to a larger cleanup as explained below. In #108458 and other PRs we taught the processors of backup, restore and c2c to periodically send metas containing the TracingAggregatorEvents, for the coordinator to then process and persist for improved observability. In that logic we were defensive against a scenario in which the processor received a root context with no tracing span. In which case we never initialized the tracing aggregator on the processor. This should never be possible in production code, but we want to prevent failing the replication job if this were to happen. As part of this defense we also returned `nil` instead of the meta in `Next()`. What We didn't realize was that return a nil row and nil meta indicates that the consumer has been completely drained and causes the processor to pre-emptively drain and shutdown. With this change, if the root context does not have a tracing span and we are unable to init a tracing aggregator, we simply do not allow the timer controlling the return of the TracingAggergatorEvents to fire. By doing this, we avoid a situation where a nil row and nil meta are returned too early. As part of this cleanup this change also simplifies the tracing aggregator. The aggregator is no longer responsible for creating a child span and registering a listener. Instead it simply returns an object that can be registered as an event listener by the caller. The former was an approach that was necessary when we wanted to decorate the span with LazyTags, but nobody uses LazyTags and we have a much better workflow to consume aggregator events with the `Advanced Debugging` job details page. This simplification was motivated by the fact that `processorbase.StartInternal` now takes event listeners that can be registered with the tracing span of the context that hangs of the processor, and is accessible via the `processor.Ctx()` method. Fixes: #109658 Release note: None Co-authored-by: Lidor Carmel <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: adityamaru <[email protected]>
This change builds on top of #107994 and wires up each stream
ingestion data processor to emit TracingAggregatorEvents to
the frontier and subsequently the job coordinator.
These events are periodically flushed to files in the
job_info
table and are consumable via the DBConsole Job Details page.
Currently, the only aggregator event that is propagated is the
IngestionPerformanceStats emitted by the sst batcher.
Informs: #108374
Fixes: #100126
Release note: None