-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: EngineCapacity call is progressively expensive on an import in a large cluster #83137
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
E-quick-win
Likely to be a quick win for someone experienced.
T-storage
Storage Team
Comments
lunevalex
added
the
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
label
Jun 21, 2022
mwang1026
added
the
E-quick-win
Likely to be a quick win for someone experienced.
label
Jun 22, 2022
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Jul 8, 2022
Previously, we were creating subdirectories for ranges and range snapshots in the auxiliary directory every time we accepted a snapshot, but only cleaning up the snapshot subdirectories after a snapshot scratch space closed. This left empty parent range directories around on the FS, slowing down future calls to Pebble.Capacity() and indirectly slowing down AddSSTable in the future. This change adds code to clean up empty range directories in the aux directory if they're not being used. Some coordination and synchronization code had to be added to ensure we wouldn't remove a directory that was just created by a concurrent snapshot. Fixes cockroachdb#83137. Release note (performance improvement): Addresses issue where imports and rebalances were being slowed down due to the accumulation of empty directories from range snapshot applications.
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Jul 11, 2022
Previously, we were creating subdirectories for ranges and range snapshots in the auxiliary directory every time we accepted a snapshot, but only cleaning up the snapshot subdirectories after a snapshot scratch space closed. This left empty parent range directories around on the FS, slowing down future calls to Pebble.Capacity() and indirectly slowing down AddSSTable in the future. This change adds code to clean up empty range directories in the aux directory if they're not being used. Some coordination and synchronization code had to be added to ensure we wouldn't remove a directory that was just created by a concurrent snapshot. Fixes cockroachdb#83137. Release note (performance improvement): Addresses issue where imports and rebalances were being slowed down due to the accumulation of empty directories from range snapshot applications.
craig bot
pushed a commit
that referenced
this issue
Jul 11, 2022
82161: ui: Add Jest as test runner to DB Console r=nathanstilwell a=nathanstilwell DB Console is the last place Cockroach Labs is using a test runner other than [Jest](https://jestjs.io/). This PR adds Jest as the test runner intended to replace [Mocha](https://mochajs.org/). Mocha runs in a headless browser via [Karma](https://karma-runner.github.io/latest/index.html) whereas Jest will run tests in NodeJS and simulate a browser environment using [jsdom](https://github.com/jsdom/jsdom). Due to this change in environment, you will see not only files to set up the Jest test runner, but changes to some tests, some mocking of browser globals that are not included in jsdom by default, and some configuration adjustment to the `tsconfig.json`. Since configuration changes are infrequent and are highly contextual, we decided to err on the side of verbose inline documentation in configuration files. Details about individual changes to configs or tests are documented in commit messages. 84068: streamingccl: fix span use-after-finish in ingestion frontier r=samiskin a=stevendanna This fixes the following use-after-finish panic: panic: use of Span after Finish. Span: ingestfntr. Finish previously called at: <stack not captured. Set debugUseAfterFinish> goroutine 1617744 [running]: github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xc002c5f180) github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:186 +0x279 github.com/cockroachdb/cockroach/pkg/util/tracing.(*Tracer).startSpanGeneric(0xc0138b2a50, {0x72291c0, 0xc019ebbe40}, {0x68e4eb2, 0x1d}, {{0x0}, 0x0, {0x0, 0x0, {{0x0, ...}, ...}, ...}, ...}) github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:1207 +0x997 github.com/cockroachdb/cockroach/pkg/util/tracing.(*Tracer).StartSpanCtx(0xc0138b2a50, {0x72291c0, 0xc019ebbe40}, {0x68e4eb2, 0x1d}, {0xc010cbb760, 0x1, 0x1}) github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:1062 +0x1a7 github.com/cockroachdb/cockroach/pkg/util/tracing.ChildSpan({0x72291c0, 0xc019ebbe40}, {0x68e4eb2, 0x1d}) github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:1577 +0x145 github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamclient.(*partitionedStreamClient).Heartbeat(0xc00b311080, {0x72291c0, 0xc019ebbe40}, 0xac93aa873318001, {0x16ffc388c7d7c42a, 0x0, 0x0}) github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamclient/partitioned_stream_client.go:83 +0xb7 github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamingest.(*heartbeatSender).maybeHeartbeat(0xc01980f880, {0x72291c0, 0xc019ebbe40}, {0x16ffc388c7d7c42a, 0x0, 0x0}) github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamingest/stream_ingestion_frontier_processor.go:180 +0x250 github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamingest.(*heartbeatSender).startHeartbeatLoop.func1.1() github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamingest/stream_ingestion_frontier_processor.go:207 +0x41b github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamingest.(*heartbeatSender).startHeartbeatLoop.func1({0x72291c0, 0xc019ebbe40}) github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamingest/stream_ingestion_frontier_processor.go:227 +0xb5 github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1() github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:169 +0x52 golang.org/x/sync/errgroup.(*Group).Go.func1() golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:74 +0xb4 created by golang.org/x/sync/errgroup.(*Group).Go golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:71 +0xdd I220708 05:29:42.017167 1 (gostd) testmain.go:90 [-] 1 Test //pkg/ccl/streamingccl/streamingest:streamingest_test exited with error code 2 The use after finish was caused by goroutines in the ingestion frontier processor that lived past a call to (*ProcessorBase).InternalClose, which finishes the span attached to the context passed to the processor in Start. To address this we: - ensure that we stop our heartbeat thread before calling InternalClose in ConsumerClosed, and - provide a TrailingMetaCallback so that we can perform cleaned when DrainHelper() is called. When a TrailingMetaCallback is provided, DrainHelper() calls it instead of InternalClose(), allowing us to correctly clean up before the span is closed. - use context cancellation rather than a channel to control the heartbeat loop exit to avoid having to deal with avoiding double-closes of the channel. Fixes #84054 Release note: None 84100: kvserver: Clean up empty range directories after snapshots r=nicktrav a=itsbilal Previously, we were creating subdirectories for ranges and range snapshots in the auxiliary directory every time we accepted a snapshot, but only cleaning up the snapshot subdirectories after a snapshot scratch space closed. This left empty parent range directories around on the FS, slowing down future calls to Pebble.Capacity() and indirectly slowing down AddSSTable in the future. This change adds code to clean up empty range directories in the aux directory if they're not being used. Some coordination and synchronization code had to be added to ensure we wouldn't remove a directory that was just created by a concurrent snapshot. Fixes #83137 Release note (bug fix, performance improvement): Addresses issue where imports and rebalances were being slowed down due to the accumulation of empty directories from range snapshot applications. 84170: sql/sqlstats: record QuerySummary when merging stats r=ericharmeling a=stevendanna During execution of a transaction, all statement statistics are collected in an struct local to that transaction, and then flushed to the main ApplicationStats container when the transaction finishes. Previously, when flushing, we failed to copy the QuerySummary field, leading to `metadata->'querySummary'` from being empty in most cases. Prior to ce1b42b this only affected statements in an explicit transaction. After that commit, it affected all statements. Release note (bug fix): Fix a bug that led to the querySummary field in crdb_internal.statements_statistics's metadata column being empty. 84194: opt: mark SimplifyRootOrdering as an essential rule r=mgartner a=mgartner The unoptimized query oracle, which disables rules, found a bug in the execution engine that is only possible to hit if the `SimplifyRootOrdering` rule is disabled (see #84191). Until the bug is fixed, we mark the rule as essential so that it is not disabled by these tests. Fixes #84067 Release note: None Co-authored-by: Nathan Stilwell <[email protected]> Co-authored-by: Sean Barag <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Jul 11, 2022
Previously, we were creating subdirectories for ranges and range snapshots in the auxiliary directory every time we accepted a snapshot, but only cleaning up the snapshot subdirectories after a snapshot scratch space closed. This left empty parent range directories around on the FS, slowing down future calls to Pebble.Capacity() and indirectly slowing down AddSSTable in the future. This change adds code to clean up empty range directories in the aux directory if they're not being used. Some coordination and synchronization code had to be added to ensure we wouldn't remove a directory that was just created by a concurrent snapshot. Fixes #83137. Release note (bug fix, performance improvement): Addresses issue where imports and rebalances were being slowed down due to the accumulation of empty directories from range snapshot applications.
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Sep 19, 2022
Previously, we were creating subdirectories for ranges and range snapshots in the auxiliary directory every time we accepted a snapshot, but only cleaning up the snapshot subdirectories after a snapshot scratch space closed. This left empty parent range directories around on the FS, slowing down future calls to Pebble.Capacity() and indirectly slowing down AddSSTable in the future. This change adds code to clean up empty range directories in the aux directory if they're not being used. Some coordination and synchronization code had to be added to ensure we wouldn't remove a directory that was just created by a concurrent snapshot. Fixes cockroachdb#83137. Release note (bug fix, performance improvement): Addresses issue where imports and rebalances were being slowed down due to the accumulation of empty directories from range snapshot applications.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
E-quick-win
Likely to be a quick win for someone experienced.
T-storage
Storage Team
While running a large import, we noticed that the a majority of time spent in the EvalAddSSTable call was spent performing the Pebble.EngineCapacity() call. This call was getting more and more expensive due to how it computes available space in the auxiliary directory. Specifically to compute the space used by that directory the code walks the whole directory path and sums up the space used by each directory, so if there are a lot of directories this method will progressively get slower and slower over time. The profile to capture this is attached. The root cause of this slowdown is the accumulation of empty directories by the snapshot receiver. The receiver creates a new directory for every snapshot received here
cockroach/pkg/kv/kvserver/replica_sst_snapshot_storage.go
Line 50 in 2e038a1
cockroach/pkg/kv/kvserver/replica_sst_snapshot_storage.go
Line 59 in 2e038a1
In the cluster under test a 10 node cluster with 200k+ ranges and a running import, there were 20k+ empty directories in that folder, which made the capacity call noticeably slow. A CPU profile is attached to this ticket.
Jira issue: CRDB-16880
Epic CRDB-16237
The text was updated successfully, but these errors were encountered: